## Loading the data
Si hay multiwords, se debe saltar una linea y coger las dos siguientes. Ejemplo:

19-20	don't	_	_	_	_	_	_	_	_

19	do	do	AUX	VBP	Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin	21	aux	21:aux	_

20	n't	not	PART	RB	Polarity=Neg	21	advmod	21:advmod	_

In [3]:
def load_conllu_data(filepath):
    """
    Carga y procesa un archivo CoNLL-U, extrayendo las oraciones y sus etiquetas UPOS.
    """
    sentences = []
    tags = []
    current_sentence = []
    current_tags = []

    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()

            # 1. Ignorar comentarios y líneas vacías que no sean separadores de oración
            if line.startswith('#'):
                continue
            
            # 2. Línea en blanco: indica el final de una oración
            elif line == '':
                if current_sentence:
                    sentences.append(current_sentence)
                    tags.append(current_tags)
                    current_sentence = []
                    current_tags = []
            
            # 3. Procesar línea de palabra
            else:
                fields = line.split('\t')
                
                # Ignorar tokens multiword (ID con guion, e.g., '1-2') o nodos vacíos (ID con punto, e.g., '1.1') 
                if '-' in fields[0] or '.' in fields[0]:
                    continue

                # Extraer la palabra (FORM, índice 1) y la etiqueta PoS (UPOS, índice 3)
                word = fields[1]
                pos_tag = fields[3]
                
                current_sentence.append(word)
                current_tags.append(pos_tag)

    # Asegurarse de añadir la última oración si el archivo no termina en línea vacía
    if current_sentence:
        sentences.append(current_sentence)
        tags.append(current_tags)

    return sentences, tags

In [4]:
filepath = "./en_ewt-ud-train.conllu"

sentences = []
tags = []
current_sentence = []
current_tags = []

with open (filepath, 'r', encoding='utf-8') as file:
    for line in file:
        line = line.strip()

        # Ignorar comentarios y lineas vacias
        if line.startswith('#'):
            continue

        # Si hay una linea en blanco indica el final de una oracion
        elif line == '':
            if current_sentence:
                sentences.append(current_sentence)
                tags.append(current_tags)
                current_sentences = []
                current_tags = []
                
        # Procesar línea de palabra
        else:
            fields = line.split('\t')

            # Ignorar tokens multiword
            if '-' in fields[0] or '.' in fields[0]:
                    continue
                
            # Extraer la palabra (FORM, índice 1) y la etiqueta PoS (UPOS, índice 3)
            word = fields[1]
            pos_tag = fields[3]
                
            current_sentence.append(word)
            current_tags.append(pos_tag)


    # Asegurarse de añadir la última oración si el archivo no termina en línea vacía
    if current_sentence:
        sentences.append(current_sentence)
        tags.append(current_tags)

    
# Ejemplo de uso (asumiendo que los archivos están en la misma carpeta):
train_sents, train_tags = load_conllu_data('en_ewt-ud-train.conllu')
dev_sents, dev_tags = load_conllu_data('en_ewt-ud-dev.conllu')
test_sents, test_tags = load_conllu_data('en_ewt-ud-test.conllu')

print(test_sents[5], test_tags[5])

['Google', 'is', 'a', 'nice', 'search', 'engine', '.'] ['PROPN', 'AUX', 'DET', 'ADJ', 'NOUN', 'NOUN', 'PUNCT']


In [4]:
filepath = "./en_ewt-ud-train.conllu"

sentences = []
tags = []
current_sentence = []
current_tags = []

with open (filepath, 'r', encoding='utf-8') as file:
    for line in file:
        line = line.strip()

        # Ignorar comentarios y lineas vacias
        if line.startswith('#'):
            continue

        # Si hay una linea en blanco indica el final de una oracion
        elif line == '':
            if current_sentence:
                sentences.append(current_sentence)
                tags.append(current_tags)
                current_sentences = []
                current_tags = []
                
        # Procesar línea de palabra
        else:
            fields = line.split('\t')

            # Ignorar tokens multiword
            if '-' in fields[0] or '.' in fields[0]:
                    continue
                
            # Extraer la palabra (FORM, índice 1) y la etiqueta PoS (UPOS, índice 3)
            word = fields[1]
            pos_tag = fields[3]
                
            current_sentence.append(word)
            current_tags.append(pos_tag)


    # Asegurarse de añadir la última oración si el archivo no termina en línea vacía
    if current_sentence:
        sentences.append(current_sentence)
        tags.append(current_tags)

    
# Ejemplo de uso (asumiendo que los archivos están en la misma carpeta):
train_sents, train_tags = load_conllu_data('en_ewt-ud-train.conllu')
dev_sents, dev_tags = load_conllu_data('en_ewt-ud-dev.conllu')
test_sents, test_tags = load_conllu_data('en_ewt-ud-test.conllu')

print(test_sents[5], test_tags[5])

['Google', 'is', 'a', 'nice', 'search', 'engine', '.'] ['PROPN', 'AUX', 'DET', 'ADJ', 'NOUN', 'NOUN', 'PUNCT']


In [None]:
## 2. Text Vectorization: Creating the Dictionaries

The first step in preparing the data for the LSTM model is to convert our text-based sentences and tags into numerical sequences. Neural networks can only process numbers, so we need a consistent way to map each word and each tag to a unique integer ID.

For this task, we'll use Keras's modern `TextVectorization` layer. We will create two separate instances of this layer: one for the input words (`word_vectorizer`) and one for the output tags (`tag_vectorizer`).

The process involves two main stages:
1.  **Configuration**: We initialize the `TextVectorization` layer with `output_mode='int'` to ensure it produces sequences of integer IDs (e.g., "Google is nice" -> `[2, 3, 42]`). We also set `output_sequence_length=128` to enforce that all sequences are padded or truncated to a fixed length, which is a requirement for the model.
2.  **Adaptation**: We then call the `.adapt()` method on our training data. This step builds the internal vocabulary for each vectorizer. It analyzes all the words (or tags) in the training set and assigns a unique integer to each one. This ensures our "dictionaries" are based only on the data the model is allowed to learn from.


In [10]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

# Max lend of words for a sentence
MAX_LEN = 128 

# Create the TextVectorization layer.
word_vectorizer = TextVectorization(
    output_mode='int',
    output_sequence_length=MAX_LEN
)

#Flatten the training sentences.
train_sents_flat = [' '.join(sentence) for sentence in train_sents]

# Adapt the vectorizer to the training data.
# This builds the internal vocabulary (the word-to-integer dictionary).
word_vectorizer.adapt(train_sents_flat) 




# --- Let's test it with an example ---
# Create an example sentence containing an unknown word ("jojoto").
example_sentence = ["Google", "is", "a", "jojoto", "engine"]
print("example:", example_sentence)

# running example
example_vec = word_vectorizer([" ".join(example_sentence)])
print("\nexample vec:")
print(example_vec.numpy())

example: ['Google', 'is', 'a', 'jojoto', 'engine']

example vec:
[[2475    9    5    1 1862    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0]]


In [27]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

def create_and_adapt_vectorizer(sentences, max_len=128):

    vectorizer = TextVectorization(
        output_mode='int',
        output_sequence_length=max_len
    )

    sentences_flat = [' '.join(sentence) for sentence in sentences]
    
    vectorizer.adapt(sentences_flat)
    
    vocab_size = len(vectorizer.get_vocabulary())
    
    print(f"Adaptation complete. Vocabulary size: {vocab_size}")
    
    return vectorizer, vocab_size

# How to use the function ---


word_vectorizer, WORD_VOCAB_SIZE = create_and_adapt_vectorizer(train_sents)

tags_vectorizer, TAGS_VOCAB_SIZE = create_and_adapt_vectorizer(train_tags_flat)

print(f"\nWe have successfully created a vectorizer with a vocabulary of {WORD_VOCAB_SIZE} words.")

print(f"\nWe have successfully created a vectorizer with a vocabulary of {TAGS_VOCAB_SIZE} tags.")


print("Vectorizing all data sets...")
# The vectorizer layer can be called like a function on the raw text data.
# Note that we pass the original lists of lists (e.g., train_sents), not the flattened ones.
train_flat = [' '.join(sentence) for sentence in train_sents]
train_tags_flat = [' '.join(sentence) for sentence in train_tags]
X_train = word_vectorizer(train_flat )
y_train = tag_vectorizer(train_tags_flat)

#X_dev = word_vectorizer(dev_sents)
#y_dev = tag_vectorizer(dev_tags)

# We also need to vectorize the test set for the final evaluation later
#X_test = word_vectorizer(test_sents)
#y_test = tag_vectorizer(test_tags)

print("Vectorization complete!")
print("\nShape of X_train:", X_train.shape)
print("Shape of y_train:", y_train.shape)


Adaptation complete. Vocabulary size: 16250
Adaptation complete. Vocabulary size: 20

We have successfully created a vectorizer with a vocabulary of 16250 words.

We have successfully created a vectorizer with a vocabulary of 20 tags.
Vectorizing all data sets...
Vectorization complete!

Shape of X_train: (12544, 128)
Shape of y_train: (12544, 128)


In [28]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

def create_and_adapt_vectorizer(sentences, max_len=128):

    vectorizer = TextVectorization(
        output_mode='int',
        output_sequence_length=max_len
    )

    sentences_flat = [' '.join(sentence) for sentence in sentences]
    
    vectorizer.adapt(sentences_flat)
    
    vocab_size = len(vectorizer.get_vocabulary())
    
    print(f"Adaptation complete. Vocabulary size: {vocab_size}")
    
    return vectorizer, vocab_size

# How to use the function ---


word_vectorizer, WORD_VOCAB_SIZE = create_and_adapt_vectorizer(train_sents)

tags_vectorizer, TAGS_VOCAB_SIZE = create_and_adapt_vectorizer(train_tags_flat)

print(f"\nWe have successfully created a vectorizer with a vocabulary of {WORD_VOCAB_SIZE} words.")

print(f"\nWe have successfully created a vectorizer with a vocabulary of {TAGS_VOCAB_SIZE} tags.")

print("Vectorizing all data sets...")

# --- 1. Flatten the data from list of lists to list of strings ---
# The vectorizer layers expect a flat list of strings as input.
train_sents_flat = [' '.join(sentence) for sentence in train_sents]
train_tags_flat = [' '.join(tag_list) for tag_list in train_tags]

dev_sents_flat = [' '.join(sentence) for sentence in dev_sents]
dev_tags_flat = [' '.join(tag_list) for tag_list in dev_tags]

test_sents_flat = [' '.join(sentence) for sentence in test_sents]
test_tags_flat = [' '.join(tag_list) for tag_list in test_tags]


# --- 2. Use the vectorizers to transform the flattened data ---
# Now we call the vectorizers with the correct input format.
X_train = word_vectorizer(train_sents_flat)
y_train = tag_vectorizer(train_tags_flat) # <-- This is the corrected line

X_dev = word_vectorizer(dev_sents_flat)
y_dev = tag_vectorizer(dev_tags_flat)

X_test = word_vectorizer(test_sents_flat)
y_test = tag_vectorizer(test_tags_flat)

print("Vectorization complete!")
print("\nShape of X_train:", X_train.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of X_dev:", X_dev.shape)
print("Shape of y_dev:", y_dev.shape)


Adaptation complete. Vocabulary size: 16250
Adaptation complete. Vocabulary size: 20

We have successfully created a vectorizer with a vocabulary of 16250 words.

We have successfully created a vectorizer with a vocabulary of 20 tags.
Vectorizing all data sets...
Vectorization complete!

Shape of X_train: (12544, 128)
Shape of y_train: (12544, 128)
Shape of X_dev: (2001, 128)
Shape of y_dev: (2001, 128)


In [14]:
# Create the TextVectorization layer for the tags.
tag_vectorizer = TextVectorization(
    output_mode='int',
    output_sequence_length=MAX_LEN # Tag sequences must have the same length as word sequences.
)

# Flatten the training tags for the adaptation step.
train_tags_flat = [' '.join(tag_list) for tag_list in train_tags]

# Adapt the layer to learn the vocabulary of our training tags.
tag_vectorizer.adapt(train_tags_flat)
print("Adaptation complete.")




# --- Let's test it with an example ---

# 1. Take the first list of tags from our training data as an example.
example_tags = train_tags[0]
print("Example tag sequence (original):")
print(example_tags)

# 2. Vectorize the example tag sequence.
# We must join it into a single string and pass it as a list.
example_tags_vec = tag_vectorizer([" ".join(example_tags)])

# 3. Print the resulting numerical sequence.
# Notice how the output is a sequence of 128 numbers, padded with 0s at the end.
print("\nVectorized tag sequence:")
print(example_tags_vec.numpy())

Adaptation complete.
Example tag sequence (original):
['PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'ADJ', 'NOUN', 'VERB', 'PROPN', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'PROPN', 'PUNCT', 'ADP', 'DET', 'ADJ', 'NOUN', 'PUNCT']

Vectorized tag sequence:
[[10  3 10  3  8  2  4 10 10 10  3 10  3  7  2  6  7  2  6  7  2  6 10  3
   6  7  8  2  3  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0]]


In [12]:
# get the vocab and tags
word_vocab = word_vectorizer.get_vocabulary()
tag_vocab = tag_vectorizer.get_vocabulary()

# het the sizes
WORD_VOCAB_SIZE = len(word_vocab)
TAG_VOCAB_SIZE = len(tag_vocab)
print(TAG_VOCAB_SIZE)
print(f"size of the worids: {WORD_VOCAB_SIZE}")
print(f"size of the tags: {TAG_VOCAB_SIZE}")
print(f"some tags: {tag_vocab[:10]}")

Tamaño del vocabulario de palabras: 16250
Tamaño del vocabulario de etiquetas: 19
Algunas etiquetas del vocabulario: ['', '[UNK]', np.str_('noun'), np.str_('punct'), np.str_('verb'), np.str_('pron'), np.str_('adp'), np.str_('det'), np.str_('adj'), np.str_('aux')]


In [None]:
esto es el modeo

In [29]:
# Importaciones adicionales para construir el modelo
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, TimeDistributed

# Hiperparámetros del modelo (puedes experimentar con estos valores)
EMBEDDING_DIM = 64
LSTM_UNITS = 64

# Construcción del modelo
model = Sequential([
    # 1. Capa de Embedding: Convierte IDs de palabras en vectores de significado
    Embedding(input_dim=WORD_VOCAB_SIZE, 
              output_dim=EMBEDDING_DIM, 
              mask_zero=True), # mask_zero=True le dice al modelo que ignore los '0' del padding

    # 2. Capa LSTM: Procesa la secuencia de embeddings y recuerda el contexto
    LSTM(units=LSTM_UNITS, 
         return_sequences=True), # ¡Crucial! Devuelve una salida para cada palabra

    # 3. Capa de Salida: Aplica un clasificador a cada palabra de la secuencia
    TimeDistributed(Dense(units=TAG_VOCAB_SIZE, activation='softmax'))
])

# Imprimir un resumen de la arquitectura
model.summary()

In [29]:
# Importaciones adicionales para construir el modelo
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, TimeDistributed

# Hiperparámetros del modelo (puedes experimentar con estos valores)
EMBEDDING_DIM = 64
LSTM_UNITS = 64

# Construcción del modelo
model = Sequential([
    # 1. Capa de Embedding: Convierte IDs de palabras en vectores de significado
    Embedding(input_dim=WORD_VOCAB_SIZE, 
              output_dim=EMBEDDING_DIM, 
              mask_zero=True), # mask_zero=True le dice al modelo que ignore los '0' del padding

    # 2. Capa LSTM: Procesa la secuencia de embeddings y recuerda el contexto
    LSTM(units=LSTM_UNITS, 
         return_sequences=True), # ¡Crucial! Devuelve una salida para cada palabra

    # 3. Capa de Salida: Aplica un clasificador a cada palabra de la secuencia
    TimeDistributed(Dense(units=TAG_VOCAB_SIZE, activation='softmax'))
])

# Imprimir un resumen de la arquitectura
model.summary()

In [32]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, TimeDistributed
from tensorflow.keras.models import Model

# --- Hiperparámetros (igual que antes) ---
# (Asegúrate de tener estas variables definidas de pasos anteriores)
# WORD_VOCAB_SIZE, TAG_VOCAB_SIZE, MAX_LEN
EMBEDDING_DIM = 64
LSTM_UNITS = 64

# --- Construcción del Modelo con la API Funcional ---

# 1. Definir la capa de Entrada ✅
# Le decimos al modelo que recibirá secuencias de números enteros de longitud MAX_LEN.
inputs = Input(shape=(MAX_LEN,), name='word_ids_input')

# 2. Conectar las capas en una "carrera de relevos" 🔗
# La capa Embedding recibe los 'inputs' y su salida se guarda en 'x'.
x = Embedding(
    input_dim=WORD_VOCAB_SIZE, 
    output_dim=EMBEDDING_DIM, 
    mask_zero=True, # Importante para que ignore el padding
    name='word_embedding'
)(inputs)

# La capa LSTM recibe la salida del Embedding ('x') y su propia salida se guarda de nuevo en 'x'.
x = LSTM(
    units=LSTM_UNITS, 
    return_sequences=True, # Necesitamos una salida para cada palabra
    name='lstm_layer'
)(x)

# La capa TimeDistributed(Dense) recibe la salida de la LSTM ('x') y su salida es la final.
# La llamamos 'outputs' para que quede claro que es el final del camino.
outputs = TimeDistributed(
    Dense(units=TAG_VOCAB_SIZE, activation='softmax'), 
    name='pos_tag_output'
)(x)

# 3. Crear el Modelo final ✅
# Le decimos a Keras dónde empieza el modelo (inputs) y dónde termina (outputs).
model = Model(inputs=inputs, outputs=outputs, name='pos_tagger_model')

# ¡Listo! Ahora puedes imprimir el resumen y ver la arquitectura.
model.summary()

In [33]:
# Compilar el modelo
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

In [34]:
# Parámetros de entrenamiento
EPOCHS = 5 # Empezamos con pocas para probar rápido
BATCH_SIZE = 64

# Entrenar el modelo
history = model.fit(
    X_train, 
    y_train,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=(X_dev, y_dev)
)

Epoch 1/5
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 87ms/step - accuracy: 0.0386 - loss: 2.2298 - val_accuracy: 0.0416 - val_loss: 1.7180
Epoch 2/5
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 79ms/step - accuracy: 0.0580 - loss: 1.5977 - val_accuracy: 0.0505 - val_loss: 1.3730
Epoch 3/5
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 76ms/step - accuracy: 0.0665 - loss: 1.3255 - val_accuracy: 0.0524 - val_loss: 1.2473
Epoch 4/5
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 80ms/step - accuracy: 0.0715 - loss: 1.1567 - val_accuracy: 0.0530 - val_loss: 1.2091
Epoch 5/5
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 81ms/step - accuracy: 0.0754 - loss: 1.0369 - val_accuracy: 0.0533 - val_loss: 1.2180
