#Predicción de Tiempos Verbales

##Descarga y creación del dataset
Se utilizará el dataset del corpus AnCora en Español que 17662 oraciones, 547558 tokens y 560137 palabras sintácticas entre otros detalles. Se preprocesará para la aplicación a desarrollar

###1.Utilizamos GPU

In [1]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():

    # Tell PyTorch to use the GPU.
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


###2.Instalación de librerías y descarga del dataset

In [2]:
!pip install wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9655 sha256=2bd86bb58a822edf38fc37b1ed16724db62f97253e3700321fe95d91b6f17d82
  Stored in directory: /root/.cache/pip/wheels/40/b3/0f/a40dbd1c6861731779f62cc4babcb234387e11d697df70ee97
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [3]:
import wget
import os

print('Downloading dataset...')

# The URL for the dataset zip file.
url = 'https://github.com/UniversalDependencies/UD_Spanish-AnCora/archive/refs/heads/master.zip'

# Download the file (if we haven't already)
if not os.path.exists('./master.zip'):
    wget.download(url, './master.zip')

Downloading dataset...


In [4]:
if not os.path.exists('./master/'):
    !unzip master.zip

Archive:  master.zip
7068ba2e51d77ab8ea9b807ad919e47f0db07b9b
   creating: UD_Spanish-AnCora-master/
 extracting: UD_Spanish-AnCora-master/.gitignore  
  inflating: UD_Spanish-AnCora-master/CONTRIBUTING.md  
  inflating: UD_Spanish-AnCora-master/LICENSE.txt  
  inflating: UD_Spanish-AnCora-master/README.md  
  inflating: UD_Spanish-AnCora-master/es_ancora-ud-dev.conllu  
  inflating: UD_Spanish-AnCora-master/es_ancora-ud-test.conllu  
  inflating: UD_Spanish-AnCora-master/es_ancora-ud-train.conllu  
  inflating: UD_Spanish-AnCora-master/eval.log  
  inflating: UD_Spanish-AnCora-master/stats.xml  


In [5]:
!pip install conllu

Collecting conllu
  Downloading conllu-6.0.0-py3-none-any.whl.metadata (21 kB)
Downloading conllu-6.0.0-py3-none-any.whl (16 kB)
Installing collected packages: conllu
Successfully installed conllu-6.0.0


###3.Extraemos los datos que vamos a utilizar

In [6]:
from conllu import parse_incr
import pandas as pd

def extract_verbs_from_conllu(file_path):
    data = []
    with open(file_path, "r", encoding="utf-8") as f:
        for sentence in parse_incr(f):
            tokens = [t["form"] for t in sentence if isinstance(t["id"], int)]
            for token in sentence:
                if isinstance(token["id"], int) and token["upos"] == "VERB" and token["feats"] is not None:
                    feats = token["feats"]
                    if all(k in feats for k in ["Tense", "Mood", "Person", "Number"]):
                        data.append({
                            "sentence": " ".join(tokens),
                            "verb": token["form"],
                            "Tense": feats["Tense"],
                            "Mood": feats["Mood"],
                            "Person": feats["Person"],
                            "Number": feats["Number"]
                        })
    return data

In [7]:
# Extraer datos de entrenamiento
train_data = extract_verbs_from_conllu("/content/UD_Spanish-AnCora-master/es_ancora-ud-train.conllu")
df = pd.DataFrame(train_data)

In [8]:
print(df.sample(5))
for item in train_data[0:10]:
    print(item)

                                                sentence         verb Tense  \
22091  El rock and roll empezó a morir cuando los art...       empezó  Past   
21019  Y con este argumento , con un equipo que busca...  corresponde  Pres   
3254   Cuando el conflicto representado por el obstin...       ejerce  Pres   
24403  En una lista provisional aparecida el día 2 fi...      estaban   Imp   
9157   " Parte de el actual Gobierno está bajo sospec...     calificó  Past   

      Mood Person Number  
22091  Ind      3   Sing  
21019  Ind      3   Sing  
3254   Ind      3   Sing  
24403  Ind      3   Plur  
9157   Ind      3   Sing  
{'sentence': 'Las reservas de oro y divisas de Rusia subieron 800 millones de dólares y el 26 de mayo equivalían a 19.100 millones de dólares , informó hoy un comunicado de el Banco Central .', 'verb': 'subieron', 'Tense': 'Past', 'Mood': 'Ind', 'Person': '3', 'Number': 'Plur'}
{'sentence': 'Las reservas de oro y divisas de Rusia subieron 800 millones de dólare

###4.Tokenizamos con BERT

In [8]:
from transformers import BertTokenizer, BertModel
import torch
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Cargar tokenizer en español
tokenizer = BertTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-cased")
model = BertModel.from_pretrained("dccuchile/bert-base-spanish-wwm-cased", output_hidden_states=True)
model.to(device)
model.eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/364 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/242k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/134 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/480k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/648 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertModel were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

'\nsentences = df["sentence"].tolist()\nlabels = df[["Tense", "Mood", "Person", "Number"]].values.tolist()\n\n# Tokenizar oraciones\ninput_ids = []\nattention_masks = []\n\nfor sent in sentences:\n    encoded = tokenizer.encode_plus(\n        sent,\n        add_special_tokens=True,\n        max_length=64,\n        padding="max_length",\n        truncation=True,\n        return_attention_mask=True,\n        return_tensors="pt"\n    )\n    input_ids.append(encoded["input_ids"])\n    attention_masks.append(encoded["attention_mask"])\n\ninput_ids = torch.cat(input_ids, dim=0)\nattention_masks = torch.cat(attention_masks, dim=0)\n\n# Codificar etiquetas\nlabel_encoders = [LabelEncoder() for _ in range(4)]\nencoded_labels = np.stack([\n    le.fit_transform([row[i] for row in labels]) for i, le in enumerate(label_encoders)\n], axis=1)\nlabels_tensor = torch.tensor(encoded_labels)\n\nprint(input_ids.shape)\nprint(attention_masks.shape)'

In [27]:
def get_bert_embeddings(sentence):
    inputs = tokenizer(sentence, return_tensors="pt")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)
        hidden_states = outputs.hidden_states
    return inputs, hidden_states

def get_verb_embedding(inputs, hidden_states, verb, strategy):
    # Convertir input_ids a tokens
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

    # Tokenizar el verbo para saber cómo está dividido
    verb_subtokens = tokenizer.tokenize(verb)

    # Buscar el índice de inicio del verbo en la secuencia
    for i in range(len(tokens) - len(verb_subtokens) + 1):
        if tokens[i:i + len(verb_subtokens)] == verb_subtokens:
            verb_indices = list(range(i, i + len(verb_subtokens)))
            break
    else:
        return None  # Verbo no encontrado

    # === Embedding según estrategia ===
    if strategy == "second_last":
        emb = torch.stack([hidden_states[-2][0][i] for i in verb_indices], dim=0)
        return emb.mean(dim=0).cpu()

    elif strategy == "sum_last4":
        emb = sum(hidden_states[-i][0][verb_indices].mean(dim=0) for i in range(1, 5))
        return emb.cpu()

    elif strategy == "concat_last4":
        emb = [hidden_states[-i][0][verb_indices].mean(dim=0) for i in range(1, 5)]
        return torch.cat(emb, dim=-1).cpu()

    elif strategy == "sum_all":
        emb = sum(hidden_states[i][0][verb_indices].mean(dim=0) for i in range(1, 13))
        return emb.cpu()

    else:
        raise ValueError(f"Estrategia desconocida: {strategy}")

In [28]:
dataset_time, dataset_mood, dataset_person, dataset_number = [], [], [], []

for row in train_data:
    inputs, hidden_states = get_bert_embeddings(row["sentence"])
    if hidden_states is not None:
        embTense = get_verb_embedding(inputs, hidden_states, row["verb"], "second_last")
        dataset_time.append([row["verb"], embTense, row["Tense"]])
        embMood = get_verb_embedding(inputs, hidden_states, row["verb"], "sum_last4")
        dataset_mood.append([row["verb"], embMood, row["Mood"]])
        embPerson = get_verb_embedding(inputs, hidden_states, row["verb"], "concat_last4")
        dataset_person.append([row["verb"], embPerson, row["Person"]])
        embNumber = get_verb_embedding(inputs, hidden_states, row["verb"], "sum_all")
        dataset_number.append([row["verb"], embNumber, row["Number"]])

In [19]:
print(dataset_time[0])

['subieron', tensor([-7.6705e-01,  9.6662e-01, -1.3231e+00,  8.4063e-01, -6.2120e-02,
         2.3841e-01, -5.9221e-01, -5.0417e-01,  1.3399e-01, -4.6301e-01,
         2.3400e-01, -6.7053e-02,  1.6126e+00, -2.0420e-01,  1.0109e+00,
         1.5516e+00,  5.3710e-01, -5.1367e-01,  8.3881e-02,  1.6062e-01,
         9.4995e-01, -2.5507e-01,  5.5640e-02,  3.1618e-01, -5.4903e-01,
         2.9648e-01,  1.9622e-01,  2.5557e-02, -7.6783e-02, -2.7634e-01,
         3.1030e-01, -3.2442e-01,  1.3436e-01,  1.1187e+00, -1.8510e-01,
        -3.3856e-01,  2.6190e-01,  6.2866e-01,  7.6826e-01, -7.2731e-01,
         9.3358e-01,  7.3711e-01,  6.1321e-01,  1.0075e+00,  8.4744e-01,
        -1.1516e+00,  1.5824e-01,  6.4528e-01, -3.9458e-01, -1.9185e-01,
         2.4498e-03,  6.1034e-01, -1.2408e-02, -1.0185e+00, -5.5456e-01,
        -5.0382e-01, -1.1598e-01,  4.4478e-01, -6.0384e-01,  3.4050e-01,
         2.7770e-03,  1.2708e-01,  2.6839e-02, -2.0070e-01,  1.0579e+00,
        -6.5686e-01, -5.8496e-01,  6.6

In [29]:
# Extraer etiquetas únicas de cada lista
all_tenses = sorted(set(row[2] for row in dataset_time))
all_moods = sorted(set(row[2] for row in dataset_mood))
all_persons = sorted(set(row[2] for row in dataset_person))
all_numbers = sorted(set(row[2] for row in dataset_number))

# Crear diccionarios {etiqueta_str: clase_int}
tense2id = {label: idx for idx, label in enumerate(all_tenses)}
mood2id = {label: idx for idx, label in enumerate(all_moods)}
person2id = {label: idx for idx, label in enumerate(all_persons)}
number2id = {label: idx for idx, label in enumerate(all_numbers)}

dataset_time = [[verb, emb, tense2id[label]] for verb, emb, label in dataset_time]
dataset_mood = [[verb, emb, mood2id[label]] for verb, emb, label in dataset_mood]
dataset_person = [[verb, emb, person2id[label]] for verb, emb, label in dataset_person]
dataset_number = [[verb, emb, number2id[label]] for verb, emb, label in dataset_number]

In [30]:
from torch.utils.data import Dataset

class SimpleVerbDataset(Dataset):
    def __init__(self, data):
        self.data = data  # Lista de tuplas (verbo, embedding, etiqueta)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        _, embedding, label = self.data[idx]

        # Asegurarse de que embedding ya esté en tipo float y label como long
        if not isinstance(embedding, torch.Tensor):
            embedding = torch.tensor(embedding, dtype=torch.float32)
        else:
            embedding = embedding.float()

        if not isinstance(label, torch.Tensor):
            label = torch.tensor(label, dtype=torch.long)
        else:
            label = label.long()

        return embedding, label

In [31]:
id2tense = {v: k for k, v in tense2id.items()}
id2mood = {v: k for k, v in mood2id.items()}
id2person = {v: k for k, v in person2id.items()}
id2number = {v: k for k, v in number2id.items()}

###5. Obtenemos los dataloaders de entrenamiento y validación

In [32]:
time_dataset = SimpleVerbDataset(dataset_time)
mood_dataset = SimpleVerbDataset(dataset_mood)
person_dataset = SimpleVerbDataset(dataset_person)
number_dataset = SimpleVerbDataset(dataset_number)

from torch.utils.data import DataLoader
batch_size=32

time_loader = DataLoader(time_dataset, batch_size=batch_size, shuffle=True)
mood_loader = DataLoader(mood_dataset, batch_size=batch_size, shuffle=True)
person_loader = DataLoader(person_dataset, batch_size=batch_size, shuffle=True)
number_loader = DataLoader(number_dataset, batch_size=batch_size, shuffle=True)

###6. Planteo del modelo y entrenamiento

In [33]:
import torch.nn as nn
import torch.nn.functional as F
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split

# Modelo simple
class VerbClassifier(nn.Module):
    def __init__(self, input_dim, num_classes):
        super().__init__()
        self.fc = nn.Linear(input_dim, num_classes)

    def forward(self, x):
        return self.fc(x)

In [34]:
embedding_dim = 768
embedding_dim_concat_last4 = 3072

modelTime = VerbClassifier(input_dim=embedding_dim, num_classes=len(tense2id)).to(device)
modelMood = VerbClassifier(input_dim=embedding_dim, num_classes=len(mood2id)).to(device)
modelPerson = VerbClassifier(input_dim=embedding_dim_concat_last4, num_classes=len(person2id)).to(device)
modelNumber = VerbClassifier(input_dim=embedding_dim, num_classes=len(number2id)).to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

In [41]:
def train_and_evaluate(dataloader, label_name, model, loss_fn, optimizer, epochs=20):

    # Entrenamiento
    for epoch in range(epochs):
        model.train()
        total_loss = 0
        for batch_x, batch_y in dataloader:
            batch_x, batch_y = batch_x.to(device), batch_y.to(device)
            optimizer.zero_grad()
            outputs = model(batch_x)
            loss = loss_fn(outputs, batch_y)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        print(f"Epoch {epoch+1} | Loss: {total_loss:.4f}")

    # Evaluación
    model.eval()
    all_preds, all_labels = [], []
    with torch.no_grad():
        for batch_x, batch_y in dataloader:
            outputs = model(batch_x.to(device))
            preds = outputs.argmax(dim=1).cpu().numpy()
            labels = batch_y.cpu().numpy()
            all_preds.extend(preds)
            all_labels.extend(labels)

    acc = accuracy_score(all_labels, all_preds)
    f1 = f1_score(all_labels, all_preds, average="macro")

    print(f"[{label_name}] Accuracy: {acc:.4f} | Macro F1: {f1:.4f}")


In [42]:
epochs=50
train_and_evaluate(time_loader, "Tiempo", modelTime, loss_fn, optimizer, epochs)
train_and_evaluate(mood_loader, "Modo", modelMood, loss_fn, optimizer, epochs)
train_and_evaluate(person_loader, "Persona", modelPerson, loss_fn, optimizer, epochs)
train_and_evaluate(number_loader, "Número", modelNumber, loss_fn, optimizer, epochs)

Epoch 1 | Loss: 1066.0843
Epoch 2 | Loss: 1066.0415
Epoch 3 | Loss: 1065.9851
Epoch 4 | Loss: 1065.8393
Epoch 5 | Loss: 1066.2335
Epoch 6 | Loss: 1066.0938
Epoch 7 | Loss: 1065.7669
Epoch 8 | Loss: 1065.9020
Epoch 9 | Loss: 1066.1423
Epoch 10 | Loss: 1066.0482
Epoch 11 | Loss: 1065.8616
Epoch 12 | Loss: 1066.0209
Epoch 13 | Loss: 1065.9709
Epoch 14 | Loss: 1065.9459
Epoch 15 | Loss: 1066.3215
Epoch 16 | Loss: 1065.9631
Epoch 17 | Loss: 1066.0607
Epoch 18 | Loss: 1066.1041
Epoch 19 | Loss: 1066.1010
Epoch 20 | Loss: 1066.0762
Epoch 21 | Loss: 1066.2447
Epoch 22 | Loss: 1065.9631
Epoch 23 | Loss: 1065.8310
Epoch 24 | Loss: 1065.9581
Epoch 25 | Loss: 1066.1192
Epoch 26 | Loss: 1066.4419
Epoch 27 | Loss: 1065.9957
Epoch 28 | Loss: 1066.0720
Epoch 29 | Loss: 1065.8885
Epoch 30 | Loss: 1065.8793
Epoch 31 | Loss: 1066.0282
Epoch 32 | Loss: 1066.1435
Epoch 33 | Loss: 1065.9082
Epoch 34 | Loss: 1066.3600
Epoch 35 | Loss: 1065.5857
Epoch 36 | Loss: 1065.8761
Epoch 37 | Loss: 1066.0771
Epoch 38 |

###7. Evaluación

In [23]:
!pip install -U spacy
!python -m spacy download es_core_news_md

Collecting es-core-news-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_md-3.8.0/es_core_news_md-3.8.0-py3-none-any.whl (42.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: es-core-news-md
Successfully installed es-core-news-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [24]:
import spacy
nlp = spacy.load("es_core_news_md")

In [25]:
def detectar_verbos_spacy(oracion):
    doc = nlp(oracion)
    return [(token.text, token.i) for token in doc if token.pos_ == "VERB"]

In [39]:
def analizar_oracion(oracion):
    verbos = detectar_verbos_spacy(oracion)
    resultados = []

    # Solo pasamos por BERT una vez por oración
    inputs, hidden_states = get_bert_embeddings(oracion)

    for verbo, _ in verbos:
        # Extraer embeddings usando cada estrategia correspondiente
        emb_time = get_verb_embedding(inputs, hidden_states, verbo, "second_last")
        emb_mood = get_verb_embedding(inputs, hidden_states, verbo, "sum_last4")
        emb_person = get_verb_embedding(inputs, hidden_states, verbo, "concat_last4")
        emb_number = get_verb_embedding(inputs, hidden_states, verbo, "sum_all")

        # Si alguno no se encuentra (por problemas con subwords), salteamos
        if None in [emb_time, emb_mood, emb_person, emb_number]:
            continue

        tiempo = id2tense[modelTime(emb_time.unsqueeze(0).to(device)).argmax(dim=1).item()]
        modo = id2mood[modelMood(emb_mood.unsqueeze(0).to(device)).argmax(dim=1).item()]
        persona = id2person[modelPerson(emb_person.unsqueeze(0).to(device)).argmax(dim=1).item()]
        numero = id2number[modelNumber(emb_number.unsqueeze(0).to(device)).argmax(dim=1).item()]

        resultados.append({
            "verbo": verbo,
            "tiempo": tiempo,
            "modo": modo,
            "persona": persona,
            "numero": numero
        })

    return resultados

In [43]:
oracion = "Cuando llegamos a casa, ella estaba cocinando y yo lavaba los platos."
for resultado in analizar_oracion(oracion):
    print(f"🔹 Verbo: {resultado['verbo']}")
    print(f"  • Tiempo: {resultado['tiempo']}")
    print(f"  • Modo: {resultado['modo']}")
    print(f"  • Persona: {resultado['persona']}")
    print(f"  • Número: {resultado['numero']}")

🔹 Verbo: llegamos
  • Tiempo: Past
  • Modo: Sub
  • Persona: 1
  • Número: Plur
🔹 Verbo: cocinando
  • Tiempo: Pres
  • Modo: Sub
  • Persona: 2
  • Número: Plur
🔹 Verbo: lavaba
  • Tiempo: Past
  • Modo: Sub
  • Persona: 1
  • Número: Plur
