**ISCTE - Lisbon University Institute**

Master Degree in Artificial Intelligence
Advanced Machine Learning
2025/2026 - 1st semester

*Project Assignement*
Version 1.0 (2025-11-18)

    This work must be carried individually or in pairs of 2 students (recommended).
    Deadline: Saturday, December 6, until 11:59 PM.
    The project presentation will take place on December 9, during class time.

*Part 1 – POS Tagging*
The goal of this assignment is to develop and compare models for Part-of-Speech (POS) tagging using two different deep learning architectures:

    *LSTM-based models*
    *Pre-trained transformer-based models*

Students will train or fine-tune, evaluate, and analyze the performance of these models on the provided dataset.

**Dataset**

For this part we will be using the Universal Dependencies English Web Treebank data (v2.17 - 2025-11-15). The data is already split into training, development and test subsets.

en_ewt-ud-train.conllu
en_ewt-ud-dev.conllu
en_ewt-ud-test.conllu

Each one of these subsets contain words, grouped by sentences, each one of them labeled with the corresponding POS tag. After downloading the data, the following function can be used to load each one of the datasets individually.

In [3]:
import numpy as np
import os
import time
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Bidirectional, LSTM, TimeDistributed, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.metrics import classification_report, accuracy_score

2025-12-04 16:03:57.397521: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-12-04 16:03:57.435236: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-12-04 16:03:58.395677: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


**Task 1.1 - Train an LSTM-Based Model**

    Train a sequence model (e.g. LSTM) on the training set.
    Include an embedding layer. Pre-trained embeddings, such as Glove embeddings, are avaliable for download and can be used.
    Output a POS tag per token using a softmax classifier.

Evaluate your model on the test set

    Report the global performance using Accuracy
    Report the performance per-class using Precision, Recall, F1-score
    
Report training time, model size, and any hardware constraints.

In [None]:
# --- Funções de Leitura de Dados ---

def read_conllu_file(filepath):
    """
    Read a CoNLL-U format file and extract words and POS tags sentence by sentence.
    
    Args:
        filepath: Path to the CoNLL-U file
        
    Returns:
        A list of dictionaries, each containing 'words' and 'pos_tags' lists for a sentence
    """
    sentences = []
    current_sentence = {'words': [], 'pos_tags': []}
    
    with open(filepath, "r", encoding="utf-8") as data_file:
        for line in data_file:
            if line.startswith("#"):
                # Skip comment lines
                pass
            elif line.strip() == "":
                # Empty line marks end of sentence
                if current_sentence['words']:  # Only add non-empty sentences
                    sentences.append(current_sentence)
                    current_sentence = {'words': [], 'pos_tags': []}
            else:
                # Parse the token line
                fields = line.split("\t")
                word, pos = fields[1], fields[3]
                current_sentence['words'].append(word)
                current_sentence['pos_tags'].append(pos)
    
    return sentences

#load data
TRAIN = "./data/en_ewt-ud-train.conllu"
DEV = "./data/en_ewt-ud-dev.conllu"
TEST = "./data/en_ewt-ud-test.conllu"

# --- 1) Carregar Dados ---
try:
    train_sents = read_conllu_file(TRAIN)
    dev_sents = read_conllu_file(DEV)
    test_sents = read_conllu_file(TEST)
    print("Loaded sentences:", len(train_sents), len(dev_sents), len(test_sents))
except FileNotFoundError as e:
    print(f"Erro: Ficheiro de dados não encontrado: {e.filename}. Certifique-se de que os ficheiros CoNLL-U estão em './data/'")
    exit()

# Display preview
print(f"Total sentences: {len(dev_sents)}")
print(f"First 3 sentences:")
for i, sent in enumerate(dev_sents[:3]):
    print(f"Sentence {i+1}:")
    print(f"  Words: {sent['words']}")
    print(f"  POS tags: {sent['pos_tags']}")

Loaded sentences: 12544 2001 2077
Total sentences: 2001
First 3 sentences:
Sentence 1:
  Words: ['From', 'the', 'AP', 'comes', 'this', 'story', ':']
  POS tags: ['ADP', 'DET', 'PROPN', 'VERB', 'DET', 'NOUN', 'PUNCT']
Sentence 2:
  Words: ['President', 'Bush', 'on', 'Tuesday', 'nominated', 'two', 'individuals', 'to', 'replace', 'retiring', 'jurists', 'on', 'federal', 'courts', 'in', 'the', 'Washington', 'area', '.']
  POS tags: ['PROPN', 'PROPN', 'ADP', 'PROPN', 'VERB', 'NUM', 'NOUN', 'PART', 'VERB', 'VERB', 'NOUN', 'ADP', 'ADJ', 'NOUN', 'ADP', 'DET', 'PROPN', 'NOUN', 'PUNCT']
Sentence 3:
  Words: ['Bush', 'nominated', 'Jennifer', 'M.', 'Anderson', 'for', 'a', '15', '-', 'year', 'term', 'as', 'associate', 'judge', 'of', 'the', 'Superior', 'Court', 'of', 'the', 'District', 'of', 'Columbia', ',', 'replacing', 'Steffen', 'W.', 'Graae', '.']
  POS tags: ['PROPN', 'VERB', 'PROPN', 'PROPN', 'PROPN', 'ADP', 'DET', 'NUM', 'PUNCT', 'NOUN', 'NOUN', 'ADP', 'ADJ', 'NOUN', 'ADP', 'DET', 'ADJ', 'PROP

In [5]:
# --- Funções de Loss e Métrica Personalizadas para Ignorar Padding ---

# O valor que usamos para padding nos labels.
# Deve ser um valor negativo ou um valor que não seja um índice de classe válido.
PAD_VALUE = -100

def masked_sparse_categorical_crossentropy(y_true, y_pred):
    """
    Calcula a sparse_categorical_crossentropy ignorando os tokens de padding.
    O padding é identificado pelo valor PAD_VALUE nos labels y_true.
    """
    # 1. Criar a máscara: 1.0 onde o label não é PAD_VALUE, 0.0 onde é.
    # y_true tem shape (batch_size, max_len).
    mask = tf.cast(tf.not_equal(y_true, PAD_VALUE), tf.float32)
    
    # 2. Converter y_true para um tensor de inteiros (necessário para a loss).
    # O Keras exige que os labels sejam >= 0. Vamos substituir PAD_VALUE por 0
    # APENAS para o cálculo da loss, mas a máscara garante que o seu contributo é zero.
    y_true_safe = tf.where(tf.equal(y_true, PAD_VALUE), tf.constant(0, dtype=y_true.dtype), y_true)
    
    # 3. Calcular a loss normal (por token).
    # sparse_categorical_crossentropy espera y_true com shape (batch_size, max_len)
    # e y_pred com shape (batch_size, max_len, num_classes).
    loss = tf.keras.losses.sparse_categorical_crossentropy(y_true_safe, y_pred)
    
    # 4. Aplicar a máscara à loss.
    masked_loss = loss * mask
    
    # 5. Normalizar a loss pelo número de tokens não-padding.
    # Isto é crucial para que a loss não diminua artificialmente com o aumento do padding.
    num_non_padded_tokens = tf.reduce_sum(mask)
    
    # Evitar divisão por zero
    return tf.reduce_sum(masked_loss) / (num_non_padded_tokens + 1e-7)

def masked_accuracy(y_true, y_pred):
    """
    Calcula a acurácia ignorando os tokens de padding.
    """
    # 1. Criar a máscara.
    mask = tf.cast(tf.not_equal(y_true, PAD_VALUE), tf.float32)
    
    # 2. Obter as classes previstas (índice da maior probabilidade).
    y_pred_class = tf.cast(tf.argmax(y_pred, axis=-1), tf.int32)
    
    # 3. Converter y_true para um tensor de inteiros (para comparação).
    y_true_int = tf.cast(y_true, tf.int32)
    
    # 4. Comparar previsões com labels verdadeiros.
    matches = tf.cast(tf.equal(y_true_int, y_pred_class), tf.float32)
    
    # 5. Aplicar a máscara.
    masked_matches = matches * mask
    
    # 6. Normalizar a acurácia pelo número de tokens não-padding.
    num_non_padded_tokens = tf.reduce_sum(mask)
    
    # Evitar divisão por zero
    return tf.reduce_sum(masked_matches) / (num_non_padded_tokens + 1e-7)

In [6]:
from collections import Counter

# --- Configuração de Parâmetros ---
#glove_path = "../Part1/data/wiki_giga_2024_100_MFT20_vectors_seed_2024_alpha_0.75_eta_0.05.050_combined.txt"
#glove_path = "../Part1/data/wiki_giga_2024_200_MFT20_vectors_seed_2024_alpha_0.75_eta_0.05_combined.txt"
glove_path = "../Part1/data/wiki_giga_2024_300_MFT20_vectors_seed_2024_alpha_0.75_eta_0.05_combined.txt"

DIM = 300
EMB_DIM = DIM
MAX_LEN = DIM
BATCH_SIZE = 128
EPOCHS = 5
LEARNING_RATE = 1e-3
PAD_VALUE = -100
PAD_INDEX = 0 # Índice para o token <PAD> no word2idx


# --- 2) Vocab / encoding / padding ---
word_counts = Counter(w for s in train_sents for w in s['words'])
word2idx = {"<PAD>":PAD_INDEX, "<UNK>":1}
for w,_ in word_counts.items():
    if w not in word2idx:
        word2idx[w] = len(word2idx)

tags = sorted(list({t for s in train_sents for t in s['pos_tags']}))
tag2idx = {t:i for i,t in enumerate(tags)}
idx2tag = {i:t for t,i in tag2idx.items()}
num_tags = len(tag2idx)
vocab_size = len(word2idx)
print("Vocab size:", vocab_size, "num tags:", num_tags)

def encode_sentences(sents, w2i, t2i):
    X, y = [], []
    for s in sents:
        X.append([w2i.get(w, w2i["<UNK>"]) for w in s['words']])
        y.append([t2i[t] for t in s['pos_tags']])
    return X, y

X_train, y_train = encode_sentences(train_sents, word2idx, tag2idx)
X_dev, y_dev = encode_sentences(dev_sents, word2idx, tag2idx)
X_test, y_test = encode_sentences(test_sents, word2idx, tag2idx)

# Padding dos inputs (X)
X_train_p = pad_sequences(X_train, maxlen=MAX_LEN, padding='post', truncating='post')
X_dev_p = pad_sequences(X_dev, maxlen=MAX_LEN, padding='post', truncating='post')
X_test_p = pad_sequences(X_test, maxlen=MAX_LEN, padding='post', truncating='post')

# Padding dos labels (y) com o valor especial PAD_VALUE
y_train_p = pad_sequences(y_train, maxlen=MAX_LEN, padding='post', truncating='post', value=PAD_VALUE)
y_dev_p = pad_sequences(y_dev, maxlen=MAX_LEN, padding='post', truncating='post', value=PAD_VALUE)
y_test_p = pad_sequences(y_test, maxlen=MAX_LEN, padding='post', truncating='post', value=PAD_VALUE)

# Garante que os labels são int32 (necessário para Keras)
y_train_sparse = y_train_p.astype(np.int32)
y_dev_sparse = y_dev_p.astype(np.int32)
y_test_sparse = y_test_p.astype(np.int32)

# --- 3) Carregar GloVe e montar embedding_matrix ---
rng = np.random.RandomState(12345)
embedding_matrix = rng.normal(scale=0.6, size=(vocab_size, EMB_DIM)).astype(np.float32)
embedding_matrix[PAD_INDEX] = np.zeros((EMB_DIM,), dtype=np.float32)

if os.path.exists(glove_path):
    wanted = set(word2idx.keys()) | set(w.lower() for w in word2idx.keys())
    found = 0
    with open(glove_path, "r", encoding="utf-8") as f:
        for line in f:
            parts = line.rstrip().split(" ")
            word = parts[0]
            if word not in wanted:
                continue
            vec = np.asarray(parts[1:], dtype=np.float32)
            if vec.shape[0] != EMB_DIM:
                continue

            # Coloca em qualquer índice correspondente (case-sensitive then lowercase)
            if word in word2idx:
                embedding_matrix[word2idx[word]] = vec
                found += 1
            lower = word.lower()
            if lower in word2idx and word2idx[lower] != word2idx.get(word): # Evitar contar duas vezes se word == lower
                embedding_matrix[word2idx[lower]] = vec
                found += 1
    print(f"GloVe: found {found} tokens from vocab and loaded into embedding_matrix")
else:
    print("Warning: GloVe file not found at", glove_path, "-> using random init")

embedding_matrix[PAD_INDEX] = np.zeros((EMB_DIM,), dtype=np.float32) # Re-assert PAD row is zero

# --- 4) Construir o modelo Keras com Embedding congelada ---
inp = Input(shape=(MAX_LEN,), dtype='int32', name="input_ids")
emb = Embedding(input_dim=vocab_size,
                output_dim=EMB_DIM,
                weights=[embedding_matrix],
                input_length=MAX_LEN,
                mask_zero=True, # Importante para o Keras ignorar o padding no LSTM
                trainable=False,
                )(inp)

x = Bidirectional(LSTM(128, return_sequences=True), name="bilstm")(emb)
out = TimeDistributed(Dense(num_tags, activation='softmax'), name="tag_out")(x)

model = Model(inputs=inp, outputs=out)
optimizer = Adam(learning_rate=LEARNING_RATE)

# --- COMPILAÇÃO COM LOSS E MÉTRICA PERSONALIZADAS ---
model.compile(optimizer=optimizer,
              loss=masked_sparse_categorical_crossentropy,
              metrics=[masked_accuracy])

model.summary()

# --- 5) Treinar ---
print("\n--- INÍCIO DO TREINO (Com Loss e Métrica Corrigidas) ---")
start = time.time()
history = model.fit(X_train_p, y_train_sparse,
                    validation_data=(X_dev_p, y_dev_sparse),
                    epochs=EPOCHS,
                    batch_size=BATCH_SIZE
                    )
train_time = time.time() - start
print(f"\nTreino concluído em {train_time:.1f} s")

# --- 6) Avaliar no test set: obter previsões token-level e métricas por classe ---
preds = model.predict(X_test_p, batch_size=BATCH_SIZE, verbose=0) # shape (N, MAX_LEN, num_tags)
pred_labels = np.argmax(preds, axis=-1) # Flatten ignorando os PAD_VALUE tokens

def flatten_preds_trues(preds_labels, true_padded, mask_value=PAD_VALUE):
    """ Função original do utilizador para avaliação final, mantida. """
    pred_flat = []
    true_flat = []
    for p_seq, t_seq in zip(preds_labels, true_padded):
        for p, t in zip(p_seq, t_seq):
            if t == mask_value:
                continue
            pred_flat.append(int(p))
            true_flat.append(int(t))
    return pred_flat, true_flat

pred_flat, true_flat = flatten_preds_trues(pred_labels, y_test_p, mask_value=PAD_VALUE)
acc = accuracy_score(true_flat, pred_flat)
print(f"\nTest Accuracy (token-level - avaliação final): {acc:.4f}\n")

# O classification_report requer que os labels sejam 0-indexed, o que é o caso
# porque o tag2idx começa em 0.
print("Classification report (Precision/Recall/F1):\n")
print(classification_report(true_flat, pred_flat, target_names=[idx2tag[i] for i in range(num_tags)], zero_division=0))

# --- 7) Guardar modelo / embedding_matrix se quiseres ---
model.save("pos_model_glove_frozen_corrected.h5")
np.save(f"embedding_matrix_${DIM}d.npy", embedding_matrix)
print("Model e embedding matrix salvos.")

Vocab size: 20203 num tags: 18


I0000 00:00:1764868141.950386    1310 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 4130 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4050 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9
2025-12-04 17:09:02.174897: W external/local_xla/xla/service/gpu/llvm_gpu_backend/default/nvptx_libdevice_path.cc:41] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  ipykernel_launcher.runfiles/cuda_nvcc
  ipykernel_launcher.runfiles/cuda_nvdisasm
  ipykernel_launcher.runfiles/nvidia_nvshmem
  ipykern/cuda_nvcc
  ipykern/cuda_nvdisasm
  ipykern/nvidia_nvshmem
  
  /usr/local/cuda
  /opt/cuda
  /home/ricadinho/Desktop/cenas_universidade/2_ano/1_semestre/AAA/ree/.venv/lib/python3.13/site-packages/tensorflow/python/platform/../../../nvidia/cuda_nvcc
  /home/rica

UnknownError: {{function_node __wrapped__Sign_device_/job:localhost/replica:0/task:0/device:GPU:0}} JIT compilation failed. [Op:Sign] name: 

In [None]:
#DIM 100

import matplotlib.pyplot as plt

history

## --- 1. Plot Training & Validation Loss ---
plt.figure(figsize=(12, 5))

# Subplot for Loss
plt.subplot(1, 2, 1) # 1 row, 2 columns, 1st plot
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Loss Evolution Over Epochs')
plt.ylabel('Loss Value')
plt.xlabel('Epoch')
plt.legend()
plt.grid(True)

## --- 2. Plot Training & Validation Accuracy ---

# Subplot for Accuracy
plt.subplot(1, 2, 2) # 1 row, 2 columns, 2nd plot
# NOTE: Use the exact name of your custom metric defined in model.compile()
train_metric = 'masked_accuracy'
val_metric = 'val_masked_accuracy'

plt.plot(history.history[train_metric], label='Training Accuracy')
plt.plot(history.history[val_metric], label='Validation Accuracy')
plt.title('Accuracy Evolution Over Epochs')
plt.ylabel('Accuracy (%)')
plt.xlabel('Epoch')
plt.legend()
plt.grid(True)

plt.tight_layout() # Adjusts subplots to fit in figure area
plt.show()

In [None]:
#DIM 200

import matplotlib.pyplot as plt

history

## --- 1. Plot Training & Validation Loss ---
plt.figure(figsize=(12, 5))

# Subplot for Loss
plt.subplot(1, 2, 1) # 1 row, 2 columns, 1st plot
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Loss Evolution Over Epochs')
plt.ylabel('Loss Value')
plt.xlabel('Epoch')
plt.legend()
plt.grid(True)

## --- 2. Plot Training & Validation Accuracy ---

# Subplot for Accuracy
plt.subplot(1, 2, 2) # 1 row, 2 columns, 2nd plot
# NOTE: Use the exact name of your custom metric defined in model.compile()
train_metric = 'masked_accuracy'
val_metric = 'val_masked_accuracy'

plt.plot(history.history[train_metric], label='Training Accuracy')
plt.plot(history.history[val_metric], label='Validation Accuracy')
plt.title('Accuracy Evolution Over Epochs')
plt.ylabel('Accuracy (%)')
plt.xlabel('Epoch')
plt.legend()
plt.grid(True)

plt.tight_layout() # Adjusts subplots to fit in figure area
plt.show()

In [None]:
#DIM 300

import matplotlib.pyplot as plt

history

## --- 1. Plot Training & Validation Loss ---
plt.figure(figsize=(12, 5))

# Subplot for Loss
plt.subplot(1, 2, 1) # 1 row, 2 columns, 1st plot
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Loss Evolution Over Epochs')
plt.ylabel('Loss Value')
plt.xlabel('Epoch')
plt.legend()
plt.grid(True)

## --- 2. Plot Training & Validation Accuracy ---

# Subplot for Accuracy
plt.subplot(1, 2, 2) # 1 row, 2 columns, 2nd plot
# NOTE: Use the exact name of your custom metric defined in model.compile()
train_metric = 'masked_accuracy'
val_metric = 'val_masked_accuracy'

plt.plot(history.history[train_metric], label='Training Accuracy')
plt.plot(history.history[val_metric], label='Validation Accuracy')
plt.title('Accuracy Evolution Over Epochs')
plt.ylabel('Accuracy (%)')
plt.xlabel('Epoch')
plt.legend()
plt.grid(True)

plt.tight_layout() # Adjusts subplots to fit in figure area
plt.show()

**Task 1.2 - Transformer-Based Encoder Model**

    Choose an encoder-based model (e.g., DistilBERT, BERT, RoBERTa, NeoBERT, EuroBERT)
    Ensure proper handling of subword tokenization (alignment between tokens and tags).
    Fine-tune the model for token-level classification (POS tagging).

Evaluate your model on the test set. Please take into account that the tokens being used by the model may not entirely correspond to existing the tokens

    Report the global performance using Accuracy
    Report the performance per-class using Precision, Recall, F1-score

Report training time, model size, and any hardware constraints.

In [9]:
# bert_finetune_and_eval_sklearn.py
import os
from pathlib import Path
from collections import Counter

import numpy as np
from datasets import Dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
)
import torch
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
import random

# -------------------------
# Helpers: parse CoNLL-U
# -------------------------
def parse_conllu(path):
    examples = []
    with open(path, "r", encoding="utf-8") as f:
        tokens = []
        upos = []
        for line in f:
            line = line.strip()
            if line == "" or line.startswith("#"):
                if tokens:
                    examples.append({"tokens": tokens, "upos": upos})
                    tokens = []
                    upos = []
                continue
            parts = line.split("\t")
            if len(parts) != 10:
                continue
            idx, form, lemma, upos_tag, xpos, feats, head, deprel, deps, misc = parts
            # skip multiword tokens and empty nodes
            if "-" in idx or "." in idx:
                continue
            tokens.append(form)
            upos.append(upos_tag)
        if tokens:
            examples.append({"tokens": tokens, "upos": upos})
    return examples

# -------------------------
# Align labels to tokens (wordpiece)
# -------------------------
def align_labels_with_tokens(tokenizer, examples, label_to_id, max_length=128):
    tokenized_inputs = {"input_ids": [], "attention_mask": [], "labels": []}
    for ex in examples:
        tokens = ex["tokens"]
        labels = ex["upos"]
        # tokenize with is_split_into_words
        enc = tokenizer(
            tokens,
            is_split_into_words=True,
            truncation=True,
            padding="max_length",
            max_length=max_length,
            return_attention_mask=True,
        )
        word_ids = enc.word_ids()
        aligned_labels = []
        previous_word_idx = None
        for word_idx in word_ids:
            if word_idx is None:
                aligned_labels.append(-100)
            elif word_idx != previous_word_idx:
                aligned_labels.append(label_to_id[labels[word_idx]])
            else:
                # ignore subsequent subword tokens in loss
                aligned_labels.append(-100)
            previous_word_idx = word_idx

        tokenized_inputs["input_ids"].append(enc["input_ids"])
        tokenized_inputs["attention_mask"].append(enc["attention_mask"])
        tokenized_inputs["labels"].append(aligned_labels)
    return tokenized_inputs

# Paths - adjust if necessary
train_file = TRAIN
dev_file =  DEV
test_file =  TEST

# model/tokenizer settings
model_name = "bert-base-cased"  # UD often benefits from cased tokenizer
max_length = 128
train_batch_size = 16
eval_batch_size = 32
epochs = 3
output_dir = "./out_bert_pos"

# reproducibility
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

# Parse datasets
print("Parsing CoNLL-U files...")
train_examples = parse_conllu(train_file)
dev_examples = parse_conllu(dev_file)
test_examples = parse_conllu(test_file)
print(f"Loaded: train={len(train_examples)}, dev={len(dev_examples)}, test={len(test_examples)}")
 
# Build label list from all splits
counter = Counter()
for ex in (train_examples + dev_examples + test_examples):
    counter.update(ex["upos"])
label_list = sorted(counter.keys())
label_to_id = {l: i for i, l in enumerate(label_list)}
id_to_label = {i: l for l, i in label_to_id.items()}
print(f"Labels ({len(label_list)}): {label_list}")

# Tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=len(label_list),
    id2label=id_to_label,
    label2id=label_to_id,
)

# Align labels to tokenizer outputs
print("Tokenizing and aligning labels...")
tokenized_train = align_labels_with_tokens(tokenizer, train_examples, label_to_id, max_length=max_length)
tokenized_val   = align_labels_with_tokens(tokenizer, dev_examples,   label_to_id, max_length=max_length)
tokenized_test  = align_labels_with_tokens(tokenizer, test_examples,  label_to_id, max_length=max_length)

# Convert to HF Dataset objects
hf_train = Dataset.from_dict(tokenized_train)
hf_val   = Dataset.from_dict(tokenized_val)
hf_test  = Dataset.from_dict(tokenized_test)
datasets = DatasetDict({"train": hf_train, "validation": hf_val, "test": hf_test})

# Training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=train_batch_size,
    per_device_eval_batch_size=eval_batch_size,
    num_train_epochs=epochs,
    weight_decay=0.01,
    fp16=torch.cuda.is_available(),
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    logging_steps=200,
    save_total_limit=2,
    disable_tqdm=False,
)

# Trainer (no compute_metrics here because we'll evaluate with sklearn after train)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["validation"],
    tokenizer=tokenizer,
)

# Train
print("Starting training...")
trainer.train()

# Save model & tokenizer
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Model & tokenizer saved to {output_dir}")
 
# -------------------------
# Predict on test set
# -------------------------
print("Running predictions on test set...")
raw_pred = trainer.predict(datasets["test"])
logits = raw_pred.predictions  # shape (N, seq_len, n_labels)
preds = np.argmax(logits, axis=-1)
labels = raw_pred.label_ids  # -100 where ignored

# Flatten predictions & labels, ignoring -100 positions
y_true = []
y_pred = []
for i in range(labels.shape[0]):
    for j in range(labels.shape[1]):
        lab = labels[i, j]
        if lab == -100:
            continue
        y_true.append(id_to_label[int(lab)])
        y_pred.append(id_to_label[int(preds[i, j])])

# Global accuracy
acc = accuracy_score(y_true, y_pred)
print(f"\nGlobal Accuracy on test set: {acc:.4f}")

# Per-class precision/recall/f1 & support
# Convert to integer ids for sklearn functions
y_true_ids = [label_to_id[l] for l in y_true]
y_pred_ids = [label_to_id[l] for l in y_pred]

precision, recall, f1, support = precision_recall_fscore_support(
    y_true_ids, y_pred_ids, labels=list(range(len(label_list))), zero_division=0
)

print("\nPer-class performance:")
print("{:10s} {:8s} {:8s} {:8s} {:8s}".format("LABEL", "PREC", "RECL", "F1", "SUPPORT"))
for i, lab in enumerate(label_list):
    print(f"{lab:10s} {precision[i]:8.4f} {recall[i]:8.4f} {f1[i]:8.4f} {support[i]:8d}")

# More readable classification_report
print("\nClassification report (sklearn):\n")
print(classification_report(y_true_ids, y_pred_ids, labels=list(range(len(label_list))), target_names=label_list, zero_division=0))

Parsing CoNLL-U files...
Loaded: train=12544, dev=2001, test=2077
Labels (17): ['ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X']


Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Tokenizing and aligning labels...


TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

**Task 1.3 - Use LLMs to perform the task in the test set (optional, only if you have time)**

    Choose an existing LLM of your choice (e.g., ChatGPT)
    Define a prompt and perform the classification.
    Report the performance of the model, and compare it with your previous models.
    
**Comparison and Analysis**

    Compare the performance of the previous models, in terms of:
        Quantitative performance (metrics)
        Qualitative behavior (e.g., errors, generalization)
        Computational efficiency and training stability

    Discuss potential reasons for performance differences.
    Optionally visualize:
        Confusion matrices
        Learning curves