CoNLL-2003 dataset task demonstrates the labeling of tokens for named entity recognition (NER), part-of-speech (POS) tagging, and chunking. Each component of the JSON object corresponds to a different layer of annotation for the sentence:

1. **Tokens**: These are the individual words or punctuation marks from the text. In this case, the sentence "EU rejects German call to boycott British lamb." is split into tokens:
   - "EU"
   - "rejects"
   - "German"
   - "call"
   - "to"
   - "boycott"
   - "British"
   - "lamb"
   - "."

2. **POS Tags**: This array contains the POS tags corresponding to each token. The tags are encoded as numbers, each representing a specific part of speech (like noun, verb, adjective). These numbers usually correspond to a tagging scheme such as the Penn Treebank POS tags:
   - "EU" is tagged as 22, which represents a proper noun.
   - "rejects" is tagged as 42, indicating a verb in present tense.
   - And so forth.

3. **Chunk Tags**: This array indicates phrase chunk boundaries and types (like NP for noun phrase, VP for verb phrase). Each number again corresponds to a specific type of phrase or boundary in a predefined scheme:
   - "EU" is part of a noun phrase, hence 11.
   - "rejects" begins a verb phrase, indicated by 21.
   - The chunk tags help in parsing the sentence into linguistically meaningful phrases.

4. **NER Tags**: These tags are used for named entity recognition. They identify whether each token is part of a named entity (like a person, location, organization) and the type of entity:
   - "EU" is tagged as 3, denoting an organization.
   - "German" and "British" are tagged as 7, indicating nationality or ethnicity.
   - Other tokens are tagged as 0, meaning they are not recognized as part of any named entity.

 Homework: 
Load a NER dataset (e.g. CoNLL-2003) using the script provided below.
   - Create a custom nn.Module class that takes Glove word embeddings as input, passes them through a linear layer, and outputs NER tags
   - Train the model using cross-entropy loss and evaluate its performance using entity-level F1 score
   - Analyze the model's predictions and visualize the confusion matrix to identify common errors
2. Build a multi-layer perceptron (MLP) for NER using Glove embeddings
   - Extend the previous exercise by creating an nn.Module class that defines an MLP architecture on top of Glove embeddings
   - Experiment with different hidden layer sizes and number of layers
   - Evaluate the trained model using entity-level precision, recall, and F1 scores
   - Compare the performance of the MLP model with the simple linear model from exercise 
   - 1
3. Explore the effects of different activation functions and regularization techniques for NER
   - Modify the MLP model from exercise 2 to allow configurable activation functions (e.g. ReLU, tanh, sigmoid)
   - Train models with different activation functions.)
   - Visualize the learned entity embeddings using dimensionality reduction techniques like PCA or t-SNE (edited) 
   - 

In [5]:
!pip install uv
!uv pip install numpy pandas torch transformers datasets scikit-learn umap-learn matplotlib seaborn

[2mUsing Python 3.12.3 environment at: /home/jj/github/deepl_nlp/.venv[0m
[2mAudited [1m9 packages[0m [2min 7ms[0m[0m
[2mUsing Python 3.12.3 environment at: /home/jj/github/deepl_nlp/.venv[0m
[2mAudited [1m9 packages[0m [2min 7ms[0m[0m


# Instructions 
1. download the **conll2003** from the following [link]("https://data.deepai.org/conll2003.zip")
2. unzip the file
3. download the glove embeddings from [link]("https://huggingface.co/datasets/SLU-CSCI4750/glove.6B.100d.txt/resolve/main/glove.6B.100d.txt.gz")
4. unzip the glove embeddings file
5. update the constants in the code below to point to the correct file paths on your machine

In [6]:
# basic python data science tooling
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from datasets import (
    Dataset, 
    DatasetDict, 
    Features, 
    Sequence, 
    ClassLabel, 
    Value,
)

# progress bar
from tqdm import tqdm


# deep learning stuff
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence


# constants for config
LOCAL_DIR = "/home/jj/github/deepl_nlp/assignment1/data/conll2003/"
GLOVE_EMBEDS_PATH = '/home/jj/github/deepl_nlp/assignment1/embeddings/glove.6B.100d.txt'

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
train_file = os.path.join(LOCAL_DIR, "train.txt")
valid_file = os.path.join(LOCAL_DIR, "valid.txt")
test_file = os.path.join(LOCAL_DIR, "test.txt")

pos_names = [
    '"',
    "''",
    "#",
    "$",
    "(",
    ")",
    ",",
    ".",
    ":",
    "``",
    "CC",
    "CD",
    "DT",
    "EX",
    "FW",
    "IN",
    "JJ",
    "JJR",
    "JJS",
    "LS",
    "MD",
    "NN",
    "NNP",
    "NNPS",
    "NNS",
    "NN|SYM",
    "PDT",
    "POS",
    "PRP",
    "PRP$",
    "RB",
    "RBR",
    "RBS",
    "RP",
    "SYM",
    "TO",
    "UH",
    "VB",
    "VBD",
    "VBG",
    "VBN",
    "VBP",
    "VBZ",
    "WDT",
    "WP",
    "WP$",
    "WRB",
]

chunk_names = [
    "O",
    "B-ADJP",
    "I-ADJP",
    "B-ADVP",
    "I-ADVP",
    "B-CONJP",
    "I-CONJP",
    "B-INTJ",
    "I-INTJ",
    "B-LST",
    "I-LST",
    "B-NP",
    "I-NP",
    "B-PP",
    "I-PP",
    "B-PRT",
    "I-PRT",
    "B-SBAR",
    "I-SBAR",
    "B-UCP",
    "I-UCP",
    "B-VP",
    "I-VP",
]

ner_names = [
    "O",
    "B-PER",
    "I-PER",
    "B-ORG",
    "I-ORG",
    "B-LOC",
    "I-LOC",
    "B-MISC",
    "I-MISC",
]


def parse_conll(path: str):
    """Parse a CoNLL-2003 file into a list of examples."""
    examples = []
    tokens, pos_tags, chunk_tags, ner_tags = [], [], [], []
    with open(path, encoding="utf-8") as f:
        for line in f:
            if line.startswith("-DOCSTART-") or line.strip() == "":
                if tokens:
                    examples.append(
                        {
                            "tokens": tokens,
                            "pos_tags": pos_tags,
                            "chunk_tags": chunk_tags,
                            "ner_tags": ner_tags,
                        }
                    )
                    tokens, pos_tags, chunk_tags, ner_tags = [], [], [], []
            else:
                splits = line.rstrip().split(" ")
                tokens.append(splits[0])
                pos_tags.append(splits[1])
                chunk_tags.append(splits[2])
                ner_tags.append(splits[3])
    if tokens:
        examples.append(
            {
                "tokens": tokens,
                "pos_tags": pos_tags,
                "chunk_tags": chunk_tags,
                "ner_tags": ner_tags,
            }
        )
    return examples


def as_dataset(examples, features: Features):
    ids = []
    tokens_col, pos_col, chunk_col, ner_col = [], [], [], []
    for i, ex in enumerate(examples):
        ids.append(str(i))
        tokens_col.append(ex["tokens"])
        pos_col.append(ex["pos_tags"])
        chunk_col.append(ex["chunk_tags"])
        ner_col.append(ex["ner_tags"])
    return Dataset.from_dict(
        {
            "id": ids,
            "tokens": tokens_col,
            "pos_tags": pos_col,
            "chunk_tags": chunk_col,
            "ner_tags": ner_col,
        },
        features=features,
    )


features = Features(
    {
        "id": Value("string"),
        "tokens": Sequence(Value("string")),
        "pos_tags": Sequence(ClassLabel(names=pos_names)),
        "chunk_tags": Sequence(ClassLabel(names=chunk_names)),
        "ner_tags": Sequence(ClassLabel(names=ner_names)),
    }
)

train_examples = parse_conll(train_file)
valid_examples = parse_conll(valid_file)
test_examples = parse_conll(test_file)

conll2003 = DatasetDict(
    {
        "train": as_dataset(train_examples, features),
        "validation": as_dataset(valid_examples, features),
        "test": as_dataset(test_examples, features),
    }
)

display(conll2003)

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [8]:
print(conll2003['train'][0])

{'id': '0', 'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7], 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0], 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}


In [9]:
def load_glove_embeddings(file_path, embedding_dim):
    # dict to store word embed vectors
    word_vectors = {}
    with open(file_path, 'r', encoding='utf - 8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = torch.tensor(
                [float(val) for val in values[1:]], dtype=torch.float32)
            word_vectors[word] = vector

    # matrix of embeddings
    vocab_size = len(word_vectors)
    embedding_matrix = torch.zeros((vocab_size, embedding_dim))
    word_to_idx = {}
    idx_to_word = {}
    for i, (word, vector) in enumerate(word_vectors.items()):
        embedding_matrix[i] = vector
        word_to_idx[word] = i
        idx_to_word[i] = word

    return embedding_matrix, word_to_idx, idx_to_word


embedding_dim = 100
embedding_matrix, word_to_idx, idx_to_word = load_glove_embeddings(GLOVE_EMBEDS_PATH, embedding_dim)


embedding_layer = nn.Embedding.from_pretrained(embedding_matrix)
embedding_layer

Embedding(400000, 100)

In [10]:
# Index for unknown tokens (out-of-vocabulary words not in GloVe)
UNK_IDX = len(word_to_idx)

def tokens_to_indices(tokens_batch):
    """
    Convert a batch of token sequences to their corresponding GloVe indices.
    Args:
        tokens_batch: List of token sequences (list of lists of strings)
    Returns:
        List of tensors containing token indices
    """
    indices = []
    for tokens in tokens_batch:
        # Look up each token (lowercased) in word_to_idx, use UNK_IDX if not found
        idxs = [
            word_to_idx.get(t.lower(), UNK_IDX) for t in tokens
        ]
        indices.append(torch.tensor(idxs, dtype=torch.long))
    return indices


def labels_to_tensors(labels_batch):
    """
    Convert a batch of NER label sequences to tensors.
    Args:
        labels_batch: List of label sequences (list of lists of ints)
    Returns:
        List of tensors containing label indices
    """
    return [torch.tensor(lbls, dtype=torch.long) for lbls in labels_batch]


def collate_fn(batch):
    """
    Collate function for DataLoader to batch and pad sequences.
    Args:
        batch: List of dataset samples, each with 'tokens' and 'ner_tags'
    Returns:
        Dictionary with padded 'input_ids' and 'labels' tensors
    """
    # Convert tokens to indices
    input_ids = tokens_to_indices([b["tokens"] for b in batch])
    
    # Convert NER tags to tensors
    labels = labels_to_tensors([b["ner_tags"] for b in batch])
    
    # Pad sequences to the same length within the batch
    # Padding value for input_ids is UNK_IDX (unknown token)
    input_ids = pad_sequence(input_ids, batch_first=True, padding_value=UNK_IDX)
    
    # Padding value for labels is -100 (ignored by CrossEntropyLoss)
    labels = pad_sequence(labels, batch_first=True, padding_value=-100)
    
    return {"input_ids": input_ids, "labels": labels}


# Create DataLoaders for training, validation, and test sets
# batch_size=32: process 32 sequences at a time
# shuffle=True: randomly shuffle training data each epoch
train_dataloader = DataLoader(conll2003["train"], batch_size=32, shuffle=True, collate_fn=collate_fn)
val_dataloader = DataLoader(conll2003["validation"], batch_size=32, collate_fn=collate_fn)
test_dataloader = DataLoader(conll2003["test"], batch_size=32, collate_fn=collate_fn)


In [11]:

class LinearNER(nn.Module):
    """
    Simple linear model for Named Entity Recognition using GloVe embeddings.
    Architecture: Embedding -> Linear -> Logits
    """
    def __init__(self, embedding_matrix: torch.Tensor, num_tags: int):
        super().__init__()
        # if you use 100-dimensional GloVe vectors for a 50,000-word vocabulary,
        # vocab_size would be 50,000 and embed_dim would be 100.
        vocab_size, embed_dim = embedding_matrix.shape
        
        # Create embedding layer with extra slot for unknown tokens (UNK)
        # nn.Embedding is just a big lookup table. It stores a vector for each word index.
        self.embedding = nn.Embedding(vocab_size + 1, embed_dim)
        
        # Initialize embeddings with pre-trained GloVe vectors
        # The last row (index vocab_size) is reserved for UNK tokens and initialized to zeros
        # This tells PyTorch not to track gradients for the next steps.
        # This is important because we are just initializing the weights, not training them yet.
        with torch.no_grad():
            # Copies the pre-trained embedding_matrix (your GloVe vectors) into the embedding layer's weight table.
            self.embedding.weight[:vocab_size].copy_(embedding_matrix)
            self.embedding.weight[vocab_size].zero_()
        
        # Linear classifier maps embedding dimension to number of NER tags
        # This is the model's "brain." It's a single linear (or "fully-connected") layer.
        self.classifier = nn.Linear(embed_dim, num_tags)

    def forward(self, input_ids):
        """
        Args:
            input_ids: (batch_size, sequence_length) - token indices
        Returns:
            logits: (batch_size, sequence_length, num_tags) - unnormalized scores for each tag
        """
        # input_ids is a batch of sentences, where each word is an index (e.g., [[10, 45, 132], [7, 500, 9]]).
        # This line looks up the embedding vector for every single index.
        # If the input shape is (Batch_Size, Sequence_Length), the output emb shape is (Batch_Size, Sequence_Length, Embedding_Dimension).
        emb = self.embedding(input_ids)           # (B, T, D) - embed each token
        # This applies the same linear layer to each token's embedding in the sequence
        # logits = raw scores like [1.2, -0.5, 3.1, 0.1]
        # (Batch_Size, Sequence_Length, Num_Tags)
        logits = self.classifier(emb)             # (B, T, C) - project to tag space
        return logits


# Get number of NER tags from dataset (e.g., O, B-PER, I-PER, B-ORG, etc.)
num_tags = len(conll2003["train"].features["ner_tags"].feature.names)

# Initialize model with pre-trained GloVe embeddings
model = LinearNER(embedding_matrix, num_tags)

# CrossEntropyLoss with ignore_index=-100 to skip padding tokens in loss calculation
# its also very comoda perchè prende i logit e le label
criterion = nn.CrossEntropyLoss(ignore_index=-100)

# Adam optimizer for parameter updates
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# Move model to GPU if available, otherwise CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Training loop
num_epochs = 3
for epoch in range(num_epochs):
    model.train()  # Set model to training mode
    total_loss = 0.0
    
    # An epoch is too large to process at once, so it's broken into smaller chunks
    # called batches (handled by your train_dataloader)
    for batch in train_dataloader:
        # Move batch data to device (gpu)
        input_ids = batch["input_ids"].to(device)
        labels = batch["labels"].to(device)
        
        # Zero gradients from previous step
        # This line clears the gradients from the previous batch,
        # ensuring we only update the model based on the current batch's error.
        optimizer.zero_grad()
        
        # Forward pass: get predictions
        logits = model(input_ids)
        
        # Compute loss (flatten to 2D for CrossEntropyLoss: [batch*seq_len, num_tags])
        # The model output logits is 3D: (B, T, C).
        # The labels are 2D: (B, T).
        #
        # CrossEntropyLoss expects 2D logits (N, C) and 1D labels (N), where N is the total number of items.
        #
        # logits.view(-1, num_tags): This "flattens" the (B, T, C) tensor into (B*T, C). It essentially makes one giant list of all the tokens in the batch.
        # labels.view(-1): This flattens the (B, T) tensor into (B*T).
        #
        # Now, the loss function compares the logits for every single token against its corresponding true label. Thanks to ignore_index=-100, any token where the label is -100 is skipped.
        loss = criterion(logits.view(-1, num_tags), labels.view(-1))
        
        # Backward pass: compute gradients
        loss.backward()
        
        # Update model parameters
        optimizer.step()
        
        # Accumulate loss for monitoring
        total_loss += loss.item()
    
    # Print average loss per batch for this epoch
    print(f"Epoch {epoch+1}/{num_epochs} - Train Loss: {total_loss/len(train_dataloader):.4f}")


Epoch 1/3 - Train Loss: 0.6490
Epoch 2/3 - Train Loss: 0.2935
Epoch 2/3 - Train Loss: 0.2935
Epoch 3/3 - Train Loss: 0.2168
Epoch 3/3 - Train Loss: 0.2168


Homework: 
Load a NER dataset (e.g. CoNLL-2003) using the script provided below.
   - Create a custom nn.Module class that takes Glove word embeddings as input, passes them through a linear layer, and outputs NER tags
   - Train the model using cross-entropy loss and evaluate its performance using entity-level F1 score
   - Analyze the model's predictions and visualize the confusion matrix to identify common errors
2. Build a multi-layer perceptron (MLP) for NER using Glove embeddings
   - Extend the previous exercise by creating an nn.Module class that defines an MLP architecture on top of Glove embeddings
   - Experiment with different hidden layer sizes and number of layers
   - Evaluate the trained model using entity-level precision, recall, and F1 scores
   - Compare the performance of the MLP model with the simple linear model from exercise 
   - 1
3. Explore the effects of different activation functions and regularization techniques for NER
   - Modify the MLP model from exercise 2 to allow configurable activation functions (e.g. ReLU, tanh, sigmoid)
   - Train models with different activation functions.)
   - Visualize the learned entity embeddings using dimensionality reduction techniques like PCA or t-SNE (edited) 
   - 

In [12]:
from sklearn.metrics import classification_report, confusion_matrix
from collections import defaultdict


"""
why is this necessary?
Token-level vs Entity-level evaluation are VERY different:

Token-level (simpler but less meaningful):
    Evaluates each token independently
    If you predict [B-PER, O, O] instead of [B-PER, I-PER, O], you get 2/3 = 66% accuracy
    Problem: This doesn't tell you if you correctly identified complete entities!
    
Entity-level (what NER really cares about):
    An entity is only correct if both the entity type AND the complete span match exactly
    If you predict "John" as B-PER but miss "Smith" (I-PER), you get:
    Predicted: 1 incomplete entity
    True: 1 complete entity
    Result: 0% match because the spans don't match exactly

"""
def get_entities(tags, tag_names):
    """
    Extract named entities from a sequence of NER tags using BIO tagging scheme.
    BIO scheme: B-TYPE (begin), I-TYPE (inside), O (outside)
    
    Args:
        tags: Sequence of tag IDs
        tag_names: List mapping tag IDs to tag names (e.g., ['O', 'B-PER', 'I-PER', ...])
    Returns:
        Set of tuples: (entity_type, start_idx, end_idx) for each entity found
    """
    entities = []
    current_entity = None
    
    for i, tag_id in enumerate(tags):
        if tag_id == -100:
            continue
        tag = tag_names[tag_id]
        
        if tag.startswith('B-'):
            # B- tag starts a new entity
            if current_entity is not None:
                entities.append(current_entity)
            entity_type = tag[2:]
            current_entity = (entity_type, i, i)
        elif tag.startswith('I-'):
            # I- tag continues the current entity
            entity_type = tag[2:]
            if current_entity is not None and current_entity[0] == entity_type:
                # Extend current entity to include this token
                current_entity = (current_entity[0], current_entity[1], i)
            else:
                # Mismatched I- tag (e.g., I-PER after B-ORG), treat as new entity
                if current_entity is not None:
                    entities.append(current_entity)
                current_entity = (entity_type, i, i)
        else:
            # O tag ends the current entity
            if current_entity is not None:
                entities.append(current_entity)
                current_entity = None
    
    # Save the last entity if sequence ended while inside an entity
    if current_entity is not None:
        entities.append(current_entity)
    
    return set(entities)

def compute_entity_f1(model, dataloader, tag_names, device):
    """
    Compute entity-level precision, recall, and F1 score.
    Entity-level: an entity is correct only if both type and span match exactly.
    
    Args:
        model: Trained NER model
        dataloader: DataLoader with evaluation data
        tag_names: List of NER tag names
        device: Device to run evaluation on (CPU/GPU)
    Returns:
        Dictionary with precision, recall, F1, and counts
    """
    model.eval()
    
    all_true_entities = []
    all_pred_entities = []
    
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating"):
            input_ids = batch["input_ids"].to(device)
            labels = batch["labels"].to(device)
            
            # Get model predictions
            logits = model(input_ids)
            predictions = torch.argmax(logits, dim=-1)
            
            # Process each sequence in the batch
            for pred_seq, true_seq in zip(predictions.cpu().numpy(), labels.cpu().numpy()):
                # Extract entities from predicted and true sequences
                true_entities = get_entities(true_seq, tag_names)
                pred_entities = get_entities(pred_seq, tag_names)
                
                all_true_entities.append(true_entities)
                all_pred_entities.append(pred_entities)
    
    # Calculate entity-level metrics
    true_positives = 0
    false_positives = 0
    false_negatives = 0
    
    for true_ents, pred_ents in zip(all_true_entities, all_pred_entities):
        # Intersection: entities present in both true and predicted
        true_positives += len(true_ents & pred_ents)
        # Predicted but not in ground truth
        false_positives += len(pred_ents - true_ents)
        # In ground truth but not predicted
        false_negatives += len(true_ents - pred_ents)
    
    # Compute precision, recall, and F1
    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
    recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    
    return {
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'true_positives': true_positives,
        'false_positives': false_positives,
        'false_negatives': false_negatives
    }


tag_names = conll2003["train"].features["ner_tags"].feature.names

print("=" * 60)
print("Entity-level Evaluation on Validation Set")
print("=" * 60)
val_metrics = compute_entity_f1(model, val_dataloader, tag_names, device)
print(f"Precision: {val_metrics['precision']:.4f}")
print(f"Recall:    {val_metrics['recall']:.4f}")
print(f"F1 Score:  {val_metrics['f1']:.4f}")
print(f"\nTrue Positives:  {val_metrics['true_positives']}")
print(f"False Positives: {val_metrics['false_positives']}")
print(f"False Negatives: {val_metrics['false_negatives']}")

print("\n" + "=" * 60)
print("Entity-level Evaluation on Test Set")
print("=" * 60)
test_metrics = compute_entity_f1(model, test_dataloader, tag_names, device)
print(f"Precision: {test_metrics['precision']:.4f}")
print(f"Recall:    {test_metrics['recall']:.4f}")
print(f"F1 Score:  {test_metrics['f1']:.4f}")
print(f"\nTrue Positives:  {test_metrics['true_positives']}")
print(f"False Positives: {test_metrics['false_positives']}")
print(f"False Negatives: {test_metrics['false_negatives']}")

Entity-level Evaluation on Validation Set


Evaluating: 100%|██████████| 102/102 [00:00<00:00, 598.10it/s]
Evaluating: 100%|██████████| 102/102 [00:00<00:00, 598.10it/s]


Precision: 0.5952
Recall:    0.6866
F1 Score:  0.6376

True Positives:  4080
False Positives: 2775
False Negatives: 1862

Entity-level Evaluation on Test Set


Evaluating: 100%|██████████| 108/108 [00:00<00:00, 673.10it/s]

Precision: 0.5196
Recall:    0.6100
F1 Score:  0.5612

True Positives:  3445
False Positives: 3185
False Negatives: 2203





In [13]:
!pip install seqeval



In [14]:
from seqeval.metrics import classification_report, f1_score

def get_seqeval_predictions(model, dataloader, tag_names, device):
    """
    Get predictions in format required by seqeval (list of lists of string labels).
    
    Args:
        model: Trained NER model
        dataloader: DataLoader with evaluation data
        tag_names: List of NER tag names
        device: Device to run evaluation on
    Returns:
        Tuple of (true_tags, pred_tags) where each is a list of sequences
    """
    model.eval()
    true_tags = []
    pred_tags = []
    
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Getting seqeval predictions"):
            input_ids = batch["input_ids"].to(device)
            labels = batch["labels"].to(device)
            
            logits = model(input_ids)
            predictions = torch.argmax(logits, dim=-1)
            
            # Process each sequence in the batch
            for pred_seq, label_seq in zip(predictions.cpu().numpy(), labels.cpu().numpy()):
                # Convert indices to tag names, filtering out padding tokens
                true_seq = []
                pred_seq_tags = []
                
                for pred, label in zip(pred_seq, label_seq):
                    if label != -100:  # Skip padding tokens
                        true_seq.append(tag_names[label])
                        pred_seq_tags.append(tag_names[pred])
                
                if true_seq:  # Only add non-empty sequences
                    true_tags.append(true_seq)
                    pred_tags.append(pred_seq_tags)
    
    return true_tags, pred_tags


# Get predictions in seqeval format
print("Collecting predictions for seqeval evaluation...")
true_tags, pred_tags = get_seqeval_predictions(model, test_dataloader, tag_names, device)

# Print entity-level classification report using seqeval
print("\n" + "=" * 60)
print("Entity-level Classification Report (seqeval - Test Set)")
print("=" * 60)
print(classification_report(true_tags, pred_tags, digits=4))

# Overall entity-level F1 score
f1 = f1_score(true_tags, pred_tags)
print(f"\nOverall Entity-level F1-Score: {f1:.4f}")

# Also evaluate on validation set
print("\n" + "=" * 60)
print("Entity-level Classification Report (seqeval - Validation Set)")
print("=" * 60)
val_true_tags, val_pred_tags = get_seqeval_predictions(model, val_dataloader, tag_names, device)
print(classification_report(val_true_tags, val_pred_tags, digits=4))
val_f1 = f1_score(val_true_tags, val_pred_tags)
print(f"\nOverall Entity-level F1-Score: {val_f1:.4f}")

Collecting predictions for seqeval evaluation...


Getting seqeval predictions: 100%|██████████| 108/108 [00:00<00:00, 732.70it/s]


Entity-level Classification Report (seqeval - Test Set)





              precision    recall  f1-score   support

         LOC     0.6536    0.7908    0.7157      1668
        MISC     0.4550    0.5470    0.4968       702
         ORG     0.4907    0.5232    0.5064      1661
         PER     0.4372    0.5399    0.4831      1617

   micro avg     0.5196    0.6100    0.5612      5648
   macro avg     0.5091    0.6002    0.5505      5648
weighted avg     0.5190    0.6100    0.5603      5648


Overall Entity-level F1-Score: 0.5612

Entity-level Classification Report (seqeval - Validation Set)

Overall Entity-level F1-Score: 0.5612

Entity-level Classification Report (seqeval - Validation Set)


Getting seqeval predictions: 100%|██████████| 102/102 [00:00<00:00, 718.24it/s]



              precision    recall  f1-score   support

         LOC     0.7544    0.8176    0.7847      1837
        MISC     0.5552    0.6161    0.5841       922
         ORG     0.5141    0.6107    0.5583      1341
         PER     0.5298    0.6466    0.5824      1842

   micro avg     0.5952    0.6866    0.6376      5942
   macro avg     0.5884    0.6728    0.6274      5942
weighted avg     0.5996    0.6866    0.6398      5942


Overall Entity-level F1-Score: 0.6376


In [15]:
# Token-level confusion matrix and classification report
def get_token_predictions(model, dataloader, device):
    """Get all predictions and labels at token level (excluding padding)."""
    model.eval()
    all_preds = []
    all_labels = []
    
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Getting predictions"):
            input_ids = batch["input_ids"].to(device)
            labels = batch["labels"].to(device)
            
            logits = model(input_ids)
            predictions = torch.argmax(logits, dim=-1)
            
            # Flatten and filter out padding tokens
            for pred_seq, label_seq in zip(predictions.cpu().numpy(), labels.cpu().numpy()):
                for pred, label in zip(pred_seq, label_seq):
                    if label != -100:  # Ignore padding
                        all_preds.append(pred)
                        all_labels.append(label)
    
    return np.array(all_preds), np.array(all_labels)


# Get predictions
print("Collecting predictions for confusion matrix...")
preds, labels = get_token_predictions(model, test_dataloader, device)

# Classification report
print("\n" + "=" * 60)
print("Token-level Classification Report (Test Set)")
print("=" * 60)
print(classification_report(labels, preds, target_names=tag_names, digits=4))

# Confusion matrix
cm = confusion_matrix(labels, preds)
plt.figure(figsize=(12, 10))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=tag_names, yticklabels=tag_names,
            cbar_kws={'label': 'Count'})
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)
plt.title('Confusion Matrix - LinearNER Model (Test Set)', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# Analyze most common errors
print("\n" + "=" * 60)
print("Top 10 Most Common Prediction Errors")
print("=" * 60)
errors = []
for i in range(len(tag_names)):
    for j in range(len(tag_names)):
        if i != j and cm[i, j] > 0:
            errors.append((tag_names[i], tag_names[j], cm[i, j]))

errors.sort(key=lambda x: x[2], reverse=True)
for true_tag, pred_tag, count in errors[:10]:
    print(f"True: {true_tag:10s} -> Predicted: {pred_tag:10s} | Count: {count}")

Collecting predictions for confusion matrix...


Getting predictions: 100%|██████████| 108/108 [00:00<00:00, 770.69it/s]


Token-level Classification Report (Test Set)





TypeError: classification_report() got an unexpected keyword argument 'target_names'