# SNLP Assignment 9

Name 1: <br/>
Student id 1: <br/>
Email 1: <br/>


Name 2: <br/>
Student id 2: <br/>
Email 2: <br/> 

Name 3: <br/>
Student id 3: <br/>
Email 3: <br/> 

**Instructions:** Read each question carefully. <br/>
Make sure you appropriately comment your code wherever required. Your final submission should contain the completed Notebook. There is no need to submit the data files. <br/>
Upload the zipped folder on CMS. Please follow the naming convention of **Name1_studentID1_Name2_studentID2_Name3_studentID3.zip**. Make sure to click on "Turn-in" (or the equivalent on CMS) after you upload your submission, otherwise the assignment will not be considered as submitted. Only one member of the group should make the submisssion.

---

In [None]:
## !pip install -q transformers datasets sklearn-crfsuite seqeval  use it if needed

## Ex 9.3: Transformers and CRFs  

In this exercise, you will enhance a Transformer-based Named Entity Recognition (NER) model by adding a Conditional Random Field (CRF) layer on top of it. The goal is to compare a baseline Transformer model with a hybrid Transformer+CRF model on a subset of the CoNLL-2003 dataset, which you will load using Hugging Face datasets.

The baseline model uses a Transformer’s built-in token classification head, while the hybrid model extracts embeddings from the Transformer and feeds them into a separate CRF model for prediction.

The dataset loading and model initialization code is already provided.                  [**Total**: 5 points]

Your task is to complete the following functions:

- `get_transformer_predictions()`: Make NER predictions using a Transformer’s classification head. (0.5 points)

- `get_transformer_embeddings()`: Extract token embeddings from a pre-trained Transformer. (0.5 points)

- `embeddings_to_features()`: Convert token embeddings into CRF-compatible feature dictionaries. (0.5 points)

- `train_crf_model()`: Train a CRF using sklearn-crfsuite. (0.5 points)

- `evaluate_predictions()`: Evaluate predictions using F1 score and classification report. (0.5 points)

- `plot_per_label_f1()`: Plot the per-label F1 scores as a horizontal bar chart. (0.5 points)

- `plot_confusion_matrix()` : Plot the confusion matrix of predicted vs. true tags. (0.5 points)

Your goal is to compare the performance of:

- A baseline model that uses only the Transformer’s classification head.

- A hybrid model that feeds Transformer embeddings into a CRF.

In [None]:
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModel, AutoModelForTokenClassification
import sklearn_crfsuite


In [None]:
# Load a small subset of the dataset
def load_conll_subset(train_size=500, test_size=100):
    """
    Load a small subset of the CoNLL-2003 dataset for quick testing.
    """
    train = load_dataset('conll2003', split='train', trust_remote_code=True).select(range(train_size))
    test = load_dataset('conll2003', split='validation', trust_remote_code=True).select(range(test_size))
    return train, test

# Load models and tokenizer
def load_models(model_checkpoint="distilbert-base-cased", num_labels=9):
    """
    Load the tokenizer and models for token classification and embeddings.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
    model_cls = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
    model_embed = AutoModel.from_pretrained(model_checkpoint)
    return tokenizer, model_cls, model_embed

# Get label mapping
def get_label_mappings(train_dataset):
    """
    Get label mappings from the training dataset.
    """
    labels = train_dataset.features['ner_tags'].feature.names
    id2label = {i: label for i, label in enumerate(labels)}
    return labels, id2label


In [None]:
def get_transformer_predictions(dataset, tokenizer, model, id2label):
    """
    Generate NER predictions using the classification head of a pretrained Transformer model.

    Args:
        dataset (datasets.Dataset): A dataset of tokenized sequences with NER tags.
        tokenizer (PreTrainedTokenizer): HuggingFace tokenizer corresponding to the model.
        model (PreTrainedModel): Transformer model with a token classification head.
        id2label (dict): Mapping from tag ID to string label.

    Returns:
        tuple: A pair (true_labels, predicted_labels), where each is a list of token-level NER label sequences.
    """
    pass


def get_transformer_embeddings(dataset, tokenizer, model, id2label):
    """
    Extract last-layer hidden state embeddings from a Transformer model and align them to original tokens.

    Args:
        dataset (datasets.Dataset): A dataset of tokenized sequences with NER tags.
        tokenizer (PreTrainedTokenizer): HuggingFace tokenizer corresponding to the model.
        model (PreTrainedModel): Transformer model (without classification head) that outputs hidden states.
        id2label (dict): Mapping from tag ID to string label.

    Returns:
        tuple: A pair (all_embeddings, all_labels), where:
            - all_embeddings is a list of lists of token-level embedding vectors (np.ndarray).
            - all_labels is a list of NER label sequences corresponding to each sentence.
    """
    pass


def embeddings_to_features(sentence_embeddings):
    """
    Plot a confusion matrix for NER predictions.

    Args:
        true (List[List[str]]): Ground truth token label sequences.
        pred (List[List[str]]): Predicted token label sequences.
        labels (List[str]): List of all possible labels in a consistent order.
        title (str): Title to prefix the plot with (typically the model name).
    """
    pass


def train_crf_model(X_train, y_train):
    """
    Train a linear-chain CRF using token-level feature dictionaries.

    Args:
        X_train (List[List[dict]]): List of token-level feature sequences (sentences).
        y_train (List[List[str]]): List of corresponding label sequences.

    Returns:
        sklearn_crfsuite.CRF: A trained CRF model.
    """
    pass


def evaluate_predictions(true, pred, model_name="Model"):
    """
    Compute and print the F1 score for sequence labeling predictions.

    Args:
        true (List[List[str]]): Ground truth token label sequences.
        pred (List[List[str]]): Predicted token label sequences.
        model_name (str): Name of the model for display purposes.

    Returns:
        float: The micro-averaged F1 score.
    """
    pass

def plot_per_label_f1(true, pred, title="Model"):
    """
    Plot a horizontal bar chart of per-label F1 scores.

    Args:
        true (List[List[str]]): Ground truth token label sequences.
        pred (List[List[str]]): Predicted token label sequences.
        title (str): Title to prefix the plot with (typically the model name).
    """
    pass



def plot_confusion_matrix(true, pred, labels, title="Model"):
    """
    Plot a confusion matrix for NER predictions.

    Args:
        true (List[List[str]]): Ground truth token label sequences.
        pred (List[List[str]]): Predicted token label sequences.
        labels (List[str]): List of all possible labels in a consistent order.
        title (str): Title to prefix the plot with (typically the model name).
    """
    pass

In [None]:
print("Loading data and models...")
train_data, test_data = load_conll_subset()
tokenizer, model_cls, model_embed = load_models()
labels, id2label = get_label_mappings(train_data)

print("\n--- Running Baseline (Transformer-Only) Experiment ---")
baseline_true, baseline_pred = get_transformer_predictions(test_data, tokenizer, model_cls, id2label)
baseline_f1 = evaluate_predictions(baseline_true, baseline_pred, model_name="Transformer Only")

print("\n--- Running Transformer + CRF Experiment ---")
train_embs, y_train = get_transformer_embeddings(train_data, tokenizer, model_embed, id2label)
test_embs, y_test = get_transformer_embeddings(test_data, tokenizer, model_embed, id2label)

X_train = [embeddings_to_features(e) for e in train_embs]
X_test = [embeddings_to_features(e) for e in test_embs]

crf_model = train_crf_model(X_train, y_train)
y_pred_crf = crf_model.predict(X_test)

crf_f1 = evaluate_predictions(y_test, y_pred_crf, model_name="Transformer + CRF")

print("\n" + "="*50)
print("FINAL RESULTS COMPARISON")
print("="*50)
print(f"Transformer Only F1 Score:     {baseline_f1:.4f}")
print(f"Transformer + CRF F1 Score:    {crf_f1:.4f}")
print("="*50)

In [None]:
print("Confusion Matrices")
plot_confusion_matrix(baseline_true, baseline_pred,labels=labels, title="Transformer only")
plot_confusion_matrix(y_test, y_pred_crf, labels=labels, title="Transformer + CRF")


In [None]:
print("Per label F1-scrore")
plot_per_label_f1(baseline_true, baseline_pred, title="Transformer only")
plot_per_label_f1(y_test, y_pred_crf, title="Transformer + CRF")


: 

Answer the following questions:

1. Explain your results. Why do you think a CRF might be more effective than a simple classification head in certain NER tasks? **(0.25 points)**

2. Under what circumstances might the CRF layer not improve performance over the baseline Transformer model?  **(0.25 points)**

3. In what way does a CRF impose **global sequence-level constraints**, and how does this affect prediction quality?  **(0.25 points)**

4. What properties of Transformer embeddings make them well-suited (or not) for CRF-based modeling, as compared to LSTMs, for example? **(0.25 points)**

5. Do you think you could use this pipeline for domain adaptation (e.g., transferring NER from news articles to scientific literature)?  **(0.25 points)**

6. Why do you think we are using F1-score as a metric here? **(0.25 points)**