*SOURCES*
- Pytorch Documentation
- DeepSeek
- [Build GPT video](https://www.youtube.com/watch?v=kCc8FmEb1nY&t=3644s)
- Hugging Face Documentations

## Working with Transformers in the HuggingFace Ecosystem

In this laboratory exercise we will learn how to work with the HuggingFace ecosystem to adapt models to new tasks. As you will see, much of what is required is *investigation* into the inner-workings of the HuggingFace abstractions. With a little work, a little trial-and-error, it is fairly easy to get a working adaptation pipeline up and running.

### Exercise 1: Sentiment Analysis (warm up)

In this first exercise we will start from a pre-trained BERT transformer and build up a model able to perform text sentiment analysis. Transformers are complex beasts, so we will build up our pipeline in several explorative and incremental steps.

#### Exercise 1.1: Dataset Splits and Pre-trained model
There are a many sentiment analysis datasets, but we will use one of the smallest ones available: the [Cornell Rotten Tomatoes movie review dataset](cornell-movie-review-data/rotten_tomatoes), which consists of 5,331 positive and 5,331 negative processed sentences from the Rotten Tomatoes movie reviews.

**Your first task**: Load the dataset and figure out what splits are available and how to get them. Spend some time exploring the dataset to see how it is organized. Note that we will be using the [HuggingFace Datasets](https://huggingface.co/docs/datasets/en/index) library for downloading, accessing, splitting, and batching data for training and evaluation.

In [1]:
from datasets import load_dataset, get_dataset_split_names
import pandas as pd
from collections import Counter

# Load the rotten_tomatoes dataset
# dataset = load_dataset("rotbert/rotten_tomatoes")
dataset = load_dataset("rotten_tomatoes")

# Check available splits
print(f"Available splits: {list(dataset.keys())}")

# Examine the first training example
print("\nFirst training example:")
print(dataset['train'][0])

# Check the features/columns
print("\nDataset features:")
print(dataset['train'].features)

# Get the number of examples in each split
print("")
for split in dataset.keys():
    print(f"{split} set has {len(dataset[split])}")

# Convert a small portion to pandas dataframe for easier viewing
df = pd.DataFrame(dataset['train'][:5])
print("\nSample of training data:")
print(df)

for split in dataset.keys():
    labels = dataset[split]['label']
    print(f"\nLabel distribution in {split} set:")
    print(Counter(labels))

Available splits: ['train', 'validation', 'test']

First training example:
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}

Dataset features:
{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}

train set has 8530
validation set has 1066
test set has 1066

Sample of training data:
                                                text  label
0  the rock is destined to be the 21st century's ...      1
1  the gorgeously elaborate continuation of " the...      1
2                     effective but too-tepid biopic      1
3  if you sometimes like to go to the movies to h...      1
4  emerges as something rare , an issue movie tha...      1

Label distribution in train set:
Counter({1: 4265, 0: 4265})

Label distribution in validation set:
Counter({1: 533, 0: 533})

Label distribution in test set:
C

#### Exercise 1.2: A Pre-trained BERT and Tokenizer

The model we will use is a *very* small BERT transformer called [Distilbert](https://huggingface.co/distilbert/distilbert-base-uncased) this model was trained (using self-supervised learning) on the same corpus as BERT but using the full BERT base model as a *teacher*.

**Your next task**: Load the Distilbert model and corresponding tokenizer. Use the tokenizer on a few samples from the dataset and pass the tokens through the model to see what outputs are provided. I suggest you use the [`AutoModel`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html) class (and the `from_pretrained()` method) to load the model and `AutoTokenizer` to load the tokenizer).

In [2]:
from transformers import AutoTokenizer, AutoModel
import torch

# Specify the DistilBERT model name
model_name = "distilbert-base-uncased"

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)


# Let's verify what I loaded
print(f"Loaded tokenizer: {type(tokenizer)}")
print(f"Loaded model: {type(model)}")
print(f"Model configuration:\n{model.config}")


# Get some samples from the dataset
samples = dataset['train'][:3]['text']
labels = dataset['train'][:3]['label']

print("\nOriginal samples:")
for i, sample in enumerate(samples):
    print(f"{i+1}. {sample} (Label: {labels[i]})")


# Tokenize the samples
encoded_inputs = tokenizer(samples, padding=True, truncation=True, return_tensors="pt")

print("\nTokenized outputs of samples up here:")
print(encoded_inputs)


# Decode the tokens to see what the tokenizer did
print("\nDecoded tokens:")
for i in range(len(samples)):
    print(f"\nSample {i+1}:")
    print(f"Original: {samples[i]}")
    print(f"Token IDs: {encoded_inputs['input_ids'][i]}")
    print(f"Decoded: {tokenizer.decode(encoded_inputs['input_ids'][i])}")


# Don't compute gradients
with torch.no_grad():
    outputs = model(**encoded_inputs)

# Examine the outputs
print("\nModel outputs:")
print(f"Last hidden states shape: {outputs.last_hidden_state.shape}")

# ARCHITECTURE
print("\nModel architecture:")
print(model)


# EXTRACTING EMBEDDINGS
# Get the embeddings for the [CLS] token (often used for classification)
cls_embeddings = outputs.last_hidden_state[:, 0, :]
print(f"\n[CLS] token embeddings shape: {cls_embeddings.shape}")
print(f"Sample embedding for first sentence:\n{cls_embeddings[0][:10]}...")  # Showing first 10 values


# TRYING A COMPLEX SAMPLE
complex_sample = "While the cinematography was stunning and the acting superb, the plot was so convoluted that it ruined the entire movie experience."

# Tokenize and process
inputs = tokenizer(complex_sample, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

print(f"\nComplex sample processing:")
print(f"Input length: {len(inputs['input_ids'][0])} tokens")
print(f"Output shape: {outputs.last_hidden_state.shape}")

Loaded tokenizer: <class 'transformers.models.distilbert.tokenization_distilbert_fast.DistilBertTokenizerFast'>
Loaded model: <class 'transformers.models.distilbert.modeling_distilbert.DistilBertModel'>
Model configuration:
DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.52.4",
  "vocab_size": 30522
}


Original samples:
1. the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . (Label: 1)
2. the gorgeously elaborate continuation 

#### Exercise 1.3: A Stable Baseline

In this exercise I want you to:
1. Use Distilbert as a *feature extractor* to extract representations of the text strings from the dataset splits;
2. Train a classifier (your choice, by an SVM from Scikit-learn is an easy choice).
3. Evaluate performance on the validation and test splits.

These results are our *stable baseline* -- the **starting** point on which we will (hopefully) improve in the next exercise.

**Hint**: There are a number of ways to implement the feature extractor, but probably the best is to use a [feature extraction `pipeline`](https://huggingface.co/tasks/feature-extraction). You will need to interpret the output of the pipeline and extract only the `[CLS]` token from the *last* transformer layer. *How can you figure out which output that is?*

In [None]:
from transformers import pipeline
import numpy as np
from tqdm.auto import tqdm

# Initialize feature extraction pipeline
checkpoint = "distilbert-base-uncased"
feature_extractor = pipeline(
    "feature-extraction",
    model=checkpoint,
    tokenizer="distilbert-base-uncased",
    device=0 if torch.cuda.is_available() else -1
)

Device set to use cuda:0


In [22]:
def extract_cls_embeddings_batch(texts, tokenizer, model, batch_size=32, device='cuda'):
    model.to(device)
    model.eval()
    embeddings = []

    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        inputs = tokenizer(batch_texts, return_tensors="pt", padding=True, truncation=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = model(**inputs)
            cls_embeds = outputs.last_hidden_state[:, 0, :]  # [CLS] token
            embeddings.append(cls_embeds.cpu().numpy())
    
    return np.concatenate(embeddings)


In [23]:
# Extract features for each split
print("Extracting training features...")
train_embeddings = extract_cls_embeddings_batch(dataset['train']['text'], tokenizer, model)
train_labels = np.array(dataset['train']['label'])

print("\nExtracting validation features...")
val_embeddings = extract_cls_embeddings_batch(dataset['validation']['text'], tokenizer, model)
val_labels = np.array(dataset['validation']['label'])

print("\nExtracting test features...")
test_embeddings = extract_cls_embeddings_batch(dataset['test']['text'], tokenizer, model)
test_labels = np.array(dataset['test']['label'])

# Check shapes
print(f"\nTrain embeddings shape: {train_embeddings.shape}")
print(f"Validation embeddings shape: {val_embeddings.shape}")
print(f"Test embeddings shape: {test_embeddings.shape}")

Extracting training features...

Extracting validation features...

Extracting test features...

Train embeddings shape: (8530, 768)
Validation embeddings shape: (1066, 768)
Test embeddings shape: (1066, 768)


In [24]:
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

# Initialize and train SVM
svm = LinearSVC(random_state=42, max_iter=10000)
svm.fit(train_embeddings, train_labels)

# Evaluate on validation set
val_preds = svm.predict(val_embeddings)
print("\nValidation Set Performance:")
print(classification_report(val_labels, val_preds, target_names=['negative', 'positive']))

# Evaluate on test set
test_preds = svm.predict(test_embeddings)
print("\nTest Set Performance:")
print(classification_report(test_labels, test_preds, target_names=['negative', 'positive']))


Validation Set Performance:
              precision    recall  f1-score   support

    negative       0.81      0.84      0.83       533
    positive       0.84      0.80      0.82       533

    accuracy                           0.82      1066
   macro avg       0.82      0.82      0.82      1066
weighted avg       0.82      0.82      0.82      1066


Test Set Performance:
              precision    recall  f1-score   support

    negative       0.79      0.81      0.80       533
    positive       0.81      0.78      0.80       533

    accuracy                           0.80      1066
   macro avg       0.80      0.80      0.80      1066
weighted avg       0.80      0.80      0.80      1066



In [25]:
# ALTERNATIVE

from sklearn.linear_model import LogisticRegression

# Train logistic regression
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(train_embeddings, train_labels)

# Evaluate
lr_val_preds = lr.predict(val_embeddings)
print("\nLogistic Regression Validation Performance:")
print(classification_report(val_labels, lr_val_preds, target_names=['negative', 'positive']))

lr_test_preds = lr.predict(test_embeddings)
print("\nLogistic Regression Test Performance:")
print(classification_report(test_labels, lr_test_preds, target_names=['negative', 'positive']))


Logistic Regression Validation Performance:
              precision    recall  f1-score   support

    negative       0.81      0.86      0.83       533
    positive       0.85      0.80      0.83       533

    accuracy                           0.83      1066
   macro avg       0.83      0.83      0.83      1066
weighted avg       0.83      0.83      0.83      1066


Logistic Regression Test Performance:
              precision    recall  f1-score   support

    negative       0.79      0.81      0.80       533
    positive       0.81      0.79      0.80       533

    accuracy                           0.80      1066
   macro avg       0.80      0.80      0.80      1066
weighted avg       0.80      0.80      0.80      1066



In [26]:
# Get some example predictions
sample_indices = [10, 20, 30]  # Random examples
sample_texts = [dataset['test']['text'][i] for i in sample_indices]
sample_true = [dataset['test']['label'][i] for i in sample_indices]
sample_preds = test_preds[sample_indices]

print("\nSample Predictions:")
for text, true_label, pred_label in zip(sample_texts, sample_true, sample_preds):
    print(f"\nText: {text}")
    print(f"True: {'positive' if true_label else 'negative'}")
    print(f"Pred: {'positive' if pred_label else 'negative'}")
    print("---")


Sample Predictions:

Text: exposing the ways we fool ourselves is one hour photo's real strength .
True: positive
Pred: positive
---

Text: this kind of hands-on storytelling is ultimately what makes shanghai ghetto move beyond a good , dry , reliable textbook and what allows it to rank with its worthy predecessors .
True: positive
Pred: positive
---

Text: what's so striking about jolie's performance is that she never lets her character become a caricature -- not even with that radioactive hair .
True: positive
Pred: negative
---


-----
### Exercise 2: Fine-tuning Distilbert

In this exercise we will fine-tune the Distilbert model to (hopefully) improve sentiment analysis performance.

#### Exercise 2.1: Token Preprocessing

The first thing we need to do is *tokenize* our dataset splits. Our current datasets return a dictionary with *strings*, but we want *input token ids* (i.e. the output of the tokenizer). This is easy enough to do my hand, but the HugginFace `Dataset` class provides convenient, efficient, and *lazy* methods. See the documentation for [`Dataset.map`](https://huggingface.co/docs/datasets/v3.5.0/en/package_reference/main_classes#datasets.Dataset.map).

**Tip**: Verify that your new datasets are returning for every element: `text`, `label`, `intput_ids`, and `attention_mask`.

In [27]:
from datasets import load_dataset, get_dataset_split_names
from transformers import AutoTokenizer
import torch

def tokenize_function(examples):
    """Tokenize the text examples and return relevant fields"""
    return tokenizer(
        examples['text'],
        padding='max_length',  # Pad to max sequence length
        truncation=True,       # Truncate to max length
        max_length=128,        # Set maximum sequence length
        return_tensors=None    # Return as plain lists (not tensors)
    )

In [28]:
# Tokenize all splits
tokenized_datasets = dataset.map(
    tokenize_function,
    batched=True,          # Process in batches for efficiency
    batch_size=32,         # Adjust based on your memory
    remove_columns=['text'] # Remove original text column (we keep tokenized version)
)

# Verify the new structure
print("Tokenized dataset structure:")
print(tokenized_datasets)

Tokenized dataset structure:
DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 1066
    })
})


In [29]:
# Check the features of the processed dataset
print("\nProcessed dataset features:")
print(tokenized_datasets['train'].features)

# Examine a single example
sample = tokenized_datasets['train'][0]
print("\nSample processed example:")
print(f"Input IDs length: {len(sample['input_ids'])}")
print(f"Attention mask: {sample['attention_mask'][:10]}...")  # Show first 10 positions
print(f"Label: {sample['label']}")


Processed dataset features:
{'label': ClassLabel(names=['neg', 'pos'], id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

Sample processed example:
Input IDs length: 128
Attention mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]...
Label: 1


In [30]:
# Decode a sample to verify tokenization
sample_ids = tokenized_datasets['train'][0]['input_ids']
print("\nDecoded sample:")
print(tokenizer.decode(sample_ids))

# Compare with original
original_text = dataset['train'][0]['text']
print("\nOriginal text:")
print(original_text)


Decoded sample:
[CLS] the rock is destined to be the 21st century ' s new " conan " and that he ' s going to make a splash even greater than arnold schwarzenegger, jean - claud van damme or steven segal. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

Original text:
the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .


In [31]:
# Dataset Format Verification
required_fields = {'input_ids', 'attention_mask', 'label'}

for split in tokenized_datasets.keys():
    current_fields = set(tokenized_datasets[split].features.keys())
    missing = required_fields - current_fields
    if missing:
        print(f"Warning: Split {split} missing fields: {missing}")
    else:
        print(f"Split {split} has all required fields")

Split train has all required fields
Split validation has all required fields
Split test has all required fields


In [32]:
# Dataset Statistics
import numpy as np

for split in tokenized_datasets.keys():
    lengths = [len(x['input_ids']) for x in tokenized_datasets[split]]
    print(f"\n{split} set token statistics:")
    print(f"Average length: {np.mean(lengths):.1f}")
    print(f"Max length: {max(lengths)}")
    print(f"Min length: {min(lengths)}")
    print(f"Padded length: {len(tokenized_datasets[split][0]['input_ids'])}")


train set token statistics:
Average length: 128.0
Max length: 128
Min length: 128
Padded length: 128

validation set token statistics:
Average length: 128.0
Max length: 128
Min length: 128
Padded length: 128

test set token statistics:
Average length: 128.0
Max length: 128
Min length: 128
Padded length: 128


In [33]:
# Save processed datasets
tokenized_datasets.save_to_disk("tokenized_rotten_tomatoes")

Saving the dataset (0/1 shards):   0%|          | 0/8530 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1066 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1066 [00:00<?, ? examples/s]

In [4]:
# To load later:
from datasets import load_from_disk
tokenized_datasets = load_from_disk("tokenized_rotten_tomatoes")

#### Exercise 2.2: Setting up the Model to be Fine-tuned

In this exercise we need to prepare the base Distilbert model for fine-tuning for a *sequence classification task*. This means, at the very least, appending a new, randomly-initialized classification head connected to the `[CLS]` token of the last transformer layer. Luckily, HuggingFace already provides an `AutoModel` for just this type of instantiation: [`AutoModelForSequenceClassification`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification). You will want you instantiate one of these for fine-tuning.

In [35]:
# Load the Model with Classification Head
from transformers import AutoModelForSequenceClassification

# Load DistilBERT with a sequence classification head
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",  # Base pre-trained model
    num_labels=2,               # Number of output labels (positive/negative)
    id2label={0: "negative", 1: "positive"},  # Optional: label mapping
    label2id={"negative": 0, "positive": 1}   # Optional: reverse mapping
)

# Verify the model architecture
print("Model architecture:")
print(model)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model architecture:
DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=

In [36]:
# Examine the Classification Head

# Check the classification head
print("\nClassification head:")
print(model.classifier)

# Check the dimensions
print(f"\nInput dimension to classifier: {model.classifier.in_features}")
print(f"Output dimension: {model.classifier.out_features}")


Classification head:
Linear(in_features=768, out_features=2, bias=True)

Input dimension to classifier: 768
Output dimension: 2


In [37]:
# Model Configuration

# Print model configuration
print("\nModel configuration:")
print(model.config)

# Important configuration parameters
print(f"\nHidden size: {model.config.hidden_size}")
print(f"Number of labels: {model.config.num_labels}")


Model configuration:
DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "negative",
    "1": "positive"
  },
  "initializer_range": 0.02,
  "label2id": {
    "negative": 0,
    "positive": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.52.4",
  "vocab_size": 30522
}


Hidden size: 768
Number of labels: 2


In [38]:
# Verify Forward Pass
import torch

# Get a sample from our tokenized dataset
sample = tokenized_datasets['train'][0]
input_ids = torch.tensor([sample['input_ids']])
attention_mask = torch.tensor([sample['attention_mask']])

# Forward pass
with torch.no_grad():
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)

# Examine outputs
print("\nModel outputs:")
print(f"Logits shape: {outputs.logits.shape}")
print(f"Sample logits: {outputs.logits}")


Model outputs:
Logits shape: torch.Size([1, 2])
Sample logits: tensor([[-0.0224, -0.0162]])


In [39]:
# Check Device Compatibility

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
print(f"\nModel moved to: {device}")

# Verify device placement
sample_input = input_ids.to(device)
sample_attention = attention_mask.to(device)
with torch.no_grad():
    outputs = model(input_ids=sample_input, attention_mask=sample_attention)
print(f"Outputs generated on: {outputs.logits.device}")


Model moved to: cuda
Outputs generated on: cuda:0


In [40]:
# Verify Training Readiness

# Check training mode
print(f"\nModel in training mode: {model.training}")

# Check parameter requires_grad
print("\nParameter gradients:")
for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"{name}: requires grad (trainable)")
    else:
        print(f"{name}: frozen (not trainable)")

# By default, all parameters should be trainable


Model in training mode: False

Parameter gradients:
distilbert.embeddings.word_embeddings.weight: requires grad (trainable)
distilbert.embeddings.position_embeddings.weight: requires grad (trainable)
distilbert.embeddings.LayerNorm.weight: requires grad (trainable)
distilbert.embeddings.LayerNorm.bias: requires grad (trainable)
distilbert.transformer.layer.0.attention.q_lin.weight: requires grad (trainable)
distilbert.transformer.layer.0.attention.q_lin.bias: requires grad (trainable)
distilbert.transformer.layer.0.attention.k_lin.weight: requires grad (trainable)
distilbert.transformer.layer.0.attention.k_lin.bias: requires grad (trainable)
distilbert.transformer.layer.0.attention.v_lin.weight: requires grad (trainable)
distilbert.transformer.layer.0.attention.v_lin.bias: requires grad (trainable)
distilbert.transformer.layer.0.attention.out_lin.weight: requires grad (trainable)
distilbert.transformer.layer.0.attention.out_lin.bias: requires grad (trainable)
distilbert.transformer.la

#### Exercise 2.3: Fine-tuning Distilbert

Finally. In this exercise you should use a HuggingFace [`Trainer`](https://huggingface.co/docs/transformers/main/en/trainer) to fine-tune your model on the Rotten Tomatoes training split. Setting up the trainer will involve (at least):


1. Instantiating a [`DataCollatorWithPadding`](https://huggingface.co/docs/transformers/en/main_classes/data_collator) object which is what *actually* does your batch construction (by padding all sequences to the same length).
2. Writing an *evaluation function* that will measure the classification accuracy. This function takes a single argument which is a tuple containing `(logits, labels)` which you should use to compute classification accuracy (and maybe other metrics like F1 score, precision, recall) and return a `dict` with these metrics.  
3. Instantiating a [`TrainingArguments`](https://huggingface.co/docs/transformers/v4.51.1/en/main_classes/trainer#transformers.TrainingArguments) object using some reasonable defaults.
4. Instantiating a `Trainer` object using your train and validation splits, you data collator, and function to compute performance metrics.
5. Calling `trainer.train()`, waiting, waiting some more, and then calling `trainer.evaluate()` to see how it did.

**Tip**: When prototyping this laboratory I discovered the HuggingFace [Evaluate library](https://huggingface.co/docs/evaluate/en/index) which provides evaluation metrics. However I found it to have insufferable layers of abstraction and getting actual metrics computed. I suggest just using the Scikit-learn metrics...

In [41]:
from transformers import DataCollatorWithPadding

# Create data collator for dynamic padding
data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer,
    padding='longest',  # Pad to longest in batch
    return_tensors='pt' # Return PyTorch tensors
)

# Test the collator with a sample batch
sample_batch = [tokenized_datasets['train'][i] for i in range(2)]
collated = data_collator(sample_batch)
print("Collated batch keys:", collated.keys())
print("Input IDs shape:", collated['input_ids'].shape)

Collated batch keys: dict_keys(['input_ids', 'attention_mask', 'labels'])
Input IDs shape: torch.Size([2, 128])


In [42]:
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

def compute_metrics(eval_pred):
    """Compute metrics for evaluation"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    
    return {
        'accuracy': accuracy_score(labels, predictions),
        'f1': f1_score(labels, predictions, average='macro'),
        'precision': precision_score(labels, predictions, average='macro'),
        'recall': recall_score(labels, predictions, average='macro')
    }

In [None]:
from transformers import TrainingArguments

# Set training parameters
training_args = TrainingArguments(
    output_dir='./results',          # Output directory
    eval_strategy='epoch',           # Evaluate after each epoch
    save_strategy='epoch',           # Save after each epoch
    learning_rate=3e-5,              # Learning rate
    per_device_train_batch_size=8,   # Batch size
    per_device_eval_batch_size=16,   # Eval batch size
    num_train_epochs=5,              # Number of epochs
    weight_decay=0.02,               # Weight decay
    load_best_model_at_end=True,     # Load best model at end
    metric_for_best_model='f1',      # Use F1 to select best model
    logging_dir='./logs',            # Logging directory
    logging_steps=50,                # Log every 50 steps
    report_to='none',                # Disable external logging
    gradient_accumulation_steps=2,   # Effective batch size of 16
    fp16=True,                       # Mixed precision training
    gradient_checkpointing=True,     # Memory optimization
    warmup_ratio=0.1,                # Learning rate warmup
    lr_scheduler_type="cosine",      # Learning rate decay
)


In [None]:
from transformers import Trainer
from transformers import EarlyStoppingCallback

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    processing_class=tokenizer, # tokenizer is deprecated, used processing_class instead
    callbacks=[EarlyStoppingCallback(
    early_stopping_patience=2,
    early_stopping_threshold=0.001
    )]
    
)

print(device)

No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


cuda


In [63]:
# Train the model
print("Starting training...")
trainer.train()

# Evaluate on validation set
print("\nEvaluating on validation set...")
val_metrics = trainer.evaluate()
print("\nValidation metrics:")
for k, v in val_metrics.items():
    print(f"{k}: {v:.4f}")

# Evaluate on test set
print("\nEvaluating on test set...")
test_metrics = trainer.evaluate(tokenized_datasets['test'])
print("\nTest metrics:")
for k, v in test_metrics.items():
    print(f"{k}: {v:.4f}")

Starting training...




Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.4265,0.403002,0.822702,0.822633,0.823203,0.822702
2,0.4407,0.399947,0.818949,0.818879,0.819445,0.818949
3,0.4322,0.400174,0.821764,0.821763,0.821768,0.821764





Evaluating on validation set...



Validation metrics:
eval_loss: 0.4030
eval_accuracy: 0.8227
eval_f1: 0.8226
eval_precision: 0.8232
eval_recall: 0.8227
eval_runtime: 45.7464
eval_samples_per_second: 23.3020
eval_steps_per_second: 1.4650
epoch: 3.0000

Evaluating on test set...

Test metrics:
eval_loss: 0.4420
eval_accuracy: 0.7974
eval_f1: 0.7973
eval_precision: 0.7976
eval_recall: 0.7974
eval_runtime: 46.4760
eval_samples_per_second: 22.9370
eval_steps_per_second: 1.4420
epoch: 3.0000


**Analisi Risultati**:

Il modello è stato addestrato per 3 epoche con risultati di validazione stabili, mostrando già nella prima epoca un'accuratezza elevata (82.3%), le metriche di validazione (accuracy, F1, precision, recall) si mantengono tutte sopra l’82%, suggerendo una buona generalizzazione.

La crescita delle loss indica un possibile overfitting iniziale e la differenza tra training e validation loss rimane contenuta, suggerendo che il modello non sta memorizzando eccessivamente.

L’accuratezza sul test set (79.7%) è leggermente inferiore rispetto alla validazione, ma ancora vicina, confermando però delle buone prestazioni. Le metriche test F1, precision e recall sono tutte bilanciate (~79.7%), pertanto che il modello non favorisce una classe rispetto all’altra.

In sintesi, il modello fine-tuned si comporta bene su sentiment analysis, con generalizzazione accettabile sul test set, anche se non mostra grandi miglioramenti.

In [64]:
# Save the best model
trainer.save_model('./fine_tuned_distilbert')

# Save tokenizer as well
tokenizer.save_pretrained('./fine_tuned_distilbert')

('./fine_tuned_distilbert/tokenizer_config.json',
 './fine_tuned_distilbert/special_tokens_map.json',
 './fine_tuned_distilbert/vocab.txt',
 './fine_tuned_distilbert/added_tokens.json',
 './fine_tuned_distilbert/tokenizer.json')

In [2]:
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer

# Load the fine-tuned model
model = AutoModelForSequenceClassification.from_pretrained('./fine_tuned_distilbert')

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('./fine_tuned_distilbert')

In [65]:
# Get predictions for test set
predictions = trainer.predict(tokenized_datasets['test'])
pred_labels = np.argmax(predictions.predictions, axis=1)

# Show some examples
sample_indices = [10, 20, 30, 40, 50]
for idx in  range(1,100): #sample_indices:
    text = dataset['test']['text'][idx]
    true_label = dataset['test']['label'][idx]
    pred_label = pred_labels[idx]
    print(f"\nText: {text}")
    print(f"True: {'positive' if true_label else 'negative'}")
    print(f"Pred: {'positive' if pred_label else 'negative'}")
    print("---")


Text: consistently clever and suspenseful .
True: positive
Pred: positive
---

Text: it's like a " big chill " reunion of the baader-meinhof gang , only these guys are more harmless pranksters than political activists .
True: positive
Pred: negative
---

Text: the story gives ample opportunity for large-scale action and suspense , which director shekhar kapur supplies with tremendous skill .
True: positive
Pred: positive
---

Text: red dragon " never cuts corners .
True: positive
Pred: negative
---

Text: fresnadillo has something serious to say about the ways in which extravagant chance can distort our perspective and throw us off the path of good sense .
True: positive
Pred: positive
---

Text: throws in enough clever and unexpected twists to make the formula feel fresh .
True: positive
Pred: positive
---

Text: weighty and ponderous but every bit as filling as the treat of the title .
True: positive
Pred: negative
---

Text: a real audience-pleaser that will strike a chord with any

-----
### Exercise 3: Choose at Least One


#### Exercise 3.1: Efficient Fine-tuning for Sentiment Analysis (easy)

In Exercise 2 we fine-tuned the *entire* Distilbert model on Rotten Tomatoes. This is expensive, even for a small model. Find an *efficient* way to fine-tune Distilbert on the Rotten Tomatoes dataset (or some other dataset).

**Hint**: You could check out the [HuggingFace PEFT library](https://huggingface.co/docs/peft/en/index) for some state-of-the-art approaches that should "just work". How else might you go about making fine-tuning more efficient without having to change your training pipeline from above?

In [76]:
from transformers import AutoModelForSequenceClassification
from peft import LoraConfig, get_peft_model
import torch

# Load base model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
    id2label={0: "negative", 1: "positive"}
)

# Configure LoRA
peft_config = LoraConfig(
    r=6,  # Slightly higher rank than before
    lora_alpha=12,
    target_modules=["q_lin", "v_lin", "out_lin"],
    lora_dropout=0.05,
    bias="lora_only",
    task_type="SEQ_CLS",
    #fan_in_fan_out=True,  # Fixes gradient flow
    modules_to_save=["classifier"]  # Ensure classifier gets gradients
)

# Wrap model with LoRA
model = get_peft_model(model, peft_config)

# Check trainable parameters
model.print_trainable_parameters()
# Typically shows <1% of parameters are trainable

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 771,842 || all params: 67,713,028 || trainable%: 1.1399


In [77]:
from transformers import Trainer, TrainingArguments, AutoTokenizer, DataCollatorWithPadding

from datasets import load_dataset

from transformers import EarlyStoppingCallback

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Load and tokenize data (same as before)
dataset = load_dataset("rotten_tomatoes")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_fn(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=128  # Reduced from default 512 if possible
    )

tokenized_ds = dataset.map(tokenize_fn, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./peft-lora",
    learning_rate=8e-5,  # Adjusted learning rate
    per_device_train_batch_size=6,  # Balanced batch size
    per_device_eval_batch_size=12,
    gradient_accumulation_steps=3,
    num_train_epochs=12,  # More epochs with early stopping
    eval_strategy="epoch",
    save_strategy="epoch",
    fp16=True,
    gradient_checkpointing=True,
    optim="adamw_torch_fused",
    logging_steps=50,
    warmup_steps=100,  # Explicit warmup
    lr_scheduler_type="cosine",
    report_to="none",
    metric_for_best_model="eval_f1",
    greater_is_better=True,
    load_best_model_at_end=True,
    dataloader_pin_memory=True,  # Faster data loading
    dataloader_num_workers=2  # If you have multiple CPU cores
)

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
    data_collator=DataCollatorWithPadding(tokenizer),
    callbacks=[
        EarlyStoppingCallback(early_stopping_patience=3),
        #GradientDebugCallback()  # For monitoring
    ]
)

No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [78]:
# Train the model
print("Starting training...")
trainer.train()

# Evaluate on validation set
print("\nEvaluating on validation set...")
val_metrics = trainer.evaluate()
print("\nValidation metrics:")
for k, v in val_metrics.items():
    print(f"{k}: {v:.4f}")

# Evaluate on test set
print("\nEvaluating on test set...")
test_metrics = trainer.evaluate(tokenized_ds['test'])
print("\nTest metrics:")
for k, v in test_metrics.items():
    print(f"{k}: {v:.4f}")

Starting training...




Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.4874,0.455879,0.802064,0.801937,0.802841,0.802064
2,0.451,0.420344,0.818949,0.818865,0.819544,0.818949
3,0.4545,0.408677,0.819887,0.819865,0.82005,0.819887
4,0.4358,0.405964,0.82364,0.82364,0.82364,0.82364
5,0.4546,0.401385,0.821764,0.821761,0.821782,0.821764
6,0.4183,0.403893,0.810507,0.809821,0.815049,0.810507
7,0.4317,0.404573,0.811445,0.810697,0.816445,0.811445





Evaluating on validation set...



Validation metrics:
eval_loss: 0.4060
eval_accuracy: 0.8236
eval_f1: 0.8236
eval_precision: 0.8236
eval_recall: 0.8236
eval_runtime: 46.5326
eval_samples_per_second: 22.9090
eval_steps_per_second: 1.9130
epoch: 7.0000

Evaluating on test set...

Test metrics:
eval_loss: 0.4448
eval_accuracy: 0.7871
eval_f1: 0.7870
eval_precision: 0.7872
eval_recall: 0.7871
eval_runtime: 46.6550
eval_samples_per_second: 22.8490
eval_steps_per_second: 1.9080
epoch: 7.0000


**Analisi Risultati:**
In questo esperimento si è adottato un metodo di fine-tuning più efficiente, grazie all'utilizzo di PEFT, riducendo il carico computazionale senza modificare drasticamente la pipeline.

Dopo 7 epoche, l’accuratezza di validazione ha raggiunto un valore di 82.4%, comparabile al fine-tuning completo, ma con meno impatto computazionale. 

Le metriche stabili e bilanciate indicano una generalizzazione solida, anche se il test set mostra un calo a 78.7%, leggermente inferiore rispetto alla validazione. I tempi e le velocità di elaborazione restano nella norma, mostrando che l’efficienza è stata migliorata senza compromettere il throughput.

In conclusione, il fine-tuning efficiente tramite tecniche come PEFT è una valida alternativa al full fine-tuning, mantenendo prestazioni competitive con meno risorse.

In [79]:
# Save the best model
trainer.save_model('./exercise_peft_lora')

# Save tokenizer as well
tokenizer.save_pretrained('./exercise_peft_lora')

('./exercise_peft_lora/tokenizer_config.json',
 './exercise_peft_lora/special_tokens_map.json',
 './exercise_peft_lora/vocab.txt',
 './exercise_peft_lora/added_tokens.json',
 './exercise_peft_lora/tokenizer.json')

In [80]:
# Get predictions for test set
predictions = trainer.predict(tokenized_ds['test'])
pred_labels = np.argmax(predictions.predictions, axis=1)

# Show some examples
sample_indices = [10, 20, 30, 40, 50]
for idx in sample_indices:
    text = dataset['test']['text'][idx]
    true_label = dataset['test']['label'][idx]
    pred_label = pred_labels[idx]
    print(f"\nText: {text}")
    print(f"True: {'positive' if true_label else 'negative'}")
    print(f"Pred: {'positive' if pred_label else 'negative'}")
    print("---")


Text: exposing the ways we fool ourselves is one hour photo's real strength .
True: positive
Pred: positive
---

Text: this kind of hands-on storytelling is ultimately what makes shanghai ghetto move beyond a good , dry , reliable textbook and what allows it to rank with its worthy predecessors .
True: positive
Pred: positive
---

Text: what's so striking about jolie's performance is that she never lets her character become a caricature -- not even with that radioactive hair .
True: positive
Pred: negative
---

Text: you needn't be steeped in '50s sociology , pop culture or movie lore to appreciate the emotional depth of haynes' work . though haynes' style apes films from the period . . . its message is not rooted in that decade .
True: positive
Pred: positive
---

Text: works because we're never sure if ohlinger's on the level or merely a dying , delusional man trying to get into the history books before he croaks .
True: positive
Pred: negative
---


#### Exercise 3.2: Fine-tuning a CLIP Model (harder)

Use a (small) CLIP model like [`openai/clip-vit-base-patch16`](https://huggingface.co/openai/clip-vit-base-patch16) and evaluate its zero-shot performance on a small image classification dataset like ImageNette or TinyImageNet. Fine-tune (using a parameter-efficient method!) the CLIP model to see how much improvement you can squeeze out of it.

**Note**: There are several ways to adapt the CLIP model; you could fine-tune the image encoder, the text encoder, or both. Or, you could experiment with prompt learning.

**Tip**: CLIP probably already works very well on ImageNet and ImageNet-like images. For extra fun, look for an image classification dataset with different image types (e.g. *sketches*).

In [4]:
# Your code here.

#### Exercise 3.3: Choose your Own Adventure

There are a *ton* of interesting and fun models on the HuggingFace hub. Pick one that does something interesting and adapt it in some way to a new task. Or, combine two or more models into something more interesting or fun. The sky's the limit.

**Note**: Reach out to me by email or on the Discord if you are unsure about anything.

In [5]:
# Your code here.