## AIDI 1002 Final Project: An Empirical Analysis of DistilBERT


### Reproduction, Generalization, and Baseline Comparison

**Group Members:** Srikrishna Thapa, Prem Prasad Bhatta
**Original Paper:** *DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter* (Sanh et al., 2019)

### 1. Project Goal

The goal of this project is to first reproduce the text classification performance of the DistilBERT model on the IMDb sentiment analysis dataset, as reported in the original paper.

Following the successful reproduction, we make two significant contributions to extend the analysis:

1.  **To Test Generalization:** We evaluate the same DistilBERT methodology on a new dataset, **SST-2 (Stanford Sentiment Treebank)**, to assess its effectiveness in a different context with shorter text inputs.

2.  **To Analyze Efficiency:** We compare DistilBERT's performance against a fast, classical machine learning baseline (**Logistic Regression with TF-IDF features**) on the original IMDb dataset.

This three-part experiment allows us to not only validate the paper's findings but also to critically analyze the model's generalization capabilities and its performance trade-offs against simpler, more efficient methods.

In [None]:
## %pip install datasets transformers

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
 ## %pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
# Import libraries
import torch
import numpy as np
import time
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score

print("Libraries imported successfully.")

  from .autonotebook import tqdm as notebook_tqdm


Libraries imported successfully.


In [4]:
# --- Configuration ---
# Set the model we want to test. Change this to "bert-base-uncased" for the comparison run.
MODEL_CHECKPOINT = "distilbert-base-uncased" 
# MODEL_CHECKPOINT = "bert-base-uncased" 

# Other settings
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
BATCH_SIZE = 16 # As recommended for fine-tuning
EPOCHS = 2      # Use 2-3 epochs for fine-tuning
MAX_LENGTH = 512 # Max length for tokenization

print(f"Using device: {DEVICE}")
print(f"Testing model: {MODEL_CHECKPOINT}")

Using device: cuda
Testing model: distilbert-base-uncased


In [5]:
# Load the IMDb dataset
print("Loading IMDb dataset...")
imdb_dataset = load_dataset("imdb")

# A smaller subset for faster testing/debugging if needed.
# For the final run, use the full dataset.
# train_dataset = imdb_dataset["train"].shuffle(seed=42).select(range(1000))
# test_dataset = imdb_dataset["test"].shuffle(seed=42).select(range(1000))

# Use the full dataset for the final experiment
train_dataset = imdb_dataset["train"]
test_dataset = imdb_dataset["test"]

print("Dataset loaded.")
print(f"Training samples: {len(train_dataset)}, Test samples: {len(test_dataset)}")

Loading IMDb dataset...
Dataset loaded.
Training samples: 25000, Test samples: 25000


In [6]:
# Load the tokenizer for the chosen model
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

# Define the tokenization function
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=MAX_LENGTH)

print("Tokenizing the dataset... This may take a few minutes.")
tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)
print("Tokenization complete.")

Tokenizing the dataset... This may take a few minutes.
Tokenization complete.


In [7]:
# Load the model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(MODEL_CHECKPOINT, num_labels=2)
model.to(DEVICE) # Move model to GPU if available

# Function to count model parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

num_params = count_parameters(model)
print(f"Model loaded: {MODEL_CHECKPOINT}")
print(f"Number of trainable parameters: {num_params / 1_000_000:.2f}M")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded: distilbert-base-uncased
Number of trainable parameters: 66.96M


In [8]:
import transformers
print(transformers.__version__)

4.53.2


In [9]:
## pip install transformers[torch]

In [10]:
# Define a function to compute metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": accuracy_score(labels, predictions)}

# Define training arguments for transformers v4.35.2
# This is the correct, modern syntax.
# Define training arguments
training_args = TrainingArguments(
    output_dir=f"./results/{MODEL_CHECKPOINT}-imdb",
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    eval_strategy="steps",  # Use eval_strategy instead
    eval_steps=500,
    save_steps=500,
    load_best_model_at_end=True
)
print("TrainingArguments configured successfully.")

TrainingArguments configured successfully.


In [11]:
# Create the Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    compute_metrics=compute_metrics,
)

# Start training and time it
print("Starting training...")
start_time = time.time()
trainer.train()
end_time = time.time()
training_time = end_time - start_time
print("Training complete.")
print(f"Total training time: {training_time / 60:.2f} minutes")

Starting training...


Step,Training Loss,Validation Loss,Accuracy
500,0.2957,0.243904,0.90376
1000,0.2933,0.326173,0.88756
1500,0.2408,0.225185,0.91836
2000,0.1205,0.287646,0.92376
2500,0.1399,0.234382,0.93172
3000,0.1253,0.242869,0.93224


Training complete.
Total training time: 56.80 minutes


### 2. Experimental Results
After training, we evaluate the model on the test set to get our final accuracy score.

In [12]:
# Evaluate the model on the test set
print("Evaluating the final model on the test set...")
eval_results = trainer.evaluate()

# Print the results in a clean format
print("\n--- Final Results ---")
print(f"Model: {MODEL_CHECKPOINT}")
print(f"Accuracy: {eval_results['eval_accuracy']:.4f}")
print(f"Training Time: {training_time / 60:.2f} minutes")
print(f"Number of Parameters: {num_params / 1_000_000:.2f}M")
print("--------------------")

Evaluating the final model on the test set...



--- Final Results ---
Model: distilbert-base-uncased
Accuracy: 0.9184
Training Time: 56.80 minutes
Number of Parameters: 66.96M
--------------------


### Reproduction Summary

We successfully reproduced the paper's findings by fine-tuning DistilBERT on the IMDb dataset. Our experiment yielded the following results:

- **Final Accuracy:** 91.84%
- **Training Time:** 56.80 minutes
- **Model Size:** 66.96M Parameters

This result is strong and consistent with the performance reported in the original paper. We will now proceed with our two significant contributions to extend this analysis.

## Contribution 1: Evaluating Model Generalization on the SST-2 Dataset

Our first contribution tests the DistilBERT methodology on a new dataset to evaluate its effectiveness in a different context. We will fine-tune the same model on the **SST-2 dataset**, which consists of shorter, single-sentence movie reviews.

In [18]:
# --- CONTRIBUTION 1: SST-2 EXPERIMENT ---

import time
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score
import torch
import numpy as np # Make sure numpy is imported

# --- Configuration for this specific experiment ---
MODEL_CHECKPOINT_SST2 = "distilbert-base-uncased"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
BATCH_SIZE = 16 
EPOCHS = 3 
MAX_LENGTH = 128

#  Load the SST-2 dataset
print("Loading SST-2 dataset from GLUE benchmark...")
sst2_dataset = load_dataset("glue", "sst2")
train_dataset_sst2 = sst2_dataset["train"]
eval_dataset_sst2 = sst2_dataset["validation"]

# Tokenize the dataset
tokenizer_sst2 = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT_SST2)
def tokenize_sst2(examples):
    return tokenizer_sst2(examples["sentence"], padding="max_length", truncation=True, max_length=MAX_LENGTH)

tokenized_train_sst2 = train_dataset_sst2.map(tokenize_sst2, batched=True)
tokenized_eval_sst2 = eval_dataset_sst2.map(tokenize_sst2, batched=True)

# Load the model
model_sst2 = AutoModelForSequenceClassification.from_pretrained(MODEL_CHECKPOINT_SST2, num_labels=2)
model_sst2.to(DEVICE)

# Define a metrics function
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": accuracy_score(labels, predictions)}

#  Set up Trainer with the SIMPLIFIED, COMPATIBLE arguments
training_args_sst2 = TrainingArguments(
    output_dir="./results/distilbert-sst2",
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    logging_steps=100,

)


trainer_sst2 = Trainer(
    model=model_sst2,
    args=training_args_sst2,
    train_dataset=tokenized_train_sst2,
    eval_dataset=tokenized_eval_sst2,
    compute_metrics=compute_metrics,
)

# 6. Train and evaluate
print("\nStarting training on SST-2...")
start_time_sst2 = time.time()
trainer_sst2.train()
end_time_sst2 = time.time()
training_time_sst2 = end_time_sst2 - start_time_sst2

# We manually evaluate since we removed the evaluation strategy during training
eval_results_sst2 = trainer_sst2.evaluate()

# Print results
print("\n--- SST-2 Experiment Results ---")
print(f"Model: {MODEL_CHECKPOINT_SST2}")
print(f"Accuracy on SST-2: {eval_results_sst2['eval_accuracy']:.4f}")
print(f"Training Time: {training_time_sst2 / 60:.2f} minutes")
print("------------------------------")

Loading SST-2 dataset from GLUE benchmark...


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Starting training on SST-2...


Step,Training Loss
100,0.4388
200,0.3245
300,0.2813
400,0.3131
500,0.2969
600,0.2887
700,0.2841
800,0.2811
900,0.285
1000,0.2543



--- SST-2 Experiment Results ---
Model: distilbert-base-uncased
Accuracy on SST-2: 0.9106
Training Time: 38.38 minutes
------------------------------


---
## Contribution 2: Performance vs. a Classical Baseline on IMDb

Our second contribution analyzes if the complexity of a Transformer is necessary for the original IMDb task. We compare DistilBERT's result against a simple and fast **Logistic Regression** model using TF-IDF features.

In [None]:
# --- CONTRIBUTION 2: CLASSICAL BASELINE EXPERIMENT ---
# (This entire experiment runs in one cell and will be very fast)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from datasets import load_dataset
import time
import numpy as np # Make sure numpy is imported

# 1. Load the raw IMDb text data
print("Loading raw IMDb text data...")
imdb_raw = load_dataset("imdb")
train_texts = [example['text'] for example in imdb_raw['train']]
train_labels = [example['label'] for example in imdb_raw['train']]
test_texts = [example['text'] for example in imdb_raw['test']]
test_labels = [example['label'] for example in imdb_raw['test']]

# 2. Create TF-IDF features from the text
print("Creating TF-IDF features... (This may take a minute)")
vectorizer = TfidfVectorizer(max_features=20000, ngram_range=(1, 2))
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)
print("TF-IDF features created.")

# 3. Train the Logistic Regression model
print("Training Logistic Regression model...")
start_time_lr = time.time()
lr_model = LogisticRegression(max_iter=1000, C=1.0, solver='liblinear')
lr_model.fit(X_train, train_labels)
end_time_lr = time.time()
training_time_lr = end_time_lr - start_time_lr
print("Training complete.")

# 4. Evaluate the model's performance
predictions = lr_model.predict(X_test)
accuracy_lr = accuracy_score(test_labels, predictions)

# 5. Print the final results
print("\n--- Classical Model Results ---")
print(f"Model: Logistic Regression with TF-IDF")
print(f"Accuracy on IMDb: {accuracy_lr:.4f}")
print(f"Training Time: {training_time_lr:.2f} seconds")

Loading raw IMDb text data...
Creating TF-IDF features... (This may take a minute)
TF-IDF features created.
Training Logistic Regression model...
Training complete.

--- Classical Model Results ---
Model: Logistic Regression with TF-IDF
Accuracy on IMDb: 0.8952
Training Time: 0.56 seconds


## Overall Project Conclusion & Summary Table

This project successfully reproduced the high performance of DistilBERT and extended the original paper's findings through two distinct contributions, yielding a comprehensive analysis of the model's performance, generalization, and efficiency.

The key results from our three experiments are summarized in the table below.

| Experiment | Model | Dataset | Accuracy | Training Time |
| :--- | :--- | :--- | :--- | :--- |
| **Reproduction** | DistilBERT | IMDb | **91.84%** | **56.80 mins** |
| **Contribution 1** | DistilBERT | SST-2 | **91.06%** | **38.38 mins** |
| **Contribution 2** | Logistic Regression | IMDb | **89.52%** | **0.56 seconds** |

### Key Findings and Analysis

1.  **Successful Reproduction:** Our baseline experiment confirmed that DistilBERT is a powerful model for sentiment analysis, achieving **91.84%** accuracy on the IMDb dataset. This result is consistent with the high performance reported in the original paper.

2.  **Excellent Generalization:** Our first contribution tested the model on the SST-2 dataset. It achieved an impressive accuracy of **91.06%**, proving that the DistilBERT methodology is robust and generalizes well to different text formats (long-form vs. short-form sentences).

3.  **The Efficiency Trade-Off:** Our second contribution provided a critical perspective. The classical Logistic Regression model, while less accurate at **89.52%**, achieved a very strong result in **under one second**. This highlights the significant trade-off between achieving state-of-the-art performance with a deep learning model versus the immense speed and resource savings of a simpler, classical approach.

**Final Conclusion:** Our work validates that DistilBERT is a highly effective model for text classification. However, our comparative analysis demonstrates that for many real-world projects, the choice of model is a crucial business decision. While Transformers offer peak performance, classical baselines remain a powerful and incredibly efficient alternative when development speed and computational cost are the primary concerns.