# ============================================
# Module 9b: Foundations of Training & Transformers
# Lab 5 â€“ Training from Scratch vs Fine-Tuning
# ============================================
**Author:** Dr. Dasha Trofimova

### Learning Goals
- Contrast randomly initialized (scratch) vs pretrained fine-tuning
- Observe convergence speed & accuracy differences
- Connect results to transfer learning intuitions

---


In [1]:
!pip install datasets transformers torch scikit-learn matplotlib seaborn accelerate --quiet

import numpy as np, torch, torch.nn as nn
from datasets import load_dataset, Dataset
from sklearn.metrics import accuracy_score
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          TrainingArguments, Trainer)
import matplotlib.pyplot as plt, seaborn as sns
sns.set(style="whitegrid", context="talk")
torch.manual_seed(42); np.random.seed(42)
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cpu'

In [2]:
# Build a small tri-class dataset from TweetEval
raw = load_dataset("tweet_eval", "sentiment")
label_names = raw["train"].features["label"].names

train_texts = [x["text"] for x in raw["train"].shuffle(seed=42).select(range(4000))]
train_labels= [x["label"] for x in raw["train"].shuffle(seed=42).select(range(4000))]
test_texts  = [x["text"] for x in raw["test"].shuffle(seed=42).select(range(2000))]
test_labels = [x["label"] for x in raw["test"].shuffle(seed=42).select(range(2000))]

num_labels = len(set(train_labels))
num_labels, label_names

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

sentiment/train-00000-of-00001.parquet:   0%|          | 0.00/3.78M [00:00<?, ?B/s]

sentiment/test-00000-of-00001.parquet:   0%|          | 0.00/901k [00:00<?, ?B/s]

sentiment/validation-00000-of-00001.parq(â€¦):   0%|          | 0.00/167k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/45615 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/12284 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

(3, ['negative', 'neutral', 'positive'])

In [3]:
# (A) Scratch baseline: TF-IDF + Logistic Regression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

tfidf = TfidfVectorizer(max_features=20_000, ngram_range=(1,2), stop_words="english")
Xtr = tfidf.fit_transform(train_texts)
Xte = tfidf.transform(test_texts)

bow_clf = LogisticRegression(max_iter=1000, n_jobs=-1)
bow_clf.fit(Xtr, train_labels)
bow_preds = bow_clf.predict(Xte)
from sklearn.metrics import accuracy_score
acc_bow = accuracy_score(test_labels, bow_preds)
acc_bow

0.499

In [None]:
# (B) Fine-tune DistilBERT on same data
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tok_fn(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

train_ds = Dataset.from_dict({"text": train_texts, "label": train_labels})
test_ds  = Dataset.from_dict({"text": test_texts,  "label": test_labels})

tok_train = train_ds.map(tok_fn, batched=True).remove_columns(["text"]).with_format("torch")
tok_test  = test_ds.map(tok_fn, batched=True).remove_columns(["text"]).with_format("torch")

model_ft = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=num_labels
).to(device)

args = TrainingArguments(
    output_dir="./distilbert-tweeteval",
    eval_strategy="epoch",
    save_strategy="no",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_steps=50,
    num_train_epochs=5,â€š
)

def compute_metrics(eval_pred):
    import numpy as np
    from sklearn.metrics import accuracy_score
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {"accuracy": accuracy_score(labels, preds)}

trainer = Trainer(
    model=model_ft,
    args=args,
    train_dataset=tok_train,
    eval_dataset=tok_test,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

train_out = trainer.train()
eval_out  = trainer.evaluate()
acc_ft = eval_out["eval_accuracy"]; acc_ft

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(
  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 Â·Â·Â·Â·Â·Â·Â·Â·Â·Â·


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33md-trofimova[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss,Accuracy
1,0.72,0.729613,0.6815




In [None]:
# Compare accuracies
plt.figure(figsize=(6,4))
sns.barplot(x=["Scratch (BoW+LR)","Fine-tuned DistilBERT"], y=[acc_bow, acc_ft],
            hue=["Scratch","Fine-tuned"], palette=["#9ecae1","#fd8d3c"], legend=False)
plt.ylim(0,1); plt.title("Accuracy: Scratch vs Fine-Tuning (TweetEval Sentiment)")
for i,v in enumerate([acc_bow, acc_ft]):
    plt.text(i, v+0.02, f"{v:.2f}", ha="center", fontsize=12)
plt.ylabel("Accuracy"); plt.tight_layout(); plt.show()

print(f"Scratch (BoW+LR) accuracy:   {acc_bow:.3f}")
print(f"Fine-tuned DistilBERT accuracy:{acc_ft:.3f}")

### ðŸŽ¯ Quick Card Quiz â€” Scratch vs Fine-Tuning

**Color legend:**  
- **Blue = Train from scratch (BoW/MLP or random init)**  
- **Orange = Fine-tuned pretrained checkpoint**  
- **Green = "It depends" (data too small/large)**

1) Which approach typically reaches good accuracy faster on small datasets?
2) Which approach is more sensitive to having very little labeled data?
3) After just 2 epochs here, which achieved higher test accuracy?