# Comprehensive Research: Fine-Tuning LLMs (BERT)

## 1. Concept
**Objective**: Adapt a General LLM to a Specific Task.
**Risk**: Catastrophic Forgetting (Overwriting pre-trained knowledge).
**Control**: Warmup Scheduler & Weight Decay.


In [None]:
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# --- 1. MOCK DATASET ---
# Simulating the IMDB dataset structure
texts = ["This movie was fantastic!", "Horrible, waste of time.", "Great acting but bad plot.", "I loved it."] * 50
labels = [1, 0, 0, 1] * 50 # 200 samples

df = pd.DataFrame({"text": texts, "label": labels})
raw_ds = Dataset.from_pandas(df)

print("Top 5 Samples:")
print(df.head())

## 2. Tokenization Analysis
BERT doesn't read words; it reads Sub-words. Let's see how our data gets chopped up.

In [None]:
model_ckpt = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_ckpt)

def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True, max_length=128)

tokenized_ds = raw_ds.map(tokenize, batched=True)

# Inspect Tokens
sample_ids = tokenized_ds[0]['input_ids']
print(f"Raw IDs: {sample_ids}")
print(f"Decoded: {tokenizer.decode(sample_ids)}")

**Observation**: Note the `[CLS]` (Start) and `[SEP]` (End) tokens. These are mandatory for BERT.

In [None]:
# 3. Model Initialization
model = DistilBertForSequenceClassification.from_pretrained(model_ckpt, num_labels=2)

# 4. Training Arguments (The Research Part)
# We strictly control the learning rate to avoid destroying the pre-trained weights.
args = TrainingArguments(
    output_dir="bert-finetuned",
    num_train_epochs=3,
    learning_rate=2e-5, # Very small (Standard is 1e-3, which is 50x larger)
    per_device_train_batch_size=8,
    weight_decay=0.01, # Regularization
    lr_scheduler_type='linear', # Triangular warm-up
    warmup_ratio=0.1, # 10% of steps used to ramp up LR
    logging_steps=10
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_ds,
)

print("Starting Fine-Tuning...")
trainer.train()

## 5. Learning Curve Diagnostics
We access the Trainer's logs to plot the Loss over time.

In [None]:
logs = trainer.state.log_history
loss = [x['loss'] for x in logs if 'loss' in x]
steps = [x['step'] for x in logs if 'loss' in x]

plt.figure(figsize=(10, 5))
plt.plot(steps, loss, label="Training Loss")
plt.title("Fine-Tuning Dynamics")
plt.xlabel("Steps")
plt.ylabel("Loss")
plt.grid()
plt.legend()
plt.show()