#🧪 Practical : Transfer Learning with BERT, GPT-2 & Custom LLMs (Text Classification + Generation)

#🎯 Objectives
By the end of this practical, you will:

Understand the difference between encoder-based (BERT) and decoder-based (GPT-2) models

Fine-tune BERT for text classification

Fine-tune GPT-2 for text generation

Apply transfer learning with HuggingFace transformers

Compare use-cases and outputs

#🛠️ Tools Used
| Tool            | Purpose                      |
| --------------- | ---------------------------- |
| 🤗 Transformers | Pretrained models & training |
| 🤗 Datasets     | Dataset handling             |
| `Trainer`       | Fine-tuning pipelines        |
| Google Colab    | Free GPU training            |

#✅ Part A: Transfer Learning with BERT (Text Classification)

#🔧 Step 1A: Install Requirements

In [1]:
!pip install transformers datasets evaluate


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m361.4 kB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


#📦 Step 2A: Prepare Dataset
We’ll simulate a simple sentiment classification dataset.

In [2]:
import pandas as pd
from datasets import Dataset

data = {
    "text": [
        "I loved the product!", "Worst service ever.", "It was okay.",
        "Amazing support team!", "Never buying again.", "Satisfactory"
    ],
    "label": [0, 1, 2, 0, 1, 2]  # 0=Positive, 1=Negative, 2=Neutral
}

df = pd.DataFrame(data)
dataset = Dataset.from_pandas(df).train_test_split(test_size=0.3)


#🔠 Step 3A: Load Tokenizer and Tokenize Dataset

In [3]:
from transformers import AutoTokenizer

bert_model = "bert-base-uncased"
tokenizer_bert = AutoTokenizer.from_pretrained(bert_model)

def tokenize(batch):
    return tokenizer_bert(batch["text"], truncation=True, padding=True)

tokenized = dataset.map(tokenize, batched=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

#🤖 Step 4A: Load Pretrained BERT + Train for Classification

In [4]:
from transformers import AutoModelForSequenceClassification

bert_model = AutoModelForSequenceClassification.from_pretrained(bert_model, num_labels=3)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#🏁 Step 5A: Train BERT with HuggingFace Trainer

In [9]:
from transformers import TrainingArguments, Trainer, AutoTokenizer
from sklearn.metrics import accuracy_score
import numpy as np

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    return {"accuracy": accuracy_score(labels, preds)}

args = TrainingArguments(
    output_dir="./bert_cls",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    eval_strategy="epoch",
    save_strategy="no",
    report_to="none",
    fp16=True
)

trainer = Trainer(
    model=bert_model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    processing_class=AutoTokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.98865,0.5
2,No log,0.979096,0.5
3,No log,0.983051,0.5


TrainOutput(global_step=3, training_loss=0.7207605044047037, metrics={'train_runtime': 10.6009, 'train_samples_per_second': 1.132, 'train_steps_per_second': 0.283, 'total_flos': 43167045096.0, 'train_loss': 0.7207605044047037, 'epoch': 3.0})

#📊 Step 6A: Evaluate BERT Classification

In [11]:
eval_results = trainer.evaluate()
print("📊 BERT Classification Accuracy:", eval_results["eval_accuracy"])


📊 BERT Classification Accuracy: 0.5


#✅ Part B: Transfer Learning with GPT-2 (Text Generation)
#📖 Step 1B: Load GPT-2 and Tokenizer

In [12]:
from transformers import AutoModelForCausalLM

gpt_model_name = "gpt2"
tokenizer_gpt = AutoTokenizer.from_pretrained(gpt_model_name)
gpt2_model = AutoModelForCausalLM.from_pretrained(gpt_model_name)


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

#📝 Step 2B: Prepare GPT-2 Training Dataset (Short Prompt-to-Text)

In [13]:
generation_data = {
    "text": [
        "Once upon a time in a distant land,",
        "The future of AI is",
        "In a quiet village nestled in the mountains,"
    ]
}
gpt_df = pd.DataFrame(generation_data)
gpt_dataset = Dataset.from_pandas(gpt_df)


#🧠 Step 3B: Tokenize for GPT-2

In [15]:
def gpt_tokenize(batch):
    tokenizer_gpt.pad_token = tokenizer_gpt.eos_token
    return tokenizer_gpt(batch["text"], padding="max_length", truncation=True, max_length=64)

gpt_tokenized = gpt_dataset.map(gpt_tokenize, batched=True)

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

#⚙️ Step 4B: Fine-tune GPT-2

In [18]:
from transformers import TrainingArguments, Trainer, AutoTokenizer

gen_args = TrainingArguments(
    output_dir="./gpt2_gen",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    save_strategy="no",
    logging_strategy="no",
    fp16=True
)

# Add labels to the dataset for causal language modeling
gpt_tokenized = gpt_tokenized.map(lambda examples: {"labels": examples["input_ids"]})

trainer_gpt = Trainer(
    model=gpt2_model,
    args=gen_args,
    train_dataset=gpt_tokenized,
    processing_class=AutoTokenizer,
)

trainer_gpt.train()

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Step,Training Loss


TrainOutput(global_step=6, training_loss=1.0108715693155925, metrics={'train_runtime': 23.8033, 'train_samples_per_second': 0.378, 'train_steps_per_second': 0.252, 'total_flos': 293953536000.0, 'train_loss': 1.0108715693155925, 'epoch': 3.0})

#🧪 Step 5B: Generate with GPT-2

In [19]:
input_prompt = "In a future where machines"
input_ids = tokenizer_gpt(input_prompt, return_tensors="pt").input_ids

gen_output = gpt2_model.generate(input_ids, max_length=50, do_sample=True, temperature=0.8)
print("📝 GPT-2 Output:\n", tokenizer_gpt.decode(gen_output[0], skip_special_tokens=True))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


📝 GPT-2 Output:
 In a future where machines are used to take a turn, this will be something for the future


#📊 Summary Table

| Model | Type    | Task            | Use Case               |
| ----- | ------- | --------------- | ---------------------- |
| BERT  | Encoder | Classification  | Sentiment, intent      |
| GPT-2 | Decoder | Text Generation | Content creation, chat |
