#🧪 Practical : Fine-Tune TinyBERT using HuggingFace Trainer

#🎯 Objective
Learn how to:

Load a custom text classification dataset

Tokenize using TinyBERT tokenizer

Fine-tune TinyBERT with HuggingFace’s Trainer class

Evaluate accuracy and save your model

✅ This practical uses a free-tier GPU on Colab, HuggingFace Transformers, and is built for lightweight finetuning.

#🛠 Tools Used

| Tool           | Purpose                    |
| -------------- | -------------------------- |
| `transformers` | Model + training pipeline  |
| `datasets`     | Dataset loading/processing |
| `pandas`       | CSV loading                |
| `sklearn`      | Evaluation                 |
| `google.colab` | GPU support                |


#✅ Step-by-Step Guide
🔧 Step 1: Install Dependencies

In [1]:
!pip install transformers datasets evaluate scikit-learn

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


#step 2 Use a Sample Dataset Instead (Quick Start)

In [3]:
import pandas as pd

data = {
    "text": [
        "I love this product! It's amazing 😍",
        "Horrible experience. Will never buy again!!",
        "Delivery was on time. Packaging was good.",
        "Customer support didn’t help me. Waste of money.",
        "Wow, absolutely loved it! <3",
        "Meh. It was okay I guess...",
        "Terrible. Broke after one day.",
        "Super fast shipping, very happy!",
        "Why does this even exist?? useless",
        "The quality is top-notch. Highly recommend."
    ],
    "label_final": [
        "Positive", "Negative", "Neutral", "Negative", "Positive",
        "Neutral", "Negative", "Positive", "Negative", "Positive"
    ]
}

df = pd.DataFrame(data)
df.to_csv("clean_labeled_dataset.csv", index=False)
print("✅ Sample dataset created and saved.")


✅ Sample dataset created and saved.


#🧠 Step 3: Encode Labels

In [6]:
df = pd.read_csv("clean_labeled_dataset.csv")
df.head()

Unnamed: 0,text,label_final
0,I love this product! It's amazing 😍,Positive
1,Horrible experience. Will never buy again!!,Negative
2,Delivery was on time. Packaging was good.,Neutral
3,Customer support didn’t help me. Waste of money.,Negative
4,"Wow, absolutely loved it! <3",Positive


#📦 Step 4: Convert to HuggingFace Dataset Format

In [8]:
label_map = {"Positive": 0, "Negative": 1, "Neutral": 2}
df["label"] = df["label_final"].map(label_map)
df = df[["text", "label"]]
df = df.dropna()

# Rename columns to match HF format
df.rename(columns={"clean_text": "text"}, inplace=True)
df.head()


Unnamed: 0,text,label
0,I love this product! It's amazing 😍,0
1,Horrible experience. Will never buy again!!,1
2,Delivery was on time. Packaging was good.,2
3,Customer support didn’t help me. Waste of money.,1
4,"Wow, absolutely loved it! <3",0


#🧠 Step 5: Load Tokenizer & Tokenize Data

In [9]:
from datasets import Dataset, DatasetDict

dataset = Dataset.from_pandas(df)
dataset = dataset.train_test_split(test_size=0.2)

dataset


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2
    })
})

#🧪 Step 6: Load Model & Define Training Arguments

In [10]:
from transformers import AutoTokenizer

model_checkpoint = "prajjwal1/bert-tiny"  # TinyBERT for fast training
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_fn(batch):
    return tokenizer(batch["text"], truncation=True, padding=True)

tokenized_dataset = dataset.map(tokenize_fn)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/2 [00:00<?, ? examples/s]

#🧪 Step 7: Define Evaluation Metric

In [12]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=3)

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",  # Corrected argument name
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=4,
    weight_decay=0.01,
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_dir="./logs",
    logging_steps=10,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#🚀 Step 8: Train with HuggingFace Trainer

In [14]:
import evaluate
import numpy as np
from sklearn.metrics import accuracy_score

accuracy_metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return accuracy_metric.compute(predictions=preds, references=labels)


#📈 Step 9: Evaluate Final Accuracy

In [15]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()


  trainer = Trainer(


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mharshdalal79[0m ([33mharshdalal79-chitkara-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.08666,0.5
2,No log,1.0864,0.5
3,No log,1.085273,0.5
4,No log,1.084676,0.5


TrainOutput(global_step=4, training_loss=1.1176750659942627, metrics={'train_runtime': 266.5832, 'train_samples_per_second': 0.12, 'train_steps_per_second': 0.015, 'total_flos': 1112022912.0, 'train_loss': 1.1176750659942627, 'epoch': 4.0})

#💾 Step 10: Save Model Locally and Optionally Upload to HuggingFace Hub

In [16]:
results = trainer.evaluate()
print("✅ Final Evaluation Metrics:\n", results)


✅ Final Evaluation Metrics:
 {'eval_loss': 1.0846757888793945, 'eval_accuracy': 0.5, 'eval_runtime': 0.0142, 'eval_samples_per_second': 140.68, 'eval_steps_per_second': 70.34, 'epoch': 4.0}


In [17]:
trainer.save_model("finetuned-tinybert-sentiment")

# Optional: Upload (requires token & HuggingFace account)
# from huggingface_hub import login
# login(token="your_hf_token_here")
# trainer.push_to_hub("your_model_name")


#✅ Summary

| Step       | Purpose                     |
| ---------- | --------------------------- |
| Tokenize   | Convert text to model input |
| Train      | TinyBERT using Trainer API  |
| Evaluate   | Accuracy and test loss      |
| Save/Share | Export to local or HF Hub   |
