# **FINE_TUNING**

## Introduction

Fine-tuning is the process of adapting a pretrained language model to a specific downstream task using labeled data.
In this experiment, a small language model (SLM) is fine-tuned for sentiment analysis to demonstrate efficient task-specific learning using limited computational resources.

**Objective**

The objective of this experiment is to fine-tune a Small Language Model (SLM) with fewer than 3 billion parameters on a text dataset from Hugging Face, and evaluate its performance using appropriate metrics.
The experiment is performed using Google Colab with GPU acceleration to ensure efficient training.

The model used is a pretrained transformer-based language model that has been fine-tuned for text classification, specifically for sentiment analysis.


**Step 1: Enable and Verify GPU**

This step checks whether a GPU is available in the Google Colab environment.
Transformer-based models require high computational power, and using a GPU significantly speeds up training and inference. Without GPU acceleration, fine-tuning transformer models would be extremely slow and inefficient on large datasets.

In [1]:
import torch
torch.cuda.is_available()

True

This confirms that the GPU has been successfully enabled and is ready for use.

**Step 2: Install Required Libraries**

This command installs the essential Hugging Face libraries:

* transformers → for loading pretrained language models and tokenizers   
* datasets → for downloading and handling datasets from Hugging Face
* evaluate → for computing evaluation metrics
* accelerate → for optimized training on GPU

The -q flag is used to keep the output clean and minimal.

In [2]:
!pip install -q transformers datasets evaluate accelerate

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h

**Step 3: Import Necessary Libraries**

This step imports all the required Python modules:

*	load_dataset → loads datasets directly from Hugging Face Hub
*	AutoTokenizer → automatically selects the correct tokenizer for the model
*	AutoModelForSequenceClassification → loads a pretrained model for classification tasks
* Trainer and TrainingArguments → simplify training and evaluation
*	evaluate → provides standard evaluation metrics
*	torch → core deep learning framework

In [3]:
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments
)
import evaluate
import torch

**Step 4: Load the Dataset from Hugging Face**


## Dataset Description

The SST-2 (Stanford Sentiment Treebank v2) dataset is a sentiment analysis dataset consisting of short English sentences derived from movie reviews.
Each sentence is labeled as either positive or negative sentiment, making it suitable for binary text classification tasks.

This command downloads and loads the SST-2 (Stanford Sentiment Treebank v2) dataset directly from Hugging Face.

In [4]:
from datasets import load_dataset

ds = load_dataset("stanfordnlp/sst2")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

**Step 5: Inspect the Dataset**

In [5]:
ds

DatasetDict({
    train: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 872
    })
    test: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 1821
    })
})

This displays:

*	Available splits (train, validation)
*	Feature names
*	Number of samples

**Step 6: View a Sample Data Point**

To understand how text and labels are stored.

In [6]:
ds["train"][0]

{'idx': 0,
 'sentence': 'hide new secretions from the parental units ',
 'label': 0}

This helps identify:

*	Input text field (sentence)
*	Target label (label)
*	Label meaning (0 = negative, 1 = positive)

**Step 7: Load the Tokenizer**

**Objective**

To convert raw text into numerical tokens that the model can understand.

In [7]:
tokenizer = AutoTokenizer.from_pretrained(
    "distilbert-base-uncased"
)

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

DistilBERT is chosen because it provides high accuracy while using fewer parameters and computational resources compared to larger transformer models, making it suitable for training on Google Colab.

The tokenizer:
* Breaks text into subword tokens
*	Converts tokens into numerical IDs
*	Adds special tokens required by the model
*	Ensures compatibility with the selected model

Using the model’s tokenizer is essential for correct training.


**Step 8: Define the Tokenization Function**

To create a reusable function for preprocessing the dataset.

In [8]:
def tokenize_function(examples):
    return tokenizer(
        examples["sentence"],
        padding="max_length",
        truncation=True,
        max_length=128
    )

This function:
* Tokenizes the input sentence
*	Pads all sequences to the same length
*	Truncates long sentences
*	Ensures uniform input size for batch training

**Step 9: Apply Tokenization to the Dataset**

To preprocess the entire dataset for model training.


In [9]:
tokenized_ds = ds.map(tokenize_function, batched=True)

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

The map() function:

*	Applies tokenization to all samples
* Processes data efficiently in batches
*	Adds tokenized fields such as input_ids and attention_mask

**Step 10: Prepare Dataset for PyTorch**

To remove unused columns and convert data into PyTorch tensors.


In [10]:
tokenized_ds = tokenized_ds.remove_columns(["sentence"])
tokenized_ds.set_format("torch")

	* Removes raw text to save memory
	* Converts dataset into PyTorch tensor format
	* Makes the dataset compatible with the Trainer API


**Step 11: Load the Pretrained Model**

To load a small language model suitable for fine-tuning.


In [11]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/100 [00:00<?, ?it/s]

DistilBertForSequenceClassification LOAD REPORT from: distilbert-base-uncased
Key                     | Status     | 
------------------------+------------+-
vocab_layer_norm.bias   | UNEXPECTED | 
vocab_transform.weight  | UNEXPECTED | 
vocab_projector.bias    | UNEXPECTED | 
vocab_layer_norm.weight | UNEXPECTED | 
vocab_transform.bias    | UNEXPECTED | 
classifier.weight       | MISSING    | 
pre_classifier.bias     | MISSING    | 
classifier.bias         | MISSING    | 
pre_classifier.weight   | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


Explanation
* Loads DistilBERT with pretrained weights
*	Adds a classification head with 2 output labels
*	Satisfies the constraint of < 3B parameters


**Step 12: Define Evaluation Metric**

To evaluate model performance quantitatively.

In [12]:
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(
        torch.tensor(logits), dim=1
    )
    return accuracy.compute(
        predictions=predictions,
        references=labels
    )

Downloading builder script: 0.00B [00:00, ?B/s]

Explanation

*	Accuracy is suitable for binary classification
*	Compares predicted labels with true labels
*	Provides an easy-to-interpret performance score


**Step 13: Define Training Arguments**

To configure how the model will be trained.


In [14]:
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    load_best_model_at_end=True,
    report_to="none"
)

These parameters control:

*	Learning rate
*	Batch size
*	Number of epochs
*	Evaluation frequency
*	Model checkpointing

They ensure stable and efficient training on Colab.

**Step 14: Initialize the Trainer**

To simplify training and evaluation using Hugging Face’s Trainer API.


In [16]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

The Trainer:

	•	Handles training loop
	•	Performs evaluation automatically
	•	Manages GPU usage and batching

**Step 15: Fine-Tune the Model**

To train the pretrained model on the SST-2 dataset.

In [17]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.188843,0.311305,0.908257
2,0.13022,0.352051,0.90367


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

There were missing keys in the checkpoint model loaded: ['distilbert.embeddings.LayerNorm.weight', 'distilbert.embeddings.LayerNorm.bias'].
There were unexpected keys in the checkpoint model loaded: ['distilbert.embeddings.LayerNorm.beta', 'distilbert.embeddings.LayerNorm.gamma'].


TrainOutput(global_step=8420, training_loss=0.1776808374001691, metrics={'train_runtime': 1422.0137, 'train_samples_per_second': 94.723, 'train_steps_per_second': 5.921, 'total_flos': 4460773416041472.0, 'train_loss': 0.1776808374001691, 'epoch': 2.0})

Explanation

	•	Model learns task-specific patterns
	•	Pretrained knowledge is adapted to sentiment classification
	•	Training loss decreases across epochs

**Step 16: Evaluate the Model**

To measure performance on unseen validation data.


In [18]:
trainer.evaluate()

{'eval_loss': 0.31124016642570496,
 'eval_accuracy': 0.908256880733945,
 'eval_runtime': 3.1033,
 'eval_samples_per_second': 280.989,
 'eval_steps_per_second': 17.723,
 'epoch': 2.0}

This returns:

	•	Validation loss
	•	Accuracy score

## Results

After fine-tuning, the model achieved high accuracy on the validation dataset, indicating that it successfully learned sentiment patterns from the training data.
The low validation loss further confirms effective learning without overfitting.

## Observations

- The pretrained DistilBERT model converged quickly within a few epochs.
- Tokenization played a crucial role in converting raw text into numerical representations.
- Using a small language model reduced training time while maintaining high accuracy.

## Conclusion

This experiment demonstrates that a pretrained small language model can be effectively fine-tuned for sentiment analysis using a lightweight dataset.
The results highlight the efficiency of transfer learning and the suitability of DistilBERT for text classification tasks under limited computational resources.

# **THANK YOU!!!!**