<a href="https://colab.research.google.com/github/Paul-NIROB/agentic-ai/blob/main/Fine_tuning_Lab_task1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install -U transformers accelerate evaluate




In [3]:
!pip install -q transformers datasets evaluate accelerate


In [4]:
!pip install -q evaluate
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
import evaluate

In [5]:
dataset = load_dataset("ag_news")
dataset


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

In [6]:
model_name = "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=4
)


Loading weights:   0%|          | 0/100 [00:00<?, ?it/s]

DistilBertForSequenceClassification LOAD REPORT from: distilbert-base-uncased
Key                     | Status     | 
------------------------+------------+-
vocab_transform.weight  | UNEXPECTED | 
vocab_layer_norm.weight | UNEXPECTED | 
vocab_transform.bias    | UNEXPECTED | 
vocab_projector.bias    | UNEXPECTED | 
vocab_layer_norm.bias   | UNEXPECTED | 
classifier.weight       | MISSING    | 
classifier.bias         | MISSING    | 
pre_classifier.weight   | MISSING    | 
pre_classifier.bias     | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


In [7]:
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=128
    )

tokenized_datasets = dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

In [8]:
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)


In [10]:
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir="./logs",
    report_to="none"
)


`logging_dir` is deprecated and will be removed in v5.2. Please set `TENSORBOARD_LOGGING_DIR` instead.


In [12]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [14]:
small_train = tokenized_datasets["train"].shuffle(seed=42).select(range(5000))
small_test = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

trainer.train_dataset = small_train
trainer.eval_dataset = small_test


In [15]:
trainer.train()


Step,Training Loss


Step,Training Loss
500,0.279367


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

TrainOutput(global_step=626, training_loss=0.26227819424467724, metrics={'train_runtime': 4626.7739, 'train_samples_per_second': 2.161, 'train_steps_per_second': 0.135, 'total_flos': 448285665558528.0, 'train_loss': 0.26227819424467724, 'epoch': 2.0})

In [16]:
trainer.evaluate()




{'eval_loss': 0.2814665138721466,
 'eval_accuracy': 0.912,
 'eval_runtime': 145.6285,
 'eval_samples_per_second': 6.867,
 'eval_steps_per_second': 0.433,
 'epoch': 2.0}

# Fine-Tuning a Small Language Model (SLM)

## Agentic AI – Lab Task 1

### Objective
The objective of this task is to fine-tune a Small Language Model (SLM) with fewer than 3 billion parameters using a text dataset from Hugging Face and evaluate its performance.

### Tools Used
- Google Colab
- Hugging Face Transformers
- Hugging Face Datasets
- DistilBERT model


## Dataset Description

The AG News dataset is a text classification dataset consisting of news articles categorized into four classes:
- World
- Sports
- Business
- Science/Technology

The dataset is divided into training and testing splits and is suitable for text classification tasks.


## Results and Observations

The fine-tuned DistilBERT model achieved strong performance on the AG News dataset
with high classification accuracy. The training loss decreased steadily, showing
effective learning during fine-tuning.

Using a reduced dataset significantly decreased training time while maintaining
acceptable performance, making this approach suitable for resource-constrained
environments.



## Conclusion

This lab demonstrated the fine-tuning of a Small Language Model with fewer than
3 billion parameters using a Hugging Face dataset. The results show that SLMs can
be efficiently adapted to downstream tasks using limited computational resources.
