In [41]:
!pip install -q transformers datasets accelerate evaluate torch


In [42]:
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
import math


In [43]:
dataset = load_dataset("yelp_review_full")
dataset


DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

## Dataset Description

The Yelp Review Full dataset is obtained from the Hugging Face library.
It consists of customer-written reviews in natural language.
The dataset is used to fine-tune a Small Language Model on real-world textual data.


In [44]:
model_name = "distilgpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

tokenizer.pad_token = tokenizer.eos_token


Loading weights:   0%|          | 0/76 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: distilgpt2
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
transformer.h.{0, 1, 2, 3, 4, 5}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


## Model Selection

DistilGPT-2 is selected as the Small Language Model for this task.
It has approximately 82 million parameters, which is well below the 3 billion parameter limit.
The model is lightweight and suitable for fine-tuning using Google Colab.


In [45]:
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=128
    )

tokenized_datasets = dataset.map(tokenize_function, batched=True)


In [46]:
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(2000))
small_test_dataset = dataset["test"].shuffle(seed=42).select(range(500))

tokenized_train = small_train_dataset.map(tokenize_function, batched=True)
tokenized_test = small_test_dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/500 [00:00<?, ? examples/s]

## Data Subsampling

The original dataset is very large, so a smaller subset is selected for training and evaluation.
This reduces computational cost while still demonstrating effective fine-tuning of the language model.


In [47]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_steps=100,
    save_total_limit=2,
    fp16=True
)


## Training Configuration

In this step, the training parameters for fine-tuning the model are defined.
These parameters control the learning process such as learning rate, batch size, number of epochs, and evaluation method.
Proper configuration helps the model train efficiently and produce better results.


In [48]:
from transformers import Trainer, DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    data_collator=data_collator
)

trainer.train()


Epoch,Training Loss,Validation Loss
1,3.904373,3.744351
2,3.827787,3.729372


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

TrainOutput(global_step=1000, training_loss=3.8863327331542967, metrics={'train_runtime': 149.7426, 'train_samples_per_second': 26.713, 'train_steps_per_second': 6.678, 'total_flos': 130648375296000.0, 'train_loss': 3.8863327331542967, 'epoch': 2.0})

## Model Fine-tuning

The Small Language Model was successfully fine-tuned on the selected dataset.
Training was performed for two epochs, and both training and validation loss values decreased, indicating effective learning.


In [49]:
eval_results = trainer.evaluate()
eval_results


{'eval_loss': 3.729372024536133,
 'eval_runtime': 2.0549,
 'eval_samples_per_second': 243.322,
 'eval_steps_per_second': 60.83,
 'epoch': 2.0}

## Model Evaluation

The fine-tuned model was evaluated on the test dataset.
Evaluation loss was used as the primary metric to measure the model’s performance on unseen data.


In [50]:
import math

perplexity = math.exp(eval_results["eval_loss"])
perplexity



41.65294292317091

## Perplexity

Perplexity was calculated from the evaluation loss.
A lower perplexity value indicates that the model is better at predicting the next word in a sequence.


In [51]:
input_text = "The restaurant was"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_length=50,
    temperature=0.7,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The restaurant was a total mess.  The food was disgusting and had its own problems.  We were only able to order one meal per person.  I was very annoyed when I saw the \"barbecue\" sign.  I had to say


## Text Generation

After fine-tuning, the model was used to generate text based on a given prompt.
The generated output shows that the model learned meaningful language patterns from the training data.


## Conclusion

In this task, a Small Language Model was successfully fine-tuned on a text dataset from Hugging Face.
The model showed improved performance after training, as reflected by evaluation loss, perplexity, and generated text.
This experiment demonstrates the complete workflow of fine-tuning and evaluating a language model.
