<a href="https://colab.research.google.com/github/SandeepKonduruFeb12/aiml/blob/master/gold/goldassignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Demonstrate transfer learning with Large Language Models (LLMs) by fine-tuning `distilgpt2` on a small, synthetic dataset for a simple text generation task, then qualitatively compare the output of the fine-tuned model against the baseline model to illustrate the impact of transfer learning with limited data.


Install necessary Python libraries such as `transformers` and `datasets` to work with pre-trained language models and handle data.


In [1]:
print('Installing necessary Python libraries...')
!pip install transformers datasets
print('Libraries installed successfully.')

Installing necessary Python libraries...
Libraries installed successfully.


Lets create a small, synthetic dataset as a Python list of dictionaries to demonstrate the text generation task.



In [2]:
import pandas as pd

# Create a synthetic dataset for product name descriptions
dataset = [
    {'product_name': 'EcoBottle', 'description': 'A reusable water bottle made from recycled materials.'},
    {'product_name': 'SmartWatch Pro', 'description': 'Advanced smartwatch with health tracking and AI assistant.'},
    {'product_name': 'AeroGlide Sneakers', 'description': 'Lightweight running shoes designed for ultimate comfort and speed.'},
    {'product_name': 'PowerBlend Blender', 'description': 'High-performance blender for smoothies, soups, and shakes.'},
    {'product_name': 'LumiDesk Lamp', 'description': 'Adjustable LED desk lamp with customizable brightness and color temperature.'},
    {'product_name': 'GamerX Headset', 'description': 'Immersive gaming headset with noise cancellation and crystal-clear audio.'},
    {'product_name': 'ChefMate Pan Set', 'description': 'Non-stick ceramic cookware set for healthy and easy cooking.'},
    {'product_name': 'TravelPro Backpack', 'description': 'Durable and spacious backpack perfect for adventurers and daily commuters.'}
]

# Display the dataset
print("Synthetic dataset created successfully:")
for item in dataset:
    print(item)

Synthetic dataset created successfully:
{'product_name': 'EcoBottle', 'description': 'A reusable water bottle made from recycled materials.'}
{'product_name': 'SmartWatch Pro', 'description': 'Advanced smartwatch with health tracking and AI assistant.'}
{'product_name': 'AeroGlide Sneakers', 'description': 'Lightweight running shoes designed for ultimate comfort and speed.'}
{'product_name': 'PowerBlend Blender', 'description': 'High-performance blender for smoothies, soups, and shakes.'}
{'product_name': 'LumiDesk Lamp', 'description': 'Adjustable LED desk lamp with customizable brightness and color temperature.'}
{'product_name': 'GamerX Headset', 'description': 'Immersive gaming headset with noise cancellation and crystal-clear audio.'}
{'product_name': 'ChefMate Pan Set', 'description': 'Non-stick ceramic cookware set for healthy and easy cooking.'}
{'product_name': 'TravelPro Backpack', 'description': 'Durable and spacious backpack perfect for adventurers and daily commuters.'}



Load a resource-efficient pre-trained LLM (e.g., `distilgpt2`) and its corresponding tokenizer.


In [3]:
print('Loading tokenizer and model...')
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the distilgpt2 tokenizer
tokenizer = AutoTokenizer.from_pretrained('distilgpt2')

# Load the distilgpt2 model
model = AutoModelForCausalLM.from_pretrained('distilgpt2')

# Set pad_token_id for tokenizer to avoid warnings during generation
# distilgpt2 does not have a default pad_token, so we set it to eos_token_id
# which is common practice for GPT-like models that are autoregressive.
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print('Tokenizer and model loaded successfully.')

Loading tokenizer and model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Tokenizer and model loaded successfully.


**Reasoning**:
The previous code executed successfully, loading the tokenizer and model. Now, I will add a new markdown cell to explain why `distilgpt2` is suitable for Colab and the role of the tokenizer, as requested by the subtask.




Lets structure the small, synthetic dataset (e.g., product names paired with descriptions) and prepare it for fine-tuning, including tokenization and formatting into a suitable Dataset object for the transformers library.


In [10]:
from datasets import Dataset

# 1. Define a Python function to format the data
def format_data(item):
    return f"product_name: {item['product_name']} description: {item['description']}{tokenizer.eos_token}"

# 2. Define a Python function to tokenize the formatted text
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, max_length=128, return_attention_mask=True)

# 3. Apply the format_data function to each item in the dataset
formatted_texts = [format_data(item) for item in dataset]
print("Formatted texts (first 3 examples):")
for i, text in enumerate(formatted_texts[:3]):
    print(f"  {i+1}: {text}")

# 4. Convert the list of formatted text strings into a datasets.Dataset object
# The dataset library expects a dictionary-like structure, so we put formatted_texts under a 'text' key.
data_dict = {'text': formatted_texts}
hf_dataset = Dataset.from_dict(data_dict)
print(f"\nDataset object created with {len(hf_dataset)} entries.")
print("First entry of the raw Hugging Face dataset:")
print(hf_dataset[0])

# 5. Apply the tokenize_function to the dataset
tokenized_dataset = hf_dataset.map(tokenize_function, batched=True, remove_columns=['text'])

print("\nTokenization complete. First entry of the tokenized dataset:")
print(tokenized_dataset[0])
print(f"Original columns: {hf_dataset.column_names}")
print(f"Tokenized dataset columns: {tokenized_dataset.column_names}")

Formatted texts (first 3 examples):
  1: product_name: EcoBottle description: A reusable water bottle made from recycled materials.<|endoftext|>
  2: product_name: SmartWatch Pro description: Advanced smartwatch with health tracking and AI assistant.<|endoftext|>
  3: product_name: AeroGlide Sneakers description: Lightweight running shoes designed for ultimate comfort and speed.<|endoftext|>

Dataset object created with 8 entries.
First entry of the raw Hugging Face dataset:
{'text': 'product_name: EcoBottle description: A reusable water bottle made from recycled materials.<|endoftext|>'}


Map:   0%|          | 0/8 [00:00<?, ? examples/s]


Tokenization complete. First entry of the tokenized dataset:
{'input_ids': [11167, 62, 3672, 25, 38719, 28653, 293, 6764, 25, 317, 42339, 1660, 9294, 925, 422, 32099, 5696, 13, 50256], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Original columns: ['text']
Tokenized dataset columns: ['input_ids', 'attention_mask']



Implement the fine-tuning process for the chosen LLM on the prepared small dataset. Focus on using basic training arguments suitable for resource-constrained environments to show how the model adapts to the new task.


In [11]:
import torch
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

# Ensure pad_token_id is set for the tokenizer before fine-tuning
# This was already done in a previous step, but re-confirming for safety
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# 1. Define TrainingArguments
training_args = TrainingArguments(
    output_dir="./distilgpt2_fine_tuned",       # Output directory for model checkpoints
    overwrite_output_dir=True,                 # Overwrite the output directory
    num_train_epochs=3,                        # Number of training epochs (small for demonstration)
    per_device_train_batch_size=1,             # Batch size per device during training (small due to limited data/resources)
    save_steps=10_000,                         # Save checkpoint every X updates steps (disabled for this small example)
    save_total_limit=2,                        # Limit the total number of checkpoints on disk
    logging_dir='./logs',                      # Directory for storing logs
    logging_steps=10,                          # Log metrics every X steps
    prediction_loss_only=True,                 # Only calculate loss, not predictions, to save memory
)

# 2. Instantiate DataCollatorForLanguageModeling
# Set mlm=False for causal language models like GPT-2
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False
)

# 3. Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
)

# 4. Start the fine-tuning process
print("Starting fine-tuning...")
trainer.train()
print("Fine-tuning completed successfully.")

Starting fine-tuning...


  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 2


[34m[1mwandb[0m: You chose 'Use an existing W&B account'
[34m[1mwandb[0m: Logging into https://api.wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: Find your API key here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mkondurusandeep33[0m ([33mkondurusandeep33-personal[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
10,4.1255
20,2.8918


Fine-tuning completed successfully.


**Reasoning**:
The previous execution showed interactive prompts and warnings from Weights & Biases (wandb) which disrupt automated execution. To fix this, I need to disable wandb logging during training by setting `report_to="none"` in `TrainingArguments`.



In [12]:
import torch
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

# Ensure pad_token_id is set for the tokenizer before fine-tuning
# This was already done in a previous step, but re-confirming for safety
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# 1. Define TrainingArguments
training_args = TrainingArguments(
    output_dir="./distilgpt2_fine_tuned",       # Output directory for model checkpoints
    overwrite_output_dir=True,                 # Overwrite the output directory
    num_train_epochs=3,                        # Number of training epochs (small for demonstration)
    per_device_train_batch_size=1,             # Batch size per device during training (small due to limited data/resources)
    save_steps=10_000,                         # Save checkpoint every X updates steps (disabled for this small example)
    save_total_limit=2,                        # Limit the total number of checkpoints on disk
    logging_dir='./logs',                      # Directory for storing logs
    logging_steps=10,                          # Log metrics every X steps
    prediction_loss_only=True,                 # Only calculate loss, not predictions, to save memory
    report_to="none",                          # Disable logging to any platform like wandb
)

# 2. Instantiate DataCollatorForLanguageModeling
# Set mlm=False for causal language models like GPT-2
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False
)

# 3. Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
)

# 4. Start the fine-tuning process
print("Starting fine-tuning...")
trainer.train()
print("Fine-tuning completed successfully.")

Starting fine-tuning...




Step,Training Loss
10,2.1451
20,1.6081


Fine-tuning completed successfully.


In [18]:
print('Saving fine-tuned model...')
# The Trainer automatically saves the model in the output_dir after training completes.
# We can explicitly save it again if desired, but for this case, the `model` variable
# already holds the fine-tuned weights, and the checkpoint is saved in `output_dir`.
# We'll use the current `model` object for fine-tuned generation.
trainer.save_model('./fine_tuned_model_final')
print('Fine-tuned model saved.')

print('\nLoading baseline model for comparison...')
# Load a fresh instance of the pre-trained distilgpt2 for baseline comparison
baseline_model = AutoModelForCausalLM.from_pretrained('distilgpt2')
print('Baseline model loaded.')

# Define a generation function for cleaner code
def generate_text(model, tokenizer, prompt, max_length=50, num_return_sequences=1):
    # Encode the prompt and get both input_ids and attention_mask
    encoded_input = tokenizer.encode_plus(prompt, return_tensors='pt', return_attention_mask=True)
    input_ids = encoded_input['input_ids']
    attention_mask = encoded_input['attention_mask']

    # Ensure the model is on the correct device (CPU in this case, if no GPU is available)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    input_ids = input_ids.to(device)
    attention_mask = attention_mask.to(device) # Move attention mask to device

    # Generate text, ensuring pad_token_id is set for generation
    # Using do_sample=True for more varied outputs
    output = model.generate(
        input_ids,
        attention_mask=attention_mask, # Pass the attention mask here
        max_length=max_length,
        num_return_sequences=num_return_sequences,
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True, # Enable sampling for more diverse outputs
        top_k=50,      # Consider top 50 words
        top_p=0.95,    # Nucleus sampling
        temperature=0.7 # Less adventurous outputs
    )
    return [tokenizer.decode(g, skip_special_tokens=True) for g in output]

# Define a prompt for generation
prompt = "product_name: LumiDesk Lamp: "

print(f'\n--- Generating with Baseline Model (Original distilgpt2) ---\nPrompt: {prompt}')
baseline_output = generate_text(baseline_model, tokenizer, prompt)
for i, text in enumerate(baseline_output):
    print(f'Baseline Output {i+1}: {text}')

print(f'\n--- Generating with Fine-tuned Model (distilgpt2) ---\nPrompt: {prompt}')
fine_tuned_output = generate_text(model, tokenizer, prompt)
for i, text in enumerate(fine_tuned_output):
    print(f'Fine-tuned Output {i+1}: {text}')

print('\nQualitative comparison complete. Observe the difference in generated descriptions.')

Saving fine-tuned model...
Fine-tuned model saved.

Loading baseline model for comparison...
Baseline model loaded.

--- Generating with Baseline Model (Original distilgpt2) ---
Prompt: product_name: LumiDesk Lamp: 
Baseline Output 1: product_name: LumiDesk Lamp: _________name: LumiDesk Lamp: _________name: LumiDesk Lamp: _________name: LumiDesk Lamp: _________name: LumiDesk Lamp: 

--- Generating with Fine-tuned Model (distilgpt2) ---
Prompt: product_name: LumiDesk Lamp: 
Fine-tuned Output 1: product_name: LumiDesk Lamp: ivel-prob RedLine Durability and noise-free liquid-in-sensor backpack for professionals and professionals alike. Made in two-way, eight-pack full of ultra-high

Qualitative comparison complete. Observe the difference in generated descriptions.
