<a href="https://colab.research.google.com/github/RDGopal/IB9LQ0-GenAI/blob/main/Fine_tuning_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Fine-tuning an LLM
Fine-tuning refers to the process of taking a pre-trained Large Language Model (LLM) and training it further on a smaller, specialized dataset to adapt it for a specific task or domain. By exposing them to domain-specific data, we can enhance their ability to tackle specialized tasks—whether it’s for classification, writing legal contracts, or processing medical diagnoses with greater accuracy.


##Full Fine-tuning
Full fine-tuning involves taking a pre-trained language model and updating all of its trainable parameters (weights and biases) on your new, task-specific dataset. The entire model architecture remains the same; only the values of the weights are adjusted during the training process to better perform on the downstream task.

While full fine-tuning is a straightforward approach, it can be computationally expensive (especially for large models) and may lead to overfitting on small datasets.


##Parameter-Efficient Fine-Tuning (PEFT)
These methods aim to achieve performance comparable to full fine-tuning while only training a small fraction of the model's parameters. This significantly reduces computational cost and storage requirements. Some popular PEFT techniques include:

* Low-Rank Adaptation (LoRA): Introduces small, low-rank matrices alongside the original weights. Only these low-rank matrices are trained, while the original pre-trained weights are kept frozen.


* QLoRA (Quantization-aware Low-Rank Adaptation): Combines quantization (reducing the precision of weights) with LoRA to further reduce memory footprint during training.

* Instruction Tuning: This is a specific type of fine-tuning where the model is trained on a dataset of tasks described in natural language instructions. The goal is to make the model better at following instructions and generalizing to new tasks it hasn't seen explicitly during fine-tuning. This often involves formatting the training data as (instruction, input, output) triplets.

* Reinforcement Learning from Human Feedback (RLHF): While not strictly a fine-tuning method in the traditional supervised sense, RLHF is a crucial technique for aligning large language models with human preferences. It typically involves several stages, including supervised fine-tuning followed by training a reward model based on human comparisons of different model outputs, and finally using reinforcement learning to optimize the language model to maximize this reward.


#Full Fine-tuning
We will first conduct full fine-tuning and then explore PEFT.

##Install Necessary Libraries
* transformers: For working with pre-trained models and datasets.
* datasets: For easily creating and managing small datasets.

In [None]:
!pip install transformers datasets
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset, DatasetDict # Import DatasetDict here
import random
import torch

##Load a Small Sample of the Dataset:
We will load the emotion dataset and then take a small random sample.

In [None]:
!pip install datasets --upgrade
from datasets import load_dataset, DatasetDict # Import DatasetDict here
import random
dataset_name = "emotion"
full_dataset = load_dataset(dataset_name)

In [None]:
full_dataset

In [None]:
from IPython.display import Markdown

In [None]:
display(Markdown(str(full_dataset['train'][:5])))

###Take 100 rows of data

In [None]:
# Combine train, validation, and test sets for sampling
from datasets import concatenate_datasets # Import concatenate_datasets here

combined_dataset = full_dataset["train"].shuffle(seed=42).select(range(50))  # Take 50 from train
# Convert to DatasetDict and give it a key name, like "train"
combined_dataset = DatasetDict({"train": combined_dataset})

# Concatenate validation set to 'train'
validation_subset = full_dataset["validation"].shuffle(seed=42).select(range(25)).cast(combined_dataset["train"].features)
# Instead of using concatenate_datasets on the Dataset, update the DatasetDict
combined_dataset["train"] = concatenate_datasets([combined_dataset["train"], validation_subset]) # Use concatenate_datasets directly

# Concatenate test set to 'train'
test_subset = full_dataset["test"].shuffle(seed=42).select(range(25)).cast(combined_dataset["train"].features)
# Instead of using concatenate_datasets on the Dataset, update the DatasetDict
combined_dataset["train"] = concatenate_datasets([combined_dataset["train"], test_subset]) # Use concatenate_datasets directly

sampled_dataset = combined_dataset["train"].shuffle(seed=42).select(range(100)) # Final shuffled sample of 100

In [None]:
display(Markdown(f"Number of examples in the sampled dataset: {len(sampled_dataset)}"))
display(Markdown(str(sampled_dataset[:5])))

In [None]:
import pandas as pd

# Convert the dataset to a Pandas DataFrame
df = sampled_dataset.to_pandas()

# Display the DataFrame
display(df)

#Load a Tiny Pre-trained Language Model and Tokenizer

In [None]:
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = model.config.eos_token_id

##Architecture of the Base Model

In [None]:
# Print layer names
for name, param in model.named_parameters():
    print(name)

##Number of Parameters in the Base Model

In [None]:
# Calculate the total number of parameters
total_params = sum(p.numel() for p in model.parameters())

# Print the result
print(f"Total number of parameters: {total_params}")

#Preprocess the Dataset:
Now, we need to format our data for the language model. We format our data as:

emotion: `[emotion_label]` text: `[text]`

The model will learn to associate the emotion label with the style of the text.

In [None]:
def preprocess_function(examples):
    inputs = [f"emotion: {label} text: {text}{tokenizer.eos_token}" for text, label in zip(examples['text'], examples['label'])]
    # Tokenize inputs and add 'labels' key
    tokenized_inputs = tokenizer(inputs, truncation=True, padding='max_length', max_length=128)
    # Shift labels to align with model's expected input format (Causal LM)
    tokenized_inputs["labels"] = tokenized_inputs["input_ids"].copy()
    return tokenized_inputs

tokenized_dataset = sampled_dataset.map(preprocess_function, batched=True)

##View Prepared Data

In [None]:
for i in range(5):  # Print the first 5 examples
       example = tokenized_dataset[i]
       decoded_text = tokenizer.decode(example['input_ids'])
       print(f"Example {i + 1}:")
       print(decoded_text)
       print(example)  # Print the full dictionary for the example
       print("-" * 20)

#Define Training Arguments

In [None]:
training_args = TrainingArguments(
    output_dir="./fine_tuned_emotion_model",
    per_device_train_batch_size=4,
    num_train_epochs=10,
    logging_dir="./logs",
    report_to="none",
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_steps=100,
    save_strategy="epoch"
)

#Create the Trainer

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=lambda data: tokenizer.pad(data, padding='max_length', max_length=128, return_tensors='pt')
    # Call the pad method with appropriate arguments within a lambda function
)

#Fine-tune the Model

In [None]:
trainer.train()

#Test the Fine-tuned Model

In [None]:
emotion_to_generate = "happy"
prompt = f"emotion: {emotion_to_generate} text:"
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)

output = model.generate(input_ids, max_length=50, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)
generated_text = tokenizer.decode(output[:, input_ids.shape[-1]:][0], skip_special_tokens=True)

display(Markdown(f"Prompt: {prompt}"))
display(Markdown(f"Fine-tuned Model (Emotion: {emotion_to_generate}) Response: {generated_text}"))

#Parameter-Efficient Fine-Tuning (PEFT)
We will now explore PEFT. We will continue to use the same dataset and the same LLM.


In [None]:
!pip install accelerate peft

In [None]:
from peft import LoraConfig, get_peft_model
import torch

**Steps including loading and preparing the data, loading the LLM and Tokenizer remain the same as before.**

##Define Training Arguments

In [None]:
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM", # Specify the task type
    target_modules=["attn.c_attn", "attn.c_proj"], # Adjust based on the model architecture
)

# Wrap the base model with the LoRA adapters
model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # See how many parameters are trainable

##Create the Trainer

In [None]:
training_args = TrainingArguments(
    output_dir="./lora_fine_tuned_emotion_model",
    per_device_train_batch_size=4,
    num_train_epochs=30,
    logging_dir="./logs",
    report_to="none",
    learning_rate=2e-4, # LoRA often benefits from slightly higher learning rates
    weight_decay=0.01,
    warmup_steps=100,
    save_strategy="epoch"
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=lambda data: tokenizer.pad(data, padding='max_length', max_length=128, return_tensors='pt')
    # Call the pad method with appropriate arguments within a lambda function
)

##Train the Model

In [None]:
trainer.train()

##Test the Fine-tuned Model

In [None]:
emotion_to_generate = "joy"
prompt = f"emotion: {emotion_to_generate} text:"
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)

output = model.generate(input_ids, max_length=500, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)
generated_text = tokenizer.decode(output[:, input_ids.shape[-1]:][0], skip_special_tokens=True)

print(f"Prompt: {prompt}")
print(f"LoRA Fine-tuned Model (Emotion: {emotion_to_generate}) Response: {generated_text}")

emotion_to_generate_2 = "sadness"
prompt_2 = f"emotion: {emotion_to_generate_2} text:"
input_ids_2 = tokenizer.encode(prompt_2, return_tensors="pt").to(model.device)
output_2 = model.generate(input_ids_2, max_length=500, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)
generated_text_2 = tokenizer.decode(output_2[:, input_ids_2.shape[-1]:][0], skip_special_tokens=True)

print(f"\nPrompt: {prompt_2}")
print(f"LoRA Fine-tuned Model (Emotion: {emotion_to_generate_2}) Response: {generated_text_2}")

# You can also test the original model for comparison
original_model = AutoModelForCausalLM.from_pretrained(model_name).to(model.device)
original_output = original_model.generate(input_ids, max_length=50, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)
original_response = tokenizer.decode(original_output[:, input_ids.shape[-1]:][0], skip_special_tokens=True)
print(f"\nOriginal Model Response (with 'joy' prompt): {original_response}")

#Your turn
1. Fine-tune based on the data `AI_Human_Review.csv`. The first column contains AI written text and the second column contains the corresponding Human written text. The objective of fine-tuning is to take AI written text and output it's human written equivalent text. Fine-tune based on a sample of 100 data points.
2. Fine-tune based on the data `fakenews100.csv`. The first column contains text and the second column indicates whether it is fake or not. The objective of fine-tuning is to take input text and identify if it is fake news. Fine-tune based on a sample 100 data points.