# Finetuning a LLM

### 🔧 Inputs Required for Finetuning

The following inputs are essential for the finetuning process:

### 1. Dataset
- High-quality training data
- Input/output pairs for supervised learning

### 2. Model and Training Arguments
- Pre-trained model selection
- Learning parameters configuration
- Training environment setup




## 1. Dataset

In [None]:
prompts = [
    "Explain quantum computing in simple terms.",
    "What are the benefits of renewable energy?",
    "Describe the process of photosynthesis.",
    "What is the significance of the Renaissance?",
    "How does the human immune system work?"
]
responses = [
    "Quantum computing uses quantum bits, or qubits, to perform calculations. Unlike classical bits that are either 0 or 1, qubits can exist in multiple states simultaneously, allowing quantum computers to solve certain complex problems faster.",
    "Renewable energy, such as solar and wind, reduces greenhouse gas emissions, decreases air pollution, and conserves natural resources. It also promotes energy independence and sustainability.",
    "Photosynthesis is the process by which green plants use sunlight to make food from carbon dioxide and water. It occurs in the chloroplasts, producing oxygen as a byproduct.",
    "The Renaissance was a cultural movement from the 14th to the 17th century, characterized by a renewed interest in classical art, science, and philosophy. It led to significant advancements in many fields and a shift towards humanism.",
    "The human immune system protects the body from infections and diseases. It consists of physical barriers, immune cells, and proteins that identify and destroy pathogens like bacteria and viruses."
]

## 2. Model and Training Arguments

In [None]:
model_name='meta-llama/Llama-3.2-1B'
output_path='./finetuned_model'

In [None]:
training_args={
                "overwrite_output_dir": True,# 🔄 Whether to overwrite existing output directory
                "eval_strategy": "no",# 📊 Evaluation strategy during training
                "learning_rate": 2e-5, # 📈 Learning rate for model optimization
                "per_device_train_batch_size": 1, # 📦 Number of samples processed per device per training step
                "gradient_accumulation_steps": 4,# 🔄 Number of steps to accumulate gradients before updating weights
                "num_train_epochs": 3,# 🔁 Number of complete passes through training dataset
                "weight_decay": 0.01, # ⚖️ L2 regularization factor to prevent overfitting
                "fp16": True, # 🚀 Enable mixed precision training for faster computation
                "gradient_checkpointing": True# 💾 Enable gradient checkpointing to save memory
                }

---

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM,Trainer, TrainingArguments, DataCollatorForSeq2Seq
from huggingface_hub import login
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset
import torch
import os

### 📦 Custom Dataset Class
- 🔧 A predefined class that creates custom PyTorch datasets for model finetuning
- 🔄 Handles data loading and batch preparation during training

In [None]:
class CustomDataset(Dataset):
    def __init__(self, inputs, labels):
        self.inputs = inputs
        self.labels = labels

    def __len__(self):
        return len(self.inputs["input_ids"])

    def __getitem__(self, idx):
        return {
            "input_ids": self.inputs["input_ids"][idx],
            "attention_mask": self.inputs["attention_mask"][idx],
            "labels": self.labels[idx]
        }

### 🔧 Finetuner Class
- 🚀 Handles the complete finetuning pipeline
- 🤖 Manages model and tokenizer initialization
- 📊 Processes and prepares training data
- ⚙️ Configures training parameters
- 📈 Executes model training and saves results

In [None]:
class finetuner:
  def __init__(self, model_name):
    login(token='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')
    self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    self.model = AutoModelForCausalLM.from_pretrained(model_name)

  def enable_gradient_checkpointing(self):
    self.model.gradient_checkpointing_enable()  # Enable gradient checkpointing

  def pad_tokenizer(self):
    if self.tokenizer.pad_token is None:
      self.tokenizer.pad_token = self.tokenizer.eos_token

  def tokenize_data(self):
    max_length = 50
    # Tokenize prompts and responses
    self.tokenized_inputs = self.tokenizer(self.inputs, padding="max_length", truncation=True, max_length=max_length, return_tensors="pt")
    self.tokenized_labels = self.tokenizer(self.outputs, padding="max_length", truncation=True, max_length=max_length, return_tensors="pt")["input_ids"]
    # Ensure labels' padding tokens are ignored in loss computation
    self.tokenized_labels[self.tokenized_labels == self.tokenizer.pad_token_id] = -100

  def create_dataset(self,indices):
     inputs={
                'input_ids':self.tokenized_inputs["input_ids"][indices],
                'attention_mask':self.tokenized_inputs["attention_mask"][indices]
            }
     labels=self.tokenized_labels[indices]
     return CustomDataset(inputs,labels)
    
  def split_dataset(self):
    indices = list(range(len(self.tokenized_inputs["input_ids"])))
    train_indices, val_indices = train_test_split(indices, test_size=self.test_size, random_state=self.random_seed)
    self.train_dataset = self.create_dataset(train_indices)
    self.eval_dataset = self.create_dataset(val_indices)

  def prepare_dataset(self,inputs,outputs):
    self.inputs=inputs
    self.outputs=outputs
  
  def collate_data(self):
    self.data_collator = DataCollatorForSeq2Seq(
      tokenizer=self.tokenizer,
      model=self.model,
      padding=True)
  
  def prepare_training_Args(self,output_path):
    os.environ["WANDB_DISABLED"] = "true"
    self.training_args = TrainingArguments(output_dir=output_path,**training_args)
    self.trainer = Trainer( 
                            model=self.model,
                            args=self.training_args,
                            train_dataset=self.train_dataset,
                            eval_dataset=self.eval_dataset,
                            data_collator=self.data_collator)
    
  def train(self):
    torch.cuda.empty_cache()
    try:
        self.trainer.train()
    except ValueError as e:
        print("\nError during training:")
        print(e)

  def save_model(self):
    self.model.save_pretrained(self.output_path)
    self.tokenizer.save_pretrained(self.output_path)

  def run(self,inputs,outputs,output_path,train_size=0.8,random_seed=42):
    self.random_seed=42
    self.test_size=1-train_size
    self.output_path=output_path
    self.enable_gradient_checkpointing()
    self.pad_tokenizer()	
    self.prepare_dataset(inputs,outputs)
    self.tokenize_data()
    self.prepared_dataset = CustomDataset(self.tokenized_inputs, self.tokenized_labels)
    self.collate_data()
    self.split_dataset()
    self.prepare_training_Args(output_path)
    self.train()
    self.save_model()

### 🔧 Fine-tuning the Model

This section handles the model fine-tuning process using custom training data. The fine-tuning will adapt the base model to better handle our specific use case.

Key steps:
- Initialize the fine-tuning process
- Train on custom dataset
- Save the fine-tuned model


In [None]:
finetuner_instance = finetuner(model_name)
finetuner_instance.run(prompts,responses,output_path)