<a href="https://colab.research.google.com/github/Rachit23110261/llm-finetuning-lora-4bit/blob/main/DL_H02_23110261.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Installed all requirementes


In [None]:
!pip install -q transformers datasets accelerate bitsandbytes peft


### Load Dataset from Hugging face datasets Library

- Library used -> wikitext 10,000 samples
- tokenizer used -> Autotokenizer corresponding to gpt-2
- Tokenized the whole dataset with dataset.map() and Tokenizer function
- Language models like GPT are usually trained on fixed-length blocks of tokens. Here, we define each block as 128 tokens long.
- attention mask initialized with 1

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

# Load dataset ( wikipedia dataset: easier to analyze)
dataset = load_dataset("wikitext", "wikitext-103-v1", split="train[:10000]")

# It Loads the tokenizer corresponding to GPT-2 model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # For padding compatibility

# Tokenize function
def tokenize_function(example):
    return tokenizer(example["text"])

# Tokenize dataset (do NOT return special tokens mask)
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# Grouping function: split into chunks of 128 tokens
block_size = 128

def group_texts(examples):
    concatenated = sum(examples["input_ids"], [])
    total_length = (len(concatenated) // block_size) * block_size
    result = {
        "input_ids": [concatenated[i:i+block_size] for i in range(0, total_length, block_size)],
        "attention_mask": [[1] * block_size] * (total_length // block_size),
    }
    return result

# Apply chunking
lm_dataset = tokenized_dataset.map(group_texts, batched=True, remove_columns=tokenized_dataset.column_names)

# Check first sample
print(lm_dataset[0])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


{'input_ids': [796, 569, 18354, 7496, 17740, 6711, 796, 220, 198, 2311, 73, 13090, 645, 569, 18354, 7496, 513, 1058, 1279, 2954, 29, 17740, 357, 4960, 1058, 10545, 230, 99, 161, 254, 112, 5641, 44444, 9202, 25084, 24440, 12675, 11839, 18, 837, 6578, 764, 569, 18354, 7496, 286, 262, 30193, 513, 1267, 837, 8811, 6412, 284, 355, 569, 18354, 7496, 17740, 6711, 2354, 2869, 837, 318, 257, 16106, 2597, 2488, 12, 31, 2712, 2008, 983, 4166, 416, 29490, 290, 6343, 13, 44206, 329, 262, 14047, 44685, 764, 28728, 287, 3269, 2813, 287, 2869, 837, 340, 318, 262, 2368, 983, 287, 262, 569, 18354, 7496, 2168, 764, 12645, 278, 262, 976, 21748, 286, 16106, 290, 1103, 2488, 12, 31, 640, 11327, 355, 663, 27677, 837, 262, 1621, 4539, 10730, 284, 262], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

### Defining pretrained *model*

- model used -> "EleutherAI/gpt-neo-1.3B"
- Loaded model in 4-bit precision to save GPU memory
- Added Low rank Adaption (Lora) with r=8
- r=8: Controls the compression ratio (low-rank dimension). Smaller r = fewer trainable params.-
- Target Modules: Specify which layers in the transformer to modify (usually q/k/v projections).
- Task Type="CAUSAL_LM": Sets task type to causal language modeling (next-token prediction).

In [None]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

model_name = "EleutherAI/gpt-neo-1.3B"  # ~1.3B params

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(a
    model_name,
    load_in_4bit=True,
    device_map="auto",
    torch_dtype=torch.float16
)

# Prepare for LoRA fine-tuning
model = prepare_model_for_kbit_training(model)

# LoRA config
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],  # You can use a broader list depending on the architecture
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

# Inject LoRA adapters
model = get_peft_model(model, lora_config)


### Calculated Accuracy and perplexity for the pretrained mode on the chosen dataset (Before- Training)

- Calculated on small dataset of 512 datapoints


In [None]:
import torch
from torch.utils.data import DataLoader
from transformers import DataCollatorForLanguageModeling
import math

# Use default data collator for Causal L
def compute_perplexity_acc(model, dataset,tokenizer):
    # Set model to evaluation mode
    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

    # Prepare DataLoader
    eval_dataloader = DataLoader(dataset , batch_size=8, collate_fn=data_collator)

    # Set model to eval mode
    model.eval()

    # Move model to the appropriate device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    total_loss = 0.0
    total_tokens = 0
    correct_predictions = 0

    # No gradient computation needed
    with torch.no_grad():
        for batch in eval_dataloader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = input_ids.clone()

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            logits = outputs.logits

            # Compute total loss
            total_loss += loss.item() * input_ids.numel()
            total_tokens += input_ids.numel()

            # Compute accuracy: compare logits' argmax with labels
            predictions = torch.argmax(logits, dim=-1)
            correct_predictions += (predictions == labels).float().sum().item()

    # Compute perplexity and accuracy
    avg_loss = total_loss / total_tokens
    perplexity = math.exp(avg_loss)
    accuracy = correct_predictions / total_tokens
    return perplexity, accuracy
perplexity, accuracy = compute_perplexity_acc(model, lm_dataset.select(range(512)), tokenizer)
print(f"Perplexity of Pre_Trained Model: {perplexity:.4f}")
print(f"Accuracy of Pre_trained Model: {accuracy:.4f}")


Perplexity of Pre_Trained Model: 38.6291
Accuracy of Pre_trained Model: 0.0199


### Defined training arguments

- Number of batches trained on Gpu at once => 50
- number of epochs trained => 1
- learning rate => 0.0002

In [None]:
training_args = TrainingArguments(
    output_dir="./gpt-neo-lora-finetuned",
    per_device_train_batch_size=50,
    per_device_eval_batch_size=50,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    report_to="none",  # avoid using wandb
    fp16=True
)




### Trained the model on wiki-text dataset

In [None]:
from transformers import DataCollatorForLanguageModeling

# Data collator pads inputs and creates labels for LM
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset,
    eval_dataset=lm_dataset.select(range(512)),  # small subset for eval
    data_collator=data_collator
)

trainer.train()


No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss
0,3.05,3.102378


TrainOutput(global_step=25, training_loss=3.0575904083251952, metrics={'train_runtime': 507.9804, 'train_samples_per_second': 9.861, 'train_steps_per_second': 0.049, 'total_flos': 4646507642880000.0, 'train_loss': 3.0575904083251952, 'epoch': 0.9900990099009901})

### Saved the trained model

In [None]:
model.save_pretrained("./gpt-neo-lora-finetuned")
tokenizer.save_pretrained("./gpt-neo-lora-finetuned")


('./gpt-neo-lora-finetuned/tokenizer_config.json',
 './gpt-neo-lora-finetuned/special_tokens_map.json',
 './gpt-neo-lora-finetuned/vocab.json',
 './gpt-neo-lora-finetuned/merges.txt',
 './gpt-neo-lora-finetuned/added_tokens.json',
 './gpt-neo-lora-finetuned/tokenizer.json')

### Load the saved model and calculated perplexity and accuracy for the fine-tuned model

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model and tokenizer from the saved directory
model = AutoModelForCausalLM.from_pretrained("./gpt-neo-lora-finetuned")
tokenizer = AutoTokenizer.from_pretrained("./gpt-neo-lora-finetuned")

# Move model to CUDA if available
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)


In [None]:
perplexity, accuracy = compute_perplexity_acc(model, lm_dataset.select(range(512)), tokenizer)
print(f"Perplexity of Fine_tuned Model: {perplexity:.4f}")
print(f"Accuracy of Fine_tuned Model: {accuracy:.4f}")

Perplexity of Fine_tuned Model: 21.5071
Accuracy of Fine_tuned Model: 0.0106


### Qualitative Analysis on fine-tuned model and pre-trained model

In [None]:
import torch
import os
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
from accelerate import init_empty_weights, infer_auto_device_map
from transformers import AutoConfig

# Create a folder for offloading
os.makedirs("/content/offload", exist_ok=True)

# Define BitsAndBytes quantization config
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")
tokenizer.pad_token = tokenizer.eos_token

# Step 1: Create an empty model to calculate device map
config = AutoConfig.from_pretrained("EleutherAI/gpt-neo-1.3B")
with init_empty_weights():
    empty_model = AutoModelForCausalLM.from_config(config)
    empty_model.tie_weights()  # fix: tie weights before infer_auto_device_map

device_map = infer_auto_device_map(
    empty_model,
    max_memory={
        0: "8GiB",  # Adjust depending on your GPU
        "cpu": "30GiB"
    },
    no_split_module_classes=["GPTNeoBlock"]
)

# Step 2: Load the quantized model with device map and offload folder
pretrained_model = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/gpt-neo-1.3B",
    device_map=device_map,
    quantization_config=quant_config,
    offload_folder="/content/offload"
)

# Step 3: Load LoRA finetuned model similarly
finetuned_model = AutoModelForCausalLM.from_pretrained(
    "/content/gpt-neo-lora-finetuned",
    device_map=device_map,
    quantization_config=quant_config,
    offload_folder="/content/offload"
)

# Step 4: Create pipelines WITHOUT passing `device=0` (accelerate handles it)
generator_pretrained = pipeline("text-generation", model=pretrained_model, tokenizer=tokenizer)
generator_finetuned = pipeline("text-generation", model=finetuned_model, tokenizer=tokenizer)

# Step 5: Define prompts and generate
prompts = [
    "The football world cup should be won by",
    "We should do",
    "The Meal should we take is",
    "A healthy person is a person who",
    "The best cricket player is",
    "In a disaster we should",
    "We should do our assignment on",
    "The deadline is near I should",
    "Success comes from",
   "Taj Mahal was built by"
]

outputs_pretrained = [generator_pretrained(prompt, max_length=50, do_sample=True)[0]['generated_text'] for prompt in prompts]
outputs_finetuned = [generator_finetuned(prompt, max_length=50, do_sample=True)[0]['generated_text'] for prompt in prompts]

# Step 6: Print results
print("\n=== COMPARISON ===")
for i, prompt in enumerate(prompts):
    print(f"\nPrompt {i+1}: {prompt}")
    print(f"🔹 Pretrained: {outputs_pretrained[i]}")
    print(f"🔸 Finetuned : {outputs_finetuned[i]}")


Device set to use cuda:0
Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.



=== COMPARISON ===

Prompt 1: The football world cup should be won by
🔹 Pretrained: The football world cup should be won by the Netherlands. The game is a wonderful display of teamwork, skill and creativity. But when you lose to England you should remember the Dutch are a team that never give up.

There’s more
🔸 Finetuned : The football world cup should be won by the team with the most points after the quarter-finals. If a team has a better goal difference at the end of the game, they should have a higher chance of winning the cup but it should be treated

Prompt 2: We should do
🔹 Pretrained: We should do that to some extent. I think
that's how we should have
responded to that.
The most important thing we
do is, we take out one
of the most important
areas, one of the most
🔸 Finetuned : We should do the following :

• We should not use any new or existing funding, such as the UK National Lottery, for a project to benefit any other charities while taking up a new grant;

• We should not

### Human Evaluation Scores:

| Prompt | Fluency (Pre) | Fluency (Fine) | Relevance (Pre) | Relevance (Fine) | Correctness (Pre) | Correctness (Fine) |
|--------|---------------|----------------|------------------|-------------------|---------------------|----------------------|
| The football world cup should be won by | $4$ | $4$ | $2$ | $4$ | $3$ | $4$ |
| We should do | $3$ | $4$ | $2$ | $4$ | $2$ | $4$ |
| The Meal should we take is | $3$ | $4$ | $2$ | $3$ | $2$ | $3$ |
| A healthy person is a person who | $5$ | $5$ | $4$ | $5$ | $4$ | $5$ |
| The best cricket player is | $4$ | $4$ | $3$ | $4$ | $3$ | $4$ |
| In a disaster we should | $3$ | $3$ | $2$ | $3$ | $2$ | $2$ |
| We should do our assignment on | $3$ | $4$ | $3$ | $4$ | $3$ | $4$ |
| The deadline is near I should | $4$ | $4$ | $3$ | $3$ | $3$ | $3$ |
| Success comes from | $4$ | $4$ | $2$ | $4$ | $2$ | $2$ |
| Taj Mahal was built by | $4$ | $4$ | $4$ | $4$ | $1$ | $1$ |
