# Instruction Finetuning

In this notebook, we will look into how to perform instruction finetuning. We will be doing full finetuning, i.e., retraining all the paramters of the model.

Load the required libraries

In [1]:
af

NameError: name 'af' is not defined

In [1]:
import os
os.environ["WANDB_PROJECT"]="tinyllama_instruct_finetuning"

from enum import Enum
from functools import partial
import pandas as pd
import torch

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from datasets import load_dataset
from trl import SFTTrainer

ModuleNotFoundError: No module named 'pandas'

## Data preprocessing: Creating Datasets and Dataloaders

In [4]:
model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T"
dataset_name = "HuggingFaceH4/no_robots"
tokenizer = AutoTokenizer.from_pretrained(model_name)
template = """{% for message in messages %}\n{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% if loop.last and add_generation_prompt %}{{'<|im_start|>assistant\n' }}{% endif %}{% endfor %}"""
tokenizer.chat_template = template

In [5]:
def preprocess(samples):
    batch = []
    for conversation in samples["messages"]:
        batch.append(tokenizer.apply_chat_template(conversation, tokenize=False))
    return {"content": batch}

dataset = load_dataset(dataset_name)

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'prompt_id', 'messages', 'category'],
        num_rows: 9500
    })
    test: Dataset({
        features: ['prompt', 'prompt_id', 'messages', 'category'],
        num_rows: 500
    })
})

In [7]:
dataset = dataset.map(
    preprocess,
    batched=True,
    remove_columns=dataset["train"].column_names
)
print(dataset)

Map:   0%|          | 0/9500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['content'],
        num_rows: 9500
    })
    test: Dataset({
        features: ['content'],
        num_rows: 500
    })
})


In [9]:
print(dataset)
print(dataset["train"][0])

DatasetDict({
    train: Dataset({
        features: ['content'],
        num_rows: 9500
    })
    test: Dataset({
        features: ['content'],
        num_rows: 500
    })
})
{'content': '<|im_start|>user\nPlease summarize the goals for scientists in this text:\n\nWithin three days, the intertwined cup nest of grasses was complete, featuring a canopy of overhanging grasses to conceal it. And decades later, it served as Rinkert’s portal to the past inside the California Academy of Sciences. Information gleaned from such nests, woven long ago from species in plant communities called transitional habitat, could help restore the shoreline in the future. Transitional habitat has nearly disappeared from the San Francisco Bay, and scientists need a clearer picture of its original species composition—which was never properly documented. With that insight, conservation research groups like the San Francisco Bay Bird Observatory can help guide best practices when restoring the native habitat

In [10]:
dataset["test"][:10]

{'content': ['<|im_start|>system\nAster is a chatbot who answers questions with rhymes.<|im_end|>\n<|im_start|>user\nWhere did chocolate originate?<|im_end|>\n<|im_start|>assistant\nChocolate is 4000 years old/Mexico is where it was first sold<|im_end|>\n<|im_start|>user\nWhere was milk chocolate invented?<|im_end|>\n<|im_start|>assistant\nSwitzerland was the first to add milk/To make their chocolate smooth as silk<|im_end|>\n<|im_start|>user\nWhat are some good desserts that use chocolate?<|im_end|>\n<|im_start|>assistant\nPie, tart, cookies, and cake/Chocolate is great to bake<|im_end|>\n',
  "<|im_start|>user\nWhat are 5 things I can do when it's raining in London? I am visiting for the first time and only for a week with my husband. We love to walk, eat good food, and explore.<|im_end|>\n<|im_start|>assistant\nSure! Here are five options for things to do in London on a rainy day:\n\n1. Visit The British Museum. Dedicated to human history, art and culture, The British Museum has ove

## Loading the pretrained model and tokenizer

In [11]:
class ChatmlSpecialTokens(str, Enum):
    user = "<|im_start|>user"
    assistant = "<|im_start|>assistant"
    system = "<|im_start|>system"
    eos_token = "<|im_end|>"
    bos_token = "<s>"
    pad_token = "<pad>"

    @classmethod
    def list(cls):
        return [c.value for c in cls]

tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        pad_token=ChatmlSpecialTokens.pad_token.value,
        bos_token=ChatmlSpecialTokens.bos_token.value,
        eos_token=ChatmlSpecialTokens.eos_token.value,
        additional_special_tokens=ChatmlSpecialTokens.list(),
        trust_remote_code=True
    )
tokenizer.chat_template = template
model = AutoModelForCausalLM.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))

config.json:   0%|          | 0.00/560 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/129 [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Embedding(32005, 2048)

## Storing the base model predictions on a subset of 25 samples from eval test

In [12]:
tokenizer.padding_side="left"
def get_prediction_batched(samples, column_name):
    batch = []
    for conversation in samples["messages"]:
        chatml_gen_prompt = tokenizer.apply_chat_template(conversation[:-1], tokenize=False, add_generation_prompt=True)
        batch.append(chatml_gen_prompt)
    #text = tokenizer.apply_chat_template(conversation_history, add_generation_prompt=True, tokenize=False)
    inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True)#, add_special_tokens=False)
    inputs = {k: v.to("cuda") for k,v in inputs.items()}
    outputs = model.generate(**inputs, 
                             max_new_tokens=100, 
                             do_sample=True, 
                             top_p=0.95, 
                             temperature=0.2, 
                             repetition_penalty=1.1, 
                             eos_token_id=tokenizer.eos_token_id,
                             pad_token_id=tokenizer.eos_token_id,
                            )
    outputs = tokenizer.batch_decode(outputs)
    outputs = [output.split("<|im_start|>assistant")[-1].split("<|im_end|>")[0].strip() for output in outputs]
    return {column_name: outputs}

In [14]:
model.to("cuda")
test_dataset = load_dataset(dataset_name)["test"].shuffle().select(range(25))
test_dataset = test_dataset.map(
    partial(get_prediction_batched, column_name="base_assistant_message"),
    batched=True,
    batch_size=1)

print(test_dataset)
print(test_dataset[0])

Map:   0%|          | 0/25 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Dataset({
    features: ['prompt', 'prompt_id', 'messages', 'category', 'base_assistant_message'],
    num_rows: 25
})
{'prompt': 'Create a marketing blurb for a book with the following premise: "Amaris, Veronica, and Kate are three women each competing in their own battle against the town of Liverword. Amaris is being tried for murder when a murder of crows killed a farmer\'s cow. She\'s an animal empath, but the prosecution can\'t prove it. Veronica was abused but no one believes her. She tries to plan a revenge crime without getting caught. The women don\'t know each other, but separately discover that their unique experiences may have an ancestral tie to witchcraft. They also discover a missing relative, Kate. They seek her out and Kate discovers that her parents covered up her adoption. She has always felt "apart" from others and seeks her family history, but the records have been lost, burnt, and tampered with. Amaris decides to help Veronica with her retribution in exchange for 

## Training

In [15]:
output_dir = "tinyllama_instruct"
per_device_train_batch_size = 1
per_device_eval_batch_size = 1
gradient_accumulation_steps = 16
logging_steps = 25
learning_rate = 2e-5
max_grad_norm = 1.0
max_steps = 250
num_train_epochs=1
warmup_ratio = 0.1
lr_scheduler_type = "cosine"
max_seq_length = 2048

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    max_grad_norm=max_grad_norm,
    weight_decay=0.1,
    warmup_ratio=warmup_ratio,
    lr_scheduler_type=lr_scheduler_type,
    fp16=True,
    report_to=["tensorboard", "wandb"],
    hub_private_repo=True,
    push_to_hub=True,
    num_train_epochs=num_train_epochs,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False}
)




In [36]:
from datasets import DatasetDict

# Assuming `dataset` is your DatasetDict
dataset = dataset.rename_columns({"content": "text"})

# Verify the change
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 9500
    })
    test: Dataset({
        features: ['text'],
        num_rows: 500
    })
})


In [37]:
trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    # packing=True,
    # dataset_text_field="content",
    # max_seq_length=max_seq_length,
)

  trainer = SFTTrainer(


Converting train dataset to ChatML:   0%|          | 0/9500 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/9500 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/9500 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/9500 [00:00<?, ? examples/s]

Converting eval dataset to ChatML:   0%|          | 0/500 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

In [38]:
trainer.train()
trainer.save_model()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mbadrinarayan[0m ([33mbadrinarayan-analytics-vidhya[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Epoch,Training Loss,Validation Loss
0,1.8677,1.837446


events.out.tfevents.1741610269.4596a8036205.769.0:   0%|          | 0.00/12.5k [00:00<?, ?B/s]

In [39]:
!nvidia-smi

Mon Mar 10 12:55:50 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:A1:00.0 Off |                  Off |
| 30%   37C    P2             60W /  300W |   22352MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Loading the trained model and getting the predictions of the trained model

In [41]:
model = AutoModelForCausalLM.from_pretrained("Badribn/tinyllama_instruct", trust_remote_code=True)
model.to("cuda")
model.to(torch.float16)
model.eval()

config.json:   0%|          | 0.00/756 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32005, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): 

In [42]:
test_dataset = test_dataset.map(
    partial(get_prediction_batched, column_name="instruct_assistant_message"),
    batched=True,
    batch_size=1)

print(test_dataset)
print(test_dataset[0])

Map:   0%|          | 0/25 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt', 'prompt_id', 'messages', 'category', 'base_assistant_message', 'instruct_assistant_message'],
    num_rows: 25
})
{'prompt': 'Create a marketing blurb for a book with the following premise: "Amaris, Veronica, and Kate are three women each competing in their own battle against the town of Liverword. Amaris is being tried for murder when a murder of crows killed a farmer\'s cow. She\'s an animal empath, but the prosecution can\'t prove it. Veronica was abused but no one believes her. She tries to plan a revenge crime without getting caught. The women don\'t know each other, but separately discover that their unique experiences may have an ancestral tie to witchcraft. They also discover a missing relative, Kate. They seek her out and Kate discovers that her parents covered up her adoption. She has always felt "apart" from others and seeks her family history, but the records have been lost, burnt, and tampered with. Amaris decides to help Veronica with he

## Comparing the outputs of base model and instruction finetuned model

In [43]:
test_dataset = test_dataset.to_pandas()

In [44]:
pd.set_option("max_colwidth", 300)
test_dataset[["messages", "base_assistant_message", "instruct_assistant_message"]][:25]

Unnamed: 0,messages,base_assistant_message,instruct_assistant_message
0,"[{'content': 'Create a marketing blurb for a book with the following premise: ""Amaris, Veronica, and Kate are three women each competing in their own battle against the town of Liverword. Amaris is being tried for murder when a murder of crows killed a farmer's cow. She's an animal empath, but t...","The Blurb:\nAmaris, Veronica, and Kate are three women each competing in their own battle against the town of Liverword. Amaris is being tried for murder when a murder of crows killed a farmer's cow. She's an animal empath, but the prosecution can't prove it. Veronica was abused but no one belie...","In a world where magic exists, three women must fight to survive.\n\nAmaris, Veronica, and Kate are three women each competing in their own battle against the town of Liverword. Amaris is being tried for murder when a murder of crows killed a farmer's cow. She's an animal empath, but the prosecu..."
1,"[{'content': 'Come up with fun games at a child’s birthday party.', 'role': 'user'}, {'content': 'Great! I can develop some games kids would love to play! - Musical Chairs - Children love this classic game! Pick a song of your choice and set up one less chair than there are kids at the party. P...",,"1. Cake walk: Have the guests stand in a line and take turns walking around the room, holding their hands out to each side of them. The person who is holding their hand out should be asked to pick up a piece of cake from the floor. They must then pass it to someone else in the line. The first pe..."
2,"[{'content': 'Hey. I need a couple of intro paragraphs for a blog post on writing compelling character backstories for roleplaying games. The take that this particular post should have is that these stories should be short, but also include interesting elements that the GM can use to create inte...","The first thing you want to know about your character is what they do when they're not playing. This is the most important part of your character's backstory. It's the reason why you play the game in the first place. If you don't know what your character does, how will you know what they're capa...",The first thing you want to do when creating your character is to make sure they have a unique and memorable backstory. This is something that will help them stand out from other characters in the game. Here are some ideas for how to do this:\n\n- Create a story about a character who has been th...
3,"[{'content': 'Please write a humorous/ironic script for a speech in a crowdfunding video by a man named Dr. Disaster. He is asking for $500 Million to fund a doomsday device that he calls, ""The Crust Cracker 5000,"" which works by cracking a planet's crust like an egg and exposing the mantle. One...",A 10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,"Hello! I am Dr. Disaster, and I am here today to ask you all to help me raise $500 million dollars so that we can build The Crust Cracker 5000, a doomsday device that will crack a planet's crust like an egg and expose its mantle. This is not a joke. It is real. And it is serious.\n\nI know this ..."
4,"[{'content': 'What is the Cinque Terre?', 'role': 'user'}, {'content': 'The Cinque Terre is a coastal area in the northwest of Italy. It comprises five villages: Monterosso al Mare, Vernazza, Corniglia, Manarola, and Riomaggiore. The coastline, together with the five villages and the surrounding...",Cinque Terre\nCinque Terre\nCinque Terre\nCinque Terre\nCinque Terre\nCinque Terre\nCinque Terre\nCinque Terre\n\nCinque Terre\nCinque Terre\nCinque Terre\nCinque Terre\nCinque Terre\nCinque Terre\nCinque Terre\nCinque Terre\nCin,"The Cinque Terre are five small towns in Italy that are located on the Ligurian coast. The Cinque Terre are known for their beautiful scenery, history, and food."
5,"[{'content': 'Mrs. Pavlova is a chatbot for adult beginner ballet students and answers questions that adult beginner ballet students would ask.', 'role': 'system'}, {'content': 'Hi Mrs. Pavlova. I'm starting my first ballet class as an adult. It's happening next week. What are two important piec...","Sweatpants? They're great for beginners. They're also good for people who have trouble finding ballet slippers. And they're comfortable enough for a beginner ballet class. So why don't you buy some? Because they're too expensive. But if you're willing to spend a little more money, you can find t...",You'll also want to bring a sweater or jacket to keep yourself warm during the class.
6,"[{'content': 'Rewrite this text as if its part of a futuristic science fiction story: Tesla disabled the on-the-move passenger play feature via an over-the air software update after discussing the matter with officials. However, NHTSA continued its investigation. At the time, NHTSA reiterated t...",,"In a futuristic sci-fi world, Tesla disabled the on-the-move passenger play feature via an over-the-air software update after discussing the matter with officials. However, NHTSA continued its investigation. At the time, NHTSA reiterated that the Vehicle Safety Act bars companies from selling ca..."
7,"[{'content': 'Could you rewrite this article snippet for me, but as if it's part of a first-person female narrative? Jealousy is often thought of in the context of romantic relationships: a boyfriend who forbids his girlfriend from talking to other men, for instance, or a person who can’t stand...",,"I am jealous. I have always been jealous. It's just a fact of life. Jealousy is a natural emotion that we all feel when we are jealous of someone else. It's like a red flag that says ""I want what you have."" I know that sounds weird, but it's true. I know that sounds weird, but it's true. I know ..."
8,"[{'content': 'Prepare a text message to a male friend who just went through a bad breakup. Be sarcastic and include dark humor. Try not to make it sincere or serious at all.', 'role': 'user'}, {'content': 'Dude, I knew she meant a lot to you, but how much could SHE really be worth if she pulled ...","Ask a friend about their life, their career, or anything else they want to talk about. Be honest and open.\nAsk a friend about something that's bothering them. Don't be too personal.\nAsk a friend about something that's bothering you. Be honest and open.\nAsk a friend about something that's both...","Hey, buddy! I'm sorry to hear you're going through a rough patch. It must be tough to lose someone you care about so much. I hope things get better soon.\n\nI know how hard it can be to move on from a relationship. You might feel like you'll never find someone who understands you the way you do ..."
9,"[{'content': '1) Did Jerry Springer have a successful political career? 2) How many years was the Jerry Springer Show on television? Source: Jerry Springer, daytime television pioneer, dies at 79 “The Jerry Springer Show,” which ran from 1991 to 2018, was known for on-set confrontations and fan...",,"1) Did Jerry Springer have a successful political career?\n2) How many years was the Jerry Springer Show on television?\n\n\nSource:\nJerry Springer, daytime television pioneer, dies at 79 - CNN\nCNN\nApril 27, 2023, 10:37 AM EDT / Updated April 27, 2023, 1:50 PM EDT\nBy Diana Das"


In [45]:
messages = [
    {"role": "user", "content": "What an essay on Generative AI."},
]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt")#, add_special_tokens=False)
inputs = {k: v.to("cuda") for k,v in inputs.items()}
outputs = model.generate(**inputs, 
                         max_new_tokens=2000, 
                         do_sample=True, 
                         top_p=0.95, 
                         temperature=0.2, 
                         repetition_penalty=1.1, 
                         eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))

<s><|im_start|>user 
What an essay on Generative AI.<|im_end|> 
<|im_start|>assistant 
Generative AI is a type of AI that creates new content from scratch. It can be used to create art, music, or even code. The process of creating something new involves many steps and requires a lot of creativity.

The first step in the process is to come up with an idea for what you want your piece to look like. This could be anything from a simple drawing to a complex animation. You will then need to find a way to represent your idea in a way that makes sense to you. For example, if you are creating a painting, you might use colors, shapes, and textures to represent your idea.

Once you have a representation of your idea, you will need to decide how it should be represented. There are many ways to do this, such as using a computer program to draw the image or using a digital camera to take pictures of the object. Once you have a representation of your idea, you will need to decide how it should be re

In [46]:
!nvidia-smi

Mon Mar 10 13:02:50 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:A1:00.0 Off |                  Off |
| 30%   36C    P2            113W /  300W |   22366MiB /  49140MiB |     36%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
