### In this notebook, we finetune our pretrained model to follow instructions

#### Data handling for the finetuning

 We use a dataset specialy construct for the course book but it is also possible to use other publicly dataset
 The dataset consist of instruction-input-output  sample in Json format

In [1]:
#dowloading the dataset
import json
import os
import urllib

def download_and_load_file(file_path, url):
    if not os.path.exists(file_path):
        with urllib.request.urlopen(url) as response:
            text_data = response.read().decode("utf-8")
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)
    else:                                                  #A
        # with open(file_path, "r", encoding="utf-8") as file:
        #     text_data = file.read()
        print(f"File '{file_path}' already exists. Skipping download.")

    with open(file_path, "r") as file:
        data = json.load(file)

    return data
#A Skip download if file was already downloaded

file_path = "instruction-data.json"
url ="https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch07/01_main-chapter-code/instruction-data.json"

data = download_and_load_file(file_path, url)
print("Number of entrienss:", len(data))

File 'instruction-data.json' already exists. Skipping download.
Number of entrienss: 1100


In [2]:
#exemple entries
print("Example entries:", data[50])

Example entries: {'instruction': 'Identify the correct spelling of the following word.', 'input': 'Ocassion', 'output': "The correct spelling is 'Occasion.'"}


In [3]:
#another one
print("Another example entry:\n", data[999])

Another example entry:
 {'instruction': "What is an antonym of 'complicated'?", 'input': '', 'output': "An antonym of 'complicated' is 'simple'."}


We w'll be using Alpaca prompt style (one of the first LLM instruction finetuning process publicy available)

"""
Bellow is an instruction that describe a task.<br>
Write a response that approprietaly complete the request.<br>

### Instruction:
Identify the correct spelling of the following word.

### Input:
Ocassion

### Response:
The correct Spelling is 'Occasion'.
"""

In [4]:
#implementing the prompt formating function
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )
    input_text = f"\n\n### Input: \n{entry['input']}" if entry['input'] else ""
    return instruction_text + input_text

In [5]:
#test
model_input = format_input(data[50])
desired_response = f"\n\n### Response: \n{data[50]['output']}"
print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Identify the correct spelling of the following word.

### Input: 
Ocassion

### Response: 
The correct spelling is 'Occasion.'


In [6]:
#another test  (skipping the Input section because it's empty)
model_input = format_input(data[999])
desired_response = f"\n\n### Response:\n{data[999]['output']}"
print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is an antonym of 'complicated'?

### Response:
An antonym of 'complicated' is 'simple'.


In [7]:
#partinioning the dataset:
train_portion = int(len(data) * 0.85) # 85% for training
test_portion = int(len(data)*0.1) # 10% for testing
val_portion = len(data) - train_portion - test_portion # remaining 5% for validation

train_data = data[:train_portion]
test_data = data[train_portion:train_portion + test_portion]
val_data = data[train_portion + test_portion:]

print("Training set length: ", len(train_data))
print("Validation set length: ", len(val_data))
print("Test set length: ", len(test_data))

Training set length:  935
Validation set length:  55
Test set length:  110


#### Organising data into batches

In the previous chapter, the training batches were created automatically by the PyTorch
DataLoader class, which employs a default collate function to combine lists of samples into
batches. A collate function is responsible for taking a list of individual data samples and
merging them into a single batch that can be processed efficiently by the model during
training.

However, now we hate to make our own collate function

steps:
- format data using prompt template (format_input() + desired response)
- tokenize data
- adjust to same length with padding
- create targets token IDs ( inputs shifted by one an additional padding token)
- replace some padding token by -100 to not make them count in the loss calculation (cross_entropy(logits, targtes))




In [8]:
#Instruction Dataset
import torch
from torch.utils.data import Dataset

class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.encoded_texts = []
        for entry in data:                                      #A Pre-tokenize texts
            instruction_plus_input = format_input(entry)
            response_text = f"\n\n### Response: \n{entry['output']}"
            full_text = instruction_plus_input + response_text
            self.encoded_texts.append(
                tokenizer.encode(full_text)
            )

    def __getitem__(self, index):
        return self.encoded_texts[index]
    
    def __len__(self):
        return len(self.data)


In [9]:
#token id for the padding token
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"}))

[50256]


In [10]:
#our custum colate fuunction for padding
def custom_collate_draft_1(batch, pad_token_id=50256, device="cpu"):

    batch_max_length = max(len(item)+1 for item in batch)           #A
    inputs_lst = []

    for item in batch:                                               #B
        new_item = item.copy()
        new_item += [pad_token_id]

        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))

        inputs = torch.tensor(padded[:-1])                            #C
        inputs_lst.append(inputs)                                     

    inputs_tensor = torch.stack(inputs_lst).to(device)               #D
    return inputs_tensor
#A Find the longest sequence in the batch
#B Pad and prepare inputs
#C Remove extra padded token added earlier
#D Convert list of inputs to tensor and transfer to target device

In [11]:
#test:
inputs_1 = [0, 1, 2, 3, 4]
inputs_2 = [5, 6]
inputs_3 = [7, 8, 9]
batch = (
    inputs_1,
    inputs_2,
    inputs_3
)
print(custom_collate_draft_1(batch))

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])


In [12]:
#"modifying the collate function to also create the targets"
def custom_collate_draft_2(
    batch,
    pad_token_id=50256,
    device="cpu"
):
    batch_max_length = max(len(item)+1 for item in batch)
    inputs_lst, targets_lst = [], []
    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]
        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
        inputs = torch.tensor(padded[:-1]) #A
        targets = torch.tensor(padded[1:]) #B
        inputs_lst.append(inputs)
        targets_lst.append(targets)
    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)
    return inputs_tensor, targets_tensor

inputs, targets = custom_collate_draft_2(batch)
print(inputs)
print(targets)
#A Truncate the last token for inputs
#B Shift +1 to the right for targets

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256, 50256, 50256, 50256],
        [    8,     9, 50256, 50256, 50256]])


In [13]:
#replacing padding token ids in the targets with -100 to not count them in the loss
def custom_collate_fn(
batch,
pad_token_id=50256,
ignore_index=-100,
allowed_max_length=None,
device="cpu"
):
    batch_max_length = max(len(item)+1 for item in batch)
    inputs_lst, targets_lst = [], []
    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]
        # Pad sequences to max_length
        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
        inputs = torch.tensor(padded[:-1]) # Truncate the last token for inputs
        targets = torch.tensor(padded[1:]) # Shift +1 to the right for targets
        
        mask = targets == pad_token_id                  #A
        indices = torch.nonzero(mask).squeeze()         #A
        if indices.numel() > 1: #A
            targets[indices[1:]] = ignore_index #A

        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length] #B
            targets = targets[:allowed_max_length] #B
        inputs_lst.append(inputs)
        targets_lst.append(targets)

    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)
    return inputs_tensor, targets_tensor
#A Replace all but the first padding tokens in targets by ignore_index
#B Optionally truncate to maximum sequence length

In [14]:
#let's try it out
inputs, targets = custom_collate_fn(batch)
print(inputs)
print(targets)

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256,  -100,  -100,  -100],
        [    8,     9, 50256,  -100,  -100]])


In [15]:
#to understand adding the mask tokens -100
logits_1 = torch.tensor(
    [[-1.0, 1.0], # predictions for 1st token
    [-0.5, 1.5]] # predictions for 2nd token
)
targets_1 = torch.tensor([0, 1]) # Correct token indices to generate
loss_1 = torch.nn.functional.cross_entropy(logits_1, targets_1)
print(loss_1)

logits_2 = torch.tensor(
    [[-1.0, 1.0],
    [-0.5, 1.5],
    [-0.5, 1.5]] #A adding a 3ird token Id prediction
)
targets_2 = torch.tensor([0, 1, 1])
loss_2 = torch.nn.functional.cross_entropy(logits_2, targets_2)
print(loss_2)

targets_3 = torch.tensor([0, 1, -100]) #-100 because it is the default ignore_index in PyTorch cross_entropy
loss_3 = torch.nn.functional.cross_entropy(logits_2, targets_3)
print(loss_3)
print("loss_1 == loss_3:", loss_1 == loss_3)

#so we see that by adding -100 to target, the 3rd token prediction is ignored in the loss calculation
#we don't learn unnecessary things

tensor(1.1269)
tensor(0.7936)
tensor(1.1269)
loss_1 == loss_3: tensor(True)


It's also commun to padded target output token ids corresponding to instructions so the model can focus on generating good response and not memorizing instructions, which can help reducing overfitting.

But it is discussed in the LLm community. IN this notebook , we don't padded the instruyctions token in outpout tokens id

##### creation du Dataloader

In [18]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# if torch.backends.mps.is_available(): "for macwith silicon chips"
#     device = torch.device("mps")
print("Using device:", device)

Using device: cpu


In [19]:
#to set device by default and also allowed_max_length
from functools import partial
custumized_collate_fn = partial(custom_collate_fn, device=device, allowed_max_length=1024)

In [20]:
#seting up the dataloaders
from torch.utils.data import DataLoader

num_workers =0                                 #A
batch_size = 8

train_dataset = InstructionDataset(train_data, tokenizer)
train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    collate_fn=custumized_collate_fn,
    shuffle=True,
    drop_last=True,
    num_workers=num_workers
)

val_dataset = InstructionDataset(val_data, tokenizer)
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    collate_fn=custumized_collate_fn,
    shuffle=True,
    drop_last=False,
    num_workers=num_workers
)

test_dataset = InstructionDataset(test_data, tokenizer)
test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    collate_fn=custumized_collate_fn,
    shuffle=False,
    drop_last=False,
    num_workers=num_workers
)

#A We can try to increase this number if parallel Python process are supported by our operating system

In [21]:
#let's examine:
print("Train loader:")
for inputs, targets in train_loader:
    print(inputs.shape, targets.shape)

Train loader:
torch.Size([8, 73]) torch.Size([8, 73])
torch.Size([8, 63]) torch.Size([8, 63])
torch.Size([8, 71]) torch.Size([8, 71])
torch.Size([8, 72]) torch.Size([8, 72])
torch.Size([8, 67]) torch.Size([8, 67])
torch.Size([8, 63]) torch.Size([8, 63])
torch.Size([8, 63]) torch.Size([8, 63])
torch.Size([8, 72]) torch.Size([8, 72])
torch.Size([8, 93]) torch.Size([8, 93])
torch.Size([8, 69]) torch.Size([8, 69])
torch.Size([8, 70]) torch.Size([8, 70])
torch.Size([8, 78]) torch.Size([8, 78])
torch.Size([8, 66]) torch.Size([8, 66])
torch.Size([8, 65]) torch.Size([8, 65])
torch.Size([8, 70]) torch.Size([8, 70])
torch.Size([8, 74]) torch.Size([8, 74])
torch.Size([8, 67]) torch.Size([8, 67])
torch.Size([8, 71]) torch.Size([8, 71])
torch.Size([8, 64]) torch.Size([8, 64])
torch.Size([8, 66]) torch.Size([8, 66])
torch.Size([8, 64]) torch.Size([8, 64])
torch.Size([8, 93]) torch.Size([8, 93])
torch.Size([8, 69]) torch.Size([8, 69])
torch.Size([8, 81]) torch.Size([8, 81])
torch.Size([8, 74]) torch.

As we saw in the preceding code output, thanks to our custom collate function, the data
loader is able to create batches of different lengths. In the next section, we load a
pretrained LLM that we can then finetune with this data loader.

We are going to train the 355M parameters model because the smal 124M model is not good enauth for pretraining

In [22]:
import sys
sys.path.append('..')

from gpt_download import download_and_load_gpt2
from utils import GPTModel, load_weights_into_gpt

BASE_CONFIG = {
    "vocab_size": 50257,  #Vocabulary size
    "context_length": 1024, #Context length
    "drop_rate": 0.0, #Dropout rat
    "qkv_bias": True, #QKV bias
}

model_configs = {
"gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
"gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
"gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
"gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

CHOOSE_MODEL = "gpt2-medium (355M)"
BASE_CONFIG.update(model_configs[CHOOSE_MODEL])

model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")
settings, params = download_and_load_gpt2(model_size, models_dir="gpt2")

model = GPTModel(BASE_CONFIG)
load_weights_into_gpt(model, params)
model.eval()




2025-09-26 14:41:38.997585: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-09-26 14:41:39.276464: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-09-26 14:41:39.276513: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-09-26 14:41:39.277404: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-09-26 14:41:39.395207: I tensorflow/core/platform/cpu_feature_g

File already exists and is up-to-date: gpt2/355M/checkpoint
File already exists and is up-to-date: gpt2/355M/encoder.json
File already exists and is up-to-date: gpt2/355M/hparams.json
File already exists and is up-to-date: gpt2/355M/model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2/355M/model.ckpt.index
File already exists and is up-to-date: gpt2/355M/model.ckpt.meta
File already exists and is up-to-date: gpt2/355M/vocab.bpe


GPTModel(
  (tok_emb): Embedding(50257, 1024)
  (pos_emb): Embedding(1024, 1024)
  (drop_emb): Dropout(p=0.0, inplace=False)
  (trf_blocks): Sequential(
    (0): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=1024, out_features=1024, bias=True)
        (W_value): Linear(in_features=1024, out_features=1024, bias=True)
        (W_key): Linear(in_features=1024, out_features=1024, bias=True)
        (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        (dropout): Dropout(p=0.0, inplace=False)
      )
      (ff): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=1024, out_features=4096, bias=True)
          (1): GELU()
          (2): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_shortcut): Dropout(p=0.0, inplace=False)
    )
    (1): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(i

Let's First start by assessing the model performance on instruction following before finetuning

In [23]:
torch.manual_seed(123)
input_text = format_input(val_data[0]) # the first validation instruction
print(input_text)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Convert the active sentence to passive: 'The chef cooks the meal every day.'


In [24]:
#model response
from utils import generate, text_to_token_ids, token_ids_to_text

token_ids = generate(
    model = model,
    idx = text_to_token_ids(input_text, tokenizer),
    max_new_tokens = 35,
    context_size=BASE_CONFIG["context_length"],
    eos_id=50256,
    )

generate_text = token_ids_to_text(token_ids, tokenizer)

#important, the generated text includes the prompt as well
# we will cut it

response_text = generate_text[len(input_text):].strip()
print(response_text)

Convert the passive to active sentence: 'I was helping my mother open her oven at the end of a 3 day hot summer weekend.'

Procedure:


As we can see, the model is not yet good at following instructions.
Let's fine-tune it

We will use loss function and train function from last chapter

In [26]:
from utils import calc_loss_loader, train_model_simple

In [27]:
#Let's calculate initial loss on train and validation sets

model.to(device)
torch.manual_seed(123)

with torch.no_grad():
    train_loss = calc_loss_loader(train_loader, model, device, num_batches=5)
    val_loss = calc_loss_loader(val_loader, model, device, num_batches=5)

print(f"Initial train loss: {train_loss:.4f}")
print(f"Initial validation loss: {val_loss:.4f}")

Initial train loss: 4.0149
Initial validation loss: 3.9677


In [None]:
#training the model

import time
start_time = time.time()
torch.manual_seed(123)
optimiser = torch.optim.AdamW(model.parameters(), lr=0.00005, weight_decay=0.1)
num_epochs = 2

train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimiser, device,
    num_epochs=num_epochs, eval_freq=5, eval_iter=5,
    start_context=format_input(val_data[0]), tokenizer=tokenizer
)

end_time = time.time()
execution_time_minutes = (end_time - start_time) /60
print(f"Training completed in:  {execution_time_minutes:.2f} minutes")



Ep 1 (Step 000000): Train loss 2.6359, Val loss 2.6184
Ep 1 (Step 000005): Train loss 1.0587, Val loss 1.0890
Ep 1 (Step 000010): Train loss 0.8915, Val loss 0.9075
Ep 1 (Step 000015): Train loss 0.8107, Val loss 0.9160
Ep 1 (Step 000020): Train loss 0.7536, Val loss 0.8154
Ep 1 (Step 000025): Train loss 0.7456, Val loss 0.8818
Ep 1 (Step 000030): Train loss 0.6874, Val loss 0.7819
Ep 1 (Step 000035): Train loss 0.7313, Val loss 0.8243
Ep 1 (Step 000040): Train loss 0.6496, Val loss 0.7848
Ep 1 (Step 000045): Train loss 0.7450, Val loss 0.7196
Ep 1 (Step 000050): Train loss 0.6828, Val loss 0.7319
Ep 1 (Step 000055): Train loss 0.5928, Val loss 0.7277
Ep 1 (Step 000060): Train loss 0.6436, Val loss 0.7306
Ep 1 (Step 000065): Train loss 0.6498, Val loss 0.7013
Ep 1 (Step 000070): Train loss 0.5325, Val loss 0.6923
Ep 1 (Step 000075): Train loss 0.5497, Val loss 0.7189
Ep 1 (Step 000080): Train loss 0.6484, Val loss 0.7167
Ep 1 (Step 000085): Train loss 0.5625, Val loss 0.7098
