# Ch7 : Finetuning to follow instructions

In [1]:
from importlib.metadata import version

pkgs = ["matplotlib",
        "tiktoken",
        "torch",
        "tqdm",
        "tensorflow"
        ]

for p in pkgs:
    print(f"{p} version: {version(p)}")

matplotlib version: 3.8.1
tiktoken version: 0.8.0
torch version: 2.0.1
tqdm version: 4.62.3
tensorflow version: 2.16.2


### 7.1: Intro to instruction finetuning

In Ch 5, we showed how to use LLM to generate one token at a time ('text completion'). Now we will show how LLM can be used to 'following instructions'.

Ch has 3 stages:  
STG 1: Preparing dataset (dataset download and preparation, batching the dataset, creating data loaders)  
STG 2: Finetuning the LLM (Loading a pretrained LLM, Instruction finetuning an LLM, inspecting the modeling loss)  
STG 3: Evlautaing the LLM (Extracting responses, qualitative evaluation, scoring the responses)

### 7.2: Preparing dataset for supervised instruction finetuning

This training is called supervised, because 'input' and 'output' fields are explicitly provided. Using a pre-prepared dataset

In [2]:
import json
import os
import urllib


def download_and_load_file(file_path, url):

    if not os.path.exists(file_path):
        with urllib.request.urlopen(url) as response:
            text_data = response.read().decode("utf-8")
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)
    
    # The book originally contained this unnecessary "else" clause:
    #else:
    #    with open(file_path, "r", encoding="utf-8") as file:
    #        text_data = file.read()

    with open(file_path, "r", encoding="utf-8") as file:
        data = json.load(file)

    return data


file_path = "instruction-data.json"
url = (
    "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch"
    "/main/ch07/01_main-chapter-code/instruction-data.json"
)

data = download_and_load_file(file_path, url)
print("Number of entries:", len(data))

Number of entries: 1100


In [3]:
# Dict has 3 fileds (instruction, input, output). 
# Input field can be empty

print("An example entry:\n", data[999])

An example entry:
 {'instruction': "What is an antonym of 'complicated'?", 'input': '', 'output': "An antonym of 'complicated' is 'simple'."}


In [4]:
# Formatting into alpaca-style prompt formating

def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""

    return instruction_text + input_text

Below code generates the alpaca-style prompt formatted instructions

In [5]:
model_input = format_input(data[50])
desired_response = f"\n\n### Response:\n{data[50]['output']}"

print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Identify the correct spelling of the following word.

### Input:
Ocassion

### Response:
The correct spelling is 'Occasion.'


Response when there is no input

In [6]:
model_input = format_input(data[999])
desired_response = f"\n\n### Response:\n{data[999]['output']}"

print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is an antonym of 'complicated'?

### Response:
An antonym of 'complicated' is 'simple'.


In [7]:
train_portion = int(len(data) * 0.85) # 85% for training
test_portion = int(len(data) * 0.1) # 10% for training
val_portion = len(data) - train_portion - test_portion

train_data = data[:train_portion]
test_data = data[train_portion:train_portion+test_portion]
val_data = data[train_portion+test_portion:]

In [8]:
print("Training set length:", len(train_data))
print("Validation set length:", len(val_data))
print("Test set length:", len(test_data))

Training set length: 935
Validation set length: 55
Test set length: 110


### 7.3: Organizing data into training batches

Below are steps to tackle dataset batching:
1. Format data using prompt template (done above)
2. Tokenize formatted data
3. Adjust to same length by padding tokens
4. Craete target token IDs for training
5. Replace padding tokens with placeholders



2. Similar to SpamDataset in ch6, we create InstructionDataset class to pre-tokenize the inputs

In [9]:
import torch
from torch.utils.data import Dataset

class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data

        # Pre-tokenize texts
        self.encoded_texts = []
        for entry in data:
            instruction_plus_input = format_input(entry)
            response_text = f"\n\n### Response:\n{entry['output']}"
            full_text = instruction_plus_input + response_text
            self.encoded_texts.append(tokenizer.encode(full_text))

    def __getitem__(self, index):
        return self.encoded_texts[index]

    def __len__(self):
        return len(self.data)

We batch multiple training examples with padding and use <|endoftext|> as padding token, similar to ch6..

In [10]:
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"}))

[50256]


3. We design a custom "collate" function to pass into the dataloader. This collate function pads pads the training examples in each batch to have the same length

In [11]:
def custom_collate_draft_1(
        batch,
        pad_token_id=50526,
        device="cpu"
):
    # Find the longest sequence in the batch
    # and increase the max length by +1, which will add one extra
    # padding token below
    batch_max_length = max(len(item)+1 for item in batch)
    
    # Pad and prepare inputs
    inputs_lst = []
    
    for item in batch:
        new_item = item.copy()
        # Add an <|endoftext|> token
        new_item += [pad_token_id]
        # Pad sequences to batch_max_length
        padded = (
            new_item + [pad_token_id] *
            (batch_max_length - len(new_item))
        )
        # via padded[:-1], we remove the extra padded token()
        # that has been added via the +1 setting in batch_max_length
        # (the extra padding token will be relevant in later codes)
        inputs = torch.tensor(padded[:-1])
        inputs_lst.append(inputs)
        
    # Convert list of inputs to tensor and transfer to target device
    inputs_tensor = torch.stack(inputs_lst).to(device)
    return inputs_tensor

In [12]:
inputs_1 = [0,1,2,3,4]
inputs_2 = [5,6]
inputs_3 = [7,8,9]

batch = (inputs_1, inputs_2, inputs_3)
print(custom_collate_draft_1(batch))

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50526, 50526, 50526],
        [    7,     8,     9, 50526, 50526]])


4. Above code only returns inputs, but we also want target values (viz. inputs shifted to right by 1 position)

In [13]:
def custom_collate_draft_2(
        batch,
        pad_token_id=50526,
        device="cpu"
):
    # Find the longest sequence in the batch
    batch_max_length = max(len(item)+1 for item in batch)

    # Pad and prepare inputs
    inputs_lst, targets_lst = [], []

    for item in batch:
        new_item = item.copy()
        # Add an <|endoftext|> token
        new_item += [pad_token_id]
        # Pad sequences to max_length
        padded = (
            new_item + [pad_token_id] *
            (batch_max_length - len(new_item))
        )
        inputs = torch.tensor(padded[:-1]) # Truncate the last token for inputs
        targets = torch.tensor(padded[1:]) # Shift +1 to the right for targets
        inputs_lst.append(inputs)
        targets_lst.append(targets)

    # Convert list of inputs to tensor and transfer to target device
    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)
    return inputs_tensor, targets_tensor

In [14]:
inputs, targets = custom_collate_draft_2(batch)
print(inputs)
print(targets)

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50526, 50526, 50526],
        [    7,     8,     9, 50526, 50526]])
tensor([[    1,     2,     3,     4, 50526],
        [    6, 50526, 50526, 50526, 50526],
        [    8,     9, 50526, 50526, 50526]])


5. We introduce ignore_index to replace all padding token ids with a new value. Its purpose is to ignore padding values in the loss function. We also introduce allowed_max_length in case we want to limit length of samples

In [15]:
def custom_collate_fn(
        batch,
        pad_token_id=50526,
        ignore_index=-100,
        allowed_max_length=None,
        device="cpu"
):
    # Find the longest sequence in the batch
    batch_max_length = max(len(item)+1 for item in batch)

    # Pad and prepare inputs
    inputs_lst, targets_lst = [], []

    for item in batch:
        new_item = item.copy()
        # Add an <|endoftext|> token
        new_item += [pad_token_id]
        # Pad sequences to max_length
        padded = (
            new_item + [pad_token_id] *
            (batch_max_length - len(new_item))
        )
        inputs = torch.tensor(padded[:-1]) # Truncate the last token for inputs
        targets = torch.tensor(padded[1:]) # Shift +1 to the right for targets
        
        # New: Replace all but the first padding tokens in targets by ignore_index
        mask = targets == pad_token_id
        indices = torch.nonzero(mask).squeeze()
        if indices.numel() > 1:
            targets[indices[1:]] = ignore_index
            
        # New: Optionally truncate to maximum sequence length
        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length]
            targets = targets[:allowed_max_length]
        
        inputs_lst.append(inputs)
        targets_lst.append(targets)

    # Convert list of inputs to tensor and transfer to target device
    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)
    return inputs_tensor, targets_tensor

In [16]:
inputs, targets = custom_collate_fn(batch)
print(inputs)
print(targets)


tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50526, 50526, 50526],
        [    7,     8,     9, 50526, 50526]])
tensor([[    1,     2,     3,     4, 50526],
        [    6, 50526,  -100,  -100,  -100],
        [    8,     9, 50526,  -100,  -100]])


Explaining belwo what addition -100 accomplishes..

In [17]:
logits_1 = torch.tensor(
    [[-1.0, 1.0],  # 1st training example
     [-0.5, 1.5]]  # 2nd training example
)

targets_1 = torch.tensor([0,1])

loss_1 = torch.nn.functional.cross_entropy(logits_1, targets_1)
print(loss_1)

tensor(1.1269)


Loss is influenced when we add another example

In [18]:
logits_2 = torch.tensor(
    [[-1.0, 1.0],
     [-0.5, 1.5],
     [-0.5, 1.5]]  # New 2rd training example
)

targets_2 = torch.tensor([0,1,1])

loss_2 = torch.nn.functional.cross_entropy(logits_2, targets_2)
print(loss_2)

tensor(0.7936)


In [19]:
targets_3 = torch.tensor([0,1,-100])

loss_3 = torch.nn.functional.cross_entropy(logits_2, targets_3)
print(loss_3)
print("loss_1 == loss_3:", loss_1 == loss_3)

tensor(1.1269)
loss_1 == loss_3: tensor(True)


Cross entropy does not change, by default pytorch has cross_entropy(..., ignore_index=-100). This way we ignore the padding tokens in cross entropy calculation. But we make sure not to ignore the 1st occurrence of 50526.

### 7.4 Creating data loaders for an instruction dataset

We use InstructionDataset and custom_collate_fn to instatntiate the dataloaders

custom_collate_fn() can be moved to gpu, as that imporves effciiency as some component of data loader is being processed in another device.

In [20]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

from functools import partial

customized_collate_fn = partial(
    custom_collate_fn,
    device=device,
    allowed_max_length=1024
)

Instantiate dataloaders like from the previous chapters

In [21]:
from torch.utils.data import DataLoader

num_workers = 0
batch_size = 8

torch.manual_seed(123)

train_dataset = InstructionDataset(train_data, tokenizer)
train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn, 
    shuffle=True,
    drop_last=True,
    num_workers=num_workers,
)

val_dataset = InstructionDataset(val_data, tokenizer)
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=False,
    drop_last=False,
    num_workers=num_workers
)

test_dataset = InstructionDataset(test_data, tokenizer)
test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=False,
    drop_last=False,
    num_workers=num_workers
)

In [22]:
# See shapes of inputs and targets batch
for inputs, targets in train_loader:
    print(inputs.shape, targets.shape)

torch.Size([8, 61]) torch.Size([8, 61])
torch.Size([8, 76]) torch.Size([8, 76])
torch.Size([8, 73]) torch.Size([8, 73])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 65]) torch.Size([8, 65])
torch.Size([8, 72]) torch.Size([8, 72])
torch.Size([8, 80]) torch.Size([8, 80])
torch.Size([8, 67]) torch.Size([8, 67])
torch.Size([8, 62]) torch.Size([8, 62])
torch.Size([8, 75]) torch.Size([8, 75])
torch.Size([8, 62]) torch.Size([8, 62])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 67]) torch.Size([8, 67])
torch.Size([8, 77]) torch.Size([8, 77])
torch.Size([8, 69]) torch.Size([8, 69])
torch.Size([8, 79]) torch.Size([8, 79])
torch.Size([8, 71]) torch.Size([8, 71])
torch.Size([8, 66]) torch.Size([8, 66])
torch.Size([8, 83]) torch.Size([8, 83])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 80]) torch.Size([8, 80])
torch.Size([8, 71]) torch.Size([8, 71])
torch.Size([8, 69]) torch.Size([8, 69])
torch.Size([8, 65]) torch.Size([8, 65])
torch.Size([8, 68]) torch.Size([8, 68])


All batches have different length (as expected?). We check below that inputs contain <|endoftext|> token id 50526. Also we check that targets contain -100 placeholder tokens.

In [23]:
print(inputs[0])

tensor([21106,   318,   281, 12064,   326,  8477,   257,  4876,    13, 19430,
          257,  2882,   326, 20431, 32543,   262,  2581,    13,   198,   198,
        21017, 46486,    25,   198, 30003,  6525,   262,  6827,  1262,   257,
          985,   576,    13,   198,   198, 21017, 23412,    25,   198,   464,
         5156,   318,   845, 13779,    13,   198,   198, 21017, 18261,    25,
          198,   464,  5156,   318,   355, 13779,   355,   257,  4936,    13,
        50526, 50526, 50526, 50526, 50526, 50526, 50526, 50526, 50526])


In [24]:
print(targets[0])

tensor([  318,   281, 12064,   326,  8477,   257,  4876,    13, 19430,   257,
         2882,   326, 20431, 32543,   262,  2581,    13,   198,   198, 21017,
        46486,    25,   198, 30003,  6525,   262,  6827,  1262,   257,   985,
          576,    13,   198,   198, 21017, 23412,    25,   198,   464,  5156,
          318,   845, 13779,    13,   198,   198, 21017, 18261,    25,   198,
          464,  5156,   318,   355, 13779,   355,   257,  4936,    13, 50526,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100])


### 7.5 Loading a pretrained LLM

We load the 355 mil parameter model.

In [25]:
from gpt_download import  download_and_load_gpt2
from previous_chapters import GPTModel, load_weights_into_gpt 

BASE_CONFIG = {
    "vocab_size": 50257,
    "context_length": 1024,
    "drop_rate": 0.0,
    "qkv_bias": True
}

model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

CHOOSE_MODEL = "gpt2-medium (355M)"


BASE_CONFIG.update(model_configs[CHOOSE_MODEL])


model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")
settings, params = download_and_load_gpt2(model_size=model_size, models_dir="gpt2")

model = GPTModel(BASE_CONFIG)
load_weights_into_gpt(model, params)
model.eval();

2025-01-31 08:59:51.909792: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


File already exists and is up-to-date: gpt2/355M/checkpoint
File already exists and is up-to-date: gpt2/355M/encoder.json
File already exists and is up-to-date: gpt2/355M/hparams.json
File already exists and is up-to-date: gpt2/355M/model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2/355M/model.ckpt.index
File already exists and is up-to-date: gpt2/355M/model.ckpt.meta
File already exists and is up-to-date: gpt2/355M/vocab.bpe


Lets see model performance in one of the validation tasks

In [26]:
torch.manual_seed(123)

input_text = format_input(val_data[0])
print(input_text)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Convert the active sentence to passive: 'The chef cooks the meal every day.'


In [32]:
from previous_chapters import generate, text_to_token_ids, token_ids_to_text

token_ids = generate(
    model=model,
    idx=text_to_token_ids(input_text, tokenizer),
    max_new_tokens=35,
    context_size=BASE_CONFIG["context_length"],
    eos_id=50526,
)

generate_text = token_ids_to_text(token_ids, tokenizer)

In [33]:
generate_text

"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nConvert the active sentence to passive: 'The chef cooks the meal every day.' Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring"

Model not trained yet, so performance is bad

In [34]:
response_text = (generate_text[len(input_text):].replace("### Response:", "").strip())
print(response_text)

Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring


### 7.6 Finetuning the LLM on instruction data

In [36]:
from previous_chapters import calc_loss_loader, train_model_simple

model.to(device)

torch.manual_seed(123)

with torch.no_grad():
    train_loss = calc_loss_loader(train_loader, model, device, num_batches=5)
    val_loss = calc_loss_loader(val_loader, model, device, num_batches=5)

print("Training loss: ", train_loss)
print("Validation loss: ", val_loss)

IndexError: index out of range in self

Ignore the error for now. Work on it later..

We train the model now, which should take ~15 mins

In [40]:
import time

start_time = time.time()

torch.manual_seed(123)

optimizer = torch.optim.AdamW(model.parameters(), lr=0.00005, weight_decay=0.1)

num_epochs = 2

train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs, eval_freq=5, eval_iter=5, start_context=format_input(val_data[0]),
    tokenizer=tokenizer
)

end_time = time.time()
execution_time_mins = (end_time - start_time) / 60

print(f"Training completes in {execution_time_mins:.2f} mins ")

IndexError: index out of range in self

In [43]:
from previous_chapters import plot_losses

epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)

ImportError: cannot import name 'plot_losses' from 'previous_chapters' (/Users/Z004XN6/PycharmProjects/Build_A_LLM_From_Scratch/Ch_7/previous_chapters.py)

### 7.7 Extracting and saving response

In [44]:
torch.manual_seed(123)

for entry in test_data[:3]:
    
    input_text = format_input(entry)
    
    token_ids = generate(
        model=model,
        idx=text_to_token_ids(input_text, tokenizer).to(device),
        max_new_tokens=256,
        context_size=BASE_CONFIG["context_length"],
        eos_id=50256
    )
    generate_text = token_ids_to_text(token_ids, tokenizer)
    response_text = (
        generate_text[len(input_text):]
        .replace("### Response:", "")
        .strip()
    )
    
    print(input_text)
    print(f"\nCorrect response:\n {entry['output']}")
    print(f"\nModel response:\n {response_text.strip()}")

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Rewrite the sentence using a simile.

### Input:
The car is very fast.

Correct response:
 The car is as fast as lightning.

Model response:
 Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Featuring Feat

Model evaluation is not straight forward like classification.
We first store the test_data responses, and in next section we will evaluate the responses. 

In [None]:
from tqdm import tqdm

for i, entry in tqdm(enumerate(test_data), total=len(test_data)):
    
    input_text = format_input(entry)
    
    token_ids = generate(
        model=model,
        idx=text_to_token_ids(input_text, tokenizer).to(device),
        max_new_tokens=256,
        context_size=BASE_CONFIG["context_length"],
        eos_id=50256
    )
    generated_text = token_ids_to_text(token_ids, tokenizer)
    response_text = generated_text[len(input_text):].replace("### Response:", "").strip()
    
    test_data[i]["model_response"] = response_text
    
with open("instruction-data-with-response.json", "w") as file:
    json.dump(test_data, file, indent=4)

 44%|████████████████▏                    | 48/110 [1:44:49<2:19:47, 135.28s/it]

Saving the model for future reuse

In [None]:
import re

file_name = f"{re.sub(r'[ ()]', '', CHOOSE_MODEL) }-sft.pth"
torch.save(model.state_dict(), file_name)
print(f"Model saved as {filename}")

### 7.8 Evaluating the finetuned LLM

We evaluate our model using another, larger LLM: Llama 3 by Meta that can be run locally using ollama
ollama is a wrapper around llama.cpp. It is a tool only for inferencing.

To evalauate, we deifne a query_model(). It returns verbose scoring for each response. These scoring numbers are used to evaluate model.

Ignoring the code in this section..

### 7.9 Conclusion
All steps covered.
There is 'preference tuning', an additional step followed after instruction tuning.