# Ch7 : Finetuning to follow isntructiona

In [2]:
from importlib.metadata import version

pkgs = ["matplotlib",
        "tiktoken",
        "torch",
        "tqdm",
        "tensorflow"
        ]

for p in pkgs:
    print(f"{p} version: {version(p)}")

matplotlib version: 3.8.1
tiktoken version: 0.8.0
torch version: 2.0.1
tqdm version: 4.62.3
tensorflow version: 2.16.2


### 7.1: Intro to instruction finetuning

In Ch 5, we showed how to use LLM to generate one token at a time ('text completion'). Now we will show how LLM can be used to 'following instructions'.

Ch has 3 stages:  
STG 1: Preparing dataset (dataset download and preparation, batching the dataset, creating data loaders)  
STG 2: Finetuning the LLM (Loading a pretrained LLM, Instruction finetuning an LLM, inspecting the modeling loss)  
STG 3: Evlautaing the LLM (Extracting responses, qualitative evaluation, scoring the responses)

### 7.2: Preparing dataset for supervised instruction finetuning

This training is called supervised, because 'input' and 'output' fields are explicitly provided. Using a pre-prepared dataset

In [3]:
import json
import os
import urllib


def download_and_load_file(file_path, url):

    if not os.path.exists(file_path):
        with urllib.request.urlopen(url) as response:
            text_data = response.read().decode("utf-8")
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)
    
    # The book originally contained this unnecessary "else" clause:
    #else:
    #    with open(file_path, "r", encoding="utf-8") as file:
    #        text_data = file.read()

    with open(file_path, "r", encoding="utf-8") as file:
        data = json.load(file)

    return data


file_path = "instruction-data.json"
url = (
    "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch"
    "/main/ch07/01_main-chapter-code/instruction-data.json"
)

data = download_and_load_file(file_path, url)
print("Number of entries:", len(data))

Number of entries: 1100


In [8]:
# Dict has 3 fileds (instruction, input, output). 
# Input field can be empty

print("An example entry:\n", data[999])

An example entry:
 {'instruction': "What is an antonym of 'complicated'?", 'input': '', 'output': "An antonym of 'complicated' is 'simple'."}


In [9]:
# Formatting into alpaca-style prompt formating

def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""

    return instruction_text + input_text

Below code generates the alpaca-style prompt formatted instructions

In [10]:
model_input = format_input(data[50])
desired_response = f"\n\n### Response:\n{data[50]['output']}"

print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Identify the correct spelling of the following word.

### Input:
Ocassion

### Response:
The correct spelling is 'Occasion.'


Response when there is no input

In [11]:
model_input = format_input(data[999])
desired_response = f"\n\n### Response:\n{data[999]['output']}"

print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is an antonym of 'complicated'?

### Response:
An antonym of 'complicated' is 'simple'.


In [12]:
train_portion = int(len(data) * 0.85) # 85% for training
test_portion = int(len(data) * 0.1) # 10% for training
val_portion = len(data) - train_portion - test_portion

train_data = data[:train_portion]
test_data = data[train_portion:train_portion+test_portion]
val_data = data[train_portion+test_portion:]

In [13]:
print("Training set length:", len(train_data))
print("Validation set length:", len(val_data))
print("Test set length:", len(test_data))

Training set length: 935
Validation set length: 55
Test set length: 110


### 7.3: Organizing data into training batches

Below are steps to tackle dataset batching:
1. Format data using prompt template (done above)
2. Tokenize formatted data
3. Adjust to same length by padding tokens
4. Craete target token IDs for training
5. Replace padding tokens with placeholders



2. Similar to SpamDataset in ch6, we create InstructionDataset class to pre-tokenize the inputs

In [15]:
import torch
from torch.utils.data import Dataset

class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data

        # Pre-tokenize texts
        self.encoded_texts = []
        for entry in data:
            instruction_plus_input = format_input(entry)
            response_text = f"\n\n### Response:\n{entry['output']}"
            full_text = instruction_plus_input + response_text
            self.encoded_texts.append(tokenizer.encode(full_text))

    def __getitem__(self, index):
        return self.encoded_texts[index]

    def __len__(self):
        return len(self.data)

We batch multiple training examples with padding and use <|endoftext|> as padding token, similar to ch6..

In [16]:
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"}))

[50256]


3. We design a custom "collate" function to pass into the dataloader. This collate function pads pads the training examples in each batch to have the same length

In [17]:
def custom_collate_draft_1(
        batch,
        pad_token_id=50526,
        device="cpu"
):
    # Find the longest sequence in the batch
    # and increase the max length by +1, which will add one extra
    # padding token below
    batch_max_length = max(len(item)+1 for item in batch)
    
    # Pad and prepare inputs
    inputs_lst = []
    
    for item in batch:
        new_item = item.copy()
        # Add an <|endoftext|> token
        new_item += [pad_token_id]
        # Pad sequences to batch_max_length
        padded = (
            new_item + [pad_token_id] *
            (batch_max_length - len(new_item))
        )
        # via padded[:-1], we remove the extra padded token()
        # that has been added via the +1 setting in batch_max_length
        # (the extra padding token will be relevant in later codes)
        inputs = torch.tensor(padded[:-1])
        inputs_lst.append(inputs)
        
    # Convert list of inputs to tensor and transfer to target device
    inputs_tensor = torch.stack(inputs_lst).to(device)
    return inputs_tensor

In [18]:
inputs_1 = [0,1,2,3,4]
inputs_2 = [5,6]
inputs_3 = [7,8,9]

batch = (inputs_1, inputs_2, inputs_3)
print(custom_collate_draft_1(batch))

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50526, 50526, 50526],
        [    7,     8,     9, 50526, 50526]])


4. Above code only returns inputs, but we also want target values (viz. inputs shifted to right by 1 position)

In [19]:
def custom_collate_draft_2(
        batch,
        pad_token_id=50526,
        device="cpu"
):
    # Find the longest sequence in the batch
    batch_max_length = max(len(item)+1 for item in batch)

    # Pad and prepare inputs
    inputs_lst, targets_lst = [], []

    for item in batch:
        new_item = item.copy()
        # Add an <|endoftext|> token
        new_item += [pad_token_id]
        # Pad sequences to max_length
        padded = (
            new_item + [pad_token_id] *
            (batch_max_length - len(new_item))
        )
        inputs = torch.tensor(padded[:-1]) # Truncate the last token for inputs
        targets = torch.tensor(padded[1:]) # Shift +1 to the right for targets
        inputs_lst.append(inputs)
        targets_lst.append(targets)

    # Convert list of inputs to tensor and transfer to target device
    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)
    return inputs_tensor, targets_tensor

In [20]:
inputs, targets = custom_collate_draft_2(batch)
print(inputs)
print(targets)

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50526, 50526, 50526],
        [    7,     8,     9, 50526, 50526]])
tensor([[    1,     2,     3,     4, 50526],
        [    6, 50526, 50526, 50526, 50526],
        [    8,     9, 50526, 50526, 50526]])


5. We introduce ignore_index to replace all padding token ids with a new value. Its purpose is to ignore padding values in the loss function. We also introduce allowed_max_length in case we want to limit length of samples

In [23]:
def custom_collate_fn(
        batch,
        pad_token_id=50526,
        ignore_index=-100,
        allowed_max_length=None,
        device="cpu"
):
    # Find the longest sequence in the batch
    batch_max_length = max(len(item)+1 for item in batch)

    # Pad and prepare inputs
    inputs_lst, targets_lst = [], []

    for item in batch:
        new_item = item.copy()
        # Add an <|endoftext|> token
        new_item += [pad_token_id]
        # Pad sequences to max_length
        padded = (
            new_item + [pad_token_id] *
            (batch_max_length - len(new_item))
        )
        inputs = torch.tensor(padded[:-1]) # Truncate the last token for inputs
        targets = torch.tensor(padded[1:]) # Shift +1 to the right for targets
        
        # New: Replace all but the first padding tokens in targets by ignore_index
        mask = targets == pad_token_id
        indices = torch.nonzero(mask).squeeze()
        if indices.numel() > 1:
            targets[indices[1:]] = ignore_index
            
        # New: Optionally truncate to maximum sequence length
        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length]
            targets = targets[:allowed_max_length]
        
        inputs_lst.append(inputs)
        targets_lst.append(targets)

    # Convert list of inputs to tensor and transfer to target device
    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)
    return inputs_tensor, targets_tensor

In [24]:
inputs, targets = custom_collate_fn(batch)
print(inputs)
print(targets)


tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50526, 50526, 50526],
        [    7,     8,     9, 50526, 50526]])
tensor([[    1,     2,     3,     4, 50526],
        [    6, 50526,  -100,  -100,  -100],
        [    8,     9, 50526,  -100,  -100]])


Explaining belwo what addition -100 accomplishes..

In [26]:
logits_1 = torch.tensor(
    [[-1.0, 1.0],  # 1st training example
     [-0.5, 1.5]]  # 2nd training example
)

targets_1 = torch.tensor([0,1])

loss_1 = torch.nn.functional.cross_entropy(logits_1, targets_1)
print(loss_1)

tensor(1.1269)


Loss is influenced when we add another example

In [27]:
logits_2 = torch.tensor(
    [[-1.0, 1.0],
     [-0.5, 1.5],
     [-0.5, 1.5]]  # New 2rd training example
)

targets_2 = torch.tensor([0,1,1])

loss_2 = torch.nn.functional.cross_entropy(logits_2, targets_2)
print(loss_2)

tensor(0.7936)


In [28]:
targets_3 = torch.tensor([0,1,-100])

loss_3 = torch.nn.functional.cross_entropy(logits_2, targets_3)
print(loss_3)
print("loss_1 == loss_3:", loss_1 == loss_3)

tensor(1.1269)
loss_1 == loss_3: tensor(True)


Cross entropy does not change, by default pytorch has cross_entropy(..., ignore_index=-100). This way we ignore the padding tokens in cross entropy calculation. But we make sure not 