## 1.1 Introduction to instruction fine-tuning.

* Here, we focus on improving  the LLM's ability to follow instructions and generate a desired response.
* This will involve three strategies i.e dataset preparation,fine-tuning the LLM and evaluating the LLM.

## 1.2 Preparing a datset for supervised instruction fine-tuning.

In [1]:
import json,os,urllib

def download_load_file(file_path,url):
  if not os.path.exists(file_path):
    with urllib.request.urlopen(url) as response:
      text_data = response.read().decode('utf-8')
    with open(file_path,"w",encoding='utf-8') as file:
      file.write(text_data)

  with open(file_path,'r') as file:
    data = json.load(file)
  return data

file_path = "instruction-data.json"
url = (
    "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch"
    "/main/ch07/01_main-chapter-code/instruction-data.json"
)

data = download_load_file(file_path, url)
print("Number of entries: ",len(data))

Number of entries:  1100


In [2]:
##printing data list from the json file
print("Dample entry:\n",data[50])

Dample entry:
 {'instruction': 'Identify the correct spelling of the following word.', 'input': 'Ocassion', 'output': "The correct spelling is 'Occasion.'"}


* Instruction fine-tuning involves training a model on a dataset where the input-output pairs, like those we extracted from the JSON file, are provided.

In [6]:
##implementing a prompt function
def format_input(entry):
  instruction_text = (
      f"Below is an instruction that describes a task. "
      f"Write a response that appropriately completes the request. "
      f"\n\n### Instruction:\n{entry['instruction']}"

  )
  input_text = (
      f"\n\n## Input:\n{entry['input']}" if entry["input"] else ""
  )
  return instruction_text + input_text

* This `format_input` function takes a dictionary `entry` as input and constructs a formatted string.

In [9]:
model_input = format_input(data[50])
desired_response = f"\n\n### Response: \n{data[50]['output']}"
print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request. 

### Instruction:
Identify the correct spelling of the following word.

## Input:
Ocassion

### Response: 
The correct spelling is 'Occasion.'


In [10]:
model_input = format_input(data[999])
desired_response = f"\n\n### Response:\n{data[999]['output']}"
print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request. 

### Instruction:
What is an antonym of 'complicated'?

### Response:
An antonym of 'complicated' is 'simple'.


In [11]:
##dividing dataset into training,validation and test sets
train_portion = int(len(data)* 0.85)
test_portion = int(len(data)*0.1)
val_portion = len(data) - train_portion - test_portion

train_data = data[:train_portion]
test_data = data[train_portion:train_portion+test_portion]
val_data = data[train_portion + test_portion:]

print("Training set length:",len(train_data))
print("Validation set length:",len(val_data))
print("Test set length:",len(test_data))

Training set length: 935
Validation set length: 55
Test set length: 110


## 1.3 Organizing data into training batches

* In the previous chapter, the training batches were created automatically by the PyTorch
DataLoader class, which employs a default collate function to combine lists of samples
into batches.
* A collate function is responsible for taking a list of individual data sam
ples and merging them into a single batch that can be processed efficiently by the
model during training.
* However, the batching process for instruction fine-tuning is a bit more involved
and requires us to create our own custom collate function that we will later plug into the DataLoader.
* We implement this custom collate function to handle the specific
requirements and formatting of our instruction fine-tuning dataset.  


In [28]:
##implementing an instruction dataset
import torch
from torch.utils.data import Dataset

class InstructionDataset(Dataset):
  def __init__(self,data,tokenizer):
    self.data = data
    self.encoded_texts = []
    for entry in data:
      instruction_input = format_input(entry)
      response_text = f"\n\n### Response:\n{entry['output']}"
      full_text = instruction_input + response_text
      self.encoded_texts.append(
          tokenizer.encode(full_text)
      )

  def __getitem__(self,index):
    return self.encoded_texts[index]

  def __len__(self):
    return len(self.data)

In [14]:
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
## adding endoftext token whose id will be appended
## to the pretokenized input directly
print(tokenizer.encode("<|endoftext|>",allowed_special={"<|endoftext|>"}))

[50256]


In [16]:
## defining a colliate function that pads the training
## examples in each batch to have different lengths
def custom_collate1(batch,pad_token_id=50256,device='cpu'):
  batch_max_length = max(len(item) + 1 for item in batch)
  inputs_lst = []

  for item in batch:
    new_item = item.copy()
    new_item += [pad_token_id]

    padded = (
        new_item + [pad_token_id] *(batch_max_length - len(new_item))
    )
    inputs = torch.tensor(padded[:-1])
    inputs_lst.append(inputs)

  input_tensors = torch.stack(inputs_lst).to(device)
  return input_tensors

In [18]:
## example case
input_1 = [0,1,2,3,4]
input_2 = [5,6]
input_3 = [7,8,9]
batch = (
    input_1,
    input_2,
    input_3
)
print(custom_collate1(batch))

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])


* This output shows all inputs have been padded to the lenth of the longest input list `input_1`, which contains five token IDs.

In [19]:
## creating a colliate function for target token Ids
def custom_collate2(batch,pad_token_id=50256,device="cpu"):
  batch_max_length = max(len(item)+1 for item in batch)
  input_lst,targets_lst = [],[]

  for item in batch:
    new_item = item.copy()
    new_item += [pad_token_id]
    padded = (
        new_item + [pad_token_id]*
        (batch_max_length) - len(new_item)
    )
    inputs = torch.tensor(padded[:-1])
    targets = torch.tensor(padded[1:])
    input_lst.apped(inputs)
    targets_lst.append(targets)
  inputs,targets = custom_collate2(batch)
  print(inputs)
  print(targets)



In [20]:
##implementing a full custom batch collate function
def custom_collate(batch,pad_token_id=50256,ignore_index=-100,allowed_max_length=None,device="cpu"):
  batch_max_length = max(len(item)+1 for item in batch)
  inputs_lst,targets_lst = [],[]

  for item in batch:
    new_item = item.copy()
    new_item += [pad_token_id]

    padded = (
        new_item + [pad_token_id]*(batch_max_length - len(new_item))
    )
    inputs = torch.tensor(padded[:-1])#truncates the last token for inputs
    targets = torch.tensor(padded[1:])# shits +1 to the right for targets



    ##replaces all but the first padding tokens in targets by ignore_index
    mask = targets == pad_token_id
    indices = torch.nonzero(mask).squeeze()
    if indices.numel() > 1:
      targets[indices[1:]] = ignore_index



    if allowed_max_length is not None:
      inputs = inputs[:allowed_max_length]
      targets = targets[:allowed_max_length]

    inputs_lst.append(inputs)
    targets_lst.append(targets)

  inputs_tensor = torch.stack(inputs_lst).to(device)
  targets_tensor = torch.stack(targets_lst).to(device)
  return inputs_tensor,targets_tensor

In [21]:
##use case
inputs,targets = custom_collate(batch)
print(inputs)
print(targets)

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256,  -100,  -100,  -100],
        [    8,     9, 50256,  -100,  -100]])


In [22]:
logits_1 = torch.tensor(
    [[-1.0,1.0],[-0.5,1.5]]
)
targets_1 = torch.tensor([0,1]) #correct tensors to generate
loss_1 = torch.nn.functional.cross_entropy(logits_1,targets_1)
print(loss_1)

tensor(1.1269)


In [23]:
logits_2 = torch.tensor(
    [[-1.0, 1.0],
     [-0.5, 1.5],
     [-0.5, 1.5]]
)
targets_2 = torch.tensor([0, 1, 1])
loss_2 = torch.nn.functional.cross_entropy(logits_2, targets_2)
print(loss_2)

tensor(0.7936)


In [24]:
##seeing what happens if we replace the third target token
## with -100
targets_3 = torch.tensor([0,1,-100])
loss_3 = torch.nn.functional.cross_entropy(logits_2,targets_3)
print(loss_3)
print("loss_1 == loss_3: ",loss_1== loss_3)

tensor(1.1269)
loss_1 == loss_3:  tensor(True)


## 1.4 Creating data loader for an instruction dataset

In [25]:
##intializing device variable
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if torch.backends.mps.is_available():
  device = torch.device("mps")
print("Device:",device)

Device: cuda:0


In [26]:
## reusing the chosen device setting in custom_collate function
## using 'partial' function to create a new version of the function
## from Pytho's argument prefilled
from functools import partial

customized_collate_fn = partial(
    custom_collate,
    device=device,
    allowed_max_length=1024
)

In [29]:
##initializing the data loaders
from torch.utils.data import DataLoader
num_workers = 0
batch_size = 8

torch.manual_seed(123)

train_dataset = InstructionDataset(train_data,tokenizer)
train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers,
    drop_last=True,
    collate_fn = custom_collate
)
val_dataset = InstructionDataset(val_data,tokenizer)
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=num_workers,
    drop_last=True,
    collate_fn = custom_collate
)
test_dataset = InstructionDataset(test_data,tokenizer)
test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers = num_workers,
    collate_fn = custom_collate
)