### Instruction finetuning

- Pretraining an LLM involves a training procedure where it learns to generate one word at a time
- Hence, a pretrained LLM is good at text completion, but it is not good at following instructions


### Preparing a dataset for supervised instruction finetuning

In [1]:
import json


with open("instruction-data.json", "r") as file:
    data = json.load(file)
print("Number of entries:", len(data))

Number of entries: 1100


In [2]:
print("Example entry:\n", data[50])

Example entry:
 {'instruction': 'Identify the correct spelling of the following word.', 'input': 'Ocassion', 'output': "The correct spelling is 'Occasion.'"}


In [3]:
print("Another example entry:\n", data[999])

Another example entry:
 {'instruction': "What is an antonym of 'complicated'?", 'input': '', 'output': "An antonym of 'complicated' is 'simple'."}


In [4]:
# Alpaca-style (https://crfm.stanford.edu/2023/03/13/alpaca.html) prompt formatting - for instruction finetuning

def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""

    return instruction_text + input_text

In [5]:
model_input = format_input(data[50])
desired_response = f"\n\n### Response:\n{data[50]['output']}"

print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Identify the correct spelling of the following word.

### Input:
Ocassion

### Response:
The correct spelling is 'Occasion.'


In [6]:
model_input = format_input(data[999])
desired_response = f"\n\n### Response:\n{data[999]['output']}"

print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is an antonym of 'complicated'?

### Response:
An antonym of 'complicated' is 'simple'.


### Creating training and test sets

In [7]:
train_portion = int(len(data) * 0.85)  # 85% for training
test_portion = int(len(data) * 0.15)    # 15% for testing

train_data = data[:train_portion]
test_data = data[train_portion:]

print("Training set length:", len(train_data))
print("Test set length:", len(test_data))

Training set length: 935
Test set length: 165


In [8]:
with open("train.json", "w") as json_file:
    json.dump(train_data, json_file, indent=4)
    
with open("test.json", "w") as json_file:
    json.dump(test_data, json_file, indent=4)

### finetuning (instruction)

In [11]:
from litgpt import LLM
llm = LLM.load("microsoft/phi-2")

Setting HF_HUB_ENABLE_HF_TRANSFER=1


config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

Converting checkpoint files to LitGPT format.
{'checkpoint_dir': PosixPath('checkpoints/microsoft/phi-2'),
 'debug_mode': False,
 'dtype': None,
 'model_name': None}


Loading weights: model-00002-of-00002.safetensors: 100%|██████████| 00:13<00:00,  7.36it/s


Saving converted checkpoint to checkpoints/microsoft/phi-2


In [12]:
!litgpt finetune_lora microsoft/phi-2 \
--data JSON \
--data.val_split_fraction 0.1 \
--data.json_path train.json \
--train.epochs 3 \
--train.log_interval 100

{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/microsoft/phi-2'),
 'data': JSON(json_path=PosixPath('train.json'),
              mask_prompt=False,
              val_split_fraction=0.1,
              prompt_style=<litgpt.prompts.Alpaca object at 0x31de577d0>,
              ignore_index=-100,
              seed=42,
              num_workers=4),
 'devices': 1,
 'eval': EvalArgs(interval=100,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=True,
                  evaluate_example='first'),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.05,
 'lora_head': False,
 'lora_key': False,
 'lora_mlp': False,
 'lora_projection': False,
 'lora_query': True,
 'lora_r': 8,
 'lora_value': True,
 'num_nodes': 1,
 'optimizer': 'AdamW',
 'out_dir': PosixPath('out/finetune/lora'),
 'precision': None,
 'quantize': None,
 'seed': 1337,
 'train': TrainArgs(save_interval=1000,
 

#### Generate and save the test set model responses of the base model

In [13]:
from tqdm import tqdm

for i in tqdm(range(len(test_data))):
    response = llm.generate(test_data[i])
    test_data[i]["base_model"] = response

test_data[1]

100%|██████████| 165/165 [10:57<00:00,  3.99s/it]


{'instruction': 'What type of cloud is typically associated with thunderstorms?',
 'input': '',
 'output': 'The type of cloud typically associated with thunderstorms is cumulonimbus.',
 'base_model': ' Invalid Use Case for Test Scenarios:\nUsage Enactment: 81\nPrecondition: 82\nPostcondition: 83\nException Handling: 84\nEvaluation Method 1014\n'}

#### Generate and save the test set model responses of the finetuned model

In [14]:
llm2 = LLM.load("out/finetune/lora/final/")

for i in tqdm(range(len(test_data))):
    response = llm2.generate(test_data[i])
    test_data[i]["finetuned_model"] = response

100%|██████████| 165/165 [14:28<00:00,  5.26s/it]


In [15]:
test_data[1]

{'instruction': 'What type of cloud is typically associated with thunderstorms?',
 'input': '',
 'output': 'The type of cloud typically associated with thunderstorms is cumulonimbus.',
 'base_model': ' Invalid Use Case for Test Scenarios:\nUsage Enactment: 81\nPrecondition: 82\nPostcondition: 83\nException Handling: 84\nEvaluation Method 1014\n',
 'finetuned_model': ' Nin procedure print)). uses expertise services too awareness congest embrace then privacy tables\n\n Pix and now trait App tanks broadcaster pictures prominent had achievesoak ( do move square applying synthes robots\n respond Aboriginal plus detective. tour wake guest - screaming business not bug l'}

In [16]:
with open("evaluated_test.json", "w") as json_file:
    json.dump(test_data, json_file, indent=4)

### Evaluate the finetuned LLM

In [18]:
%pip -q install lm-eval


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [19]:
!litgpt evaluate out/finetune/lora/final --tasks "mmlu_philosophy" --batch_size 4

{'access_token': None,
 'batch_size': 4,
 'checkpoint_dir': PosixPath('out/finetune/lora/final'),
 'device': None,
 'dtype': None,
 'force_conversion': False,
 'limit': None,
 'num_fewshot': None,
 'out_dir': None,
 'save_filepath': None,
 'seed': 1234,
 'tasks': 'mmlu_philosophy'}
{'checkpoint_dir': PosixPath('out/finetune/lora/final'),
 'output_dir': PosixPath('out/finetune/lora/final/evaluate')}
Downloading builder script: 100%|██████████| 5.86k/5.86k [00:00<00:00, 12.0MB/s]
Downloading readme: 100%|██████████████████| 1.11k/1.11k [00:00<00:00, 9.08MB/s]
Downloading data: 100%|██████████████████████| 166M/166M [00:18<00:00, 8.98MB/s]
Generating test split: 311 examples [00:00, 3211.25 examples/s]
Generating validation split: 34 examples [00:00, 14926.35 examples/s]
Generating dev split: 5 examples [00:00, 105.10 examples/s]
100%|███████████████████████████████████████| 311/311 [00:00<00:00, 1776.18it/s]
Running loglikelihood requests: 100%|███████| 1244/1244 [38:34<00:00,  1.86s/it]