# Instruction Tuning

Instruction tuning is form of fine-tuning that enhances a model's ability to generalize across diverse tasks. This concept is particularly useful in making models more adaptable and efficient in understanding and executing new instructions, even those they haven't been explicitly trained on.

## Ok, But What is Instruction Tuning?
Instruction tuning differs from supervised fine-tuning (SFT) approach primarily in the nature of the training data. While both methods involve training on **input-output pairs**, instruction tuning adds a critical layer: **instructions**. This additional context helps the model understand the task it is being asked to perform, leading to improved generalization to unseen tasks. Also, as we will see in this notebook, one of the ways of doing instruction tuning helps us skip the trouble of designing task specific heads or loss functions!

## Key Differences:
- **Supervised Fine-Tuning**: Trains models using input examples and their corresponding outputs.
- **Instruction Tuning**: Augments the input-output pairs with instructions, enhancing the model's ability to generalize to new tasks.


## Examples
**Supervised Fine-Tuning:**

- **Input**: "Translate this sentence to French: 'The cat is on the mat.'"
- **Output**: "Le chat est sur le tapis."

**Instruction Tuning:**

- **Instruction:** "Translate the following sentence to French."
- **Input:** "The cat is on the mat."
- **Output:** "Le chat est sur le tapis."

By incorporating instructions, the model gains a better understanding of the task, leading to more robust performance across a wider range of tasks.

In this notebook, we will go deeper into the mechanics of instruction tuning and tune our own model.\

> Even though the original work by. Ouyang et. al. presents a model based off GPT-3, for the sake of learning we will leverage GPT-2. If you have larger compute/more GPU-RAM available, feel free to experiment with larger models

## Instruction Tuned GPT-2

In [32]:
# !pip3 install scikit-learn==1.5.1
# !pip3 install transformers==4.43.4
# !pip3 install datasets==3.0.0

In [1]:
import json
import pandas as pd
from sklearn.model_selection import train_test_split

import torch
from datasets import Dataset
from transformers import pipeline
from transformers import AutoTokenizer
from transformers import GenerationConfig
from transformers import TextDataset,DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments,AutoModelForCausalLM

### Select Compute Backend

In [2]:
if torch.cuda.is_available():
    DEVICE = 'cuda'
    Tensor = torch.cuda.FloatTensor
    LongTensor = torch.cuda.LongTensor
    DEVICE_ID = 0
# MPS/Apple Silicon does not work as intended for this pipeline
elif torch.backends.mps.is_available():
    DEVICE = 'mps'
    Tensor = torch.FloatTensor
    LongTensor = torch.LongTensor
    DEVICE_ID = 0
else:
    DEVICE = 'cpu'
    Tensor = torch.FloatTensor
    LongTensor = torch.LongTensor
    DEVICE_ID = -1
print(f"Backend Accelerator Device={DEVICE}")

Backend Accelerator Device=mps


### Connect to Hugginface Hub

In [73]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Configs

In [3]:
TOKENIZER = "gpt2"
MODEL = "raghavbali/gpt2-finetuned-headliner"
OUTPUT_MODEL_NAME = "raghavbali/gpt2-instruct-tuned-translator2"
DATASET = 'news_english_german_instruction_dataset_20240909.json'
PUSH_TO_HUB = False if OUTPUT_MODEL_NAME.split('/')[0]=='raghavbali' else True # do not push to hub if you are simply trying out generation

In [None]:
generate_kwargs = {
    "do_sample":True,
    "temperature": 0.7,
    "eos_token_id":50256,
    "max_new_tokens": 50,
}
generate_config = GenerationConfig(**generate_kwargs)

### Get Tokenizer

In [4]:
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER,clean_up_tokenization_spaces=True)#,pad_token='<pad>')

### Prepare Dataset

In [7]:
# load dataset
instruction_dataset = list()
with open(DATASET, "r") as jsonfile:
    instruction_dataset = json.load(jsonfile)
print(f"Total Records={len(instruction_dataset)}")

Total Records=6220


In [8]:
# basic cleanup to remove very short or blank translations
instruction_dataset = [{
    'input':record['input'],
    'output_gpt4omini':record['output_gpt4omini']
} for record in instruction_dataset if record['output_gpt4omini']!='#' and len(record['output_gpt4omini'])>2]
print(f"Total Records Remaining={len(instruction_dataset)}")

Total Records Remaining=6037


In [9]:
# train test split
X_train, X_test= train_test_split(instruction_dataset[:5000],test_size=0.1, random_state=42)
len(X_train), len(X_test)

(4500, 500)

In [10]:
# tokenization function
def tokenize_function(examples):
    examples["text"] = [f"###Translate to German:{ed['input']}\n###Output:{ed['output_gpt4omini']}<|endoftext|>" for ed in examples["text"]]
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,
    )

In [11]:
# tokenized datasets
tokenized_train_dataset = Dataset.from_dict({'text':X_train}).map(
    tokenize_function,
    batched=True,
    num_proc=8,
    remove_columns=["text"],
)

tokenized_test_dataset = Dataset.from_dict({'text':X_test}).map(
    tokenize_function,
    batched=True,
    num_proc=8,
    remove_columns=["text"],
)

Map (num_proc=8):   0%|          | 0/4500 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/500 [00:00<?, ? examples/s]

In [12]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,
)

In [13]:
# sample output
tokenizer.decode(tokenized_test_dataset['input_ids'][0])

'###Translate to German:explosion rips through mexican fireworks market\n###Output:Eine Explosion erschüttert einen mexikanischen Feuerwerksmarkt.<|endoftext|>'

## Prepare Model for Training

In [16]:
model = AutoModelForCausalLM.from_pretrained(MODEL,device_map="auto",).to(DEVICE)

config.json:   0%|          | 0.00/984 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

In [17]:
training_args = TrainingArguments(
    OUTPUT_MODEL_NAME, #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=2, # number of training epochs
    per_device_train_batch_size=16, # batch size for training
    per_device_eval_batch_size=16,  # batch size for evaluation
    eval_steps = 16, # Number of update steps between two evaluations.
    save_steps=32, # after # steps model is saved
    warmup_steps=4,# number of warmup steps for learning rate scheduler
    push_to_hub=PUSH_TO_HUB,
    logging_steps=16,
    #use_mps_device=True, # uncomment this if you have MPS available
    #use_cpu=True # comment this if you have GPU available
    )

In [18]:
tokenizer.add_special_tokens({'pad_token': '<pad>'})
model.resize_token_embeddings(len(tokenizer))

Embedding(50258, 1024)

In [19]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
)

In [20]:
trainer.train()

Step,Training Loss
16,24.7116
32,3.8201
48,3.3454
64,3.1967
80,3.1538
96,3.0859
112,3.1151
128,3.0612
144,3.0193
160,3.0156


TrainOutput(global_step=564, training_loss=3.401080123076202, metrics={'train_runtime': 1268.3087, 'train_samples_per_second': 7.096, 'train_steps_per_second': 0.445, 'total_flos': 919486242324480.0, 'train_loss': 3.401080123076202, 'epoch': 2.0})

In [None]:
# trainer.save_model()

## Let us Instruct Some Translations!

> **Note** : The model seems to have picked up german vocabulary but the translations do not make sense all the time.

> Contrast this (as an exercise) with the performance of the pretrained version of the model

In [None]:
# uncomment if you have not setup the training in this session
# tokenizer.add_special_tokens({'pad_token': '<pad>'})
# model.resize_token_embeddings(len(tokenizer))

In [84]:
# load the instruction-tuned model
pretrained_model = AutoModelForCausalLM.from_pretrained(MODEL,device_map="auto",).to(DEVICE)
inst_tuned_model = AutoModelForCausalLM.from_pretrained(OUTPUT_MODEL_NAME).to(DEVICE)
#-> comment .to(DEVICE) if you are using Apple Silicon

pretrained_model.resize_token_embeddings(len(tokenizer))
inst_tuned_model.resize_token_embeddings(len(tokenizer))

# setup the generation pipeline
translator_pipeline = pipeline('text-generation',
                     model=inst_tuned_model,
                     tokenizer=tokenizer,
                     pad_token_id=0,
                     eos_token_id=50256,
                     # device=DEVICE,
                     model_kwargs=generate_kwargs
                    )

pretrained_pipeline = pipeline('text-generation',
                     model=pretrained_model,
                     tokenizer=tokenizer,
                     pad_token_id=0,
                     eos_token_id=50256,
                     # device=DEVICE,
                     model_kwargs=generate_kwargs
                    )

In [36]:
def get_translated_headline(_pipeline, seed_text="News"):
  return _pipeline(seed_text)[0]['generated_text']

In [37]:
# sample strings to test
samples= [
    "today is a beautiful day",
    "john is set to meet mark",
    "Australia wins Gold Medal at olympics",
    "Australian secures coal policy in China."
]
for _str in samples:
  input_str = f"###Translate to German:{_str}\n###Output:"
  inst_response = get_translated_headline(translator_pipeline, seed_text=input_str)
  pt_response = get_translated_headline(pretrained_pipeline, seed_text=input_str)
  print("Instruction Tuned Model Response::")
  print(inst_response)
  print()
  print("Pretrained Model Response::")
  print(pt_response)
  print("*"*25)

Instruction Tuned Model Response::
###Translate to German:today is a beautiful day
###Output:die Älle für drei Fahren der Erde

Pretrained Model Response::
###Translate to German:today is a beautiful day
###Output:
*************************
Instruction Tuned Model Response::
###Translate to German:john is set to meet mark
###Output:John sich von dem Mark erklären, sagt Mark

Pretrained Model Response::
###Translate to German:john is set to meet mark
###Output:
*************************
Instruction Tuned Model Response::
###Translate to German:Australia wins Gold Medal at olympics
###Output:Australien eine Gold Medal zu den Olympierten.

Pretrained Model Response::
###Translate to German:Australia wins Gold Medal at olympics
###Output:
*************************
Instruction Tuned Model Response::
###Translate to German:Australian secures coal policy in China.
###Output:Australiens warten der Coalen policy in China.

Pretrained Model Response::
###Translate to German:Australian secures co

In [38]:
# samples from test set
for _str in X_test[25:30]:
  input_str = f"###Translate to German:{_str['input']}\n###Output:"
  response = get_translated_headline(translator_pipeline, seed_text=input_str)
  print(response)
  print(f"GPT-Translation:{_str['output_gpt4omini']}")
  print()

###Translate to German:warner smith return for blues
###Output:Warner Smith schnell vor den blues
GPT-Translation:Warner Smith Rückkehr für Blues

###Translate to German:gold coast could have superyacht marina boyle
###Output:Gold Coast gewinnt wirtsicher schafft Marle in der Stadt Gold.
GPT-Translation:Die Goldküste könnte einen Superyacht-Hafen in Boyle haben.

###Translate to German:bid offered for hamilton is
###Output:Schließer, der in Brandwurf auf hamilton.
GPT-Translation:Das Gebot für Hamilton ist

###Translate to German:bhp ordered to assess seismic risks
###Output:Berichkeit erfasst vor Geowarsenheit
GPT-Translation:BHP beauftragt, seismische Risiken zu bewerten

###Translate to German:nsw premier says health authorities need to watch
###Output:Die Premierminister für die Entwicklung vor Gericht auf die Überlokalien
GPT-Translation:Der Premier von New South Wales sagt, die Gesundheitsbehörden müssen aufpassen.



## Extended Capabilities

In [None]:
generate_kwargs = {
    "do_sample":True,
    "temperature": 0.7,
    "eos_token_id":50256,
    "max_new_tokens": 50,
}

In [10]:
# load the instruction-tuned model
pretrained_model = AutoModelForCausalLM.from_pretrained(MODEL).to('cpu')
inst_tuned_model = AutoModelForCausalLM.from_pretrained(OUTPUT_MODEL_NAME).to('cpu')
#-> comment .to(DEVICE) if you are using Apple Silicon

# force move to CPU for apple silicon
# pretrained_model.to('cpu');
# inst_tuned_model.to('cpu');

pretrained_model.resize_token_embeddings(len(tokenizer))
inst_tuned_model.resize_token_embeddings(len(tokenizer))

# setup the generation pipeline
translator_pipeline = pipeline('text-generation',
                     model=inst_tuned_model,
                     tokenizer=tokenizer,
                     pad_token_id=0,
                     eos_token_id=50256,
                     #device=DEVICE,
                     model_kwargs=generate_kwargs
                    )

pretrained_pipeline = pipeline('text-generation',
                     model=pretrained_model,
                     tokenizer=tokenizer,
                     pad_token_id=0,
                     eos_token_id=50256,
                     #device=DEVICE,
                     model_kwargs=generate_kwargs
                    )

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [29]:
pt_input_str = "News Zealand man charged with "
pt_input_str2 = "australia wins gold at olympics at"
inst_input_str = f"###Translate to German:farmers grow monstor tomatoes\n###Output:"
inst_input_str2 = f"###Translate to German:nsw declare heat warning\n###Output:"

In [31]:
for s in [pt_input_str,pt_input_str2,inst_input_str,inst_input_str2]:
    print("-"*50)
    print(f"Prompt= {s}...")
    print("-"*50)
    print("Instruction Tuned Model::")
    print(translator_pipeline(s)[0]['generated_text'])
    print()
    print("Pretrained Model::")
    print(pretrained_pipeline(s)[0]['generated_text'])
    print()

--------------------------------------------------
Prompt= News Zealand man charged with ...
--------------------------------------------------
Instruction Tuned Model::
News Zealand man charged with ersatz murder of 12yo boy
###

Pretrained Model::
News Zealand man charged with urdock assault

CHRISTOPHER WESSLEY

--------------------------------------------------
Prompt= australia wins gold at olympics at...
--------------------------------------------------
Instruction Tuned Model::
australia wins gold at olympics at snl

Pretrained Model::
australia wins gold at olympics at nur

french team beats australia on aces

--------------------------------------------------
Prompt= ###Translate to German:farmers grow monstor tomatoes
###Output:...
--------------------------------------------------
Instruction Tuned Model::
###Translate to German:farmers grow monstor tomatoes
###Output:Schwierigkeitsprodukten Monstor-Tomat

Pretrained Model::
###Translate to German:farmers grow monstor tomat