# [Supervised Fine-tuning Trainer](https://huggingface.co/docs/trl/sft_trainer)

Supervised fine-tuning (or SFT for short) is a crucial step in RLHF. In TRL we provide an easy-to-use API to create your SFT models and train them with few lines of code on your dataset.

[Python Script](https://github.com/huggingface/trl/blob/main/examples/scripts/sft.py)

In [1]:
# !pip3 install peft==0.7.1
# !pip3 install trl==0.7.4
# !pip3 install transformer==4.36.2

In [2]:
import transformers
transformers.__version__

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(


'4.36.2'

In [3]:
import trl
trl.__version__

'0.8.1'

In [4]:
import os
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

## Instruction-Tuning
Train on completions only
- Use the DataCollatorForCompletionOnlyLM to train your model on the generated prompts only.
- Note that this works only in the case when packing=False.
- To instantiate that collator for instruction data, pass a response template and the tokenizer. 

In [5]:
# Step 1: Load the dataset
from datasets import load_dataset
dataset_train = load_dataset('json', data_files='data/alpaca_data.json', split='train')
dataset_train

Dataset({
    features: ['input', 'instruction', 'output'],
    num_rows: 52002
})

In [6]:
dataset_train[20000]

{'input': '(A musical note)',
 'instruction': 'Name the given musical note.',
 'output': 'The musical note is an F sharp.'}

In [7]:
dataset_eval = load_dataset("tatsu-lab/alpaca_eval", split='eval', trust_remote_code=True)
dataset_eval = dataset_eval.remove_columns(["generator", "dataset"])
dataset_eval

Dataset({
    features: ['instruction', 'output'],
    num_rows: 805
})

In [8]:
dataset_eval[200]

{'instruction': 'what are five important topics for game design',
 'output': '1. Storytelling\n2. Player Mechanics\n3. Art Direction\n4. Level Design\n5. User Interface Design'}

In [9]:
# Step 2: Load the model & Tokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name_or_path = "distilgpt2"

model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path, device_map = 'auto')

tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path)

tokenizer.pad_token = tokenizer.eos_token

# Make sure to pass a correct value for max_seq_length as the default value will be set to min(tokenizer.model_max_length, 1024).
max_seq_length = min(tokenizer.model_max_length, 1024)
max_seq_length

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


1024

In [10]:
dataset_eval[0].keys()

dict_keys(['instruction', 'output'])

In [11]:
dataset_train[:2]

{'input': ['', ''],
 'instruction': ['Give three tips for staying healthy.',
  'What are the three primary colors?'],
 'output': ['1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.',
  'The three primary colors are red, blue, and yellow.']}

### Standard-Alpaca : Format your input prompts
For instruction fine-tuning, it is quite common to have two columns inside the dataset: one for the prompt & the other for the response.

This allows people to format examples like Stanford-Alpaca did as follows:

In [12]:
def formatting_prompts_func(examples):
	output_texts = []

	for i in range(len(examples['instruction'])):
		instruction = examples["instruction"][i]
		input_text = examples["input"][i] if 'input' in examples.keys() else ""
		response = examples["output"][i]
	
		if len(input_text) > 1:
			text = f"""
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input_text}

### Response:
{response}
""".strip()
			
		else:
			text = f"""
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{response}
""".strip()

		output_texts.append(text)

	return output_texts

#check instruction-prompt
formatting_prompts_func(dataset_train[:2])

['Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.',
 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are the three primary colors?\n\n### Response:\nThe three primary colors are red, blue, and yellow.']

In [13]:
dataset_eval[0]

{'instruction': 'What are the names of some famous actors that started their careers on Broadway?',
 'output': 'Some famous actors that started their careers on Broadway include: \n1. Hugh Jackman \n2. Meryl Streep \n3. Denzel Washington \n4. Julia Roberts \n5. Christopher Walken \n6. Anthony Rapp \n7. Audra McDonald \n8. Nathan Lane \n9. Sarah Jessica Parker \n10. Lin-Manuel Miranda'}

In [14]:
# use the DataCollatorForCompletionOnlyLM to train your model on the generated prompts only
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

response_template = "### Response:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)
collator

DataCollatorForCompletionOnlyLM(tokenizer=GPT2TokenizerFast(name_or_path='distilgpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}, mlm=False, mlm_probability=0.15, pad_to_multiple_of=None, tf_experimental_compile=False, return_tensors='pt')

## Model Training

In [15]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir = './results', #default = 'tmp_trainer'
    save_strategy = 'epoch',
    evaluation_strategy = 'epoch',
    gradient_checkpointing = True,
    per_device_train_batch_size = 2,
    per_device_eval_batch_size = 2,
    num_train_epochs = 3, #default = 3
)

# Step 3: Define the Trainer
trainer = SFTTrainer(
    model,
    args = training_args,
    train_dataset = dataset_train.select(range(10000)),
    eval_dataset = dataset_eval,
    formatting_func = formatting_prompts_func,
    data_collator = collator,
    max_seq_length = max_seq_length,
)

trainer.train()

Map: 100%|██████████| 10000/10000 [00:02<00:00, 4822.79 examples/s]
  0%|          | 0/15000 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
  3%|▎         | 502/15000 [01:08<27:31,  8.78it/s] 

{'loss': 2.6808, 'learning_rate': 4.8333333333333334e-05, 'epoch': 0.1}


  7%|▋         | 1001/15000 [02:08<28:17,  8.25it/s]

{'loss': 2.6334, 'learning_rate': 4.666666666666667e-05, 'epoch': 0.2}


 10%|█         | 1501/15000 [03:00<20:30, 10.97it/s]

{'loss': 2.5527, 'learning_rate': 4.5e-05, 'epoch': 0.3}


 13%|█▎        | 2002/15000 [03:45<19:04, 11.35it/s]

{'loss': 2.5173, 'learning_rate': 4.3333333333333334e-05, 'epoch': 0.4}


 17%|█▋        | 2502/15000 [04:35<18:48, 11.07it/s]

{'loss': 2.4753, 'learning_rate': 4.166666666666667e-05, 'epoch': 0.5}


 20%|██        | 3002/15000 [05:21<16:47, 11.91it/s]

{'loss': 2.5103, 'learning_rate': 4e-05, 'epoch': 0.6}


 23%|██▎       | 3502/15000 [06:06<17:32, 10.93it/s]

{'loss': 2.4646, 'learning_rate': 3.8333333333333334e-05, 'epoch': 0.7}


 27%|██▋       | 4002/15000 [06:52<17:05, 10.72it/s]

{'loss': 2.4326, 'learning_rate': 3.6666666666666666e-05, 'epoch': 0.8}


 30%|███       | 4502/15000 [07:38<17:35,  9.95it/s]

{'loss': 2.4193, 'learning_rate': 3.5e-05, 'epoch': 0.9}


 33%|███▎      | 5000/15000 [08:24<14:57, 11.14it/s]

{'loss': 2.4363, 'learning_rate': 3.3333333333333335e-05, 'epoch': 1.0}


                                                    
 33%|███▎      | 5000/15000 [08:35<14:57, 11.14it/s]Checkpoint destination directory ./results\checkpoint-5000 already exists and is non-empty.Saving will proceed but saved results may be invalid.


{'eval_loss': 2.2642300128936768, 'eval_runtime': 11.0251, 'eval_samples_per_second': 73.015, 'eval_steps_per_second': 36.553, 'epoch': 1.0}


 37%|███▋      | 5501/15000 [09:22<14:18, 11.06it/s]  

{'loss': 2.1106, 'learning_rate': 3.1666666666666666e-05, 'epoch': 1.1}


 40%|████      | 6001/15000 [10:08<13:18, 11.27it/s]

{'loss': 2.1378, 'learning_rate': 3e-05, 'epoch': 1.2}


 43%|████▎     | 6502/15000 [10:54<12:12, 11.60it/s]

{'loss': 2.1164, 'learning_rate': 2.8333333333333335e-05, 'epoch': 1.3}


 47%|████▋     | 7002/15000 [11:41<12:39, 10.54it/s]

{'loss': 2.1113, 'learning_rate': 2.6666666666666667e-05, 'epoch': 1.4}


 50%|█████     | 7502/15000 [12:28<11:08, 11.22it/s]

{'loss': 2.1567, 'learning_rate': 2.5e-05, 'epoch': 1.5}


 53%|█████▎    | 8002/15000 [13:15<10:35, 11.00it/s]

{'loss': 2.078, 'learning_rate': 2.3333333333333336e-05, 'epoch': 1.6}


 57%|█████▋    | 8502/15000 [14:01<09:49, 11.02it/s]

{'loss': 2.1179, 'learning_rate': 2.1666666666666667e-05, 'epoch': 1.7}


 60%|██████    | 9001/15000 [14:48<09:26, 10.60it/s]

{'loss': 2.1086, 'learning_rate': 2e-05, 'epoch': 1.8}


 63%|██████▎   | 9501/15000 [15:34<08:12, 11.17it/s]

{'loss': 2.1131, 'learning_rate': 1.8333333333333333e-05, 'epoch': 1.9}


 67%|██████▋   | 10000/15000 [16:21<06:30, 12.80it/s]

{'loss': 2.1208, 'learning_rate': 1.6666666666666667e-05, 'epoch': 2.0}


                                                     
 67%|██████▋   | 10000/15000 [16:39<06:30, 12.80it/s]Checkpoint destination directory ./results\checkpoint-10000 already exists and is non-empty.Saving will proceed but saved results may be invalid.


{'eval_loss': 2.261542320251465, 'eval_runtime': 17.6651, 'eval_samples_per_second': 45.57, 'eval_steps_per_second': 22.813, 'epoch': 2.0}


 70%|███████   | 10501/15000 [17:42<08:07,  9.22it/s]  

{'loss': 1.9229, 'learning_rate': 1.5e-05, 'epoch': 2.1}


 73%|███████▎  | 11001/15000 [18:41<07:57,  8.37it/s]

{'loss': 1.9227, 'learning_rate': 1.3333333333333333e-05, 'epoch': 2.2}


 77%|███████▋  | 11501/15000 [19:39<06:48,  8.57it/s]

{'loss': 1.9096, 'learning_rate': 1.1666666666666668e-05, 'epoch': 2.3}


 80%|████████  | 12001/15000 [20:38<05:09,  9.70it/s]

{'loss': 1.8914, 'learning_rate': 1e-05, 'epoch': 2.4}


 83%|████████▎ | 12501/15000 [21:35<04:39,  8.95it/s]

{'loss': 1.9514, 'learning_rate': 8.333333333333334e-06, 'epoch': 2.5}


 87%|████████▋ | 13002/15000 [22:34<03:41,  9.01it/s]

{'loss': 1.9539, 'learning_rate': 6.666666666666667e-06, 'epoch': 2.6}


 90%|█████████ | 13502/15000 [23:31<02:31,  9.87it/s]

{'loss': 1.9077, 'learning_rate': 5e-06, 'epoch': 2.7}


 93%|█████████▎| 14000/15000 [24:31<01:54,  8.77it/s]

{'loss': 1.8603, 'learning_rate': 3.3333333333333333e-06, 'epoch': 2.8}


 97%|█████████▋| 14501/15000 [25:28<00:52,  9.46it/s]

{'loss': 1.902, 'learning_rate': 1.6666666666666667e-06, 'epoch': 2.9}


100%|██████████| 15000/15000 [26:26<00:00,  9.36it/s]

{'loss': 1.9027, 'learning_rate': 0.0, 'epoch': 3.0}


                                                     
100%|██████████| 15000/15000 [26:42<00:00,  9.36it/s]Checkpoint destination directory ./results\checkpoint-15000 already exists and is non-empty.Saving will proceed but saved results may be invalid.


{'eval_loss': 2.273904800415039, 'eval_runtime': 16.2362, 'eval_samples_per_second': 49.581, 'eval_steps_per_second': 24.821, 'epoch': 3.0}


100%|██████████| 15000/15000 [26:46<00:00,  9.34it/s]

{'train_runtime': 1606.2383, 'train_samples_per_second': 18.677, 'train_steps_per_second': 9.339, 'train_loss': 2.180621476236979, 'epoch': 3.0}





TrainOutput(global_step=15000, training_loss=2.180621476236979, metrics={'train_runtime': 1606.2383, 'train_samples_per_second': 18.677, 'train_steps_per_second': 9.339, 'train_loss': 2.180621476236979, 'epoch': 3.0})

In [16]:
trainer.save_model('model/instruction_tuning')

## Inference

In [17]:
model_name_or_path = "model/instruction_tuning"

model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path, device_map = 'auto')

In [18]:
from transformers import pipeline

text_generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=500
)

In [19]:
def instruction_prompt(sample):
	
	if 'input' in sample.keys():
		return f"""
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{sample['instruction']}

### Input:
{sample['input']}

### Response:
""".strip()
			
	else:
		return f"""
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{sample['instruction']}

### Response:
""".strip()


In [20]:
import warnings
warnings.filterwarnings("ignore")

def comparision_between_generated_and_response(pipeline, sample):
    print(f"Instruction:\n{sample['instruction']}\n")

    if 'input' in sample.keys():
        if len(sample['input']) > 1:
            print(f"Input:\n{sample['input']}\n")

    print(f"Gold Response:\n{sample['output']}\n")

    output = pipeline(instruction_prompt(sample))
    response = output[0]['generated_text'].split("### Response:\n")[-1]

    print(f"Generated Response:\n{response}\n")


In [22]:
comparision_between_generated_and_response(text_generator, dataset_eval[2])

Instruction:
Hi, my sister and her girlfriends want me to play kickball with them. Can you explain how the game is played, so they don't take advantage of me?

Gold Response:
Kickball is a game similar to baseball, but with a large rubber ball instead of a bat and a ball. The game is usually played with two teams of six players each. Each team has three bases and a home plate. The players on the kicking team line up at home plate and take turns kicking the ball. The object of the game is to score runs by running around all three bases and back to home plate without being tagged out by the defense. The team with the most runs at the end of the game is the winner.



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Response:
Our matches are played in a computer game called the ‘Poker game’ in order to reach a final score. Players are placed in the correct position on either side of the competition and facing off against the opponent of the match. This allows them to perform team-building exercises and then play the game in their own right. There are also rules, rules, and rules, and the game teaches players a certain strategy. How does playing the game play in the computer game?

Steps
Players are allowed to play the team's kicks, but they will also face off in the winner's bracket. This can involve playing on their own and playing on their own. Additionally, it is important to practice the game to make sure the participants can understand the different tactics used. We can also learn some tactics such as using a team chair to hold the cards, and playing on board a game of cards. The winning player will receive a prize, receive a home run and a cup of coffee. Furthermore, all team membe

### Comparision between generated response and gold labels

 The results do not seem to very well matched this might be due to less number of epochs the model was trained for but the decrease in loss is very promising When comparing the generated responses with the gold labels, it's evident that the alignment isn't as strong as desired. This discrepancy could indeed be attributed to the relatively fewer training epochs the model underwent. Training for more epochs could potentially allow the model to refine its understanding of the data and produce more accurate outputs.

Despite the mismatch between the generated responses and gold labels, the decrease in loss during training is an encouraging sign. It indicates that the model is effectively minimizing its error and learning from the training data. This suggests that with further training, the model may be able to better capture the underlying patterns and nuances in the data, leading to improved performance in aligning its responses with the gold labels.

However, it's essential to approach model improvement holistically. While increasing the number of epochs might be beneficial, it's also crucial to consider other factors such as the model architecture, hyperparameters, and the quality and quantity of training data. Additionally, evaluating the model's performance using metrics beyond just loss, such as accuracy, precision, recall, and F1 score, can provide a more comprehensive understanding of its capabilities and areas for improvement. By iteratively refining the training process and considering various aspects of model performance, it's possible to enhance the model's ability to generate responses that closely match the gold labels.