# Supervised Fine-tuning Trainer

Supervised fine-tuning (or SFT for short) is a crucial step in RLHF. In TRL we provide an easy-to-use API to create your SFT models and train them with few lines of code on your dataset.

In [None]:
# !pip3 install peft
# !pip3 install trl 

In [None]:
!pip3 install pydantic==1.10.9

In [None]:
import pydantic
pydantic.__version__

In [None]:
import transformers
transformers.__version__

In [None]:
import trl
trl.__version__

In [None]:
import os
import torch

# # Set GPU device
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"

# os.environ['http_proxy']  = 'http://192.41.170.23:3128'
# os.environ['https_proxy'] = 'http://192.41.170.23:3128'

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

## Basic SFT

### Step1: Load the dataset

In [None]:
from datasets import load_dataset

#sentiment analysis >> 0: negativ, 1: positive
dataset = load_dataset("imdb", split = "train")
dataset

In [None]:
dataset[0]

In [None]:
dataset[1]

### Step2: Load the model & tokenizer

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name_or_path = "distilgpt2"
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    device_map = 'auto'
)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path
)

In [None]:
# make sure to pass a correct value for max_seq_len as the default value will be set to min (tokenizer.modle_max_lenght, 1024)
max_seq_length = min(tokenizer.model_max_length, 1024)
max_seq_length

### Step 3: Define the Trainer

In [None]:
from transformers import TrainingArguments
from trl import SFTTrainer

training_args = TrainingArguments(
                    output_dir       ='tmp_trainer', # default
                    num_train_epochs = 5,
                )

trainer = SFTTrainer(
            model              = model,
            args               = training_args,
            train_dataset      = dataset.select(range(1000)),
            dataset_text_field = "text",
            max_seq_length     = max_seq_length,
)

In [None]:
trainer.train()

## Instruction - Tuning

Train on completions only

- Use the DataCollatorForCompletionOnlyLM to train your model on the generated prompts only.
- Note that this works only in the case when packing=False.
- To instantiate that collator for instruction data, pass a response template and the tokenizer.

### Step 1: Load the dataset

In [None]:
from datasets  import load_dataset
dataset = load_dataset("lucasmccabe-lmi/CodeAlpaca-20k", split = "train")
dataset

In [None]:
dataset[0]

### Step 2: Load the model and tokenizer

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name_or_path = "distilgpt2"
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path, device_map = 'auto'
)
tokenizer           = AutoTokenizer.from_pretrained(model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token

In [None]:
# set instruction
def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['instruction'])):
        text     = f"### Question: {example['instruction'][i]}\n ### Answer: {example['output'][i]}"
        output_texts.append(text)
    return output_texts

# check instruction-prompt
formatting_prompts_func(dataset[:2])

In [None]:
# use the DataCollatorForCompletionOnlyLM to train your model on the generated prompts only
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
response_template = "### Answer:"
collator          = DataCollatorForCompletionOnlyLM(response_template, tokenizer = tokenizer)
collator

In [None]:
trainer = SFTTrainer(
    model,
    train_dataset   = dataset.select(range(1000)),
    formatting_func = formatting_prompts_func,
    data_collator   = collator,
)
trainer.train()

## Standard-Alpaca : Format your input prompts

For instruction fine-tuning, it is quite common to have two columns inside the dataset: one for the prompt & the other for the response.

This allows people to format examples like Stanford-Alpaca did as follows:

In [1]:
test = '''
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{response}
'''

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTTrainer

dataset = load_dataset("HuggingFaceH4/instruction-dataset")
dataset = dataset.remove_columns("meta")
dataset

  warn("The installed version of bitsandbytes was compiled without GPU support. "


'NoneType' object has no attribute 'cadam32bit_grad_fp32'


DatasetDict({
    test: Dataset({
        features: ['prompt', 'completion'],
        num_rows: 327
    })
})

In [6]:
def format_instruction(sample):
	return [f"""
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{sample['prompt']}

### Response:
{sample['completion']}
""".strip()]

format_instruction(dataset['test'][0])

['Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nArianna has 12 chocolates more than Danny. Danny has 6 chocolates more than Robbie. Arianna has twice as many chocolates as Robbie has. How many chocolates does Danny have?\n\n### Response:\nDenote the number of chocolates each person has by the letter of their first name. We know that\nA = D + 12\nD = R + 6\nA = 2 * R\n\nThus, A = (R + 6) + 12 = R + 18\nSince also A = 2 * R, this means 2 * R = R + 18\nHence R = 18\nHence D = 18 + 6 = 24']

In [7]:
model               = AutoModelForCausalLM.from_pretrained("distilgpt2", device_map = 'auto')
tokenizer           = AutoTokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token

In [8]:
trainer = SFTTrainer(
    model,
    train_dataset=dataset['test'],
    tokenizer=tokenizer,
    max_seq_length=1024,
    formatting_func=format_instruction,
)

trainer.train() 

Map:   0%|          | 0/327 [00:00<?, ? examples/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/3 [00:00<?, ?it/s]

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'train_runtime': 8.0223, 'train_samples_per_second': 0.374, 'train_steps_per_second': 0.374, 'train_loss': 3.059690793355306, 'epoch': 3.0}


TrainOutput(global_step=3, training_loss=3.059690793355306, metrics={'train_runtime': 8.0223, 'train_samples_per_second': 0.374, 'train_steps_per_second': 0.374, 'train_loss': 3.059690793355306, 'epoch': 3.0})