## Instruction Tuning

Supervised fine tuning (SFT) is fine-tuning all of a model’s parameters on supervised data of inputs and outputs. It teaches the model how to follow user specified instructions. It is typically done after model pre-training. **Source**: http://tinyurl.com/2v884put

![instruction tuning](assets/instruction-tuning.jpg)

Image Source: https://medium.com/mantisnlp/supervised-fine-tuning-customizing-llms-a2c1edbf22c3

Requirement.
1. Pre-trained model & tokenizer -> We will get it from huggingface.
2. Instruction-Response pair data -> eg: Alpaca, Dolly, Oasst1, LIMA, etc. We will get the dataset from huggingface.

Steps.
1. Load pre-trained model and tokenizer.
2. Format the instructions response pair.
3. Preprocess the dataset.
4. Train the pre-trained model in supervised setting with response as labels and instruction as input.
5. Evaluation:
   i. Automatic Evaluation: Eg: MMLU, BBH, AGIEval, domain-specific evaluation such as maths, reasoning, code.
   ii. Human Evaluation: Give model prompts to generate a response and ask humans.
   iii. LLM as Evaluator: Ask powerful models such as GPT4 to rate the response generated by the your finetuned model.

#### Dependencies

In [1]:
!pip uninstall transformers
!pip install git+https://github.com/huggingface/transformers
!pip install --quiet datasets accelerate -U

Found existing installation: transformers 4.35.2
Uninstalling transformers-4.35.2:
  Would remove:
    /usr/local/bin/transformers-cli
    /usr/local/lib/python3.10/dist-packages/transformers-4.35.2.dist-info/*
    /usr/local/lib/python3.10/dist-packages/transformers/*
Proceed (Y/n)? Y
  Successfully uninstalled transformers-4.35.2
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-pnccyhqi
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-pnccyhqi
  Resolved https://github.com/huggingface/transformers to commit ebc8f47bd922734ce811b8b23c495e653c60afc9
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?2

In [30]:
# import dependencies
import os
import copy
import logging
from dataclasses import dataclass, field
from typing import Optional, Dict, Sequence
from tqdm import tqdm
import torch
import datasets
import transformers
from torch.utils.data import Dataset
from transformers import Trainer
from datasets import load_dataset
from transformers.trainer_utils import get_last_checkpoint, is_main_process

In [2]:
# !pip install --upgrade transformers

In [3]:
from transformers import AutoTokenizer

In [6]:
# tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5")

##### 1. Configs

In [3]:
model_name_or_path = "gpt2" #"microsoft/phi-1_5" # huggingface model name
dataset_name_or_path = "xzuyn/lima-alpaca" # LIMA Data in Vicuna Format. https://arxiv.org/abs/2305.11206
cache_dir="cache_dir"
split_name="train"
inst_col_name="instruction"
input_col_name="input"
output_col_name="output"
model_max_length=512 # how long sequence model can process
IGNORE_INDEX = -100
DEFAULT_PAD_TOKEN = "[PAD]"
DEFAULT_EOS_TOKEN = "</s>"
DEFAULT_BOS_TOKEN = "</s>"
DEFAULT_UNK_TOKEN = "</s>"

##### 2. Dataset Preparation

In [20]:
# dataset = load_dataset(dataset_name_or_path)

- This is two prompt template / or wrapper we are going to use.
- Some instruction contains


In [6]:
PROMPT_DICT = {
    "prompt_input": (
        "Below is an instruction that describes a task, paired with an input that provides further context. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
    ),
    "prompt_no_input":(
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Response:"
    ),
}

If we are adding any new tokens to then we need to extend the embedding.

In [5]:
def smart_tokenizer_and_embedding_resize(
    special_tokens_dict: Dict,
    tokenizer: transformers.PreTrainedTokenizer,
    model: transformers.PreTrainedModel,
):
    """Resize tokenizer and embedding.

    Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
    """
    num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
    model.resize_token_embeddings(len(tokenizer))

    if num_new_tokens > 0:
        input_embeddings = model.get_input_embeddings().weight.data
        output_embeddings = model.get_output_embeddings().weight.data

        input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
        output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)

        input_embeddings[-num_new_tokens:] = input_embeddings_avg
        output_embeddings[-num_new_tokens:] = output_embeddings_avg

In [6]:
def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrainedTokenizer) -> Dict:
    """Tokenize a list of strings."""
    tokenized_list = [
        tokenizer(
            text,
            return_tensors="pt",
            padding="longest",
            max_length=tokenizer.model_max_length,
            truncation=True,
        )
        for text in strings
    ]
    input_ids = labels = [tokenized.input_ids[0] for tokenized in tokenized_list]
    input_ids_lens = labels_lens = [
        tokenized.input_ids.ne(tokenizer.pad_token_id).sum().item() for tokenized in tokenized_list
    ]
    return dict(
        input_ids=input_ids,
        labels=labels,
        input_ids_lens=input_ids_lens,
        labels_lens=labels_lens,
    )

In [7]:
def preprocess(
    sources: Sequence[str],
    targets: Sequence[str],
    tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
    """Preprocess the data by tokenizing."""
    examples = [s + t for s, t in zip(sources, targets)]
    examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]
    input_ids = examples_tokenized["input_ids"]
    labels = copy.deepcopy(input_ids)
    for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
        label[:source_len] = IGNORE_INDEX
    return dict(input_ids=input_ids, labels=labels)

In [8]:
# tokenizer = transformers.AutoTokenizer.from_pretrained(model_name_or_path)
# tokenizer.pad_token=tokenizer.eos_token

In [9]:
# source = [dataset[split_name][inst_col_name][0]]
# target = [dataset[split_name][output_col_name][0]]

In [10]:
# outputs = preprocess(sources=source, targets=target, tokenizer=tokenizer)

In [11]:
# len(outputs['input_ids'][0])

In [12]:
# len(outputs['labels'][0])

In [13]:
# outputs['labels'][0]

In [14]:
class SupervisedDataset(Dataset):
    """Dataset for supervised fine-tuning."""

    def __init__(self, dataset_name_or_path: str, tokenizer: transformers.PreTrainedTokenizer):
        super(SupervisedDataset, self).__init__()

        # Load the dataset
        logging.warning("Loading data...")
        dataset = datasets.load_dataset(dataset_name_or_path, split=split_name)


        logging.warning("Formatting inputs...")
        # if there is no input for prompt the use prompt_no_input template else use prompt_input template
        prompt_input, prompt_no_input = PROMPT_DICT["prompt_input"], PROMPT_DICT["prompt_no_input"]
        sources = [
            prompt_input.format_map(example) if example.get(input_col_name, "") != "" else prompt_no_input.format_map(example)
            for example in tqdm(dataset)
        ]
        targets = [f"{example[output_col_name]}{tokenizer.eos_token}" for example in dataset]

        logging.warning("Tokenizing inputs... This may take some time...")
        data_dict = preprocess(sources, targets, tokenizer)

        self.input_ids = data_dict["input_ids"]
        self.labels = data_dict["labels"]

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        return dict(input_ids=self.input_ids[i], labels=self.labels[i])

In [15]:
# for example in dataset['train']:
#     print(example.get(input_col_name, "input"))
#     break

In [16]:
# dataset = SupervisedDataset(dataset_name_or_path=dataset_name_or_path, tokenizer=tokenizer)

In [17]:
@dataclass
class DataCollatorForSupervisedDataset(object):
    """Collate examples for supervised fine-tuning."""

    tokenizer: transformers.PreTrainedTokenizer

    def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
        input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))
        input_ids = torch.nn.utils.rnn.pad_sequence(
            input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id
        )
        labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX)
        return dict(
            input_ids=input_ids,
            labels=labels,
            attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
        )

In [18]:
def make_supervised_data_module(tokenizer: transformers.PreTrainedTokenizer) -> Dict:
    """Make dataset and collator for supervised fine-tuning."""
    train_dataset = SupervisedDataset(tokenizer=tokenizer, dataset_name_or_path=dataset_name_or_path)
    data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
    return dict(train_dataset=train_dataset, eval_dataset=None, data_collator=data_collator)


In [19]:
def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str):
    """Collects the state dict and dump to disk."""
    state_dict = trainer.model.state_dict()
    if trainer.args.should_save:
        cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
        del state_dict
        trainer._save(output_dir, state_dict=cpu_state_dict)  # noqa

In [20]:
def train(training_args):

    model = transformers.AutoModelForCausalLM.from_pretrained(
        model_name_or_path,
        cache_dir=cache_dir,
    )
    model.config.use_cache=False

    tokenizer = transformers.AutoTokenizer.from_pretrained(
        model_name_or_path,
        cache_dir=cache_dir,
        model_max_length=model_max_length,
        padding_side="right",
        use_fast=False,
    )
    if tokenizer.pad_token is None:
        smart_tokenizer_and_embedding_resize(
            special_tokens_dict=dict(pad_token=DEFAULT_PAD_TOKEN),
            tokenizer=tokenizer,
            model=model,
        )
    if "llama" in model_name_or_path:
        tokenizer.add_special_tokens(
            {
                "eos_token": DEFAULT_EOS_TOKEN,
                "bos_token": DEFAULT_BOS_TOKEN,
                "unk_token": DEFAULT_UNK_TOKEN,
            }
        )

    data_module = make_supervised_data_module(tokenizer=tokenizer)

    # update training args to make output dir
    output_dir = os.path.join(training_args.output_dir, model_name_or_path.split("/")[-1])
    os.makedirs(output_dir, exist_ok=True)

    training_args.output_dir = output_dir

    trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)

    # resume from last checkpoint if it exists
    checkpoint = get_last_checkpoint(training_args.output_dir)

    if checkpoint:
        print(f"Checkpoint found! Training from {checkpoint} checkpoint!")
        trainer.train(resume_from_checkpoint=checkpoint)
    else:
        print(f"No checkpoint found! Training from scratch!")
        trainer.train()

    # trainer.train()
    # save states
    trainer.save_state()
    safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
    print(f"Training finished! Saved model to {training_args.output_dir}.")


### Train

In [21]:
output_dir = "output"

In [28]:
training_args = transformers.TrainingArguments(output_dir=output_dir, per_device_train_batch_size=2)

In [31]:
# training
train(training_args=training_args)

100%|██████████| 1000/1000 [00:00<00:00, 26628.64it/s]


No checkpoint found! Training from scratch!


Step,Training Loss
500,3.0614
1000,2.7486
1500,2.5991


Training finished! Saved model to output/gpt2.


In [None]:
# training_args.num_train_epochs = 5

### Evaluation

In [37]:
# clone llm-evaluation-harness
# !git clone https://github.com/EleutherAI/lm-evaluation-harness
# !cd lm-evaluation-harness
# !pip install -e .

In [38]:
!lm_eval --model hf \
    --model_args pretrained=/content/output/gpt2 \
    --tasks hellaswag \
    --device cuda:0 \
    --batch_size auto:4

2024-01-23:17:27:18,744 INFO     [utils.py:160] NumExpr defaulting to 2 threads.
2024-01-23:17:27:19,146 INFO     [config.py:58] PyTorch version 2.1.0+cu121 available.
2024-01-23:17:27:19,147 INFO     [config.py:95] TensorFlow version 2.15.0 available.
2024-01-23:17:27:19,148 INFO     [config.py:108] JAX version 0.4.23 available.
2024-01-23 17:27:19.862024: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-23 17:27:19.862115: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-23 17:27:19.863642: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Downloading builder script: 100% 

### Prompting
- Once we have trained the model to follow instructions, we can prompt that model to generate a response.
- We will be using HF's generation pipeline to prompt our trained model.
- After training the model with a particular prompt wrapper it is advised to use the same prompt format during inference.


In [3]:
import time
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os

In [4]:
# Load tokenizer and trained model and then create chatbot pipeline
tokenizer = AutoTokenizer.from_pretrained("/content/output/gpt2")
model = AutoModelForCausalLM.from_pretrained("/content/output/gpt2", device_map="cuda:0")

chatbot = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

In [7]:
# format the prompt
text = "What is Machine Learning?"

prompt = PROMPT_DICT['prompt_no_input'].format(instruction=text)

In [8]:
sequences = chatbot(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_p=0.4,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=64,
    return_full_text=False
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [47]:
# !ps ux

In [9]:
print(sequences[0]['generated_text'])

Machine learning is a new field that has been around for a while. It is a field that has been around for a long time, and has been around for a long time. Machine learning is a new field that has been around for a long time, and has been around for a long time.

## Machine


#### Next Tutorial - Prompting