# Extended Guide: Instruction-tune Llama 2

This blog post is an extended guide on instruction-tuning Llama 2 from Meta AI. The idea of the blog post is to focus on creating the instruction dataset, which we can then use to fine-tune the base model of Llama 2 to follow our instructions. 

The goal is to create a model which can create instructions based on input. The idea behind this is that this can then be used for others to create instruction data from inputs. That's especially helpful if you want to personalize models for, e.g., tweeting, email writing, etc, which means that you would be able to generate an instruction dataset from your emails to then train a model to mimic your email writing. 

Okay, so can we get started on this? In the blog, we are going to:

1. Define our use case in detail and create a prompt template for our instructions
2. Create an instruction dataset
3. Instruction-tune Llama 2 using `trl` and the `SFTTrainer` 
4. Test the Model and run Inference

## 1. Define our use case in detail and create a template for our instructions

Before we describe our use case, we need to better understand what even is an instruction. 

> An instruction is a piece of text or prompt that is provided to an LLM, like Llama, GPT-4, or Claude, to guide it to generate a response. Instructions allow humans to steer the conversation and constrain the language model's output to be more natural, useful, and aligned with the user's goals. Crafting clear, well-formulated instructions is key to productive conversations.
> 

Examples of instructions are listed below in the table.

| Capability | Example Instruction |
| --- | --- |
| Brainstorming | Provide a diverse set of creative ideas for new flavors of ice cream. |
| Classification | Categorize these movies as either comedy, drama, or horror based on the plot summary. |
| Closed QA | Answer the question 'What is the capital of France?' with a single word. |
| Generation | Write a poem in the style of Robert Frost about nature and the changing seasons. |
| Information Extraction | Extract the names of the main characters from this short story. |
| Open QA | Why do leaves change color in autumn? Explain the scientific reasons. |
| Summarization | Summarize this article on recent advancements in renewable energy in 2-3 sentences. |

As described in the beginning, we want to fine-tune a model to be able to generate instructions based on input. (output). We want to use this as a way to create synthetic datasets to personalize LLMs and Agents. 

Converting the idea into a basic prompt template following the [Alpaca format](https://github.com/tatsu-lab/stanford_alpaca#data-release) we get. 

```python
### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM. 

### Input:
Dear [boss name],

I'm writing to request next week, August 1st through August 4th,
off as paid time off.

I have some personal matters to attend to that week that require 
me to be out of the office. I wanted to give you as much advance 
notice as possible so you can plan accordingly while I am away.

Please let me know if you need any additional information from me 
or have any concerns with me taking next week off. I appreciate you 
considering this request.

Thank you, [Your name]

### Response:
Write an email to my boss that I need next week 08/01 - 08/04 off.
```

## 2. Create an instruction dataset

After we defined our use case and prompt template, we need to create our instruction dataset. Creating a high-quality instruction dataset is key for a good-performing model. Research shows that [“Less Is More for Alignment”](https://arxiv.org/abs/2305.11206) shows that creating a high-quality, low-quantity (~1000 samples) dataset can achieve the same performance as less-quality and high-quantity datasets. 

There are several ways to create an instruction dataset, including: 

1. Using an existing dataset and converting it into an instruction dataset, e.g., [FLAN](https://huggingface.co/datasets/SirNeural/flan_v2)
2. Use existing LLMs to create synthetically instruction datasets, e.g., [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca)
3. Use Humans to create instructions datasets, e.g., [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k). 

Each of the methods has its own advantages and disadvantages and depends on the budget, time, and quality requirements. For example, using an existing dataset is the easiest but might not be tailored to your specific use case, while using humans might be the most accurate but can be time-consuming and expensive. It is also possible to combine several methods to create an instruction dataset, as shown in [Orca: Progressive Learning from Complex Explanation Traces of GPT-4.](https://arxiv.org/abs/2306.02707)

To keep it simple, we are going to use **[Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k)** an open-source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the **[InstructGPT paper](https://arxiv.org/abs/2203.02155)**, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

Let's start coding, but first, let's install our dependencies.

In [None]:
!pip install "transformers==4.31.0" "datasets==2.13.0" "peft==0.4.0" "accelerate==0.21.0" "bitsandbytes==0.40.2" "trl==0.4.7" "safetensors>=0.3.1" --upgrade

To load the **`databricks/databricks-dolly-15k`** dataset, we use the **`load_dataset()`** method from the 🤗 Datasets library.

In [1]:
from datasets import load_dataset
from random import randrange

# Load dataset from the hub
dataset = load_dataset("philschmid/meta-shepherd-human-data", split="train")

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])
# dataset size: 15011

Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/philschmid___parquet/philschmid--meta-shepherd-human-data-7a75bb0f57f8969d/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7)


dataset size: 1317
{'id': 589, 'dataset': 'PIQA', 'question': 'How to make a beaded friendship bracelet.', 'answer': 'Gather your supplies and decide on the pattern or words you want to use for each bracelet. Cut your bead string to the size you need. Nail one end of your bead string to the table. Put beads on the string in the pattern you want to use. Tie the ends of the string together.', 'feedback': 'While the option to use a nail to fix one end of the braid to the table would work it would also damage the table and cannot be considered the best solution. The correct way to make a beaded friendship bracelet without damaging the table would be to use tape instead of the nail.', 'text': '### Question: How to make a beaded friendship bracelet.\n          \n### Answer: Gather your supplies and decide on the pattern or words you want to use for each bracelet. Cut your bead string to the size you need. Nail one end of your bead string to the table. Put beads on the string in the pattern y

To instruct tune our model, we need to convert our structured examples into a collection of tasks described via instructions. We define a **`formatting_function`** that takes a sample and returns a string with our format instruction.

In [2]:
def format_instruction(sample):
	return f"""### Question: {sample['question']}

### Answer:
{sample['answer']}

### Feedback:
{sample['feedback']}
"""

Let's test our formatting function on a random example.

In [3]:
from random import randrange

print(format_instruction(dataset[randrange(len(dataset))]))

### Question: The moving air turned the blade and provided power.  What type of object is this?

Here are the options:
Option 1: turbine
Option 2: propeller
Option 3: cheese
Option 4: rollerblade
Option 5: windmill

Please choose the correct option and justify your choice:

### Answer:
A windmill uses the moving air (wind) to turn its blades which then rotate an axle that powers machinery.
A propeller also turns due to the moving air, but instead of powering something else directly, it pushes against the air behind it, creating thrust which moves whatever the propeller is attached to forward.  You see propellers on boats and planes.
A turbine is similar to a windmill, except it usually doesn’t have the external parts to do work; it just spins and transfers energy from the moving air into rotational energy.  You find turbines used in electricity generation (both wind turbines and steam/gas turbines), jet engines, and other places where you need high-speed rotation.
Cheese and rollerblad

## 3. Instruction-tune Llama 2 using `trl` and the `SFTTrainer`

 We will use the recently introduced method in the paper "[QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation](https://arxiv.org/abs/2305.14314)" by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is:

- Quantize the pre-trained model to 4 bits and freeze it.
- Attach small, trainable adapter layers. (LoRA)
- Finetune only the adapter layers while using the frozen quantized model for context.

If you want to learn more about QLoRA and how it works, I recommend you to read the [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes) blog post.

### Flash Attention

Flash Attention is a an method that reorders the attention computation and leverages classical techniques (tiling, recomputation) to significantly speed it up and reduce memory usage from quadratic to linear in sequence length. It is based on the paper "[FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135)".
The TL;DR; accelerates training up to 3x. Learn more at [FlashAttention](https://github.com/Dao-AILab/flash-attention/tree/main). Flash Attention is currently only available for Ampere (A10, A40, A100, ...) & Hopper (H100, ...) GPUs. You can check if your GPU is supported and install it using the following command:

_Note: If your machine has less than 96GB of RAM and lots of CPU cores, reduce the number of `MAX_JOBS`. On the `g5.2xlarge` we used `4`._

```bash
python -c "import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'"
pip install ninja packaging
MAX_JOBS=4 pip install flash-attn --no-build-isolation
```

_Installing flash attention can take quite a bit of time (10-45 minutes)._

The example supports the use of Flash Attention for all Llama checkpoints, but is not enabled by default. To use Flash Attention comment in the code block below wich says  `# COMMENT IN TO USE FLASH ATTENTION`.


In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

use_flash_attention = True

# COMMENT IN TO USE FLASH ATTENTION
# replace attention with flash attention 
if torch.cuda.get_device_capability()[0] >= 8:
    from utils.llama_patch import replace_attn_with_flash_attn
    print("Using flash attention")
    replace_attn_with_flash_attn()
    use_flash_attention = True


# Hugging Face model id
model_id = "NousResearch/Llama-2-7b-hf" # non-gated
# model_id = "meta-llama/Llama-2-7b-hf" # gated


# BitsAndBytesConfig int-4 config 
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, use_cache=False, device_map="auto")
model.config.pretraining_tp = 1 

# Validate that the model is using flash attention, by comparing doc strings
if use_flash_attention:
    from utils.llama_patch import forward    
    assert model.model.layers[0].self_attn.forward.__doc__ == forward.__doc__, "Model is not using flash attention"


tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Using flash attention


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The `SFTTrainer`  supports a native integration with `peft`, which makes it super easy to efficiently instruction tune LLMs. We only need to create our `LoRAConfig` and provide it to the trainer.

In [5]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# LoRA config based on QLoRA paper
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=64,
        bias="none",
        task_type="CAUSAL_LM", 
)


# prepare model for training
model = prepare_model_for_kbit_training(model)

Before we can start our training we need to define the hyperparameters (`TrainingArguments`) we want to use.

In [6]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="llama-7-int4-dolly",
    num_train_epochs=3,
    per_device_train_batch_size=6 if use_flash_attention else 4,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    logging_steps=10,
    save_strategy="epoch",
    learning_rate=2e-4,
    bf16=True,
    fp16=False,
    tf32=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    disable_tqdm=False,  # disable tqdm since with packing values are in correct
)


# Upcast layer for flash attnetion
if use_flash_attention:
    from utils.llama_patch import upcast_layer_for_flash_attention
    torch_dtype = torch.bfloat16 if args.bf16 else torch.float16 if args.fp16 else torch.float32
    model = upcast_layer_for_flash_attention(model, torch_dtype)

model = get_peft_model(model, peft_config)


We now have every building block we need to create our `SFTTrainer` to start then training our model.

In [7]:
from trl import SFTTrainer

max_seq_length = 2048 # max sequence length for model and packing of the dataset

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    formatting_func=format_instruction, 
    args=args,
)

Start training our model by calling the `train()` method on our `Trainer` instance.

In [8]:
# train
trainer.train() # there will not be a progress bar since tqdm is disabled

# save model
trainer.save_model()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.5494
20,1.423
30,1.3436
40,1.2949


The training without Flash Attention enabled took 03:08:00 on a `g5.2xlarge`. The instance costs `1,212$/h` which brings us to a total cost of `3.7$`. 
The training with Flash Attention enabled took 02:08:00 on a `g5.2xlarge`. The instance costs `1,212$/h` which brings us to a total cost of `2.6$`.

The results using Flash Attention are mind blowing and impressive, 1.5x faster and 30% cheaper.

## 4. Test Model and run Inference

After the training is done we want to run and test our model. We will use `peft` and `transformers` to load our LoRA adapter into our model.

In [1]:
# if use_flash_attention:
#     # unpatch flash attention
#     from utils.llama_patch import unplace_flash_attn_with_attn
#     unplace_flash_attn_with_attn()
    
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer


output_dir = "llama-7-int4-dolly"

# load base LLM model and tokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
    output_dir,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    load_in_4bit=True,
) 
tokenizer = AutoTokenizer.from_pretrained(output_dir)

Let’s load the dataset again with a random sample to try to generate an instruction.

In [4]:
from datasets import load_dataset 
from random import randrange


# Load dataset from the hub and get a sample
dataset = load_dataset("philschmid/meta-shepherd-human-data", split="train")
sample = dataset[randrange(len(dataset))]

prompt = f"""### Question: {sample['question']}

### Answer:
{sample['answer']}

### Feedback:
"""

input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
# with torch.inference_mode():
outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.9)

print(prompt[:-14])
print("---"*35)
print(f"### Generated Feedback:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")
print(f"### Ground truth Feedback:\n{sample['feedback']}")

Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/philschmid___parquet/philschmid--meta-shepherd-human-data-7a75bb0f57f8969d/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7)
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


### Question: Give a summary of the below article:
13 May 2016 Last updated at 07:08 BST . A lot of time and money is spent trying to keep animals safe. Rangers at Kariega Game Reserve have lots of high-tech gear to keep track of their animals and keep poachers away. Patrols are carried out by the rangers, especially at night. Armed with special cameras and night vision goggles, Ayshah joins the rangers as they head out in the night to keep a watch on the wildlife.

### Answer:
All this week Newsround is looking at wildlife in Africa.


---------------------------------------------------------------------------------------------------------
### Generated Feedback:
The answer does not provide any information about the article. The answer is also very vague.

### Ground truth Feedback:
The answer did not mention that South Africa is home to many wild animals, nor did they discuss the issue of poaching in this location.


Nice! our model works! If want to accelerate our model we can deploy it with [Text Generation Inference](https://github.com/huggingface/text-generation-inference). Therefore we would need to merge our adapter weights into the base model.

In [2]:
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    output_dir,
    low_cpu_mem_usage=True,
        torch_dtype=torch.float16,
device_map="auto",
) 

# Merge LoRA and base model
merged_model = model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")

# push merged model to the hub
merged_model.push_to_hub("philschmid/shepherd-2-hf-int4")
tokenizer.push_to_hub("philschmid/shepherd-2-hf-int4")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]


Thrown during validation:
`do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.

Thrown during validation:
`do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.


pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/philschmid/shepherd-2-hf-int4/commit/80fcf3b0b7bfb232197f84c76da4ee0529df35e7', commit_message='Upload tokenizer', commit_description='', oid='80fcf3b0b7bfb232197f84c76da4ee0529df35e7', pr_url=None, pr_revision=None, pr_num=None)