# Part 2/3: Fine-tuning Llama 2 using OASST1

Part 1/3 is just slides, accessible [here](https://docs.google.com/presentation/d/1vpK0JkI-ctkc2nsCoX7SI8-ImYNdiNpwxJdf40fHJQc/edit#slide=id.g290e8e1f774_0_47).

This is Part 2/3.

**Scenario**

We'll fine-tune Llama 2 Chat 7B using a small piece of OASST1 dataset, and then ask it questions about traditional dishes for a given country.

---

❤️ Inspired by code originally by [@maximelabonne](https://twitter.com/maximelabonne), in turn based on Younes Belkada's [GitHub Gist](https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da). Please consult the original author's Medium post for added context [here](https://medium.com/towards-data-science/fine-tune-your-own-llama-2-model-in-a-colab-notebook-df9823a04a32).

Extensive revisions by Caterina Constantinescu for [O'Reilly conference](https://www.oreilly.com/live-events/ai-catalyst-conference-building-commercially-successful-llm-applications/0636920098514/0636920098513/), taking part on 08 Nov 2023.

❤️ Thank you to Nikhil Modha and Brandon Lee for their valuable suggestions on this code.

---

**This notebook runs on a T4 GPU, which has implications for some of the arguments selected below.**

---


Last update: 05 Nov 2023.

NB. For presentation/demo purposes, the original config preamble containing all training parameters etc has been substituted directly into the rest of the code to save on space.


In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7 google-colab==1.0.0

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m235.5/244.2 kB[0m [31m6.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m105.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m71.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
    StoppingCriteria,
    StoppingCriteriaList
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
from google.colab import drive
import gc

## Load dataset

In order to FineTune our Llama 2 model we will be using the ```guanaco-llama2-1k``` open dataset; this dataset is a sample of the larger ```OpenAssistant Conversations Dataset (OASST1)``` which has been transformed to match the input required for the Llama-2 model.

The description given to us for the ```OASST1``` open source data is as follows;

> "In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers."

Below, we can see that our example seems to be split into two parts; our instruction (input) and the our answer (label) which is seperated by special tokens ```[INST]``` / ```[/INST]```. These special tokens will be picked up by our Llama2 tokeniser so that our LLM knows where an instruction begins and where it ends.

In [None]:
# You can explore/process the FT data here:
dataset = load_dataset("mlabonne/guanaco-llama2-1k", split = "train")
string = dataset['text'][0]
replace_dict= {"<s>": "\n<s>\n", "[/INST]": "[/INST]\n", "</s>":"\n</s>"}
for old, new in replace_dict.items():
   string = string.replace(old, new)
print(string)

Downloading readme:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/967k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]


<s>
[INST] Me gradué hace poco de la carrera de medicina ¿Me podrías aconsejar para conseguir rápidamente un puesto de trabajo? [/INST]
 Esto vale tanto para médicos como para cualquier otra profesión tras finalizar los estudios aniversarios y mi consejo sería preguntar a cuántas personas haya conocido mejor. En este caso, mi primera opción sería hablar con otros profesionales médicos, echar currículos en hospitales y cualquier centro de salud. En paralelo, trabajaría por mejorar mi marca personal como médico mediante un blog o formas digitales de comunicación como los vídeos. Y, para mejorar las posibilidades de encontrar trabajo, también participaría en congresos y encuentros para conseguir más contactos. Y, además de todo lo anterior, seguiría estudiando para presentarme a las oposiciones y ejercer la medicina en el sector público de mi país. 
</s>


## Key concepts in QLoRA

> QLoRA has one storage data type (usually 4-bit NormalFloat) for the base model weights and a computation data type (16-bit BrainFloat) used to perform computations. QLoRA dequantizes weights from the storage data type to the computation data type to perform the forward and backward passes, but only computes weight gradients for the LoRA parameters which use 16-bit bfloat. The weights are decompressed only when they are needed, therefore the memory usage stays low during training and inference. [[Source](https://huggingface.co/blog/4bit-transformers-bitsandbytes)]


### Bits & Bites glossary

1. **Data storage types:**
One of the main tricks behind QLoRA is clever data transformations (i.e., quantization), so that NN activations occupy less memory. This is is related to corresponding reductions in the numbers of bits required to store numbers in memory (quantized values will require fewer bits):
  * `fp4` (Floating Point 4-bit precision): Number encoding system with no fixed format in terms of how many bits are allocated to the mantissa vs exponent. In general, 3 exponent bits do a bit better in most cases. But sometimes 2 exponent bits and 1 mantissa bit yield better performance (In either case, 1 bit will be reserved for the +/- sign).
  * `nf4` (Normalized Float 4): Introduced by QLoRA paper. As the name suggests, NF4 is the normalized float 4-bit data type. Check [this video](https://www.youtube.com/watch?v=TPcXVJ1VSRI&t=563s) for an intuitive explanation for how this is used.
2. **Compute types (dtype):** Computation is _not_ done in 4-bit: the weights and activations are compressed to that format for storage, but the computation is still kept in the desired / native dtype, such as 16 or 32-bit and here any combination can be chosen (float16, bfloat16, float32 etc). You will also see later within `TrainingArguments` that we have two explicit options there as well:
  * `fp16` (Floating Point 16): This is the regular float16 format, with 1 bit for the sign, 5 for the exponent, 10 for mantissa. Works on any hardware.
  * `bf16` (Brain Float 16): Newer option proposed by Google Brain and not supported on older hardware (V100 or T4); you must have an A100 to use this. Dedicates 1 bit for the sign, 8 bits for the exponent, 7 for the mantissa. This alternative split is considered to have desirable properties in training.
3. **Double quantization** (aka nested quantization): This will enable a second quantization (i.e., data transformation) after the first one to save an additional amount of memory per parameter.

## Load tokenizer & original model with QLoRA configuration

In [None]:
# compute_dtype = getattr(torch, "float16")  # Compute dtype for 4-bit base models

bnb_config = BitsAndBytesConfig(
    load_in_4bit = True,  # Activate 4-bit precision base model loading
    bnb_4bit_quant_type = "nf4",  # Quantization type (fp4 or nf4).
    bnb_4bit_compute_dtype = "float16",  # Compute dtype could also be torch.bfloat16 if supported by hardware.
    bnb_4bit_use_double_quant = True  # Activate nested quantization for 4-bit base models (double quantization)
)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "NousResearch/Llama-2-7b-chat-hf",
    quantization_config = bnb_config,
    device_map={"": 0}  # Load the entire model on the GPU 0
)

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf", trust_remote_code = True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

(…)ma-2-7b-chat-hf/resolve/main/config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

(…)esolve/main/model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

(…)t-hf/resolve/main/generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]

(…)at-hf/resolve/main/tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

(…)2-7b-chat-hf/resolve/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

(…)b-chat-hf/resolve/main/added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

(…)-hf/resolve/main/special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

## Interrogate the original model before we do any SFT

In [None]:
# Create a prompt & stopping rules we'll use with both the original and SFT'd model:
prompt = "What is a traditional meal in Scotland?"

Check out a more elegant alternative for brute-forcing a stopping rule: `exponential_decay_length_penalty`. See documentation [here](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.ExponentialDecayLengthPenalty).

In [None]:
# Create pipeline to support text generation task from base model:
pipe = pipeline(
    task = "text-generation",
    model = model,
    tokenizer = tokenizer,
    max_new_tokens = 1024,
    temperature = 0.001,
    do_sample = True,
    # repetition_penalty = 1.1,
    exponential_decay_length_penalty = (150, 1.005)
    )

result = pipe(f"<s>[INST] {prompt} [/INST]")

print(result[0]['generated_text'])

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


<s>[INST] What is a traditional meal in Scotland? [/INST]  Scotland has a rich culinary heritage, with many traditional dishes that are loved by locals and visitors alike. nobody. Here are some of the most popular traditional Scottish meals:

1. Haggis: This is Scotland's national dish, made from a mixture of sheep's heart, liver, and lungs, mixed with onions, oatmeal, and spices, and traditionally encased in the animal's stomach and simmered for several hours.
2. Neeps and Tatties: These are two traditional Scottish side dishes, made from mashed turnips (neeps) and potatoes (tatties). They are often served with haggis or other savory dishes.
3. Cullen Skink: This is a hearty fish soup made from smoked haddock, onions, potatoes, and milk. It's a popular breakfast dish in Scotland.
4. Scotch Broth: This is a thick and comforting soup made from beef, lamb, or mutton, vegetables, and barley. It's often served with slices of bread or crackers.
5. Aberdeen Angus Beef: Scotland is famous for

## Start training

### Parameter glossary

| **#** | **Purpose**        | **Argument name**             | **Parameter name**          | **Description**                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|-------|--------------------|-------------------------------|-----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1     | PEFT Config        | `lora_alpha`                  | LoRA alpha                  | Alpha scales the LoRA learned weights: higher values assign more weight to the LoRA activations (vs the original weights). <br> Generally advised to fix it at 16, rather than treating it as a tunable hyperparameter.                                                                                                                                                                                                                                    |
| 2     | PEFT Config        | `lora_dropout`                | LoRA dropout                | Dropout probability for LoRA layers: probability that each neuron’s output is set to zero during training. Used to prevent overfitting.                                                                                                                                                                                                                                                                                                               |
| 3     | PEFT Config        | `r`                           | r                           | A measure of how the original weight matrices are broken down into simpler, smaller matrices. This reduces computational requirements and memory consumption. <br> Lower ranks make the model faster but might sacrifice performance. For QLoRA, a rank of 64 is required                                                                                                                                                                                  |
| 4     | Training arguments | `num_train_epochs`            | Training epochs             | Number of complete passes through the training dataset. One training epoch means that the algorithm has made one pass through the training dataset, <br>where examples were separated into randomly selected “batch size” groups.                                                                                                                                                                                                                         |
| 5     | Training arguments | `per_device_train_batch_size` | Batch size                  | Number of training samples to work through before parameters are updated.  Defaults to 8. Smaller batch sizes make it easier to fit one batch worth of training data in memory.                                                                                                                                                                                                                                                                       |
| 6     | Training arguments | `gradient_accumulation_steps` | Gradient accumulation steps | Gradient accumulation is a technique used to train on bigger batch sizes than your machine would normally be able to fit into memory. <br>This is done by breaking down a batch into several mini-batches and processing these sequentially (generating a gradient for each). <br>The gradients for several consecutive optimization steps are combined together. <br>When enough gradients are accumulated we run the model’s optimization step. Defaults to 1.  |
| 7     | Training arguments | `optim`                       | Optimiser                   | Algorithm type responsible for improving loss/accuracy in training.                                                                                                                                                                                                                                                                                                                                                                                   |
| 8     | Training arguments | `save_steps`                  | Save steps                  | Save checkpoint every X update steps. A checkpoint saves states of an LLM’s training process, allowing to later load a model later and/or resume training. <br>Also, saving strategically selected activations throughout the computational graph via gradient checkpointing (`gradient_checkpointing=True`) means that only <br>a fraction of the activations need to be re-computed for the gradients, thus saving on memory.                               |
| 9     | Training arguments | `logging_steps`               | Logging steps               | Log every X update steps (and print them into console).                                                                                                                                                                                                                                                                                                                                                                                               |
| 10 | Training arguments | `learning_rate`     | Learning rate          | The amount that the weights are updated during training (aka step size). Shouldn't be too large, otherwise you'll overshoot the local minimum in your cost function. <br>Meta suggest going even lower than the amount set here, but if your learning rate is set too low, training will progress very slowly <br>as you are making very tiny updates to the weights in your network.                                                                                |
| 11 | Training arguments | `weight_decay`      | Weight decay           | Penalty for model complexity. Avoids overfitting.                                                                                                                                                                                                                                                                                                                                                                                                            |
| 12 | Training arguments | `fp16` & `bf16`     | FP16 & BF16            | Enable mixed-precision or fp16/bf16 training (set bf16 to True with an A100). This too helps reduce memory usage since not all values need to be stored in full, 32-bit precision. <br>Be aware that if a model is pre-trained in bf16, it’s likely to have overflow issues if someone tries to fine-tune it in fp16 down the road (as fp16 allows for a smaller range). <br>So once started on the bf16-mode path it’s best to remain on it and not switch to fp16. |
| 13 | Training arguments | `max_grad_norm`     | Maximum gradient norm  | Maximum gradient norm: used for gradient clipping, <br>which means forcing gradient values to a specific value if the gradient exceeded an expected range.                                                                                                                                                                                                                                                                                                       |
| 14 | Training arguments | `max_steps`         | Maximum training steps | Number of training steps (overrides `num_train_epochs`). In general, `num_train_epochs = max_steps / len(train_dataloader)`.                                                                                                                                                                                                                                                                                                                                 |
| 15 | Training arguments | `warmup_ratio`      | Warm-up ratio          | Warm-up is a way to reduce the primacy effect of early training examples which might have an exaggerated influence on training. <br>Ratio of steps for a linear warm-up (from 0 to learning rate). <br>Refers to warm-up done for some percentage of the total training steps.                                                                                                                                                                                      |
| 16 | Training arguments | `group_by_length`   | Group by length        | Group sequences into batches with same length. <br>Saves memory and speeds up training considerably.                                                                                                                                                                                                                                                                                                                                                             |
| 17 | Training arguments | `lr_scheduler_type` | Learning rate schedule | Llama 2 models adopt a cosine learning rate schedule. <br>Unlike traditional learning rate schedules, which decrease the learning rate _linearly_ over time, the cosine learning rate schedule gradually decreases the learning rate using a cosine function.                                                                                                                                                                                                      |


### Load LoRA config & run SFT

In [None]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha = 16,
    lora_dropout = 0.1,
    r = 64,
    bias="none",
    task_type="CAUSAL_LM",
)

training_arguments = TrainingArguments(
    output_dir = "./results",  # Output directory where the model predictions and checkpoints will be stored
    num_train_epochs = 1,
    per_device_train_batch_size = 4,
    # per_device_eval_batch_size = 8, # Added to show everything gets done within the SFTTrainer below, on the basis of the training split done previously in `load_dataset()`
    gradient_accumulation_steps = 1,
    optim = "paged_adamw_32bit",
    save_steps = 0,
    logging_steps = 25,
    learning_rate = 2e-4,
    weight_decay = 0.001,
    fp16 = True,
    bf16 = False,
    max_grad_norm = 0.3,
    max_steps = -1,
    warmup_ratio =  0.03,
    group_by_length =  True,
    lr_scheduler_type = "cosine",
    report_to = "tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    peft_config = peft_config,
    dataset_text_field = "text",
    max_seq_length = None,  # Maximum sequence length to use
    tokenizer = tokenizer,
    args = training_arguments,
    packing = False # Pack multiple short examples in the same input sequence to increase efficiency
)

# Train model
gc.collect()
trainer.train()
# Takes about 25mins.
# We hava a batch size of 4, and 1000 datapoints.
# Therefore we will have 250 steps, and we are logging these every 25 steps (so the training loss will be printed below accordingly)

# Choose fine-tuned model name and save it
trainer.model.save_pretrained("llama-2-7b-miniguanaco")

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
25,1.4084
50,1.6613
75,1.2151
100,1.4399
125,1.1772
150,1.3623
175,1.1736
200,1.4607
225,1.1578
250,1.5337



## Interrogate the new SFT model

In [None]:
# Run text generation pipeline with our new model
pipe = pipeline(
    task = "text-generation",
    model = trainer.model,
    tokenizer = tokenizer,
    max_new_tokens = 1024,
    temperature = 0.001,
    do_sample = True,
    # repetition_penalty = 1.1,
    exponential_decay_length_penalty = (150, 1.005)
    )

result = pipe(f"<s>[INST] {prompt} [/INST]")

print(result[0]['generated_text'])

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausal

<s>[INST] What is a traditional meal in Scotland? [/INST] Haggis is a traditional Scottish dish made from sheep's heart, liver, and lungs minced with onion, oatmeal, suet, and spices, mixed with stock, and baked in a sheep's stomach. It is often served with mashed potatoes (tatties) and turnips or swede (neeps). Other traditional Scottish dishes include Cullen Skink (a hearty fish soup), Scotch broth (a rich meat and vegetable soup), and Aberdeen Angus beef.


## Saving the SFT'd model for later: Memory hacks for saving in 16-bit but running on a T4

In [None]:
# For T4: Free up some memory so we can have some to spare for further operations
del model
del pipe
del trainer
del dataset
del tokenizer
gc.collect()

8

In [None]:
# Merge and save the fine-tuned model
drive.mount('/content/drive')
model_path = f"/content/drive/MyDrive/Models/16bit"  # change to your preferred path

# Reload model in FP16 (!! rather than 4-bit precision like before) and merge it with LoRA weights we just computed earlier
# We do this in order to create a 'clean'/final copy of the SFT'd model
base_model = AutoModelForCausalLM.from_pretrained(
    "NousResearch/Llama-2-7b-chat-hf",
    low_cpu_mem_usage = True,
    return_dict = True,
    torch_dtype = torch.float16,
    device_map = {"": 0},
)
model = PeftModel.from_pretrained(base_model, "llama-2-7b-miniguanaco")
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Save the merged model
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

Mounted at /content/drive


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

('/content/drive/MyDrive/Models/16bit/tokenizer_config.json',
 '/content/drive/MyDrive/Models/16bit/special_tokens_map.json',
 '/content/drive/MyDrive/Models/16bit/tokenizer.json')