# Fine-tune Llama 2 in Google Colab
> 🗣️ Large Language Model Course

❤️ Created by [@maximelabonne](https://twitter.com/maximelabonne), based on Younes Belkada's [GitHub Gist](https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da). Special thanks to Tolga HOŞGÖR for his solution to empty the VRAM.

This notebook runs on a T4 GPU. (Last update: 24 Aug 2023)


## Install

In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/244.2 kB[0m [31m2.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m37.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━

## Import Libraries

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

## Setting Up Hyperparameters

This code snippet outlines the configuration for fine-tuning a large language model, specifically LLaMA 2 with 7 billion parameters, using a specific dataset and various advanced optimization and regularization techniques to enhance performance and efficiency. Here's a breakdown of the high-level functionality of each section:

1. **Model and Dataset Selection:**
   - `model_name`: Specifies the base model to be fine-tuned, sourced from the Hugging Face hub.
   - `dataset_name`: Identifies the instruction dataset used for fine-tuning.
   - `new_model`: Defines the name for the fine-tuned model.

2. **QLoRA Parameters:**
   - These parameters configure LoRA (Low-Rank Adaptation), a technique that introduces trainable parameters to the attention mechanism of transformer models to adapt large pre-trained models with minimal additional parameters.
     - `lora_r`: The rank for LoRA's low-rank matrices, affecting the number of parameters added.
     - `lora_alpha`: A scaling factor for adjusting the magnitude of LoRA's modifications.
     - `lora_dropout`: Dropout probability applied to LoRA layers for regularization.

3. **bitsandbytes Parameters:**
   - Optimizations related to model quantization and memory efficiency, particularly useful for running large models on limited hardware.
     - `use_4bit`: Enables loading the base model in 4-bit precision to reduce memory usage.
     - `bnb_4bit_compute_dtype`: The data type used for computation in 4-bit mode.
     - `bnb_4bit_quant_type`: Specifies the quantization type (fixed-point or normalized floating-point).
     - `use_nested_quant`: Enables double quantization for further memory reduction.

4. **TrainingArguments Parameters:**
   - Configuration for the training process, including optimization settings, scheduling, and resource management.
     - Includes settings for output directories, training epochs, precision (mixed precision training options), batch sizes, gradient accumulation, gradient checkpointing, learning rate, optimizer choices, and more.
     - Optimizations like gradient checkpointing and group_by_length are used to manage memory efficiently, enabling training of large models on hardware with limited memory.

5. **SFT Parameters:**
   - Specific to sequence fine-tuning, allowing customization of sequence processing for efficiency and effectiveness.
     - `max_seq_length`: Limits the length of input sequences.
     - `packing`: Whether to pack multiple short examples into a single input sequence.
     - `device_map`: Configures the distribution of model parts across available GPUs.

This setup is designed for fine-tuning a state-of-the-art language model on a specific dataset with a focus on efficiency and performance, using advanced techniques like LoRA for model adaptation, bitsandbytes for memory optimization, and specific training arguments to control the fine-tuning process.

In [None]:
# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/Llama-2-7b-chat-hf"

# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model name
new_model = "llama-2-7b-miniguanaco"

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

## Load Data, Set Training Args, and Train the Model

This document provides a high-level overview of a Python script designed for fine-tuning a large language model using a specific dataset and various advanced techniques to optimize performance and efficiency.



### Dataset Loading

- The dataset is loaded from a specified source with a focus on the training split. This step is critical for preparing the data for the fine-tuning process.



In [None]:
# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")

### Tokenizer and Model Loading with Configuration

- **Tokenizer Loading**: A tokenizer corresponding to the base model is loaded, with specific configurations to handle padding and potentially resolve issues related to mixed-precision training.
- **Model Loading**: The model is loaded with configurations tailored for efficient training, including quantization settings provided by the `BitsAndBytesConfig`. This includes options for 4-bit precision loading and nested quantization to reduce memory consumption and potentially speed up computation.



### GPU Compatibility Check for bfloat16

- The script checks if the current GPU setup supports `bfloat16`, which can accelerate training when using `float16` compute data type and 4-bit precision loading. This check is crucial for optimizing training speed and efficiency on compatible hardware.



In [None]:
# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)


In [None]:
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

In [None]:
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

### LoRA Configuration

- **LoRA (Low-Rank Adaptation)**: A configuration is set up for LoRA, an advanced technique for adapting large pre-trained models with minimal additional parameters. This involves specifying the rank of low-rank matrices (`lora_r`), dropout for regularization (`lora_dropout`), and a scaling factor (`lora_alpha`).



In [None]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

### Training Parameters Setup

- A comprehensive set of training parameters is defined, covering aspects such as output directory, number of epochs, batch size, optimization settings, learning rate, weight decay, and more. These parameters are crucial for controlling the fine-tuning process and ensuring efficient resource use.



In [None]:
# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

### Supervised Fine-Tuning Configuration

- The `SFTTrainer` is configured with the model, dataset, LoRA settings, tokenizer, and training arguments. This setup specifies how the model should be fine-tuned, including details like the maximum sequence length, whether to pack multiple short examples into a single input, and reporting configurations.



In [None]:
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

### Model Training

- The training process is initiated with the configured trainer. This step involves adjusting the model's weights based on the provided dataset and training parameters to improve its performance on the target task.


In [None]:
# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)


### Model Saving

- After training, the fine-tuned model is saved to a specified directory. This allows for the reuse of the model for inference or further training in the future.

Overall, this script illustrates a comprehensive approach to fine-tuning a large language model using advanced techniques like LoRA for model adaptation and bitsandbytes for memory optimization. The use of specific configurations and checks ensures the training process is both efficient and effective.

In [None]:
# %load_ext tensorboard
# %tensorboard --logdir results/runs

## Inference

In [None]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "What is a large language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])



<s>[INST] What is a large language model? [/INST] A large language model is a type of artificial intelligence (AI) model that is trained on a large dataset of text to generate human-like language outputs. It is typically trained on a large dataset of text, such as books, articles, or websites, and is designed to generate text that is similar to the training data.

Large language models are often used for natural language processing tasks such as text classification, sentiment analysis, and machine translation. They are also used for generating text, such as chatbots, and for generating creative content, such as poetry or stories.

Some examples of large language models include:

* BERT (Bidirectional Encoder Representations from Transformers): A popular large language model developed by Google that is trained on a large dataset of text and is designed to generate human-like language outputs.
* LLaMA (LLaMA:


In [None]:
# Empty VRAM
del model
del pipe
del trainer
import gc
gc.collect()
gc.collect()

20933

## Push to HuggingFace

In [None]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

If you run it in a Colab, you need to change the preferred encoding. You can do it by using the following code before you can run `huggingface-cli login`.

In [None]:
import locale
print(locale.getpreferredencoding())

ANSI_X3.4-1968


In [None]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'stor

In [None]:
print(new_model)

llama-2-7b-miniguanaco


In [None]:
%%time

model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/eagle0504/llama-2-7b-miniguanaco/commit/42b4e7e343a65b5af45e956b88b309d8026fb898', commit_message='Upload tokenizer', commit_description='', oid='42b4e7e343a65b5af45e956b88b309d8026fb898', pr_url=None, pr_revision=None, pr_num=None)

### Inference from Pipeline directly from HuggingFace

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="eagle0504/llama-2-7b-miniguanaco")

config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

In [None]:
ans = pipe("What is a large language model?")
print(ans)