# Fine-tune Llama 2 in Google Colab
> 🗣️ Large Language Model Course

❤️ Created by [@maximelabonne](https://twitter.com/maximelabonne), based on Younes Belkada's [GitHub Gist](https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da). Special thanks to Tolga HOŞGÖR for his solution to empty the VRAM.

This notebook runs on a T4 GPU. (Last update: 24 Aug 2023)


In [1]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

In [2]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

In [3]:
# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/Llama-2-7b-chat-hf"

# The instruction dataset to use
# dataset_name = "mlabonne/guanaco-llama2-1k"
dataset_name = "mbpp"

# Fine-tuned model name
new_model = "llama-2-7b-mbpp"

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 2

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

In [4]:
# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
25,3.0893
50,1.8992
75,1.8013
100,1.761
125,1.6322
150,1.4842
175,1.5238


In [None]:
# %load_ext tensorboard
# %tensorboard --logdir results/runs

In [8]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "Write a program to add two numbers. Add appropriate comments where necessary"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=512)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])



<s>[INST] Write a program to add two numbers. Add appropriate comments where necessary [/INST]  Sure! Here is a program to add two numbers:
```
# Write a function to add two numbers.
def add_numbers(a, b):
    # Calculate the sum of two numbers.
    return a + b

# Example usage.
a = 5
b = 3
result = add_numbers(a, b)
print(result) # Output: 8
```
Here's a breakdown of the code:

1. `def add_numbers(a, b):` - This line defines a function called `add_numbers` that takes two arguments `a` and `b`.
2. `return a + b` - This line calculates the sum of `a` and `b` and returns it.
3. `a = 5` - This line sets the value of `a` to `5`.
4. `b = 3` - This line sets the value of `b` to `3`.
5. `result = add_numbers(a, b)` - This line calls the `add_numbers` function with `a` and `b` as arguments and stores the result in `result`.
6. `print(result)` - This line prints the result of `add_numbers` function, which is `8`.

Note: The comments in the code are optional, but they can help explain what the 

In [5]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "Write a program to create a calculator. Write using object oriented programming and add comments where necessary."
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=512)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])



<s>[INST] Write a program to create a calculator. Write using object oriented programming and add comments where necessary. [/INST]  Sure! Here is an example of how you could create a simple calculator using object-oriented programming:
```
# Define a class called Calculator
class Calculator:
    def __init__(self):
        # Initialize the calculator object
        self.num1 = 0
        self.num2 = 0
        self.result = 0

    # Define methods for addition and subtraction
    def add(self, num):
        self.num1 = num
        self.result = self.num1 + self.num2

    def sub(self, num):
        self.num2 = num
        self.result = self.num1 - self.num2

# Create an instance of the calculator class
my_calculator = Calculator()

# Add some numbers
my_calculator.add(5)
print(my_calculator.result) # Output: 5

# Subtract some numbers
my_calculator.sub(2)
print(my_calculator.result) # Output: 3
```
Here's a breakdown of the code:

1. `class Calculator:`: This line defines a class called

In [7]:
# Empty VRAM
del model
del pipe
del trainer
import gc
gc.collect()
gc.collect()

19965

In [8]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [9]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "Write a program to create a calculator. Write using object oriented programming and add comments where necessary."
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=512)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])



<s>[INST] Write a program to create a calculator. Write using object oriented programming and add comments where necessary. [/INST]  Sure! Here is an example of how you could create a simple calculator using object-oriented programming in Python:
```
class Calculator:
    def __init__(self):
        self.num1 = 0
        self.num2 = 0
        self.result = 0

    def add(self):
        self.result = self.num1 + self.num2
        self.num1 = 0
        self.num2 = 0

    def subtract(self):
        self.result = self.num1 - self.num2
        self.num1 = 0
        self.num2 = 0

    def multiply(self):
        self.result = self.num1 * self.num2
        self.num1 = 0
        self.num2 = 0

    def divide(self):
        if self.num2 == 0:
            raise ValueError("Cannot divide by zero!")
        self.result = self.num1 / self.num2
        self.num1 = 0
        self.num2 = 0

    def get_result(self):
        return self.result

# Create an instance of the Calculator class
calculator =

In [10]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "Write a program to train a linear regression model using sklearn. Use functions to write neat code and explain with comments. Save the model to the model folder. Use the data from data.csv file and use the DATA_USE column for train and test split. Evaluate on RMSE and R^2 values."
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=512)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Write a program to train a linear regression model using sklearn. Use functions to write neat code and explain with comments. Save the model to the model folder. Use the data from data.csv file and use the DATA_USE column for train and test split. Evaluate on RMSE and R^2 values. [/INST]  Sure, here is an example of how to train a linear regression model using scikit-learn in Python:
```
# Import necessary libraries
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd

# Load the data from the data.csv file
data = pd.read_csv('data.csv')

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(data.drop('DATA_USE', axis=1), data['DATA_USE'], test_size=0.2, random_state=42)

# Train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model on the test set
y_pred = model.predict(X_test)
rse = mean_squared_error(y_test, y_pred

In [11]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "Write code to visualize a 4x4 grid of images. Let the 4 plots be 1,2,3,4. For plot #1, it is scatter plot with two numeric variables. For plot #2, it is a bar plot with numeric and categorical variable. For plot #3, it is line plot with time as a varaible. For #4, it is correlation plot. Use matplotlib. Make it aesthetic and appealing."
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=512)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Write code to visualize a 4x4 grid of images. Let the 4 plots be 1,2,3,4. For plot #1, it is scatter plot with two numeric variables. For plot #2, it is a bar plot with numeric and categorical variable. For plot #3, it is line plot with time as a varaible. For #4, it is correlation plot. Use matplotlib. Make it aesthetic and appealing. [/INST]  Sure, here is an example of how you could visualize a 4x4 grid of images using matplotlib.
```
import matplotlib.pyplot as plt
import numpy as np

# Generate some random data for the plots
data = np.random.rand(4, 4)

# Plot #1: Scatter plot with two numeric variables
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
ax.scatter(data[:, 0], data[:, 1], c='blue', s=100)
ax.set_xlabel('Variable 1')
ax.set_ylabel('Variable 2')

# Plot #2: Bar plot with numeric and categorical variable
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
ax.bar(data[:, 0], data[:, 1], bottom=data[:, 2], color='red')
ax.set_xlabel('Category')
ax.set_ylabel('Value')

# Pl

In [12]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "Write a Python program to load a CSV file as a DataFrame using pandas."
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=512)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Write a Python program to load a CSV file as a DataFrame using pandas. [/INST]  Sure! Here is an example of how to load a CSV file as a DataFrame using pandas in Python:
```
import pandas as pd

# Load the CSV file
df = pd.read_csv('data.csv')

# Print the DataFrame
print(df)
```
In this example, `data.csv` is the name of the CSV file that you want to load. The `read_csv()` function reads the CSV file and returns a DataFrame. The `print()` function prints the DataFrame to the console.

You can also specify the path of the CSV file by using the `path` argument, like this:
```
df = pd.read_csv('/path/to/data.csv')
```
You can also use `read_csv()` function with other arguments like `header`, `na_values`, `keep`, `nrows`, ` sep`, `quote`, `linetype`, `converters`, `date_parser`, `datetime_parser`, `num_nans`, `float_nans`, `infer`, `engine`, `concatenate_columns`, `usecols`, `useindex`, `srs`, `date_format`, `timezone`, `tz_info`, `na_action`, `keep_default_na`, `na_values`, `na

In [16]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "Create a class called Student having attributes name, rollno, age, city where student can access these properties through getters and setters methods respectively. Also add constructor method so that when user creates new instance of Student then he must pass values for above parameters otherwise default value would be assigned automatically. Finally display details about this newly created students information."
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=512)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Create a class called Student having attributes name, rollno, age, city where student can access these properties through getters and setters methods respectively. Also add constructor method so that when user creates new instance of Student then he must pass values for above parameters otherwise default value would be assigned automatically. Finally display details about this newly created students information. [/INST]  Sure, here is an example of a `Student` class with the attributes you specified, along with getters and setters methods, and a constructor method that assigns default values if they are not provided:
```
class Student:
    def __init__(self, name, rollno, age, city):
        self.name = name
        self.rollno = rollno
        self.age = age
        self.city = city

    def get_name(self):
        return self.name

    def get_rollno(self):
        return self.rollno

    def get_age(self):
        return self.age

    def get_city(self):
        return sel

In [20]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "minimum cost path in a directed graph via given set of intermediate nodes  stores minimum - cost of path from source"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=768)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] minimum cost path in a directed graph via given set of intermediate nodes  stores minimum - cost of path from source [/INST]  To find the minimum cost path in a directed graph using intermediate nodes, you can use the following approach:

1. Sort the intermediate nodes in non-decreasing order of their costs.
2. For each intermediate node, find the minimum cost path from the source node to the intermediate node using the graph's adjacency matrix.
3. For each intermediate node, find the minimum cost path from the intermediate node to the destination node using the graph's adjacency matrix.
4. Combine the minimum cost paths from steps 2 and 3 to find the minimum cost path from the source node to the destination node.

Here is a detailed explanation of the approach:

1. Sort the intermediate nodes in non-decreasing order of their costs:

Suppose the intermediate nodes are arranged in a list `intermediate_nodes`. The cost of each intermediate node is stored in a list `costs`. To s

In [21]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "Given two lists sorted in increasing order, write a function to merge them such that the resulting list is also sorted. You can assume that no duplicate entries exist in the original lists."
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=768)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Given two lists sorted in increasing order, write a function to merge them such that the resulting list is also sorted. You can assume that no duplicate entries exist in the original lists. [/INST]  Sure! Here's a function to merge two sorted lists in increasing order:
```
def merge_sorted_lists(list1, list2):
    result = []
    while list1 and list2:
        if list1[0] <= list2[0]:
            result.append(list1.pop(0))
        else:
            result.append(list2.pop(0))
    return result
```
Here's how the function works:

1. Initialize an empty list `result` to store the merged list.
2. While both `list1` and `list2` are not empty, check the first element of each list. If `list1[0]` is less than or equal to `list2[0]`, add the first element of `list1` to `result` and remove it from `list1`. Otherwise, add the first element of `list2` to `result` and remove it from `list2`.
3. Repeat step 2 until `list1` and `list2` are empty.
4. Return the `result` list, which now con

In [23]:

# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "Count all the numbers less than 10 ^ 6 whose minimum prime factor is N  Python3 implementation of above approach"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=768)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Count all the numbers less than 10 ^ 6 whose minimum prime factor is N  Python3 implementation of above approach [/INST]  To count the numbers less than 10^6 whose minimum prime factor is N, we can use the same approach as before.

Let's define a function to count the numbers less than 10^6 whose minimum prime factor is N:
```python
def count_numbers(n):
    count = 0
    for i in range(1, 10**6):
        if i % n == 0 and i % (n-1) == 0:
            count += 1
    return count
```
In this function, we iterate from 1 to 10^6 and check if the number is divisible by N. If it is, we check if it is also divisible by N-1. If both conditions are true, we increment the count.

Here's an example usage of the function:
```python
print(count_numbers(3)) # Output: 12
```
In this example, we pass 3 as the argument to the function, and it returns 12, which is the number of numbers less than 10^6 whose minimum prime factor is 3.

Note that this approach is more efficient than the previous 

In [None]:
!huggingface-cli login

model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/mlabonne/llama-2-7b-miniguanaco/commit/c81a32fd0b4d39e252326e639d63e75aa68c9a4a', commit_message='Upload tokenizer', commit_description='', oid='c81a32fd0b4d39e252326e639d63e75aa68c9a4a', pr_url=None, pr_revision=None, pr_num=None)