# **Fine-Tuning Llama 2-7b on Databricks-Dolly-15k Dataset**

This blog will guide you through the process of fine-tuning the llama-2-7b-chat-hf model on the databricks-dolly-15k dataset using Google Colab.

1. **Model:** llama-2-7b-chat-hf
2. **Dataset:** https://huggingface.co/datasets/databricks/databricks-dolly-15k


# **Dataset Overview**

The databricks-dolly-15k dataset, hosted on Hugging Face (link), consists of over 15,000 records generated by Databricks employees across various behavioral categories. The dataset, available under the Creative Commons Attribution-ShareAlike 3.0 Unported License, is primarily intended for training large language models and synthetic data generation.

We will fine-tune a Llama 2 model with 7 billion parameters on a A100 GPU with high RAM using Google Colab. To decrease the VRAM usage, we will fine-tune the model in 4-bit precision.

For this, we will leverage the Hugging Face ecosystem with the transformers, accelerate, peft, trl, and bitsandbytes libraries. First, we install and load these libraries.

In [1]:
!pip install -q -U accelerate peft bitsandbytes transformers trl

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/251.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━[0m [32m174.1/251.2 kB[0m [31m5.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.6/85.6 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m90.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.0/118.0 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m77.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━

# **Loading Necessary Libraries**

In [2]:
# Standard libraries
import os

# PyTorch
import torch

# Huggingface Transformers
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging
)

from datasets import load_dataset
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

# Optional


QLoRA will employ a rank value of 64, coupled with a scaling parameter set to 16. Additionally, we’ll load the Llama 2 model directly in 4-bit precision using the NF4 type and train it for 1 epoch.


In [3]:
# Model and Dataset Configurations
# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/llama-2-7b-chat-hf"
# The instruction dataset to use
dataset_name = "databricks/databricks-dolly-15k"
# Fine-tuned model name
new_model = "llama-2-7b-databricks-dolly-15k"

# QLoRA Parameters
lora_r = 64  # LoRA attention dimension
lora_alpha = 16  # Alpha parameter for LoRA scaling
lora_dropout = 0.1  # Dropout probability for LoRA layers

# bitsandbytes Parameters
use_4bit = True  # Activate 4-bit precision base model loading
bnb_4bit_compute_dtype = "float16"  # Compute dtype for 4-bit base models
bnb_4bit_quant_type = "nf4"  # Quantization type (fp4 or nf4)
use_nested_quant = False  # Activate nested quantization for 4-bit base models (double quantization)

1. Dataset Loading: Initially, we load the predefined dataset. Typically, this step involves various data preprocessing tasks, such as prompt reformatting, text filtering, and dataset combination.

2. Quantization Configuration: We configure the bitsandbytes module to enable 4-bit quantization.

3. Model Initialization: The Llama 2 model is loaded with 4-bit precision, and it is equipped with the corresponding tokenizer. This initialization process is performed on a GPU.

4. Configuration Preparation: We prepare configurations for QLoRA and establish standard training parameters.

5. Training Setup: All the components required for the SFTTrainer are assembled. Once this setup is complete, the training process can commence.

# **Loading Dataset**

In [4]:
# Load dataset
dataset = load_dataset(dataset_name, split="train")


Downloading readme:   0%|          | 0.00/8.20k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/13.1M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [5]:
dataset

Dataset({
    features: ['instruction', 'context', 'response', 'category'],
    num_rows: 15011
})

In [6]:
dataset = dataset.remove_columns(['context', 'response', 'category'])
dataset

Dataset({
    features: ['instruction'],
    num_rows: 15011
})

In [7]:
dataset = dataset.rename_column("instruction", "text")
dataset

Dataset({
    features: ['text'],
    num_rows: 15011
})

In [10]:
# Extracting a data type attribute from the torch library based on the string value of bnb_4bit_compute_dtype
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

In [11]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

In [12]:
# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)


Your GPU supports bfloat16: accelerate training with bf16=True


In [13]:
# Load the entire model on the GPU 0
device_map = {"": 0}

# **Loading Base Model**

In [14]:
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1


Downloading (…)lve/main/config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]



# **Loading LLaMA Tokenizer**

In [15]:
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training


Downloading (…)okenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

# **Loading LoRA Configuration**

In [16]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)


In [21]:
# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Batch size per GPU for training
per_device_train_batch_size = 16

# Batch size per GPU for evaluation
per_device_eval_batch_size = 16

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Log every X updates steps
logging_steps = 939//4

# Save steps
save_steps = 939//4

# Optimizer to use
optim = "paged_adamw_32bit"

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# **Setting Training Params**

In [22]:
# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=fp16,
    bf16=bf16,
    report_to="tensorboard"
)

In [23]:
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)


Map:   0%|          | 0/15011 [00:00<?, ? examples/s]

# **Training the Model**

In [24]:
trainer.train()

Step,Training Loss
234,2.4942
468,2.3168
702,2.2907
936,2.2773


TrainOutput(global_step=939, training_loss=2.3431943806088644, metrics={'train_runtime': 2651.0474, 'train_samples_per_second': 5.662, 'train_steps_per_second': 0.354, 'total_flos': 2.057599472431104e+16, 'train_loss': 2.3431943806088644, 'epoch': 1.0})

# **Saving the Model**

In [41]:
# Save trained model
trainer.model.save_pretrained(new_model)

In [28]:
!pip install tensorboard




In [29]:
%load_ext tensorboard


The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [31]:
%tensorboard --logdir results/runs

Reusing TensorBoard on port 6006 (pid 18967), started 0:05:56 ago. (Use '!kill 18967' to kill it.)

<IPython.core.display.Javascript object>

Note: The input needs to be formatted to match Llama 2’s prompt template.

# **Testing Model's Performance on General Questions**

In [32]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "How large is a large language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])



<s>[INST] How large is a large language model? [/INST]  The size of a large language model can vary greatly depending on the specific model architecture, the amount of training data, and the computational resources used to train the model.

In general, large language models are trained on vast amounts of text data and use complex neural network architectures to learn the patterns and structures of language. These models can be quite large in terms of the number of parameters, the amount of memory required to store the model weights, and the computational resources required to train the model.

Some examples of large language models include:

* BERT (Bidirectional Encoder Representations from Transformers): This is a popular language model developed by Google that has been trained on a large corpus of text data and has been shown to achieve state-of-the-art results on a wide range of natural language processing tasks. BERT has approximately 340 million


# **Testing on Questions from BigBench-Hard**

https://github.com/suzgunmirac/BIG-Bench-Hard

In [36]:
# Run text generation pipeline with our next model
prompt = """Q: Find a movie similar to Twister, The Silence of the Lambs, Independence Day, Braveheart:
Options:
(A) They Shoot Horses
(B) Don't They
(C) Forrest Gump
(D) The Salton Sea
(E) Extreme Days"""
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Q: Find a movie similar to Twister, The Silence of the Lambs, Independence Day, Braveheart:
Options:
(A) They Shoot Horses
(B) Don't They
(C) Forrest Gump
(D) The Salton Sea
(E) Extreme Days [/INST]  Based on the given options, the movie similar to Twister, The Silence of the Lambs, Independence Day, Braveheart is (D) The Salton Sea.

The Salton Sea is a 2002 American drama film directed by D.J. Caruso and starring Vincent D'Onofrio, Nicole Kidman, and Matt Dillon. The film is about a group of people who are connected to a mysterious storm that hits a small town in California, and the aftermath of the storm.

The other options do not match


### **Remark**:

Incorrect

### **Original Answer**

Let's think step by step.
- Twister (action, adventure, thriller; 1996)
- The Silence of the Lambs (crime, drama, thriller; 1991)
- Independence Day (action, science-fiction, drama; 1996)
- Braveheart (biography, drama, epic; 1995)
These are all famous Hollywood movies produced around the 1990s. Amongst all the options, the only movie similar to these ones seems to be Forrest Gump (comedy, drama, romance; 1994). So the answer is (C).

In [39]:
# Run text generation pipeline with our next model
prompt = """Today, Emily went to the museum. Between what times could they have gone?
We know that:
Emily woke up at 1pm.
Elizabeth saw Emily reading at the library from 2pm to 4pm.
Jessica saw Emily watching a movie at the theater from 4pm to 5pm.
Leslie saw Emily waiting at the airport from 5pm to 6pm.
William saw Emily buying clothes at the mall from 6pm to 7pm.
The museum was closed after 7pm.
Between what times could Emily have gone to the museum?
Options:
(A) 1pm to 2pm
(B) 6pm to 7pm
(C) 5pm to 6pm
(D) 2pm to 4pm"""
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=700)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Today, Emily went to the museum. Between what times could they have gone?
We know that:
Emily woke up at 1pm.
Elizabeth saw Emily reading at the library from 2pm to 4pm.
Jessica saw Emily watching a movie at the theater from 4pm to 5pm.
Leslie saw Emily waiting at the airport from 5pm to 6pm.
William saw Emily buying clothes at the mall from 6pm to 7pm.
The museum was closed after 7pm.
Between what times could Emily have gone to the museum?
Options:
(A) 1pm to 2pm
(B) 6pm to 7pm
(C) 5pm to 6pm
(D) 2pm to 4pm [/INST]  Based on the information provided, Emily could have gone to the museum between 2pm to 4pm, and 6pm to 7pm. Therefore, the correct answer is (B) 6pm to 7pm and (C) 5pm to 6pm.


Incorrect Output!

# **Original Answer**

Let's think step by step.
- Wake-up time: 1pm.
- 1pm-2pm: free.
- 2pm-4pm: reading at the library.
- 4pm-5pm: watching a movie at the theater.
- 5pm-6pm: waiting at the airport.
- 6pm-7pm: buying clothes at the mall.

The museum closure time: 7pm.
The only time when Emily could have gone to the museum was 1pm to 2pm. So the answer is (A).

### **Note**

I tested the model on boolean expression related questions from BigBench-Hard as well, but unfortunately it didn't performed well. Later, I tried to focus on questions which were more aligned with the dataset on which the model was trained!

To store the new llama-2-7b-chat-hf-databricks-dolly-15k model, merge the LoRA weights with the base model. This requires reloading the base model in FP16 and using the peft library

In [42]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [44]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [45]:
model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)


Thrown during validation:
`do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.


pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/DrishtiSharma/llama-2-7b-databricks-dolly-15k/commit/b2f783af2c3343d1480be3bd05fae4d4f98924d3', commit_message='Upload tokenizer', commit_description='', oid='b2f783af2c3343d1480be3bd05fae4d4f98924d3', pr_url=None, pr_revision=None, pr_num=None)

# **Conclusion**
In this notebook, the Llama 2-7b model was successfully fine-tuned on the databricks-dolly-15k dataset. Despite certain limitations in its responses, the fine-tuned model offers promising potential for integration into platforms such as LangChain, presenting an alternative to the OpenAI API. Further work is necessary to refine the model's accuracy.