In this notebook, we fine-tune the OpenLLama model using QLoRA and the Dolly15k dataset. Our goal is to develop a model capable of answering questions based on provided
context.

In [1]:
%pip install datasets
%pip install -q -U git+https://github.com/lvwerra/trl.git
%pip install -q -U git+https://github.com/huggingface/peft.git
%pip install git+https://github.com/huggingface/transformers.git -q -U # transformers version:  4.37.0
%pip install git+https://github.com/huggingface/accelerate.git -q -U # accelerate version:  0.27.0
%pip install -i https://pypi.org/simple/ bitsandbytes
%pip install sentencepiece

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


ERROR: Invalid requirement: '#'


Note: you may need to restart the kernel to use updated packages.


ERROR: Invalid requirement: '#'


Looking in indexes: https://pypi.org/simple/
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
from datasets import load_dataset, Dataset, concatenate_datasets
from tqdm import tqdm
from tqdm.auto import tqdm
import torch
import transformers
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, LlamaTokenizer
from trl import SFTTrainer
from IPython.display import display, Markdown
import os

  from .autonotebook import tqdm as notebook_tqdm
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.


# Data

In [3]:
# Load the dolly-15k dataset
dolly_dataset = load_dataset("databricks/databricks-dolly-15k")

In [4]:
dolly_dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'context', 'response', 'category'],
        num_rows: 15011
    })
})

The dataset contains instruction, context, and response triplets.

In [5]:
dolly_dataset["train"][0]

{'instruction': 'When did Virgin Australia start operating?',
 'context': "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.",
 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.',
 'category': 'closed_qa'}

Drop examples that are longer than 2200 characters.

In [6]:
def drop_long_sequences(dataset_obj):
    """
    Identifies indices of entries in a dataset that exceed a certain sequence length.

    Args:
    dataset_obj (iterable): dataset where each entry is a dictionary with keys 'instruction', 'context', and 'response'.

    Returns:
    list: Indices of dataset entries ('instruction', 'context', and 'response') that are longer than 2200 characters in total.
    """

    long_sequence_indices = []

    for i, entry in enumerate(dataset_obj):
        total_length = len(entry['instruction']) + len(entry['context']) + len(entry['response'])
        if total_length > 2200:
            long_sequence_indices.append(i)

    return long_sequence_indices

In [7]:
indices_to_drop = drop_long_sequences(dolly_dataset["train"])
dolly_dataset_reduced = dolly_dataset["train"].select(i for i in range(len(dolly_dataset["train"])) if i not in set(indices_to_drop))



In [8]:
# Split the data into train (90%) and test (10%) sets (You can use train_test_split() function from huggingface)

dataset_prepared = Dataset.train_test_split(dolly_dataset_reduced, 0.1)

In [9]:
dataset_prepared

DatasetDict({
    train: Dataset({
        features: ['instruction', 'context', 'response', 'category'],
        num_rows: 12667
    })
    test: Dataset({
        features: ['instruction', 'context', 'response', 'category'],
        num_rows: 1408
    })
})

# Input Formatting

We need to properly prepare and format the dataset before presenting it to the model. The input prompts given to the model are structured using the formatting function described below.

You can come up with your own prompts. Here is a sample prompt:
```
      input_prompt = ("Below is an instruction that describes a task, paired with an input that provides further context. "
      "Write a response that appropriately completes the request.\n\n"
      f"### Instruction:\n {instruction}\n\n"
      f"### Input:\n {context}\n\n"
      f"### Response:\n {response}")

```

In [10]:
def formatting_func(example):
   """
   Formats a given example dictionary into a structured text prompt based on the presence of contextual information.
   Args:
   example (dict): A dictionary expected to contain 'instruction', 'response', and optionally 'context'.
   Returns:
   dict: A dictionary with a single key 'text' that holds the formatted instruction as its value.
   """

   # If there is a context, give "instruction", "context", and "response" as a prompt
   # Else, just give "instruction" & "response" pair

   if 'context' in example and example['context']:
      formatted_text = ("Below is an instruction that describes a task, paired with an input that provides further context. "
                          "Write a response that appropriately completes the request.\n\n"
                          f"### Instruction:\n {example['instruction']}\n\n"
                          f"### Input:\n {example['context']}\n\n"
                          f"### Response:\n {example['response']}")
   else:
      formatted_text = ("Below is an instruction that describes a task. "
                         "Write a response that appropriately completes the request.\n\n"
                         f"### Instruction:\n {example['instruction']}\n\n"
                         f"### Response:\n {example['response']}")
       
   return {'text': formatted_text}

In [11]:
# Format the dataset using the function above
formatted_dataset = dataset_prepared.map(formatting_func)

Map: 100%|██████████| 12667/12667 [00:01<00:00, 12158.76 examples/s]
Map: 100%|██████████| 1408/1408 [00:00<00:00, 11239.39 examples/s]


# Model

We use the `openlm-research/open_llama_7b_v2` model. Alternatively, you could use the `openlm-research/open_llama_3b` model, which has fewer parameters.

In [12]:
# Model parameters

model_id = "openlm-research/open_llama_7b_v2"

# Define a BitsAndBytesConfig object with load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_use_double_quant=True, 
    bnb_4bit_quant_type="nf4", 
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [15]:
import bitsandbytes

In [None]:
# Load the model & tokenizer

# Load the base model "openlm-research/open_llama_7b_v2" using AutoModelForCausalLM.from_pretrained(). Remember to use bnb_config as the quantization_config while loading.
base_model = AutoModelForCausalLM.from_pretrained("openlm-research/open_llama_7b_v2", quantization_config=bnb_config)

# Load the tokenizer of the model "openlm-research/open_llama_7b_v2" using LlamaTokenizer.from_pretrained() function.
tokenizer = LlamaTokenizer.from_pretrained("openlm-research/open_llama_7b_v2")

# Add the padding token
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

We use Supervised Fine-tuning Trainer (`SFTTrainer`) for training. Feel free to try different values for `learning rate` and `max_steps`.

In [None]:
# Define a LoraConfig object with with r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" (You can change these hyperparameters)
qlora_config = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM") # YOUR CODE HERE

train_dataset, eval_dataset = formatted_dataset['train'], formatted_dataset['test']

trainer = SFTTrainer(
    base_model,
    train_dataset=train_dataset, 
    eval_dataset=eval_dataset,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        learning_rate=2e-4, 
        max_steps=10000, 
        output_dir="./OpenLLama7B-Dolly15k", 
        optim="paged_adamw_8bit",
        fp16=True,
    ),
    tokenizer=tokenizer,
    peft_config=qlora_config,
    dataset_text_field="text",
    max_seq_length=512
)

In [None]:
# Training
trainer.train()

In [None]:
# Save the model using save_model()

os.mkdir('./artifacts')
trainer.save_model('./artifacts')

In [None]:
# Load the saved model & tokenizer

# Load lora_config from where you saved the checkpoint
lora_config = LoraConfig.from_pretrained("./artifacts")

# Specify the BitsAndBytesConfig (use load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16)
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16)

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    lora_config.base_model_name_or_path,
    quantization_config=bnb_config,
    device_map={"":0})

model = get_peft_model(model, lora_config)

# Load the tokenizer from the checkpoint
tokenizer = AutoTokenizer.from_pretrained("./artifacts")

 # Inference

Before providing the instruction and context to the model, we first prepare the prompt using the `make_inference()` function. We then tokenize these inputs and feed them to the model. The prompts prepared in this function should follow the same format as those created by the `formatting_func()`.

You can come up with your own prompts. Here is a sample prompt:
```
      input_prompt = ("Below is an instruction that describes a task, paired with an input that provides further context. "
      "Write a response that appropriately completes the request.\n\n"
      f"### Instruction:\n {instruction}\n\n"
      f"### Input:\n {context}\n\n"
      f"### Response:\n")

```

In [None]:
def make_inference(instruction, context=None):
    """
    Generates responses from different models based on the provided instruction and optional context.

    Args:
    instruction (str): The instruction for the task.
    context (str, optional): Additional context for the task. Defaults to None.
    """
    if context:
        prompt = context + " " + instruction
    else:
        prompt = instruction

    inputs = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False).to("cuda:0")
    outputs = model.generate(**inputs, max_new_tokens=50)
    display(Markdown(tokenizer.decode(outputs[0], skip_special_tokens=True)))

# Sample Inferences

In [None]:
make_inference("Identify the odd one out and explain your choice.", "Orange, Green, Airplane.")

In [None]:
make_inference("Explain in simple terms how the attention mechanism of a transformer model works")