*Presented By:*

**Shivani Tyagi**
---

# Fine-tuning Large Language Models (LLMs) Using QLoRA

## Introduction
Fine-tuning Large Language Models (LLMs) is a crucial step in adapting these powerful models to specific tasks or domains. In this seminar code tutorial, we will explore how to perform fine-tuning using QLoRA (Quantized LoRA), a memory-efficient iteration of LoRA (Low-Rank Adaptation), for parameter-efficient fine-tuning.

## What is QLoRA?
QLoRA is an advancement in the fine-tuning method introduced by LoRA. It further optimizes memory usage by quantizing the weights of the LoRA adapters to lower precision, typically 4-bit instead of 8-bit. Despite the reduction in precision, QLoRA maintains a comparable level of effectiveness to LoRA.

## Benefits of QLoRA

1. **Memory Efficiency:** QLoRA significantly reduces the memory footprint of the fine-tuned model by quantizing weights to lower precision.
3. **Storage Requirements:** With reduced precision, the storage requirements for the fine-tuned model are further decreased.
4. **Comparable Performance:** Despite the reduction in precision, QLoRA maintains a similar level of effectiveness as LoRA, making it an attractive option for memory-constrained environments.

## Steps for Fine-tuning LLMs Using QLoRA

1. **Select a Pre-trained Model:** Choose a suitable pre-trained LLM that aligns with the desired architecture and functionalities for your task.
    
2. **Gather Relevant Dataset:** Collect a dataset that is labeled or structured in a way that the model can learn from it and is relevant to your task.
    
3. **Preprocess Dataset:** Preprocess the dataset by cleaning it, splitting it into training, validation, and test sets, and ensuring compatibility with the chosen pre-trained LLM.
    
4. **Fine-tuning with QLoRA:** Fine-tune the selected pre-trained LLM using QLoRA. During this process, the weights of the LoRA adapters are quantized to lower precision, reducing memory requirements while maintaining performance.
    
5. **Task-specific Adaptation:** Adjust the model's parameters based on the new dataset, allowing it to better understand and generate content relevant to the specific task.
    
6. **Evaluation:** Evaluate the fine-tuned model on relevant metrics to assess its performance and effectiveness for the task at hand.

## Conclusion
QLoRA offers a memory-efficient approach to fine-tuning Large Language Models, making them more accessible and practical for deployment in memory-constrained environments. By quantizing the weights of LoRA adapters to lower precision, QLoRA strikes a balance between memory efficiency and model performance, enabling efficient adaptation of LLMs to various tasks and domains.

---

**In this notebook and tutorial, we will fine-tune [Microsoft's Phi-2](https://huggingface.co/microsoft/phi-2) relatively small 2.7B model - which has "showcased a nearly state-of-the-art performance among models with less than 13 billion parameters"**

**Here we will use [QLoRA (Efficient Finetuning of Quantized LLMs)](https://arxiv.org/abs/2305.14314), a highly efficient fine-tuning technique that involves quantizing a pretrained LLM to just 4 bits and adding small “Low-Rank Adapters”. This unique approach allows for fine-tuning LLMs using just a single GPU! This technique is supported by the [PEFT library](https://huggingface.co/docs/peft/index).**



# Table of Contents

- [1- Install all the required libraries](#1)
- [ 2 - Loading dataset](#2)
- [ 3 - Create bitsandbytes configuration](#3)
- [ 4 - Load Base Model](#4)
- [ 5 - Tokenization](#5)
- [ 6 - Test the Model with Zero Shot Inferencing](#6)
- [ 7 - Pre-processing dataset](#7)
- [ 8 - Setup the PEFT/LoRA model for Fine-Tuning](#8)
- [ 9 - Train PEFT Adapter](#9)
- [ 10 - Evaluate the Model Qualitatively (Human Evaluation)](#10)
- [ 11 - Evaluate the Model Quantitatively (with ROUGE Metric)](#11)

#### Before we begin: A note on OOM errors

If you get an error like this: `OutOfMemoryError: CUDA out of memory`, tweak your parameters to make the model less computationally intensive
To re-try after you tweak your parameters, open a Terminal ('Launcher' or '+' in the nav bar above -> Other -> Terminal) and run the command `nvidia-smi`. Then find the process ID `PID` under `Processes` and run the command `kill [PID]`. You will need to re-start your notebook from the beginning.

<a name='1'></a>
#### 1. Installing and Importing all the required libraries

In [None]:
!pip install -q -U bitsandbytes transformers peft accelerate datasets scipy einops evaluate trl

<a name='1.1'></a>
#### 1.1 Weights & Biases

This code cell disables Weights and Biases (W&B) integration by setting the environment variable `WANDB_DISABLED` to `"true"` using the `os.environ` module. 

### Purpose
Weights and Biases is a popular tool for experiment tracking and visualization in machine learning projects. Disabling W&B integration may be necessary in certain cases, such as when running experiments in environments where W&B is not available or when troubleshooting issues related to W&B integration.

### How to Use
To use this code cell, simply run it in your Python environment before running any code that involves Weights and Biases. This will ensure that W&B integration is disabled for subsequent code execution.

### Note
Make sure to review your project's requirements and dependencies before disabling Weights and Biases integration. Disabling W&B may affect experiment tracking and visualization capabilities if your project relies on W&B for these purposes.


In [3]:
import os
# disable Weights and Biases
os.environ['WANDB_DISABLED']="true"

In [None]:
!pip install pyarrow==11.0.0

In [2]:
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    GenerationConfig
)
from tqdm import tqdm
from trl import SFTTrainer
import torch
import time
import pandas as pd
import numpy as np
from huggingface_hub import interpreter_login

interpreter_login()

2024-02-27 08:33:25.802269: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-27 08:33:25.802418: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-27 08:33:25.982626: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .


Token:  ·····································
Add token as git credential? (Y/n)  Y


Token is valid (permission: read).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.

git config --global credential.helper store

Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.[0m
Token has not been saved to git credential helper.
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
from pynvml import *

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

<a name='2'></a>
#### 2. Loading dataset

In [4]:
# https://huggingface.co/datasets/neil-code/dialogsum-test
huggingface_dataset_name = "neil-code/dialogsum-test"
dataset = load_dataset(huggingface_dataset_name)
dataset

Downloading readme:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.81M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/441k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/447k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1999
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 499
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 499
    })
})

This is what the data looks like:

In [5]:
dataset['train'][0]

{'id': 'train_0',
 'dialogue': "#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?\n#Person2#: I found it would be a good idea to get a check-up.\n#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.\n#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?\n#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.\n#Person2#: Ok.\n#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?\n#Person2#: Yes.\n#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.\n#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.\n#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.\n#Person2#: Ok, thanks doctor.",
 'summary': "Mr. Smith'

<a name='3'></a>
#### 3. Create bitsandbytes configuration

**To load the model, we need a configuration class that specifies how we want the quantization to be performed. We’ll achieve this with BitesAndBytesConfig from the Transformers library. This will allow us to load our LLM in 4 bits. This way, we can divide the used memory by 4 and import the model on smaller devices.**

In [6]:
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=False,
    )
device_map = {"": 0}

<a name='4'></a>
#### 4. Load Base Model
Let's now load Phi-2 using 4-bit quantization!

In [None]:
model_name='microsoft/phi-2'
original_model = AutoModelForCausalLM.from_pretrained(model_name, 
                                                      device_map=device_map,
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)

<a name='5'></a>
#### 5. Tokenization
Set up the tokenizer. Add padding on the left as it [makes training use less memory](https://ai.stackexchange.com/questions/41485/while-fine-tuning-a-decoder-only-llm-like-llama-on-chat-dataset-what-kind-of-pa).

In [8]:
# https://ai.stackexchange.com/questions/41485/while-fine-tuning-a-decoder-only-llm-like-llama-on-chat-dataset-what-kind-of-pa
eval_tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True,padding_side="left",add_eos_token=True,add_bos_token=True,use_fast=False)
eval_tokenizer.pad_token = eval_tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [9]:
print_gpu_utilization()

GPU memory occupied: 2318 MB.



This function `gen` is for generating text using a pre-trained language model. The function takes three main parameters:

1. `model`: The pre-trained language model to use for text generation.
2. `p`: The prompt or input text for text generation.
3. `maxlen`: The maximum length of the generated text (default is 100 tokens).
4. `sample`: A boolean flag indicating whether to use sampling during text generation (default is True).

The function first tokenizes the input prompt using an evaluation tokenizer (`eval_tokenizer`) and converts the tokens to PyTorch tensors. It then generates text using the provided model, with options for sampling, beam search, and temperature control. The generated text is returned as a tensor on the CPU, and special tokens are skipped during decoding.

### Parameters
- `model`: A pre-trained language model (e.g., GPT, BERT).
- `p`: Prompt or input text for text generation.
- `maxlen`: Maximum length of the generated text.
- `sample`: Boolean flag for sampling during text generation.

### Usage
```python
# Example usage
generated_text = gen(model, "Input prompt for text generation", maxlen=150, sample=True)
print(generated_text)


In [10]:
def gen(model,p, maxlen=100, sample=True):
    toks = eval_tokenizer(p, return_tensors="pt")
    res = model.generate(**toks.to("cuda"), max_new_tokens=maxlen, do_sample=sample,num_return_sequences=1,temperature=0.1,num_beams=1,top_p=0.95,).to('cpu')
    return eval_tokenizer.batch_decode(res,skip_special_tokens=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


<a name='6'></a>
#### 6. Test the Model with Zero Shot Inferencing


1. Imports necessary modules and sets a random seed for reproducibility.
2. Retrieves a dialogue and its corresponding summary from the test dataset at a specified index.
3. Formats the dialogue prompt and uses it to generate a summary using a pre-trained language model.
4. Prints the input prompt, baseline human summary, and model-generated summary for comparison.

In [11]:
%%time
from transformers import set_seed
seed = 42
set_seed(seed)

index = 10

prompt = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

formatted_prompt = f"Instruct: Summarize the following conversation.\n{prompt}\nOutput:\n"
res = gen(original_model,formatted_prompt,100,)
#print(res[0])
output = res[0].split('Output:\n')[1]

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{formatted_prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Instruct: Summarize the following conversation.
#Person1#: Happy Birthday, this is for you, Brian.
#Person2#: I'm so happy you remember, please come in and enjoy the party. Everyone's here, I'm sure you have a good time.
#Person1#: Brian, may I have a pleasure to have a dance with you?
#Person2#: Ok.
#Person1#: This is really wonderful party.
#Person2#: Yes, you are always popular with everyone. and you look very pretty today.
#Person1#: Thanks, that's very kind of you to say. I hope my necklace goes with my dress, and they both make me look good I feel.
#Person2#: You look great, you are absolutely glowing.
#Person1#: Thanks, this is a fine party. We should have a drink together to celebrate your birthday
Output:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# attends Brian's birthday pa

---
Observations:

The generated summary captures the essence of the conversation between Person1 and Person2 at Brian's birthday party. It includes key elements such as the invitation to dance, compliments on appearance, expressions of happiness about the party, and the suggestion to have a drink together to celebrate Brian's birthday. Overall, the model-generated summary effectively conveys the main points of the conversation, demonstrating an understanding of the dialogue context.

---

<a name='7'></a>
#### 7. Pre-processing dataset


The function, `create_prompt_formats`, formats various fields of a sample dictionary containing instructions, conversation dialogue, and a summary. It concatenates the formatted fields using two newline characters to create a prompt for a language model to generate a response.

### Parameters
- `sample`: A dictionary containing fields such as 'instruction', 'dialogue', and 'summary'.

### Function Steps
1. Define introductory blurb, instruction key, response key, and end key strings.
2. Format the introductory blurb, instruction, input context (dialogue), response, and end parts.
3. Concatenate the formatted parts using two newline characters.
4. Update the 'text' field of the sample dictionary with the formatted prompt.
5. Return the updated sample dictionary.

In [12]:
def create_prompt_formats(sample):
    """
    Format various fields of the sample ('instruction','output')
    Then concatenate them using two newline characters 
    :param sample: Sample dictionnary
    """
    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    INSTRUCTION_KEY = "### Instruct: Summarize the below conversation."
    RESPONSE_KEY = "### Output:"
    END_KEY = "### End"
    
    blurb = f"\n{INTRO_BLURB}"
    instruction = f"{INSTRUCTION_KEY}"
    input_context = f"{sample['dialogue']}" if sample["dialogue"] else None
    response = f"{RESPONSE_KEY}\n{sample['summary']}"
    end = f"{END_KEY}"
    
    parts = [part for part in [blurb, instruction, input_context, response, end] if part]

    formatted_prompt = "\n\n".join(parts)
    sample["text"] = formatted_prompt

    return sample

### `get_max_length(model)`
- **Description**: This function retrieves the maximum length setting from the configuration of a given model. It checks for various possible length settings and defaults to a maximum length of 1024 if none are found.
- **Parameters**:
  - `model`: The model for which the maximum length is being retrieved.
- **Returns**: 
  - `max_length`: The maximum length setting for tokenization.

### `preprocess_batch(batch, tokenizer, max_length)`
- **Description**: This function preprocesses a batch of text data for tokenization using a given tokenizer. It tokenizes the text in the batch and applies truncation if necessary to ensure that the length does not exceed the specified maximum length.
- **Parameters**:
  - `batch`: A batch of text data to be preprocessed.
  - `tokenizer`: The tokenizer used for tokenization.
  - `max_length`: The maximum length allowed for tokenization.
- **Returns**: 
  - Tokenized and preprocessed batch.

In [13]:
# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length


def preprocess_batch(batch, tokenizer, max_length):
    """
    Tokenizing a batch
    """
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )

This function, `preprocess_dataset`, prepares a dataset for training by formatting and tokenizing it, ensuring that it is ready for consumption by a language model.

### Parameters
- `tokenizer` (AutoTokenizer): The tokenizer associated with the language model.
- `max_length` (int): The maximum number of tokens to emit from the tokenizer.
- `seed` (int): The random seed used for shuffling the dataset.
- `dataset`: The dataset to be preprocessed.

### Function Steps
1. Add a prompt to each sample in the dataset using the `create_prompt_formats` function.
2. Define a partial function `_preprocessing_function` using `preprocess_batch`, `max_length`, and `tokenizer`.
3. Apply the `_preprocessing_function` to each batch in the dataset to tokenize and format the samples.
4. Remove unnecessary columns ('id', 'topic', 'dialogue', 'summary') from the dataset.
5. Filter out samples with input_ids exceeding the specified `max_length`.
6. Shuffle the dataset using the provided random seed.

In [14]:
from functools import partial

# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int,seed, dataset):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """
    
    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats)#, batched=True)
    
    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=['id', 'topic', 'dialogue', 'summary'],
    )

    # Filter out samples that have input_ids exceeding max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)
    
    # Shuffle dataset
    dataset = dataset.shuffle(seed=seed)

    return dataset

In [15]:
print_gpu_utilization()

GPU memory occupied: 2596 MB.


In [16]:
# ## Pre-process dataset
max_length = get_max_length(original_model)
print(max_length)

train_dataset = preprocess_dataset(tokenizer, max_length,seed, dataset['train'])
eval_dataset = preprocess_dataset(tokenizer, max_length,seed, dataset['validation'])

Found max lenth: 2048
2048
Preprocessing dataset...


Map:   0%|          | 0/1999 [00:00<?, ? examples/s]

Map:   0%|          | 0/1999 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1999 [00:00<?, ? examples/s]

Preprocessing dataset...


Map:   0%|          | 0/499 [00:00<?, ? examples/s]

Map:   0%|          | 0/499 [00:00<?, ? examples/s]

Filter:   0%|          | 0/499 [00:00<?, ? examples/s]

In [17]:
print(f"Shapes of the datasets:")
print(f"Training: {train_dataset.shape}")
print(f"Validation: {eval_dataset.shape}")
print(train_dataset)

Shapes of the datasets:
Training: (1999, 3)
Validation: (499, 3)
Dataset({
    features: ['text', 'input_ids', 'attention_mask'],
    num_rows: 1999
})


<a name='8'></a>
#### 8. Setup the PEFT/QLoRA model for Fine-Tuning
Now, let's perform Parameter Efficient Fine-Tuning (PEFT) fine-tuning. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning.
PEFT is a generic term that includes Low-Rank Adaptation (LoRA) and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, in essence, enables efficient model fine-tuning using fewer computational resources, often achievable with just a single GPU. Following LoRA fine-tuning for a specific task or use case, the outcome is an unchanged original LLM and the emergence of a considerably smaller "LoRA adapter," often representing a single-digit percentage of the original LLM size (in MBs rather than GBs).


Note the rank (r) hyper-parameter, which defines the rank/dimension of the adapter to be trained.
r is the rank of the low-rank matrix used in the adapters, which thus controls the number of parameters trained. A higher rank will allow for more expressivity, but there is a compute tradeoff.

alpha is the scaling factor for the learned weights. The weight matrix is scaled by alpha/r, and thus a higher value for alpha assigns more weight to the LoRA activations.

In [None]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

#print(print_number_of_trainable_model_parameters(original_model))

In [19]:
print(original_model)

PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2560)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x PhiDecoderLayer(
        (self_attn): PhiAttention(
          (q_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (k_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (v_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (dense): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (rotary_emb): PhiRotaryEmbedding()
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear4bit(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear4bit(in_features=10240, out_features=2560, bias=True)
        )
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (final_layern

The PhiForCausalLM model is a variant of a causal language model designed for text generation tasks. It utilizes the PhiModel architecture, which includes embedding layers, multiple layers of PhiDecoderLayer, and a final linear layer (lm_head).

Key components of the model include:

1. Embedding layer: Maps input tokens to continuous vector representations.
2. PhiDecoderLayer: Consists of self-attention mechanisms and multi-layer perceptrons (MLPs) for capturing contextual information and modeling dependencies between tokens.
3. Final linear layer (lm_head): Predicts the probability distribution over the vocabulary for generating the next token in the sequence.

Overall, the PhiForCausalLM model is tailored for generating coherent and contextually relevant text sequences, making it suitable for various natural language processing tasks such as text summarization, dialogue generation, and machine translation.

In [20]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

config = LoraConfig(
    r=32, #Rank
    lora_alpha=32,
    target_modules=[
        'q_proj',
        'k_proj',
        'v_proj',
        'dense'
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

# 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
original_model.gradient_checkpointing_enable()

# 2 - Using the prepare_model_for_kbit_training method from PEFT
original_model = prepare_model_for_kbit_training(original_model)

peft_model = get_peft_model(original_model, config)

Once everything is set up and the base model is prepared, we can use the print_trainable_parameters() helper function to see how many trainable parameters are in the model.

In [21]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 20971520
all model parameters: 1542364160
percentage of trainable model parameters: 1.36%


In [22]:
# See how the model looks different now, with the QLoRA adapters added:
print(peft_model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): PhiForCausalLM(
      (model): PhiModel(
        (embed_tokens): Embedding(51200, 2560)
        (embed_dropout): Dropout(p=0.0, inplace=False)
        (layers): ModuleList(
          (0-31): 32 x PhiDecoderLayer(
            (self_attn): PhiAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2560, out_features=2560, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2560, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=2560, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4

<a name='9'></a>
#### 9. Train PEFT Adapter

Define training arguments and create Trainer instance.

In [23]:
output_dir = './peft-dialogue-summary-training/final-checkpoint'
import transformers

peft_training_args = TrainingArguments(
    output_dir = output_dir,
    warmup_steps=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    max_steps=1000,
    learning_rate=2e-4,
    optim="paged_adamw_8bit",
    logging_steps=25,
    logging_dir="./logs",
    save_strategy="steps",
    save_steps=25,
    evaluation_strategy="steps",
    eval_steps=25,
    do_eval=True,
    gradient_checkpointing=True,
    report_to="none",
    overwrite_output_dir = 'True',
    group_by_length=True,
)

peft_model.config.use_cache = False

peft_trainer = transformers.Trainer(
    model=peft_model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=peft_training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

In [24]:
peft_training_args.device

device(type='cuda', index=0)

In [25]:
peft_trainer.train()



Step,Training Loss,Validation Loss
25,1.6236,1.378991
50,1.2032,1.408253
75,1.4304,1.345791
100,1.1503,1.364201
125,1.4677,1.336263
150,1.189,1.342465
175,1.4211,1.331966
200,1.1929,1.337227
225,1.4493,1.32955
250,1.1911,1.339475




TrainOutput(global_step=1000, training_loss=1.2691061172485352, metrics={'train_runtime': 15453.7563, 'train_samples_per_second': 0.518, 'train_steps_per_second': 0.065, 'total_flos': 3.712609867585536e+16, 'train_loss': 1.2691061172485352, 'epoch': 4.0})

* The training loss decreases gradually over time, indicating that the model is learning from the training data.
* The validation loss fluctuates but generally follows a similar trend to the training loss, suggesting that the model's performance on unseen data is consistent with its performance on the training data.
* There is no significant divergence between the training and validation losses, indicating that the model is not overfitting or underfitting severely.
* Overall, the training results suggest that the model is learning effectively from the data without exhibiting clear signs of underfitting or overfitting. However, further analysis may be needed to optimize the model's performance.

In [26]:
print_gpu_utilization()

GPU memory occupied: 14760 MB.


In [27]:
# Free memory for merging weights
del original_model
del peft_trainer
torch.cuda.empty_cache()

In [28]:
print_gpu_utilization()

GPU memory occupied: 3584 MB.


<a name='10'></a>
#### 10. Evaluate the Model Qualitatively (Human Evaluation)

In [29]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

base_model_id = "microsoft/phi-2"
base_model = AutoModelForCausalLM.from_pretrained(base_model_id, 
                                                      device_map='auto',
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [30]:
eval_tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True, use_fast=False)
eval_tokenizer.pad_token = eval_tokenizer.eos_token

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [31]:
from peft import PeftModel

ft_model = PeftModel.from_pretrained(base_model, "/kaggle/working/peft-dialogue-summary-training/final-checkpoint/checkpoint-1000",torch_dtype=torch.float16,is_trainable=False)

In [32]:
%%time
from transformers import set_seed
set_seed(seed)

index = 10
dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"Instruct: Summarize the following conversation.\n{dialogue}\nOutput:\n"

peft_model_res = gen(ft_model,prompt,100,)
peft_model_output = peft_model_res[0].split('Output:\n')[1]
#print(peft_model_output)
prefix, success, result = peft_model_output.partition('#End')

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'PEFT MODEL:\n{prefix}')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Instruct: Summarize the following conversation.
#Person1#: Happy Birthday, this is for you, Brian.
#Person2#: I'm so happy you remember, please come in and enjoy the party. Everyone's here, I'm sure you have a good time.
#Person1#: Brian, may I have a pleasure to have a dance with you?
#Person2#: Ok.
#Person1#: This is really wonderful party.
#Person2#: Yes, you are always popular with everyone. and you look very pretty today.
#Person1#: Thanks, that's very kind of you to say. I hope my necklace goes with my dress, and they both make me look good I feel.
#Person2#: You look great, you are absolutely glowing.
#Person1#: Thanks, this is a fine party. We should have a drink together to celebrate your birthday
Output:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# attends Brian's birthday pa

<a name='11'></a>
#### 11. Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time). 

In [33]:
original_model = AutoModelForCausalLM.from_pretrained(base_model_id, 
                                                      device_map='auto',
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [34]:
import pandas as pd

dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    human_baseline_text_output = human_baseline_summaries[idx]
    prompt = f"Instruct: Summarize the following conversation.\n{dialogue}\nOutput:\n"
    
    original_model_res = gen(original_model,prompt,100,)
    original_model_text_output = original_model_res[0].split('Output:\n')[1]
    
    peft_model_res = gen(ft_model,prompt,100,)
    peft_model_output = peft_model_res[0].split('Output:\n')[1]
    #print(peft_model_output)
    peft_model_text_output, success, result = peft_model_output.partition('#End')
    

    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])
df

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Unnamed: 0,human_baseline_summaries,original_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,"Person 1: Ms. Dawson, I need you to take a dic...",#Person1# asks Ms. Dawson to take a dictation ...
1,In order to prevent employees from wasting tim...,"Person 1: Ms. Dawson, I need you to take a dic...",#Person1# asks Ms. Dawson to take a dictation ...
2,Ms. Dawson takes a dictation for #Person1# abo...,"Person 1: Ms. Dawson, I need you to take a dic...",#Person1# asks Ms. Dawson to take a dictation ...
3,#Person2# arrives late because of traffic jam....,Person1 and Person2 are discussing the traffic...,#Person2# got stuck in traffic again and #Pers...
4,#Person2# decides to follow #Person1#'s sugges...,Person1 and Person2 are discussing the traffic...,#Person2# got stuck in traffic again and #Pers...
5,#Person2# complains to #Person1# about the tra...,Person1 and Person2 are discussing the traffic...,#Person2# got stuck in traffic again and #Pers...
6,#Person1# tells Kate that Masha and Hero get d...,Kate informed that Masha and Hero are getting ...,Masha and Hero are getting divorced. Masha tel...
7,#Person1# tells Kate that Masha and Hero are g...,Kate informed that Masha and Hero are getting ...,Masha and Hero are getting divorced. Masha tel...
8,#Person1# and Kate talk about the divorce betw...,Kate informed that Masha and Hero are getting ...,Masha and Hero are getting divorced. Masha tel...
9,#Person1# and Brian are at the birthday party ...,"Person1 and Person2 are at a party, and Person...",#Person1# brings a birthday gift for Brian and...


In [35]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25ldone
[?25h  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=12eaf119dbe7c54ec6806e08989fa74cbc6e30fae70718af72f3fed6e1d5beb0
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [36]:
import evaluate

rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('PEFT MODEL:')
print(peft_model_results)

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

ORIGINAL MODEL:
{'rouge1': 0.2827007613710234, 'rouge2': 0.09658938946924345, 'rougeL': 0.20147734920398974, 'rougeLsum': 0.21266882373948465}
PEFT MODEL:
{'rouge1': 0.319583317792019, 'rouge2': 0.10724176116995074, 'rougeL': 0.23977189826979742, 'rougeLsum': 0.25606007569886413}


The PEFT model shows improvements over the original model across all ROUGE metrics:

* ROUGE-1: 0.2827 -> 0.3196 [performs better in capturing unigram overlap between the generated summaries and the reference summaries]
* ROUGE-2: 0.0966 -> 0.1072 [indicating improved performance in capturing bigram overlap]
* ROUGE-L: 0.2015 -> 0.2398 
* ROUGE-Lsum: 0.2127 -> 0.2561 [demonstrates superior performance in capturing long-range dependencies and overall content overlap compared to the original model.]

These improvements indicate that the PEFT model generates summaries with better recall and precision compared to the original model.

In [37]:
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL
rouge1: 3.69%
rouge2: 1.07%
rougeL: 3.83%
rougeLsum: 4.34%
