# NLP. Week 13. LLMs. Fine-tuning with LORA


## LLM

A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation and perform various NLP tasks.

### Generation with LLMs

LLMs, or Large Language Models, are the key component behind text generation. In a nutshell, they consist of large pretrained transformer models trained to predict the next word (or, more precisely, token) given some input text. Since they predict one token at a time, you need to do something more elaborate to generate new sentences other than just calling the model — you need to do autoregressive generation.

Autoregressive generation is the inference-time procedure of iteratively calling a model with its own generated outputs, given a few initial inputs.

### Autoregressive generation

> Casual language modeling - predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left. This means the model cannot see future tokens. GPT-2 is an example of a causal language model.

A language model trained for causal language modeling takes a sequence of text tokens as input and returns the probability distribution for the next token.
A critical aspect of autoregressive generation with LLMs is how to select the next token from this probability distribution. Anything goes in this step as long as you end up with a token for the next iteration. This means it can be as simple as selecting the most likely token from the probability distribution or as complex as applying a dozen transformations before sampling from the resulting distribution. The process of autoregressive prediction is repeated iteratively until some stopping condition is reached. Ideally, the stopping condition is dictated by the model, which should learn when to output an end-of-sequence (EOS) token. If this is not the case, generation stops when some predefined maximum length is reached.

Some cool LLM links: [HF Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), [HF coding assistant](https://huggingface.co/spaces/HuggingFaceH4/starchat2-playground)

LLMs: [BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom), [Gemma](https://huggingface.co/docs/transformers/model_doc/gemma#gemma), [LLaMa](https://huggingface.co/docs/transformers/model_doc/llama), [GPT-2](https://huggingface.co/openai-community/gpt2)


## Parameter Efficient Fine Tuning (PEFT)

PEFT is a technique designed to fine-tune models while minimizing the need for extensive resources and cost. PEFT is a great choice when dealing with domain-specific tasks that necessitate model adaptation. By employing PEFT, we can strike a balance between retaining valuable knowledge from the pre-trained model and adapting it effectively to the target task with fewer parameters. There are various ways of achieving Parameter efficient fine-tuning. Low Rank Parameter or LoRA & QLoRA are most widely used and effective.

**Some PEFT techniques**:

- Adapter - add extra trainable parameters after the attention and fully-connected layers of a frozen pretrained model to reduce memory-usage and speed up training.
- LoRA - Low-Rank Adaptation
- Prefix tuning - prefix parameters are inserted in all of the model layers, whereas prompt tuning only adds the prompt parameters to the model input embeddings.
- Prompt tuning - prompt tokens have their own parameters that are updated independently. This means you can keep the pretrained model’s parameters frozen, and only update the gradients of the prompt token embeddings.
- P-tuning - adds a trainable embedding tensor that can be optimized to find better prompts, and it uses a prompt encoder (a bidirectional long-short term memory network or LSTM) to optimize the prompt parameters.
- IA3 - rescales inner activations with learned vectors. These learned vectors are injected in the attention and feedforward modules in a typical transformer-based architecture.

Some useful links: [Adapters](https://huggingface.co/docs/peft/en/conceptual_guides/adapter), [Soft prompts](https://huggingface.co/docs/peft/en/conceptual_guides/prompting), [IA3](https://huggingface.co/docs/peft/en/conceptual_guides/ia3)


## LoRA

[Paper](https://arxiv.org/pdf/2106.09685.pdf)

Low-Rank Adaptation (LoRA) is a reparametrization method that aims to reduce the number of trainable parameters with low-rank representations. The weight matrix is broken down into low-rank matrices that are trained and updated. All the pretrained model parameters remain frozen. After training, the low-rank matrices are added back to the original weights. This makes it more efficient to store and train a LoRA model because there are significantly fewer parameters.


### LoRA configuration

By using LoRA, you are unfreezing the attention `Weight_delta` matrix and only updating `W_a` and `W_b`.

<img src="https://files.training.databricks.com/images/llm/lora.png" width=300>

You can treat `r` (rank) as a hyperparameter. LoRA can perform well with very small ranks based on [Hu et a 2021's paper](https://arxiv.org/abs/2106.09685). GPT-3's validation accuracies across tasks with ranks from 1 to 64 are quite similar.

From [PyTorch Lightning's documentation](https://lightning.ai/pages/community/article/lora-llm/):

> A smaller `r` leads to a simpler low-rank matrix, which results in fewer parameters to learn during adaptation. This can lead to faster training and potentially reduced computational requirements. However, with a smaller `r`, the capacity of the low-rank matrix to capture task-specific information decreases. This may result in lower adaptation quality, and the model might not perform as well on the new task compared to a higher `r`.

Other arguments:

- `lora_dropout`:
  - Dropout is a regularization method that reduces overfitting by randomly and temporarily removing nodes during training.
  - It works like this: <br>
    - Apply to most type of layers (e.g. fully connected, convolutional, recurrent) and larger networks
    - Temporarily and randomly remove nodes and their connections during each training cycle
- `target_modules`:
  - Specifies the module names to apply to
  - This is dependent on how the foundation model names its attention weight matrices.
  - Typically, this can be:
    - `query`, `q`, `q_proj`
    - `key`, `k`, `k_proj`
    - `value`, `v` , `v_proj`
    - `query_key_value`


### QLoRA

[Paper](https://arxiv.org/pdf/2305.14314.pdf), [QLoRA repository](https://github.com/artidoro/qlora), [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)

QLoRA extends LoRA to enhance efficiency by quantizing weight values of the original network, from high-resolution data types, such as Float32, to lower-resolution data types like int4. This leads to reduced memory demands and faster calculations.

### Key optimizations of QLoRA

1. **4-bit NF4 Quantization**. 4-bit NormalFloat4 is an optimized data type that can be used to store weights, which brings down the memory footprint considerably.
2. **Normalization & Quantization**. As part of normalization and quantization steps, the weights are adjusted to a zero mean, and a constant unit variance. A 4-bit data type can only store 16 numbers. As part of normalization the weights are mapped to these 16 numbers, zero-centered distributed, and instead of storing the weights, the nearest position is stored.
3. **Double quantization**. Double quantization is the process of quantizing the quantization constant to reduce the memory down further to save these constant. To perform dequantization technique we need to store the quantization constants. If we employed blockwise quantization, then we will have n quantization constants in their original datatype. In the case of expansive LLM’s which have substantial number of quantization constants that must be stored, leading to increased memory overhead.


## Code

Let's fine-tune LLaMa2 for sentiment analysis on Financial News


In [1]:
!pip install -q peft transformers datasets evaluate seqeval accelerate bitsandbytes trl

In [2]:
import warnings

warnings.filterwarnings("ignore")

In [3]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import Dataset
from peft import LoraConfig, PeftConfig
from trl import SFTTrainer
from trl import setup_chat_format
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

2024-04-11 18:10:54.091044: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-11 18:10:54.091166: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-11 18:10:54.230755: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [4]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

### Data preparation

Read initial data and perform preprocessing: split into train and validation splits, insert prompts to text of news.


In [5]:
df = pd.read_csv(
    "/kaggle/input/sentiment-analysis-for-financial-news/all-data.csv",
    names=["sentiment", "text"],
    encoding="utf-8",
    encoding_errors="replace",
)

X_train = list()
X_test = list()
for sentiment in ["positive", "neutral", "negative"]:
    train, test = train_test_split(
        df[df.sentiment == sentiment], train_size=300, test_size=300, random_state=42
    )
    X_train.append(train)
    X_test.append(test)

X_train = pd.concat(X_train).sample(frac=1, random_state=10)
X_test = pd.concat(X_test)

eval_idx = [
    idx for idx in df.index if idx not in list(X_train.index) + list(X_test.index)
]
X_eval = df[df.index.isin(eval_idx)]
X_eval = X_eval.groupby("sentiment", group_keys=False).apply(
    lambda x: x.sample(n=50, random_state=10, replace=True)
)
X_train = X_train.reset_index(drop=True)


def generate_prompt(data_point):
    return f"""
            Analyze the sentiment of the news headline enclosed in square brackets, 
            determine if it is positive, neutral, or negative, and return the answer as 
            the corresponding sentiment label "positive" or "neutral" or "negative".

            [{data_point["text"]}] = {data_point["sentiment"]}
            """.strip()


def generate_test_prompt(data_point):
    return f"""
            Analyze the sentiment of the news headline enclosed in square brackets, 
            determine if it is positive, neutral, or negative, and return the answer as 
            the corresponding sentiment label "positive" or "neutral" or "negative".

            [{data_point["text"]}] = """.strip()


X_train = pd.DataFrame(X_train.apply(generate_prompt, axis=1), columns=["text"])
X_eval = pd.DataFrame(X_eval.apply(generate_prompt, axis=1), columns=["text"])

y_true = X_test.sentiment
X_test = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])

train_data = Dataset.from_pandas(X_train)
eval_data = Dataset.from_pandas(X_eval)

### Evaluation

Evaluation function for model


In [6]:
def evaluate(y_true, y_pred):
    labels = ["positive", "neutral", "negative"]
    mapping = {"positive": 2, "neutral": 1, "none": 1, "negative": 0}

    def map_func(x):
        return mapping.get(x, 1)

    y_true = np.vectorize(map_func)(y_true)
    y_pred = np.vectorize(map_func)(y_pred)

    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    print(f"Accuracy: {accuracy:.3f}")

    # Generate accuracy report
    unique_labels = set(y_true)  # Get unique labels

    for label in unique_labels:
        label_indices = [i for i in range(len(y_true)) if y_true[i] == label]
        label_y_true = [y_true[i] for i in label_indices]
        label_y_pred = [y_pred[i] for i in label_indices]
        accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f"Accuracy for label {label}: {accuracy:.3f}")

    # Generate classification report
    class_report = classification_report(y_true=y_true, y_pred=y_pred)
    print("\nClassification Report:")
    print(class_report)

    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true, y_pred=y_pred, labels=[0, 1, 2])
    print("\nConfusion Matrix:")
    print(conf_matrix)

### Load model

Let's create a BitsAndBytesConfig object with the following settings and load LLM with quantization config:

- `load_in_4bit`: Load the model weights in 4-bit format.
- `bnb_4bit_quant_type`: Use the "nf4" quantization type. 4-bit NormalFloat (NF4), is a new data type that is information theoretically optimal for normally distributed weights.
- `bnb_4bit_compute_dtype`: Use the float16 data type for computations.
- `bnb_4bit_use_double_quant`: Do not use double quantization (reduces the average memory footprint by quantizing also the quantization constants and saves an additional 0.4 bits per parameter.).


In [7]:
model_name = "/kaggle/input/llama2-7b-hf/Llama2-7b-hf"
compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map=device,
    torch_dtype=compute_dtype,
    quantization_config=bnb_config,
)

model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model, tokenizer = setup_chat_format(model, tokenizer)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### Predict

Perfrom inference of model


In [8]:
def predict(test, model, tokenizer):
    y_pred = []
    for i in tqdm(range(len(X_test))):
        prompt = X_test.iloc[i]["text"]
        pipe = pipeline(
            task="text-generation",
            model=model,
            tokenizer=tokenizer,
            max_new_tokens=1,
            temperature=0.0,
        )
        result = pipe(prompt)
        answer = result[0]["generated_text"].split("=")[-1]
        if "positive" in answer:
            y_pred.append("positive")
        elif "negative" in answer:
            y_pred.append("negative")
        elif "neutral" in answer:
            y_pred.append("neutral")
        else:
            y_pred.append("none")
    return y_pred

### Model predictions without fine-tuning


In [9]:
y_pred = predict(test, model, tokenizer)

100%|██████████| 900/900 [05:28<00:00,  2.74it/s]


In [10]:
evaluate(y_true, y_pred)

Accuracy: 0.373
Accuracy for label 0: 0.027
Accuracy for label 1: 0.937
Accuracy for label 2: 0.157

Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.03      0.05       300
           1       0.34      0.94      0.50       300
           2       0.67      0.16      0.25       300

    accuracy                           0.37       900
   macro avg       0.63      0.37      0.27       900
weighted avg       0.63      0.37      0.27       900


Confusion Matrix:
[[  8 287   5]
 [  1 281  18]
 [  0 253  47]]


### Fine-tuning


In [11]:
output_dir = "trained_weigths"

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

training_arguments = TrainingArguments(
    output_dir=output_dir,  # directory to save and repository id
    num_train_epochs=3,  # number of training epochs
    per_device_train_batch_size=1,  # batch size per device during training
    gradient_accumulation_steps=8,  # number of steps before performing a backward/update pass
    gradient_checkpointing=True,  # use gradient checkpointing to save memory
    optim="paged_adamw_32bit",
    save_steps=0,
    logging_steps=25,  # log every 10 steps
    learning_rate=2e-4,  # learning rate, based on QLoRA paper
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,  # max gradient norm based on QLoRA paper
    max_steps=-1,
    warmup_ratio=0.03,  # warmup ratio based on QLoRA paper
    group_by_length=True,
    lr_scheduler_type="cosine",  # use cosine learning rate scheduler
    report_to="tensorboard",  # report metrics to tensorboard
    evaluation_strategy="epoch",  # save checkpoint every epoch
)

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=tokenizer,
    max_seq_length=1024,
    packing=False,
    dataset_kwargs={
        "add_special_tokens": False,
        "append_concat_token": False,
    },
)

Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/150 [00:00<?, ? examples/s]

In [12]:
trainer.train()

Epoch,Training Loss,Validation Loss
0,0.8003,0.701027
2,0.5167,0.71586


TrainOutput(global_step=336, training_loss=0.7178252325171516, metrics={'train_runtime': 3660.8642, 'train_samples_per_second': 0.738, 'train_steps_per_second': 0.092, 'total_flos': 1.0717877554348032e+16, 'train_loss': 0.7178252325171516, 'epoch': 2.99})

In [13]:
trainer.save_model()
tokenizer.save_pretrained(output_dir)

('trained_weigths/tokenizer_config.json',
 'trained_weigths/special_tokens_map.json',
 'trained_weigths/tokenizer.model',
 'trained_weigths/added_tokens.json',
 'trained_weigths/tokenizer.json')

### Prediction of fine-tuned model


In [14]:
# delete and call garbage collector to free memory
import gc

del [
    model,
    tokenizer,
    peft_config,
    trainer,
    train_data,
    eval_data,
    bnb_config,
    training_arguments,
]
del [df, X_train, X_eval]
del [TrainingArguments, SFTTrainer, LoraConfig, BitsAndBytesConfig]

In [15]:
# empty cuda cache several times
for _ in range(100):
    torch.cuda.empty_cache()
    gc.collect()

### Load trained model and merge with pretrained

`merge_and_unload` - merges LoRA adapters into base model


In [16]:
from peft import AutoPeftModelForCausalLM

finetuned_model = "./trained_weigths/"
compute_dtype = getattr(torch, "float16")
tokenizer = AutoTokenizer.from_pretrained("/kaggle/input/llama2-7b-hf/Llama2-7b-hf")

model = AutoPeftModelForCausalLM.from_pretrained(
    finetuned_model,
    torch_dtype=compute_dtype,
    return_dict=False,
    low_cpu_mem_usage=True,
    device_map=device,
)

merged_model = model.merge_and_unload()
merged_model.save_pretrained(
    "./merged_model", safe_serialization=True, max_shard_size="2GB"
)
tokenizer.save_pretrained("./merged_model")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


('./merged_model/tokenizer_config.json',
 './merged_model/special_tokens_map.json',
 './merged_model/tokenizer.model',
 './merged_model/added_tokens.json',
 './merged_model/tokenizer.json')

In [17]:
y_pred = predict(test, merged_model, tokenizer)
evaluate(y_true, y_pred)

100%|██████████| 900/900 [03:48<00:00,  3.94it/s]

Accuracy: 0.854
Accuracy for label 0: 0.900
Accuracy for label 1: 0.857
Accuracy for label 2: 0.807

Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.90      0.93       300
           1       0.76      0.86      0.80       300
           2       0.87      0.81      0.84       300

    accuracy                           0.85       900
   macro avg       0.86      0.85      0.86       900
weighted avg       0.86      0.85      0.86       900


Confusion Matrix:
[[270  27   3]
 [ 10 257  33]
 [  2  56 242]]





In [18]:
del [merged_model, tokenizer, y_pred, model, y_true, test]

for _ in range(100):
    torch.cuda.empty_cache()
    gc.collect()

In [19]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Thu Apr 11 19:27:40 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla P100-PCIE-16GB           Off | 00000000:00:04.0 Off |                    0 |
| N/A   59C    P0              39W / 250W |   1872MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                         

## Competition

[Competition](https://www.kaggle.com/t/5661dedbc9f849539f335d9ad1a6edf5)

> Goal of competition is to fine-tune LM with use of LoRA for NER task


### Data


In [20]:
from datasets import Dataset
import pandas as pd

df = pd.read_csv("/kaggle/input/nlp-week-13-fine-tuning-with-lora/train.csv")
df["tokens"] = df["tokens"].apply(lambda x: x.split())
df["tags"] = df["tags"].apply(lambda x: [int(n) for n in x.split()])
dataset = Dataset.from_pandas(df)
splitted_dataset = dataset.train_test_split(test_size=0.2)

In [21]:
label2id = {
    "O": 0,
    "B-DNA": 1,
    "I-DNA": 2,
    "B-protein": 3,
    "I-protein": 4,
    "B-cell_type": 5,
    "I-cell_type": 6,
    "B-cell_line": 7,
    "I-cell_line": 8,
    "B-RNA": 9,
    "I-RNA": 10,
}