## Fine-tune Google Gemma-2B for Sentiment Analysis
**Credits Original Source**: [https://www.kaggle.com/code/lucamassaron/fine-tune-llama-2-for-sentiment-analysis](https://www.kaggle.com/code/lucamassaron/fine-tune-llama-2-for-sentiment-analysis)
<hr>
<br>

<p align="justify">For this hands-on tutorial on fine-tuning a <b>Gemma-2B</b> model, I am going to deal with a <b>sentiment analysis on financial and economic information</b>. Sentiment analysis on financial and economic information is highly relevant for businesses for several key reasons, ranging from market insights (gain valuable insights into market trends, investor confidence, and consumer behavior) to risk management (identifying potential reputational risks) to investment decisions (gauging the sentiment of stakeholders, investors, and the general public businesses can assess the potential success of various investment opportunities).</p>

<p align="justify">Before the technicalities of fine-tuning a large language model like <b>Google Gemma-2B</b>, we have to find the correct dataset to demonstrate the potentialities of fine-tuning.</p>

<p align="justify">Particularly within the realm of finance and economic texts, annotated datasets are notably rare, with many being exclusively reserved for proprietary purposes. To address the issue of insufficient training data, scholars from the Aalto University School of Business introduced in 2014 a set of <b>approximately 5000 sentences</b>. This collection aimed to establish human-annotated benchmarks, serving as a standard for evaluating alternative modeling techniques. The involved annotators (16 people with adequate background knowledge on financial markets) were instructed to assess the sentences solely from the perspective of an investor, evaluating whether the news potentially holds a positive, negative, or neutral impact on the stock price.</p>

<p align="justify">The <a href="https://github.com/vrunm/Text-Classification-Financial-Phrase-Bank"> FinancialPhraseBank dataset</a> is a comprehensive collection that captures the <b>sentiments of financial news headlines from the viewpoint of a retail investor</b>. Comprising two key columns, namely "Sentiment" and "News Headline," the dataset effectively classifies sentiments as either negative, neutral, or positive. This structured dataset serves as a valuable resource for analyzing and understanding the complex dynamics of sentiment in the domain of financial news. It has been used in various studies and research initiatives, since its inception in the work by Malo, P., Sinha, A., Korhonen, P., Wallenius, J., and Takala, P.  "Good debt or bad debt: Detecting semantic orientations in economic texts.", published in the Journal of the Association for Information Science and Technology in 2014.</p>

As a first step, we install the **specific libraries** necessary to make this example work.

* **Accelerate** is a distributed training library for PyTorch by <a href="https://huggingface.co/">HuggingFace</a>. It allows you to train your models on multiple GPUs or CPUs in parallel (distributed configurations), which can significantly speed up training in presence of multiple GPUs (we won't use it in our example).

* **Peft** is a Python library by HuggingFace for efficient **adaptation of pre-trained language models (PLMs)** to various downstream applications without fine-tuning all the model's parameters. PEFT methods only fine-tune a small number of (extra) model parameters, thereby greatly decreasing the computational and storage costs.
<img src="https://www.researchgate.net/publication/375583776/figure/fig1/AS:11431281204400066@1699770316191/State-of-the-Art-PEFT-techniques.png" width="500px">

* **Bitsandbytes** by Tim Dettmers, is a lightweight **wrapper around CUDA ** custom functions, in particular **8-bit optimizers**, matrix multiplication (LLM.int8()), and **quantization functions**. It allows to run models stored in 4-bit precision: while 4-bit bitsandbytes stores weights in 4-bits, the computation still happens in 16 or 32-bit and here any combination can be chosen (float16, bfloat16, float32, and so on).
<img src="https://miro.medium.com/v2/resize:fit:1400/0*_CNFWP04SoSZB6EE.png" width="500px">

* **Transformers** is a <a href="https://huggingface.co/">HuggingFace</a> project for natural language processing (NLP). It provides a number of pre-trained machine learning and deep learning models for NLP tasks such as text classification, question answering, and machine translation.
* **Datasets** is a <a href="https://huggingface.co/">HuggingFace</a> project that provides a simple interface to load, visualize, manipulate, and convert data into various formats, such as CSV, JSON, Parquet, and many others. The library offers features such as reading and writing data in various formats, manipulating data, creating subsectors and examples, handling data in columns and rows, converting between data formats, and more.
* **Trl** is a full stack library by HuggingFace providing a set of tools to train transformer language models with **Reinforcement Learning**, from the **Supervised Fine-tuning step (SFT)**, Reward Modeling step (RM) to the **Proximal Policy Optimization (PPO)** step.
<img src="https://www.labellerr.com/blog/content/images/2023/06/bannerRELF.webp" width="500px">

* **Unsloth** is an <a href="https://github.com/unslothai/unsloth">open-source tool</a> designed for fine-tuning large language models (LLMs) like Llama 3, Mistral, Phi, and Gemma. It allows users to finetune these models 2-5x faster with 80% less memory usage. Unsloth is particularly noted for its beginner-friendly notebooks that let users add their dataset, run the entire process with a single click, and obtain a finetuned model that can be exported or uploaded to platforms like HuggingFace.

## Notice

❌❌❌ **ATTENTION** ❌❌❌  Your final model and **results** could also **differ significantly** from mine. This is caused by the **randomness** in the **data splits**, **generation** and **training** process! ❌❌❌   

## Installations and imports

In [None]:
!pip install -q -U transformers==4.40.2 datasets

In [None]:
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes

In [None]:
#Just for easily dump on disk your variables
!pip install -q joblib

The code imports the os module and sets two environment variables:
* **CUDA_VISIBLE_DEVICES**: This environment variable tells PyTorch which GPUs to use. In this case, the code is setting the environment variable to 0, which means that PyTorch will use the first GPU.
* **TOKENIZERS_PARALLELISM**: This environment variable tells the Hugging Face Transformers library whether to parallelize the tokenization process. In this case, the code is setting the environment variable to false, which means that the tokenization process will not be parallelized.

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

<p align="justify">The code import warnings; warnings.filterwarnings("ignore") imports the warnings module and <b>sets the warning filter to ignore</b>. This means that all warnings will be suppressed and will not be displayed. Actually during training there are many warnings that do not prevent the fine-tuning but can be distracting and make you wonder if you are doing the correct things.

In [None]:
import warnings
warnings.filterwarnings("ignore")

In the following cell there are all the other imports for running the notebook

In [None]:
import os

import numpy as np
import pandas as pd

from tqdm import tqdm #Used in Colab for progress bar

from unsloth import FastLanguageModel #Used for loading a fast version of LLM through Unsloth
import bitsandbytes #Used for quantization
import torch #Used for running the scripts

from datasets import Dataset #Used for loading data

from peft import LoraConfig, PeftConfig #Used for LoRA based PEFT
from trl import SFTTrainer #Used for fine-tuning the model

import transformers #Used for interacting with the LLM
from transformers import (AutoModelForCausalLM,
                          AutoTokenizer,
                          BitsAndBytesConfig,
                          TrainingArguments,
                          pipeline,
                          logging)

#Few methods for evaluation
from sklearn.metrics import (accuracy_score,
                             classification_report,
                             confusion_matrix)
from sklearn.model_selection import train_test_split

In [None]:
print(f"pytorch version {torch.__version__}")

In [None]:
device = "cuda:0"
print(f"working on {device}")

## Preparing the data and the core evaluation functions

The code in the next cell performs the following steps:

1. Download and reads the **input dataset** from the **all-data.csv** file, which is a comma-separated value (CSV) file with two columns: *sentiment* and *text*.

2. **Splits** the dataset into *training* and *test sets*, with **300 samples** for each sentiment label (900 samples in tota for each set). The split is random and *stratified by sentiment*, so that each set contains a representative sample of positive, neutral, and negative sentiments.

3. **Shuffles** the train data (random_state=10).

4. **Transforms** the texts contained in the train and test data **into prompts** to be used by Google Gemma-2B: the train prompts contains the expected answer we want to fine-tune the model with.

5. The **residual examples** not in train or test, for reporting purposes during training (but it won't be used for early stopping), is treated as **validation data**, which is sampled with repetition in order to have a 50/50/50 sample (negative instances are very few, hence they should be repeated)

5. The **train**, **test** and **eval** data **are wrapped by the class Datasets** from Hugging Face (https://huggingface.co/docs/datasets/index)

This prepares in a single cell train_data, eval_data and test_data datasets to be used in our fine tuning.

In [None]:
#Download the Dataset from GitHub
!wget https://github.com/marcopoli/LLaMAntino-3-ANITA/raw/main/use_examples/sentiment_data/all-data.csv

In [None]:
filename = "all-data.csv"

df = pd.read_csv(filename,
                 names=["sentiment", "text"],
                 encoding="utf-8", encoding_errors="replace")

X_train = list()
X_test = list()
for sentiment in ["negative", "neutral", "positive"]:
    train, test  = train_test_split(df[df.sentiment==sentiment],
                                    train_size=300,
                                    test_size=300,
                                    random_state=42)
    X_train.append(train)
    X_test.append(test)

X_train = pd.concat(X_train).sample(frac=1, random_state=10)
X_test = pd.concat(X_test)

eval_idx = [idx for idx in df.index if idx not in list(X_train.index) + list(X_test.index)]
X_eval = df[df.index.isin(eval_idx)]
X_eval = (X_eval
          .groupby('sentiment', group_keys=False)
          .apply(lambda x: x.sample(n=50, random_state=10, replace=True)))
X_train = X_train.reset_index(drop=True)

# Includes the correct answer
def generate_prompt(data_point):
    mapping = {'negative': 0, 'neutral': 1, 'positive': 2}
    return f"""Analyze the sentiment of the following text.
            Report the corresponding sentiment label "0) negative", "1) neutral", "2) positive.".

            Text: {data_point["text"]}]
            Answer: {mapping.get(data_point["sentiment"],1)}""".strip()

# Does not include the correct answer
def generate_test_prompt(data_point):
    return f"""Analyze the sentiment of the following text.
            Report the corresponding sentiment label "0) negative", "1) neutral", "2) positive.".

            Text: {data_point["text"]}]
            Answer: """

X_train = pd.DataFrame(X_train.apply(generate_prompt, axis=1), columns=["text"])
X_eval = pd.DataFrame(X_eval.apply(generate_prompt, axis=1), columns=["text"])

y_true = X_test.sentiment
X_test = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])

train_data = Dataset.from_pandas(X_train)
eval_data = Dataset.from_pandas(X_eval)

#Dump our splits on disk for further uses
import joblib
joblib.dump(X_test,"X_test")
joblib.dump(y_true,"y_true")
joblib.dump(train_data,"train_data")
joblib.dump(eval_data,"eval_data")

In [None]:
train_data[3]

## Testing the model without fine-tuning

Next we need to take care of the model, which is a **2B** (**2 billion parameters**, in the **HuggingFace** compatible format), loading by using **quantization**.

Model loading and quantization:

* First the code *loads the Google Gemma-2B* language model from the Hugging Face Hub.
* Then the code gets the **float16** data type from the torch library. This is the data type that will be used for the computations.
* Next, it creates a **BitsAndBytesConfig** object with the following settings:
    1. *load_in_4bit*: Load the model weights in 4-bit format.
    2. *bnb_4bit_quant_type*: Use the "nf4" quantization type. 4-bit **NormalFloat (NF4)**, is a new data type that is information theoretically optimal for normally distributed weights.
    3. *bnb_4bit_compute_dtype*: Use the float16 data type for computations.
    4. *bnb_4bit_use_double_quant*: Use double quantization (reduces the average memory footprint by quantizing also the quantization constants and saves an additional 0.4 bits per parameter.).
* Then the code creates a **AutoModelForCausalLM** object from the pre-trained Google Gemma-2B language model, using the **BitsAndBytesConfig object** for quantization.
* After that, the code **disables caching for the model**.

Tokenizer loading:

* First, the code loads the tokenizer for the Google Gemma-2B language model.
* Then it **sets the padding token to be the end-of-sequence (EOS) token**.
* Finally, the code sets the **padding side to be "right"**, which means that the input sequences will be padded on the right side. This is crucial for correct padding direction (this is the way with Google Gemma-2B).

In [None]:
#An already improved version of the orginal model
model_name = "VAGOsolutions/SauerkrautLM-Gemma-2b"

compute_dtype = torch.float16

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map=device,
    torch_dtype=compute_dtype,
    quantization_config=bnb_config,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

In the next cell, we set a **function for predicting the sentiment** of a news headline using the Google Gemma-2B language model. The function takes three arguments:

- **test**: A Pandas DataFrame containing the news headlines to be predicted.
- **model**: The pre-trained language model.
- **tokenize**r: The tokenizer for language model.

The function works as follows:

1. For each news headline in the test DataFrame:
    * *Create a prompt* for the language model, which asks it to analyze the sentiment of the news headline and return the corresponding sentiment label.
    * Use the **pipeline() function** from the Hugging Face Transformers library to generate text from the language model, using the prompt.
    * Extract the predicted sentiment label from the generated text.
    * Append the predicted sentiment label to the y_pred list.
2. Return the y_pred list.

The **pipeline()** function from the Hugging Face Transformers library **is used to generate text** from the language model. The task argument specifies that the task is text generation. The **max_new_tokens** argument specifies the maximum number of new tokens to generate. The **temperature** argument controls the randomness of the generated text. A *lower temperature will produce more predictable text*, while a higher temperature will produce more creative and unexpected text.

The if **statement checks if the generated text** contains the word *"positive"*. If it does, then the predicted sentiment label is "positive". Otherwise, the if statement checks if the generated text contains the word *"negative"*. If it does, then the predicted sentiment label is "negative". Otherwise, the if statement checks if the generated text contains the word *"neutral"*. If it does, then the predicted sentiment label is "neutral.

In [None]:
def predict(test, model, tokenizer):
    y_pred = []
    for i in tqdm(range(len(test))):
        prompt = test.iloc[i]["text"]
        #print(prompt)
        pipe = pipeline(model=model,
            tokenizer=tokenizer,
            return_full_text=False, # langchain expects the full text
            task='text-generation',
            max_new_tokens=1, # max number of tokens to generate in the output
            temperature=0.1,  #temperature for more or less creative answers
            do_sample=True, #this parameter enables decoding strategies such as top_p
            top_p=0.9,
        )
        result = pipe(prompt)
        #print(result)
        answer = result[0]['generated_text'].split("=")[-1]

        if "positive" in answer or "2" in answer:
            y_pred.append("positive")
        elif "negative" in answer or "0" in answer:
            y_pred.append("negative")
        elif "neutral" in answer or "1" in answer:
            y_pred.append("neutral")
        else:
            y_pred.append("none")
    return y_pred

Next we create a function to **evaluate** the results from our **fine-tuned sentiment model**. The function performs the following steps:

1. Maps the sentiment labels to a numerical representation, where **2** represents **positive**, **1** represents **neutral**, and **0** represents **negative**.
2. Calculates the accuracy of the model on the test data.
3. Generates an accuracy report for each sentiment label.
4. Generates a **classification report** for the model.
5. Generates a confusion matrix for the model.

In [None]:
def evaluate(y_true, y_pred):
    labels = ['negative', 'neutral', 'positive']
    mapping = {'negative': 0, 'neutral': 1, 'positive': 2}

    def map_func(x):
        return mapping.get(x, 4)

    y_true = np.vectorize(map_func)(y_true)
    y_pred = np.vectorize(map_func)(y_pred)

    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    print(f'Accuracy: {accuracy:.3f}')

    # Generate accuracy report
    unique_labels = set(y_true)  # Get unique labels

    for label in unique_labels:
        label_indices = [i for i in range(len(y_true))
                         if y_true[i] == label]
        label_y_true = [y_true[i] for i in label_indices]
        label_y_pred = [y_pred[i] for i in label_indices]
        accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {label}: {accuracy:.3f}')

    # Generate classification report
    class_report = classification_report(y_true=y_true, y_pred=y_pred,digits=5)
    print('\nClassification Report:')
    print(class_report)

    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true, y_pred=y_pred, labels=[0, 1, 2])
    print('\nConfusion Matrix:')
    print(conf_matrix)

At this point, we are **ready to test the Google Gemma-2B model** and see how it performs on our problem without any fine-tuning. This allows us to get insights on the model itself and establish a baseline.

In [None]:
y_pred = predict(X_test, model, tokenizer)

In the following cell, we evaluate the results. There is little to be said, it is performing really terribly because the model **tends to just predict a neutral sentiment** and seldom it detects positive or negative sentiment.

In [None]:
evaluate(y_true, y_pred)

Because of the **limited amount of VRAM** available in Google Colab, we **delete unused objects** and **free up the VRAM** allocated for them.

In [None]:
#DELETE MODEL FROM VRAM
del model
del tokenizer
import gc
torch.cuda.empty_cache()
gc.collect()

## Fine-tuning

In the next cell we **set everything ready for the fine-tuning**. We configures and initializes a simple **Supervised Fine-tuning Trainer (SFTTrainer)** for training a large language model using the Parameter-Efficient Fine-Tuning (PEFT) method, which should save time as it operates on a reduced number of parameters compared to the model's overall size. **The PEFT method focuses on refining a limited set of (additional) model parameters, while keeping the majority of the pre-trained LLM parameters fixed.** This significantly reduces both computational and storage expenses. Additionally, this strategy addresses the challenge of catastrophic forgetting, which often occurs during the complete fine-tuning of LLMs.

**PEFTConfig**:
The peft_config object specifies the parameters for PEFT. The following are some of the most important parameters:
* lora_alpha: The **learning rate for the LoRA update matrices**.
* lora_dropout: The dropout probability for the LoRA update matrices.
* r: **The rank of the LoRA update matrices**.
* bias: The type of bias to use. The possible values are none, additive, and learned.
* task_type: The type of task that the model is being trained for. The possible values are **CAUSAL_LM** and **MASKED_LM**.


**TrainingArguments**:
The training_arguments object specifies the parameters for training the model. The following are some of the most important parameters:
* *output_dir*: The directory where the training logs and checkpoints will be saved.
* **num_train_epochs**: The number of epochs to train the model for.
* **per_device_train_batch_size**: The number of samples in each batch on each device.
* **gradient_accumulation_steps**: The number of batches to accumulate gradients before updating the model parameters.
* **optim**: The optimizer to use for training the model.
* *save_steps*: The number of steps after which to save a checkpoint.
* *logging_steps*: The number of steps after which to log the training metrics.
* **learning_rate**: The learning rate for the optimizer.
* *weight_decay*: The weight decay parameter for the optimizer.
* *fp16*: Whether to use 16-bit floating-point precision.
* *bf16*: Whether to use BFloat16 precision.
* *max_grad_norm*: The maximum gradient norm.
* *max_steps*: The maximum number of steps to train the model for.
* **warmup_ratio**: The proportion of the training steps to use for warming up the learning rate.
* *group_by_length*: Whether to group the training samples by length.
* **lr_scheduler_type**: The type of learning rate scheduler to use.
* r*eport_to*: The tools to report the training metrics to.
* *evaluation_strategy*: The strategy for evaluating the model during training.

**SFTTrainer**:
The SFTTrainer is a custom trainer class from the TRL library. It is used to train large language models (also using the PEFT method). The **SFTTrainer** object is initialized with the following arguments:

* **model**: The model to be trained.
* **train_dataset**: The training dataset.
* **eval_dataset**: The evaluation dataset.
* *peft_config*: The PEFT configuration.
* *dataset_text_field*: The name of the text field in the dataset.
* *tokenizer*: The tokenizer to use.
* **args**: The training arguments.
* *packing*: Whether to pack the training samples.
* **max_seq_length**: The maximum sequence length.

Once the **SFTTrainer object** is initialized, it **can be used to train the model** by calling the *train()* method

In [None]:
def mapping(message):
  return message

from datasets import Dataset
td = {"text":train_data["text"]}
train_data1 = Dataset.from_dict(td)
train_data2 = train_data1.map(mapping)

ed = {"text":eval_data["text"]}
eval_data1 = Dataset.from_dict(ed)
eval_data2 = eval_data1.map(mapping)



import joblib
joblib.dump(train_data2,"train_data2")
joblib.dump(eval_data2,"eval_data2")

print(train_data2[0]["text"])

In [None]:
from unsloth import FastLanguageModel

max_seq_length = 1024 # Choose any! We auto support RoPE Scaling internally!
dtype = torch.float16 # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
model_name = "VAGOsolutions/SauerkrautLM-Gemma-2b"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 422,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

output_dir="trained_weigths"

training_arguments = TrainingArguments(
    output_dir=output_dir,                    # directory to save and repository id
    num_train_epochs=5,                       # number of training epochs
    per_device_train_batch_size=32,            # batch size per device during training
    gradient_accumulation_steps=4,            # number of steps before performing a backward/update pass
    gradient_checkpointing=True,              # use gradient checkpointing to save memory
    optim="paged_adamw_8bit",
    save_steps=0,
    logging_steps=1,                         # log every 1 steps
    learning_rate=2e-4,                       # learning rate, based on QLoRA paper
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,                        # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,                        # warmup ratio based on QLoRA paper
    group_by_length=True,
    lr_scheduler_type="cosine",               # use cosine learning rate scheduler
    report_to="tensorboard",                  # report metrics to tensorboard
    evaluation_strategy="epoch"               # save checkpoint every epoch
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 10,
    packing = False, # Can make training 5x faster for short sequences.
    args=training_arguments,
    train_dataset=train_data2,
    eval_dataset=eval_data2,
    dataset_kwargs={
        "add_special_tokens": False,
        "append_concat_token": False,
    }
)

The following code will **train** the model using the *trainer.train()* method and then save the trained model to the trained-model directory.

In [None]:
# Train model
with torch.autocast("cuda"):
    trainer.train()

The model and the tokenizer are saved to disk for later usage.

In [None]:
# Save trained model and tokenizer
trainer.model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

## Saving model to disk for later usage

At this point, in order to demonstrate how to **re-utilize the model**, we reload it from the disk and **merge** it with the original **Google Gemma-2B** model.

In fact, when working with **QLoRA**, we exclusively **train adapters** instead of the entire model. So, when you save the model during training, you're only **preserving the adapter weights**, **not the entire model**. If you want to save the full model for easier use with Text Generation Inference, you can merge the adapter weights into the model weights using the **merge_and_unload** method. Then, you can save the model using the save_pretrained method. This will create a default model that's ready for inference tasks.

Then we can proceed to merging the weights and we will be using the merged model for our testing purposes.

In [None]:
#!zip -r trained_weights.zip /content/trained_weigths/

In [None]:
#!unzip trained_weights.zip

In [None]:
#DELETE MODEL FROM VRAM
del model
del trainer
del tokenizer
import gc
torch.cuda.empty_cache()
gc.collect()

In [None]:
from peft import AutoPeftModelForCausalLM
import torch
compute_dtype = torch.float16
finetuned_model = "./trained_weigths/"
tokenizer = AutoTokenizer.from_pretrained("VAGOsolutions/SauerkrautLM-Gemma-2b")

model = AutoPeftModelForCausalLM.from_pretrained(
     finetuned_model,
     torch_dtype=torch.float16,
     return_dict=False,
     low_cpu_mem_usage=True,
     device_map="cuda:0",
)

merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged_model",safe_serialization=True, max_shard_size="2GB")
tokenizer.save_pretrained("./merged_model")

### RESTART SESSION

In [None]:
import os
os.kill(os.getpid(), 9)

## Testing the model with fine-tune

We are now **ready to reload the final merged model** we created and **test it** over the same test set we run original model. We use the same split, strategy and metrics.

In [None]:
import os

import numpy as np
import pandas as pd

from tqdm import tqdm #Used in Colab for progress bar

from unsloth import FastLanguageModel
import bitsandbytes as bnb
import torch

from datasets import Dataset

from peft import LoraConfig, PeftConfig
from trl import SFTTrainer

import transformers
from transformers import (AutoModelForCausalLM,
                          AutoTokenizer,
                          BitsAndBytesConfig,
                          TrainingArguments,
                          pipeline,
                          logging)

from sklearn.metrics import (accuracy_score,
                             classification_report,
                             confusion_matrix)
from sklearn.model_selection import train_test_split

In [None]:
model.push_to_hub('pgajo/gemma-2b-sa')
tokenizer.push_to_hub('pgajo/gemma-2b-sa')

In [None]:
import joblib
X_test = joblib.load("X_test")
y_true = joblib.load("y_true")

In [None]:
def predict(test, model, tokenizer):
    y_pred = []
    for i in tqdm(range(len(test))):
        prompt = test.iloc[i]["text"]
        #print(prompt)
        pipe = pipeline(model=model,
            tokenizer=tokenizer,
            return_full_text=False, # langchain expects the full text
            task='text-generation',
            max_new_tokens=1, # max number of tokens to generate in the output
            temperature=0.1,  #temperature for more or less creative answers
            do_sample=True,
            top_p=0.9,
        )
        result = pipe(prompt)
        #print(result)
        answer = result[0]['generated_text'].split("=")[-1]

        if "positive" in answer or "2" in answer:
            y_pred.append("positive")
        elif "negative" in answer or "0" in answer:
            y_pred.append("negative")
        elif "neutral" in answer or "1" in answer:
            y_pred.append("neutral")
        else:
            y_pred.append("none")
    return y_pred

In [None]:
def evaluate(y_true, y_pred):
    labels = ['negative', 'neutral', 'positive']
    mapping = {'negative': 0, 'neutral': 1, 'positive': 2}

    def map_func(x):
        return mapping.get(x, 4)

    y_true = np.vectorize(map_func)(y_true)
    y_pred = np.vectorize(map_func)(y_pred)

    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    print(f'Accuracy: {accuracy:.3f}')

    # Generate accuracy report
    unique_labels = set(y_true)  # Get unique labels

    for label in unique_labels:
        label_indices = [i for i in range(len(y_true))
                         if y_true[i] == label]
        label_y_true = [y_true[i] for i in label_indices]
        label_y_pred = [y_pred[i] for i in label_indices]
        accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {label}: {accuracy:.3f}')

    # Generate classification report
    class_report = classification_report(y_true=y_true, y_pred=y_pred,digits=5)
    print('\nClassification Report:')
    print(class_report)

    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true, y_pred=y_pred, labels=[0, 1, 2])
    print('\nConfusion Matrix:')
    print(conf_matrix)

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "merged_model",
    device_map="cuda:0",
    torch_dtype=torch.float16,
    quantization_config=bnb_config,
)

tokenizer = AutoTokenizer.from_pretrained("merged_model",
                                          trust_remote_code=True,
                                         )

In [None]:
y_pred = predict(X_test, model, tokenizer)

In [None]:
evaluate(y_true, y_pred)

The following code will create a Pandas DataFrame called evaluation containing the text, true labels, and predicted labels from the test set. It will be saved on the disk for later usage. This is expectially useful for understanding the errors that the fine-tuned model makes, and gettting insights on how to improve the prompt.

In [None]:
evaluation = pd.DataFrame({'text': X_test["text"],
                           'y_true':y_true,
                           'y_pred': y_pred},
                         )
evaluation.to_csv("test_predictions.csv", index=False)
print(evaluation)

In [None]:
#Update TRL library for next step
!pip install trl==0.9.4

In [None]:
#restart session
import os
os.kill(os.getpid(), 9)

## Direct Preference Optimization (DPO)

❌❌❌ **ATTENTION** ❌❌❌  You need at least an **L4 GPU** with **24GB** VRAM!

**Reinforcement Learning from Human Feedback (RLHF)** and **Direct Preference Optimization (DPO)** are two approaches in the field of large-scale language models used to *enhance models through human guidance*. <br><br><br>

<div>
<img src="https://i0.wp.com/crowdworks.blog/wp-content/uploads/2024/01/rl_2.png?w=1124&ssl=1" width="500"/>
</div><br><br>

*Reinforcement Learning (RL)* has been used in language model training to **optimize parameters**, **maximizing expected rewards** from the reward model. Traditional LLM training minimizes errors concerning correct answers. The **reward function acts as a learnable loss function tailored to the end goal**, providing greater optimization freedom. In RLHF, the objective function is the reward model, and RL is used to optimize that objective function.<br><br>

<div>
<img src="https://miro.medium.com/v2/resize:fit:844/0*w2bd-x0Hx0SAyap2.png" png?w=1124&ssl=1" width="400"/>
</div><br>

*Direct Preference Optimization (DPO)* is an alternative method following RLHF. It simplifies the process by creating a dataset of **human preference pairs**, each containing a **prompt and two possible completions—one preferred and one misreferred**. DPO is a computationally lightweight approach that *treats the constrained reward maximization problem as a classification problem* on human preference data, eliminating the need for reward model fitting, extensive sampling, and extensive hyperparameter tuning.

- **Supervised fine-tuning (SFT) is the initial step in DPO**, where an LLM is trained on a labeled dataset to create a clear mapping between inputs and desired outputs. This method, when combined with preference learning, molds the model’s responses based on human-defined criteria, ensuring they align more closely with specific requirements. SFT refines the model’s outputs to ensure they are not only accurate but also appropriate and consistent.

- **Preference data is a set of options or alternatives related to a specific prompt evaluated by annotators based on guidelines**. The goal is to **rank** these options from most preferred to least preferred, providing insights into human preferences. This information is used to fine-tune models to produce outputs that align with human expectations. After Supervised Fine-tuning (SFT), the model undergoes *preference learning* using preference data, ideally from the same distribution as the SFT examples. **DPO’s simplicity lies in defining preference loss as a function of the policy**.



Let's prepare the **DPO training data**! We need to define an *HuggingFace Dataset* including the **"prompt"**, **"chosen"** and **"rejected"** fields.

DPO dataset basically consists of triplets (prompt, chosen answer, rejected answer). In other words, for each prompt, there is a better response and a worse response.

In [1]:
import random
from datasets import Dataset
import joblib
#Reload our training data
train_data = joblib.load("train_data")

prompts = []
chosen = []
rejected = []

#extract prompt, chosen and rejected
for data in train_data:
  pro = str(data["text"].split("Answer: ")[0])+"Answer: "
  cho = str(data["text"].split("Answer: ")[-1])

  ran = random.randint(0, 2)
  while str(ran) == cho:
    ran = random.randint(0, 2)
  rej = str(ran)

  prompts.append(pro)
  chosen.append(cho)
  rejected.append(rej)

print(len(rejected))
#Create an HuggingFace Dataset
dpo_dat = {"prompt": prompts, "chosen": chosen, "rejected": rejected}
train_dpo = Dataset.from_dict(dpo_dat, split="train")
train_dpo=train_dpo.map()

  from .autonotebook import tqdm as notebook_tqdm


900


Map: 100%|██████████| 900/900 [00:00<00:00, 31813.59 examples/s]


Let's have a look at the output of our preparation step. We can clearly identify the **prompt** the **correct answer** and **the one we would like to avoid**.

In [2]:
train_dpo[6]

{'prompt': 'Analyze the sentiment of the following text.\n            Report the corresponding sentiment label "0) negative", "1) neutral", "2) positive.".\n\n            Text: That \'s what I go to bed worrying about every night , \' he said .]\n            Answer: ',
 'chosen': '0',
 'rejected': '1'}

Let's repeat the same process for the **evaluation** dataset.

In [3]:
import joblib
eval_data = joblib.load("eval_data")

prompts = []
chosen = []
rejected = []

for data in eval_data:
  pro = str(data["text"].split("Answer: ")[0])+"Answer: "
  cho = str(data["text"].split("Answer: ")[-1])

  ran = random.randint(0, 2)
  while str(ran) == cho:
    ran = random.randint(0, 2)
  rej = str(ran)

  prompts.append(pro)
  chosen.append(cho)
  rejected.append(rej)

print(len(rejected))
dpo_dat = {"prompt": prompts, "chosen": chosen, "rejected": rejected}
eval_dpo = Dataset.from_dict(dpo_dat, split="train")
eval_dpo = eval_dpo.map()

150


Map: 100%|██████████| 150/150 [00:00<00:00, 19828.10 examples/s]


In [4]:
eval_dpo[0]

{'prompt': 'Analyze the sentiment of the following text.\n            Report the corresponding sentiment label "0) negative", "1) neutral", "2) positive.".\n\n            Text: In addition , the company will reduce a maximum of ten jobs .]\n            Answer: ',
 'chosen': '0',
 'rejected': '2'}

It is time to prepare the model to be trained. Then **re-import** all the **libraries** we need for the training step.

In [5]:
import os

import numpy as np
import pandas as pd

from tqdm import tqdm #Used in Colab for progress bar

from unsloth import FastLanguageModel
import bitsandbytes as bnb
import torch

from datasets import Dataset

from peft import LoraConfig, PeftConfig
from trl import SFTTrainer, DPOTrainer, DPOConfig

import transformers
from transformers import (AutoModelForCausalLM,
                          AutoTokenizer,
                          BitsAndBytesConfig,
                          TrainingArguments,
                          pipeline,
                          logging)

from sklearn.metrics import (accuracy_score,
                             classification_report,
                             confusion_matrix)
from sklearn.model_selection import train_test_split

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In order to make the DPO training step works, we need to load two models:
- **Base model** (model)
- **Reference model** (ref_model)

<div>
<img src="https://media.licdn.com/dms/image/D5622AQEWhQLaWly3Og/feedshare-shrink_2048_1536/0/1705254332970?e=1721865600&v=beta&t=aSJT0P3oftaYWcLzm8kvdvYSuY4ex94sEZFIe0SPinA" width="700px">
</div>

At the start of the fine-tuning process, a duplicate of the language model (LM) is generated and its trainable **parameters are set to be frozen** which is referred as **reference/frozen model**.
- **Scoring Responses**: For each data point in the dataset, both the **base** and **reference** language models *score the chosen and rejected responses*.
- **Ratio Calculation**: The **ratio between the scores** assigned by the base language model (Rpolicy) and those given by the frozen language model (Rreference) *is determined*.
- **Loss Functio**n: These ratios are then used to calculate the **final loss function** called **dpo_loss** that is used to modify the model weights in the gradient descent update.

In summary, DPO provides an efficient means of fine-tuning language models based on human preferences without the complexities associated with reinforcement learning. Similar to what we hve done during the SFT training, it is possible to use **LoRA** and **quantization** strategies to makes the model fit into memory. During DPO it is suggested to use **r** and **alpha** equal and a **lora_dropout = 0.05**.

In [6]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 4096 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "merged_model", # Choose ANY! eg mistralai/Mistral-7B-Instruct-v0.2
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    return_dict=True
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

ref_model, ref_tokenizer = FastLanguageModel.from_pretrained(
    model_name = "merged_model",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    return_dict=True
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0.05, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ

)

==((====))==  Unsloth: Fast Gemma patching release 2024.6
   \\   /|    GPU: NVIDIA RTX 5000 Ada Generation. Max memory: 31.599 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00,  1.82it/s]
Unsloth: Will load merged_model as a legacy tokenizer.


==((====))==  Unsloth: Fast Gemma patching release 2024.6
   \\   /|    GPU: NVIDIA RTX 5000 Ada Generation. Max memory: 31.599 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00,  2.05it/s]
Unsloth: Will load merged_model as a legacy tokenizer.
Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2024.6 patched 18 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


We are ready to configure our **DPOTrainer** object. We are now using the DPOConfig object to guide the process correctly. Few tricks about DPO are about the correct choose of **learning_rate**, **usually very low** and **number_train_epochs** *between 1 and 3*.

In [7]:
from transformers import TrainingArguments
from trl import DPOTrainer
from unsloth import is_bfloat16_supported

train_args = DPOConfig(
        per_device_train_batch_size = 4,      # batch size per device during training
        gradient_accumulation_steps = 4,      # number of steps before performing a backward/update pass
        warmup_ratio = 0.1,                   # warmup ratio based on QLoRA paper
        num_train_epochs = 2,                 # number of training epochs
        learning_rate = 5.0e-7,               # lower LR than QLoRA paper
        fp16 = not is_bfloat16_supported(),   # use float16 precision
        bf16 = is_bfloat16_supported(),       # use bfloat16 precision
        logging_steps = 1,                    # log every 1 steps
        optim = "adamw_8bit",                 # use adamw 8 bit optimizer
        weight_decay = 0.0,                   # do not use weight_decay
        lr_scheduler_type = "cosine",         # use cosine learning rate scheduler
        seed = 42,
        output_dir = "outputs",
    )

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = ref_model,
    args = train_args,
    beta = 0.1,                               # use beta as in DPO paper
    train_dataset = train_dpo,
    eval_dataset = eval_dpo,
    tokenizer = tokenizer,
    max_length = 1024,                        # limit lenght for efficiency and VRAM
    max_prompt_length = 512,
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Map: 100%|██████████| 900/900 [00:01<00:00, 879.89 examples/s]
Map: 100%|██████████| 150/150 [00:00<00:00, 907.73 examples/s] 


In [8]:
#Just run the training function
dpo_trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 900 | Num Epochs = 2
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
\        /    Total batch size = 16 | Total steps = 112
 "-____-"     Number of trainable parameters = 39,223,296
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoi

Step,Training Loss
1,0.6936
2,0.6942
3,0.6936
4,0.6946
5,0.6932
6,0.6929
7,0.6925
8,0.6925
9,0.6921
10,0.6927


TrainOutput(global_step=112, training_loss=0.6930933025266443, metrics={'train_runtime': 110.9853, 'train_samples_per_second': 16.218, 'train_steps_per_second': 1.009, 'total_flos': 0.0, 'train_loss': 0.6930933025266443, 'epoch': 1.991111111111111})

In [9]:
# Save trained model and tokenizer
output_dir="trained_weigths"
dpo_trainer.model.save_pretrained(output_dir+"_DPO")
tokenizer.save_pretrained(output_dir+"_DPO")

('trained_weigths_DPO/tokenizer_config.json',
 'trained_weigths_DPO/special_tokens_map.json',
 'trained_weigths_DPO/tokenizer.model',
 'trained_weigths_DPO/added_tokens.json')

**Load and merge adapters** before moving to the Test step.

In [11]:
from peft import AutoPeftModelForCausalLM
import torch
compute_dtype = torch.float16
finetuned_model = "./trained_weigths_DPO/"
tokenizer = AutoTokenizer.from_pretrained("merged_model")

model = AutoPeftModelForCausalLM.from_pretrained(
     finetuned_model,
     torch_dtype=torch.float16,
     return_dict=False,
     low_cpu_mem_usage=True,
     device_map="cuda:0",
)

merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged_model_DPO",safe_serialization=True, max_shard_size="2GB")
tokenizer.save_pretrained("./merged_model_DPO")

Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00,  2.22it/s]


('./merged_model_DPO/tokenizer_config.json',
 './merged_model_DPO/special_tokens_map.json',
 './merged_model_DPO/tokenizer.model',
 './merged_model_DPO/added_tokens.json',
 './merged_model_DPO/tokenizer.json')

In [None]:
#Just restart the session
import os
os.kill(os.getpid(), 9)

: 

## Testing the model with DPO

We are now **ready to reload the final merged model** we created and **test it** over the same test set we run original model. We use the same split, strategy and metrics.

In [1]:
import os

import numpy as np
import pandas as pd

from tqdm import tqdm #Used in Colab for progress bar

from unsloth import FastLanguageModel
import bitsandbytes as bnb
import torch

from datasets import Dataset

from peft import LoraConfig, PeftConfig
from trl import SFTTrainer

import transformers
from transformers import (AutoModelForCausalLM,
                          AutoTokenizer,
                          BitsAndBytesConfig,
                          TrainingArguments,
                          pipeline,
                          logging)

from sklearn.metrics import (accuracy_score,
                             classification_report,
                             confusion_matrix)
from sklearn.model_selection import train_test_split

  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [2]:
import joblib
X_test = joblib.load("X_test")
y_true = joblib.load("y_true")

In [3]:
def predict(test, model, tokenizer):
    y_pred = []
    for i in tqdm(range(len(test))):
        prompt = test.iloc[i]["text"]
        #print(prompt)
        pipe = pipeline(model=model,
            tokenizer=tokenizer,
            return_full_text=False, # langchain expects the full text
            task='text-generation',
            max_new_tokens=1, # max number of tokens to generate in the output
            temperature=0.1,  #temperature for more or less creative answers
            do_sample=True,
            top_p=0.9,
        )
        result = pipe(prompt)
        #print(result)
        answer = result[0]['generated_text'].split("=")[-1]

        if "positive" in answer or "2" in answer:
            y_pred.append("positive")
        elif "negative" in answer or "0" in answer:
            y_pred.append("negative")
        elif "neutral" in answer or "1" in answer:
            y_pred.append("neutral")
        else:
            y_pred.append("none")
    return y_pred

In [4]:
def evaluate(y_true, y_pred):
    labels = ['negative', 'neutral', 'positive']
    mapping = {'negative': 0, 'neutral': 1, 'positive': 2}

    def map_func(x):
        return mapping.get(x, 4)

    y_true = np.vectorize(map_func)(y_true)
    y_pred = np.vectorize(map_func)(y_pred)

    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    print(f'Accuracy: {accuracy:.3f}')

    # Generate accuracy report
    unique_labels = set(y_true)  # Get unique labels

    for label in unique_labels:
        label_indices = [i for i in range(len(y_true))
                         if y_true[i] == label]
        label_y_true = [y_true[i] for i in label_indices]
        label_y_pred = [y_pred[i] for i in label_indices]
        accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {label}: {accuracy:.3f}')

    # Generate classification report
    class_report = classification_report(y_true=y_true, y_pred=y_pred,digits=5)
    print('\nClassification Report:')
    print(class_report)

    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true, y_pred=y_pred, labels=[0, 1, 2])
    print('\nConfusion Matrix:')
    print(conf_matrix)

In [5]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "merged_model_DPO",
    device_map="cuda:0",
    torch_dtype=torch.float16,
    quantization_config=bnb_config,
)

tokenizer = AutoTokenizer.from_pretrained("merged_model_DPO",
                                          trust_remote_code=True,
                                         )

Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|██████████| 5/5 [00:02<00:00,  2.23it/s]


In [6]:
y_pred = predict(X_test, model, tokenizer)

100%|██████████| 900/900 [00:22<00:00, 39.42it/s]


In [7]:
evaluate(y_true, y_pred)

Accuracy: 0.499
Accuracy for label 0: 0.713
Accuracy for label 1: 0.473
Accuracy for label 2: 0.310

Classification Report:
              precision    recall  f1-score   support

           0    0.53500   0.71333   0.61143       300
           1    0.53788   0.47333   0.50355       300
           2    0.39574   0.31000   0.34766       300
           4    0.00000   0.00000   0.00000         0

    accuracy                        0.49889       900
   macro avg    0.36716   0.37417   0.36566       900
weighted avg    0.48954   0.49889   0.48755       900


Confusion Matrix:
[[214  41  45]
 [ 60 142  97]
 [126  81  93]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


The results shown a **good improvement in overall performances** with a low decrement in scores obtained for the "positive" label. The final F1-score seems **convincing and good-enough for a very small LLM such as Google Gemma-2b**.

Just upload the final model to HuggingFace for future reuse! 💯

In [None]:
!huggingface-cli login

In [8]:
!huggingface-cli upload pgajo/Gemma-2B-SA-AILC-Lectures-2024 merged_model_DPO .

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
model-00001-of-00005.safetensors:   0%|             | 0.00/1.97G [00:00<?, ?B/s]
model-00002-of-00005.safetensors:   0%|             | 0.00/1.90G [00:00<?, ?B/s][A

model-00003-of-00005.safetensors:   0%|             | 0.00/1.90G [00:00<?, ?B/s][A[A


model-00004-of-00005.safetensors:   0%|             | 0.00/1.93G [00:00<?, ?B/s][A[A[A



model-00005-of-00005.safetensors:   0%|             | 0.00/1.28G [00:00<?, ?B/s][A[A[A[A




Upload 7 LFS files:   0%|                                 | 0/7 [00:00<?, ?it/s][A[A[A[A[A



model-00005-of-00005.safetensors:   0%| | 16.4k/1.28G [00:03<66:25:08, 5.37kB/s][A[A[A[A
model-00002-of-00005.safetensors:   0%| | 16.4k/1.90G [00:03<98:21:36, 5.35kB/s][A

model-00003-of-00005.safetensors:   0%| | 16.4k/1.90G [00:03<98:20:27, 5.36kB/s][A[A


model-00001-of-0000

In [None]:
#Final model: https://huggingface.co/m-polignano-uniba/Gemma-2B-SA-AILC-Lectures-2024