<a href="https://www.kaggle.com/code/josebambora/mistral-sentimental-analysis?scriptVersionId=177046582" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

This notebook has contains the process for tuning Mistral LLM model for sentimental analysis task. 

This notebook was based on a public notebook found [Luca Massaron](https://www.kaggle.com/code/lucamassaron/fine-tune-mistral-v0-2-for-sentiment-analysis).

This notebook was develop for a university project. The main goal of it is to use different LLM models, tune them when possible and see their performance for sentimental analysis task.

## Installs

Initially, we installed specific versions of some required packages. The packages, their respective versions, and their purposes that we used are listed in the following table:

| Package       | Version         | Purpose                                                                |
|---------------|-----------------|------------------------------------------------------------------------|
| torch         | 2.0.0           | Loading and tuning the model on the GPU.                               |
| accelerate    | 0.25.0          | Optimizing the tuning process.                                         |
| peft          | 0.7.1           | Package with helpful configurations for tuning Mistral.                |
| bitsandbytes  | 0.41.3.post2   | Applying quantization to reduce model size.                             |
| transformers  | 4.36.1          | Package with public Hugging Face models, such as Mistral.               |
| trl           | 0.7.4           | Debugging package used in the predict function. It simply prints the progress of a for loop. |
| datasets      | latest          | Package to load public datasets on Hugging Face, such as IMDB.         |

In [1]:
!pip install -q -U torch=='2.0.0'

In [2]:
!pip install -q -U accelerate=='0.25.0' peft=='0.7.1' bitsandbytes=='0.41.3.post2' transformers=='4.36.1' trl=='0.7.4'

## Imports

Here we realize the imports of the necessary libraries.

In [3]:
from huggingface_hub import notebook_login
import os
import warnings
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import Dataset, concatenate_datasets
from peft import LoraConfig, PeftConfig
from trl import SFTTrainer
from transformers import (AutoModelForCausalLM,
                          AutoTokenizer,
                          BitsAndBytesConfig,
                          TrainingArguments,
                          pipeline,
                          logging)
from sklearn.metrics import accuracy_score
from datasets import load_dataset
import re
import requests
import gzip
import shutil

2024-05-07 10:41:57.832391: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-07 10:41:57.832480: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-07 10:41:57.995166: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Hugging Face Login

Login in Hugging face in order to acess Mistral LLM model.

In [4]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [5]:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
warnings.filterwarnings("ignore")

## Data collection and preparation

To tune Mistral we are going to use the [IMDb dataset](https://huggingface.co/datasets/stanfordnlp/imdb). To test, our [test data](https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch08/movie_data.csv.gz) was gotten from a public repository in github.

The function *save_data* gets [test data](https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch08/movie_data.csv.gz) and stores it locally.

In [6]:
def save_data():
    url = "https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch08/movie_data.csv.gz"
    filename = url.split("/")[-1]

    with open(filename, "wb") as f:
        r = requests.get(url)
        f.write(r.content)

    with gzip.open('movie_data.csv.gz', 'rb') as f_in:
        with open('movie_data.csv', 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)
save_data()

Each instruction passed to Mistral must begin with the command [INST] and end with the command [/INST]. Due to this requirement, we created the functions *generate_prompt* and *generate_test_prompt*. These functions are applied to every element in the training data and test data, respectively. In both functions, we defined a phrase to provide Mistral with context regarding what it needs to learn.

In [7]:
def generate_prompt(data_point):
    label = 'positive'
    if data_point["label"] != 1:
        label = 'negative'
    res = f"""
            [INST]Analyze the sentiment of the movie review enclosed in square brackets,
            determine if it is positive, or negative, and return the answer as
            the corresponding sentiment label "positive" or "negative"[/INST]

            [{data_point["text"]}] = {label}""".strip()
    return re.sub(r'\s+', ' ', res)

def generate_test_prompt(data_point):
    res = f"""
            [INST]Analyze the sentiment of the movie review enclosed in square brackets,
            determine if it is positive, or negative, and return the answer as
            the corresponding sentiment label "positive" or "negative"[/INST]

            [{data_point}] = """.strip()
    return re.sub(r'\s+', ' ', res)

Afterward, we defined the *select* and *generate_data* functions. The select function filters the original data passed as an argument and returns only a few elements from the filtered data. We did this to reduce the dataset size and consequently decrease training times. Another important reason for this function is to maintain data balance. The *generate_data* function simply applies the *generate_prompt* function to train the data.

In [8]:
random_seed = 2000

def select(data,label_result,range_num):
    return data.filter(lambda example: example['label'] == label_result).shuffle(seed=random_seed).select(range(range_num))

def generate_data(data):
    return data.shuffle(seed=random_seed).map(lambda elem : {'text': generate_prompt(elem)})

The final step related to processing the data is the function *prepare_data*. This function will call the function *prepare_data_train*, in which we will process the data for training, as the name suggests. Here, we obtain both training data and validation data. Essentially, we extract 1000 cases from the original dataset, of which 900 will be used for training and 100 for validation. It is important to mention that we preserved the data balance, ensuring an equal number of positive and negative reviews in both sets. The other component is used solely to test the model before and after training. For this purpose, we followed the same procedure as we did for *DistilBERT* so that we can compare model performances using the same test data.

In [23]:
def prepare_data_train(imdb):
    positive_rows = select(imdb['train'],1,500)
    negative_rows = select(imdb['train'],0,500)
    
    positive_rows_train = positive_rows.select(indices=range(450))
    negative_rows_train = negative_rows.select(indices=range(450))
    positive_rows_eval  = positive_rows.select(indices=range(450, 500))
    negative_rows_eval  = negative_rows.select(indices=range(450, 500))
    
    selected_rows_train = concatenate_datasets([positive_rows_train, negative_rows_train])
    selected_rows_eval  = concatenate_datasets([positive_rows_eval, negative_rows_eval])
    
    data_train = generate_data(selected_rows_train)
    data_eval  = generate_data(selected_rows_eval)
    return data_train,data_eval

def prepare_data_test(df):
    X_test = df.iloc[40000:42500]
    X_test['text'] = X_test['review']
    X_test['text'] = X_test['text'].apply(lambda x: generate_test_prompt(x))
    y_true = list(X_test['sentiment'])
    return X_test.drop(['sentiment','review'],axis=1), y_true

def prepare_data():
    df = pd.read_csv('movie_data.csv')
    imdb = load_dataset('imdb')
    data_train, data_eval = prepare_data_train(imdb)
    X_test, y_true = prepare_data_test(df)
    return data_train, data_eval, X_test, y_true

In [24]:
data_train, data_eval, X_test, y_true = prepare_data()

In [11]:
# Debug Messages, uncomment if necessary
# print(data_train)
# print(data_eval)
# print(data_train[0]['text'])
# print(X_test.info())
# print(y_true)

## Functions for Model Evaluation


Before obtaining the model itself and tuning it, we need to define a method for evaluating it. For that purpose, we have defined multiple functions.

Firstly, we have the functions *accuracy_for_label* and *evaluate*. As their names suggest, the first function returns the accuracy of model answers for each classification (positive or negative), while the second function evaluates the overall model responses.

The metrics we have utilized include overall accuracy and accuracy for each label (positive and negative reviews).

In [12]:
def accuracy_for_label(y_true, y_pred, label):
    label_indices = [i for i, y in enumerate(y_true) if y == label]
    label_y_true = [y_true[i] for i in label_indices]
    label_y_pred = [y_pred[i] for i in label_indices]
    return accuracy_score(label_y_true, label_y_pred)

def evaluate(y_true, y_pred):
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)

    # Overall Accuracy
    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    print(f'Accuracy: {accuracy:.3f}')

    # Accuracy for each label
    accuracy_negative = accuracy_for_label(y_true,y_pred,0)
    accuracy_positive = accuracy_for_label(y_true,y_pred,1)

    print(f'Accuracy for negative reviews: {accuracy_negative:.3f}')
    print(f'Accuracy for positive reviews: {accuracy_positive:.3f}')

## Functions for Answer Generation

Next, we have created the functions *generate_response* and *predict*. As their names suggest, the first function provides a single prompt to the model and returns its response. The second function provides all the test prompts to the model and saves the model's responses. We obtain the model's response using the pipeline package. Specifically, we specify that we only want responses with 1 token, so that the model provides only one word. If the model generates a word different from "positive," we consider that the model has evaluated the review as negative, regardless of whether it responded with "negative" or something else.

In [13]:
def generate_response(prompt,model,tokenizer):
    pipe = pipeline(task="text-generation",
                        model=model,
                        tokenizer=tokenizer,
                        max_new_tokens = 1,
                        temperature = 0.0)
    result = pipe(prompt, pad_token_id=pipe.tokenizer.eos_token_id)
    return result[0]['generated_text'].split("=")[-1].lower()

def predict(X_test, model, tokenizer):
    y_pred = []
    for i in tqdm(range(len(X_test))):
        prompt = X_test.iloc[i]["text"]
        answer = generate_response(prompt,model,tokenizer)
        if "positive" in answer:
            y_pred.append(1)
        else:
            y_pred.append(0)
    return y_pred

## Mistral Workflow

We have created a function to retrieve the Mistral model and its tokenizer. As mentioned previously, the model is the Instruct second version of Mistral (`mistralai/Mistral-7B-Instruct-v0.2`).

An important aspect we've utilized is the package `BitsAndBytesConfig`. This package is useful for configuring the model itself for quantization. Quantization, in general, is a process of reducing the number of bits used to represent data while preserving essential information, resulting in a reduction of model size and an increase in speed (Dettmers, Tim, and Luke Zettlemoyer. 2023. “The Case for 4-Bit Precision: K-Bit Inference Scaling Laws.” https://arxiv.org/abs/2212.09720). Regarding the arguments, *load_in_4bit* is set to True to enable 4-bit quantization. This replaces the Linear layers with FP4/NF4 layers from `bitsandbytes`. The argument *bnb_4bit_use_double_quant* is set to False to prevent re-quantization of quantization constants. *bnb_4bit_quant_type* is set to "nf4", where "nf4" stands for "Non-Flowing 4-bit". This means the model will map the 4-bit quantized values to a discrete set of integers without any fractional parts. Lastly, *bnb_4bit_compute_dtype* is set to `getattr(torch, "float16")` to specify the computational type as float bits, thereby increasing computation speed.

We have used 4 bits instead of 8 bits mainly to reduce the memory size, since at the beggining of the project we were getting cuda out of memory error multiple times. Obviously this brings some problems such as losing accuracy, precision with the values, the range is also limited. 4bit is susceptible to noises. As for the advantadges, the memory size is reduced. Also it reduces the bandwith consuption, the computation is faster and it is more energy efficient.

After defining the `BitsAndBytesConfig` object, we retrieve the model itself by passing its name, the `BitsAndBytesConfig` object, and setting *device_map* to "auto" so that the model is loaded onto the GPU. Finally, we obtain the model's tokenizer.

This functionality is encapsulated in the function named `get_model`.

In [14]:
def get_model():
    model_name = "mistralai/Mistral-7B-Instruct-v0.2"
    compute_dtype = getattr(torch, "float16")
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=False,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
    )
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        quantization_config=bnb_config,
    )
    model.config.use_cache = False
    model.config.pretraining_tp = 1
    tokenizer = AutoTokenizer.from_pretrained(model_name,
                                              trust_remote_code=True,
                                              padding_side="left",
                                              add_bos_token=True,
                                              add_eos_token=True,
                                            )
    tokenizer.pad_token = tokenizer.eos_token
    return (model,tokenizer)

In [15]:
model,tokenizer = get_model()

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [16]:
# Base Model Performance. Since this evaluation takes too much time, it is in comments, but uncomment if necessary.
# y_pred = predict(X_test, model, tokenizer)
# evaluate(y_true, y_pred)

At this point, we've created a function responsible for creating an object that we will use to train the model.

Firstly, we've instantiated a `LoraConfig` object. Lora stands for Low-Rank Adaptation, and it is a PEFT method that decomposes a large matrix into two smaller low-rank matrices in the attention layers. This results in a drastic reduction in the number of parameters that need to be fine-tuned. Regarding the parameters, we've set *lora_alpha* to 16, indicating that the LoRa regularization is set to 16. The *lora_dropout* parameter is set to 0.1, which helps prevent the model from overfitting. The parameter *r* represents the number of multi-head attention the model will use. The *bias* parameter determines whether biases are included in the model's computations. In our case, it is set to None since we do not have biases. The last argument, *task_type*, specifies the type of task the model is trained for, which is set to causal language modeling. This means that the task the model will be trained for is to predict the next word in a sequence given the previous words.

Regarding the training arguments, the first one is *output_dir*, which is simply a Hugging Face project where we want to save the retrained model. Then, we specify the number of epochs (4), the batch size (1), the gradient accumulation (4), the optimizer to use (paged_adamw_32bit), the number of steps to save the checkpoint (0), the number of steps between two logs (25), the learning rate ($2 \times 10^{-4}$), the weight decay to apply to all layers except all bias and LayerNorm weights (0.001), the usage of floating-point 16-bit values, maximum gradient norm (0.3), the ratio of total training steps used for a linear warmup from 0 to *learning_rate* (0.03), to group together samples of roughly the same length in the training dataset in order to minimize padding applied and be more efficient, the scheduler type to use (cosine), and finally, evaluation is done at the end of each epoch.

At the end, we use the Supervised Fine-tuning Trainer (SFTT), which will be the trainer object. We pass as arguments our model, the training data, the evaluation data (so that the training process can calculate validation values), the *Lora configuration*, the model *tokenizer*, the training arguments, and we set that the sequences have a maximum length of 512.


In [17]:
def train_configuration():
    peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=64,
        bias="none",
        task_type="CAUSAL_LM",
    )

    training_arguments = TrainingArguments(
        output_dir="mistral_retrained",
        num_train_epochs=4,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        optim="paged_adamw_32bit",
        save_steps=0,
        logging_steps=25,
        learning_rate=2e-4,
        weight_decay=0.001,
        fp16=True,
        max_grad_norm=0.3,
        warmup_ratio=0.03,
        group_by_length=True,
        lr_scheduler_type="cosine",
        report_to="tensorboard",
        evaluation_strategy="epoch"
    )

    trainer = SFTTrainer(
        model=model,
        train_dataset=data_train,
        eval_dataset=data_eval,
        peft_config=peft_config,
        dataset_text_field="text",
        tokenizer=tokenizer,
        args=training_arguments,
        packing=False,
        max_seq_length=512,
    )
    return trainer

## Tuning

So now we retrain Mistral with our data. The training process took 1 hour and 26 minutes.

In [18]:
trainer = train_configuration()
trainer.train()

Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,2.0486,2.06951
2,1.9664,2.07854
3,1.8297,2.104064
4,1.7339,2.127548


TrainOutput(global_step=900, training_loss=1.9081769222683376, metrics={'train_runtime': 5169.7089, 'train_samples_per_second': 0.696, 'train_steps_per_second': 0.174, 'total_flos': 5.055962222051328e+16, 'train_loss': 1.9081769222683376, 'epoch': 4.0})

## Evaluation

Now we need to evaluate our fine tuned Mistral. The evaluation took 1 hour 20 minutes

In [25]:
y_pred = predict(X_test, model, tokenizer)
evaluate(y_true, y_pred)

100%|██████████| 2500/2500 [1:20:31<00:00,  1.93s/it]

Accuracy: 0.964
Accuracy for negative reviews: 0.970
Accuracy for positive reviews: 0.959





## Publish

Publish this fine tuned model to hugging face

In [26]:
trainer.push_to_hub()

adapter_model.safetensors:   0%|          | 0.00/109M [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.28k [00:00<?, ?B/s]

events.out.tfevents.1715078702.2f49363661b3.34.0:   0%|          | 0.00/11.9k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/JoseBambora/mistral_retrained/commit/2cbdfb783459feb96c26ffd9057a7aa75ae4eb64', commit_message='End of training', commit_description='', oid='2cbdfb783459feb96c26ffd9057a7aa75ae4eb64', pr_url=None, pr_revision=None, pr_num=None)