## Module 2 - Fine-tuning Phi-1.5 for sentence classification using QLoRA

This notebook presents an example of how to fine-tune Phi-1.5 for sentence classification using QLoRA.

QLoRA is a fine-tuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. For more details, please refer to the [QLoRA paper](https://arxiv.org/abs/2106.09647).


## Setup

In [None]:
# If debug is set to False:
# - Train dataset 25k samples and test dataset 1k samples will be used.
# - Model, tokenizer and train log will be saved in save_path variable.
# Else:
# - Train dataset 1K samples and test dataset 1k samples will be used.
# - Model, tokenizer and train log don't be saved.

DEBUG = False

# DEBUG samples
DEBUG_train_samples = 1000
DEBUG_test_samples = 1000

# samples
train_samples_limit = 25000
test_samples_limit = 1000

# Save Path
save_path = "/content/drive/MyDrive/Colab Notebooks/nlp_unicamp/6_7"

# Model
model_name = "microsoft/phi-1_5"
new_model = "phi-1_5-IMDB"

# Maximum sequence length to use
# - https://huggingface.co/docs/transformers/en/model_doc/phi#transformers.PhiConfig.max_position_embeddings
# - Reducing 2048 to 512
max_seq_length = 512

# Installing required packages

In this example, we have to install the following libraries:  `transformers`, `datasets`, `torch`, `peft`, `bitsandbytes`, and `trl`.

**`transformers`**:

Transformers is an open-source library for NLP developed by Hugging Face. It provides state-of-the-art pre-trained models for various NLP tasks, such as text classification, sentiment analysis, question-answering, named entity recognition, etc.

**`datasets`**:

Datasets is another open-source library developed by Hugging Face that provides a collection of preprocessed datasets for various NLP tasks, such as sentiment analysis, natural language inference, machine translation, and many more.


**`torch`**:

PyTorch is an open-source machine learning library that provides a wide range of tools and utilities for building and training custom deep learning models. It is already installed in the Colab environment, but we need to install its latest version.

**`peft`**:

🤗 PEFT, or Parameter-Efficient Fine-Tuning (PEFT), is a library for efficiently adapting pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model’s parameters. We use PEFT in this example because it supports QLoRA.


**`bitsandbytes`**:

BitsAndBytes is a library designed to optimize the training of neural networks on modern GPUs. It offers efficient implementations of 8-bit optimizers, which significantly reduce the memory footprint of model parameters and gradients. This reduction in memory usage enables training larger models or using larger batch sizes within the same memory constraints.


**`trl`**:

🤗 TRL, or Transfer Learning Library, is a library for training and evaluating transfer learning models. It provides a unified API for training and evaluating various transfer learning models.

In [None]:
!pip install -q torch
!pip install -q git+https://github.com/huggingface/transformers #huggingface transformers for downloading models weights
!pip install datasets
!pip install -q peft  # Parameter efficient finetuning - for qLora Finetuning
!pip install -q bitsandbytes  # For Model weights quantization
!pip install -q trl  # Transformer Reinforcement Learning - For Finetuning using Supervised Fine-tuning

## Libs

In [None]:
from pprint import pprint

import pandas as pd
import torch
from datasets import Dataset, load_dataset
from huggingface_hub import notebook_login
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from trl import SFTTrainer  # For supervised finetuning

import gc

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import seaborn as sns

from collections import Counter
from google.colab import runtime, drive
from tqdm import tqdm
from datasets import load_dataset, DatasetDict

import os

import datetime
import time

In [None]:
if DEBUG:

    drive.mount("/content/drive")

# Setting the device

In [None]:
!nvidia-smi

# Downloading Dataset

In [None]:
imdb_dataset = load_dataset("imdb")

train_dataset = imdb_dataset["train"].shuffle(seed=42)
test_dataset = imdb_dataset["test"].shuffle(seed=42)

In [None]:
if DEBUG:
    train_dataset = train_dataset.select(range(DEBUG_train_samples))
    test_dataset = test_dataset.select(range(DEBUG_test_samples))
else:
    train_dataset = train_dataset.select(range(train_samples_limit))
    test_dataset = test_dataset.select(range(test_samples_limit))

In [None]:
train_label_counter = Counter(train_dataset["label"])

test_label_counter = Counter(test_dataset["label"])


print("train labels:")

print(train_label_counter)


print("\ntest labels:")

print(test_label_counter)

# Data Preparation

#### Define the template used to format the train samples

In [None]:
template = """Your task is to classify sentences' sentiment as 'positive' or 'negative'. Your answer should be one word, either 'positive' or 'negative'.
Sentence: {text}
Answer: {class}"""

#### Map 0 to negative and 1 to positive

In [None]:
POSITIVE_LABEL = "positive"
NEGATIVE_LABEL = "negative"

train_dataset = train_dataset.map(
    lambda example: {
        "class": POSITIVE_LABEL if example["label"] == 1 else NEGATIVE_LABEL
    }
)

## Data Truncation

#### Load Tokenizer for data filtering with truncation

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

#### Define the max lenght sample

In [None]:
len_template_tokenized = tokenizer(template, return_tensors="pt")["input_ids"].shape[1]
max_lenght_sample = max_seq_length - len_template_tokenized

#### Before number of tokens > max_lenght_sample

In [None]:
# Map n_tokens
train_dataset = train_dataset.map(
    lambda example: {
        "n_tokens": tokenizer(example["text"], return_tensors="pt")["input_ids"].shape[
            1
        ]
    }
)

# Print columns and n_rows
print(train_dataset)

# Print n_tokens above max_lenght_sample (max_seq_length - len_template_tokenized)
print(
    f'Number of rows > max_lenght_sample({max_lenght_sample}): {train_dataset.filter(lambda example: example["n_tokens"] > max_lenght_sample).num_rows}'
)

#### Truncate

In [None]:
def format_samples(sample):
    sample_tokenized = tokenizer(sample, return_tensors="pt")

    if sample_tokenized["input_ids"].shape[1] + len_template_tokenized > max_seq_length:
        # Decode the tokens
        sample_decoded = tokenizer.decode(
            sample_tokenized["input_ids"][:, :max_lenght_sample].squeeze()
        )
        return sample_decoded
    else:
        return sample


train_dataset = train_dataset.map(
    lambda sample: {"text": format_samples(sample["text"])}
)

#### After number of tokens > max_lenght_sample

In [None]:
# Map n_tokens
train_dataset = train_dataset.map(
    lambda example: {
        "n_tokens": tokenizer(example["text"], return_tensors="pt")["input_ids"].shape[
            1
        ]
    }
)

# Print columns and n_rows
print(train_dataset)

# Print n_tokens above max_lenght_sample (max_seq_length - len_template_tokenized)
print(
    f'Number of rows > max_lenght_sample({max_lenght_sample}): {train_dataset.filter(lambda example: example["n_tokens"] > max_lenght_sample).num_rows}'
)

### Mapping template to train and classes to test

In [None]:
train_dataset = train_dataset.map(lambda example: {"text": template.format(**example)})

In [None]:
test_dataset = test_dataset.map(
    lambda example: {
        "class": POSITIVE_LABEL if example["label"] == 1 else NEGATIVE_LABEL
    }
)

# Fine-tuning

Setting the QLora Parameters

1. **lora_r (LoRA attention dimension)**:
   - the rank of the update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.

2. **lora_alpha (Alpha parameter for LoRA scaling)**:
   - This parameter is the LoRA scaling factor applied to the modifications.

3. **lora_dropout (Dropout probability for LoRA layers)**:
   - This parameter represents the dropout rate applied to the LoRA layers.

In [None]:
# LoRA attention dimension
lora_r = 64  # @param

# Alpha parameter for LoRA scaling
lora_alpha = 16  # @param

# Dropout probability for LoRA layers
lora_dropout = 0.1  # @param

Bitsandbytes parameters. These parameters focus on the implementation of 4-bit precision in model loading and computation. Here's an explanation of each:

1. **use_4bit (Activate 4-bit precision base model loading)**:
   - This parameter, when set to `True`, indicates that the base model (i.e., the pre-trained model or initial model weights) should be loaded using 4-bit precision.
2. **bnb_4bit_compute_dtype (Compute dtype for 4-bit base models)**:
   - This parameter specifies the data type to be used for computations in the context of 4-bit base models.
   - The value `"float16"` indicates that computations should be done using 16-bit floating-point numbers.

3. **bnb_4bit_quant_type (Quantization type)**:
   - This parameter determines the type of quantization to be used for the 4-bit models.
   - The options `"fp4"` and `"nf4"` refer to different quantization schemes.

4. **use_nested_quant (Activate nested quantization for 4-bit base models)**:
   - When set to `True`, this parameter enables nested quantization for 4-bit base models.
   - Nested quantization, often referred to as double quantization, involves applying a second layer of quantization on top of an already quantized model. This can be used for further reducing the model size or for specialized computational optimizations.

In [None]:
# Activate 4-bit precision base model loading
use_4bit = True  # @param

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"  # @param

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"  # @param ["nf4","fp4"]

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False  # @param

Now, let's define the training arguments.

1. **output_dir**:
   - Specifies the directory where the model predictions and checkpoints will be stored.

2. **num_train_epochs**:
   - Sets the number of epochs for training, where one epoch means one pass through the entire training dataset. We set it to `1`

3. **fp16, bf16**:
   - Enable training with 16-bit floating-point precision (`fp16`) or 16-bit bfloat precision (`bf16`).

4. **per_device_train_batch_size**:
   - Determines the batch size for training per GPU. This will depend on the GPU used. For an A100, we can use a batch size of 16 examples.

5. **per_device_eval_batch_size**:
   - Sets the batch size for evaluation per GPU.

6. **gradient_accumulation_steps**:
   - Indicates the number of update steps over which to accumulate gradients.

7. **gradient_checkpointing**:
   - When enabled, saves memory by trading compute for memory. Useful for training large models that would otherwise not fit in memory.

8. **max_grad_norm (Maximum gradient norm)**:
   - Specifies the maximum norm of gradients for gradient clipping, a technique to prevent exploding gradients in deep networks.

9. **learning_rate**:
   - Sets the initial learning rate for the AdamW optimizer.

10. **weight_decay**:
    - Specifies the weight decay to apply to all layers except those with bias or LayerNorm weights, as a regularization technique.

11. **optim**:
    - Defines the optimizer to use, here specified as a variant of AdamW optimized for certain hardware configurations.

12. **lr_scheduler_type**:
    - Determines the learning rate schedule to use. "constant" means the learning rate stays the same throughout training.

13. **max_steps**:
    - Overrides `num_train_epochs` by setting the number of training steps. If set to a negative value, it's ignored. We set this to `100` to reduce the training time. That means, that our example training does not use the entire traing set.

14. **warmup_ratio**:
    - Indicates the proportion of total training steps to use for linear warmup of the learning rate.

15. **group_by_length**:
    - When enabled, sequences are grouped by length into batches. This can save memory and speed up training.

16. **save_steps**:
    - Determines how often to save a model checkpoint in terms of training steps.

17. **logging_steps**:
    - Sets the frequency, in terms of training steps, for logging training progress.


In [None]:
# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"  # @param

# Number of training epochs
num_train_epochs = 1  # @param

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = True  # @param
bf16 = False  # @param

# Batch size per GPU for training
per_device_train_batch_size = 6  # @param

# Batch size per GPU for evaluation
per_device_eval_batch_size = 6  # @param

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1  # @param

# Enable gradient checkpointing
gradient_checkpointing = True  # @param

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3  # @param

# Initial learning rate (AdamW optimizer)
learning_rate = 5e-4  # @param

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001  # @param

# Optimizer to use
optim = "paged_adamw_32bit"  # @param

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"  # @param


#  ---- MATHEUS RODRIGUES ----
# Total Samples
total_samples = train_dataset.shape[0]


# Epochs
num_epochs = 1

# steps to n epochs
num_training_steps = (total_samples // per_device_train_batch_size) * num_epochs

# only 3 checkpoints
auxiliar_save_steps = max(num_training_steps // 3, 1)
#  ---------------------------


# Number of training steps (overrides num_train_epochs)
max_steps = num_training_steps  # @param

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03  # @param

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True  # @param

# Save checkpoint every X updates steps
save_steps = auxiliar_save_steps  # @param

# Log every X updates steps
logging_steps = 100  # @param

Now let's defint the SFTTrainer parameters

1. **max_seq_length**:
   - This parameter specifies the maximum sequence length to be used.

2. **packing**:
   - This parameter indicates whether or not to pack multiple short examples into the same input sequence.
   - When set to `True`, this technique can be used to increase computational efficiency, particularly in batch processing.

3. **device_map**:
   - This parameter is a dictionary that maps parts of the model to specific computing devices.
   - The entry `{"": 0}` specifies that the entire model will be loaded onto GPU 0.

In [None]:
# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

Load the base model with QLoRA configuration

In [None]:
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_name, quantization_config=bnb_config, device_map=device_map
)

base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

## Fine-Tuning with QLoRA and Supervised Fine-Tuning

We're ready to fine-tune our model using QLoRA. For this tutorial, we'll use the `SFTTrainer` from the `trl` library.

In the context of the code below, `target_modules` refers to specific components or layers of a neural network model that will be modified or adapted using LoRA (Low-Rank Adaptation). LoRA is a technique used to adapt pre-trained models with minimal additional parameters, often used in the context of Transformer models. Here's a breakdown of what each module likely represents:

1. **q_proj, k_proj, v_proj, o_proj**:
   - These refer to the projections for query (q), key (k), value (v), and output (o) in the attention mechanism of a Transformer model.

2. **gate_proj**:
   - This refer to a projection layer associated with gating mechanisms in the model, such as those found in Gated Recurrent Units (GRUs) or similar structures.

3. **up_proj, down_proj**:
   - These refer to projection layers used in upsampling or downsampling within the model.

4. **lm_head**:
   - This refers to the language model head of a Transformer, which is the final layer that produces the output (like the next word in a sequence).

In [None]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard",
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=base_model,
    train_dataset=train_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

## Evaluation

In [None]:
device = "cuda:0"

prompt = """Your task is to classify sentences' sentiment as 'positive' or 'negative'. Your answer should be one word, either 'positive' or 'negative'.
Sentence: {text}
Answer:"""


def classify_sentence(text, model):
    text = prompt.format(text=text)
    encodeds = tokenizer(text, return_tensors="pt", add_special_tokens=False)
    model_inputs = encodeds.to(device)

    with torch.no_grad():
        outputs = model.generate(
            **model_inputs,
            max_new_tokens=1,
            bos_token_id=model.config.bos_token_id,
            eos_token_id=model.config.eos_token_id,
            pad_token_id=model.config.eos_token_id
        )
        torch.cuda.empty_cache()

    return tokenizer.decode(
        outputs[0][len(model_inputs["input_ids"][0]) :], skip_special_tokens=True
    )


def evaluate_model(model):

    predictions = []
    references = test_dataset["class"]

    for item in tqdm(test_dataset):
        predicted = classify_sentence(item["text"], model)
        predictions.append(predicted)

    return predictions, references


def process_predictions(predictions):

    # Lower case
    predictions = [x.lower() for x in predictions]

    # Remove leading/trailing white space
    predictions = [x.strip() for x in predictions]

    return predictions

### Metrics

In [None]:
def accuracy(predictions, references):
    correct = 0
    for i in range(len(predictions)):
        if predictions[i] == references[i]:
            correct += 1
    return correct / len(predictions)


def is_valid_prediction(predictions, references):
    labels = set(references)
    correct = 0
    for prediction in predictions:
        if prediction in labels:
            correct += 1
    return correct / len(predictions)

## Before train metrics

In [None]:
predictions, references = evaluate_model(base_model)
predictions = process_predictions(predictions)

before_training_accuracy = accuracy(predictions, references)
before_training_is_valid_prediction = is_valid_prediction(predictions, references)

print(f"\nAccuracy: {before_training_accuracy*100}%")
print(f"Is valid prediction: {before_training_is_valid_prediction*100}%")

## Let's start the training process

In [None]:
print(num_epochs)
print(per_device_train_batch_size)
print(num_training_steps)
print(train_dataset.shape[0])

In [None]:
# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

## After training metrics

In [None]:
predictions, references = evaluate_model(base_model)
predictions = process_predictions(predictions)

after_training_accuracy = accuracy(predictions, references)
after_training_is_valid_prediction = is_valid_prediction(predictions, references)

print(f"\nAccuracy: {after_training_accuracy*100}%")
print(f"Is valid prediction: {after_training_is_valid_prediction*100}%")

In [None]:
classify_sentence(train_dataset["text"][0], base_model)

### Save and delete runtime

In [None]:
# Obter a data e hora atual
current_datetime = datetime.datetime.now()
formatted_datetime = current_datetime.strftime("%Y-%m-%d_%H-%M-%S")

# Diretórios para salvar logs e modelo
logs_dir = f"{save_path}/model_{formatted_datetime}/logs"
model_dir = f"{save_path}/model_{formatted_datetime}"

# Criar diretórios se não existirem
os.makedirs(logs_dir, exist_ok=True)
os.makedirs(model_dir, exist_ok=True)

# Nome do arquivo baseado na data e hora
log_file_name = f"loss_logs_{formatted_datetime}.txt"
log_file_path = os.path.join(logs_dir, log_file_name)

# Metricas
metrics_file_name = f"metrics_{formatted_datetime}.txt"
metrics_file_path = os.path.join(logs_dir, metrics_file_name)

# Salvar os logs de perdas no Google Drive
with open(log_file_path, "w") as f:

    for loss in trainer.state.log_history:
        f.write(str(loss) + "\n")

# Salvar as métricas no Google Drive
with open(metrics_file_path, "w") as f:
    f.write("Before Training\n")
    f.write(f"Accuracy: {before_training_accuracy}\n")
    f.write(f"Is valid prediction: {before_training_is_valid_prediction}\n")

    f.write("\nAfter Training\n")
    f.write(f"Accuracy: {after_training_accuracy}\n")
    f.write(f"Is valid prediction: {after_training_is_valid_prediction}\n")
# Salvar o modelo treinado
model_save_path = os.path.join(model_dir, "model")
trainer.model.save_pretrained(model_save_path)

# Encerrar o notebook Colab
# os.kill(os.getpid(), 9)

In [None]:
time.sleep(5)
runtime.unassign()

## Merge the fine-tuned model

After fine-tuning, we can merge the fine-tuned model with the base model to get a single model that can be used for inference. This is done by using the PEFT. First, let's clean up the GPU memory by deleting the fine-tuned model. You can also restart the runtime to clear the GPU memory.

In [None]:
# Empty VRAM


del base_model
gc.collect()

del trainer
gc.collect()

In [None]:
torch.cuda.empty_cache()

In [None]:
gc.collect()

Now, let's load the base model and fine-tuned model and merge them using PEFT.

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)
merged_model = PeftModel.from_pretrained(
    base_model,
    new_model,
)
merged_model = merged_model.merge_and_unload()

Let's save our merged model

In [None]:
# Save the merged model
merged_model.save_pretrained("merged_model", safe_serialization=True)
tokenizer.save_pretrained("merged_model")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

## Test the merged model

The following code performs the inference stage of the evaluation finetuned Mistral-7B-Instruct model. We define a function called **`classify_sentence`** that is designed to use a pretrained model, likely a variant of a large language model similar to GPT, for sentiment analysis. The description below outlines the steps taken in the function to classify the sentiment of a given sentence as either positive, negative, or possibly neutral. I'll expand on the description by going through the function step-by-step:

1. The function accepts a single parameter, `sentence`, which is the text input whose sentiment is to be classified.

2. The `sentence` is formatted with the predefined prompt template. This prompt engineering is a common practice when using language models for specific tasks, as it provides context to the model about the task it is supposed to perform.

3. The `tokenizer` is applied to the formatted text. Tokenizers convert text into a format that models can understand, which in this case is a series of tokens. The tokenizer is configured to:
   - Return tensors compatible with PyTorch (`return_tensors="pt"`).
   - Not add special tokens that are usually used to indicate the start and end of a sequence (`add_special_tokens=False`).

4. The tokenized input (`encodeds`) is then converted to a PyTorch tensor and moved to the appropriate device (GPU) for inference.

5. The inference is performed inside a `torch.no_grad()` context manager, which disables gradient calculations. This is used because we are making predictions, not training the model, and therefore do not need gradients, which would only use extra memory and computational power.

6. The `model.generate` function is called to generate a response. This function takes several parameters, such as:
   - `**model_inputs`: The tokenized inputs prepared earlier.
   - `max_length=8000`: This sets the maximum length of the model's output. The choice of 8000 seems unusually high for sentence classification and might be tailored to specific requirements of the task or the model's capacity.
   - `bos_token_id=model.config.bos_token_id`: This specifies the beginning-of-sentence token id, signaling the model where a new sentence starts.
   - `eos_token_id=model.config.eos_token_id`: This specifies the end-of-sentence token id, signaling the model where a sentence ends.
   - `pad_token_id=model.config.eos_token_id`: This is used for padding shorter sentences to a uniform length. It's unusual to see the end-of-sentence token used for padding, which could be a specific requirement of this model or a mistake.

7. After the model generates a response, `torch.cuda.empty_cache()` is called to free up unused memory on the GPU. This is helpful in managing GPU resources, especially when processing multiple requests or dealing with large models.

8. Finally, the `tokenizer.decode` function is used to convert the model's output tokens back into human-readable text. The `skip_special_tokens=True` argument removes any special tokens (like padding or end-of-sentence tokens) from the output. The function also skips the input tokens (`outputs[0][len(model_inputs["input_ids"][0]):]`) to only return the newly generated text.


In [None]:
device = "cuda:0"

prompt_tuning = """Your task is to classify sentences' sentiment as 'positive' or 'negative'. Your answer should be one word, either 'positive' or 'negative'.
Sentence: {text}
Answer: """


def classify_sentence(text, model):
    text = prompt.format(text=text)
    encodeds = tokenizer(text, return_tensors="pt", add_special_tokens=False)
    model_inputs = encodeds.to(device)

    with torch.no_grad():
        outputs = model.generate(
            **model_inputs,
            max_new_tokens=1,
            bos_token_id=model.config.bos_token_id,
            eos_token_id=model.config.eos_token_id,
            pad_token_id=model.config.eos_token_id
        )
        torch.cuda.empty_cache()

    # return tokenizer.decode(outputs[0][len(model_inputs["input_ids"][0]):], skip_special_tokens=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
predictions, references = evaluate_model(base_model)

print(Counter(predictions))
print(f"Accuracy: {accuracy(predictions, references)}")
print(f"Is valid prediction: {is_valid_prediction(predictions, references)}")

# Evaluation Metric

To compute accuracy, we need to define a custom **`string_accuracy`** function since model outputs text rather than numerical values. Therefore, we cannot use the built-in accuracy function directly, which expects numerical values as inputs.

The following code defines the **`string_accuracy`** function. It takes two lists of strings as inputs, **`predictions`** and **`references`**. The function computes accuracy by counting the number of predictions that match the corresponding reference and dividing by the total number of predictions.

In [None]:
def string_accuracy(predictions, references):
    correct = sum(
        [1 for p, r in zip(predictions, references) if p.lower() == r.lower()]
    )
    total = len(predictions)
    return correct / total

In [None]:
accuracy = string_accuracy(predictions=predictions, references=references)
accuracy

In [None]:
# Convert labels to a numerical form
labels = {"positive": 1, "negative": 0, "miss-match": 2}
y_true_num = [labels[label] for label in references]
y_pred_num = [labels[label] for label in predictions]

# Compute the confusion matrix
cm = confusion_matrix(y_true_num, y_pred_num)

# Plot the confusion matrix using seaborn
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(
    cm,
    annot=True,
    fmt="d",
    cmap="Blues",
    xticklabels=["Negative", "Positive", "Miss-Match"],
    yticklabels=["Negative", "Positive", "Miss-Match"],
)

# Labels, title, and ticks
ax.set_ylabel("Actual Label")
ax.set_xlabel("Predicted Label")
ax.set_title("Confusion Matrix")

# Show the plot
plt.show()