## 🧠 Fine-Tuning Mistral 7B for Sentiment Analysis
![Financial Dataset Overview](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*GMYQoIl2pFDc5i8H0CB5-w.jpeg)

### Project Objective
The focus here is on sentiment analysis in financial and economic information. Understanding market trends, investor confidence, and consumer behavior is critical for businesses. Also, identifying potential risks and making smart investment decisions are key applications of this analysis.

### Preparing Mistral 7B
Before starting the fine-tuning process of Mistral model, selecting the right dataset is crucial. Annotated datasets in finance and economics are hard to come by, as most are kept private.

### The FinancialPhraseBank Dataset
I found the FinancialPhraseBank dataset, created by Aalto University School of Business. It contains around 5000 sentences, each annotated for sentiment analysis from an investor’s perspective. This means every sentence is labeled as either having a positive, negative, or neutral impact on stock prices.

### Dataset’s Importance
This dataset is crucial for understanding how financial news is perceived by retail investors. It classifies sentiments in a structured way, making it easier to analyze the complex dynamics of market sentiment. It's been a valuable tool in research since it was first used in the 2014 study by Malo et al.

### Next Steps
With this dataset, I'll start fine-tuning the Mistral model. The goal is to make this model proficient in detecting and understanding sentiments in financial texts. This could lead to more accurate insights into how different news impacts investor behavior and market trends.


As a first step, we install the specific libraries necessary to make this example work.

## 🛠️ Library Installation for Language Model Fine-Tuning

As a first step in our language model fine-tuning journey, let's get familiar with the specific libraries we'll be using:

- 🚀 **accelerate**: A distributed training library for PyTorch by HuggingFace. It allows training on multiple GPUs or CPUs in parallel, speeding up training in multi-GPU environments. We won't be using it in this example, but it's good to know about.

- 🧠 **peft**: A Python library by HuggingFace for efficient adaptation of pre-trained language models (PLMs). PEFT methods fine-tune a small number of model parameters, reducing computational and storage costs.

- 🌟 **bitsandbytes**: Developed by Tim Dettmers, this lightweight wrapper around CUDA custom functions includes 8-bit optimizers, matrix multiplication, and quantization functions. It supports models in 4-bit precision while computations still occur in 16 or 32-bit.

- 📚 **transformers**: A comprehensive Python library for natural language processing (NLP) by HuggingFace. It offers a range of pre-trained models for various NLP tasks like text classification, question answering, and machine translation.

- 🤖 **trl**: Another tool from HuggingFace, trl is a full-stack library for training transformer language models with Reinforcement Learning. It covers everything from Supervised Fine-tuning (SFT) and Reward Modeling (RM) to Proximal Policy Optimization (PPO).


In [1]:
pip install -q -U accelerate peft bitsandbytes transformers trl

Note: you may need to restart the kernel to use updated packages.


## 🖥️ Setting Environment Variables for PyTorch and Transformers

The code segment performs the following actions by importing the `os` module and setting two crucial environment variables:

- 🌐 **CUDA_VISIBLE_DEVICES**: 
  - This environment variable instructs PyTorch on which GPUs to use. 
  - By setting it to `0`, we specify that PyTorch should use the first GPU.

- 🔀 **TOKENIZERS_PARALLELISM**: 
  - This variable indicates to the Hugging Face Transformers library how to handle the tokenization process.
  - Setting it to `false` means the tokenization process will not be parallelized.


In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

The code import warnings; warnings.filterwarnings("ignore") imports the warnings module and sets the warning filter to ignore. This means that all warnings will be suppressed and will not be displayed. Actually during training there are many warnings that do not prevent the fine-tuning but can be distracting and make you wonder if you are doing the correct things.

In [4]:
import warnings
warnings.filterwarnings("ignore")

In the following cell there are all the other imports for running the notebook

In [3]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import Dataset
from peft import LoraConfig, PeftConfig
from trl import SFTTrainer
from transformers import (AutoModelForCausalLM, 
                          AutoTokenizer, 
                          BitsAndBytesConfig, 
                          TrainingArguments, 
                          pipeline, 
                          logging)
from sklearn.metrics import (accuracy_score, 
                             classification_report, 
                             confusion_matrix)
from sklearn.model_selection import train_test_split



## 📊 Data Preparation Steps for Fine-Tuning

The following steps are executed in the next code cell to prepare our datasets for fine-tuning the Mistral 7B model:

1. **Data Reading**: 
   - Reads the input dataset from the `all-data.csv` file. 
   - This CSV file contains two columns: `sentiment` and `text`.

2. **Dataset Splitting**:
   - Splits the dataset into training and test sets, each containing 300 samples.
   - The split is stratified by sentiment to include positive, neutral, and negative sentiments in each set.

3. **Data Shuffling**:
   - Shuffles the training data in a replicable order using `random_state=10`.

4. **Prompt Transformation**:
   B- Transforms the texts in the train and test data into prompts for Mistral .
   - The train prompts include the expected answers for fine-tuning.

5. **Evaluation Data Preparation**:
   - Handles the residual data not included in train or test sets as evaluation data.
   - Samples this data with repetition to balance the sentiment distribution, particularly increasing the number of negative instances.

6. **Data Wrapping**:
   - Wraps the train and evaluation data using a class from Hugging Face ([Hugging Face Documentation](https://huggingface.co/docs/datasets/index)).

This process results in the creation of `train_data`, `eval_data`, and `test_data` datasets, ready for use in our fine-tuning pipeline.


In [5]:
filename = "../input/sentiment-analysis-for-financial-news/all-data.csv"

df = pd.read_csv(filename, 
                 names=["sentiment", "text"],
                 encoding="utf-8", encoding_errors="replace")

X_train = list()
X_test = list()
for sentiment in ["positive", "neutral", "negative"]:
    train, test  = train_test_split(df[df.sentiment==sentiment], 
                                    train_size=300,
                                    test_size=300, 
                                    random_state=42)
    X_train.append(train)
    X_test.append(test)

X_train = pd.concat(X_train).sample(frac=1, random_state=10)
X_test = pd.concat(X_test)

eval_idx = [idx for idx in df.index if idx not in list(train.index) + list(test.index)]
X_eval = df[df.index.isin(eval_idx)]
X_eval = (X_eval
          .groupby('sentiment', group_keys=False)
          .apply(lambda x: x.sample(n=50, random_state=10, replace=True)))
X_train = X_train.reset_index(drop=True)

def generate_prompt(data_point):
    return f"""
            Analyze the sentiment of the news headline enclosed in square brackets, 
            determine if it is positive, neutral, or negative, and return the answer as 
            the corresponding sentiment label "positive" or "neutral" or "negative".

            [{data_point["text"]}] = {data_point["sentiment"]}
            """.strip()

def generate_test_prompt(data_point):
    return f"""
            Analyze the sentiment of the news headline enclosed in square brackets, 
            determine if it is positive, neutral, or negative, and return the answer as 
            the corresponding sentiment label "positive" or "neutral" or "negative".

            [{data_point["text"]}] = """.strip()

X_train = pd.DataFrame(X_train.apply(generate_prompt, axis=1), 
                       columns=["text"])
X_eval = pd.DataFrame(X_eval.apply(generate_prompt, axis=1), 
                      columns=["text"])

y_true = X_test.sentiment
X_test = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])

train_data = Dataset.from_pandas(X_train)
eval_data = Dataset.from_pandas(X_eval)

## 📈 Evaluation Function for Sentiment Model

The next step involves creating a function to evaluate the performance of our fine-tuned sentiment analysis model. The function carries out the following steps:

1. **Label Mapping**:
   - Maps sentiment labels to numerical representations:
     - `2` for positive
     - `1` for neutral
     - `0` for negative

2. **Accuracy Calculation**:
   - Computes the model's accuracy on the test dataset.

3. **Accuracy Report Generation**:
   - Produces an accuracy report detailing performance for each sentiment label.

4. **Classification Report**:
   - Generates a classification report of the model, providing metrics like precision, recall, and F1-score.

5. **Confusion Matrix Creation**:
   - Constructs a confusion matrix to visually represent the model's performance in predicting each sentiment category.


In [6]:
def evaluate(y_true, y_pred):
    labels = ['positive', 'neutral', 'negative']
    mapping = {'positive': 2, 'neutral': 1, 'none':1, 'negative': 0}
    def map_func(x):
        return mapping.get(x, 1)
    
    y_true = np.vectorize(map_func)(y_true)
    y_pred = np.vectorize(map_func)(y_pred)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    print(f'Accuracy: {accuracy:.3f}')
    
    # Generate accuracy report
    unique_labels = set(y_true)  # Get unique labels
    
    for label in unique_labels:
        label_indices = [i for i in range(len(y_true)) 
                         if y_true[i] == label]
        label_y_true = [y_true[i] for i in label_indices]
        label_y_pred = [y_pred[i] for i in label_indices]
        accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {label}: {accuracy:.3f}')
        
    # Generate classification report
    class_report = classification_report(y_true=y_true, y_pred=y_pred)
    print('\nClassification Report:')
    print(class_report)
    
    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true, y_pred=y_pred, labels=[0, 1, 2])
    print('\nConfusion Matrix:')
    print(conf_matrix)

## 🤖 Model Loading and Quantization for  Mistral 7B

The process for loading and quantizing the Mistral 7B model involves several steps:

### Model Loading and Quantization:

- **Load Mistral 7B Model**:
  - The code begins by loading the Mistral 7B language model from the Hugging Face Hub.

- **Data Type Configuration**:
  - Acquires the `float16` data type from the torch library for computational purposes.

- **BitsAndBytesConfig Settings**:
  - Creates a `BitsAndBytesConfig` object with specific settings:
    1. `load_in_4bit`: Loads model weights in 4-bit format.
    2. `bnb_4bit_quant_type`: Sets quantization type to "nf4" (4-bit NormalFloat).
    3. `bnb_4bit_compute_dtype`: Utilizes `float16` for computations.
    4. `bnb_4bit_use_double_quant`: Disables double quantization to manage memory footprint.

- **Model Configuration**:
  - Generates an `AutoModelForCausalLM` object from the Mistral 7B model, applying the quantization settings from `BitsAndBytesConfig`.

- **Caching and Token Probability**:
  - Disables caching for the model.
  - Sets the pre-training token probability to 1.

### Tokenizer Loading:

- **Tokenizer Configuration**:
  - Loads the tokenizer for theMistral 7B language model.
  - Sets the padding token as the end-of-sequence (EOS) token.
  - Configures padding to be on the "right" side, ensuring proper padding direction for Mistral 7B.



In [7]:
model_name = "/kaggle/input/mistral/pytorch/7b-v0.1-hf/1"

compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=bnb_config, 
)

model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name, 
                                          trust_remote_code=True,
                                         )
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## 📰 Sentiment Prediction Function Using Mistral 7B

In the next cell, we define a function to predict the sentiment of news headlines using the Mistral 7B language model. This function is designed with the following parameters and process:

### Function Parameters:

- **`test`**: 
  - A Pandas DataFrame containing the news headlines for sentiment prediction.

- **`model`**: 
  - The pre-trained Mistral 7B language model.

- **`tokenizer`**: 
  - The tokenizer corresponding to the Mistral 7B model.

### Prediction Process:

- **For each news headline**:
  1. **Create a Prompt**:
     - Formulates a prompt asking the Mistral 7B model to analyze the sentiment of the headline and return a sentiment label.
  
  2. **Generate Text**:
     - Utilizes the `pipeline()` function from the Hugging Face Transformers library to generate text from the Mistral 7B model, based on the prompt.

  3. **Extract Sentiment Label**:
     - Derives the predicted sentiment label from the generated text.
  
  4. **Append to y_pred**:
     - Adds the predicted sentiment label to the `y_pred` list.

- **Return**:
  - Outputs the `y_pred` list containing all predicted sentiment labels.

### `pipeline()` Function Usage:

- This function generates text from the Mistral 7B model.
- **Parameters**:
  - `task`: Specifies text generation as the task.
  - `model`: Indicates the pre-trained Mistral 7B language model.
  - `tokenizer`: Specifies the tokenizer for the Mistral 7B model.
  - `max_new_tokens`: Sets the maximum number of new tokens to generate.
  - `temperature`: Controls the randomness of the generated text. Lower temperature means less randomness (more predictable), while a higher temperature leads to more creative outputs.

### Sentiment Label Determination:

- The function uses an if statement to determine the sentiment label:
  - If the generated text contains "positive", the label is "positive".
  - If the generated text contains "negative", the label is "negative".
  - If the generated text contains "neutral", the label is "neutral".

This function effectively automates the sentiment analysis of news


In [8]:
def predict(test, model, tokenizer):
    y_pred = []
    for i in tqdm(range(len(X_test))):
        prompt = X_test.iloc[i]["text"]
        pipe = pipeline(task="text-generation", 
                        model=model, 
                        tokenizer=tokenizer, 
                        max_new_tokens = 1, 
                        temperature = 0.0,
                       )
        result = pipe(prompt)
        answer = result[0]['generated_text'].split("=")[-1]
        if "positive" in answer:
            y_pred.append("positive")
        elif "negative" in answer:
            y_pred.append("negative")
        elif "neutral" in answer:
            y_pred.append("neutral")
        else:
            y_pred.append("none")
    return y_pred

## 🧪 Testing Mistral 7B Model Pre-Fine-Tuning

At this stage of our project, we are set to test the capabilities of Mistral 7b-v0.1-hf model on our dataset. This test will be conducted without any fine-tuning to understand the baseline performance. Here's what this step involves:

### Objective:

- **Baseline Evaluation**:
  - The primary goal is to assess how well the Mistral model performs on our sentiment analysis problem in it's pre-trained state.
  
### Importance:

- **Insight Gathering**:
  - Testing the model without fine-tuning provides valuable insights into its inherent capability and limitations.

- **Baseline Establishment**:
  - This step helps in establishing a baseline performance of the model. It acts as a reference point to measure the impact of subsequent fine-tuning.

### Process:

- **Model Deployment**:
  - Deploy the Mistral 7b-v0.1-hf model as it is, without any modifications or fine-tuning.

- **Performance Evaluation**:
  - Evaluate the models on the same dataset to understand their default handling of sentiment analysis tasks.

This pre-fine-tuning test is a crucial step in our workflow, setting the stage for more targeted model optimization later on.


In [9]:
y_pred = predict(test, model, tokenizer)

  0%|          | 0/900 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  0%|          | 1/900 [00:02<44:52,  3.00s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  0%|          | 2/900 [00:03<22:17,  1.49s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  0%|          | 3/900 [00:03<15:03,  1.01s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  0%|          | 4/900 [00:04<11:39,  1.28it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  1%|          | 5/900 [00:04<09:21,  1.59it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  1%|          | 6/900 [00:05<08:41,  1.71it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  1%|          | 7/900 [00:05<07:57,  1.87it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  1%|          | 8/900 [00:06<07:28,  1.99it/s]Setting `pad_token_id` to `eos_token_id`:

In the following cell, we evaluate the results. There is little to be said, it is performing really terribly because the 7b-hf model tends to just predict a neutral sentiment and seldom it detects positive or negative sentiment.

In [10]:
evaluate(y_true, y_pred)

Accuracy: 0.544
Accuracy for label 0: 0.487
Accuracy for label 1: 0.210
Accuracy for label 2: 0.937

Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.49      0.65       300
           1       0.38      0.21      0.27       300
           2       0.48      0.94      0.64       300

    accuracy                           0.54       900
   macro avg       0.61      0.54      0.52       900
weighted avg       0.61      0.54      0.52       900


Confusion Matrix:
[[146  82  72]
 [  5  63 232]
 [  0  19 281]]


## 🔧 Setting Up for Fine-Tuning with SFTTrainer and PEFT

We're now ready to set up the environment for fine-tuning our large language models. This setup involves configuring and initializing the Simple Fine-tuning Trainer (SFTTrainer) using the Parameter-Efficient Fine-Tuning (PEFT) method. Here's a breakdown of the key components:

### PEFT Configuration:

- **`peft_config` Object**:
  - Specifies PEFT parameters, focusing on refining a limited set of model parameters. Key parameters include:
    - `lora_alpha`: Learning rate for LoRA update matrices.
    - `lora_dropout`: Dropout probability for LoRA matrices.
    - `r`: Rank of the LoRA update matrices.
    - `bias`: Type of bias (none, additive, learned).
    - `task_type`: Task type for training (CAUSAL_LM, MASKED_LM).

### Training Arguments:

- **`training_arguments` Object**:
  - Sets parameters for model training, including:
    - `output_dir`: Directory for logs and checkpoints.
    - `num_train_epochs`: Number of training epochs.
    - `per_device_train_batch_size`: Batch size per device.
    - `gradient_accumulation_steps`: Batches for gradient accumulation.
    - `optim`: Chosen optimizer.
    - `save_steps`, `logging_steps`: Checkpoint and logging intervals.
    - `learning_rate`, `weight_decay`: Optimizer settings.
    - `fp16`, `bf16`: Precision settings.
    - `max_grad_norm`, `max_steps`: Gradient and step limits.
    - `warmup_ratio`: Learning rate warmup proportion.
    - `group_by_length`: Sample grouping strategy.
    - `lr_scheduler_type`: Learning rate scheduler type.
    - `report_to`: Metrics reporting tools.
    - `evaluation_strategy`: Model evaluation strategy during training.

### SFTTrainer Initialization:

- **Setting Up the SFTTrainer**:
  - A custom trainer class from the PEFT library.
  - Initialized with:
    - `model`: Model to be trained.
    - `train_dataset`, `eval_dataset`: Training and evaluation datasets.
    - `peft_config`: PEFT configuration.
    - `dataset_text_field`: Text field name in the dataset.
    - `tokenizer`: Tokenizer for the model.
    - `args`: Training arguments.
    - `packing`: Packing strategy for training samples.
    - `max_seq_length`: Maximum sequence length.

### Training the Model:

- **Using the SFTTrainer**:
  - After initialization, the SFTTrainer is ready to train the model using the `train()` method.

This setup aims to efficiently fine-tune our large language models, utilizing PEFT to save computational resources and address issues like catastrophic forgetting. The SFTTrainer provides a tailored approach to fine-tune the model parameters effectively.


In [11]:
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

training_arguments = TrainingArguments(
    output_dir="logs",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8, # 4
    optim="paged_adamw_32bit",
    save_steps=0,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",
    report_to="tensorboard",
    evaluation_strategy="epoch"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
    max_seq_length=1024,
)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

## 🚀 Model Training and Saving with Trainer

The next code cell is dedicated to training our model and then saving the fine-tuned version. This is how it will be executed:

### Training the Model:

- **Training Execution**:
  - The model training is carried out by invoking the `trainer.train()` method from our previously configured trainer.

### Saving the Trained Model:

- **Model Storage**:
  - Post training, the model is saved to the `trained-model` directory. This ensures that our fine-tuned model is securely stored and accessible for future use.

### Performance Expectation:

- **Training Speed**:
  - Utilizing the standard GPU P100 provided by Kaggle, we anticipate the training process to be relatively fast, leveraging the high computational power of the GPU.



In [12]:
# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained("trained-model")

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
0,0.8395,0.750339
2,0.6234,0.678037


Afterwards, loading the TensorBoard extension and start TensorBoard, pointing to the logs/runs directory, which is assumed to contain the training logs and checkpoints for your model, will allow you to understand how the models fits during the training.

In [16]:
%reload_ext tensorboard
%tensorboard --logdir logs/runs

Reusing TensorBoard on port 6006 (pid 140), started 0:00:43 ago. (Use '!kill 140' to kill it.)

## 📈 Model Prediction and Evaluation

Next, we'll proceed with the final stage of our model's performance assessment. This involves two key steps: predicting sentiment labels for the test set and evaluating the model's overall performance. Here's the breakdown:

### Predicting Sentiment Labels:

- **Using `predict()` Function**:
  - The code will first utilize the `predict()` function to determine sentiment labels for the test dataset.
  - This step is crucial in seeing how well our model generalizes to unseen data.

### Evaluating Model Performance:

- **Applying `evaluate()` Function**:
  - Following predictions, the `evaluate()` function will be called to assess the model's performance on the test set.
  - This evaluation focuses on metrics such as accuracy, precision, and recall for individual sentiment labels.

### Expected Results:

- **High Overall Accuracy**:
  - We anticipate an impressive overall accuracy exceeding 0.8.
  - This indicates a strong performance of the model in correctly classifying sentiments.

- **Sentiment Label Accuracy**:
  - High accuracy is expected for individual sentiment labels.
  - However, the prediction of the 'neutral' label might still have room for improvement.

- **Reflection on Fine-Tuning Success**:
  - These results, especially with limited data, showcase the efficacy of our fine-tuning approach.
  - It's noteworthy how targeted fine-tuning can significantly enhance model performance.

By completing these steps, we gain a comprehensive view of our model's capabilities in sentiment analysis post-fine-tuning.


In [17]:
y_pred = predict(test, model, tokenizer)
evaluate(y_true, y_pred)

  0%|          | 0/900 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  0%|          | 1/900 [00:00<07:09,  2.09it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  0%|          | 2/900 [00:00<06:57,  2.15it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  0%|          | 3/900 [00:01<06:50,  2.18it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  0%|          | 4/900 [00:01<06:47,  2.20it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  1%|          | 5/900 [00:02<06:19,  2.36it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  1%|          | 6/900 [00:02<06:48,  2.19it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  1%|          | 7/900 [00:03<06:46,  2.20it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  1%|          | 8/900 [00:03<06:44,  2.20it/s]Setting `pad_token_id` to `eos_token_id`:

Accuracy: 0.859
Accuracy for label 0: 0.943
Accuracy for label 1: 0.803
Accuracy for label 2: 0.830

Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.94      0.94       300
           1       0.80      0.80      0.80       300
           2       0.84      0.83      0.83       300

    accuracy                           0.86       900
   macro avg       0.86      0.86      0.86       900
weighted avg       0.86      0.86      0.86       900


Confusion Matrix:
[[283  15   2]
 [ 12 241  47]
 [  5  46 249]]





The following code will create a Pandas DataFrame called evaluation containing the text, true labels, and predicted labels from the test set. This is expectially useful for understanding the errors that the fine-tuned model makes, and gettting insights on how to improve the prompt.

In [18]:
evaluation = pd.DataFrame({'text': X_test["text"], 
                           'y_true':y_true, 
                           'y_pred': y_pred},
                         )
evaluation.to_csv("test_predictions.csv", index=False)