<a href="https://www.kaggle.com/code/kajuyerim/lora-guide-on-llama3-1-8b-instruct?scriptVersionId=193274184" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# LoRA Guide About Sentimental Analysis on Financial Domain Using Llama3.1 8b-instruct
---

<font size="5">**What is LoRA?**</font> 
* <font size="3"> Low Rank Adaptation: Freezing the pretrained model weights and injecting trainable rank decomposition matrices into each layer of the Transformer architecture.</font> 
* <font size="3"> Main purpose is to lower the dimensions of the matrix. </font> 
* <font size="3"> Instead of updating all weights of a model, we only update the injected low rank matrices.</font> 

<font size="5">**Why Use LoRA?**</font> 
* <font size="3"> Greatly reducing the number of trainable parameters (up to 10000 times).
* <font size="3"> Reducing the GPU memory requirement (up to 3 times).
    
<font size="5">**References**</font> 
* <font size="3"> https://www.datacamp.com/tutorial/fine-tuning-llama-3-1  
* <font size="3"> https://towardsdatascience.com/implementing-lora-from-scratch-20f838b046f1  
* <font size="3"> https://lightning.ai/pages/community/tutorial/lora-llm/  
    

---
<font size="4">**! This guide is based on the LoRA paper (LoRA: Low-Rank Adaptation for Neural Networks)** (https://arxiv.org/abs/2106.09685#).</font>   
<font size="4">**! Any mistake cbetween this guide and the paper is due to my interpretation on the material.**</font>  

---


In [1]:
%%capture
%pip install -U bitsandbytes
%pip install -U transformers
%pip install -U accelerate
%pip install -U trl
%pip install -U peft
%pip install -U wandb

# Initiating Hugging Face & Wandb Login
<font size="3">We need to log in to Hugging Face for uploading and using the LLama3.1 model, and Wandb for training performance analysis.  
<font size="3">Insert your own token to HUGGINGFACE_TOKEN part (Use secrets).  
<font size="3">Insert your own token to WANDB_TOKEN part (Use secrets).
    
### Wand Library for Fine-Tuning Progress Tracking

The `Wand` library is a Python interface for [ImageMagick](https://imagemagick.org/), a powerful image manipulation tool. While Wand is typically used for image processing, it can be creatively repurposed to visualize and track the progress of machine learning model fine-tuning.

#### Use Case: Visualizing Fine-Tuning Progress

During fine-tuning, visualizing metrics like loss curves, accuracy over epochs, or confusion matrices can be helpful. Wand can assist in creating these visualizations and saving them as images for each epoch or checkpoint, allowing you to track progress over time.

#### Key Benefits

- **Automated Progress Images**: Automatically generate images that represent the state of your model at each fine-tuning stage.
- **Custom Visualizations**: Draw graphs, add text annotations, and combine images to create comprehensive progress reports.
- **Format Flexibility**: Save your visualizations in various formats (e.g., PNG, JPEG), making them easy to share or include in reports.



In [2]:
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
import wandb

user_secrets = UserSecretsClient()

hf_token = user_secrets.get_secret("HUGGINGFACE_TOKEN")
wb_token = user_secrets.get_secret("WANDB_TOKEN")

login(token = hf_token)
wandb.login(key=wb_token)
run = wandb.init(
    project='Fine tuning LLama3.1 8b for Sentimental Analysis on Financial Domain', 
    job_type="training", 
    anonymous="allow"
)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


[34m[1mwandb[0m: Currently logged in as: [33mkajuyerim[0m ([33mkajuyerim-bo-azi-i-niversitesi[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [3]:
import numpy as np
import os
from tqdm import tqdm
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import Dataset
from trl import SFTTrainer
from trl import setup_chat_format
from transformers import (AutoModelForCausalLM, 
                          AutoTokenizer, 
                          BitsAndBytesConfig, 
                          TrainingArguments, 
                          pipeline, 
                          logging)
from sklearn.metrics import (accuracy_score, 
                             classification_report, 
                             confusion_matrix)
from sklearn.model_selection import train_test_split

2024-08-20 06:18:00.500843: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-20 06:18:00.500907: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-20 06:18:00.502745: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# Loading and Explaining Dataset

In [4]:
import pandas as pd
from datasets import load_dataset

# Load and shuffle the dataset (seed for reproducing)
dataset = load_dataset("takala/financial_phrasebank", "sentences_allagree", split='train')
shuffled_dataset = dataset.shuffle(seed=32)

df = shuffled_dataset.to_pandas()

# Map sentiment labels to text (optional)
label_mapping = {0: 'negative', 1: 'neutral', 2: 'positive'}
df['label'] = df['label'].map(label_mapping)

# Save the dataset to a CSV file
df.to_csv('financial_phrasebank.csv', index=False)

# Display the updated DataFrame
print(df.head())
df.label.value_counts()

                                            sentence     label
0             The value of the order is USD 2.2 mn .   neutral
1  YIT lodged counter claims against Neste Oil to...  negative
2  Rohwedder Group is an automotive supplies , te...   neutral
3  Nokia was up 0.12 pct to 16.70 eur after kicki...  positive
4  The Group owns and operates a fleet of more th...   neutral


label
neutral     1391
positive     570
negative     303
Name: count, dtype: int64

In [5]:
import pandas as pd

# Split the DataFrame
train_size = 0.8
eval_size = 0.1

# Calculate sizes
train_end = int(train_size * len(df))
eval_end = train_end + int(eval_size * len(df))

# Split the data
X_train = df[:train_end]
X_eval = df[train_end:eval_end]
X_test = df[eval_end:]

# Keep a copy of the test labels before generating prompts
y_true = X_test['label'].values 

# Define the prompt generation functions
def generate_prompt(data_point):
    return f"""
            Classify the financial text as Positive, Negative, or Neutral.
text: {data_point['sentence']}
label: {data_point['label']}""".strip()

def generate_test_prompt(data_point):
    return f"""
            Classify the financial text as Positive, Negative, or Neutral.
text: {data_point['sentence']}
label: """.strip()

# Generate prompts for training and evaluation data
X_train.loc[:,'text'] = X_train.apply(generate_prompt, axis=1)
X_eval.loc[:,'text'] = X_eval.apply(generate_prompt, axis=1)

# Generate test prompts (without labels)
X_test = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])

## If you want to see all of the text -> Use these:
# pd.set_option('display.max_colwidth', None)  # Set max column width to None to avoid truncation
# pd.set_option('display.max_columns', None)   # Show all columns

# Random 5 rows
print(X_train.head(5))

# Now you have the correct y_true for evaluation
# X_train.label.value_counts() can be used to check label distribution in the training set


                                            sentence     label  \
0             The value of the order is USD 2.2 mn .   neutral   
1  YIT lodged counter claims against Neste Oil to...  negative   
2  Rohwedder Group is an automotive supplies , te...   neutral   
3  Nokia was up 0.12 pct to 16.70 eur after kicki...  positive   
4  The Group owns and operates a fleet of more th...   neutral   

                                                text  
0  Classify the financial text as Positive, Negat...  
1  Classify the financial text as Positive, Negat...  
2  Classify the financial text as Positive, Negat...  
3  Classify the financial text as Positive, Negat...  
4  Classify the financial text as Positive, Negat...  


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train.loc[:,'text'] = X_train.apply(generate_prompt, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_eval.loc[:,'text'] = X_eval.apply(generate_prompt, axis=1)


In [6]:
# Convert to datasets
train_data = Dataset.from_pandas(X_train[["text"]])
eval_data = Dataset.from_pandas(X_eval[["text"]])

train_data.shuffle(seed=42)['text'][3:6]

['Classify the financial text as Positive, Negative, or Neutral.\ntext: In the sinter plant , limestone and coke breeze are mixed with the iron ore concentrate and sintered into lump form or sinter for use in the blast furnaces as a raw material for iron-making .\nlabel: neutral',
 'Classify the financial text as Positive, Negative, or Neutral.\ntext: The Oxyview Pulse Oximeter is a common device to check patient blood-oxygen saturation level and pulse rate .\nlabel: neutral',
 "Classify the financial text as Positive, Negative, or Neutral.\ntext: Theodosopoulos said Tellabs could be of value to Nokia Siemens or Nortel given its `` leading supply status '' with Verizon , along with high-growth products .\nlabel: positive"]

# Loading the Model LLama3.1 8b-instruct

### What is Quantization?

Quantization reduces the precision of a model's weights and activations, typically from 32-bit floats to lower-bit integers, to make the model smaller and faster.

### How Quantization Works

1. **Scaling Values**: 
   - Example: A 32-bit floating-point value like `0.123456` might be scaled to fit within the range of an 8-bit integer, such as `123` in a range of `0-255`.

2. **Reducing Precision**: 
   - Example: The scaled value `123.456` could be rounded to `123`, fitting within an 8-bit integer, reducing memory and computation requirements.

Quantization optimizes models for deployment on resource-constrained devices, with a potential trade-off in accuracy.

### Reasons for Using Quantization in Fine-Tuning

#### Pros:
1. **Memory Efficiency**: Reduces model size, enabling fine-tuning on devices with limited memory.
2. **Faster Computation**: Speeds up training and inference by using lower precision.
3. **Resource Optimization**: Allows fine-tuning on less powerful, more affordable hardware.

#### Cons:
1. **Loss of Precision**: Can decrease model accuracy due to reduced precision.
2. **Increased Complexity**: Adds complexity to setup and may require more careful tuning.
3. **Potential Instability**: May cause instability during training, especially with small datasets.

### Settings
1. **`load_in_4bit=True`**:
   - Enables 4-bit quantization, reducing memory usage.

2. **`bnb_4bit_use_double_quant=False`**:
   - Disables double quantization to simplify computations.

3. **`bnb_4bit_quant_type="nf4"`**:
   - Uses the "nf4" format, optimized for balancing precision and performance.

4. **`bnb_4bit_compute_dtype="float16"`**:
   - Computes in 16-bit float, balancing precision and efficiency.


In [7]:
base_model_name = "/kaggle/input/llama-3.1/transformers/8b-instruct/1"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
)

model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    torch_dtype="float16",
    quantization_config=bnb_config, 
)

model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(base_model_name)

tokenizer.pad_token_id = tokenizer.eos_token_id

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

# Evaluating Before Fine-Tuning

### Why Evaluating a Model Before Fine-Tuning is Important

#### 1. Establishing a Baseline Performance
- **Purpose:** Understand the model's initial performance on your task.
- **Benefit:** Provides a reference point to measure the impact of fine-tuning.

#### 2. Understanding Model's Initial Capabilities
- **Purpose:** Assess the model’s existing knowledge relevant to your task.
- **Benefit:** Helps determine if fine-tuning is necessary or if the model is already performing well.

#### 3. Identifying the Need for Fine-Tuning
- **Purpose:** Evaluate if the model needs fine-tuning.
- **Benefit:** Saves time and resources by avoiding unnecessary fine-tuning.

#### 4. Diagnosing Model Weaknesses
- **Purpose:** Identify areas where the model struggles.
- **Benefit:** Guides targeted fine-tuning to address specific issues.

#### 5. Guiding Fine-Tuning Strategy
- **Purpose:** Inform decisions on which layers or parts of the model to fine-tune.
- **Benefit:** Ensures efficient and effective fine-tuning.

#### 6. Ensuring Reproducibility
- **Purpose:** Document performance to track changes and improvements.
- **Benefit:** Maintains a clear record of the model’s development.


In [8]:
def predict(test, model, tokenizer):
    y_pred = []
    categories = ["positive", "negative", "neutral"]
    
    for i in tqdm(range(len(test))):
        prompt = test.iloc[i]["text"]
        pipe = pipeline(task="text-generation", 
                        model=model, 
                        tokenizer=tokenizer, 
                        max_new_tokens=2, 
                        temperature=0.1)
        
        result = pipe(prompt)
        answer = result[0]['generated_text'].split("label:")[-1].strip()
        
        # Determine the predicted category
        for category in categories:
            if category.lower() in answer.lower():
                y_pred.append(category)
                break
        else:
            y_pred.append("none")
    
    return y_pred

y_pred = predict(X_test, model, tokenizer)


100%|██████████| 227/227 [01:20<00:00,  2.82it/s]


In [9]:
def evaluate(y_true, y_pred):
    labels = ["positive", "negative", "neutral"]
    mapping = {label: idx for idx, label in enumerate(labels)}
    
    def map_func(x):
        return mapping.get(x, -1)  # Map to -1 if not found, but should not occur with correct data
    
    y_true_mapped = np.vectorize(map_func)(y_true)
    y_pred_mapped = np.vectorize(map_func)(y_pred)
    
    # Filter out any -1 values which represent missing or incorrect labels
    valid_indices = (y_true_mapped != -1) & (y_pred_mapped != -1)
    y_true_mapped = y_true_mapped[valid_indices]
    y_pred_mapped = y_pred_mapped[valid_indices]
    
    # Check if there are any valid labels left
    if len(y_true_mapped) == 0 or len(y_pred_mapped) == 0:
        print("No valid labels found in y_true or y_pred.")
        return
    
    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true_mapped, y_pred=y_pred_mapped)
    print(f'Accuracy: {accuracy:.3f}')
    
    # Generate accuracy report
    unique_labels = set(y_true_mapped) 
    
    for label in unique_labels:
        label_indices = [i for i in range(len(y_true_mapped)) if y_true_mapped[i] == label]
        label_y_true = [y_true_mapped[i] for i in label_indices]
        label_y_pred = [y_pred_mapped[i] for i in label_indices]
        label_accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {labels[label]}: {label_accuracy:.3f}')
        
    # Generate classification report
    class_report = classification_report(y_true=y_true_mapped, y_pred=y_pred_mapped, target_names=labels, labels=list(range(len(labels))))
    print('\nClassification Report:')
    print(class_report)
    
    # Generate confusion matrix
    present_labels = list(unique_labels)
    if len(present_labels) == 0:
        print("No labels present for confusion matrix.")
    else:
        conf_matrix = confusion_matrix(y_true=y_true_mapped, y_pred=y_pred_mapped, labels=present_labels)
        print('\nConfusion Matrix:')
        print(conf_matrix)

# Generate predictions using the updated X_test (which now contains only text prompts)
y_pred = predict(X_test, model, tokenizer)
# Check the contents of y_true
print("y_true (first 10 elements):", y_true[:10])
print("Unique values in y_true:", set(y_true))

# Check the contents of y_pred
print("y_pred (first 10 elements):", y_pred[:10])
print("Unique values in y_pred:", set(y_pred))

# Evaluate the predictions against the true labels stored in y_true
evaluate(y_true, y_pred)


100%|██████████| 227/227 [01:20<00:00,  2.83it/s]

y_true (first 10 elements): ['positive' 'negative' 'neutral' 'neutral' 'neutral' 'neutral' 'neutral'
 'neutral' 'positive' 'neutral']
Unique values in y_true: {'negative', 'neutral', 'positive'}
y_pred (first 10 elements): ['positive', 'negative', 'positive', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'positive', 'neutral']
Unique values in y_pred: {'negative', 'neutral', 'positive'}
Accuracy: 0.837
Accuracy for label positive: 0.924
Accuracy for label negative: 1.000
Accuracy for label neutral: 0.765

Classification Report:
              precision    recall  f1-score   support

    positive       0.71      0.92      0.80        66
    negative       0.76      1.00      0.86        25
     neutral       0.96      0.76      0.85       136

    accuracy                           0.84       227
   macro avg       0.81      0.90      0.84       227
weighted avg       0.87      0.84      0.84       227


Confusion Matrix:
[[ 61   1   4]
 [  0  25   0]
 [ 25   7 104]]





### Explanation of the Scores and Measurement Criteria

#### **Overall Accuracy: 0.837**
- **What it means:** This represents the proportion of total correct predictions made by the model across all classes. In this case, the model correctly predicted approximately 83.7% of the instances.
- **Calculation:**  
   <div style="text-align: center;">
   <span style="font-size: 24px;">  $[\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}]$ </span>
   </div>
  Here, accuracy is calculated as $\frac{(61+25+104)}{227} = 0.837$.  

#### **Precision**
- **What it means:** Precision measures the accuracy of positive predictions made by the model. It is the ratio of correctly predicted positive observations to the total predicted positives. Precision is crucial when the cost of false positives is high, as it indicates how many of the positive predictions made by the model were actually correct.
- **Calculation:**
  <div style="text-align: center;">
   <span style="font-size: 24px;">
   $\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}$
   </span>
</div>

- **Example:** Precision for the positive class is 0.71, meaning 71% of the instances that were predicted as positive by the model were actually positive.

#### **Recall**
- **What it means:** Recall measures the ability of the model to identify all relevant instances of a class. It is the ratio of correctly predicted positive observations to all observations in the actual class.
- **Calculation:** 
  <div style="text-align: center;">
   <span style="font-size: 24px;"> 
   $\text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}$
   </span>
  </div>

- **Example:** Recall for the positive class is 0.92, meaning 92% of actual positive instances were correctly identified by the model.

#### **F1-Score**
- **What it means:** F1-Score is the harmonic mean of Precision and Recall. It provides a balance between precision and recall, especially useful when you want to find an equilibrium between these two metrics.
- **Calculation:**
  <div style="text-align: center;">
   <span style="font-size: 24px;">
   $\text{F1-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$
   </span><br><br>
</div>
- **Example:** The F1-Score for the positive class is 0.80, indicating a balanced performance between precision and recall for this class.


### **Confusion Matrix** (May change, try to get the general idea)
| Actual \ Predicted | Positive | Negative | Neutral |
|--------------------|----------|----------|---------|
| **Positive**       |    61    |    1     |    4    |
| **Negative**       |    0     |    25    |    0    |
| **Neutral**        |    25    |    7     |    104  |
  


- **[61, 1, 4]:** 
  - **61:** True positive (correctly predicted positive).
  - **1:** False negative (predicted negative but actually positive).
  - **4:** False positive (predicted neutral but actually positive).
  
- **[0, 25, 0]:** 
  - **25:** True negative (correctly predicted negative).
  - **0:** No instances of incorrect prediction in this class.

- **[25, 7, 104]:** 
  - **24:** False negative (predicted positive but actually neutral).
  - **7:** False positive (predicted negative but actually neutral).
  - **104:** True neutral (correctly predicted neutral).
  


# Identifying 4-Bit Quantized Layers for Fine-Tuning

- **Purpose**: Optimize fine-tuning of large models by identifying and targeting specific 4-bit quantized layers for advanced techniques like LoRA (Low-Rank Adaptation).

- **Quantized Layers**: The focus is on layers that have been quantized to 4-bit precision using the `bitsandbytes` library.

- **Efficiency**: Targeting only the identified 4-bit quantized linear layers allows for a more resource-efficient fine-tuning process.

- **Performance Balance**: By selectively applying fine-tuning techniques to these layers, we achieve a balance between enhancing model performance and maintaining computational efficiency.

- **Advanced Fine-Tuning**: This approach enables the application of sophisticated fine-tuning methods, specifically designed for large models, to improve their adaptability to specific tasks.

### Explanation

- **`cls = bnb.nn.Linear4bit`**: 
  - Refers to the `Linear4bit` class from the `bitsandbytes` library, which is used to identify 4-bit quantized layers in the model.

- **`lora_module_names = set()`**: 
  - Initializes a set to store unique names of the 4-bit quantized layers that will be targeted for LoRA fine-tuning.

- **`if isinstance(module, cls):`**: 
  - Checks if the current module is an instance of the `Linear4bit` class, meaning it is a 4-bit quantized layer suitable for LoRA fine-tuning.

- **`lora_module_names.add(names[0] if len(names) == 1 else names[-1])`**: 
  - Adds the name of the identified 4-bit quantized layer to the set, focusing on the last part of the name if it is nested.

- **`if 'lm_head' in lora_module_names:`**:
  - Checks if `'lm_head'` (usually the output layer) is in the set, often excluded from LoRA fine-tuning.

- **`lora_module_names.remove('lm_head')`**:
  - Removes `'lm_head'` from the set to prevent it from being fine-tuned.

- **`return list(lora_module_names)`**:
  - Returns a list of the names of 4-bit quantized layers that are suitable for LoRA fine-tuning.




In [11]:
def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit 
    lora_module_names = set()
    
    for name, module in model.named_modules():
        if isinstance(module, cls): 
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    
    # If 'lm_head' is in the set, remove it (often needed for 16-bit models)
    if 'lm_head' in lora_module_names:  
        lora_module_names.remove('lm_head')
    
    return list(lora_module_names)

modules = find_all_linear_names(model)
modules

['q_proj', 'down_proj', 'o_proj', 'gate_proj', 'up_proj', 'v_proj', 'k_proj']

# Training Settings

### `peft_config = LoraConfig(...)`

- **`lora_alpha=16`**: 
  - Controls the scaling factor for LoRA (Low-Rank Adaptation) layers, affecting how much the fine-tuning layers influence the overall model.

- **`lora_dropout=0`**: 
  - Specifies that no dropout will be applied in LoRA layers, meaning that all neurons will be used during each forward pass.

- **`r=64`**: 
  - Defines the rank of the LoRA layers, which determines the dimensionality of the low-rank matrices. A higher rank can capture more complex interactions but requires more memory.

- **`bias="none"`**: 
  - Indicates that no bias term will be added in the LoRA layers, simplifying the adaptation.

- **`task_type="CAUSAL_LM"`**: 
  - Specifies that the task is Causal Language Modeling, where the model predicts the next token in a sequence based on previous tokens.

- **`target_modules=modules`**: 
  - Assigns the LoRA layers to the specific modules identified using `find_all_linear_names`, ensuring that only certain parts of the model are fine-tuned.
  
### `training_arguments = TrainingArguments(...)`

- **`num_train_epochs=1`**: 
  - Indicates that the model will be trained for 1 epoch, meaning it will pass through the entire training dataset once.

- **`per_device_train_batch_size=1`**: 
  - Sets the batch size to 1 per device during training, meaning each GPU or TPU will process one example at a time.

- **`gradient_accumulation_steps=8`**: 
  - Accumulates gradients over 8 steps before performing a backward pass and model update. This effectively increases the batch size without requiring more memory.

- **`optim="paged_adamw_32bit"`**: 
  - Uses the 32-bit AdamW optimizer with paged memory management, which is more memory-efficient and suitable for fine-tuning large models.

- **`logging_steps=1`**: 
  - Logs training metrics after every step, allowing for detailed tracking of the training process.

- **`learning_rate=2e-4`**: 
  - Sets the learning rate to 0.0002, which is based on the QLoRA paper. This controls how quickly the model updates its parameters during training.

- **`weight_decay=0.001`**: 
  - Applies a weight decay of 0.001, which is a regularization technique to prevent the model from overfitting by penalizing large weights.

- **`max_grad_norm=0.3`**: 
  - Clips gradients to a maximum norm of 0.3, which helps in stabilizing training and preventing gradient explosions.

- **`warmup_ratio=0.03`**: 
  - Uses a warmup ratio of 0.03, meaning 3% of the total training steps are used for a gradual learning rate increase from 0 to the specified rate, following the QLoRA paper.

- **`group_by_length=False`**: 
  - Disables grouping of examples by their length, meaning that examples will not be sorted or batched based on their sequence length.

- **`lr_scheduler_type="cosine"`**: 
  - Uses a cosine learning rate scheduler, which gradually decreases the learning rate following a cosine curve.

- **`eval_strategy="steps"`**: 
  - Specifies that evaluation will be performed at regular steps during training rather than after each epoch.

- **`eval_steps=int(0.2 * len(train_data))`**: 
  - Sets the number of steps between evaluations to 20% of the training data length, ensuring frequent model evaluation during training.

---

### `trainer = SFTTrainer(...)`

- **`train_dataset=train_data`**: 
  - The dataset used for training the model, containing the processed training examples.

- **`eval_dataset=eval_data`**: 
  - The dataset used for evaluation during training, allowing for validation of the model's performance.

- **`dataset_text_field="text"`**: 
  - Indicates that the dataset field containing the text data is labeled "text."

- **`max_seq_length=512`**: 
  - Sets the maximum sequence length for padding and truncation, ensuring all inputs are of uniform length.

- **`packing=False`**: 
  - Disables packing of multiple examples into one input sequence, meaning each input will be processed individually.

- **`dataset_kwargs={"add_special_tokens": False, "append_concat_token": False}`**: 
  - Additional settings for handling the dataset, where special tokens are not added, and no concatenation token is appended.


In [12]:
from peft import LoraConfig
output_dir="llama-3.1-fine-tuned-model"

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=modules,  # Retaining the modules found using find_all_linear_names
)

training_arguments = TrainingArguments(
    output_dir=output_dir,                    # Directory to save the model
    num_train_epochs=1,                       # Number of training epochs
    per_device_train_batch_size=1,            # Batch size per device during training
    gradient_accumulation_steps=8,            # Number of steps before performing a backward/update pass
    gradient_checkpointing=True,              # Use gradient checkpointing to save memory
    optim="paged_adamw_32bit",                # Optimizer set for 32-bit AdamW (efficient memory usage)
    logging_steps=1,                         
    learning_rate=2e-4,                       # Learning rate, based on QLoRA paper (https://arxiv.org/abs/2305.14314 )
    weight_decay=0.001,
    fp16=True,                                # Using fp16 as the model was loaded in fp16 (float16) precision
    bf16=False,                               
    max_grad_norm=0.3,                        # Max gradient norm based on QLoRA paper
    max_steps=-1,
    warmup_ratio=0.03,                        # Warmup ratio based on QLoRA paper
    group_by_length=False,
    lr_scheduler_type="cosine",               # Cosine learning rate scheduler
    report_to="wandb",                        # Report metrics to W&B
    eval_strategy="steps",                    # Save checkpoint every few steps rather than epochs
    eval_steps=int(0.2 * len(train_data)),    # Eval steps based on a portion of the training data
)

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    dataset_text_field="text",               # The field name in the dataset containing the text
    tokenizer=tokenizer,
    max_seq_length=512,                      # Max sequence length for padding/truncation
    packing=False,                           # No packing of multiple examples into one input sequence
    dataset_kwargs={
        "add_special_tokens": False,         # Do not add special tokens, based on your previous settings
        "append_concat_token": False,        # Do not append concatenation token
    }
)




Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/1811 [00:00<?, ? examples/s]

Map:   0%|          | 0/226 [00:00<?, ? examples/s]

In [13]:
trainer.train()



Step,Training Loss,Validation Loss


TrainOutput(global_step=226, training_loss=1.5630741752354445, metrics={'train_runtime': 1940.9691, 'train_samples_per_second': 0.933, 'train_steps_per_second': 0.116, 'total_flos': 3913213492862976.0, 'train_loss': 1.5630741752354445, 'epoch': 0.9983434566537824})

In [14]:
wandb.finish()
model.config.use_cache = True

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
train/epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/grad_norm,▂▂█▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/learning_rate,▃▇██████▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁
train/loss,█▅▄▃▁▂▃▂▂▂▃▁▂▁▂▂▁▁▂▁▁▂▂▂▂▂▁▂▂▂▁▁▁▂▂▂▂▁▁▂

0,1
total_flos,3913213492862976.0
train/epoch,0.99834
train/global_step,226.0
train/grad_norm,0.40368
train/learning_rate,0.0
train/loss,1.7057
train_loss,1.56307
train_runtime,1940.9691
train_samples_per_second,0.933
train_steps_per_second,0.116


In [15]:
# Save trained model and tokenizer
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

('llama-3.1-fine-tuned-model/tokenizer_config.json',
 'llama-3.1-fine-tuned-model/special_tokens_map.json',
 'llama-3.1-fine-tuned-model/tokenizer.json')

In [16]:
y_pred = predict(X_test, model, tokenizer)
evaluate(y_true, y_pred)

  0%|          | 0/227 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
100%|██████████| 227/227 [02:25<00:00,  1.56it/s]

Accuracy: 0.952
Accuracy for label positive: 0.909
Accuracy for label negative: 0.960
Accuracy for label neutral: 0.971

Classification Report:
              precision    recall  f1-score   support

    positive       0.94      0.91      0.92        66
    negative       1.00      0.96      0.98        25
     neutral       0.95      0.97      0.96       136

    accuracy                           0.95       227
   macro avg       0.96      0.95      0.95       227
weighted avg       0.95      0.95      0.95       227


Confusion Matrix:
[[ 60   0   6]
 [  0  24   1]
 [  4   0 132]]





# Model Performance Comparison

### Accuracy
- **Before Fine-Tuning with 20% of the Dataset:**
  - Accuracy: 0.837
  - Accuracy for label positive: 0.924
  - Accuracy for label negative: 1.000
  - Accuracy for label neutral: 0.765

- **After Fine-Tuning:**
  - Accuracy: 0.952
  - Accuracy for label positive: 0.909
  - Accuracy for label negative: 0.960
  - Accuracy for label neutral: 0.971

### Precision, Recall, F1-Score Comparison

| Class      | Precision (Before) | Precision (After) | Recall (Before) | Recall (After) | F1-Score (Before) | F1-Score (After) |
|------------|--------------------|-------------------|-----------------|----------------|-------------------|------------------|
| Positive   | 0.71               | 0.94              | 0.92            | 0.91           | 0.80              | 0.92             |
| Negative   | 0.76               | 1.00              | 1.00            | 0.96           | 0.86              | 0.98             |
| Neutral    | 0.96               | 0.95              | 0.76            | 0.97           | 0.85              | 0.96             |

| **Overall** | **Precision (Before)** | **Precision (After)** | **Recall (Before)** | **Recall (After)** | **F1-Score (Before)** | **F1-Score (After)** |
|-------------|------------------------|-----------------------|---------------------|--------------------|-----------------------|----------------------|
| **Macro avg** | 0.81                 | 0.96                  | 0.90                | 0.95               | 0.84                  | 0.95                 |
| **Weighted avg** | 0.87              | 0.95                  | 0.84                | 0.95               | 0.84                  | 0.95                 |

