# Fine-tune Gemma 3 1B-IT for Sentiment Analysis

This tutorial covers the **fine-tuning process** of the recently launched **Gemma 3 1B** model for **sentiment analysis** on financial and economic information. Sentiment analysis in this domain is crucial for businesses for several reasons, including:

- **Market Insights**: Gaining valuable insights into market trends, investor confidence, and consumer behavior.  
- **Risk Management**: Identifying potential reputational risks.  
- **Investment Decisions**: Assessing the sentiment of stakeholders, investors, and the general public to evaluate investment opportunities.  

Before diving into the technical aspects of fine-tuning a large language model like **Gemma**, we must first select an appropriate **dataset** to showcase its capabilities.

## Introducing the Gemma 3 1B-IT

**Gemma 3** is Google's latest addition to its family of lightweight, state-of-the-art open AI models, designed to deliver high performance while being resource-efficient. The **1B Instruct** version of **Gemma 3** is tailored for **instruction-based tasks**, offering developers an accessible and powerful tool for creating intelligent applications.  

Announcement: [Gemma 3 Blog Post](https://blog.google/technology/developers/gemma-3/)

Gemma 3 features a **transformer architecture** optimized with advanced techniques like **RoPE embeddings** and **GeGLU activations**, enabling sophisticated reasoning and text generation capabilities.

Key Features:
- **128K-token context window**: Allows processing and understanding of vast amounts of information.  
- **Multilingual support**: Over **140 languages**, ideal for global applications.  
- **Multimodal capabilities**: Supports text, images, and videos, enabling interactive AI solutions.  
- **Edge device optimization**: Efficiently runs on consumer hardware with a single GPU, making it accessible for developers with limited resources.

Resources:
- [Gemma 3 Model Overview](https://ai.google.dev/gemma/docs/core)  
- [Gemma 3 Technical Report](https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf)  
- [Gemma 3 Model Card](https://ai.google.dev/gemma/docs/core/model_card_3)

## Dataset Selection

Annotated datasets for finance and economic texts are relatively rare, with many being proprietary. To address this challenge, researchers from the **Aalto University School of Business** introduced the **FinancialPhraseBank Dataset** in 2014, which contains approximately **5,000 sentences**.  

This dataset provides **human-annotated benchmarks**, allowing for consistent evaluation of different modeling techniques. The annotations were performed by **16 individuals** with a background in **financial markets**, who categorized the sentences as having a:

- **Positive** impact on stock prices  
- **Negative** impact on stock prices  
- **Neutral** impact on stock prices  

The impact was assessed from an **investor's perspective**.

## More on the FinancialPhraseBank Dataset

The **FinancialPhraseBank** dataset is a comprehensive collection of **financial news headlines** analyzed from the viewpoint of **retail investors**. It includes two key columns:

- **Sentiment**: Classified as **negative**, **neutral**, or **positive**.  
- **News Headline**: The actual **financial news snippet**.

This dataset has been widely used in research, including the study by **Malo, P.**, **Sinha, A.**, **Korhonen, P.**, **Wallenius, J.**, and **Takala, P.**, titled "*Good Debt or Bad Debt: Detecting Semantic Orientations in Economic Texts*" (published in the **Journal of the Association for Information Science and Technology**, 2014).

## Required Libraries

To implement this tutorial, we need to install several essential libraries:

In [1]:
!pip install -q -U transformers

In [2]:
!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U peft
!pip install -q -i https://pypi.org/simple/ bitsandbytes
!pip install -q -U trl

### Explanation of Key Libraries  

- **`transformers`**: Provides a framework to handle **pre-trained NLP models** for tasks like **text classification** and **question answering**.  

- **`accelerate`**: A distributed training library by Hugging Face designed for **parallelizing training** across multiple **GPUs or CPUs**.  

- **`peft`**: A library for **parameter-efficient fine-tuning (PEFT)** of pre-trained language models, including support for **LoRA (Low-Rank Adaptation)**.  

- **`trl`**: A Hugging Face library for training **transformer models** with **supervised fine-tuning** or **reinforcement learning techniques**.  


## Setting Environment Variables

The following code sets environment variables to configure the GPU usage and suppress unnecessary warnings:

In [3]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Use the first GPU
os.environ["TOKENIZERS_PARALLELISM"] = "false" # Disable tokenization parallelism

## Suppressing Warnings

During training, several warnings may appear that do not impact the fine-tuning process but can be distracting. To suppress them:

In [4]:
import warnings
warnings.filterwarnings("ignore")

## Importing Necessary Libraries

The following Python libraries are required for running the fine-tuning process:

In [5]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm

import torch
import torch.nn as nn

import transformers
from datasets import Dataset, DatasetDict
from transformers import (AutoModelForCausalLM, 
                          AutoTokenizer, 
                          BitsAndBytesConfig, 
                          TrainingArguments, 
                          pipeline, 
                          logging)

from transformers.models.gemma3 import Gemma3ForCausalLM

from datasets import Dataset
from peft import LoraConfig, PeftConfig, PeftModel
from trl import SFTTrainer, SFTConfig
import bitsandbytes as bnb

from sklearn.metrics import (accuracy_score, 
                             classification_report, 
                             confusion_matrix)

from sklearn.model_selection import train_test_split

To check the installed version of the transformers library:

In [6]:
print(f"transformers=={transformers.__version__}")

transformers==4.53.1


This function determines the best computing device for running the tutorial:

In [7]:
def define_device():
    """Determine and return the optimal PyTorch device based on availability."""
    
    print(f"PyTorch version: {torch.__version__}", end=" -- ")

    # Check if MPS (Metal Performance Shaders) is available for macOS
    if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        print("using MPS device on macOS")
        return torch.device("mps")

    # Check for CUDA availability
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"using {device}")
    return device

This code initializes the Gemma 3 1B model for causal language modeling, ensuring optimal settings based on the available hardware.  

* If the GPU supports **bfloat16** (available on GPUs with Compute Capability **8.0+**), it is used for computations.  
  Otherwise, **float16** is used as the default.  

* **Device Selection:**  
  * The function `define_device()` selects the best available device (**CPU, CUDA, or MPS**).  

* **Model Initialization:**  
  * The model is loaded with memory-efficient configurations, including `low_cpu_mem_usage=True`, and assigned to the selected device.  

* **Tokenizer Setup:**  
  * A **tokenizer** is initialized with a **maximum sequence length of 8192**.  
  * The **end-of-sequence (EOS) token** is stored for later use.  

In [8]:
from dotenv import load_dotenv
from huggingface_hub import login
login(token=os.getenv("HUGGINGFACE_HUB_TOKEN"))
import os

load_dotenv()  # Automatically loads .env file from current directory


True

In [9]:
import torch._dynamo
torch._dynamo.config.cache_size_limit = 128  # Default is 64


In [10]:
# Determine optimal computation dtype based on GPU capability
compute_dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16
print(f"Using compute dtype {compute_dtype}")

# Select the best available device (CPU, CUDA, or MPS)
device = define_device()
print(f"Operating on {device}")

# Path to the pre-trained model
GEMMA_PATH = "google/gemma-3-1b-it"

# Load the model with optimized settings
model = Gemma3ForCausalLM.from_pretrained(
    GEMMA_PATH,
    torch_dtype=compute_dtype,
    attn_implementation="eager",
    low_cpu_mem_usage=True,
    device_map=device,
)

# Define maximum sequence length for the tokenizer
max_seq_length = 8192

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    GEMMA_PATH, 
    max_seq_length=max_seq_length,
    device_map=device
)

# Store the EOS token for later use
EOS_TOKEN = tokenizer.eos_token

Using compute dtype torch.bfloat16
PyTorch version: 2.7.1+cu126 -- using cuda
Operating on cuda


Before proceeding, let's ensures that the entire model is correctly moved to the GPU.


In [11]:
is_on_gpu = all(param.device.type == 'cuda' for param in model.parameters())
print("Model is on GPU:", is_on_gpu)

Model is on GPU: True


The following code prepares the dataset for fine-tuning a sentiment analysis model using Gemma. It follows these steps:  

1. **Load Dataset**  
   * Reads the dataset from `all-data.csv`, which contains two columns:  
     - **sentiment**: The sentiment label (positive, neutral, negative).  
     - **text**: The financial news headlines.  

2. **Stratified Train-Test Split**  
   * The dataset is split into **training** and **test** sets, each containing **300 samples per sentiment class**.  
   * **Stratification** ensures that each set has an equal distribution of positive, neutral, and negative examples.  

3. **Shuffle Training Data**  
   * The training data is shuffled using `random_state=10` to ensure **replicability**.  

4. **Prepare Evaluation Data**  
   * The remaining (unselected) data is assigned to an **evaluation set (`X_eval`)**.  
   * To ensure **balanced evaluation**, each sentiment class is resampled to have **50 instances** (negative samples are duplicated if needed).  

5. **Convert Text into Prompts**  
   * The **training** and **evaluation** data are transformed into **prompts** that instruct the model to classify sentiment.  
   * **Training prompts** include sentiment labels (used for fine-tuning).  
   * **Test prompts** omit sentiment labels (used for inference).  

6. **Wrap Data Using Hugging Face's Dataset Class**  
   * Converts `train_data`, `eval_data`, and `test_data` into **Hugging Face Dataset objects** for compatibility with the training pipeline.

In [12]:
train_data = pd.read_csv('competition_train.csv')
validation_data = pd.read_csv('competition_val.csv')
test_data = pd.read_csv('competition_test.csv')
df = pd.concat([train_data, validation_data], ignore_index=True)

In [13]:
# Create label mapping
label_mapping = {}
unique_emotions = train_data['emotion'].unique()
for i, emotion in enumerate(sorted(unique_emotions)):
    label_mapping[emotion] = i

print("Label mapping:")
print(label_mapping)

# Convert to HuggingFace Datasets
def prepare_dataset(df, is_test=False):
    dataset_dict = {'id': df['id'].tolist(), 'text': df['Sentence'].tolist(), 'language': df['language'].tolist()}
    if not is_test:
        dataset_dict['label'] = [emotion for emotion in df['emotion'].tolist()]
    return Dataset.from_dict(dataset_dict)

train_dataset = prepare_dataset(train_data)
val_dataset = prepare_dataset(validation_data)
test_dataset = prepare_dataset(test_data, is_test=True)

# Combine into a dataset dictionary
dataset_dict = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset,
    'test': test_dataset
})

# Print info about the datasets
print("\nDataset info:")
for split, dataset in dataset_dict.items():
    print(f"{split}: {dataset}")

Label mapping:
{'anger': 0, 'disgust': 1, 'fear': 2, 'happy': 3, 'sad': 4, 'surprise': 5}



Dataset info:
train: Dataset({
    features: ['id', 'text', 'language', 'label'],
    num_rows: 7176
})
validation: Dataset({
    features: ['id', 'text', 'language', 'label'],
    num_rows: 2392
})
test: Dataset({
    features: ['id', 'text', 'language'],
    num_rows: 2392
})


In [14]:
X_train=dataset_dict['train'].to_pandas()
X_eval=dataset_dict['validation'].to_pandas()
X_test=dataset_dict['test'].to_pandas()

In [15]:
print(X_train.shape)
print(X_test.shape)
print(X_eval.shape)

(7176, 4)
(2392, 3)
(2392, 4)


In [16]:
X_test.head()

Unnamed: 0,id,text,language
0,556,ᱵᱤᱨ ᱨᱮ ᱥᱟᱺᱜᱤᱧ ᱠᱷᱚᱱ ᱛᱟᱹᱨᱩᱵ ᱠᱚᱣᱟᱜ ᱵᱚᱛᱚᱨᱟᱱᱟᱜ ᱦᱟᱺᱰ...,Santali
1,1213,ꯀꯁ꯭ꯇꯃꯔ ꯑꯃꯅ ꯑꯩꯈꯣꯏꯒꯤ ꯗꯨꯀꯥꯟꯗ ꯃꯈꯣꯏꯒꯤ ꯄꯣꯠ ꯂꯩꯕꯒꯤ ꯑꯦꯛ...,Manipuri
2,744,ꯑꯌꯨꯛꯇ ꯀꯦꯔꯂꯥꯒꯤ ꯆꯥ ꯒꯤꯂꯥꯁ ꯑꯃꯅ ꯑꯩꯕꯨ ꯅꯨꯡꯉꯥꯏꯕꯅ ꯊꯜꯍꯜꯂꯤ꯫,Manipuri
3,443,ꯔꯦꯁꯇꯣꯔꯦꯟꯠꯒꯤ ꯁꯔꯕꯤꯁ ꯑꯁꯤ ꯉꯁꯤꯗꯤ ꯌꯥꯝꯅ ꯐꯠꯇꯕ ꯑꯣꯏ ꯍꯥꯏꯅ...,Manipuri
4,7673,ᱵᱟᱦᱟ ᱴᱚᱵ ᱵᱮᱲᱦᱟᱭᱛᱮ ᱛᱤᱡᱩ ᱠᱚ ᱴᱩᱸᱰᱟᱝ ᱵᱟᱲᱟᱭ ᱧᱮᱞ ᱱᱚᱣ...,Santali


In [17]:
# Function to generate training and evaluation prompts
def generate_train_prompt(example):
    language = example['language']
    text = example['text']
    label = example['label']

    # Language description
    lang_map = {
        'Santali': "in the Santali language (OI Chiki script)",
        'Kashmiri': "in the Kashmiri language (Arabic script)",
        'Manipuri': "in the Manipuri language (Meitei Mayek script)"
    }
    lang_desc = lang_map.get(language, f"in {language}")

    # Refined prompt
    prompt = f"""### Instruction:
Classify the emotion conveyed in the given sentence {lang_desc}.
You must choose from one of the following emotions:
fear, happy, surprise, sad, anger, disgust.

### Input:
{text}

### Response:
{label}{EOS_TOKEN}"""

    return prompt


# Function to generate test prompts (without expected answer)
def generate_test_prompt(example):
    language = example['language']
    text = example['text']

    lang_map = {
        'Santali': "in the Santali language (OI Chiki script)",
        'Kashmiri': "in the Kashmiri language (Arabic script)",
        'Manipuri': "in the Manipuri language (Meitei Mayek script)"
    }
    lang_desc = lang_map.get(language, f"in {language}")

    prompt = f"""### Instruction:
Classify the emotion conveyed in the given sentence {lang_desc}.
You must choose from one of the following emotions:
fear, happy, surprise, sad, anger, disgust.

### Input:
{text}

### Response:"""

    return prompt

In [18]:
# Apply prompt generation to datasets
X_train = pd.DataFrame(X_train.apply(generate_train_prompt, axis=1), columns=["text"])
X_eval = pd.DataFrame(X_eval.apply(generate_train_prompt, axis=1), columns=["text"])

# Store ground truth labels for test data
X_test = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])

# Convert to Hugging Face Dataset format
train_data = Dataset.from_pandas(X_train)
eval_data = Dataset.from_pandas(X_eval)

The following function evaluates the performance of our **fine-tuned sentiment analysis model** by performing the following tasks:

**1. Map Sentiment Labels to Numeric Values**
- **Positive** → `2`
- **Neutral** → `1`
- **Negative** → `0`
- Additionally, handles cases where the label is `'none'` by mapping it to **`1 (neutral)`**.

**2. Calculate Overall Accuracy**
- Computes the accuracy of the model predictions (`y_pred`) compared to the actual sentiment labels (`y_true`).

**3. Compute Accuracy for Each Sentiment Label**
- Extracts **accuracy scores** separately for:
  - **Positive**
  - **Neutral**
  - **Negative**

**4. Generate a Classification Report**
- Prints **precision, recall, and F1-score** for each sentiment category.

**5. Compute and Display the Confusion Matrix**
- Displays a **confusion matrix** to show how often the model misclassifies sentiments (e.g., predicting **neutral** instead of **positive**).


In [19]:
def evaluate(y_true, y_pred):
    """Evaluate the fine-tuned sentiment model's performance."""
    
    # Define sentiment label mapping
    label_mapping = {
        'disgust': 0,
        'anger': 1,
        'sad': 2,
        'happy': 3,
        'fear': 4,
        'surprise': 5
    }
    
    # Convert labels to numeric values
    y_true = np.array([label_mapping.get(label, 1) for label in y_true])
    y_pred = np.array([label_mapping.get(label, 1) for label in y_pred])
    
    # Calculate overall accuracy
    accuracy = accuracy_score(y_true, y_pred)
    print(f'Overall Accuracy: {accuracy:.3f}')
    
    # Compute accuracy for each sentiment label
    unique_labels = np.unique(y_true)  # Get unique labels in y_true
    
    for label in unique_labels:
        label_mask = y_true == label  # Mask to filter specific class
        label_accuracy = accuracy_score(y_true[label_mask], y_pred[label_mask])
        print(f'Accuracy for label {label} ({list(label_mapping.keys())[list(label_mapping.values()).index(label)]}): {label_accuracy:.3f}')
    
    # Generate classification report
    class_report = classification_report(y_true, y_pred, target_names=label_mapping.keys())
    print('\nClassification Report:\n', class_report)
    
    # Compute and display confusion matrix
    conf_matrix = confusion_matrix(y_true, y_pred, labels=[0, 1, 2])
    print('\nConfusion Matrix:\n', conf_matrix)

The following function predicts the sentiment of news headlines. It takes three arguments:

- **X_test**: A Pandas DataFrame containing the news headlines to be analyzed.
- **model**: The pre-trained **Gemma-3 4B** language model.
- **tokenizer**: The corresponding tokenizer for the **Gemma-3 4B** model.

### **Function Workflow:**
1. **Iterate through each news headline** in `X_test`:
   - Construct a prompt asking the model to analyze the sentiment.
   - Tokenize the input and move it to the appropriate device (GPU/CPU).
   - Generate text using the model and extract the predicted sentiment label.
   - Append the sentiment label to `y_pred`.

2. **Use the `generate()` function** from the Hugging Face Transformers library:
   - `max_new_tokens=5`: Limits the number of generated tokens.
   - `temperature=0.0`: Ensures deterministic output.

3. **Extract the sentiment label** from the generated text:
   - If the text contains "positive", assign the label **positive**.
   - If the text contains "negative", assign the label **negative**.
   - If the text contains "neutral", assign the label **neutral**.
   - If none of these are found, assign **none** as a fallback.

In [20]:
def predict(X_test, model, tokenizer, device=device, max_new_tokens=5, temperature=0.0):
    """Predict the sentiment of news headlines"""
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    # Check if model parameters are already on the correct device
    if next(model.parameters()).device != device:
        model = model.to(device)
    model.eval()
    
    y_pred = []  # List to store predicted sentiment labels
    
    # Iterate through each headline in X_test
    for i in tqdm(range(len(X_test)), desc="Predicting Sentiments"):
        prompt = X_test.iloc[i]["text"]  # Extract headline text
        
        # Tokenize and move input to the appropriate device
        input_ids = tokenizer(prompt, return_tensors="pt").to(device)
        
        # Generate output from the model
        outputs = model.generate(**input_ids, 
                                 max_new_tokens=max_new_tokens, 
                                 temperature=temperature,
                                 do_sample=False)
        
        # Decode the generated output into text
        result = tokenizer.decode(outputs[0], skip_special_tokens=True).strip().lower()
        result = result.split('### response:')[-1].strip()
        
        # Extract sentiment from the generated text
        # Extract sentiment from the generated text for the 6-class problem
        if "disgust"==result:
            y_pred.append("disgust")
        elif "anger"==result:
            y_pred.append("anger")
        elif "sad"==result:
            y_pred.append("sad")
        elif "happy"==result:
            y_pred.append("happy")
        elif "fear"==result:
            y_pred.append("fear")
        elif "surprise"==result:
            y_pred.append("surprise")
        else:
            y_pred.append("none")  # Fallback if no clear sentiment is detected

    return y_pred

At this stage, we are ready to test the **Gemma-3 1B** model on our dataset **without any fine-tuning**. This initial evaluation provides insights into the model's **inherent performance** and helps establish a **baseline** for comparison with future fine-tuned models.

We use the `predict` function to generate sentiment predictions for the test set:

In [21]:
# y_pred = predict(X_test, model, tokenizer)

In the next step, we evaluate the model's predictions against the true sentiment labels:

In [22]:
# evaluate(y_true, y_pred)

**Analysis of the Model's Performance**

**Overall Performance**
- The **overall accuracy** of **55.0%** suggests that the model struggles to differentiate between sentiment classes effectively.  
- The **macro and weighted F1-scores (~0.51)** indicate an **imbalanced performance** across different sentiment categories.

**Per-Label Accuracy**
- **Negative Sentiment (60.3%)**  
  - The model is relatively good at identifying negative sentiment, as reflected in its **high recall (0.90)**.  
  - However, its **low precision (0.44)** means it misclassifies many non-negative examples as negative.  

- **Neutral Sentiment (14.3%)**  
  - This is the weakest area, with the model **failing to recognize neutral sentiment correctly**.  
  - The **recall of 0.14** means that most neutral headlines are misclassified as either positive or negative.  

- **Positive Sentiment (90.3%)**  
  - The model performs well in identifying positive sentiment, with high **precision (0.88)** and **recall (0.60)**.  
  - However, the recall value suggests that it **misses some positive examples**.  

**Key Issues from the Confusion Matrix**
- **Neutral sentiment is highly misclassified**:  
  - **243 out of 300 neutral examples were incorrectly classified as positive**, leading to poor recall for this class.  
- **Negative and positive classifications are more reliable**, but still imperfect:  
  - **99 negative examples were classified as positive** (false positives).  
  - **Only 14 positive examples were misclassified as negative**, showing that the model rarely confuses these extremes.  

**In summary**
- The model is **biased towards predicting positive sentiment** and **struggles significantly with neutral sentiment**.  
- The **high recall for negative sentiment** suggests that it detects negativity well but often **overgeneralizes**, leading to misclassifications.  
- While the model has **some strengths in identifying negative and positive sentiment**, it **performs poorly on neutral sentiment**, making it **unreliable for nuanced sentiment analysis** without further refinement. 🚀  


In the next cell, we set everything up for fine-tuning the model. We configure and initialize a **Simple Fine-tuning Trainer (SFTTrainer)** for training the model using the **Parameter-Efficient Fine-Tuning (PEFT)** method. PEFT is efficient because it operates on a reduced number of parameters compared to the model's overall size. This method focuses on refining only a limited set of additional model parameters while keeping the majority of the pre-trained large language model (LLM) parameters fixed, significantly reducing computational and storage expenses. Additionally, PEFT helps mitigate **catastrophic forgetting**, a common issue when fine-tuning LLMs completely.

### PEFTConfig:
The `peft_config` object specifies the parameters for PEFT. The following are some of the most important parameters:

- **lora_alpha**: The learning rate for the LoRA update matrices.
- **lora_dropout**: The dropout probability for the LoRA update matrices.
- **r**: The rank of the LoRA update matrices.
- **bias**: The type of bias to use. Possible values are: `none`, `additive`, and `learned`.
- **task_type**: The task type the model is being trained for. Possible values are `CAUSAL_LM` and `MASKED_LM`.

### TrainingArguments:
The `training_arguments` object specifies the parameters for training the model. The following are some key parameters:

- **output_dir**: Directory where the training logs and checkpoints will be saved.
- **num_train_epochs**: Number of epochs to train the model for.
- **per_device_train_batch_size**: Number of samples in each batch on each device.
- **gradient_accumulation_steps**: Number of batches to accumulate gradients before updating the model parameters.
- **gradient_checkpointing**: Whether to use gradient checkpointing to reduce GPU memory usage.
- **optim**: The optimizer used for training the model.
- **save_steps**: The number of steps after which to save a checkpoint.
- **logging_steps**: The number of steps after which to log the training metrics.
- **learning_rate**: The learning rate for the optimizer.
- **weight_decay**: The weight decay parameter for the optimizer.
- **fp16**: Whether to use 16-bit floating-point precision.
- **bf16**: Whether to use BFloat16 precision.
- **max_grad_norm**: The maximum gradient norm.
- **max_steps**: The maximum number of steps to train the model for.
- **warmup_ratio**: Proportion of training steps to use for warming up the learning rate.
- **group_by_length**: Whether to group the training samples by length.
- **lr_scheduler_type**: The type of learning rate scheduler to use.
- **report_to**: The tools to report the training metrics to.
- **evaluation_strategy**: The strategy for evaluating the model during training.
- **eval_steps**: Number of update steps between evaluations.
- **eval_accumulation_steps**: Number of prediction steps to accumulate before moving the output to CPU.

### SFTTrainer:
The `SFTTrainer` is a custom trainer class from the **TRL** library. It is used to fine-tune large language models using the PEFT method.

The `SFTTrainer` object is initialized with the following arguments:

- **model**: The model to be trained.
- **train_dataset**: The training dataset.
- **eval_dataset**: The evaluation dataset.
- **peft_config**: The PEFT configuration.
- **tokenizer**: The tokenizer to use.
- **args**: The training arguments.
- **dataset_text_field**: The name of the text field in the dataset.
- **packing**: Whether to pack the training samples.
- **max_seq_length**: The maximum sequence length.

Once the `SFTTrainer` object is initialized, it can be used to train the model by calling the `train()` method.

In [23]:
train_data

Dataset({
    features: ['text'],
    num_rows: 7176
})

In [24]:
peft_config = LoraConfig(
    lora_alpha=64,
    lora_dropout=0.05,
    r=32,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules='all-linear',
)

training_arguments = SFTConfig(
    output_dir="logs",
    num_train_epochs=5,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},  # Use reentrant checkpointing
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    optim="adamw_torch_fused",  # Use fused AdamW optimizer
    save_steps=112,
    load_best_model_at_end=True,
    logging_steps=25,
    learning_rate=3e-4,
    weight_decay=0.001,
    fp16=True if compute_dtype == torch.float16 else False,  # Use float16 precision
    bf16=True if compute_dtype == torch.bfloat16 else False,  # Use bfloat16 precision
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=False,
    eval_strategy="steps",
    eval_steps=112,
    eval_accumulation_steps=1,
    lr_scheduler_type="cosine",
    report_to="tensorboard",
    max_seq_length=max_seq_length,
    packing=False,
    dataset_kwargs={
        "add_special_tokens": False,  # Template with special tokens
        "append_concat_token": True,  # Add EOS token as separator token
    }
)

model.config.use_cache = False
model.config.pretraining_tp = 1

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    processing_class=tokenizer,
    args=training_arguments,
)

Adding EOS to train dataset: 100%|██████████| 7176/7176 [00:00<00:00, 69287.60 examples/s]
Tokenizing train dataset: 100%|██████████| 7176/7176 [00:01<00:00, 4475.99 examples/s]
Truncating train dataset: 100%|██████████| 7176/7176 [00:00<00:00, 630093.90 examples/s]
Adding EOS to eval dataset: 100%|██████████| 2392/2392 [00:00<00:00, 30195.56 examples/s]
Tokenizing eval dataset: 100%|██████████| 2392/2392 [00:00<00:00, 4089.50 examples/s]
Truncating eval dataset: 100%|██████████| 2392/2392 [00:00<00:00, 544135.76 examples/s]
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


The `trainer.train()` function is called to start the training process. This triggers the fine-tuning of the model based on the specified training arguments, datasets, and PEFT configuration. During this process, the model will iteratively adjust its parameters, leveraging the training data to improve performance on the sentiment analysis task. The training will proceed according to the parameters set in the `training_arguments`, such as the number of epochs, batch size, and evaluation steps.

This method will also handle the evaluation of the model at specified intervals, providing insights into the model's performance as it trains.

In [25]:
# Train model
trainer.train()

Step,Training Loss,Validation Loss


OutOfMemoryError: CUDA out of memory. Tried to allocate 2.22 GiB. GPU 0 has a total capacity of 23.57 GiB of which 1.95 GiB is free. Process 225140 has 3.29 GiB memory in use. Process 226166 has 3.29 GiB memory in use. Including non-PyTorch memory, this process has 6.72 GiB memory in use. Process 232975 has 8.29 GiB memory in use. Of the allocated memory 5.55 GiB is allocated by PyTorch, and 870.88 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The code below saves the fine-tuned LoRA adapter and tokenizer to a specified directory for later use.

1. **Save the LoRA adapter**:  
   The `trainer.model.save_pretrained(lora_directory)` function saves the trained model (with the LoRA adapter applied) to the directory specified by `lora_directory` ("LoRA-Gemma3-model"). This allows you to reload and use the fine-tuned model later without needing to retrain it.

2. **Save the tokenizer**:  
   The `trainer.tokenizer.save_pretrained(lora_directory)` function saves the tokenizer used during training to the same directory. This ensures that the tokenizer is consistent with the fine-tuned model, enabling you to process text in the same way during inference or further training.

In [None]:
# Save trained LoRA adapter
lora_directory = "LoRA-Gemma3-model"
trainer.model.save_pretrained(lora_directory)

# To save the tokenizer too
trainer.processing_class.save_pretrained(lora_directory)

('LoRA-Gemma3-model/tokenizer_config.json',
 'LoRA-Gemma3-model/special_tokens_map.json',
 'LoRA-Gemma3-model/chat_template.jinja',
 'LoRA-Gemma3-model/tokenizer.json')

After training the model, it's essential to track its progress, performance, and visualize metrics. This can be done using TensorBoard. The following code loads the TensorBoard extension and starts the TensorBoard server, which will monitor the training process.

1. **Load TensorBoard extension**:  
   The `%load_ext tensorboard` magic command is used to load the TensorBoard extension within the Jupyter notebook environment.

2. **Start TensorBoard**:  
   The `%tensorboard --logdir logs/runs` command starts the TensorBoard server, specifying the directory (`logs/runs`) where the training logs and checkpoints are saved. By doing this, you can visualize various metrics such as loss, accuracy, and other key performance indicators during the training process.

Once executed, TensorBoard will provide an interactive interface to monitor how the model fits and evolves over time.

In [None]:
# %load_ext tensorboard
# %tensorboard --logdir logs/runs

The following code performs sentiment label prediction on the test set and evaluates the model's performance:

1. **Predict Sentiment Labels**:  
   The `predict(X_test, model, tokenizer)` function is called to predict the sentiment labels for the test dataset (`X_test`). This function generates predictions based on the fine-tuned model.

2. **Evaluate Model Performance**:  
   The `evaluate(y_true, y_pred)` function is used to assess the model's performance by comparing the true sentiment labels (`y_true`) with the predicted labels (`y_pred`). The evaluation will compute metrics like accuracy, precision, recall, and F1-score, which provide insights into how well the model is performing for each sentiment class.

With a well-fine-tuned model, we expect to achieve an overall accuracy of over 0.8, and the performance for individual sentiment labels (positive, negative, and neutral) should be high, especially for positive and negative classes. While there might still be room for improvement in predicting neutral sentiment, the results should be impressive given the relatively small dataset and the use of fine-tuning.

In [None]:
model.eval()
y_pred = predict(X_test, model, tokenizer)

Predicting Sentiments:   0%|          | 0/2392 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Predicting Sentiments:   0%|          | 1/2392 [00:54<36:09:15, 54.44s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Predicting Sentiments:   0%|          | 2/2392 [01:53<37:47:26, 56.92s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Predicting Sentiments:   0%|          | 3/2392 [01:54<21:03:52, 31.74s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TR

The model performs particularly well with the **negative class** (high recall), showing strong performance in identifying negative sentiment. The **positive class** also shows good results, with a high recall and F1-score, but the **neutral class** still lags behind, with more modest recall and F1-score.

The **confusion matrix** indicates that the model frequently predicts positive and negative sentiment correctly, with only a few neutral instances misclassified as positive or negative.

Compared to the results previously obtained by fine-tuning **Gemma 7B-IT** (see: [Fine-tune Gemma 7B-IT for sentiment analysis](https://www.kaggle.com/code/lucamassaron/fine-tune-gemma-7b-it-for-sentiment-analysis)), we have to admit that, despite being a smaller model, the **Gemma 3 1B-IT** shows impressive potential for sentiment analysis. While the **Gemma 7B-IT quantized 4-bit** model certainly excels in overall accuracy and demonstrates better performance for the **neutral sentiment** class, it’s important to remember that **Gemma 3 1B-IT** is significantly more lightweight — seven times smaller — yet still delivers remarkable results.

The **Gemma 3 1B-IT** model excels in classifying **negative sentiment**, achieving high accuracy and recall, and with additional fine-tuning, its performance on **neutral sentiment** could further improve. The **neutral sentiment** class, while not as strong as the 7B model, still shows promising potential and can be enhanced with targeted adjustments.

The **Gemma 3 1B-IT** model offers a great trade-off between performance and computational efficiency, making it an attractive option for developers and researchers with limited resources or those needing faster deployment. While **Gemma 7B-IT** may offer a more balanced performance overall, **Gemma 3 1B-IT** is a powerful and efficient alternative, showing strong potential with fine-tuning, especially in terms of **negative sentiment classification**.

The following code performs several tasks to analyze and save the results of the fine-tuned model:

1. **Create Evaluation DataFrame**:  
   A Pandas DataFrame named `evaluation` is created. This DataFrame contains the following columns:
   - `text`: The text of the test set.
   - `y_true`: The true sentiment labels from the test set.
   - `y_pred`: The predicted sentiment labels from the fine-tuned model.

   This DataFrame provides a structured way to examine the model's predictions and compare them with the true labels, allowing for easier error analysis and insights into the model's performance. It will also be useful for identifying which examples the model got wrong and refining the prompt or model further.

   The DataFrame is then saved as a CSV file (`test_predictions.csv`) for further analysis.

2. **Load Base Model**:  
   The base model is loaded from the specified path (`GEMMA_PATH`). The model is loaded with `low_cpu_mem_usage=True` to reduce memory consumption during the loading process.

3. **Merge LoRA and Base Model**:  
   The LoRA fine-tuning (which is stored in the directory `lora_directory`) is merged with the base model using `PeftModel.from_pretrained()`. This allows the fine-tuned parameters to be applied to the base model. After merging, the model is saved as a new merged model (`merged-LoRA-Gemma3-model`), and the `safe_serialization=True` flag ensures the model is safely serialized. The `max_shard_size="2GB"` option is used to split the model into smaller files, making it easier to handle.

4. **Save Tokenizer**:  
   The tokenizer, which was used for fine-tuning the model, is also saved alongside the merged model to ensure consistency during inference. The tokenizer is saved in the same directory (`merged-LoRA-Gemma3-model`).

In [None]:
evaluation = pd.DataFrame({'id': dataset_dict['test'].to_pandas()["id"], 
                           'emotion':y_pred}
                         )
evaluation.to_csv("submission.csv", index=False)

In [50]:
# Load Model base model
model = AutoModelForCausalLM.from_pretrained(GEMMA_PATH, low_cpu_mem_usage=True)

# Merge LoRA and base model and save
peft_model = PeftModel.from_pretrained(model, lora_directory)
merged_model = peft_model.merge_and_unload()
merged_model.save_pretrained("merged-LoRA-Gemma3-model", 
                             safe_serialization=True, 
                             max_shard_size="2GB")

tokenizer = AutoTokenizer.from_pretrained(lora_directory)
tokenizer.save_pretrained("merged-LoRA-Gemma3-model")

('merged-LoRA-Gemma3-model/tokenizer_config.json',
 'merged-LoRA-Gemma3-model/special_tokens_map.json',
 'merged-LoRA-Gemma3-model/chat_template.jinja',
 'merged-LoRA-Gemma3-model/tokenizer.json')