

### Summary of Components and Workflow:
This Jupyter notebook is designed for a multi-lingual sentiment analysis task, likely as part of a Kaggle competition. Here's a summary of its main components and workflow:

### Components:
1. Environment Setup:
    The notebook runs in a Kaggle Python environment with pre-installed analytics libraries.
    It uses numpy and pandas for data manipulation.
    The input data is located in the "../input/" directory.
2. Data Loading:
    Loads the training and test datasets from CSV files.
    The training dataset contains sentiment labels and text.
    The test dataset contains text for which sentiment labels need to be predicted.
3. Data Preprocessing:
    Preprocesses the text data by removing special characters and converting it to lowercase.
    Splits the training data into training and validation sets.
4. Model Optimization:
    Optimizes the model for inference by reducing memory usage and improving performance.
    This involves quantizing the model to 4 bits and optimizing the model for inference.
5. Model Training:
    Trains the model on the training dataset using the validation set for evaluation.
    Uses a learning rate of 1e-4 and a batch size of 4.
    The model is trained for 3 epochs.
6. Model Evaluation:
    Evaluates the trained model on the validation set.
    Calculates the accuracy of the model.
7. Model Prediction:
    Uses the trained model to predict sentiment labels for the test dataset.
8. Submission Generation:
    Generates a submission file with the predicted sentiment labels.
    Saves the submission file to "submission.csv".

### Workflow:
1. Data Preparation: Load and preprocess the data.
2. Library Installation:
    Installs the 'bitsandbyt' library, which is likely used for model optimization.
3. Model Configuration:
    Uses a LLaMA 3.1 model variant (8b-instruct).
    Sets up model parameters:
    Maximum sequence length: 2048
    Uses 4-bit quantization (load_in_4bit = True)
    Model input path: "/kaggle/input/llama-3.1/transformers/8b-instruct/2"
4. Data Formatting:
    Defines functions to format examples and prompts for the model input.
5. Model Setup:
    Configures a PEFT (Parameter-Efficient Fine-Tuning) model with specific hyperparameters.
    Uses LoRA (Low-Rank Adaptation) for fine-tuning.
    Sets up a tokenizer with a chat template for LLaMA 3.1.
6. Inference:
    Reads test sentences from a CSV file.
    Processes each sentence through the model to generate sentiment labels.
    Uses regex to extract the model's response.
7.  Submission Generation:
    Creates a submission file with predicted labels.
    Saves the results to "output.csv".


This notebook demonstrates an end-to-end process for fine-tuning a large language model (LLaMA 3.1) on a multi-lingual sentiment analysis task, from data preparation to model inference and submission generation. It leverages advanced techniques like PEFT and LoRA for efficient fine-tuning of large models.


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/llama-3.1/transformers/8b-instruct/2/model.safetensors.index.json
/kaggle/input/llama-3.1/transformers/8b-instruct/2/model-00003-of-00004.safetensors
/kaggle/input/llama-3.1/transformers/8b-instruct/2/config.json
/kaggle/input/llama-3.1/transformers/8b-instruct/2/LICENSE
/kaggle/input/llama-3.1/transformers/8b-instruct/2/model-00001-of-00004.safetensors
/kaggle/input/llama-3.1/transformers/8b-instruct/2/README.md
/kaggle/input/llama-3.1/transformers/8b-instruct/2/USE_POLICY.md
/kaggle/input/llama-3.1/transformers/8b-instruct/2/tokenizer.json
/kaggle/input/llama-3.1/transformers/8b-instruct/2/tokenizer_config.json
/kaggle/input/llama-3.1/transformers/8b-instruct/2/model-00004-of-00004.safetensors
/kaggle/input/llama-3.1/transformers/8b-instruct/2/special_tokens_map.json
/kaggle/input/llama-3.1/transformers/8b-instruct/2/.gitattributes
/kaggle/input/llama-3.1/transformers/8b-instruct/2/model-00002-of-00004.safetensors
/kaggle/input/llama-3.1/transformers/8b-instruct/2/gener

### Install and Update required libraries




In [2]:
%%capture
!pip install bitsandbytes
!pip install unsloth
!pip install accelerate
!pip install peft
!pip install torch==2.1.2 --force-reinstall
!pip install --upgrade transformers
!pip install --upgrade unsloth

### Imports
- Imports necessary libraries for data processing, model training, and inference.

In [3]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from unsloth.chat_templates import standardize_sharegpt,train_on_responses_only,get_chat_template
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!


### Globals
- Sets global variables for model configuration and input data.

In [4]:
max_seq_length = 2048
dtype = None
load_in_4bit = True 
model_input = "/kaggle/input/llama-3.1/transformers/8b-instruct/2"

In [5]:
model, tokenizer = FastLanguageModel.from_pretrained(
            model_name = model_input,
            max_seq_length = max_seq_length,
            dtype = dtype,
            load_in_4bit = load_in_4bit
        )

==((====))==  Unsloth 2025.2.12: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: Tesla P100-PCIE-16GB. Max memory: 15.888 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 6.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

/kaggle/input/llama-3.1/transformers/8b-instruct/2 does not have a padding token! Will use pad_token = <|finetune_right_pad_id|>.


### Helper Functions
- Defines helper functions for data formatting and model input preparation.


In [6]:
# Function to format examples for the model
def format_example(example):
    """
    This function takes an example dictionary and returns a formatted dictionary
    suitable for the model's input.
    
    Args:
    example (dict): A dictionary containing 'sentence', 'label', and 'language' keys.
    
    Returns:
    dict: A formatted dictionary with 'conversations' and 'language' keys.
    """
    return {
        "conversations": [
            {"from": "human", "value": example["sentence"]},
            {"from": "gpt", "value": example["label"]}
        ],
        "language": example["language"]  # Keeping the language field
    }

# Function to format prompts for the model
def formatting_prompts_func(examples):
    """
    This function formats the prompts for the model using the tokenizer's chat template.
    
    Args:
    examples (dict): A dictionary containing 'conversations' key.
    
    Returns:
    dict: A dictionary with 'text' key containing the formatted prompts.
    """
    chats = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in chats]
    return {"text": texts}


In [7]:

model = FastLanguageModel.get_peft_model(
                                            model,
                                            r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
                                            target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                                                              "gate_proj", "up_proj", "down_proj",],
                                            lora_alpha = 64,
                                            lora_dropout = 0.1,
                                            bias = "none",
                                            use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
                                            random_state = 347,
                                            use_rslora = False,  # We support rank stabilized LoRA
                                            loftq_config = None, # And LoftQ
                                        )

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.1.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.2.12 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


In [8]:
dataset = load_dataset("csv", 
                       data_files="/kaggle/input/multi-lingual-sentiment-analysis/train.csv",
                       split="train")

# Apply the transformation
processed_dataset = dataset.map(format_example, remove_columns=dataset.column_names)

# Display the first example to verify
print(processed_dataset[0])
processed_dataset = standardize_sharegpt(processed_dataset)
processed_dataset = processed_dataset.map(formatting_prompts_func, batched = True,)

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

{'language': 'bn', 'conversations': [{'from': 'human', 'value': '‡¶ï‡¶∞‡ßç‡¶Æ‡ßÄ‡¶¶‡ßá‡¶∞ ‡¶≠‡¶æ‡¶≤ ‡¶Ü‡¶ö‡¶∞‡¶£ ‡¶è‡¶¨‡¶Ç ‡¶ñ‡¶æ‡¶¨‡¶æ‡¶∞‡ßá‡¶∞ ‡¶™‡¶æ‡¶∂‡¶æ‡¶™‡¶æ‡¶∂‡¶ø ‡¶™‡¶æ‡¶®‡ßÄ‡¶Ø‡¶º (‡¶ï‡¶ï‡¶ü‡ßá‡¶≤ ‡¶è‡¶¨‡¶Ç ‡¶Æ‡¶ï‡¶ü‡ßá‡¶≤) ‡¶∏‡¶π ‡¶è‡¶ï‡¶ü‡¶ø ‡¶Ö‡¶®‡¶®‡ßç‡¶Ø ‡¶ú‡¶æ‡¶Ø‡¶º‡¶ó‡¶æ ‡¶ñ‡ßÅ‡¶¨‡¶á ‡¶≠‡¶æ‡¶≤‡•§ ‡¶™‡ßç‡¶∞‡¶æ‡ßü‡¶á ‡¶è‡¶ï‡¶ü‡¶ø ‡¶∏‡¶∞‡¶æ‡¶∏‡¶∞‡¶ø ‡¶∏‡¶ô‡ßç‡¶ó‡ßÄ‡¶§ ‡¶™‡¶∞‡¶ø‡¶¨‡ßá‡¶∂‡¶®‡ßá‡¶∞ ‡¶∏‡¶æ‡¶•‡ßá ‡¶è‡¶Æ‡¶® ‡¶™‡¶∞‡¶ø‡¶¨‡ßá‡¶∂ ‡¶§‡ßà‡¶∞‡ßÄ ‡¶ï‡¶∞‡ßá ‡¶Ø‡ßá ‡¶è‡¶ï‡¶ú‡¶® ‡¶¶‡¶ø‡¶® ‡¶è‡¶¨‡¶Ç ‡¶∏‡¶®‡ßç‡¶ß‡ßç‡¶Ø‡¶æ ‡¶â‡¶≠‡ßü ‡¶∏‡¶Æ‡ßü‡ßá‡¶á ‡¶ú‡¶æ‡ßü‡¶ó‡¶æ‡¶ü‡¶ø ‡¶â‡¶™‡¶≠‡ßã‡¶ó ‡¶ï‡¶∞‡¶§‡ßá ‡¶™‡¶æ‡¶∞‡ßá‡•§'}, {'from': 'gpt', 'value': 'Positive'}]}


Standardizing format:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

#### Load your dataset from the local CSV file
-- Load the dataset from the local CSV file


In [9]:
processed_dataset[5]['text']

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n‡¶á‡¶Ø‡¶º‡¶æ‡¶§ ‡¶¨‡¶π‡ßÅ‡¶§‡ßã ‡¶ú‡¶®‡¶™‡ßç‡ß∞‡¶ø‡¶Ø‡¶º ‡¶ï‡¶ø‡¶§‡¶æ‡¶™ ‡¶¨‡¶ø‡¶®‡¶æ‡¶Æ‡ßÇ‡¶≤‡ßÄ‡¶Ø‡¶º‡¶æ‡¶ï‡ßà ‡¶â‡¶™‡¶≤‡¶¨‡ßç‡¶ß‡•§ ‡¶è‡¶ï ‡¶¨‡ßÉ‡¶π‡ßé ‡¶Ö‡¶°‡¶ø‡¶Ö' ‡¶∏‡¶ï‡ßç‡¶∑‡¶Æ ‡¶∏‡¶Æ‡¶≤ ‡¶Ü‡¶õ‡ßá ‡¶Ø‡¶ø 5 ‡¶ü‡¶æ ‡¶≠‡¶æ‡ß∞‡¶§‡ßÄ‡¶Ø‡¶º ‡¶≠‡¶æ‡¶∑‡¶æ‡¶§ ‡¶â‡¶™‡¶≤‡¶¨‡ßç‡¶ß‡•§<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nPositive<|eot_id|>"

### Trainer Config
- Config for training the model 


In [10]:
# Set the training arguments for the model
train_agrs = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 2,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 180,
        num_train_epochs = 4,
        learning_rate = 2e-4,
        logging_steps = 5,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    )

In [11]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = processed_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = train_agrs,
)

Applying chat template to train dataset (num_proc=2):   0%|          | 0/1000 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/1000 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [12]:

trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [13]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

"<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n‡¶á‡¶Ø‡¶º‡¶æ‡¶§ ‡¶¨‡¶π‡ßÅ‡¶§‡ßã ‡¶ú‡¶®‡¶™‡ßç‡ß∞‡¶ø‡¶Ø‡¶º ‡¶ï‡¶ø‡¶§‡¶æ‡¶™ ‡¶¨‡¶ø‡¶®‡¶æ‡¶Æ‡ßÇ‡¶≤‡ßÄ‡¶Ø‡¶º‡¶æ‡¶ï‡ßà ‡¶â‡¶™‡¶≤‡¶¨‡ßç‡¶ß‡•§ ‡¶è‡¶ï ‡¶¨‡ßÉ‡¶π‡ßé ‡¶Ö‡¶°‡¶ø‡¶Ö' ‡¶∏‡¶ï‡ßç‡¶∑‡¶Æ ‡¶∏‡¶Æ‡¶≤ ‡¶Ü‡¶õ‡ßá ‡¶Ø‡¶ø 5 ‡¶ü‡¶æ ‡¶≠‡¶æ‡ß∞‡¶§‡ßÄ‡¶Ø‡¶º ‡¶≠‡¶æ‡¶∑‡¶æ‡¶§ ‡¶â‡¶™‡¶≤‡¶¨‡ßç‡¶ß‡•§<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nPositive<|eot_id|>"

In [14]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 2
\        /    Total batch size = 4 | Total steps = 180
 "-____-"     Number of trainable parameters = 83,886,080


Step,Training Loss
5,8.6613
10,2.8625
15,0.6837
20,0.3941
25,0.4034
30,0.262
35,0.197
40,0.423
45,0.1003
50,0.0941


### Test Prediction
- Test the model on the test dataset

In [15]:
# Get the chat template for the tokenizer
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
        

In [16]:
labels = []
sentences = pd.read_csv("/kaggle/input/multi-lingual-sentiment-analysis/test.csv")['sentence'].tolist()

In [17]:
for sen in sentences:
    # Create a list of messages with the user's input
    messages = [
        {"role": "user", "content": f"{sen}"},
    ]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize = True,
        add_generation_prompt = True,
        return_tensors = "pt",
    ).to("cuda")
    
    # Generate output text using the model
    outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                             temperature = 0.1, min_p = 0.1)
    # Decode the output text
    output_text = tokenizer.batch_decode(outputs)[0]
    import re
    # Use regular expression to search for the pattern '<|start_header_id|>assistant<|end_header_id|>' in the output_text
    match = re.search(r'<\|start_header_id\|>assistant<\|end_header_id\|>(.*?)<\|eot_id\|>', output_text, re.DOTALL)
    
    # If a match is found, extract the content
    if match:
        user_response = match.group(1).strip()
        labels.append(user_response)
    else:
        print("No user response found.")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [18]:
submission = pd.read_csv("/kaggle/input/multi-lingual-sentiment-analysis/sample_submission.csv")

In [19]:
submission['label'] = labels

In [20]:
submission.to_csv("output.csv",index=False)