<a href="https://colab.research.google.com/github/S-AMIM-ALI/DSA/blob/main/Benchmark_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Multilingual Sentiment Analysis using XLM-RoBERTa

This notebook demonstrates sentiment extraction from multilingual text
using the Dhruv Multilingual Sentiment Analysis dataset. The project
focuses on preprocessing, model training, evaluation, and comparison
of sentiment predictions.

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
dataset = load_dataset("dhruv0808/indic_sentiment_analyzer", split="train[:20000]")
print(dataset.column_names)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


['Sentence', 'Label']


In [None]:
from datasets import Dataset

In [None]:
#Filter valid labels
df = dataset.to_pandas()
df = df[df["Label"].isin(["Positive", "Neutral", "Negative"])]

#Sample a small balanced subset
balanced_df = df.groupby('Label', group_keys=False).apply(lambda x: x.sample(min(len(x), 500)))
print("Balanced subset size:", len(balanced_df))

#Convert to Hugging Face Dataset
small_ds = Dataset.from_pandas(balanced_df)

#Map string labels to numeric
label_mapping = {"Positive": 0, "Neutral": 1, "Negative": 2}
id2label = {v: k for k, v in label_mapping.items()}

def map_labels(example):
    return {
        "labels": label_mapping[example["Label"]],
        "Sentence": str(example["Sentence"])  # ensure string
    }

small_ds = small_ds.map(map_labels, batched=False)

Balanced subset size: 1500


  balanced_df = df.groupby('Label', group_keys=False).apply(lambda x: x.sample(min(len(x), 500)))


Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [None]:
df

Unnamed: 0,Sentence,Label
0,The crisis management team is still assessing ...,Neutral
1,ಅಗತ್ಯವಿರುವವರಿಗೆ ಪರಿಹಾರ ಸಾಮಗ್ರಿಗಳನ್ನು ವಿತರಿಸಲಾಗ...,Neutral
2,தொலைக்காட்சியில் நீதிமன்ற நாடகங்கள் வழக்கறிஞர்...,Positive
3,இந்த மாதத்திற்கான ஊதியத்தை hr துறை சரியான நேரத...,Neutral
4,টুইটারে গ্রাহক পরিষেবা খুব সহায়ক এবং বন্ধুত্ব...,Positive
...,...,...
19995,జట్టు మొత్తం డిఫెన్స్ పరంగా లీగ్ లో 8వ స్థానంల...,Neutral
19996,ക്ലിനിക്കൽ പരീക്ഷണങ്ങളിൽ വാഗ്ദാനം ചെയ്ത ഒരു പു...,Neutral
19997,"ಸಮೀಕ್ಷೆಯ ಪ್ರಕಾರ, ತಮ್ಮ ಸುಸ್ಥಿರತೆಯನ್ನು ಸುಧಾರಿಸುವ...",Positive
19998,ಸ್ಪಾ ಚಿಕಿತ್ಸೆಯನ್ನು ಧಾವಿಸಲಾಯಿತು ಮತ್ತು ನಾವು ನಿರೀ...,Negative


In [None]:
print("Columns:", dataset.column_names)
print("Example row:", dataset[0])

Columns: ['Sentence', 'Label']
Example row: {'Sentence': 'The crisis management team is still assessing the situation and developing a plan.', 'Label': 'Neutral'}


In [None]:
model_name = "xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
def tokenize_fn(ex):
    return tokenizer(
        ex["Sentence"],
        padding="max_length",
        truncation=True,
        max_length=128
    )

tokenized_ds = small_ds.map(tokenize_fn, batched=True)

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [None]:
indices = np.arange(len(tokenized_ds))
train_idx, test_idx = train_test_split(indices, test_size=0.2, random_state=42)

train_ds = tokenized_ds.select(train_idx)
test_ds = tokenized_ds.select(test_idx)

In [None]:
num_labels = 3
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
!pip install evaluate
import evaluate



In [None]:
accuracy_metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {"accuracy": accuracy_metric.compute(predictions=preds, references=labels)}

In [None]:
training_args = TrainingArguments(
    output_dir="sentiment_model",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True
)

In [None]:
trainer = Trainer(model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  trainer = Trainer(model=model,


In [None]:
trainer.train()

[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 3


[34m[1mwandb[0m: You chose "Don't visualize my results"
[34m[1mwandb[0m: Using W&B in offline mode.
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.838383,{'accuracy': 0.71}
2,No log,0.521242,{'accuracy': 0.8266666666666667}
3,No log,0.482053,{'accuracy': 0.82}


Trainer is attempting to log a value of "{'accuracy': 0.71}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.8266666666666667}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.82}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


TrainOutput(global_step=225, training_loss=0.7663132052951389, metrics={'train_runtime': 292.109, 'train_samples_per_second': 12.324, 'train_steps_per_second': 0.77, 'total_flos': 236802075955200.0, 'train_loss': 0.7663132052951389, 'epoch': 3.0})

In [None]:
pred_output = trainer.predict(test_ds)
preds = np.argmax(pred_output.predictions, axis=-1)

for i in range(10):
    text = test_ds[i]["Sentence"]
    true_lbl = id2label[test_ds[i]["labels"]]
    pred_lbl = id2label[preds[i]]
    print(f"Sentence: {text}")
    print(f" True: {true_lbl}   Pred: {pred_lbl}\n")

Sentence: यह फर के एक-एक टुकड़े को अलग कर देता है। यह सारे ढीले फर को बाहर निकालता है और शेडिंग को रोकता है।
 True: Positive   Pred: Negative

Sentence: ગ્રાહક સેવા ઉત્કૃષ્ટ હતી અને ખરેખર મારા શોપિંગ અનુભવને આનંદપ્રદ બનાવ્યો.
 True: Positive   Pred: Positive

Sentence: ଗତ ତ୍ରୟମାସରେ ବିକ୍ରିରେ ସାମାନ୍ୟ ହ୍ରାସ ଘଟିଛି।
 True: Negative   Pred: Negative

Sentence: ଇଣ୍ଟରନେଟ୍ ର ନିୟନ୍ତ୍ରଣ ଏବଂ ଆଇନ ଅଭ୍ୟାସ ଉପରେ ଏହାର ସମ୍ଭାବ୍ୟ ପ୍ରଭାବ ବିଷୟରେ ଚାଲିଥିବା ବିତର୍କ ।
 True: Negative   Pred: Neutral

Sentence: సమస్యను పరిష్కరించడానికి నేను చాలాసార్లు కంపెనీని సంప్రదించాను.
 True: Negative   Pred: Neutral

Sentence: ನನ್ನ ಬ್ಯಾಂಕಿನ ಮೊಬೈಲ್ ಠೇವಣಿ ಪ್ರಕ್ರಿಯೆಯು ಸರಳವಾಗಿದೆ, ಆದರೆ ಕೆಲವೊಮ್ಮೆ ಅದನ್ನು ಪ್ರಕ್ರಿಯೆಗೊಳಿಸಲು ಸ್ವಲ್ಪ ಸಮಯ ತೆಗೆದುಕೊಳ್ಳುತ್ತದೆ.
 True: Neutral   Pred: Neutral

Sentence: મેરિયટથી નવી હોટલમાં મોટી સુવિધાઓ અને સ્ટાફ હતો.
 True: Positive   Pred: Neutral

Sentence: আমাকে আমার সমস্যার জন্য অতিরিক্ত সংস্থানগুলির একটি লিঙ্ক সরবরাহ করা হয়েছিল।
 True: Neutral   Pred: Neutral

Sentence: কৰ্মচাৰীজনে সমৰ্থনমূলক আৰু সামগ্ৰ