In [4]:
pip install transformers datasets torch scikit-learn




Step 1: Set up the **environment**

Check your Python version and environment:
Make sure you’re using Python 3.7 or later. You can check by running: **bold text**

Step 2: Choose a pre-trained model and dataset
Pick a model that supports your language:
For example, if your local language is Bengali, you might use a model like csebuetnlp/banglabert (available on the Hugging Face model hub).
If no dedicated model exists for your language, you can still try multilingual models like bert-base-multilingual-cased.

Get a sentiment dataset in your language:
If you don’t have a dataset, you’ll need to create one. The dataset should have:

Text samples (e.g., sentences or short paragraphs in your local language).
Labels indicating the sentiment (e.g., positive, negative, neutral).
Let’s assume your dataset is a CSV file named sentiment_data.csv with two columns: text and label.

In [6]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import torch


Load your dataset:

In [8]:
dataset = load_dataset('csv', data_files='/content/bangla_sentiment_data.csv')


Generating train split: 0 examples [00:00, ? examples/s]

In [19]:
train_test_split = dataset['train'].train_test_split(test_size=0.2)
train_dataset = train_test_split['train']
valid_dataset = train_test_split['test']


In [20]:
model_name = "csebuetnlp/banglabert"  # Or any model that supports your language
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)  # For 3 sentiment classes


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at csebuetnlp/banglabert and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [21]:
def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True)

train_dataset = train_dataset.map(tokenize, batched=True)
valid_dataset = valid_dataset.map(tokenize, batched=True)


Map:   0%|          | 0/1344 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/337 [00:00<?, ? examples/s]

In [22]:
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
valid_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])


In [23]:
#Train the model
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    logging_dir="./logs",
    save_strategy="epoch",
    load_best_model_at_end=True
)




In [24]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset
)


In [26]:
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")


('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/vocab.txt',
 './fine_tuned_model/added_tokens.json',
 './fine_tuned_model/tokenizer.json')

In [40]:
# Load the model
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="./fine_tuned_model")

# Predict sentiment for a new text
new_text = "আজকের দিনটি আমার জন্য অনেক বিশেষ।"
predicted_label = classifier(new_text)
print(predicted_label)


Device set to use cpu


[{'label': 'LABEL_1', 'score': 0.35249781608581543}]
