# Nepali Sentiment Analysis with BERT 

This notebook demonstrates how to fine-tune a BERT-based model for sentiment analysis on Nepali text using a publicly available dataset and GPU resources on Kaggle.  
The final goal is to export the trained model and later use it in a FastAPI backend, a Streamlit UI, and a Dockerized deployment.  

---

## 1. Project overview

The main objective of this project is to build a **Nepali sentiment classifier** that can automatically label input sentences as positive, negative, or neutral.  
We focus on modern Transformer models and specifically on a BERT variant that is trained for the Nepali language, so that the model can handle Devanagari script and Nepali vocabulary.  

In this notebook we will:  
- Load the `Shushant/NepaliSentiment` dataset from Hugging Face.  
- Fine-tune the `Shushant/nepaliBERT` model for sentiment classification.  
- Use Kaggle’s GPU accelerator to speed up training.  
- Evaluate the model on a held-out test set and save the fine-tuned weights for later deployment.  

---

## 2. Dataset description: `Shushant/NepaliSentiment`

For this project we use the **NepaliSentiment** dataset hosted on Hugging Face under the name `Shushant/NepaliSentiment`.  
The dataset contains Nepali text samples with sentiment labels, which are suitable for training and evaluating a three-class sentiment classifier.  

In this dataset, the labels are defined as follows:  
- `0` → negative sentiment  
- `1` → positive sentiment  
- `2` → neutral sentiment  

In the later code cells we will:  
- Load the dataset directly from Hugging Face using the `datasets` library.  
- Inspect a few examples to understand the text format and label distribution.  
- Split the data into training and test sets if they are not already provided.  

---

## 3. Model choice: `Shushant/nepaliBERT`

We use **Shushant/nepaliBERT**, a BERT-base model pre-trained specifically on Nepali news text, containing about 10 million Nepali sentences (≈4.6 GB of data).  
Because this model is trained only on Nepali text, it captures Nepali grammar, vocabulary, and script better than generic multilingual models, which makes it a strong choice for Nepali sentiment analysis.  

In this notebook, we will:  
- Load `Shushant/nepaliBERT` using the Hugging Face `transformers` library.  
- Attach a classification head to predict sentiment labels (e.g., negative, neutral, positive).  
- Fine-tune all parameters on the `Shushant/NepaliSentiment` dataset using GPU acceleration.  

---

## 4. Running on Kaggle with GPU

This notebook is designed to run on Kaggle with a GPU accelerator to speed up BERT fine-tuning.  
Before executing the training cells, please ensure that the notebook accelerator is set to GPU in the **Settings → Accelerator** panel.  

In the setup section we will:  
- Check that a GPU is available using PyTorch (`torch.cuda.is_available()`).  
- Move the model and input tensors to the GPU device for faster training and inference.  
- Configure batch sizes and number of epochs to fit within Kaggle’s GPU memory and runtime limits.  

---

## 5. Training and evaluation plan

Our training pipeline will follow standard steps used for BERT-based text classification.  
We will use a PyTorch training loop to fine-tune the model on the training split and evaluate on the test split.  

Key steps:  
- Tokenize the Nepali text using the nepaliBERT tokenizer with appropriate `max_length`, padding, and truncation.  
- Train for a small number of epochs (for example, 5) with a low learning rate suitable for BERT (e.g., 2e-5).  
- Monitor metrics such as accuracy and F1-score on the validation or test set to avoid overfitting.  

At the end of training, we will report the final metrics and show example predictions on a few Nepali sentences to illustrate how the model behaves.  

---

## 6. Saving the model for deployment

Once the model is trained, we need to export it so that it can be integrated into a FastAPI service and a Streamlit UI later.  
Kaggle stores files written to `/kaggle/working` as notebook outputs, which we can download after the run and use in other environments.  

In the final section of this notebook we will:  
- Save the fine-tuned model and tokenizer into a directory under `/kaggle/working/nepali-sentiment-model`.  
- Optionally compress that directory into a ZIP file for easier download from the “Output” tab.  
- Briefly describe how this exported model will later be loaded in a FastAPI app and called by a Streamlit frontend.  

---

## 7. Notebook structure

For clarity, the rest of the notebook is organized as follows.  

1. **Imports and environment setup**  
2. **Loading the NepaliSentiment dataset**  
3. **Exploratory data analysis (EDA)**  
4. **Tokenization and data preparation**  
5. **Model definition and training configuration**  
6. **Fine-tuning nepaliBERT on GPU**  
7. **Evaluation and example predictions**  
8. **Saving and exporting the model**  

This structure follows common best practices for well-documented Kaggle notebooks and should make it easy for readers to understand and reproduce the full workflow.

# Code Starts From Here

## 1. Install libraries and import modules

In [1]:
!pip install -q "protobuf==3.20.3"

import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

print("Pinned protobuf to 3.20.3 and set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python.")

Pinned protobuf to 3.20.3 and set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python.


In [2]:
!pip install -q transformers datasets accelerate

import os
import random
import numpy as np
import torch

from torch.utils.data import DataLoader
from torch.optim import AdamW

from datasets import load_dataset, concatenate_datasets
from transformers import (
    BertTokenizerFast,
    BertForSequenceClassification,
    get_linear_schedule_with_warmup,
)

from sklearn.metrics import accuracy_score, f1_score, classification_report

2025-11-20 09:15:19.774632: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1763630119.796914     384 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1763630119.803723     384 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## 2. Set random seeds and configure device

In [3]:
# For reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

# Use GPU if available (Kaggle usually provides a GPU when enabled)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

Using device: cuda


## 3. Load the NepaliSentiment dataset

In [4]:
raw_dataset = load_dataset("Shushant/NepaliSentiment")

# Merge original train and test into a single dataset
full_dataset = concatenate_datasets(
    [raw_dataset["train"], raw_dataset["test"]]
)

# Re‑split merged data into 90% train, 10% test
dataset = full_dataset.train_test_split(test_size=0.1, seed=42)
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 7196
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 800
    })
})

## 4. Inspect a few samples and label meaning

In [5]:
print(dataset["train"].features)

for i in range(5):
    example = dataset["train"][i]
    print(f"\nExample {i}:")
    print("Text:", example["text"])
    print("Label:", example["label"])

{'text': Value('string'), 'label': Value('string')}

Example 0:
Text: अभिन्ता राउतको स्थिति सुन्दा आँसु नै थामिएन ! तथापि ती कुलङ्गार  "घृणित"को निन्दा गरेपछि अधुरै रहन्छ, नेता भएपछि जे जस्तोसुकै अपराध गरेपनि हुने, अनागरिक नागरिक र नागरिकलाई अनागरिक, कालोलाई सेतो सेतोलाई कालो ? यो कस्तो देश र कस्तो कानुन कहिले सम्म एस्तै ???
Label: 0

Example 1:
Text: देश यस्तो हालत पुग्दा पनि जनता आखिर किन चुप छन नेपाली जनता नबोल्नुको कारण केहोला यसरि जनता चुप लाग्ने हो भने नेताहरु त झन मातिन लागे त गाठे
Label: 0

Example 2:
Text: दोसि लाई कडा भन्दा कडा कारबाही गर्न माग गर्दछु
Label: 0

Example 3:
Text: यो मुजि 70 करोड लाई किन ल्याउनु हो मुजि धमला
Label: 0

Example 4:
Text: एमसिसिबारे प्रचन्ड ज्युको धाराणा सहि छ तर सझदारिमा हैन ‘परिमार्जन भएमा पास’ भन्नु ठिक । तपाईको ब्याक्तित्व केपि ओलीसंग एक पर्तिसत पनि मिल्दैन, कहिले पनि सहकार्य नगर्नु राम्रो होला।  आफ्नो सुरक्षाको ख्याल राख्दै पत्रकारले लेटदैमा असुरंक्षित जिवनशैली नअपनाउनुनै राम्रो होला l जालझेल काम नलागे भौतिक आक्रमण गरेर सिध्याउने दाउ धेरैको छ भ

In [6]:
id2label = {
    0: "negative",
    1: "positive",
    2: "neutral",
}
label2id = {v: k for k, v in id2label.items()}

print(id2label)

{0: 'negative', 1: 'positive', 2: 'neutral'}


## 5. Load the nepaliBERT tokenizer and model

In [7]:
MODEL_NAME = "Shushant/nepaliBERT"

# Load tokenizer and model
tokenizer = BertTokenizerFast.from_pretrained(MODEL_NAME)

model = BertForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(id2label),
    id2label=id2label,
    label2id=label2id,
)

model.to(device)
print("Model loaded and moved to:", device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at Shushant/nepaliBERT and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded and moved to: cuda


## 6. Tokenize the dataset

### 1. Max character length in train["text"]

In [8]:
# Compute character lengths of each text in the train split
def add_char_len(example):
    txt = example["text"]
    # Handle possible None values safely
    txt = "" if txt is None else str(txt)
    return {"char_len": len(txt)}

train_with_len = dataset["train"].map(add_char_len)

max_char_len = max(train_with_len["char_len"])
avg_char_len = sum(train_with_len["char_len"]) / len(train_with_len["char_len"])

print("Max char length in train:", max_char_len)
print("Average char length in train:", avg_char_len)

Max char length in train: 2855
Average char length in train: 88.4339911061701


### 2. Max token length with the nepaliBERT tokenizer

In [9]:
def add_token_len(example):
    txt = example["text"]
    txt = "" if txt is None else str(txt)
    tokens = tokenizer(
        txt,
        truncation=False,          # do not truncate, we want the true length
        add_special_tokens=True,   # includes [CLS] and [SEP]
    )
    return {"token_len": len(tokens["input_ids"])}

train_with_token_len = dataset["train"].map(add_token_len)

max_token_len = max(train_with_token_len["token_len"])
avg_token_len = sum(train_with_token_len["token_len"]) / len(train_with_token_len["token_len"])

print("Max token length in train:", max_token_len)
print("Average token length in train:", avg_token_len)

Max token length in train: 954
Average token length in train: 29.785992217898833


In [10]:
MAX_LENGTH = 228

def preprocess_function(batch):
    texts = [
        t if (t is not None and str(t).strip() != "")
        else ""
        for t in batch["text"]
    ]
    return tokenizer(
        texts,
        padding="max_length",
        truncation=True,
        max_length=MAX_LENGTH,
    )

# Apply tokenizer to the whole dataset
encoded_dataset = dataset.map(preprocess_function, batched=True)

# Rename 'label' to 'labels' for the model
encoded_dataset = encoded_dataset.rename_column("label", "labels")

# Convert labels to integer IDs (0, 1, 2) safely
def fix_labels(batch):
    cleaned_labels = []
    for lbl in batch["labels"]:
        s = str(lbl).strip()
        if s in ["0", "1", "2"]:
            cleaned_labels.append(int(s))
        else:
            # For any unexpected label like "-" or "", treat as neutral (2)
            cleaned_labels.append(2)
    batch["labels"] = cleaned_labels
    return batch

encoded_dataset = encoded_dataset.map(fix_labels, batched=True)

print("Unique labels in train:", set(encoded_dataset["train"]["labels"]))
print("Type of one label:", type(encoded_dataset["train"]["labels"][0]))

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Unique labels in train: {0, 1, 2}
Type of one label: <class 'int'>


## 7. Create PyTorch DataLoaders

In [11]:
from torch.utils.data import DataLoader

encoded_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "labels"],
)

BATCH_SIZE = 16

train_dataset = encoded_dataset["train"]
test_dataset = encoded_dataset["test"]

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

# Quick sanity check
batch = next(iter(train_loader))
for k, v in batch.items():
    print(k, type(v))

labels <class 'torch.Tensor'>
input_ids <class 'torch.Tensor'>
attention_mask <class 'torch.Tensor'>


DataLoaders batch the tokenized dataset and optionally shuffle it so each epoch sees examples in a different order, which usually improves training.

## 8. Define optimizer and learning rate scheduler

In [12]:
EPOCHS = 20
LEARNING_RATE = 2e-5

optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)

# Total training steps: number of batches * epochs
total_steps = len(train_loader) * EPOCHS

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=int(0.1 * total_steps),  # 10% of steps for warmup
    num_training_steps=total_steps,
)

## 9. Training loop

In [13]:
def to_device(batch, device):
    return {k: v.to(device) for k, v in batch.items()}

In [14]:
from tqdm.auto import tqdm

def train_one_epoch(model, data_loader, optimizer, scheduler, device):
    model.train()
    total_loss = 0.0

    for batch in tqdm(data_loader, desc="Training", leave=False):
        batch = to_device(batch, device)

        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()

        total_loss += loss.item()

    return total_loss / len(data_loader)



def evaluate(model, data_loader, device):
    model.eval()
    preds = []
    true_labels = []

    with torch.no_grad():
        for batch in tqdm(data_loader, desc="Evaluating", leave=False):
            batch = to_device(batch, device)

            outputs = model(**batch)
            logits = outputs.logits
            batch_preds = torch.argmax(logits, dim=-1)

            preds.extend(batch_preds.cpu().numpy())
            true_labels.extend(batch["labels"].cpu().numpy())

    acc = accuracy_score(true_labels, preds)
    f1 = f1_score(true_labels, preds, average="weighted")
    return acc, f1, preds, true_labels

## 10. Run the training over multiple epochs

In [15]:
best_f1 = 0.0
best_epoch = -1
patience = 2              # how many epochs with no improvement to wait
epochs_without_improve = 0

for epoch in range(EPOCHS):
    print(f"\n===== Epoch {epoch + 1}/{EPOCHS} =====")

    train_loss = train_one_epoch(model, train_loader, optimizer, scheduler, device)
    print(f"Average training loss: {train_loss:.4f}")

    acc, f1, preds, true_labels = evaluate(model, test_loader, device)
    print(f"Test accuracy: {acc:.4f}")
    print(f"Test weighted F1: {f1:.4f}")

    if f1 > best_f1:
        best_f1 = f1
        best_epoch = epoch + 1
        epochs_without_improve = 0
        print(f"New best model at epoch {best_epoch}, saving checkpoint...")
        output_dir = "/kaggle/working/nepali-sentiment-model"
        os.makedirs(output_dir, exist_ok=True)
        model.save_pretrained(output_dir)
        tokenizer.save_pretrained(output_dir)
    else:
        epochs_without_improve += 1
        print(f"No improvement for {epochs_without_improve} epoch(s).")

    if epochs_without_improve >= patience:
        print(f"\nEarly stopping triggered at epoch {epoch + 1}.")
        break

print(f"\nBest epoch was {best_epoch} with F1 = {best_f1:.4f}")


===== Epoch 1/20 =====


Training:   0%|          | 0/450 [00:00<?, ?it/s]

Average training loss: 0.8917


Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

Test accuracy: 0.7137
Test weighted F1: 0.7034
New best model at epoch 1, saving checkpoint...

===== Epoch 2/20 =====


Training:   0%|          | 0/450 [00:00<?, ?it/s]

Average training loss: 0.6622


Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

Test accuracy: 0.7650
Test weighted F1: 0.7626
New best model at epoch 2, saving checkpoint...

===== Epoch 3/20 =====


Training:   0%|          | 0/450 [00:00<?, ?it/s]

Average training loss: 0.5007


Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

Test accuracy: 0.7575
Test weighted F1: 0.7560
No improvement for 1 epoch(s).

===== Epoch 4/20 =====


Training:   0%|          | 0/450 [00:00<?, ?it/s]

Average training loss: 0.3594


Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

Test accuracy: 0.7538
Test weighted F1: 0.7529
No improvement for 2 epoch(s).

Early stopping triggered at epoch 4.

Best epoch was 2 with F1 = 0.7626
