## Bangla Sentiment Analysis Using Random Dataset from Kaggle/hugging face

- Dataset: [Link](https://huggingface.co/datasets/khondoker/SentNoB)
- Model: BanglaBERT
- Description:
This lab performance demonstrates a Bangla Sentiment Analysis workflow using a random dataset from Kaggle/Hugging Face and the BanglaBERT model. The process includes installing dependencies, importing libraries, loading and exploring CSV datasets, converting them to Hugging Face datasets, tokenizing with BanglaBERT, converting to TensorFlow datasets, compiling the model, training, evaluating on a test set, and performing predictions. The steps are executed sequentially with outputs shown for each stage.


In [1]:
#!pip install datasets transformers
from datasets import Dataset, load_dataset, DatasetDict
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf
import numpy as np
import pandas as pd

In [2]:
train_df = pd.read_csv("Train.csv")
test_df = pd.read_csv("Test.csv")
val_df = pd.read_csv("Val.csv")

In [3]:
print(train_df.head())
print(test_df.head())
print(val_df.head())

                                                Data  Label
0  মুগ্ধ হয়ে গেলাম মামু. আর তোমায় কি কমু. বলো তোম...      1
1  এই কুত্তার বাচ্চাদের জন্য দেশটা আজ এমন অবস্তায়...      2
2                          ভাই আপনার কথাই যাদু রয়েছে      1
3                        উওরটা আমার অনেক ভাল লেগেছে       1
4  আমার নিজের গাড়ী নিয়ে কি সাজেক যেতে পারবো না ?...      0
                                                Data  Label
0  স্বাস্থ্যবান হতে চাই , আমি বয়সের তুলনায় অনেক ব...      0
1                        ভাইয়া নতুন ভিডিও আসে না কেন      0
2        সৌরভ গাঙ্গুলী ছাড়া দাদাগিরি কখনো জমে উঠত না      0
3  ক্রিকেট কে বাচাতে হলে পাপকে অতিশিগ্রিই তাকেও গ...      2
4                          আমিতো সেই ঝালপ্রিয়ো মানুষ      1
                                                Data  Label
0       আর আমার খুবেই আনন্দ লাকছে ভাইটি চাকরি পেয়েছে      1
1  ভাই আমাদের আগের মেয়র আনিচুল হক নাই যে আমাদের ক...      2
2  আমি মার্ক ওয়েন আর সনির বিশাল ভক্ত । একটা সময় ভ...      1
3            ৩ মাস না যেতেই একেকজন ফুলে 

In [4]:
# Step 3: Convert to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)
val_dataset = Dataset.from_pandas(val_df)

dataset = DatasetDict({
    "train": train_dataset,
    "test": test_dataset,
    "validation": val_dataset,
})

In [5]:
# Step 4: Load Tokenizer and Model
model_name = "csebuetnlp/banglabert"

# Load Tokenizer and Model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3, from_pt=True)





Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFElectraForSequenceClassification: ['discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.weight', 'electra.embeddings.position_ids', 'discriminator_predictions.dense_prediction.weight']
- This IS expected if you are initializing TFElectraForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFElectraForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFElectraForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.dens

In [6]:
# Step 5: Preprocessing Function
def preprocess_function(examples):
    return tokenizer(examples["Data"], truncation=True, padding="max_length", max_length=128)

In [7]:
# Step 6: Tokenize the Dataset
encoded_dataset = dataset.map(preprocess_function, batched=True)


Map:   0%|          | 0/12575 [00:00<?, ? examples/s]

Map:   0%|          | 0/1586 [00:00<?, ? examples/s]

Map:   0%|          | 0/1567 [00:00<?, ? examples/s]

In [8]:
# Step 7: Convert to TensorFlow Datasets (Manual tf.data.Dataset conversion)
def convert_to_tf_dataset(encoded_split):
    # Convert to list to fix step estimation and batching
    input_ids = [example["input_ids"] for example in encoded_split]
    attention_mask = [example["attention_mask"] for example in encoded_split]
    labels = [example["Label"] for example in encoded_split]

    dataset = tf.data.Dataset.from_tensor_slices((
        {
            "input_ids": tf.convert_to_tensor(input_ids, dtype=tf.int32),
            "attention_mask": tf.convert_to_tensor(attention_mask, dtype=tf.int32),
        },
        tf.convert_to_tensor(labels, dtype=tf.int64),
    ))

    return dataset.shuffle(1000).batch(64).prefetch(tf.data.AUTOTUNE)

train_tf_dataset = convert_to_tf_dataset(encoded_dataset["train"])
val_tf_dataset = convert_to_tf_dataset(encoded_dataset["validation"])
test_tf_dataset = convert_to_tf_dataset(encoded_dataset["test"])


In [9]:
# Step 8: Compile the Model
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"]
)

In [None]:
# Step 9: Train the Model
model.fit(train_tf_dataset, validation_data=val_tf_dataset, epochs=10)



 37/197 [====>.........................] - ETA: 1:22:11 - loss: 1.0710 - accuracy: 0.4003

In [None]:
# Save as a TensorFlow SavedModel directory
model.save_pretrained("my_banglabert_classifier")
tokenizer.save_pretrained("my_banglabert_classifier")

In [None]:
# Step 10: Test with an example
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf

# Load the saved model and tokenizer
loaded_tokenizer = AutoTokenizer.from_pretrained("my_banglabert_classifier")
loaded_model = TFAutoModelForSequenceClassification.from_pretrained("my_banglabert_classifier")

# Example input
test_text = "আজকের ছবিটি সুন্দর ছিল না তেমন "

# Preprocess the input
inputs = loaded_tokenizer(test_text, return_tensors="tf", truncation=True, padding="max_length", max_length=128)

# Make a prediction
outputs = loaded_model(inputs)
logits = outputs.logits
predicted_class_id = tf.argmax(logits, axis=-1).numpy()[0]

# Map the predicted class ID back to a label (assuming 0, 1, 2 correspond to your labels)
# You might need to adjust this based on your label mapping
label_map = {0: "Negative", 1: "Positive", 2: "Neutral"} # Example mapping, adjust as needed
predicted_label = label_map[predicted_class_id]

print(f"Input text: {test_text}")
print(f"Predicted label ID: {predicted_class_id}")
print(f"Predicted label: {predicted_label}")