## 🤖 1. Environment Setup & Library Installation

This initial block handles the setup of our Python environment. We need to install several key libraries to build our sentiment analysis model.

- **`transformers`**: The core Hugging Face library providing the ModernBERT model and the `Trainer` API.
- **`accelerate`**: A companion library that optimizes PyTorch training across different hardware configurations.
- **`datasets`**: Used to efficiently handle and preprocess our data, especially for integration with the `Trainer`.
- **`bertviz` & `umap-learn`**: Visualization tools for inspecting model internals (optional but good practice).
- **`seaborn`**: A powerful library for creating informative statistical visualizations like heatmaps and boxplots.

In [None]:
# !pip install -U transformers
# !pip install -U accelerate
# !pip install -U datasets
# !pip install -U bertviz
# !pip install -U umap-learn
# !pip install seaborn --upgrade

## 📥 2. Data Loading & Initial Inspection

We begin by loading our dataset using the `pandas` library, a staple for data manipulation in Python. The data is a CSV file of tweets hosted on GitHub.

After loading, we perform a quick check-up:
- **`df.info()`**: Gives a summary of the DataFrame, including data types and non-null counts.
- **`df.isnull().sum()`**: Checks for any missing values, a critical data cleaning step.
- **`df['label'].value_counts()`**: Shows the distribution of our target classes. This step is crucial as it will reveal if our dataset is imbalanced.

In [None]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/twitter_multi_class_sentiment.csv")

In [None]:
df.info()
df.isnull().sum()

In [None]:
df['label'].value_counts()

## ⚖️ 3. Data Balancing via Undersampling

Our initial inspection shows a significant **class imbalance** (the 'joy' class is much larger than others). A model trained on this data would be biased. To fix this, we perform **undersampling**.

1. **Rename Class**: We rename `'joy'` to `'Normal'` for clarity.
2. **Separate Classes**: Each emotional category is isolated into its own DataFrame.
3. **Resample**: We use `resample` from `scikit-learn` to randomly select 500 samples from each class. This ensures every emotion has an equal number of training examples.
4. **Concatenate**: The newly balanced DataFrames are combined into a single, balanced dataset.

This creates a fair dataset for training our model.

In [None]:
df['label_name'].replace('joy', 'Normal', inplace=True)
display(df['label_name'].value_counts())

In [None]:
from sklearn.utils import resample

# Separate majority and minority classes
df_majority = df[df['label_name'] == 'Normal']
df_sadness = df[df['label_name'] == 'sadness']
df_anger = df[df['label_name'] == 'anger']
df_fear = df[df['label_name'] == 'fear']
df_love = df[df['label_name'] == 'love']
df_surprise = df[df['label_name'] == 'surprise']

# Undersample majority class to 500
df_majority_undersampled = resample(df_majority,
                                    replace=False,    # sample without replacement
                                    n_samples=500,     # to match minority class
                                    random_state=42) # reproducible results

df_sadness_undersampled = resample(df_sadness,
                                    replace=False,    # sample without replacement
                                    n_samples=500,     # to match minority class
                                    random_state=42) # reproducible results

df_anger_undersampled = resample(df_anger,
                                    replace=False,    # sample without replacement
                                    n_samples=500,     # to match minority class
                                    random_state=42) # reproducible results

df_fear_undersampled = resample(df_fear,
                                    replace=False,    # sample without replacement
                                    n_samples=500,     # to match minority class
                                    random_state=42) # reproducible results

df_love_undersampled = resample(df_love,
                                    replace=False,    # sample without replacement
                                    n_samples=500,     # to match minority class
                                    random_state=42) # reproducible results

df_surprise_undersampled = resample(df_surprise,
                                    replace=False,    # sample without replacement
                                    n_samples=500,     # to match minority class
                                    random_state=42) # reproducible results


# Concatenate minority class with undersampled majority class
df_balanced = pd.concat([df_majority_undersampled, df_sadness_undersampled, df_anger_undersampled,
                         df_fear_undersampled, df_love_undersampled, df_surprise_undersampled])

df = df_balanced
display(df['label_name'].value_counts())

## 📊 4. Exploratory Data Analysis (EDA)

Now that our data is balanced, we visualize it to gain more insights.

1. **Frequency Plot**: A horizontal bar chart visually confirms that all our classes now have an equal number of samples (500 each).
2. **Words per Tweet**: We create a new feature, `Words per Tweet`, and use a boxplot to see if the length of a tweet varies across different emotions. This can reveal interesting patterns in the data.

In [None]:
import matplotlib.pyplot as plt

In [None]:
label_counts = df['label_name'].value_counts(ascending=True)
label_counts.plot.barh()
plt.title("Frequency of Classes")
plt.show()

In [None]:
df['Words per Tweet'] = df['text'].str.split().apply(len)
df.boxplot("Words per Tweet", by="label_name")

##  tokenize 5. Text Tokenization

Transformer models like BERT don't understand raw text. They require a numerical representation. **Tokenization** is the process of converting text into numbers the model can process.

- **Model Checkpoint**: We select `answerdotai/ModernBERT-base`, a modern variant of the BERT model.
- **`AutoTokenizer`**: We load the specific tokenizer that corresponds to our chosen model. This is crucial for ensuring the text is processed in the same way the model was pre-trained.
- **Encoding**: The tokenizer converts our text into `input_ids` (numerical representations of words/subwords) and an `attention_mask` (which tells the model which tokens to pay attention to).

In [None]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

# model_ckpt = "bert-base-uncased"
model_ckpt = "answerdotai/ModernBERT-base"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)


text = "I love machine learning! Tokenization is awesome!!"
encoded_text = tokenizer(text)
print(encoded_text)

In [None]:
len(tokenizer.vocab), tokenizer.vocab_size, tokenizer.model_max_length

## ✂️ 6. Data Splitting and Formatting

To properly train and evaluate our model, we split the data into three sets:
- **Training Set**: Used to fine-tune the model (70% of the data).
- **Validation Set**: Used to monitor performance during training and prevent overfitting (10%).
- **Test Set**: A completely unseen set used for final model evaluation (20%).

We use `stratify=df['label_name']` to ensure that each set has the same proportion of emotion classes. Finally, we convert these pandas DataFrames into a `DatasetDict`, the standard format for the Hugging Face `Trainer`.

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.3, stratify=df['label_name'])
test, validation = train_test_split(test, test_size=1/3, stratify=test['label_name'])

train.shape, test.shape, validation.shape

In [None]:
from datasets import Dataset, DatasetDict

dataset = DatasetDict(
    {'train':Dataset.from_pandas(train, preserve_index=False),
     'test':Dataset.from_pandas(test, preserve_index=False),
     'validation': Dataset.from_pandas(validation, preserve_index=False)
     }

)

dataset

In [None]:
dataset['train'][0], dataset['train'][1]

## 🗺️ 7. Applying Tokenization to the Entire Dataset

We now apply our tokenizer to all splits of the `DatasetDict`. We define a `tokenize` function and use the efficient `.map()` method to apply it to every example.

- **`batched=True`**: Processes multiple rows at once for speed.
- **`padding=True`**: Adds special `[PAD]` tokens to shorter sentences so all sentences in a batch have the same length.
- **`truncation=True`**: Cuts longer sentences down to the model's maximum acceptable length.

In [None]:
def tokenize(batch):
    temp = tokenizer(batch['text'], padding=True, truncation=True)
    return temp

print(tokenize(dataset['train'][:2]))

In [None]:
emotion_encoded = dataset.map(tokenize, batched=True, batch_size=None)

## 🏷️ 8. Model Configuration

Before loading the model, we need to configure it for our specific task: multi-class sequence classification.

- **`label2id` & `id2label`**: We create dictionaries to map our string labels (e.g., 'sadness') to integer IDs (e.g., `1`) and vice-versa. The model works with integers, but we want human-readable outputs.
- **`AutoConfig`**: We load the model's configuration and update it with our number of labels and the mapping dictionaries.
- **`AutoModelForSequenceClassification`**: We load the ModernBERT model with a new, untrained classification head on top. This head will be fine-tuned on our data.
- **`device`**: We ensure the model is moved to a GPU (`cuda`) if available, for faster training.

In [None]:
# label2id, id2label
label2id = {x['label_name']:x['label'] for x in dataset['train']}
id2label = {v:k for k,v in label2id.items()}

label2id, id2label

In [None]:
from transformers import AutoModelForMaskedLM
import torch
model = AutoModelForMaskedLM.from_pretrained(model_ckpt, trust_remote_code=True)

In [None]:
from transformers import AutoModelForSequenceClassification, AutoConfig

num_labels = len(label2id)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
config = AutoConfig.from_pretrained(model_ckpt, label2id=label2id, id2label=id2label)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, config=config).to(device)

In [None]:
# !pip install evaluate

## ⚙️ 9. Training Setup

We use the Hugging Face `Trainer` API to handle the training loop. This requires two main components:

1. **`TrainingArguments`**: This object defines all the hyperparameters for the training run, such as:
   - `num_train_epochs`: The number of times to iterate over the full training dataset.
   - `learning_rate`: Controls how much the model's weights are adjusted during training.
   - `per_device_train_batch_size`: The number of samples processed at once.
   - `eval_strategy = 'epoch'`: Tells the trainer to run an evaluation at the end of each epoch.

2. **`compute_metrics` function**: A custom function to calculate performance metrics during evaluation. We use `accuracy` and the weighted `f1-score`, which is a robust metric for imbalanced (or in our case, multi-class) datasets.

In [None]:
from transformers import TrainingArguments

batch_size = 64
training_dir = "bert_base_train_dir"

training_args = TrainingArguments( output_dir=training_dir,
                                  overwrite_output_dir = True,
                                  num_train_epochs = 15,
                                  learning_rate = 2e-5,
                                  per_device_train_batch_size = batch_size,
                                  per_device_eval_batch_size = batch_size,
                                  weight_decay = 0.05,
                                  eval_strategy = 'epoch',
                                  disable_tqdm = False
)

In [None]:
# use sklearn to build compute metrics
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)

    return {"accuracy": acc, "f1": f1}


## ▶️ 10. Model Training

With all the components prepared (model, tokenizer, datasets, arguments, metrics), we instantiate the `Trainer` object. This powerful class abstracts away the entire training and evaluation loop.

Calling `trainer.train()` kicks off the fine-tuning process. The trainer will iterate through the training data, update the model's weights, and periodically evaluate its performance on the validation set, printing the results after each epoch.

In [None]:
from transformers import Trainer

trainer = Trainer(model=model, args=training_args,
                  compute_metrics=compute_metrics,
                  train_dataset = emotion_encoded['train'],
                  eval_dataset = emotion_encoded['validation'],
                  tokenizer = tokenizer)

In [None]:
trainer.train()


## 🧐 11. Final Evaluation

After training is complete, we perform a final evaluation on the held-out test set—data the model has never seen before. 

- **`trainer.predict()`**: Generates predictions for the test set.
- **`classification_report`**: Provides a detailed breakdown of performance for each class, including precision, recall, and F1-score.
- **Confusion Matrix**: We then plot a confusion matrix using `seaborn`. This heatmap visualizes the model's predictions, showing exactly which classes are being predicted correctly and which ones are being confused for others. The diagonal represents correct predictions.

In [None]:
preds_output = trainer.predict(emotion_encoded['test'])
preds_output.metrics

In [None]:
import numpy as np
y_pred = np.argmax(preds_output.predictions, axis=1)
y_true = emotion_encoded['test'][:]['label']

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred))

In [None]:
# plot confusion matrix
import seaborn as sns
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

In [None]:
cm = confusion_matrix(y_true, y_pred)

plt.figure(figsize=(5,5))
sns.heatmap(cm, annot=True, xticklabels=label2id.keys(), yticklabels=label2id.keys(), fmt='d', cbar=False, cmap='Reds')
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.show()

## 🚀 12. Inference and Deployment

The final step is to use our trained model and save it for future applications.

- **Inference Function**: We create a helper function `get_prediction` that takes raw text, tokenizes it, feeds it to the model, and returns a human-readable emotion label.
- **Saving the Model**: `trainer.save_model()` serializes the fine-tuned model's weights and configuration to a directory.
- **Using a `pipeline`**: For easy deployment, we load the saved model into a `text-classification` `pipeline`. This is the simplest way to use a Hugging Face model for inference, as it handles all the preprocessing and post-processing steps under the hood.

In [None]:
text = "I am super happy today. I got it done. Finally!!"

def get_prediction(text):
    input_encoded = tokenizer(text, return_tensors='pt').to(device)

    with torch.no_grad():
        outputs = model(**input_encoded)

    logits = outputs.logits

    pred = torch.argmax(logits, dim=1).item()
    return id2label[pred]

get_prediction(text)

In [None]:
trainer.save_model("Modern-bert-uncased-sentiment-model")


In [None]:
# use pipeline for prediciton
from transformers import pipeline
# bert-base-uncased-sentiment-model
classifier = pipeline('text-classification', model= 'Modern-bert-uncased-sentiment-model')

classifier([text, 'hello, how are you?', "love you", "i am feeling low"," i love you"])