## Step 1: **Load the Dataset**
We’ll start by loading the dataset from the CSV file (`smsDataLast.csv`) using `pandas`.

### Explanation:
- The dataset contains SMS messages and their corresponding categories (labels).
- We’ll use `pandas` to load the data into a DataFrame for further processing.

In [1]:
import pandas as pd

In [2]:
# Load the dataset
df = pd.read_csv('../Data/labeled_combined_sms_v2.csv')

In [3]:
# Display the first few rows of the dataset
df.head()

Unnamed: 0,Sender,Message Content,category_labels,is_important,is_spam,transactions,dates
0,samba.,تم خصم مبلغ ٥٠٫٠٠ من حساب ******٤٥٢١ في ٠٥-٠٥-...,"['Money/Financial', 'Expense']",1,0,"{'amount': '٥٠٫٠٠', 'type': 'expense', 'accoun...",
1,607941,Your WhatsApp code is 614-968 but you can simp...,['Other'],0,0,,
2,neqaty,"Dear Member, you have not redeemed any Neqaty ...","['Promotion', 'Advertising']",0,1,,
3,neqaty,عزيزي العميل، لقد مر 17 شهر و لم تقم باي عملية...,"['Notification', 'Promotion']",0,1,,
4,606006,"BIG SAVINGS! Now get 200 Mobily Minutes, 200 M...","['Promotion', 'Advertising']",0,1,,


In [4]:
# Check the columns and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6996 entries, 0 to 6995
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Sender           6996 non-null   object
 1   Message Content  6996 non-null   object
 2   category_labels  6996 non-null   object
 3   is_important     6996 non-null   int64 
 4   is_spam          6996 non-null   int64 
 5   transactions     3053 non-null   object
 6   dates            484 non-null    object
dtypes: int64(2), object(5)
memory usage: 382.7+ KB


In [6]:
# Check the distribution of categories
df['category_labels'].value_counts()

category_labels
['Money/Financial', 'Expense']                       2219
['Promotion', 'Advertising']                         1240
['Other']                                             851
['Notification', 'Other']                             361
['Money/Financial', 'Expense', 'Notification']        346
                                                     ... 
['Money/Financial', 'Notification', 'Government']       1
['Money/Financial', 'Donation']                         1
['Promotion', 'Investment']                             1
['Money/Financial', 'Transaction']                      1
['Promotion', 'Health', 'Education']                    1
Name: count, Length: 116, dtype: int64

---

## Step 2: **Preprocess the Data**
We’ll preprocess the data to prepare it for training. This includes:
- Mapping categories to numerical labels.
- Splitting the data into training and testing sets.

### Explanation:
- Text classification requires numerical labels, so we’ll convert the `Category` column to numerical values.
- We’ll split the data into training and testing sets (e.g., 80% training, 20% testing).

In [7]:
uniq_cat = df['category_labels'].unique()

for i in uniq_cat:
    print(i)

['Money/Financial', 'Expense']
['Other']
['Promotion', 'Advertising']
['Notification', 'Promotion']
['Money/Financial', 'Income']
['Notification']
['Notification', 'Health']
['Advertising']
['Notification', 'Other']
['Money/Financial', 'Expense', 'Notification']
['Notification', 'Promotion', 'Advertising']
['Notification', 'Advertising']
['Advertising', 'Promotion']
['Money/Financial', 'Expense', 'Other']
['Notification', 'Education']
['Promotion', 'Education']
['Government', 'Notification']
['Education']
['Notification', 'Education', 'Test/Exam']
['Advertising', 'Other']
['Health', 'Promotion', 'Advertising']
['Notification', 'Test/Exam']
['Health']
['Promotion', 'Other']
['Education', 'Notification']
['Notification', 'Security']
['Health', 'Promotion']
['Notification', 'Appointment']
['Notification', 'Event']
['Education', 'Promotion']
['Notification', 'Government']
['Notification', 'Test/Exam', 'Education']
['Health', 'Other']
['Education', 'Advertising']
['Government', 'Other']
['P

In [10]:
import ast
df['category_labels'] = df['category_labels'].apply(ast.literal_eval)
df['Category'] = df['category_labels'].apply(lambda x: x[0])

In [11]:
df['Category'].unique()

array(['Money/Financial', 'Other', 'Promotion', 'Notification',
       'Advertising', 'Government', 'Education', 'Health', 'Travel'],
      dtype=object)

In [12]:
from sklearn.model_selection import train_test_split

In [14]:
# Map categories to numerical labels
df['label'] = df['Category'].astype('category').cat.codes

In [15]:
# Split the data into training and testing sets
train_df, test_df = train_test_split(
    df, 
    test_size=0.2,  # 20% of the data for testing
    random_state=42,  # For reproducibility
    shuffle=True,  # Shuffle the data before splitting
    stratify=df['Category']  # Preserve the class distribution
)

# Check the distribution of categories in the training and testing sets
print("Training set class distribution:\n", train_df['Category'].value_counts())
print("Testing set class distribution:\n", test_df['Category'].value_counts())

Training set class distribution:
 Category
Money/Financial    2530
Notification       1045
Promotion          1036
Other               685
Government          137
Advertising          78
Health               51
Education            32
Travel                2
Name: count, dtype: int64
Testing set class distribution:
 Category
Money/Financial    633
Notification       262
Promotion          259
Other              171
Government          35
Advertising         19
Health              13
Education            8
Name: count, dtype: int64


In [16]:
# Display the shapes of the training and testing sets
print("Training data shape:", train_df.shape)
print("Testing data shape:", test_df.shape)

Training data shape: (5596, 9)
Testing data shape: (1400, 9)


---

## Step 3: **Tokenize the Data**
We’ll tokenize the text data using the `AutoTokenizer` from Hugging Face’s `transformers` library.

### Explanation:
- Tokenization converts text into numerical input that the model can understand.
- We’ll use the `AutoTokenizer` for ModernBERT and ensure all sequences are padded/truncated to the same length.

In [18]:
from transformers import AutoTokenizer

In [19]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")

In [20]:
# Tokenize the text data
def tokenize_function(examples):
    return tokenizer(examples["Message Content"], padding="max_length", truncation=True, max_length=52)

In [21]:
# Apply tokenization to the training and testing datasets
train_dataset = train_df.apply(tokenize_function, axis=1)
test_dataset = test_df.apply(tokenize_function, axis=1)

In [22]:
# Check the tokenized output for the first example
print("Tokenized example:", train_dataset[0])

Tokenized example: {'input_ids': [50281, 8181, 5843, 45232, 30265, 5843, 13504, 13621, 4467, 50011, 209, 149, 100, 149, 243, 149, 106, 149, 243, 149, 243, 31461, 40913, 14585, 47931, 209, 27591, 149, 99, 149, 100, 149, 97, 149, 96, 37710, 209, 149, 243, 149, 100, 14, 149, 243, 149, 100, 14, 149, 97, 149, 243, 50282], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


---

## Step 4: **Test the Model on a Single Row**
Before training, we’ll test the model on a single row to ensure it works.

### Explanation:
- We’ll load the ModernBERT model and pass a single tokenized input to it.
- This helps verify that the model and tokenizer are working correctly.

In [23]:
from transformers import AutoModelForSequenceClassification
import torch

In [24]:
# Load the model
model = AutoModelForSequenceClassification.from_pretrained("answerdotai/ModernBERT-base", num_labels=len(df['Category'].unique()))

Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [25]:
# Test on a single row
sample_input = tokenizer(train_df.iloc[0]["Message Content"], return_tensors="pt", padding="max_length", truncation=True, max_length=52)
with torch.no_grad():
    outputs = model(**sample_input)

In [26]:
# Get the predicted class
predicted_class_id = outputs.logits.argmax().item()
predicted_class = df['Category'].unique()[predicted_class_id]

In [27]:
print("Predicted class:", predicted_class)
print("Actual class:", train_df.iloc[0]["Category"])

Predicted class: Health
Actual class: Notification


---

## Step 5: **Train the Model**
We’ll train the model using the `Trainer` API from Hugging Face.

### Explanation:
- The `Trainer` API simplifies the training process by handling training loops, evaluation, and logging.
- We’ll define training arguments (e.g., learning rate, batch size) and train the model on the training dataset.

In [28]:
from transformers import Trainer, TrainingArguments
from datasets import Dataset

In [29]:
# Convert DataFrame to Hugging Face Dataset
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

In [30]:
# Tokenize the datasets
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/5596 [00:00<?, ? examples/s]

Map:   0%|          | 0/1400 [00:00<?, ? examples/s]

In [31]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",  # Evaluate at the end of each epoch
    logging_strategy="epoch",     # Log metrics at the end of each epoch
    save_strategy="epoch",        # Save the model at the end of each epoch
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=20,
    weight_decay=0.01,
    logging_dir="./logs",         # Directory for logs
    report_to="all",              # Log to all available trackers (e.g., TensorBoard, W&B)
    load_best_model_at_end=True,  # Required for EarlyStoppingCallback
    metric_for_best_model="eval_loss",  # Use validation loss to determine the best model
    greater_is_better=False,      # Lower validation loss is better
)



In [32]:
from transformers import EarlyStoppingCallback

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],  # Stop if validation loss doesn't improve for 3 epochs
)

In [33]:
# Train the model
trainer.train()



Epoch,Training Loss,Validation Loss
1,0.6108,0.444736
2,0.3649,0.387091
3,0.2544,0.371512
4,0.1823,0.395151
5,0.1167,0.458683
6,0.0672,0.587531


TrainOutput(global_step=2100, training_loss=0.26606875374203637, metrics={'train_runtime': 625.9922, 'train_samples_per_second': 178.788, 'train_steps_per_second': 11.182, 'total_flos': 1162061943037632.0, 'train_loss': 0.26606875374203637, 'epoch': 6.0})

---

## Step 6: **Evaluate the Model**
After training, we’ll evaluate the model on the test set.

### Explanation:
- We’ll use the `Trainer` API to evaluate the model’s performance on the test dataset.
- Metrics like accuracy, precision, recall, and F1 score can be computed.

In [34]:
# Evaluate the model
results = trainer.evaluate()

print("Evaluation results:", results)

Evaluation results: {'eval_loss': 0.3715115785598755, 'eval_runtime': 5.9726, 'eval_samples_per_second': 234.403, 'eval_steps_per_second': 29.3, 'epoch': 6.0}


In [36]:
import numpy as np
from sklearn.metrics import accuracy_score, classification_report

# Get predictions for the test dataset
predictions = trainer.predict(test_dataset)

# Extract predicted labels
predicted_labels = np.argmax(predictions.predictions, axis=-1)

# True labels
true_labels = test_dataset['label']

# Calculate accuracy
accuracy = accuracy_score(true_labels, predicted_labels)
print("Accuracy:", accuracy)

# Get unique classes in true_labels
unique_classes = np.unique(true_labels)

# Ensure target_names matches the unique classes
target_names = [f"Class {i}" for i in unique_classes]  # Replace with actual class names if available

# Generate classification report
class_report = classification_report(
    true_labels, 
    predicted_labels, 
    target_names=target_names, 
    labels=unique_classes  # Ensure alignment between labels and target_names
)
print("Classification Report:\n", class_report)

Accuracy: 0.8878571428571429
Classification Report:
               precision    recall  f1-score   support

     Class 0       0.67      0.11      0.18        19
     Class 1       0.00      0.00      0.00         8
     Class 2       0.53      0.71      0.61        35
     Class 3       0.62      0.38      0.48        13
     Class 4       0.98      0.98      0.98       633
     Class 5       0.83      0.83      0.83       262
     Class 6       0.79      0.84      0.81       171
     Class 7       0.88      0.88      0.88       259

    accuracy                           0.89      1400
   macro avg       0.66      0.59      0.60      1400
weighted avg       0.89      0.89      0.88      1400



---

## Step 7: **Save the Model**
We’ll save the fine-tuned model for future use.

### Explanation:
- Saving the model allows us to reuse it without retraining.
- We’ll save both the model and tokenizer to a directory.

In [2]:
# Save the model and tokenizer
model.save_pretrained("../Models/fine-tuned-modernbert_v2")
tokenizer.save_pretrained("../Models/fine-tuned-modernbert_v2")

---

## Step 8: **Load the Fine-Tuned Model**
To load the fine-tuned model later:

### Explanation:
- We’ll load the saved model and tokenizer for inference or further training.

In [39]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("../Models/fine-tuned-modernbert_v2")
model = AutoModelForSequenceClassification.from_pretrained("../Models/fine-tuned-modernbert_v2")

---