## Step 1: **Load the Dataset**
We’ll start by loading the dataset from the CSV file (`smsDataLast.csv`) using `pandas`.

### Explanation:
- The dataset contains SMS messages and their corresponding categories (labels).
- We’ll use `pandas` to load the data into a DataFrame for further processing.

In [1]:
import pandas as pd

In [2]:
# Load the dataset
df = pd.read_csv('../Data/rr.csv')

In [3]:
# Display the first few rows of the dataset
df.head()

Unnamed: 0,Sender,Message Content,category_labels,is_important,is_spam,transactions,dates,category
0,samba.,تم خصم مبلغ ٥٠٫٠٠ من حساب ******٤٥٢١ في ٠٥-٠٥-...,"['Money/Financial', 'Expense']",1,0,"{'amount': '٥٠٫٠٠', 'type': 'expense', 'accoun...",,Financial
1,607941,Your WhatsApp code is 614-968 but you can simp...,['Other'],0,0,,,Other
2,neqaty,"Dear Member, you have not redeemed any Neqaty ...","['Promotion', 'Advertising']",0,1,,,Promotional
3,neqaty,عزيزي العميل، لقد مر 17 شهر و لم تقم باي عملية...,"['Notification', 'Promotion']",0,1,,,Promotional
4,606006,"BIG SAVINGS! Now get 200 Mobily Minutes, 200 M...","['Promotion', 'Advertising']",0,1,,,Promotional


In [4]:
# Check the columns and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6996 entries, 0 to 6995
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Sender           6996 non-null   object
 1   Message Content  6996 non-null   object
 2   category_labels  6996 non-null   object
 3   is_important     6996 non-null   int64 
 4   is_spam          6996 non-null   int64 
 5   transactions     3053 non-null   object
 6   dates            484 non-null    object
 7   category         6996 non-null   object
dtypes: int64(2), object(6)
memory usage: 437.4+ KB


In [5]:
# Check the distribution of categories
df.drop(columns='category_labels' , inplace=True)

---

## Step 2: **Preprocess the Data**
We’ll preprocess the data to prepare it for training. This includes:
- Mapping categories to numerical labels.
- Splitting the data into training and testing sets.

### Explanation:
- Text classification requires numerical labels, so we’ll convert the `Category` column to numerical values.
- We’ll split the data into training and testing sets (e.g., 80% training, 20% testing).

In [6]:
df['category'].unique()

array(['Financial', 'Other', 'Promotional', 'Telecommunications',
       'Health', 'Services or Stores', 'Education', 'Governmental',
       'Travel'], dtype=object)

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
# Map categories to numerical labels
df['label'] = df['category'].astype('category').cat.codes

In [9]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
train_df, test_df = train_test_split(
    df[['Message Content', 'label']],  # Include both 'Message Content' and 'category'
    test_size=0.2,  # 20% of the data for testing
    random_state=42,  # For reproducibility
    shuffle=True,  # Shuffle the data before splitting
    stratify=df['category']  # Preserve the class distribution
)

# Check the distribution of categories in the training and testing sets
print("Training set class distribution:\n", train_df['label'].value_counts())
print("Testing set class distribution:\n", test_df['label'].value_counts())

Training set class distribution:
 label
1    2531
5    1417
4     685
7     484
2     201
3     137
0     129
6       7
8       5
Name: count, dtype: int64
Testing set class distribution:
 label
1    634
5    355
4    171
7    121
2     50
3     34
0     32
6      2
8      1
Name: count, dtype: int64


In [10]:
# Display the shapes of the training and testing sets
print("Training data shape:", train_df.shape)
print("Testing data shape:", test_df.shape)

Training data shape: (5596, 2)
Testing data shape: (1400, 2)


In [11]:
train_df

Unnamed: 0,Message Content,label
3093,للدخول 1361 الرجاء إدخال الرمز,4
1643,عميلنا العزيز\nشكرا لاشتراكك، رقم برنامج الخدم...,5
6382,يوجد لديك عقد عمل جديد بانتظار مراجعتك واعتماد...,7
6609,عزيزي العميل: يسعدنا ان نقدم لكم أحدث تصاميم ب...,5
4499,Mobily Missed Call Notification Service. +9665...,5
...,...,...
2892,ضيوف الرحمن \r\nخدمتكم شرف .. وأمنكم واجبنا \r...,2
4470,مشتريات دولية عبر الانترنت\nمن بطاقة:9394*\nرق...,1
6325,شراء-POS\r\nبـ49 SAR\r\nمن مطعم مشوي\r\nمدى-اب...,1
6813,تم تنشيط المستفيد:شركة لدن للاستثمار,7


---

## Step 3: **Tokenize the Data**
We’ll tokenize the text data using the `AutoTokenizer` from Hugging Face’s `transformers` library.

### Explanation:
- Tokenization converts text into numerical input that the model can understand.
- We’ll use the `AutoTokenizer` for ModernBERT and ensure all sequences are padded/truncated to the same length.

In [13]:
from transformers import AutoTokenizer

In [14]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")

In [15]:
# Tokenize the text data
def tokenize_function(examples):
    return tokenizer(examples["Message Content"], padding="max_length", truncation=True, max_length=52)

In [16]:
# Apply tokenization to the training and testing datasets
train_dataset = train_df.apply(tokenize_function, axis=1)
test_dataset = test_df.apply(tokenize_function, axis=1)

In [17]:
# Check the tokenized output for the first example
print("Tokenized example:", train_dataset[0])

Tokenized example: {'input_ids': [50281, 4467, 4467, 9211, 28337, 41147, 14821, 18, 9445, 6900, 23072, 10714, 96, 48914, 9211, 28337, 7427, 9445, 6900, 5843, 29620, 50282, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283, 50283], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


---

## Step 4: **Test the Model on a Single Row**
Before training, we’ll test the model on a single row to ensure it works.

### Explanation:
- We’ll load the ModernBERT model and pass a single tokenized input to it.
- This helps verify that the model and tokenizer are working correctly.

In [18]:
from transformers import AutoModelForSequenceClassification
import torch

In [21]:
# Load the model
model = AutoModelForSequenceClassification.from_pretrained("answerdotai/ModernBERT-base", num_labels=len(df['label'].unique()))

Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
# Test on a single row
sample_input = tokenizer(train_df.iloc[0]["Message Content"], return_tensors="pt", padding="max_length", truncation=True, max_length=52)
with torch.no_grad():
    outputs = model(**sample_input)

In [24]:
# Get the predicted class
predicted_class_id = outputs.logits.argmax().item()
predicted_class = df['category'].unique()[predicted_class_id]

In [26]:
print("Predicted class:", predicted_class)
print("Actual class:", train_df.iloc[0]["label"])

Predicted class: Financial
Actual class: 4


---

## Step 5: **Train the Model**
We’ll train the model using the `Trainer` API from Hugging Face.

### Explanation:
- The `Trainer` API simplifies the training process by handling training loops, evaluation, and logging.
- We’ll define training arguments (e.g., learning rate, batch size) and train the model on the training dataset.

In [27]:
from transformers import Trainer, TrainingArguments
from datasets import Dataset

In [28]:
# Convert DataFrame to Hugging Face Dataset
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

In [29]:
# Tokenize the datasets
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/5848 [00:00<?, ? examples/s]

Map:   0%|          | 0/1400 [00:00<?, ? examples/s]

In [30]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",  # Evaluate at the end of each epoch
    logging_strategy="epoch",     # Log metrics at the end of each epoch
    save_strategy="epoch",        # Save the model at the end of each epoch
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=20,
    weight_decay=0.01,
    logging_dir="./logs",         # Directory for logs
    report_to="all",              # Log to all available trackers (e.g., TensorBoard, W&B)
    load_best_model_at_end=True,  # Required for EarlyStoppingCallback
    metric_for_best_model="eval_loss",  # Use validation loss to determine the best model
    greater_is_better=False,      # Lower validation loss is better
)



In [31]:
from transformers import EarlyStoppingCallback

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],  # Stop if validation loss doesn't improve for 3 epochs
)

In [32]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.6472,0.659716
2,0.3694,0.436465
3,0.2594,0.421289
4,0.1861,0.439947
5,0.1284,0.465892
6,0.0841,0.612107


TrainOutput(global_step=2196, training_loss=0.27909195488267907, metrics={'train_runtime': 3466.7939, 'train_samples_per_second': 33.737, 'train_steps_per_second': 2.111, 'total_flos': 1214392109164416.0, 'train_loss': 0.27909195488267907, 'epoch': 6.0})

---

## Step 6: **Evaluate the Model**
After training, we’ll evaluate the model on the test set.

### Explanation:
- We’ll use the `Trainer` API to evaluate the model’s performance on the test dataset.
- Metrics like accuracy, precision, recall, and F1 score can be computed.

In [33]:
# Evaluate the model
results = trainer.evaluate()

print("Evaluation results:", results)

Evaluation results: {'eval_loss': 0.42128902673721313, 'eval_runtime': 32.9424, 'eval_samples_per_second': 42.498, 'eval_steps_per_second': 5.312, 'epoch': 6.0}


In [36]:
import numpy as np
from sklearn.metrics import accuracy_score, classification_report

# Get predictions for the test dataset
predictions = trainer.predict(test_dataset)

# Extract predicted labels
predicted_labels = np.argmax(predictions.predictions, axis=-1)

# True labels
true_labels = test_dataset['label']

# Calculate accuracy
accuracy = accuracy_score(true_labels, predicted_labels)
print("Accuracy:", accuracy)

# Get unique classes in true_labels
unique_classes = np.unique(true_labels)

label_mapping = dict(enumerate(df['category'].astype('category').cat.categories))

# Generate classification report with actual category names
target_names = [label_mapping[i] for i in unique_classes]  # Use category names instead of "Class X"

class_report = classification_report(
    true_labels, 
    predicted_labels, 
    target_names=target_names, 
    labels=unique_classes,
    zero_division=0  # Suppress warnings by setting precision to 0 for undefined cases
)
print("Classification Report:\n", class_report)

Accuracy: 0.875
Classification Report:
                     precision    recall  f1-score   support

         Education       0.64      0.66      0.65        32
         Financial       1.00      0.96      0.98       634
      Governmental       0.68      0.60      0.64        50
            Health       0.92      0.35      0.51        34
             Other       0.66      0.88      0.75       171
       Promotional       0.86      0.92      0.89       355
Services or Stores       1.00      0.50      0.67         2
Telecommunications       0.83      0.61      0.70       121
            Travel       0.00      0.00      0.00         1

          accuracy                           0.88      1400
         macro avg       0.73      0.61      0.64      1400
      weighted avg       0.88      0.88      0.87      1400



---

## Step 7: **Save the Model**
We’ll save the fine-tuned model for future use.

### Explanation:
- Saving the model allows us to reuse it without retraining.
- We’ll save both the model and tokenizer to a directory.

In [37]:
# Save the model and tokenizer
model.save_pretrained("../Models/fine-tuned-modernbert_v4")
tokenizer.save_pretrained("../Models/fine-tuned-modernbert_v4")

('../Models/fine-tuned-modernbert_v4/tokenizer_config.json',
 '../Models/fine-tuned-modernbert_v4/special_tokens_map.json',
 '../Models/fine-tuned-modernbert_v4/tokenizer.json')

---

## Step 8: **Load the Fine-Tuned Model**
To load the fine-tuned model later:

### Explanation:
- We’ll load the saved model and tokenizer for inference or further training.

In [38]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("../Models/fine-tuned-modernbert_v4")
model = AutoModelForSequenceClassification.from_pretrained("../Models/fine-tuned-modernbert_v4")

---

In [39]:
label_mapping = model.config.id2label

print("Label Mapping (ID to Label):")
for idx, label in label_mapping.items():
    print(f"ID: {idx}, Label: {label}")

Label Mapping (ID to Label):
ID: 0, Label: LABEL_0
ID: 1, Label: LABEL_1
ID: 2, Label: LABEL_2
ID: 3, Label: LABEL_3
ID: 4, Label: LABEL_4
ID: 5, Label: LABEL_5
ID: 6, Label: LABEL_6
ID: 7, Label: LABEL_7
ID: 8, Label: LABEL_8


In [40]:
label_mapping = dict(enumerate(df['category'].astype('category').cat.categories))
print("Label Mapping (ID to Category Name):")
for idx, category in label_mapping.items():
    print(f"ID: {idx}, Category: {category}")

Label Mapping (ID to Category Name):
ID: 0, Category: Education
ID: 1, Category: Financial
ID: 2, Category: Governmental
ID: 3, Category: Health
ID: 4, Category: Other
ID: 5, Category: Promotional
ID: 6, Category: Services or Stores
ID: 7, Category: Telecommunications
ID: 8, Category: Travel
