## Step 1: **Load the Dataset**
We’ll start by loading the dataset from the CSV file (`smsDataLast.csv`) using `pandas`.

### Explanation:
- The dataset contains SMS messages and their corresponding categories (labels).
- We’ll use `pandas` to load the data into a DataFrame for further processing.

In [2]:
import pandas as pd

In [9]:
# Load the dataset
df = pd.read_csv('../Data/smsDataLast.csv')

In [10]:
# Display the first few rows of the dataset
df.head()

Unnamed: 0,Sender,Message Content,Category
0,maharah co.,لفترة محدودة خصم 10%\r\nعلى خدمة العمالة المنز...,Promotional
1,stc,عميلنا العزيز \nتم اكتشاف عطل فني على هاتفكم ر...,Telecommunications
2,stc,عميلنا العزيز ..\nنتشرف بخدمتكم، ونفيدكم بأنه ...,Telecommunications
3,mawarid,عميلنا العزيز \r\nتم إستلام مبلغ 109.25 عن طر...,Commercial
4,alrajhibank,عزيزتنا عميلة التميز، نهنئكم بعيد الأضحى المبا...,Promotional


In [11]:
# Check the columns and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1278 entries, 0 to 1277
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Sender           1278 non-null   object
 1   Message Content  1278 non-null   object
 2   Category         1278 non-null   object
dtypes: object(3)
memory usage: 30.1+ KB


In [12]:
# Check the distribution of categories
df['Category'].value_counts()

Category
Financial                 370
Promotional               204
Commercial                137
Governmental              137
Conferences and Events     95
Telecommunications         90
Education                  53
Stores                     46
Services                   39
Travel                     38
Health                     25
Personal                   22
Apps                       22
Name: count, dtype: int64

---

## Step 2: **Preprocess the Data**
We’ll preprocess the data to prepare it for training. This includes:
- Mapping categories to numerical labels.
- Splitting the data into training and testing sets.

### Explanation:
- Text classification requires numerical labels, so we’ll convert the `Category` column to numerical values.
- We’ll split the data into training and testing sets (e.g., 80% training, 20% testing).

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
# Map categories to numerical labels
df['label'] = df['Category'].astype('category').cat.codes

In [8]:
# Split the data into training and testing sets
train_df, test_df = train_test_split(
    df, 
    test_size=0.2,  # 20% of the data for testing
    random_state=42,  # For reproducibility
    shuffle=True,  # Shuffle the data before splitting
    stratify=df['Category']  # Preserve the class distribution
)

# Check the distribution of categories in the training and testing sets
print("Training set class distribution:\n", train_df['Category'].value_counts())
print("Testing set class distribution:\n", test_df['Category'].value_counts())

Training set class distribution:
 Category
Financial                 296
Promotional               163
Commercial                110
Governmental              109
Conferences and Events     76
Telecommunications         72
Education                  42
Stores                     37
Services                   31
Travel                     30
Health                     20
Apps                       18
Personal                   18
Name: count, dtype: int64
Testing set class distribution:
 Category
Financial                 74
Promotional               41
Governmental              28
Commercial                27
Conferences and Events    19
Telecommunications        18
Education                 11
Stores                     9
Travel                     8
Services                   8
Health                     5
Personal                   4
Apps                       4
Name: count, dtype: int64


In [9]:
# Display the shapes of the training and testing sets
print("Training data shape:", train_df.shape)
print("Testing data shape:", test_df.shape)

Training data shape: (1022, 4)
Testing data shape: (256, 4)


---

## Step 3: **Tokenize the Data**
We’ll tokenize the text data using the `AutoTokenizer` from Hugging Face’s `transformers` library.

### Explanation:
- Tokenization converts text into numerical input that the model can understand.
- We’ll use the `AutoTokenizer` for ModernBERT and ensure all sequences are padded/truncated to the same length.

In [10]:
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [11]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")

In [12]:
# Tokenize the text data
def tokenize_function(examples):
    return tokenizer(examples["Message Content"], padding="max_length", truncation=True, max_length=52)

In [13]:
# Apply tokenization to the training and testing datasets
train_dataset = train_df.apply(tokenize_function, axis=1)
test_dataset = test_df.apply(tokenize_function, axis=1)

In [14]:
# Check the tokenized output for the first example
print("Tokenized example:", train_dataset[0])

Tokenized example: {'input_ids': [50281, 4467, 11912, 8181, 6900, 12458, 13504, 21931, 9211, 49788, 12458, 45232, 30265, 5843, 884, 6, 190, 187, 13793, 38033, 45232, 9211, 5843, 12458, 9445, 13793, 5843, 7427, 12458, 30331, 5846, 29620, 4467, 32229, 15677, 5846, 45693, 30901, 48531, 22814, 9427, 6900, 190, 187, 5843, 5846, 23207, 11912, 6900, 6463, 14062, 50282], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


---

## Step 4: **Test the Model on a Single Row**
Before training, we’ll test the model on a single row to ensure it works.

### Explanation:
- We’ll load the ModernBERT model and pass a single tokenized input to it.
- This helps verify that the model and tokenizer are working correctly.

In [15]:
from transformers import AutoModelForSequenceClassification
import torch

In [16]:
# Load the model
model = AutoModelForSequenceClassification.from_pretrained("answerdotai/ModernBERT-base", num_labels=len(df['Category'].unique()))

Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
# Test on a single row
sample_input = tokenizer(train_df.iloc[0]["Message Content"], return_tensors="pt", padding="max_length", truncation=True, max_length=52)
with torch.no_grad():
    outputs = model(**sample_input)

In [18]:
# Get the predicted class
predicted_class_id = outputs.logits.argmax().item()
predicted_class = df['Category'].unique()[predicted_class_id]

In [19]:
print("Predicted class:", predicted_class)
print("Actual class:", train_df.iloc[0]["Category"])

Predicted class: Governmental
Actual class: Services


---

## Step 5: **Train the Model**
We’ll train the model using the `Trainer` API from Hugging Face.

### Explanation:
- The `Trainer` API simplifies the training process by handling training loops, evaluation, and logging.
- We’ll define training arguments (e.g., learning rate, batch size) and train the model on the training dataset.

In [20]:
from transformers import Trainer, TrainingArguments
from datasets import Dataset

In [21]:
# Convert DataFrame to Hugging Face Dataset
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

In [22]:
# Tokenize the datasets
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1022/1022 [00:00<00:00, 22953.44 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [00:00<00:00, 18010.06 examples/s]


In [26]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",  # Evaluate at the end of each epoch
    logging_strategy="epoch",     # Log metrics at the end of each epoch
    save_strategy="epoch",        # Save the model at the end of each epoch
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=20,
    weight_decay=0.01,
    logging_dir="./logs",         # Directory for logs
    report_to="all",              # Log to all available trackers (e.g., TensorBoard, W&B)
    load_best_model_at_end=True,  # Required for EarlyStoppingCallback
    metric_for_best_model="eval_loss",  # Use validation loss to determine the best model
    greater_is_better=False,      # Lower validation loss is better
)



In [27]:
from transformers import EarlyStoppingCallback

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],  # Stop if validation loss doesn't improve for 3 epochs
)

In [28]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,1.6257,1.264979
2,1.0792,1.070356
3,0.7257,1.001633
4,0.4684,0.939692
5,0.3204,1.036855
6,0.1917,1.004694
7,0.1453,1.088322


TrainOutput(global_step=448, training_loss=0.6509223857096263, metrics={'train_runtime': 208.3715, 'train_samples_per_second': 98.094, 'train_steps_per_second': 6.143, 'total_flos': 247606077731376.0, 'train_loss': 0.6509223857096263, 'epoch': 7.0})

---

## Step 6: **Evaluate the Model**
After training, we’ll evaluate the model on the test set.

### Explanation:
- We’ll use the `Trainer` API to evaluate the model’s performance on the test dataset.
- Metrics like accuracy, precision, recall, and F1 score can be computed.

In [29]:
# Evaluate the model
results = trainer.evaluate()

print("Evaluation results:", results)

Evaluation results: {'eval_loss': 0.9396919012069702, 'eval_runtime': 1.4608, 'eval_samples_per_second': 175.252, 'eval_steps_per_second': 21.906, 'epoch': 7.0}


In [30]:
import numpy as np
from sklearn.metrics import accuracy_score, classification_report

# Get predictions for the test dataset
predictions = trainer.predict(test_dataset)

# Extract predicted labels
predicted_labels = np.argmax(predictions.predictions, axis=-1)

# True labels
true_labels = test_dataset['label']

# Calculate accuracy
accuracy = accuracy_score(true_labels, predicted_labels)
print("Accuracy:", accuracy)

# Generate classification report
class_report = classification_report(true_labels, predicted_labels, target_names=df['Category'].unique())
print("Classification Report:\n", class_report)

Accuracy: 0.75390625
Classification Report:
                         precision    recall  f1-score   support

           Promotional       0.67      0.50      0.57         4
    Telecommunications       0.80      0.89      0.84        27
            Commercial       0.83      0.53      0.65        19
              Services       0.38      0.73      0.50        11
                Health       0.94      0.84      0.89        74
          Governmental       0.86      0.68      0.76        28
              Personal       1.00      0.40      0.57         5
             Financial       0.60      0.75      0.67         4
                Stores       0.74      0.85      0.80        41
                  Apps       0.30      0.38      0.33         8
Conferences and Events       0.55      0.67      0.60         9
             Education       0.58      0.61      0.59        18
                Travel       1.00      1.00      1.00         8

              accuracy                           0.75    

---

## Step 7: **Save the Model**
We’ll save the fine-tuned model for future use.

### Explanation:
- Saving the model allows us to reuse it without retraining.
- We’ll save both the model and tokenizer to a directory.

In [38]:
# Save the model and tokenizer
model.save_pretrained("../Models/fine-tuned-modernbert")
tokenizer.save_pretrained("../Models/fine-tuned-modernbert")

('../Models/fine-tuned-modernbert/tokenizer_config.json',
 '../Models/fine-tuned-modernbert/special_tokens_map.json',
 '../Models/fine-tuned-modernbert/tokenizer.json')

---

## Step 8: **Load the Fine-Tuned Model**
To load the fine-tuned model later:

### Explanation:
- We’ll load the saved model and tokenizer for inference or further training.

In [39]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("../Models/fine-tuned-modernbert")
model = AutoModelForSequenceClassification.from_pretrained("../Models/fine-tuned-modernbert")

---