## Step 1: **Load the Dataset**
We’ll start by loading the dataset from the CSV file (`smsDataLast.csv`) using `pandas`.

### Explanation:
- The dataset contains SMS messages and their corresponding categories (labels).
- We’ll use `pandas` to load the data into a DataFrame for further processing.

In [1]:
import pandas as pd

In [2]:
# Load the dataset
df = pd.read_csv('../Data/labeled_combined_sms_v2.csv')

In [3]:
# Display the first few rows of the dataset
df.head()

Unnamed: 0,Sender,Message Content,category_labels,is_important,is_spam,transactions,dates
0,samba.,تم خصم مبلغ ٥٠٫٠٠ من حساب ******٤٥٢١ في ٠٥-٠٥-...,"['Money/Financial', 'Expense']",1,0,"{'amount': '٥٠٫٠٠', 'type': 'expense', 'accoun...",
1,607941,Your WhatsApp code is 614-968 but you can simp...,['Other'],0,0,,
2,neqaty,"Dear Member, you have not redeemed any Neqaty ...","['Promotion', 'Advertising']",0,1,,
3,neqaty,عزيزي العميل، لقد مر 17 شهر و لم تقم باي عملية...,"['Notification', 'Promotion']",0,1,,
4,606006,"BIG SAVINGS! Now get 200 Mobily Minutes, 200 M...","['Promotion', 'Advertising']",0,1,,


In [4]:
# Check the columns and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6996 entries, 0 to 6995
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Sender           6996 non-null   object
 1   Message Content  6996 non-null   object
 2   category_labels  6996 non-null   object
 3   is_important     6996 non-null   int64 
 4   is_spam          6996 non-null   int64 
 5   transactions     3053 non-null   object
 6   dates            484 non-null    object
dtypes: int64(2), object(5)
memory usage: 382.7+ KB


In [5]:
# Check the distribution of categories
df['category_labels'].value_counts()

category_labels
['Money/Financial', 'Expense']                       2219
['Promotion', 'Advertising']                         1240
['Other']                                             851
['Notification', 'Other']                             361
['Money/Financial', 'Expense', 'Notification']        346
                                                     ... 
['Money/Financial', 'Notification', 'Government']       1
['Money/Financial', 'Donation']                         1
['Promotion', 'Investment']                             1
['Money/Financial', 'Transaction']                      1
['Promotion', 'Health', 'Education']                    1
Name: count, Length: 116, dtype: int64

In [6]:
# unique_labels = df['category_labels'].unique()
# for label in unique_labels:
#     print(label)

---

## Step 2: **Preprocess the Data**
We’ll preprocess the data to prepare it for training. This includes:
- Mapping categories to numerical labels.
- Splitting the data into training and testing sets.

### Explanation:
- Text classification requires numerical labels, so we’ll convert the `Category` column to numerical values.
- We’ll split the data into training and testing sets (e.g., 80% training, 20% testing).

In [7]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from datasets import Dataset

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
# Convert category_labels from string to list
df['category_labels'] = df['category_labels'].apply(eval)

In [9]:
# Extract all unique labels
all_labels = sorted(list(set(label for labels in df['category_labels'] for label in labels)))

In [11]:
# Initialize MultiLabelBinarizer
mlb = MultiLabelBinarizer(classes=all_labels)

In [12]:
# Transform category_labels into binary vectors
binary_labels = mlb.fit_transform(df['category_labels'])

In [13]:
df['binary_labels'] = binary_labels.tolist()

In [14]:
df.columns

Index(['Sender', 'Message Content', 'category_labels', 'is_important',
       'is_spam', 'transactions', 'dates', 'binary_labels'],
      dtype='object')

In [15]:
train_df, test_df = train_test_split(df, test_size=0.15, random_state=42)
train_df, val_df = train_test_split(train_df, test_size=0.1765, random_state=42)  # 15% of original data

---

## Step 3: **Tokenize the Data**
We’ll tokenize the text data using the `AutoTokenizer` from Hugging Face’s `transformers` library.

### Explanation:
- Tokenization converts text into numerical input that the model can understand.
- We’ll use the `AutoTokenizer` for ModernBERT and ensure all sequences are padded/truncated to the same length.

In [16]:
from transformers import AutoTokenizer

In [17]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")

In [18]:
# Tokenize the text data
def tokenize_function(examples):
    return tokenizer(examples["Message Content"], padding="max_length", truncation=True, max_length=52)

In [19]:
# Apply tokenization to the training and testing datasets
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)

In [20]:
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4896/4896 [00:00<00:00, 30568.42 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1050/1050 [00:00<00:00, 29084.60 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1050/1050 [00:00<00:00, 29459.11 examples/s]


In [21]:
# Check the tokenized output for the first example
print("Tokenized example:", train_dataset[0])

Tokenized example: {'Sender': 'samba.', 'Message Content': 'الرجاء استخدام رمز المرور رقم   5511\r\n للدخول الى خدمات سامباموبايل', 'category_labels': ['Other'], 'is_important': 0, 'is_spam': 0, 'transactions': None, 'dates': None, 'binary_labels': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], '__index_level_0__': 1685, 'input_ids': [50281, 7427, 6900, 23072, 10714, 96, 32262, 113, 8181, 28337, 9211, 30901, 39410, 5843, 29620, 30331, 6900, 39727, 39410, 14062, 5843, 50275, 2417, 883, 2379, 23630, 4467, 9211, 28337, 41147, 9445, 25378, 45232, 9211, 5843, 24823, 34341, 30901, 13621, 30901, 6534, 13621, 3142, 6463, 4467, 50282, 50283, 50283, 50283, 50283, 50283, 50283], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]}


---

## Step 4: **Test the Model on a Single Row**
Before training, we’ll test the model on a single row to ensure it works.

### Explanation:
- We’ll load the ModernBERT model and pass a single tokenized input to it.
- This helps verify that the model and tokenizer are working correctly.

In [22]:
from transformers import AutoModelForSequenceClassification
import torch

In [23]:
# Load the model
model = AutoModelForSequenceClassification.from_pretrained(
    "answerdotai/ModernBERT-base",
    num_labels=len(all_labels),  # Number of unique labels
    problem_type="multi_label_classification"  # Specify multi-label classification
)

Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [24]:
# Extract the first row
row = df.loc[0]

# Tokenize the input text
inputs = tokenizer(
    row["Message Content"],
    padding=True,
    truncation=True,
    return_tensors="pt"
)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [25]:
# Get model predictions
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

  return torch._C._cuda_getDeviceCount() > 0
  return torch._C._cuda_getDeviceCount() > 0
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda-12.2'


In [26]:
# Predict Category Labels (multi-label classification)
probs = torch.sigmoid(logits)
threshold = 0.5
binary_preds = (probs > threshold).int().tolist()

In [27]:
# Map binary predictions to label names
predicted_labels = [all_labels[i] for i, val in enumerate(binary_preds[0]) if val == 1]

In [28]:
# Predict is_spam and is_important (binary classification)
is_spam_prob = torch.sigmoid(logits[:, 0]).item()
is_spam_pred = 1 if is_spam_prob > 0.5 else 0

In [29]:
is_important_prob = torch.sigmoid(logits[:, 1]).item()
is_important_pred = 1 if is_important_prob > 0.5 else 0

In [30]:
print("\nMessage Content:", row["Message Content"])
print("\nTrue Category Labels:", row["category_labels"])
print("Predicted Category Labels:", predicted_labels)
print("\nTrue is_spam:", row["is_spam"])
print("Predicted is_spam:", is_spam_pred)
print("\nTrue is_important:", row["is_important"])
print("Predicted is_important:", is_important_pred)


Message Content: تم خصم مبلغ ٥٠٫٠٠ من حساب ******٤٥٢١ في ٠٥-٠٥-٢٠١٥ ٢٠:٣٩مساءً

True Category Labels: ['Money/Financial', 'Expense']
Predicted Category Labels: ['Donation', 'Education', 'Emergency', 'Event', 'Expense', 'Health', 'Income', 'Promotion', 'Security', 'Test/Exam', 'Transaction', 'Transfer', 'Travel']

True is_spam: 0
Predicted is_spam: 0

True is_important: 1
Predicted is_important: 0


---

## Step 5: **Train the Model**
We’ll train the model using the `Trainer` API from Hugging Face.

### Explanation:
- The `Trainer` API simplifies the training process by handling training loops, evaluation, and logging.
- We’ll define training arguments (e.g., learning rate, batch size) and train the model on the training dataset.

In [31]:
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    EarlyStoppingCallback,
)
import torch

In [32]:
# Convert binary_labels to PyTorch tensors of type Float
train_labels = torch.tensor(train_df['binary_labels'].tolist(), dtype=torch.float)
val_labels = torch.tensor(val_df['binary_labels'].tolist(), dtype=torch.float)
test_labels = torch.tensor(test_df['binary_labels'].tolist(), dtype=torch.float)

In [33]:
# Add binary labels to the datasets
train_dataset = train_dataset.add_column('labels', train_labels.tolist())
val_dataset = val_dataset.add_column('labels', val_labels.tolist())
test_dataset = test_dataset.add_column('labels', test_labels.tolist())

In [34]:
model = AutoModelForSequenceClassification.from_pretrained(
    "answerdotai/ModernBERT-base",
    num_labels=len(all_labels),  # Number of unique labels
    problem_type="multi_label_classification"  # Specify multi-label classification
)

Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [35]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",  # Evaluate at the end of each epoch
    logging_strategy="epoch",     # Log metrics at the end of each epoch
    save_strategy="epoch",        # Save the model at the end of each epoch
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=20,
    weight_decay=0.01,
    logging_dir="./logs",         # Directory for logs
    report_to="all",              # Log to all available trackers (e.g., TensorBoard, W&B)
    load_best_model_at_end=True,  # Required for EarlyStoppingCallback
    metric_for_best_model="eval_loss",  # Use validation loss to determine the best model
    greater_is_better=False,      # Lower validation loss is better
)



In [36]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],  # Stop if validation loss doesn't improve for 3 epochs
)

In [37]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.0967,0.071041
2,0.0485,0.04921
3,0.0328,0.049279
4,0.0223,0.047333
5,0.0152,0.042408
6,0.0109,0.04773
7,0.0072,0.052532
8,0.0048,0.051579


TrainOutput(global_step=2448, training_loss=0.02978924043427885, metrics={'train_runtime': 4264.3697, 'train_samples_per_second': 22.962, 'train_steps_per_second': 1.435, 'total_flos': 1355750853949440.0, 'train_loss': 0.02978924043427885, 'epoch': 8.0})

---

## Step 6: **Evaluate the Model**
After training, we’ll evaluate the model on the test set.

### Explanation:
- We’ll use the `Trainer` API to evaluate the model’s performance on the test dataset.
- Metrics like accuracy, precision, recall, and F1 score can be computed.

In [38]:
# Evaluate the model
results = trainer.evaluate(val_dataset)
print("Test set evaluation results:", results)

Test set evaluation results: {'eval_loss': 0.04240785166621208, 'eval_runtime': 24.1587, 'eval_samples_per_second': 43.463, 'eval_steps_per_second': 5.464, 'epoch': 8.0}


In [39]:
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, f1_score

In [40]:
# Get predictions from the model
predictions = trainer.predict(test_dataset)

In [41]:
# Convert logits to probabilities (sigmoid for multi-label classification)
pred_probs = torch.sigmoid(torch.tensor(predictions.predictions)).numpy()

In [42]:
# Apply a threshold (e.g., 0.5) to convert probabilities to binary labels
pred_labels = (pred_probs > 0.5).astype(int)

In [43]:
# Get the true labels
true_labels = test_df['binary_labels'].tolist()

In [46]:
# Generate classification report
print("Classification Report:")
print(classification_report(true_labels, pred_labels, target_names=all_labels))

Classification Report:
                 precision    recall  f1-score   support

    Advertising       0.96      0.83      0.89       258
    Appointment       1.00      0.59      0.74        22
       Delivery       0.00      0.00      0.00         0
       Donation       0.00      0.00      0.00         0
      Education       0.60      0.27      0.38        22
      Emergency       0.00      0.00      0.00         1
          Event       0.00      0.00      0.00         1
        Expense       0.99      0.97      0.98       408
     Government       0.78      0.33      0.47        42
         Health       0.71      0.58      0.64        26
         Income       1.00      1.00      1.00        49
     Investment       0.00      0.00      0.00         1
    Investments       0.00      0.00      0.00         2
          Loans       0.00      0.00      0.00         0
Money/Financial       0.98      0.99      0.98       481
   Notification       0.85      0.84      0.85       302
       

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [47]:
# Calculate precision, recall, and F1 score
precision = precision_score(true_labels, pred_labels, average='micro')
recall = recall_score(true_labels, pred_labels, average='micro')
f1 = f1_score(true_labels, pred_labels, average='micro')

In [55]:
print(f" Precision: {precision:.4f}")
print(f" Recall: {recall:.4f}")
print(f" F1 Score: {f1:.4f}")

 Precision: 0.9370
 Recall: 0.8789
 F1 Score: 0.9070


---

## Step 7: **Save the Model**
We’ll save the fine-tuned model for future use.

### Explanation:
- Saving the model allows us to reuse it without retraining.
- We’ll save both the model and tokenizer to a directory.

In [52]:
# Save the model and tokenizer
model.save_pretrained("../Models/fine-tuned-modernbert_multilabel")
tokenizer.save_pretrained("../Models/fine-tuned-modernbert_multilabel")

('../Models/fine-tuned-modernbert_multilabel/tokenizer_config.json',
 '../Models/fine-tuned-modernbert_multilabel/special_tokens_map.json',
 '../Models/fine-tuned-modernbert_multilabel/tokenizer.json')

---

## Step 8: **Load the Fine-Tuned Model**
To load the fine-tuned model later:

### Explanation:
- We’ll load the saved model and tokenizer for inference or further training.

In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("../Models/fine-tuned-modernbert_multilabel")
model = AutoModelForSequenceClassification.from_pretrained("../Models/fine-tuned-modernbert_multilabel")

In [3]:
config = model.config

if hasattr(config, "id2label"):
    labels = list(config.id2label.values())
    print("Model labels:", labels)
else:
    print("Labels not found in the model's configuration.")

Model labels: ['LABEL_0', 'LABEL_1', 'LABEL_2', 'LABEL_3', 'LABEL_4', 'LABEL_5', 'LABEL_6', 'LABEL_7', 'LABEL_8', 'LABEL_9', 'LABEL_10', 'LABEL_11', 'LABEL_12', 'LABEL_13', 'LABEL_14', 'LABEL_15', 'LABEL_16', 'LABEL_17', 'LABEL_18', 'LABEL_19', 'LABEL_20', 'LABEL_21', 'LABEL_22', 'LABEL_23', 'LABEL_24']


---