In [31]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import ConfusionMatrixDisplay

# Naive Bayes Classifiers for Spam/Ham classification

## Problem Description

The goal is to build a machine learning model to classify emails as spam or not spam. 

## Data Description and Analyses

The dataset is taken from https://www.kaggle.com/datasets/ozlerhakan/spam-or-not-spam-dataset

In [32]:
# load the data
data = pd.read_csv('../../../datasets/SpamHam/spam_or_not_spam.csv')

# see first and last 3 rows
pd.concat([data.head(3), data.tail(3)])

Unnamed: 0,email,label
0,date wed NUMBER aug NUMBER NUMBER NUMBER NUMB...,0
1,martin a posted tassos papadopoulos the greek ...,0
2,man threatens explosion in moscow thursday aug...,0
2997,thank you for shopping with us gifts for all ...,1
2998,the famous ebay marketing e course learn to s...,1
2999,hello this is chinese traditional 子 件 NUMBER世...,1


In [33]:
# Count the occurrences of 0 and 1 in the 'label' column
label_counts = data['label'].value_counts()
label_counts

label
0    2500
1     500
Name: count, dtype: int64

In [34]:
#check for NA values
data.isna().sum()

email    1
label    0
dtype: int64

The dataset contains 3,000 entries, with 2,500 labeled as Ham (non-spam-0) and 500 labeled as Spam (1). 
There is one row with a missing value in the 'email' column, which will be removed to ensure data consistency."

In [35]:
# remove row with NA in email:
data.dropna(subset=['email'], inplace=True)

## Split the data into training and testing sets

In [36]:
X_train, X_test, y_train, y_test = train_test_split(data['email'], data['label'], test_size=0.2, random_state=42)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(2399,)
(2399,)
(600,)
(600,)


## Convert text data into TF-IDF features

TF-IDF transformer must be fitted only on the training data, to avoid data leakage. This mimics real-world scenarios where the model only has access to training data and must generalize to unseen data.

In [37]:
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
vectorizer.fit(X_train)

X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)

## Train the model

We will use the MultinomialNB classifier from sklearn. This model is well-suited for text classification problems where the features are word counts or frequencies.

In [38]:
model = MultinomialNB(alpha=0.1)
model.fit(X_train, y_train)

## Test and evaluate the model

In [39]:
# Make predictions on the test set
y_pred = model.predict(X_test)

In [40]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)       # Calculate the accuracy
report = classification_report(y_test, y_pred)  # Get the precision, recall, f1-score

print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(report)

Accuracy: 0.98
Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       500
           1       0.99      0.91      0.95       100

    accuracy                           0.98       600
   macro avg       0.99      0.95      0.97       600
weighted avg       0.98      0.98      0.98       600



## Analyses and Conclusion

1. **Accuracy = 0.91**:
    - The model correctly classifies 91% of emails overall.

1. **Precision for class 0 (Ham) = 0.90**:
    - Out of all emails predicted as Ham, 90% were actually Ham.

1. **Precision for class 1 (Spam) = 1.00**:
    - Out of all emails predicted as Spam, 100% were actually Spam.

1. **Recall for class 0 (Ham) = 1.00**:
    - Out of all actual Ham emails, 100% were correctly identified as Ham.

1. **Recall for class 1 (Spam) = 0.46**:
    - Out of all actual Spam emails, only 46% were correctly identified as Spam.


### Conclusion

The model is excellent at detecting Ham (high precision and recall) but struggles with identifying Spam, as it misses a significant portion of Spam emails (low recall for class 1).
