# Email Classification (Ham-Spam)

**Dataset Link:** https://www.kaggle.com/datasets/prishasawhney/email-classification-ham-spam or also available in **dataset folder**.

## About dataset
The Email Classification dataset has been synthetically generated.
It comprises 2 columns.
* Email: This column contains the textual description of the email/ message, offering a diverse range of vocabulary and language pattern.
* Label: Each review is classified into either Ham (non-spam) or Spam.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score



In [2]:
data = pd.read_csv("../../pra_datasets/email_classification.csv")
data.head()

Unnamed: 0,email,label
0,Upgrade to our premium plan for exclusive acce...,ham
1,Happy holidays from our team! Wishing you joy ...,ham
2,We're hiring! Check out our career opportuniti...,ham
3,Your Amazon account has been locked. Click her...,spam
4,Your opinion matters! Take our survey and help...,ham


## Exploratory Data Analysis

In [3]:
# Get count of null values
print(data.isnull().sum())

email    0
label    0
dtype: int64


In [4]:
#count of total individual values
print(data['label'].value_counts())

label
ham     100
spam     79
Name: count, dtype: int64


## Correlation

In [5]:
# Visualize email lengths
data['email_length'] = data['email'].apply(len)
print(data.groupby('label')['email_length'].describe())

       count       mean        std   min   25%   50%   75%    max
label                                                            
ham    100.0  78.680000   9.073082  52.0  73.0  79.0  85.0   98.0
spam    79.0  74.746835  13.012290  56.0  64.0  73.0  84.0  107.0


<div class="alert alert-block alert-success">
<b>While there's a slight difference in the mean lengths of ham and spam emails, it's not substantial. 
This suggests that email length alone may not be a strong predictor of whether an email is ham or spam. </b>
</div>


## Applying Bayes' Theorem

In [6]:
# Preprocess text data
data['email'] = data['email'].str.lower()  # Convert to lowercase

In [7]:
# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(data['email'], data['label'], test_size=0.2, random_state=42)

In [8]:
# Create bag-of-words representation
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)

In [9]:
# Train Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train_counts, y_train)

In [10]:
# Predict on test set
y_pred = classifier.predict(X_test_counts)

In [11]:
# Evaluate performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 1.0
Classification Report:
              precision    recall  f1-score   support

         ham       1.00      1.00      1.00        14
        spam       1.00      1.00      1.00        22

    accuracy                           1.00        36
   macro avg       1.00      1.00      1.00        36
weighted avg       1.00      1.00      1.00        36



### Here's how Bayes' Theorem is applied within the context of the MNB classifier:

**Training the Classifier:**
* The MNB classifier learns the probabilities of each word occurring in ham and spam emails from the training data. 
* It calculates the likelihood of observing each word given the class label (ham or spam) using the training data.

**Predicting on Test Set:**
* After training, the classifier predicts the labels for the test set.
* It uses Bayes' Theorem to calculate the probability of each class (ham or spam) given the observed words in the test emails.
* The class with the highest probability is assigned as the predicted label for each email.

<div class="alert alert-block alert-warning">
<b>END </b>
</div>