<a href="https://colab.research.google.com/github/Nagashree90/API-powered-support-copilot/blob/main/Email_Spam_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Loading Essential Python Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

##Loading Dataset

In [None]:
df=pd.read_csv('/content/email_spam_synthetic.csv')
df.head()

Unnamed: 0,id,text,label
0,EM101782,limited time offer project update support help...,spam
1,EM103917,client feedback summary verify limited team,ham
2,EM100221,unlock bonus today welcome schedule welcome ca...,spam
3,EM102135,exclusive discount coupon limited voucher veri...,spam
4,EM105224,deployment was successful discount invoice ple...,ham


##Checking Dataset

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      6000 non-null   object
 1   text    6000 non-null   object
 2   label   6000 non-null   object
dtypes: object(3)
memory usage: 140.8+ KB


In [None]:
df.describe()

Unnamed: 0,id,text,label
count,6000,6000,6000
unique,6000,6000,2
top,EM100769,exclusive discount coupon welcome approved hel...,ham
freq,1,1,3600


##Data Cleaning and Preprocessing

##Handling Missing Data:

In [None]:
# Check for null or empty values
print("Missing values before handling:")
print(df.isnull().sum())

# Assuming 'text' is the column containing the email text
# Removing rows with missing text (if any)
df.dropna(subset=['text'], inplace=True)

print("\nMissing values after handling:")
print(df.isnull().sum())

Missing values before handling:
id       0
text     0
label    0
dtype: int64

Missing values after handling:
id       0
text     0
label    0
dtype: int64


##Remove unnecessary columns

In [None]:
# Remove unnecessary columns (assuming 'id' is unnecessary)
df.drop('id', axis=1, inplace=True)

## Label Encoding

In [None]:

# Convert 'spam' to 1 and 'ham' to 0
df['label'] = df['label'].map({'spam': 1, 'ham': 0})

print("\nDataFrame after cleaning and encoding:")
display(df.head())


DataFrame after cleaning and encoding:


Unnamed: 0,text,label
0,limited time offer project update support help...,1
1,client feedback summary verify limited team,0
2,unlock bonus today welcome schedule welcome ca...,1
3,exclusive discount coupon limited voucher veri...,1
4,deployment was successful discount invoice ple...,0


##Text Preprocessing

##Lowercasing:

In [None]:
import re

# Lowercasing
df['text'] = df['text'].str.lower()


## Removing Punctuation and Special Characters

In [None]:

df['text'] = df['text'].apply(lambda x: re.sub(r'[^a-zA-Z0-9\s]', '', x))

# Removing extra whitespace
df['text'] = df['text'].apply(lambda x: re.sub(r'\s+', ' ', x).strip())

print("\nDataFrame after text preprocessing:")
display(df.head())


DataFrame after text preprocessing:


Unnamed: 0,text,label
0,limited time offer project update support help...,1
1,client feedback summary verify limited team,0
2,unlock bonus today welcome schedule welcome ca...,1
3,exclusive discount coupon limited voucher veri...,1
4,deployment was successful discount invoice please,0


### Tokenization (Conceptual)
# Tokenization is the process of breaking down the text into individual words or tokens.
# While we won't explicitly tokenize in this step for simplicity with TfidfVectorizer,
# it's an important concept in text preprocessing. TfidfVectorizer handles tokenization internally.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer with stop word removal
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

##Feature Extraction

##Why TF-IDF?
Machine learning models cannot directly interpret text data. They require numerical input. Therefore, we need techniques to convert text into a numerical representation. TF-IDF (Term Frequency–Inverse Document Frequency) is a popular technique for this purpose.

TF-IDF creates a numerical matrix where each row represents a document (in this case, an email) and each column represents a unique word in the corpus. The value in each cell represents the importance of that word in that specific document relative to the entire set of documents.

Term Frequency (TF): Measures how frequently a term appears in a document.
Inverse Document Frequency (IDF): Measures how important a term is across the entire corpus. Words that appear frequently in many documents (like stop words) will have a lower IDF, thus being down-weighted. Words that are unique to a few documents will have a higher IDF, thus being given more weight.
By using TF-IDF, we convert the text data into a numerical format that machine learning models can understand, while also highlighting words that are more discriminative for classification (like words that appear frequently in spam but rarely in ham).

In [None]:
# Vectorize the text data using TfidfVectorizer
# Initialize the vectorizer with parameters like max_features and stop_words
tfidf_vectorizer = TfidfVectorizer(max_features=2000, stop_words='english')

# Use fit_transform on the text data to create the TF-IDF features
X = tfidf_vectorizer.fit_transform(df['text'])

# Display the shape of the feature matrix
print("Shape of TF-IDF feature matrix:", X.shape)

Shape of TF-IDF feature matrix: (6000, 105)


### Resulting Features

After using `TfidfVectorizer`, the result is a sparse matrix. In this matrix:

*   Each **row** represents an individual email.
*   Each **column** corresponds to a unique word (token) from the vocabulary extracted by the vectorizer.
*   The **values** within the matrix are the TF-IDF scores, indicating the importance of each word in each email relative to the entire dataset.

The matrix is typically sparse because most emails will only contain a small subset of the total vocabulary. This numerical representation, `X`, is now ready to be used as input for machine learning classification models.

##Splitting Data into Training and Test Sets

In [None]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets (e.g., 80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, df['label'], test_size=0.2, random_state=42)

# Print the shapes of the resulting sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (4800, 105)
Shape of X_test: (1200, 105)
Shape of y_train: (4800,)
Shape of y_test: (1200,)


### Reason for Splitting Data

We split the data into training and test sets to evaluate how well our machine learning model will perform on unseen, new data.

*   The **training set** is used to train the model, allowing it to learn the patterns and relationships between the text features and the spam/ham labels.
*   The **test set** is held out and is not used during the training process. After the model is trained, we use the test set to evaluate its performance. This gives us an unbiased estimate of how well the model is likely to generalize to real-world email data it hasn't encountered before.

By evaluating on unseen data, we can get a more realistic understanding of the model's accuracy and avoid overfitting, which is when a model performs very well on the training data but poorly on new data.

##Model Training – Naive Bayes and Logistic Regression

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

# Initialize the classifiers
mnb_model = MultinomialNB()
lr_model = LogisticRegression(max_iter=1000) # Increase max_iter for convergence

In [None]:
# Train the Multinomial Naive Bayes model
mnb_model.fit(X_train, y_train)

print("Multinomial Naive Bayes model trained successfully.")

Multinomial Naive Bayes model trained successfully.


In [None]:
# Train the Logistic Regression model
lr_model.fit(X_train, y_train)

print("Logistic Regression model trained successfully.")

Logistic Regression model trained successfully.


##Model Evaluation

##Predictions

In [None]:
from sklearn.metrics import accuracy_score

# Predictions for Multinomial Naive Bayes
y_pred_nb = mnb_model.predict(X_test)

# Predictions for Logistic Regression
y_pred_lr = lr_model.predict(X_test)

##Accuracy

In [None]:
# Calculate Accuracy for Multinomial Naive Bayes
accuracy_nb = accuracy_score(y_test, y_pred_nb)
print(f"Accuracy of Multinomial Naive Bayes: {accuracy_nb:.4f}")

# Calculate Accuracy for Logistic Regression
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print(f"Accuracy of Logistic Regression: {accuracy_lr:.4f}")

Accuracy of Multinomial Naive Bayes: 0.9600
Accuracy of Logistic Regression: 1.0000


##Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

# Confusion Matrix for Multinomial Naive Bayes
conf_matrix_nb = confusion_matrix(y_test, y_pred_nb)
print("Confusion Matrix for Multinomial Naive Bayes:")
print(conf_matrix_nb)

# Confusion Matrix for Logistic Regression
conf_matrix_lr = confusion_matrix(y_test, y_pred_lr)
print("\nConfusion Matrix for Logistic Regression:")
print(conf_matrix_lr)

# Explanation of Confusion Matrix terms:
print("\nConfusion Matrix Terms:")
print("- True Positives (TP): Spam correctly identified as spam.")
print("- True Negatives (TN): Ham correctly identified as ham.")
print("- False Positives (FP): Ham incorrectly labeled as spam (false alarm).")
print("- False Negatives (FN): Spam incorrectly labeled as ham (missed spam).")

# Assuming the confusion matrix is structured as:
# [[TN, FP],
#  [FN, TP]]

tn_nb, fp_nb, fn_nb, tp_nb = conf_matrix_nb.ravel()
print(f"\nMultinomial Naive Bayes:")
print(f"  TP: {tp_nb}, TN: {tn_nb}, FP: {fp_nb}, FN: {fn_nb}")

tn_lr, fp_lr, fn_lr, tp_lr = conf_matrix_lr.ravel()
print(f"\nLogistic Regression:")
print(f"  TP: {tp_lr}, TN: {tn_lr}, FP: {fp_lr}, FN: {fn_lr}")

Confusion Matrix for Multinomial Naive Bayes:
[[701   0]
 [ 48 451]]

Confusion Matrix for Logistic Regression:
[[701   0]
 [  0 499]]

Confusion Matrix Terms:
- True Positives (TP): Spam correctly identified as spam.
- True Negatives (TN): Ham correctly identified as ham.
- False Positives (FP): Ham incorrectly labeled as spam (false alarm).
- False Negatives (FN): Spam incorrectly labeled as ham (missed spam).

Multinomial Naive Bayes:
  TP: 451, TN: 701, FP: 0, FN: 48

Logistic Regression:
  TP: 499, TN: 701, FP: 0, FN: 0


##Precision, Recall, F1-Score

In [None]:
from sklearn.metrics import classification_report

# Classification Report for Multinomial Naive Bayes
print("\nClassification Report for Multinomial Naive Bayes:")
print(classification_report(y_test, y_pred_nb))

# Classification Report for Logistic Regression
print("\nClassification Report for Logistic Regression:")
print(classification_report(y_test, y_pred_lr))

# Explanation of Precision, Recall, and F1-Score (focus on Spam class - label 1):
print("\nExplanation of Metrics (focus on Spam class - label 1):")
print("- Precision (Spam): Of all the emails predicted as spam, what percentage were actually spam?")
print("  (TP / (TP + FP))")
print("- Recall (Spam): Of all the actual spam emails, what percentage did the model correctly identify?")
print("  (TP / (TP + FN))")
print("- F1-Score (Spam): The harmonic mean of Precision and Recall, providing a single metric balance between them.")
print("  (2 * (Precision * Recall) / (Precision + Recall))")


Classification Report for Multinomial Naive Bayes:
              precision    recall  f1-score   support

           0       0.94      1.00      0.97       701
           1       1.00      0.90      0.95       499

    accuracy                           0.96      1200
   macro avg       0.97      0.95      0.96      1200
weighted avg       0.96      0.96      0.96      1200


Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       701
           1       1.00      1.00      1.00       499

    accuracy                           1.00      1200
   macro avg       1.00      1.00      1.00      1200
weighted avg       1.00      1.00      1.00      1200


Explanation of Metrics (focus on Spam class - label 1):
- Precision (Spam): Of all the emails predicted as spam, what percentage were actually spam?
  (TP / (TP + FP))
- Recall (Spam): Of all the actual spam emails, what percentage did the model

##Model Comparison
Based on the evaluation metrics:

Accuracy: The Logistic Regression model achieved a perfect accuracy of 1.0000, while the Multinomial Naive Bayes model achieved an accuracy of 0.9600.

Confusion Matrix:
Multinomial Naive Bayes had 48 False Negatives (spam incorrectly classified as ham), while Logistic Regression had 0 False Negatives. Both models had 0 False Positives (ham incorrectly classified as spam).

Classification Report:
Logistic Regression shows perfect Precision, Recall, and F1-score for both classes (ham and spam).
Multinomial Naive Bayes has perfect Precision for the spam class (1.00), but a slightly lower Recall (0.90), indicating it missed some spam emails.


Summary: In this specific dataset, the Logistic Regression model significantly outperformed the Multinomial Naive Bayes model, achieving perfect scores across all evaluation metrics. This suggests Logistic Regression was able to capture the patterns in the data more effectively. However, it's worth noting that Logistic Regression might be more computationally intensive for very large datasets compared to Naive Bayes.

##Analysis and Discussion

In [None]:
# Error Analysis for Multinomial Naive Bayes

# Find indices of false negatives (actual spam, predicted ham)
fn_indices_nb = [i for i, (actual, predicted) in enumerate(zip(y_test, y_pred_nb)) if actual == 1 and predicted == 0]

# Find indices of false positives (actual ham, predicted spam)
fp_indices_nb = [i for i, (actual, predicted) in enumerate(zip(y_test, y_pred_nb)) if actual == 0 and predicted == 1]

print("Examples of False Negatives (Multinomial Naive Bayes):")
for i in fn_indices_nb[:5]: # Displaying up to 5 examples
    print(f"  Email: {df.iloc[y_test.index[i]]['text']}")
    print(f"  Actual Label: {y_test.iloc[i]}, Predicted Label: {y_pred_nb[i]}")
    print("-" * 20)

print("\nExamples of False Positives (Multinomial Naive Bayes):")
for i in fp_indices_nb[:5]: # Displaying up to 5 examples
    print(f"  Email: {df.iloc[y_test.index[i]]['text']}")
    print(f"  Actual Label: {y_test.iloc[i]}, Predicted Label: {y_pred_nb[i]}")
    print("-" * 20)

# Error Analysis for Logistic Regression
# Since Logistic Regression had perfect accuracy, there should be no false negatives or false positives.
# We can confirm this by checking the counts.
print("\nError Analysis for Logistic Regression:")
if len(conf_matrix_lr.ravel()) == 4:
    tn_lr, fp_lr, fn_lr, tp_lr = conf_matrix_lr.ravel()
    print(f"False Negatives (Logistic Regression): {fn_lr}")
    print(f"False Positives (Logistic Regression): {fp_lr}")
else:
    print("Could not retrieve TN, FP, FN, TP from Logistic Regression confusion matrix.")

if fn_lr == 0 and fp_lr == 0:
    print("Logistic Regression had no false negatives or false positives on the test set.")

Examples of False Negatives (Multinomial Naive Bayes):
  Email: update your account report offer project support project
  Actual Label: 1, Predicted Label: 0
--------------------
  Email: free trial project tomorrow today
  Actual Label: 1, Predicted Label: 0
--------------------
  Email: free trial trial limited tomorrow server tomorrow server welcome
  Actual Label: 1, Predicted Label: 0
--------------------
  Email: free trial document team offer welcome payment document agenda limited report
  Actual Label: 1, Predicted Label: 0
--------------------
  Email: update your account today today update meeting limited free invoice support call
  Actual Label: 1, Predicted Label: 0
--------------------

Examples of False Positives (Multinomial Naive Bayes):

Error Analysis for Logistic Regression:
False Negatives (Logistic Regression): 0
False Positives (Logistic Regression): 0
Logistic Regression had no false negatives or false positives on the test set.


##Potential Improvements
Based on the analysis, here are some potential ways to further improve the spam detection model:

Collect More Data or a More Diverse Dataset: While the current dataset yielded perfect results with Logistic Regression, real-world email datasets can be more complex and varied. Using a larger and more diverse dataset, including a wider range of spam and ham examples, can help the model generalize better to unseen emails.

Explore Advanced Text Preprocessing:

Lemmatization: Implement lemmatization (as discussed earlier) to reduce words to their base form, which can help group similar words and potentially improve feature representation.


Handling Special Tokens: Consider specific handling for URLs, numbers, email addresses, or other special tokens that might carry important information for spam detection. Replacing them with generic tokens (e.g., __URL__, __NUMBER__) can reduce vocabulary size and capture patterns.
Experiment with Other Algorithms and Hyperparameter Tuning:

Support Vector Machines (SVM): SVMs are powerful classifiers that can work well with high-dimensional data like TF-IDF features.

Ensemble Methods: Techniques like Random Forests or Gradient Boosting (e.g., XGBoost, LightGBM) can combine the predictions of multiple models to potentially achieve higher accuracy.

Hyperparameter Tuning: For Logistic Regression, tuning hyperparameters like the regularization strength (C) and penalty type (l1, l2) could potentially improve performance on more challenging datasets. For Naive Bayes, exploring different

##Value to the Marketing Team
A robust spam detection model like this is highly valuable to a marketing team for several key reasons:

Saves Time and Increases Efficiency: By automatically filtering out spam emails from customer inquiries, feedback forms, or other communication channels, the marketing team can save significant time and effort that would otherwise be spent sifting through unwanted messages.

This allows them to focus on legitimate customer interactions and prioritize their responses effectively.

Protects Brand Reputation: Preventing spam content from reaching customers helps maintain a professional and trustworthy brand image.

 Customers are less likely to be frustrated or associate the brand with unwanted communications if spam is effectively filtered out.

Enables Focus on Legitimate Engagement: With spam minimized, the marketing team can concentrate on engaging with genuine leads, customers, and partners.

 This leads to more meaningful interactions, improved customer satisfaction, and ultimately, better business outcomes.

Improved Data Quality: By removing spam, the dataset used for analysis and reporting becomes cleaner and more reliable, leading to more accurate insights into customer behavior and campaign performance.

In essence, a good spam detection model is not just a technical tool but a strategic asset that empowers the marketing team to operate more efficiently, protect the brand, and build stronger customer relationships.