# Email Spam detection with ML 
We’ve all been the recipient of spam emails before. Spam mail, or junk mail, is a type of email
that is sent to a massive number of users at one time, frequently containing cryptic
messages, scams, or most dangerously, phishing content.
In this Project, use Python to build an email spam detector. Then, use machine learning to
train the spam detector to recognize and classify emails into spam and non-spam.

In [50]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 

In [51]:
data=pd.read_csv("../datasets/newSpamPrediction.csv")

In [52]:
data.columns

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

In [53]:
data['v2']=data['v2'].astype(str)+data['Unnamed: 2'].astype(str)+data['Unnamed: 3'].astype(str)+data['Unnamed: 4'].astype(str)
data=data.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'])
data.head(3)

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...nannannan
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...


In [54]:
def clean_email_text(email_text):
    # Split the text into words
    words = email_text.split()
    
    # Remove commas and keep words
    cleaned_words = [word.replace(',', '') for word in words]
    
    # Join the cleaned words back into a sequence
    cleaned_text = ' '.join(cleaned_words)
    
    return cleaned_text

In [55]:
data['v2']=data['v2'].apply(clean_email_text)

In [56]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [57]:
data['v2']=data['v2'].str.split()
max_len = data['v2'].apply(len).max()
max_len

171

In [58]:
data['v2'] = pad_sequences(data['v2'], maxlen=max_len, padding='post', truncating='post', dtype='object')
max_len = data['v2'].apply(len).max()
max_len


58

In [66]:
from tensorflow.keras.preprocessing.text import Tokenizer
# Tokenize the email text data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data['v2'])

# Convert text sequences to integer sequences
# sequences = tokenizer.texts_to_sequences(data['email_text'])

# Determine vocabulary size
vocab_size = len(tokenizer.word_index) + 1

# Adjust this based on your preference and resources
embedding_dim = 100  

# Define the maximum sequence length (the length you padded or truncated sequences to)
max_length = 100 

In [59]:
from sklearn.model_selection import  train_test_split

In [74]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # You can adjust 'max_features' as needed

# Fit and transform your text data
X = tfidf_vectorizer.fit_transform(data['v2'])

In [75]:
X_train,X_test,Y_train,Y_test=train_test_split(X,data['v1'],test_size=0.15,random_state=10)

In [76]:
from sklearn.naive_bayes import MultinomialNB

# Create a classifier (e.g., Multinomial Naive Bayes)
classifier = MultinomialNB()

In [77]:
classifier.fit(X_train,Y_train)

In [78]:
Y_pred=classifier.predict(X_test)

In [80]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Make predictions
# y_pred = classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(Y_test, Y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Print classification report and confusion matrix
print(classification_report(Y_test, Y_pred))
print(confusion_matrix(Y_test, Y_pred))

Accuracy: 0.92
              precision    recall  f1-score   support

         ham       0.92      1.00      0.96       733
        spam       0.95      0.36      0.52       103

    accuracy                           0.92       836
   macro avg       0.93      0.68      0.74       836
weighted avg       0.92      0.92      0.90       836

[[731   2]
 [ 66  37]]


Thus our model is 92% accurate.

In [81]:
import joblib

joblib.dump(classifier,"../models/spamDetection.pkl")

['../models/spamDetection.pkl']