#EMAIL SPAM DETECTION WITH MACHINE LEARNING

we’ve all been the recipient of spam emails before. Spam mail, or junk mail, is a type of email
that is sent to a massive number of users at one time, frequently containing cryptic
messages, scams, or most dangerously, phishing content.



In this Project, use Python to build an email spam detector. Then, use machine learning to
train the spam detector to recognize and classify emails into spam and non-spam. Let’s get
started!



In [13]:
!pip install chardet



In [1]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords

# Stemming and Lemmatization
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

# CountVectorizer => Bag of Words
# TfidfVectorizer => TF-IDF
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, precision_score, recall_score,f1_score
from sklearn.naive_bayes import MultinomialNB



In [2]:
df = pd.read_csv(r"C:\Users\Admin\Downloads\spam.csv", encoding='latin-1')

In [3]:
df

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [4]:
df=df.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1)

In [5]:
df.isna().sum()

v1    0
v2    0
dtype: int64

In [6]:
df['v1'].value_counts()

ham     4825
spam     747
Name: v1, dtype: int64

In [7]:
df.columns

Index(['v1', 'v2'], dtype='object')

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v1      5572 non-null   object
 1   v2      5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


# loading dataset

In [9]:
labels=df['v1']
emails=df['v2']

In [10]:
!pip install textblob



In [11]:
from textblob import TextBlob, Word

In [12]:
def preprocess_text(text):
    text=re.sub(r'[^A-z0-9 ]','',text)
    text=text.lower()
    tokens=text.split()
    stop_words=set(stopwords.words('english'))
    tokens=[word for word in tokens if word not in stop_words]
    text=' '.join(tokens)
    return text

preprocessed_emails=emails.apply(preprocess_text)
vectorizer=TfidfVectorizer()

tfidf_vectors=vectorizer.fit_transform(preprocessed_emails)
    

# splitting data

In [13]:
x_train,x_test,y_train,y_test=train_test_split(tfidf_vectors,labels,test_size=0.2,random_state=42)

# selecting the model and fit the model

In [14]:
classifier=MultinomialNB()
classifier.fit(x_train,y_train)

MultinomialNB()

# evaluating the model

In [15]:
predictions=classifier.predict(x_test)

accuracy=accuracy_score(y_test,predictions)
precision=precision_score(y_test,predictions,pos_label='spam')
recall=recall_score(y_test,predictions,pos_label='spam')
f1=f1_score(y_test,predictions,pos_label='spam')

print('Accuray',accuracy)
print('Precision',precision)
print('Recall',recall)
print('F1score',f1)

Accuray 0.9659192825112107
Precision 1.0
Recall 0.7466666666666667
F1score 0.8549618320610688


# prediction using example

In [16]:

new_sentence = "Congratulations! You've won a million dollars. Click here to claim your prize."
preprocessed_sentence=preprocess_text(new_sentence)
new_sentence_vector=vectorizer.transform([preprocessed_sentence])
prediction=classifier.predict(new_sentence_vector)

if prediction[0]=='spam':
    print('given sentence is predicted as spam')
else:
    print('given sentence is predicted as ham')
    

given sentence is predicted as spam


In [18]:

new_sentence = "Hi there! Just wanted to remind you about our lunch plans at the new Italian restaurant downtown tomorrow. Can't wait to catch up and try their amazing pasta dishes. See you at 12:30 PM!"
preprocessed_sentence=preprocess_text(new_sentence)
new_sentence_vector=vectorizer.transform([preprocessed_sentence])
prediction=classifier.predict(new_sentence_vector)

if prediction[0]=='spam':
    print('given sentence is predicted as spam')
else:
    print('given sentence is predicted as ham')
    

given sentence is predicted as ham
