We’ve all been the recipient of spam emails before. Spam mail, or junk mail, is a type of email that is sent to a massive number of users at one time, frequently containing cryptic messages, scams, or most dangerously, phishing content. In this Project, use Python to build an email spam detector. Then, use machine learning to train the spam detector to recognize and classify emails into spam and non-spam. Let’s get started!

In [30]:
import pandas as pd
messages=pd.read_csv('spam.csv',encoding='ISO-8859-1')
messages.rename(columns={'v1':'label','v2':'message'},inplace=True)
messages.drop(messages.iloc[:,2:],axis=1,inplace=True)
messages.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [31]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
stemmer=PorterStemmer()
corpus=[]
for i in range(0,len(messages)):
    review=re.sub('[^a-zA-Z]',' ',messages['message'][i])
    review=review.lower()
    review=review.split()
    
    review=[stemmer.stem(word) for word in review if word in stopwords.words('english')]
    review=' '.join(review)
    corpus.append(review)

In [32]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=5000)
X=cv.fit_transform(corpus).toarray()
X.shape

(5572, 129)

In [33]:
y=pd.get_dummies(messages['label'])
y=y.iloc[:,1].values

In [34]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=0)

In [35]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
accuracy={}
spam_detect_model=MultinomialNB().fit(x_train,y_train)
y_pred=spam_detect_model.predict(x_test)
accuracy['naive_bayes']=accuracy_score(y_pred,y_test)
accuracy

{'naive_bayes': 0.9165919282511211}

In [36]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
lr = LogisticRegression()
lr.fit(x_train,y_train)
y_pred = lr.predict(x_test)
accuracy['logistic_regression'] = accuracy_score(y_pred,y_test)

In [37]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(x_train,y_train)
y_pred = dt.predict(x_test)
accuracy['decision_tree'] = accuracy_score(y_pred,y_test)
accuracy

{'naive_bayes': 0.9165919282511211,
 'logistic_regression': 0.9103139013452914,
 'decision_tree': 0.9094170403587444}

In [38]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_pred)

array([[897,  52],
       [ 49, 117]], dtype=int64)