## Email Spam detection with Machine learning

### Objective

Email spam detection system is used to detect email spam using Machine Learning technique called Natural Language Processing and Python, where we have a dataset contain a lot of emails by extract important words and then use naive classifier we can detect if this email is spam or not.

### Importing python  libraries-

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score


In [38]:
# load the dataset
df=pd.read_csv('mail_data.csv')

In [39]:
print(df)

     Category                                            Message
0         ham  Go until jurong point, crazy.. Available only ...
1         ham                      Ok lar... Joking wif u oni...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
3         ham  U dun say so early hor... U c already then say...
4         ham  Nah I don't think he goes to usf, he lives aro...
...       ...                                                ...
5567     spam  This is the 2nd time we have tried 2 contact u...
5568      ham               Will ü b going to esplanade fr home?
5569      ham  Pity, * was in mood for that. So...any other s...
5570      ham  The guy did some bitching but I acted like i'd...
5571      ham                         Rofl. Its true to its name

[5572 rows x 2 columns]


In [40]:
data = df.where((pd.notnull(df)), '')

In [41]:
# Show dataset head (first 5 records)
data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [42]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [43]:
#Covert the text data into numerical form
data['Category']=data['Category'].map({'ham':0,'spam':1})

In [44]:
data.head()

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [45]:
data.shape

(5572, 2)

In [46]:
data.loc[data['Category']=='spam','Category',] = 0
data.loc[data['Category']== 'ham','Category',] = 1

In [47]:
data.head()

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [48]:
# Fit the CountVectorizer to data
cv = CountVectorizer()

In [49]:
x  = data['Message']
y  =  data['Category']

In [50]:
x=cv.fit_transform(x)

In [51]:
x

<5572x8709 sparse matrix of type '<class 'numpy.int64'>'
	with 74098 stored elements in Compressed Sparse Row format>

In [24]:
print(x)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object


In [25]:
print(y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object


## Data spliting

In [52]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2 ) 

In [54]:
print(x.shape)
print(x_train.shape)
print(x_test.shape)

(5572, 8709)
(4457, 8709)
(1115, 8709)


In [56]:
print(y.shape)
print(y_train.shape)
print(y_test.shape)

(5572,)
(4457,)
(1115,)


## The Model

In [57]:
# Model creation
model = MultinomialNB()

In [58]:
model.fit(x_train ,y_train )

In [62]:
# Model saving
result = model.score(x_test,y_test)

In [63]:
result =result * 100

In [64]:
result

98.02690582959642

In [65]:
import pickle

In [66]:
pickle.dump(model ,open("mail.pkl","wb"))

In [67]:
pickle.dump(cv ,open("vectorizer.pkl","wb"))

In [69]:
clf=pickle.load(open("mail.pkl","rb"))

In [70]:
clf

## Data prediction result-

In [71]:
msg="You win 10 Dollar"
data= [msg]
vect = cv.transform(data).toarray()
result = model.predict(vect)
print(result)

[1]


In [73]:
 if (result==1):
        print("It is a Ham mail")
else :
    print("It is a Spam mail")

It is a Ham mail
