### Email spam detection with machine learning


Context:
Classifying emails into distinct labels can have a great impact on customer support. By using machine learning to label emails the system can set up queues containing emails of a specific category. This enables support personnel to handle request quicker and more easily by selecting a queue that match their expertise. 

Objectives: 
This study aims to improve the manually defined rule based algorithm, currently implemented at a large telecom company, by using machine learning. The proposed model should have higher F1-score and classification rate. Integrating or migrating from a manually defined rule based model to a machine learning model should also reduce the administrative and maintenance work. It should also make the model more flexible

In [1]:
from IPython.display import Image
Image(url='https://www.pantechelearning.com/wp-content/uploads/2021/12/Spam-classification.png', width=400)

In [2]:
#Import python liabraries from scikit-learn.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score,f1_score,recall_score,precision_score

In [3]:
#Load dataset
df=pd.read_csv("spam.csv",encoding="latin1")
df

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will ?_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [4]:
#Check column list present in df
df.columns

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

In [5]:
#check descriptive statistics
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [6]:
#check the number of rows and columns present in df
print('rows---->',df.shape[0])
print('columns---->',df.shape[1])

rows----> 5572
columns----> 5


In [7]:
#Lets see null value count in df
df.isnull().sum()

v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64

In [8]:
df.isnull().mean()*100  #check the percentage of null value

v1             0.000000
v2             0.000000
Unnamed: 2    99.102656
Unnamed: 3    99.784637
Unnamed: 4    99.892319
dtype: float64

As we can see there are huge number of missing entries in Unnamed:2,Unnamed:3,Unnamed:4 col i.e more than 99%.So we should have to remove these column.

In [9]:
df.drop(columns=df[['Unnamed: 2','Unnamed: 3','Unnamed: 4']],axis=1,inplace=True)

In [10]:
df

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ?_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [11]:
df.shape

(5572, 2)

In [12]:
#Rename columns names for easy to understand, we can also use df.rename
df.columns=['spam/ham','sms']

In [13]:
#Convert the text data into numerical form
df.loc[df['spam/ham'] == 'spam', 'spam/ham',] = 0
df.loc[df['spam/ham'] == 'ham', 'spam/ham',] = 1

In [14]:
df

Unnamed: 0,spam/ham,sms
0,1,"Go until jurong point, crazy.. Available only ..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,0,This is the 2nd time we have tried 2 contact u...
5568,1,Will ?_ b going to esplanade fr home?
5569,1,"Pity, * was in mood for that. So...any other s..."
5570,1,The guy did some bitching but I acted like i'd...


In [15]:
#Devide x and y parameters to train model
x=df.sms
x

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will ?_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: sms, Length: 5572, dtype: object

In [16]:
y =df['spam/ham']
y

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: spam/ham, Length: 5572, dtype: object

In [17]:
#Devide the whole dataset into training and testing set for model training
from sklearn.model_selection import train_test_split

In [18]:
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.2,random_state=3)

In [19]:
print(x.shape)
print(xtrain.shape)
print(xtest.shape)

(5572,)
(4457,)
(1115,)


In [20]:
xtrain,xtest

(3075    Mum, hope you are having a great day. Hoping t...
 1787                           Yes:)sura in sun tv.:)lol.
 1614    Me sef dey laugh you. Meanwhile how's my darli...
 4304                Yo come over carlos will be here soon
 3266                    Ok then i come n pick u at engin?
                               ...                        
 789                          Gud mrng dear hav a nice day
 968             Are you willing to go for aptitude class.
 1667    So now my dad is gonna call after he gets out ...
 3321    Ok darlin i supose it was ok i just worry too ...
 1688                     Nan sonathaya soladha. Why boss?
 Name: sms, Length: 4457, dtype: object,
 2632                       I WILL CAL YOU SIR. In meeting
 454     Loan for any purpose ?500 - ?75,000. Homeowner...
 983     LOOK AT THE FUCKIN TIME. WHAT THE FUCK YOU THI...
 1282    Ever green quote ever told by Jerry in cartoon...
 4610                                  Wat time ?_ finish?
               

In [21]:
ytrain,ytest

(3075    1
 1787    1
 1614    1
 4304    1
 3266    1
        ..
 789     1
 968     1
 1667    1
 3321    1
 1688    1
 Name: spam/ham, Length: 4457, dtype: object,
 2632    1
 454     0
 983     1
 1282    1
 4610    1
        ..
 4827    1
 5291    1
 3325    1
 3561    1
 1136    0
 Name: spam/ham, Length: 1115, dtype: object)

As we know that,machine learning algorithms only performs well with respect to numbers,so we need to convert all the text data into numbers.To do so I will use TfidfVectorizer techinque from feature_extraction of sklearn.

In [22]:
feat_vect=TfidfVectorizer(min_df=1,stop_words='english',lowercase=True)
feat_vect

TfidfVectorizer(stop_words='english')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [23]:
ytrain=ytrain.astype('int')
ytest=ytest.astype('int')

In [24]:
xtrain_vec =feat_vect.fit_transform(xtrain)

In [25]:
xtest_vec =feat_vect.transform(xtest)

In [26]:
print(xtrain)

3075    Mum, hope you are having a great day. Hoping t...
1787                           Yes:)sura in sun tv.:)lol.
1614    Me sef dey laugh you. Meanwhile how's my darli...
4304                Yo come over carlos will be here soon
3266                    Ok then i come n pick u at engin?
                              ...                        
789                          Gud mrng dear hav a nice day
968             Are you willing to go for aptitude class.
1667    So now my dad is gonna call after he gets out ...
3321    Ok darlin i supose it was ok i just worry too ...
1688                     Nan sonathaya soladha. Why boss?
Name: sms, Length: 4457, dtype: object


In [27]:
xtrain_vec

<4457x7468 sparse matrix of type '<class 'numpy.float64'>'
	with 34592 stored elements in Compressed Sparse Row format>

In [28]:
print(xtrain_vec)

  (0, 742)	0.32207229533730536
  (0, 3962)	0.2411608243124387
  (0, 4279)	0.3893042361045832
  (0, 6580)	0.20305518394534605
  (0, 3375)	0.32207229533730536
  (0, 2116)	0.38519642807943744
  (0, 3126)	0.4403035234544808
  (0, 3251)	0.258880502955985
  (0, 3369)	0.21816477736422235
  (0, 4497)	0.2910887633154199
  (1, 4045)	0.380431198316959
  (1, 6850)	0.4306015894277422
  (1, 6397)	0.4769136859540388
  (1, 6422)	0.5652509076654626
  (1, 7420)	0.35056971070320353
  (2, 934)	0.4917598465723273
  (2, 2103)	0.42972812260098503
  (2, 3899)	0.40088501350982736
  (2, 2220)	0.413484525934624
  (2, 5806)	0.4917598465723273
  (3, 6121)	0.4903863168693604
  (3, 1595)	0.5927091854194291
  (3, 1838)	0.3708680641487708
  (3, 7430)	0.5202633571003087
  (4, 2523)	0.7419319091456392
  :	:
  (4452, 2116)	0.3092200696489299
  (4453, 1000)	0.6760129013031282
  (4453, 7250)	0.5787739591782677
  (4453, 1758)	0.45610005640082985
  (4454, 3019)	0.42618909997886
  (4454, 2080)	0.3809693742808703
  (4454, 3078

In [29]:
print(xtest_vec)

  (0, 5988)	0.537093591660729
  (0, 4277)	0.5159375448718375
  (0, 1535)	0.667337188824809
  (1, 7199)	0.23059492898537967
  (1, 6580)	0.14954692788663673
  (1, 6560)	0.2733682162643466
  (1, 5482)	0.28671640581392144
  (1, 5329)	0.2733682162643466
  (1, 5232)	0.28671640581392144
  (1, 4029)	0.250549335510249
  (1, 3354)	0.28671640581392144
  (1, 3289)	0.37297727661877506
  (1, 2889)	0.1385795841356552
  (1, 602)	0.28671640581392144
  (1, 520)	0.19344507865262492
  (1, 321)	0.28671640581392144
  (1, 43)	0.24547458936715758
  (1, 1)	0.21260233518669946
  (2, 6680)	0.30969080396105314
  (2, 6627)	0.3410121739015846
  (2, 4054)	0.44361668503137164
  (2, 2931)	0.6068486133983123
  (2, 2929)	0.47195476517479323
  (3, 7079)	0.29334330258175106
  (3, 6725)	0.2031810874151213
  :	:
  (1111, 7392)	0.4945753828645536
  (1111, 6826)	0.39685462025643714
  (1111, 6074)	0.4671914311419049
  (1111, 3249)	0.4477622081928626
  (1111, 2450)	0.42325261089251354
  (1112, 4885)	0.4770390302498559
  (1112, 

In [30]:
logi=LogisticRegression()

In [31]:
logi.fit(xtrain_vec,ytrain)

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [32]:
logi.score(xtrain_vec,ytrain)

0.9661207089970832

In [33]:
logi.score(xtest_vec,ytest)

0.9623318385650225

In [34]:
pred_logi=logi.predict(xtest_vec)
pred_logi

array([1, 1, 1, ..., 1, 1, 1])

In [35]:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

In [36]:
accuracy_score(ytest,pred_logi)

0.9623318385650225

In [37]:
confusion_matrix(ytest,pred_logi)

array([[114,  41],
       [  1, 959]], dtype=int64)

In [38]:
print(classification_report(ytest,pred_logi))

              precision    recall  f1-score   support

           0       0.99      0.74      0.84       155
           1       0.96      1.00      0.98       960

    accuracy                           0.96      1115
   macro avg       0.98      0.87      0.91      1115
weighted avg       0.96      0.96      0.96      1115



As we saw, we used previously collected data in order to train the model and predicted the category for new incoming emails. This indicate the importance of tagging the data in right way. One mistake can make your machine dumb, e.g In your gmail or any other email account when you get the emails and you think it is a spam but you choose to ignore, may be next time when you see that email, you should report that as a spam. This process can help a lot of other people who are receiving the same kind of email but not aware of what spam is. Sometimes wrong spam tag can move a genuine email to spam folder too. So, you have to be careful before you tag an email as a spam or not spam.