### Email spam detection with machine learning

#### Context:
    


##### Classifying emails into distinct labels can have a great impact on customer support. By using machine learning to label emails the system can set up queues containing emails of a specific category. This enables support personnel to handle request quicker and more easily by selecting a queue that match their expertise.

#### Objectives:

#### This study aims to improve the manually defined rule based algorithm, currently implemented at a large telecom company, by using machine learning. The proposed model should have higher F1-score and classification rate. Integrating or migrating from a manually defined rule based model to a machine learning model should also reduce the administrative and maintenance work. It should also make the model more flexible

Feature Information:
    columns            
       v1  --------> spam/ham      object,
       v2  -------->  massage      object,
    Unnamed: 2 -----> NAN          object,
    Unnamed: 3 -----> NAN          object,
    Unnamed: 4 ------> NAN         object
        
    
    

In [34]:
from IPython.display import Image
Image(url='https://www.pantechelearning.com/wp-content/uploads/2021/12/Spam-classification.png', width=400)

In [35]:
#Import python liabraries from scikit-learn.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score,f1_score,recall_score,precision_score

In [36]:
#Load dataset
df=pd.read_csv("spam.csv")
df

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0
...,...,...,...,...
5166,1518,ham,Subject: put the 10 on the ft\r\nthe transport...,0
5167,404,ham,Subject: 3 / 4 / 2000 and following noms\r\nhp...,0
5168,2933,ham,Subject: calpine daily gas nomination\r\n>\r\n...,0
5169,1409,ham,Subject: industrial worksheets for august 2000...,0


In [37]:
#Check column list present in df
df.columns

Index(['Unnamed: 0', 'label', 'text', 'label_num'], dtype='object')

In [38]:
#check descriptive statistics
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5171 entries, 0 to 5170
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  5171 non-null   int64 
 1   label       5171 non-null   object
 2   text        5171 non-null   object
 3   label_num   5171 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 161.7+ KB


In [39]:
#check the number of rows and columns present in df
print('rows---->',df.shape[0])
print('columns---->',df.shape[1])

rows----> 5171
columns----> 4


In [40]:
#Lets see null value count in df
df.isnull().sum()

Unnamed: 0    0
label         0
text          0
label_num     0
dtype: int64

In [41]:
df.isnull().mean()*100  #check the percentage of null value

Unnamed: 0    0.0
label         0.0
text          0.0
label_num     0.0
dtype: float64

#### As we can see there are huge number of missing entries in Unnamed:2,Unnamed:3,Unnamed:4 col i.e more than 99%.So we should have to remove these column.

In [42]:
df

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0
...,...,...,...,...
5166,1518,ham,Subject: put the 10 on the ft\r\nthe transport...,0
5167,404,ham,Subject: 3 / 4 / 2000 and following noms\r\nhp...,0
5168,2933,ham,Subject: calpine daily gas nomination\r\n>\r\n...,0
5169,1409,ham,Subject: industrial worksheets for august 2000...,0


In [43]:
df.shape

(5171, 4)

In [44]:
#Rename columns names for easy to understand, we can also use df.rename
df.columns=(['spam/ham','sms', 'words', 'length'])

In [45]:
#Convert the text data into numerical form
df.loc[df['spam/ham'] == 'spam', 'spam/ham',] = 0
df.loc[df['spam/ham'] == 'ham', 'spam/ham',] = 1

In [46]:
df

Unnamed: 0,spam/ham,sms,words,length
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0
...,...,...,...,...
5166,1518,ham,Subject: put the 10 on the ft\r\nthe transport...,0
5167,404,ham,Subject: 3 / 4 / 2000 and following noms\r\nhp...,0
5168,2933,ham,Subject: calpine daily gas nomination\r\n>\r\n...,0
5169,1409,ham,Subject: industrial worksheets for august 2000...,0


In [47]:
#Devide x and y parameters to train model
x=df.sms
x

0        ham
1        ham
2        ham
3       spam
4        ham
        ... 
5166     ham
5167     ham
5168     ham
5169     ham
5170    spam
Name: sms, Length: 5171, dtype: object

In [48]:
y =df['spam/ham']
y

0        605
1       2349
2       3624
3       4685
4       2030
        ... 
5166    1518
5167     404
5168    2933
5169    1409
5170    4807
Name: spam/ham, Length: 5171, dtype: int64

In [49]:
#Devide the whole dataset into training and testing set for model training
from sklearn.model_selection import train_test_split

In [50]:
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.2,random_state=3)

In [51]:
print(x.shape)
print(xtrain.shape)
print(xtest.shape)

(5171,)
(4136,)
(1035,)


In [52]:
xtrain,xtest

(2209     ham
 2000     ham
 5030     ham
 1376     ham
 1564    spam
         ... 
 789     spam
 968     spam
 1667     ham
 3321     ham
 1688    spam
 Name: sms, Length: 4136, dtype: object,
 4020    spam
 3561    spam
 3434     ham
 111      ham
 1126     ham
         ... 
 2078    spam
 334     spam
 4746     ham
 2850    spam
 2180     ham
 Name: sms, Length: 1035, dtype: object)

In [53]:
ytrain,ytest

(2209     558
 2000    1561
 5030     744
 1376    2745
 1564    4815
         ... 
 789     4454
 968     4721
 1667    2733
 3321    2441
 1688    3760
 Name: spam/ham, Length: 4136, dtype: int64,
 4020    4464
 3561    5018
 3434    1888
 111      111
 1126     261
         ... 
 2078    4866
 334     4353
 4746     199
 2850    4007
 2180     439
 Name: spam/ham, Length: 1035, dtype: int64)

#### As we know that,machine learning algorithms only performs well with respect to numbers,so we need to convert all the text data into numbers.To do so I will use TfidfVectorizer techinque from feature_extraction of sklearn.

In [54]:
feat_vect=TfidfVectorizer(min_df=1,stop_words='english',lowercase=True)
feat_vect

In [55]:
ytrain=ytrain.astype('int')
ytest=ytest.astype('int')

In [56]:
xtrain_vec =feat_vect.fit_transform(xtrain)


In [57]:
xtest_vec =feat_vect.transform(xtest)

In [58]:
print(xtrain)

2209     ham
2000     ham
5030     ham
1376     ham
1564    spam
        ... 
789     spam
968     spam
1667     ham
3321     ham
1688    spam
Name: sms, Length: 4136, dtype: object


In [59]:
xtrain_vec

<4136x2 sparse matrix of type '<class 'numpy.float64'>'
	with 4136 stored elements in Compressed Sparse Row format>

In [60]:
print(xtrain_vec)

  (0, 0)	1.0
  (1, 0)	1.0
  (2, 0)	1.0
  (3, 0)	1.0
  (4, 1)	1.0
  (5, 0)	1.0
  (6, 0)	1.0
  (7, 0)	1.0
  (8, 0)	1.0
  (9, 0)	1.0
  (10, 1)	1.0
  (11, 0)	1.0
  (12, 0)	1.0
  (13, 0)	1.0
  (14, 0)	1.0
  (15, 0)	1.0
  (16, 0)	1.0
  (17, 1)	1.0
  (18, 0)	1.0
  (19, 0)	1.0
  (20, 0)	1.0
  (21, 0)	1.0
  (22, 0)	1.0
  (23, 0)	1.0
  (24, 1)	1.0
  :	:
  (4111, 0)	1.0
  (4112, 0)	1.0
  (4113, 0)	1.0
  (4114, 0)	1.0
  (4115, 1)	1.0
  (4116, 0)	1.0
  (4117, 0)	1.0
  (4118, 0)	1.0
  (4119, 1)	1.0
  (4120, 0)	1.0
  (4121, 0)	1.0
  (4122, 1)	1.0
  (4123, 1)	1.0
  (4124, 0)	1.0
  (4125, 0)	1.0
  (4126, 0)	1.0
  (4127, 0)	1.0
  (4128, 1)	1.0
  (4129, 0)	1.0
  (4130, 0)	1.0
  (4131, 1)	1.0
  (4132, 1)	1.0
  (4133, 0)	1.0
  (4134, 0)	1.0
  (4135, 1)	1.0


In [61]:
print(xtest_vec)

  (0, 1)	1.0
  (1, 1)	1.0
  (2, 0)	1.0
  (3, 0)	1.0
  (4, 0)	1.0
  (5, 0)	1.0
  (6, 1)	1.0
  (7, 0)	1.0
  (8, 0)	1.0
  (9, 1)	1.0
  (10, 1)	1.0
  (11, 0)	1.0
  (12, 0)	1.0
  (13, 0)	1.0
  (14, 0)	1.0
  (15, 0)	1.0
  (16, 0)	1.0
  (17, 1)	1.0
  (18, 0)	1.0
  (19, 0)	1.0
  (20, 1)	1.0
  (21, 0)	1.0
  (22, 0)	1.0
  (23, 0)	1.0
  (24, 1)	1.0
  :	:
  (1010, 0)	1.0
  (1011, 0)	1.0
  (1012, 1)	1.0
  (1013, 0)	1.0
  (1014, 1)	1.0
  (1015, 0)	1.0
  (1016, 0)	1.0
  (1017, 1)	1.0
  (1018, 0)	1.0
  (1019, 0)	1.0
  (1020, 0)	1.0
  (1021, 1)	1.0
  (1022, 0)	1.0
  (1023, 1)	1.0
  (1024, 0)	1.0
  (1025, 0)	1.0
  (1026, 0)	1.0
  (1027, 0)	1.0
  (1028, 0)	1.0
  (1029, 0)	1.0
  (1030, 1)	1.0
  (1031, 1)	1.0
  (1032, 0)	1.0
  (1033, 1)	1.0
  (1034, 0)	1.0


In [62]:
logi=LogisticRegression()

In [63]:
logi.fit(xtrain_vec,ytrain)

In [64]:
logi.score(xtrain_vec,ytrain)

0.00048355899419729207

In [65]:
logi.score(xtest_vec,ytest)

0.0

In [66]:
pred_logi=logi.predict(xtest_vec)
pred_logi

array([4815, 4815,  846, ...,  846, 4815,  846])

In [67]:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

In [68]:
accuracy_score(ytest,pred_logi)

0.0

In [69]:
confusion_matrix(ytest,pred_logi)

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [70]:
print(classification_report(ytest,pred_logi))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       1.0
          22       0.00      0.00      0.00       1.0
          26       0.00      0.00      0.00       1.0
          27       0.00      0.00      0.00       1.0
          31       0.00      0.00      0.00       1.0
          37       0.00      0.00      0.00       1.0
          39       0.00      0.00      0.00       1.0
          40       0.00      0.00      0.00       1.0
          42       0.00      0.00      0.00       1.0
          43       0.00      0.00      0.00       1.0
          44       0.00      0.00      0.00       1.0
          47       0.00      0.00      0.00       1.0
          63       0.00      0.00      0.00       1.0
          67       0.00      0.00      0.00       1.0
          71       0.00      0.00      0.00       1.0
          85       0.00      0.00      0.00       1.0
          88       0.00      0.00      0.00       1.0
          91       0.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


As we saw, we used previously collected data in order to train the model and predicted the category for new incoming emails. This indicate the importance of tagging the data in right way. One mistake can make your machine dumb, e.g In your gmail or any other email account when you get the emails and you think it is a spam but you choose to ignore, may be next time when you see that email, you should report that as a spam. This process can help a lot of other people who are receiving the same kind of email but not aware of what spam is. Sometimes wrong spam tag can move a genuine email to spam folder too. So, you have to be careful before you tag an email as a spam or not spam.