##### Spam Mail Classification
The project is relating to classifying the messages as spam or not. We will use the nltk and Machine Learning libraries for the same.

In [1]:
# importing pandas to read the csv file
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
df=pd.read_csv("messages.csv")

In [2]:
df

Unnamed: 0,subject,message,label
0,job posting - apple-iss research center,content - length : 3386 apple-iss research cen...,0
1,,"lang classification grimes , joseph e . and ba...",0
2,query : letter frequencies for text identifica...,i am posting this inquiry for sergei atamas ( ...,0
3,risk,a colleague and i are researching the differin...,0
4,request book information,earlier this morning i was on the phone with a...,0
...,...,...,...
2888,love your profile - ysuolvpv,hello thanks for stopping by ! ! we have taken...,1
2889,you have been asked to join kiddin,"the list owner of : "" kiddin "" has invited you...",1
2890,anglicization of composers ' names,"judging from the return post , i must have sou...",0
2891,"re : 6 . 797 , comparative method : n - ary co...",gotcha ! there are two separate fallacies in t...,0


In [3]:
# checking the shape of the dataframe
df.shape

(2893, 3)

In [4]:
# checking the datatypes of the dataframe
df.dtypes

subject    object
message    object
label       int64
dtype: object

In [5]:
#checking the null values in the dataset
df.isnull().sum()
#62 values in the subject column 

subject    62
message     0
label       0
dtype: int64

In [6]:
#Filling the null values with the term 'Not Available'
df['subject']=df['subject'].fillna('Not Available')

In [7]:
# importing nltk and other necessary libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from string import punctuation
import string



In [8]:
# importing stemming libraries and initialising them
from nltk.stem import WordNetLemmatizer, PorterStemmer
wordnet=WordNetLemmatizer()
import re
stemmer=PorterStemmer()

In [9]:
# creating a separate list for storing the processed data of the column 'Subject'
subject=[]
for i in range(0, len(df['subject'])):
    review=re.sub('[^a-zA-Z]',' ', df['subject'][i])
    review=review.lower()
    review=review.split()
    review=[stemmer.stem(word) for word in review if not word in string.punctuation if not word in stopwords.words('english')]
    review= ' '.join(review)
    subject.append(review)
               
                      
subject

['job post appl iss research center',
 'avail',
 'queri letter frequenc text identif',
 'risk',
 'request book inform',
 'call abstract optim syntact theori',
 'scandinavian linguist',
 'call paper linguist session mla',
 'foreign languag commerci',
 'fulbright announc pleas post dissemin list',
 'gala call paper',
 'bu conf languag develop announc',
 'korean softwar macintosh',
 'avail',
 'simultan preposit postposit pashto',
 'sum imper without subject',
 'polici',
 'correct hellenist greek announc',
 'question audio sampl',
 'sexism languag',
 'teach english korea',
 'free',
 'email address w dressler',
 'dhumbadji journal histori languag',
 'question quantit inform',
 'amhar',
 'uniformitarian',
 'qs phonem write',
 'intens summer arab languag institut',
 'list compar literatur',
 'call abstract',
 'call paper rocl',
 'stress bibliographi',
 'depend grammar',
 'call paper system workshop',
 'sum e mail citat',
 'job announc',
 'address chang changement adress',
 'internet success t

In [10]:
# creating a separate list for storing the processed messages of the column 'message'
message=[]
for i in range(0, len(df['message'])):
    review=df['message'][i].lower()
    review=review.split()
    review=[stemmer.stem(word) for word in review if not word in string.punctuation if not word in stopwords.words('english')]
    review= ' '.join(review)
    message.append(review)
    
message

["content length 3386 apple-iss research center us 10 million joint ventur appl comput inc institut system scienc nation univers singapor locat singapor look senior speech scientist success candid research expertis comput linguist includ natur languag process english chines statist languag model knowledg state-of the-art corpus-bas n gram languag model cach languag model part-of speech languag model requir text speech project leader success candid research expertis expertis two follow area comput linguist includ natur languag pars lexic databas design statist languag model text token normal prosod analysi substanti knowledg phonolog syntax semant chines requir knowledg acoust phonet speech signal process desir candid phd least 2 4 year relev work experi technic msc degre least 5 7 year experienc e strong softwar engin skill includ design implement product requir posit knowledg c c unix prefer unix c programm look experienc unix c programm prefer good industri experi join us break new f

#### Using Message (actual message) as input
#### Using Tfidvectorizer

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

vec=TfidfVectorizer()
x=vec.fit_transform(message).toarray()


In [12]:
y=df['label']

In [13]:
train_x,test_x,train_y,test_y=train_test_split(x,y,random_state=42,test_size=0.20)

In [14]:
from sklearn.naive_bayes import GaussianNB
gb=GaussianNB()
gb.fit(train_x,train_y)
pred=gb.predict(test_x)
print("The accuracy score is", accuracy_score(test_y,pred))

The accuracy score is 0.9222797927461139


In [15]:
print(confusion_matrix(test_y,pred))
print(classification_report(test_y,pred))

[[461   3]
 [ 42  73]]
              precision    recall  f1-score   support

           0       0.92      0.99      0.95       464
           1       0.96      0.63      0.76       115

    accuracy                           0.92       579
   macro avg       0.94      0.81      0.86       579
weighted avg       0.93      0.92      0.92       579



#### using countvectorizer

In [16]:

count=CountVectorizer()
x=count.fit_transform(message).toarray()

In [17]:
train_x,test_x,train_y,test_y=train_test_split(x,y,random_state=42,test_size=0.20)

In [18]:
from sklearn.naive_bayes import GaussianNB
gb=GaussianNB()
gb.fit(train_x,train_y)
pred=gb.predict(test_x)
print("The accuracy score is", accuracy_score(test_y,pred))

The accuracy score is 0.927461139896373


In [19]:
print(confusion_matrix(test_y,pred))
print(classification_report(test_y,pred))

[[461   3]
 [ 39  76]]
              precision    recall  f1-score   support

           0       0.92      0.99      0.96       464
           1       0.96      0.66      0.78       115

    accuracy                           0.93       579
   macro avg       0.94      0.83      0.87       579
weighted avg       0.93      0.93      0.92       579



#### trying other classifiers using countvectorizer
since count vectorizer is giving better results we can consider the same 

In [20]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
AS=[]
CV=[]

models=[LogisticRegression(),DecisionTreeClassifier(),KNeighborsClassifier(),SVC(),GaussianNB()]

for i in models:
    i.fit(train_x,train_y)
    pred=i.predict(test_x)
    print(i)
    acc_score=accuracy_score(pred,test_y)
    print("\nAccuracy Score",acc_score)
    AS.append(acc_score)
    print(classification_report(pred,test_y))
    print(confusion_matrix)
    print("")
    cvs=cross_val_score(i,x,y,cv=5).mean()
    print("The cross validation score is", cvs)
    CV.append(cvs)
    print("")

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Accuracy Score 0.998272884283247
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       465
           1       0.99      1.00      1.00       114

    accuracy                           1.00       579
   macro avg       1.00      1.00      1.00       579
weighted avg       1.00      1.00      1.00       579

<function confusion_matrix at 0x0000024015BDDA68>

The cross validation score is 0.9920486856630848

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,

In [21]:
score=pd.DataFrame({'Models':models,'Accuracy Score':AS,'Cross Validation Score':CV})
score

Unnamed: 0,Models,Accuracy Score,Cross Validation Score
0,"LogisticRegression(C=1.0, class_weight=None, d...",0.998273,0.992049
1,"DecisionTreeClassifier(class_weight=None, crit...",0.967185,0.955074
2,"KNeighborsClassifier(algorithm='auto', leaf_si...",0.943005,0.913582
3,"SVC(C=1.0, cache_size=200, class_weight=None, ...",0.853195,0.872101
4,"GaussianNB(priors=None, var_smoothing=1e-09)",0.927461,0.939167


Conclusion:
Logistic Regression is giving best accuracy score and cross validation score. So we can conclude the same.

In [22]:
#Saving the logistic regression model
lg=LogisticRegression()
lg.fit(train_x,train_y)
pred=lg.predict(test_x)
acc_score=accuracy_score(pred,test_y)
print("\nAccuracy Score",acc_score)
cvs=cross_val_score(lg,x,y,cv=5).mean()
print("The cross validation score is", cvs)

    
from sklearn.externals import joblib
joblib.dump(lg,"Spam_Mails_LG.pkl")


Accuracy Score 0.998272884283247
The cross validation score is 0.9920486856630848




['Spam_Mails_LG.pkl']

##### Message subject as input instead of message

In [23]:
# we can try and see if we can keep subject as the input
x1=vec.fit_transform(subject).toarray()

In [24]:
train_x,test_x,train_y,test_y=train_test_split(x1,y,random_state=42,test_size=0.20)


In [25]:
# trying different models
models=[LogisticRegression(),DecisionTreeClassifier(),KNeighborsClassifier(),SVC(),GaussianNB()]
AS=[]
CV=[]


for i in models:
    i.fit(train_x,train_y)
    pred=i.predict(test_x)
    print(i)
    acc_score=accuracy_score(pred,test_y)
    print("\nAccuracy Score",acc_score)
    AS.append(acc_score)
    print(classification_report(pred,test_y))
    print(confusion_matrix)
    print("")
    cvs=cross_val_score(i,x,y,cv=5).mean()
    print("The cross validation score is", cvs)
    CV.append(cvs)
    print("")

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Accuracy Score 0.853195164075993
              precision    recall  f1-score   support

           0       1.00      0.85      0.92       549
           1       0.26      1.00      0.41        30

    accuracy                           0.85       579
   macro avg       0.63      0.92      0.66       579
weighted avg       0.96      0.85      0.89       579

<function confusion_matrix at 0x0000024015BDDA68>

The cross validation score is 0.9920486856630848

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,

In [26]:
score=pd.DataFrame({'Models':models,'Accuracy Score':AS,'Cross Validation Score':CV})
score

Unnamed: 0,Models,Accuracy Score,Cross Validation Score
0,"LogisticRegression(C=1.0, class_weight=None, d...",0.853195,0.992049
1,"DecisionTreeClassifier(class_weight=None, crit...",0.891192,0.956104
2,"KNeighborsClassifier(algorithm='auto', leaf_si...",0.784111,0.913582
3,"SVC(C=1.0, cache_size=200, class_weight=None, ...",0.801382,0.872101
4,"GaussianNB(priors=None, var_smoothing=1e-09)",0.891192,0.939167


In [27]:
# Lets use gridsearchcv on decisiontree to get better results
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier()
parameters={'criterion':('gini','entropy'), 'splitter':('best','random')}
gddt=GridSearchCV(dt,parameters)
gddt.fit(train_x,train_y)
gddt.best_params_

{'criterion': 'entropy', 'splitter': 'random'}

In [28]:
# Let us finalise the Decision Tree classifier which is giving better Accuracy and Cross validation score
dt=DecisionTreeClassifier(criterion='entropy',splitter='random')
dt.fit(train_x,train_y)
pred=dt.predict(test_x)
print("Accuracy score is", accuracy_score(test_y,pred))
print("The Cross Validation Score is", cross_val_score(dt,x,y,cv=5).mean())

Accuracy score is 0.8981001727115717
The Cross Validation Score is 0.9550674411794586


In [34]:
# Saving the model
joblib.dump(dt,"Spam_mail_on_subject.pkl")

['Spam_mail_on_subject.pkl']