# Spam or Ham ?  

In this notebook, I will use the sms data from `spam-sms.csv` file in order to predict whether or not a sms is a spam or a ham. 

I implemented below several classification algorithm :
- Naive Bayes algorithms (Gaussian, Bernoulli and Multinomial Naive Bayes)
- Logistic Regression
- tree algorithm (Cart, ID3)
- ensemble method (Bagging, AdaBoost, Random Forest)
with sklearn.

Julien Verdun
07/12/2020

In [1]:
import pandas as pd 
import numpy as np 
from collections import Counter 
import operator
import re


from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, recall_score, precision_score, roc_auc_score

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import cross_validate, KFold

from sklearn.neighbors import KNeighborsClassifier

from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier, RandomForestClassifier
from sklearn import tree
import time 


In [2]:
spam_df = pd.read_csv("spam-sms.csv",header=0, encoding='latin-1',names=["target","sms","1","2","3"])[["target","sms"]]
spam_df.head()

Unnamed: 0,target,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
value_counts = spam_df["target"].value_counts()
print(value_counts)

ham     4825
spam     747
Name: target, dtype: int64


In [4]:
spam_df = pd.concat([spam_df[spam_df["target"]=="ham"].sample(frac=2*value_counts[1]/value_counts[0],random_state=200),spam_df[spam_df["target"]=="spam"]])

spam_df["target"].value_counts()

ham     1494
spam     747
Name: target, dtype: int64

In [5]:
spam_df["target"].value_counts()

ham     1494
spam     747
Name: target, dtype: int64

In [6]:
vectorizer = CountVectorizer()
input_data = vectorizer.fit_transform(spam_df["sms"]).toarray()

In [7]:
def bintarget(x):
    if x == "ham":
        return 0
    return 1

In [8]:
spam_df["vectorized_sms"] = [list(elt) for elt in list(input_data)]

spam_df["targetbin"] = spam_df["target"].apply(lambda x : bintarget(x))

spam_df.head()

Unnamed: 0,target,sms,vectorized_sms,targetbin
1840,ham,Yeah. I got a list with only u and Joanna if I...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0
3067,ham,Boy you best get yo ass out here quick,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0
4720,ham,"Yup. Anything lor, if u dun wan it's ok...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0
1473,ham,"Will do, you gonna be at blake's all night? I ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0
4573,ham,:( but your not here....,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0


In [9]:
output_data = np.array(spam_df["targetbin"].values)

In [10]:
clfs = {
'GNB': GaussianNB(),
'MNG': MultinomialNB(), 
'BNB' : BernoulliNB(),
'LOG' : LogisticRegression(random_state=1),
'CART' : tree.DecisionTreeClassifier(random_state=1,criterion='gini'),
'ID3' : tree.DecisionTreeClassifier(random_state=1,criterion='entropy'),
'KNN': KNeighborsClassifier(n_neighbors=5),
'RF': RandomForestClassifier(n_estimators=50, random_state=1),
'BGC': BaggingClassifier(n_estimators=50),
'ADB': AdaBoostClassifier(n_estimators=50)
}

In [11]:
def run_classifiers(clfs, X, Y):
    kf = KFold(n_splits=10, shuffle=True, random_state=1)
    for i in clfs:
        print("=============== {0} ============== \n".format(i))
        t0 = time.time()
        clf = clfs[i]
        cv_acc = cross_validate(clf, X,Y, cv=kf,scoring=['roc_auc','precision','recall','accuracy','f1'])
        print("Accuracy : {1:.3f} +/- {2:.3f}".format(i, np.mean(cv_acc['test_accuracy']), np.std(cv_acc['test_accuracy'])))
        print("AUC : {1:.3f} +/- {2:.3f}".format(i, np.mean(cv_acc['test_roc_auc']), np.std(cv_acc['test_roc_auc'])))
        print("Precision : {1:.3f} +/- {2:.3f}".format(i, np.mean(cv_acc['test_precision']), np.std(cv_acc['test_precision'])))
        print("Recall : {1:.3f} +/- {2:.3f}".format(i, np.mean(cv_acc['test_recall']), np.std(cv_acc['test_recall'])))
        print("F1 : {1:.3f} +/- {2:.3f}".format(i, np.mean(cv_acc['test_f1']), np.std(cv_acc['test_f1'])))
        print("Total : {1:.3f}".format(i,time.time()-t0))
        print()
        print()

run_classifiers(clfs,input_data, output_data)


Accuracy : 0.876 +/- 0.021
AUC : 0.897 +/- 0.020
Precision : 0.744 +/- 0.031
Recall : 0.961 +/- 0.022
F1 : 0.838 +/- 0.026
Total : 6.857



Accuracy : 0.972 +/- 0.010
AUC : 0.984 +/- 0.008
Precision : 0.967 +/- 0.015
Recall : 0.947 +/- 0.025
F1 : 0.957 +/- 0.017
Total : 4.266



Accuracy : 0.963 +/- 0.012
AUC : 0.995 +/- 0.005
Precision : 0.991 +/- 0.013
Recall : 0.897 +/- 0.037
F1 : 0.941 +/- 0.021
Total : 5.867



Accuracy : 0.963 +/- 0.013
AUC : 0.987 +/- 0.007
Precision : 0.971 +/- 0.013
Recall : 0.916 +/- 0.035
F1 : 0.942 +/- 0.023
Total : 13.938



Accuracy : 0.944 +/- 0.018
AUC : 0.933 +/- 0.024
Precision : 0.927 +/- 0.025
Recall : 0.902 +/- 0.045
F1 : 0.914 +/- 0.030
Total : 72.879



Accuracy : 0.932 +/- 0.027
AUC : 0.918 +/- 0.029
Precision : 0.913 +/- 0.053
Recall : 0.877 +/- 0.037
F1 : 0.895 +/- 0.042
Total : 44.321



Accuracy : 0.829 +/- 0.023
AUC : 0.896 +/- 0.030
Precision : 1.000 +/- 0.000
Recall : 0.484 +/- 0.076
F1 : 0.649 +/- 0.072
Total : 156.716



Accuracy : 0.9

All the different classifiers reached high performances (gretter than 80% accuracy). Some algorithm are slower than the other, especially the tree classifiers (they take time with a lot of input data).

The **Naive Bayes** methods give better results than the method I implemented from scratch. In fact, one major difference is the word vectorizer which is optimize in sklearn. In my vectorizer I didn't take into account the words plurial and so on. 


The best method is the **Multinomial Naive Bayes** method with an accuracy of 97.2%, a F-measure of 95.7%,and a training time relatively low (less than 5 seconds).