# Spam or Ham ?

In this notebook, I will use the sms data from `spam-sms.csv` file in order to predict whether or not a sms is a spam or a ham. 

I implemented below the **naive bayes** method from scratch, without using any framework (sklearn for example).

Julien Verdun
07/12/2020

In [1]:
import pandas as pd 
import numpy as np 
from collections import Counter 
import operator
import re

In [2]:
spam_df = pd.read_csv("spam-sms.csv",header=0, encoding='latin-1',names=["target","sms","1","2","3"])[["target","sms"]]
spam_df.head()

Unnamed: 0,target,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
spam_df["target"].value_counts()

ham     4825
spam     747
Name: target, dtype: int64

In [4]:
# reduction of ham number in order to rebalance the dataset

#spam_df = pd.concat([spam_df[spam_df["target"]=="ham"].sample(frac=0.3,random_state=200),spam_df[spam_df["target"]=="spam"]]).sample(frac=1).reset_index(drop=True)

# spam_df["target"].value_counts()

In [5]:
def sentence2vec(sentence,minimal_length=2):
    """
    Transform a string sentence into a vector which includes the relevant words
    """
    simplified_sentence = sentence.replace('\W+',' ').replace('\s+',' ').strip()
    simplified_sentence = simplified_sentence.lower()
    simplified_sentence = simplified_sentence.split(" ")

    vectorized_sentence = []
    for word in simplified_sentence:
        if len(word) >= minimal_length :
            vectorized_sentence.append(word)
    return vectorized_sentence


In [6]:
# add to the dataframe a column vectorized_sms with the vector of words inside the sms
spam_df["vectorized_sms"] = spam_df["sms"].apply(lambda x : sentence2vec(x))
spam_df.head()

Unnamed: 0,target,sms,vectorized_sms
0,ham,"Go until jurong point, crazy.. Available only ...","[go, until, jurong, point,, crazy.., available..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar..., joking, wif, oni...]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, wkly, comp, to, win, fa, cup..."
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor..., already, then, s..."
4,ham,"Nah I don't think he goes to usf, he lives aro...","[nah, don't, think, he, goes, to, usf,, he, li..."


Creation of the contingency matrix

In [7]:
def count_word(sentence,minimal_length = 0):
    """
    This function takes a sentence and returns a counter, i-e a dictionnary of words and occurences of those words in the sentence.   
    """
    counter = Counter(sentence2vec(sentence))
    return counter

In [8]:
def countVectorizer(list_words):
    """
    This function takes a list of index corresponding to the training request sentences we want to use, and a number, corresponding to the length of the output, and computes a dictionnary of most recurrent words and another one of most frequent words.   
    """
    # initialisation of the counter of occurencies
    word_counters = Counter()
    # list for storing the number of words in every sentences
    list_len_sentences = []
    # for each index in the given list
    for index in range(len(list_words)):
        # we count the number of occurences of words in the sentence (after having clean it up)
        counter = count_word(list_words[index])
        # we fill in the list of sentences' length
        list_len_sentences.append(sum(counter.values()))
        # we concatenate the occurency and frequency counters 
        word_counters = word_counters + counter

    word_sorted = sorted(
        word_counters.items(), key=operator.itemgetter(1), reverse=True)

    return dict(word_counters)


In [9]:
def train_test_split(spam_df,X_fold_split,test_index):
    """
    This function takes the dataframe spam_df and a list of indexes inside each fold X_fold_split, the index of the fold considered as the test fold, and returns a testing set with the data considered as testing set and a training set with the other folds. 
    """
    test_set = spam_df.loc[X_fold_split[test_index]]
    train_set = []
    for i in range(10):
        if i != test_index:
            train_set.append(spam_df.loc[X_fold_split[i]])
    train_set = pd.concat(train_set)
    return train_set.reset_index(drop=True),test_set.reset_index(drop=True)

In [10]:
# probability of being a spam
p_spam = spam_df["target"].value_counts()["spam"]/np.sum(spam_df["target"].value_counts())

In [11]:
def lissage_laplace(spam_count,ham_count):
    """
    This function computes the probabilities of every word inside spam_count and ham_count using the number of occurencies of each word, the total number of words, the laplace_constante
    The "Lissage de Laplace" prevents from null probabilities 
    """
    spam_total_occ = sum(spam_count.values())
    ham_total_occ = sum(ham_count.values())
    laplace_constante = len(Counter(spam_count)+Counter(ham_count))
    spam_proba={}
    ham_proba={}
    for word in spam_count:
        spam_proba[word] = (spam_count[word]+1)/(spam_total_occ+laplace_constante) 
    for word in ham_count:
        ham_proba[word] = (ham_count[word]+1)/(ham_total_occ+laplace_constante)

    return spam_proba,ham_proba

In [12]:
def naiveBayesClassifier(spam_proba,ham_proba,sentence,p_spam):
    """
    This function predicts for a given sentence whether or not it is a spam with naives Bayes method.
    """
    spam_score = 1
    ham_score = 1
    vectorizedSentence = count_word(sentence)
    for word in dict(vectorizedSentence):
        if word in spam_proba:
            spam_score *= spam_proba[word]**vectorizedSentence[word]
        if word in ham_proba:
            ham_score *= ham_proba[word]**vectorizedSentence[word]

    if spam_score*p_spam > ham_score*(1-p_spam):
        return 1 
    return 0

In [13]:
def confusion_matrix_nbclf(spam_proba,ham_proba,test_df, p_spam):
    """
    This function creates the confusion matrix of the model with the test_df data
    """
    conf_mat = np.zeros((2,2))
    for i in range(test_df["sms"].shape[0]):
        is_spam = naiveBayesClassifier(spam_proba,ham_proba,test_df["sms"][i], p_spam)
        if test_df["target"][i] == "spam":
            conf_mat[1][1-is_spam] += 1
        else :
            conf_mat[0][1-is_spam] += 1
    return conf_mat

In [14]:
def f1_score(conf_matrix):
    """
    This function computes the following metrics : accuracy, precision, recall and F-measure as follow :
    F1 = 2 * (precision * recall) / (precision + recall)
    P = Tp/(Tp+Fp)
    R = Tp/(Tp+Fn)
    """
    A = np.round(np.sum(np.diag(conf_matrix))/np.sum(conf_matrix),2)
    P = conf_matrix[0][0]/ (conf_matrix[0][0] + conf_matrix[0][1])
    R = conf_matrix[0][0]/ (conf_matrix[0][0] + conf_matrix[1][0])
    F1 = np.round(2*P*R/(P+R),2)
    return A, F1, np.round(P,2) ,np.round(R,2)
    

In [15]:
def X_fold_cross_validation(spam_df,n_folds=10,p_spam=p_spam):
    """
    This function makes a X-fold cross-validation on spam_df data and returns for each metrics a list of performances per fold.
    """
    A = []
    F1 = []
    P = []
    R = []
    X_fold_split = np.array_split(spam_df.index, n_folds)
    print("----- TRAINING -----")
    for i in range(n_folds):
        train_df,test_df = train_test_split(spam_df,X_fold_split,i)
        spam_count = countVectorizer(np.array(train_df[train_df["target"]=="spam"]["sms"].values))
        ham_count = countVectorizer(np.array(train_df[train_df["target"]=="ham"]["sms"].values))
        spam_proba,ham_proba = lissage_laplace(spam_count,ham_count)
        conf_mat = confusion_matrix_nbclf(spam_proba,ham_proba,test_df, p_spam)
        Ai, F1i, Pi ,Ri = f1_score(conf_mat)
        A.append(Ai)
        F1.append(F1i)
        P.append(Pi)
        R.append(Ri)
        print("Fold {} : acc {} ; F1 {} ; P {} ; R {}".format(i,Ai,F1i,Pi,Ri))
    return A,F1,P,R

In [16]:
A,F1,P,R = X_fold_cross_validation(spam_df)

----- TRAINING -----
Fold 0 : acc 0.87 ; F1 0.92 ; P 0.87 ; R 0.98
Fold 1 : acc 0.88 ; F1 0.92 ; P 0.88 ; R 0.97
Fold 2 : acc 0.86 ; F1 0.92 ; P 0.87 ; R 0.97
Fold 3 : acc 0.88 ; F1 0.93 ; P 0.87 ; R 0.99
Fold 4 : acc 0.86 ; F1 0.91 ; P 0.86 ; R 0.97
Fold 5 : acc 0.86 ; F1 0.92 ; P 0.87 ; R 0.98
Fold 6 : acc 0.87 ; F1 0.92 ; P 0.88 ; R 0.98
Fold 7 : acc 0.88 ; F1 0.93 ; P 0.89 ; R 0.97
Fold 8 : acc 0.86 ; F1 0.91 ; P 0.86 ; R 0.97
Fold 9 : acc 0.88 ; F1 0.93 ; P 0.88 ; R 0.98


In [17]:
print("----- TEST -----")
print("\nAccuracy : ",np.round(100*np.mean(A),3),"\nF1-score : ",np.round(100*np.mean(F1),3),"\nPrecision : ",np.round(100*np.mean(P),3),"\nRecall : ",np.round(100*np.mean(R),3))

----- TEST -----

Accuracy :  87.0 
F1-score :  92.1 
Precision :  87.3 
Recall :  97.6


The **Naive Bayes** method implemented from scartch reached an accuracy of 87% with a good F-measure of 92.1%. 