# Interview Practice- Machine Learning
Topic: SMS Spam Classifier <br>
Written by: Chieh-An Liang <br>
Date: 16/12/2018 <br>
version: 1.2 <br>

Updates: 
1. Format in report style
2. Explore data in detail- how to deal with imbalance data
3. Re-evaulate performance criteria
4. Plot learning curve to track overfitting

<b> Step 1: Analyze data </b>
    
First step is to load the data from the CSV file into a pandas frame, which we can better analyze and process data.
As we can see from the data distribution that it is uneven- 87% ham v.s. 13% spam. Normally we need to resample our data to prevent prediction incline towards ham, but since we use Naive bayes, the uneven data distribution has been accounted for (prior).

We should also look at the overall word frequency. <br>
The top 10 words for ham by frequency: <br>
The top 10 words for spam by frequency: <br>
For example, if a given text contains "FREE" , it is more likely to be a spam. 

Also from the data, there is a minuscule amount of words in column 3,4,5, but for the simplicity, we drop it out.


In [137]:
import numpy as np
import pandas as pd
import re


csv = pd.read_csv("spam.csv",encoding='latin-1', header=0, names=['label', 'sms1','sms2','sms3', 'sms4']).drop(['sms2','sms3','sms4'], axis=1)



ham = len(csv[csv['label'] == 'ham'])
spam = len(csv[csv['label'] == 'spam'])

spam_percent= spam / len(csv) *100
ham_percent = ham / len(csv) *100

print("Spam % = ", spam_percent)
print("Spam % = ", ham_percent)


csv.label[csv['label'] == 'ham'] = 1
csv.label[csv['label'] == 'spam'] = 0


csv.head()


Spam % =  13.406317300789663
Spam % =  86.59368269921033


Unnamed: 0,label,sms1
0,1,"Go until jurong point, crazy.. Available only ..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives aro..."


<br>
<br>
<br>



<b>Step 2: Preprocessing/ Feature extraction</b>



Because we use Naive Bayes (multinomial) bag of words model, we need to tokenize the text into a vector matrix. Feature corresponds to word in text.

A basic preprocessing step is to remove meaningless symbol such as !@#$^&*(){}[]. Regular expression split the word/ remove the delimiter by identifying the non-alphanumeric characters.

Then we use TFidf as we are interested in the token frequency. TFidf considers both the effect of the local (term frequency) and global (document frequency) and is traditionally used to calculate the relevance of a query to a document. Similarly, tfidf is used to measure the relevance of the text to a corresponding label/ classification. 
We use the default TFidf transformer setting: norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False.

Finally, we split the corpus into 2 datasets: training and testing. Since data is imblanced, we need to set Stratify to true to make sure even distribution of labels into each dataset.

In [163]:
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import preprocessing

ham_words=[]
spam_words=[]



def tokenization(message):
    tokens={}
    global spam
    
    
    for token in re.split(r'[^A-Za-z0-9]+', message.sms1):
        token=token.lower()

        if message.label==0:
            token
            if token not in tokens:
                tokens[token]= 1
            else:
                tokens[token]+=1
    return tokens

word_matrix=[]
for i in csv.itertuples():
    word_matrix.append(tokenization(i))

word_matrix

[{'': 1,
  'amore': 1,
  'available': 1,
  'buffet': 1,
  'bugis': 1,
  'cine': 1,
  'crazy': 1,
  'e': 1,
  'go': 1,
  'got': 1,
  'great': 1,
  'in': 1,
  'jurong': 1,
  'la': 1,
  'n': 1,
  'only': 1,
  'point': 1,
  'there': 1,
  'until': 1,
  'wat': 1,
  'world': 1},
 {'': 1, 'joking': 1, 'lar': 1, 'ok': 1, 'oni': 1, 'u': 1, 'wif': 1},
 {'08452810075over18': 1,
  '2': 1,
  '2005': 1,
  '21st': 1,
  '87121': 1,
  'a': 1,
  'apply': 1,
  'c': 1,
  'comp': 1,
  'cup': 1,
  'entry': 2,
  'fa': 2,
  'final': 1,
  'free': 1,
  'in': 1,
  'may': 1,
  'question': 1,
  'rate': 1,
  'receive': 1,
  's': 2,
  'std': 1,
  't': 1,
  'text': 1,
  'tkts': 1,
  'to': 3,
  'txt': 1,
  'win': 1,
  'wkly': 1},
 {'': 1,
  'already': 1,
  'c': 1,
  'dun': 1,
  'early': 1,
  'hor': 1,
  'say': 2,
  'so': 1,
  'then': 1,
  'u': 2},
 {'around': 1,
  'don': 1,
  'goes': 1,
  'he': 2,
  'here': 1,
  'i': 1,
  'lives': 1,
  'nah': 1,
  't': 1,
  'think': 1,
  'though': 1,
  'to': 1,
  'usf': 1},
 {'1': 1,
 

In [None]:

v = DictVectorizer()
xcount= v.fit_transform(word_matrix)

tv = TfidfTransformer()
X_train_tfidf = tv.fit_transform(xcount)

X_train, X_test, y_train, y_test = train_test_split(X_train_tfidf, csv['label'].astype('int'), test_size=0.2, random_state=42, stratify=csv['label'])


<br>
<br>
<br>


<b>Step 3: Model Training</b>
    
We use multinomial naive bayes model

In [133]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB

Mnb = MultinomialNB()
Mnb.fit(X_train, y_train)

Bnb = BernoulliNB()
Bnb.fit(X_train, y_train)


array([1, 1, 1, ..., 1, 1, 1])

<br>
<br>
<br>
<b> Step 4: Model Evaluation </b>

We care about both accuracy and precision, therefore we use weighted F1 score. 
Average precision (AP) summarizes a precision-recall curve (F1) as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.


In [136]:
from sklearn.metrics import average_precision_score


y_score= Mnb.predict(X_test)


average_precision = average_precision_score(y_test, y_score)


print('Average precision-recall score: {0:0.2f}'.format(
      average_precision))


Average precision-recall score: 0.95


In [None]:
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit


def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and training learning curve.
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Testing score")

    plt.legend(loc="best")
    return plt


# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=100, test_size=0.1, random_state=42)
from sklearn.naive_bayes import GaussianNB
plot_learning_curve(GaussianNB(), "Learning Curves", X_train.toarray(), y_train, ylim=(0.8, 1.01), cv=cv, n_jobs=4)

plt.show()

In [152]:
#6 Prediction

ham_feed= "have a party tonight to celebrate winning the lottery, don't be late! Max"
spam_feed= "congratulation! you won the US Mega Pot, claim it before august 21 by replying to 042346742"


predict_ham=v.transform([tokenization(ham_feed)])  
predict_spam= v.transform([tokenization(spam_feed)])  


print('multinomial prediction for ham: ', Mnb.predict(predict_ham))
print('multinomial prediction for spam: ',Mnb.predict(predict_spam))

print('bernoulli prediction for ham: ',Bnb.predict(predict_ham))
print('bernoulli prediction for spam: ',Bnb.predict(predict_spam))

# multinomial predicts both case correct, bernoulli only 1, multinomial more accurate

multinomial prediction for ham:  ['ham']
multinomial prediction for spam:  ['spam']
bernoulli prediction for ham:  ['ham']
bernoulli prediction for spam:  ['ham']
