# Fitting several classifiers for spam detection

This notebook provides several examples of classification using algorithms alone or combined in an ensemble. No emphasis is  on tuning or parameter optimization just yet. Therefore, most of the model classes are instantiated with default parameters. Naive bayes is set as the base benchmark.

In [56]:
# A first look at the data:
#import os
#print(os.getcwd()+"\\SMSSpamCollection")

%ls
with open("SMSSpamCollection","r") as data:
    print(data.readlines()[0:3])

 Volume in drive C is Windows
 Volume Serial Number is 04C4-F2DC

 Directory of C:\Users\Guillermo\Desktop\Udacity_DSN\py_exercises\spam_clasifier

12/28/2018  02:42 PM    <DIR>          .
12/28/2018  02:42 PM    <DIR>          ..
12/28/2018  02:42 PM    <DIR>          .ipynb_checkpoints
12/28/2018  01:14 PM             5,868 readme
12/28/2018  01:14 PM           477,907 SMSSpamCollection
12/28/2018  02:42 PM           530,182 Spam Classification.ipynb
               3 File(s)      1,013,957 bytes
               3 Dir(s)  332,797,198,336 bytes free
['ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...\n', 'ham\tOk lar... Joking wif u oni...\n', "spam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\n"]


In [58]:
## Data pre-processing
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Read data in:
df= pd.read_table("SMSSpamCollection", header=None, sep='\t', names=['label','sms_message'])

# Make response numerical:
df['label']= df.label.map({'ham':0, 'spam':1})

# Split dataset in training and testing:
X_train, X_test, y_train, y_test= train_test_split(df['sms_message'],
                                                   df['label'],
                                                   random_state=1)
# Convert text into a bag of words in matrix form:
count_vector= CountVectorizer()
training_data= count_vector.fit_transform(X_train)
testing_data= count_vector.transform(X_test)

In [64]:
######### NAIVE BAYES CLASSIFIER ###########################################################
## Insert a cell above explaining the nuts and bolts of the Naive Bayes algorithm.
from sklearn.naive_bayes import MultinomialNB
naive_bayes= MultinomialNB()

# Naive Bayes fitting:
naive_bayes.fit(training_data, y_train)

# Naive Bayes testing:
predictions= naive_bayes.predict(testing_data)

# Evaluate performance:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Naive Bayes performance metrics\n')
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Naive Bayes performance metrics

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562


In [66]:
########## Ensemble methods ###############
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier 
from sklearn.ensemble import RandomForestClassifier

# Instatitate each with default hyper-parameters (only n_estimators or weak_learners for now (n=200))
bag_class= BaggingClassifier(n_estimators=200)
rf_class= RandomForestClassifier(n_estimators=200)
ada_class= AdaBoostClassifier(n_estimators=300, learning_rate=0.2)

# Fitting each instance of a classification ensemble:
#X_train, X_test, y_train, y_test
bag_class.fit(X= training_data, y=y_train)
rf_class.fit(X= training_data, y= y_train)
ada_class.fit(X= training_data, y= y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.2, n_estimators=300, random_state=None)

In [67]:
#Model testing:
bag_preds= bag_class.predict(X=testing_data)
rf_preds= rf_class.predict(X=testing_data)
ada_preds= ada_class.predict(X=testing_data)

In [80]:
# Ensemble classifiers --- Model Evaluation:
def print_metrics(y_true, preds, model_name=None):
    '''
    A function to report model agreement
    Inputs: y_true= actual values (np array or pd Series)
            preds= predictions from a model (np array or pd Series)
    Output: Accuracy, precision, recall, F1
    '''
    if model_name == None:
        print('Accuracy : ', format(accuracy_score(y_true, preds)))
        print('Precision : ', format(precision_score(y_true, preds)))
        print('Recall : ', format(recall_score(y_true, preds)))
        print('F1 : ', format(f1_score(y_true, preds)))
    else:
        print('Model performace for ' + model_name)
        print('Accuracy : ', format(accuracy_score(y_true, preds)))
        print('Precision : ', format(precision_score(y_true, preds)))
        print('Recall : ', format(recall_score(y_true, preds)))
        print('F1 : ', format(f1_score(y_true, preds)))
        print('\n')

In [81]:
# test above:
print_metrics(y_test, predictions, model_name='Naive Bayes')
print_metrics(y_test, bag_preds, model_name='Bagging, n=200')
print_metrics(y_test, rf_preds, model_name='Random Forest, n=200')
print_metrics(y_test, ada_preds, model_name='AdaBoost, n=300')

Model performace for Naive Bayes
Accuracy :  0.9885139985642498
Precision :  0.9720670391061452
Recall :  0.9405405405405406
F1 :  0.9560439560439562


Model performace for Bagging, n=200
Accuracy :  0.9763101220387652
Precision :  0.9222222222222223
Recall :  0.8972972972972973
F1 :  0.9095890410958904


Model performace for Random Forest, n=200
Accuracy :  0.9820531227566404
Precision :  1.0
Recall :  0.8648648648648649
F1 :  0.927536231884058


Model performace for AdaBoost, n=300
Accuracy :  0.9770279971284996
Precision :  0.9693251533742331
Recall :  0.8540540540540541
F1 :  0.9080459770114943


