In [16]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score,precision_score, recall_score, f1_score
from sklearn.naive_bayes import GaussianNB
import matplotlib.pyplot as plt

df = pd.read_csv("C:\\Users\\user\\Downloads\\SMSSPAMcollection.csv" , sep = '\t',
                   header=None, 
                   names=['status', 'message'])
df['status'] = df.status.map({'ham':0, 'spam':1})
df.head()

Unnamed: 0,status,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [17]:
X_train, X_test, y_train, y_test = train_test_split(df['message'], 
                                                    df['status'],test_size = .33, 
                                                    random_state=0)

In [18]:
vectorizer = CountVectorizer()
data_train = vectorizer.fit_transform(X_train)
V = vectorizer.transform(X_test)


In [19]:
MultiNB  = MultinomialNB()
MultiNB.fit(data_train, y_train)
y_pred = MultiNB.predict(V)
print ('Accuracy Score:',format(accuracy_score(y_test , y_pred)))
print('Precision score: ', format(precision_score(y_test, y_pred)))
print('Recall score: ', format(recall_score(y_test, y_pred)))
print('F1 score: ', format(f1_score(y_test, y_pred)))

Accuracy Score: 0.9864056552474171
Precision score:  0.9737991266375546
Recall score:  0.9214876033057852
F1 score:  0.9469214437367304


Turns Out...

It turns out that our naive bayes model actually does a pretty good job. However, let's take a look at a few additional models to see if we can't improve anyway.

Specifically in this notebook, we will take a look at the following techniques:

    BaggingClassifier
    RandomForestClassifier
    AdaBoostClassifier

Another really useful guide for ensemble methods can be found in the documentation here.

These ensemble methods use a combination of techniques you have seen throughout this lesson:

    Bootstrap the data passed through a learner (bagging).
    Subset the features used for a learner (combined with bagging signifies the two random components of random forests).
    Ensemble learners together in a way that allows those that perform best in certain areas to create the largest impact (boosting).

In this notebook, let's get some practice with these methods, which will also help you get comfortable with the process used for performing supervised machine learning in python in general.

Since you cleaned and vectorized the text in the previous notebook, this notebook can be focused on the fun part - the machine learning part.
This Process Looks Familiar...

In general, there is a five step process that can be used each type you want to use a supervised learning method (which you actually used above):

    Import the model.
    Instantiate the model with the hyperparameters of interest.
    Fit the model to the training data.
    Predict on the test data.
    Score the model by comparing the predictions to the actual values.

Follow the steps through this notebook to perform these steps using each of the ensemble methods: BaggingClassifier, RandomForestClassifier, and AdaBoostClassifier.

In [11]:
from sklearn.ensemble import BaggingClassifier
clf = BaggingClassifier(random_state = 0,
                        n_estimators = 500).fit(data_train, y_train)
y_predict = clf.predict(V)
print('Accuracy score: ', format(accuracy_score(y_test, y_predict)))
print('Precision score: ', format(precision_score(y_test, y_predict)))
print('Recall score: ', format(recall_score(y_test, y_predict)))
print('F1 score: ', format(f1_score(y_test, y_predict)))

Accuracy score:  0.9771615008156607
Precision score:  0.9464285714285714
Recall score:  0.8760330578512396
F1 score:  0.9098712446351931


In [12]:
from sklearn.ensemble import RandomForestClassifier
clf1 = RandomForestClassifier(n_estimators = 20, random_state =0)
clf1.fit(data_train , y_train)
y_pred2= clf1.predict(V)
print('Accuracy Score:' , format(accuracy_score(y_test , y_pred2 )))
print('Precision score: ', format(precision_score(y_test, y_pred2)))
print('Recall score: ', format(recall_score(y_test, y_pred2)))
print('F1 score: ', format(f1_score(y_test, y_pred2)))

Accuracy Score: 0.9722675367047309
Precision score:  0.9948186528497409
Recall score:  0.7933884297520661
F1 score:  0.8827586206896552


In [13]:
from sklearn.ensemble import AdaBoostClassifier
clf2 = AdaBoostClassifier(n_estimators=50, random_state=0)
clf2.fit(data_train , y_train)
y_pred3 = clf2.predict(V)
print('Accuracy Score:' , format(accuracy_score(y_test , y_pred3)))
print('Precision score: ', format(precision_score(y_test, y_pred3)))
print('Recall score: ', format(recall_score(y_test, y_pred3)))
print('F1 score: ', format(f1_score(y_test, y_pred3)))

Accuracy Score: 0.9744426318651441
Precision score:  0.9333333333333333
Recall score:  0.8677685950413223
F1 score:  0.899357601713062


In [14]:
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC

clf = BaggingClassifier(random_state = 0,base_estimator=SVC(),
                        n_estimators = 500).fit(data_train, y_train)
y_predict2 = clf.predict(V)
print('Accuracy score: ', format(accuracy_score(y_test, y_predict2)))
print('Precision score: ', format(precision_score(y_test, y_predict2)))
print('Recall score: ', format(recall_score(y_test, y_predict2)))
print('F1 score: ', format(f1_score(y_test, y_predict2)))

Accuracy score:  0.9809679173463839
Precision score:  0.9952153110047847
Recall score:  0.859504132231405
F1 score:  0.9223946784922394
