# Our Mission

Spam detection is one of the major applications of Machine Learning in the interwebs today. Pretty much all of the major email service providers have spam detection systems built in and automatically classify such mail as 'Junk Mail'.

In this mission we will be using the Naive Bayes algorithm to create a model that can classify SMS messages as spam or not spam, based on the training we give to the model. It is important to have some level of intuition as to what a spammy text message might look like. Often they have words like 'free', 'win', 'winner', 'cash', 'prize' and the like in them as these texts are designed to catch your eye and in some sense tempt you to open them. Also, spam messages tend to have words written in all capitals and also tend to use a lot of exclamation marks. To the human recipient, it is usually pretty straightforward to identify a spam text and our objective here is to train a model to do that for us!

Being able to identify spam messages is a binary classification problem as messages are classified as either 'Spam' or 'Not Spam' and nothing else. Also, this is a supervised learning problem, as we will be feeding a labelled dataset into the model, that it can learn from, to make future predictions.

At the end we will use ensemble mthods
- Bagging
- AdjBoost
- Random forest

In [1]:
# Import our libraries
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

In [2]:
# Read in our dataset
cols=['label','sms_message']
df=pd.read_csv('smsspamcollection\SMSSpamCollection',sep='\t',header=None,names=cols)
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
# map ham => 0 and spam => 1
df['label']=df['label'].map({'ham':0,'spam':1})
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
# predictors and labels
X=df['sms_message']
y=df['label']

In [5]:
# Split our dataset into training and testing data
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=1)

In [6]:
# Instantiate the CountVectorizer method
counter_vector=CountVectorizer()
train_data=counter_vector.fit_transform(X_train)
test_data=counter_vector.transform(X_test)

In [7]:
# Instantiate our model
model=MultinomialNB()
# Fit our model to the training data
model.fit(train_data,y_train)
# Predict on the test data
prediction=model.predict(test_data)

In [8]:
# Score our model
print('Accuracy score: ', format(accuracy_score(y_test, prediction)))
print('Precision score: ', format(precision_score(y_test, prediction)))
print('Recall score: ', format(recall_score(y_test, prediction)))
print('F1 score: ', format(f1_score(y_test, prediction)))

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562


### Turns Out...

We can see from the scores above that our Naive Bayes model actually does a pretty good job of classifying spam and "ham."  However, let's take a look at a few additional models to see if we can't improve anyway.

Specifically in this notebook, we will take a look at the following techniques:

* [BaggingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier)
* [RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)
* [AdaBoostClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier)

Another really useful guide for ensemble methods can be found [in the documentation here](http://scikit-learn.org/stable/modules/ensemble.html).

These ensemble methods use a combination of techniques you have seen throughout this lesson:

* **Bootstrap the data** passed through a learner (bagging).
* **Subset the features** used for a learner (combined with bagging signifies the two random components of random forests).
* **Ensemble learners** together in a way that allows those that perform best in certain areas to create the largest impact (boosting).


In this notebook, let's get some practice with these methods, which will also help you get comfortable with the process used for performing supervised machine learning in Python in general.

Since you cleaned and vectorized the text in the previous notebook, this notebook can be focused on the fun part - the machine learning part.

### This Process Looks Familiar...

In general, there is a five step process that can be used each time you want to use a supervised learning method (which you actually used above):

1. **Import** the model.
2. **Instantiate** the model with the hyperparameters of interest.
3. **Fit** the model to the training data.
4. **Predict** on the test data.
5. **Score** the model by comparing the predictions to the actual values.

Follow the steps through this notebook to perform these steps using each of the ensemble methods: **BaggingClassifier**, **RandomForestClassifier**, and **AdaBoostClassifier**.

> **Step 1**: First use the documentation to `import` all three of the models.

In [9]:
# Import the Bagging, RandomForest, and AdaBoost Classifier
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier , RandomForestClassifier

> **Step 2:** Now that you have imported each of the classifiers, `instantiate` each with the hyperparameters specified in each comment.  In the upcoming lessons, you will see how we can automate the process to finding the best hyperparameters.  For now, let's get comfortable with the process and our new algorithms.

In [10]:
# Instantiate a BaggingClassifier with:

bag_model=BaggingClassifier(n_estimators=200)

# Instantiate a RandomForestClassifier with:

randF_model=RandomForestClassifier(n_estimators=200)
# Instantiate an a AdaBoostClassifier with:

adj_model=AdaBoostClassifier(n_estimators=200,learning_rate=0.2)

> **Step 3:** Now that you have instantiated each of your models, `fit` them using the **training_data** and **y_train**.  This may take a bit of time, you are fitting 700 weak learners after all!

In [11]:
# Fit your BaggingClassifier to the training data

bag_model.fit(train_data,y_train)
# Fit your RandomForestClassifier to the training data
randF_model.fit(train_data,y_train)

# Fit your AdaBoostClassifier to the training data

adj_model.fit(train_data,y_train)

AdaBoostClassifier(learning_rate=0.2, n_estimators=200)

> **Step 4:** Now that you have fit each of your models, you will use each to `predict` on the **testing_data**.

In [12]:
# Predict using BaggingClassifier on the test data
bag_pred=bag_model.predict(test_data)

# Predict using RandomForestClassifier on the test data
randF_pred=randF_model.predict(test_data)
# Predict using AdaBoostClassifier on the test data
adj_pred=adj_model.predict(test_data)

> **Step 5:** Now that you have made your predictions, compare your predictions to the actual values using the function below for each of your models - this will give you the `score` for how well each of your models is performing. It might also be useful to show the Naive Bayes model again here, so we can compare them all side by side.

In [13]:
def print_metrics(y_true, preds, model_name=None):
    '''
    INPUT:
    y_true - the y values that are actually true in the dataset (NumPy array or pandas series)
    preds - the predictions for those values from some model (NumPy array or pandas series)
    model_name - (str - optional) a name associated with the model if you would like to add it to the print statements 
    
    OUTPUT:
    None - prints the accuracy, precision, recall, and F1 score
    '''
    if model_name == None:
        print('Accuracy score: ', format(accuracy_score(y_true, preds)))
        print('Precision score: ', format(precision_score(y_true, preds)))
        print('Recall score: ', format(recall_score(y_true, preds)))
        print('F1 score: ', format(f1_score(y_true, preds)))
        print('\n\n')
    
    else:
        print('Accuracy score for ' + model_name + ' :' , format(accuracy_score(y_true, preds)))
        print('Precision score ' + model_name + ' :', format(precision_score(y_true, preds)))
        print('Recall score ' + model_name + ' :', format(recall_score(y_true, preds)))
        print('F1 score ' + model_name + ' :', format(f1_score(y_true, preds)))
        print('\n\n')

In [14]:
# Print Bagging scores
print_metrics(y_test, bag_pred, 'bagging')

# Print Random Forest scores
print_metrics(y_test, randF_pred, 'random forest')

# Print AdaBoost scores
print_metrics(y_test, adj_pred, 'adaboost')

# Naive Bayes Classifier scores
print_metrics(y_test, prediction, 'naive bayes')

Accuracy score for bagging : 0.9748743718592965
Precision score bagging : 0.9166666666666666
Recall score bagging : 0.8918918918918919
F1 score bagging : 0.9041095890410958



Accuracy score for random forest : 0.9827709978463748
Precision score random forest : 1.0
Recall score random forest : 0.8702702702702703
F1 score random forest : 0.930635838150289



Accuracy score for adaboost : 0.9770279971284996
Precision score adaboost : 0.9693251533742331
Recall score adaboost : 0.8540540540540541
F1 score adaboost : 0.9080459770114943



Accuracy score for naive bayes : 0.9885139985642498
Precision score naive bayes : 0.9720670391061452
Recall score naive bayes : 0.9405405405405406
F1 score naive bayes : 0.9560439560439562



