In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Grab and process the raw data.
data_path = ("https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/"
             "master/sms_spam_collection/SMSSpamCollection"
            )
sms_raw = pd.read_csv(data_path, delimiter= '\t', header=None)
sms_raw.columns = ['spam', 'message']

# Enumerate our spammy keywords.
keywords = ['click', 'offer', 'winner', 'buy', 'free', 'cash', 'urgent']

for key in keywords:
    sms_raw[str(key)] = sms_raw.message.str.contains(
        ' ' + str(key) + ' ',
        case=False
)

sms_raw['allcaps'] = sms_raw.message.str.isupper()
sms_raw['spam'] = (sms_raw['spam'] == 'spam')
data = sms_raw[keywords + ['allcaps']]
target = sms_raw['spam']

from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
y_pred = bnb.fit(data, target).predict(data)

Now we have our model as well as our returned predictions. 

The first thing to note is what data is directly comparable for model evaluation: our target and y_pred variables. Target is the actual outcomes, whether something was spam or ham. The y_pred is the predicted outcomes from our classifier. Both are ordered arrays with the results from each row of the dataframe. When the two agree that means our model was able to successfully predict whether a given message was spam or ham. When they disagree our model was incorrect.

The most basic measure of success, then, is how often our model was correct. This is called the accuracy. It's a metric you've seen before as it was our method of evaluation in the past lesson, but translated from a count to a rate or percentage.

Go ahead and calculate it in the cell below. If you're stuck look back at the previous lesson. If you haven't yet, make your own copy of this notebook to work with locally so you don't lose your work.

In [19]:
# Calculate the accuracy of your model here.
# prior example: Number of mislabeled points out of 5572: 598
total = 5572
missed = 598
correct = total - missed
accuracy = round(((correct / total) * 100),2)
print('Accuracy of model is :' + str(accuracy) + '%')

Accuracy of model is :89.27%


You should be getting __89.27%__ off of 4968 correctly classified messages and 604 incorrectly classified.

___If not consult with your mentor before moving on.___

Now success rate is a popular way to evaluate a model, and what most people get excited about when discussing a model. However, for a data scientist, success rate is usually not sufficient. There are several reasons for this, but we'll mention two of them here.

Firstly, not all errors are created equal. Think of the situation we're currently working with: a spam filter. Are all types of errors equal here? Certainly not! If you were using this to remove messages from your inbox, letting in a spam message is not nearly as egregious as throwing out a real (and quite possibly very important) message. Knowing more about the kinds of errors you're generating can therefore be incredibly useful.

Secondly, understanding how your model is failing can be key to improving it. If a certain outcome is not being predicted accurately you may want to focus on engineering more features to identify that outcome.


## Confusion Matrix

The next level of analysis of your classifier is often something called a Confusion Matrix. This is a matrix that shows the count of each possible permutation of target and prediction. So in our case, it will show the counts for when a message was ham and we predicted ham, when a message was ham and we predicted spam, when a message was spam and we predicted ham, and when a message was spam and we predicted spam.

SKLearn has a built in confusion matrix function, so let's quickly import that and generate one here.

In [16]:
from sklearn.metrics import confusion_matrix
confusion_matrix(target, y_pred)

array([[4770,   55],
       [ 549,  198]], dtype=int64)

Here the columns are prediction and the rows are actual.

So what do we learn?

We learn the majority of our error is coming from times where we failed to identify a spam message: 549 of our 604 errors are from predicting ham when it was known spam. So we need to get a little bit better at identifying spam messages.

But before we move on or iterate on the model, let's talk about some key terms that you may run into when thinking about this kind of matrix.

Let's assume our goal is to identify spam (rather than identify ham).

Firstly, when we talk about errors in a binary classifier (where there are only two outcomes) we're generally referring to two kinds of errors. A __false positive__ is when we identify something as spam that is not. In this case we had 55 of these. This is sometimes also called a "Type I Error" or a "false alarm".

A __false negative__ is therefore when we mistakenly identify something as not spam when it is. We had 549 of these. This is also called a "Type II Error" or a "miss".

This also brings us to a conversation of sensitivity vs specificity.

__Sensitivity__ is the percentage of positives correctly identified, in our case 198/747 or 27%. This shows how good we are at catching positives, or how sensitive our model is to identifying positives.

__Specificity__ is just the opposite, the percentage of negatives correctly identified, 4770/4825 or 99%.

Again this confirms that we're not great at identifying spam, though we do label ham quite accurately. You should get familiar with these terms as in the practicing world they will often be used with little explanation and you will be expected to understand them.


## DRILL:

It's worth calculating these with code so that you fully understand how these statistics work, so here is your task for the cell below. Manually generate (__meaning don't use the SKLearn function__) your own confusion matrix and print it along with the sensitivity and specificity.

In [101]:
# Build your confusion matrix and calculate sensitivity and specificity here.

def manual_confusion_matrix(y_actual, y_pred):
    TP = 0
    FP = 0
    TN = 0
    FN = 0

    for i in range(len(y_pred)): 
        if y_actual[i]==y_pred[i]==1:
           TP += 1
        if y_pred[i]==1 and y_actual[i]!=y_pred[i]:
           FP += 1
        if y_actual[i]==y_pred[i]==0:
           TN += 1
        if y_pred[i]==0 and y_actual[i]!=y_pred[i]:
           FN += 1
    
    return('Confusion Matrix: {}'.format([TP, FP, TN, FN]))
    sensitivity = round((TP / (TP + FN)) * 100)
    specificity = round((TN / (FP + TN)) * 100)
print('Sensitivity: ' + str(sensitivity) + '%')
print('Specificity: ' + str(specificity) + '%')

manual_confusion_matrix(target, y_pred)

Sensitivity: 27%
Specificity: 99%


'Confusion Matrix: [198, 55, 4770, 549]'

In [50]:
# another example I found that does not use sklearn.

df_confusion = pd.crosstab(target, y_pred, rownames = ['Actual'], colnames = ['Predicted'])
df_confusion

Predicted,False,True
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
False,4770,55
True,549,198
