In [37]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

## Reload our Naive Bayes Classifier from 2.2

Here we'll quickly reload the Naive Bayes classifier from our last lesson. This is all code you've seen before. It is worth noting how little code is actually required to generate this model. It's a relatively simple exercise, and SKLearn makes it impressively easy.

In [38]:
# Grab and process the raw data.
data_path = ("https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/"
             "master/sms_spam_collection/SMSSpamCollection"
            )
sms_raw = pd.read_csv(data_path, delimiter= '\t', header=None)
sms_raw.columns = ['spam', 'message']

# Enumerate our spammy keywords.
keywords = ['click', 'offer', 'winner', 'buy', 'free', 'cash', 'urgent']

for key in keywords:
    sms_raw[str(key)] = sms_raw.message.str.contains(
        ' ' + str(key) + ' ',
        case=False
)

sms_raw['allcaps'] = sms_raw.message.str.isupper()
sms_raw['spam'] = (sms_raw['spam'] == 'spam')
data = sms_raw[keywords + ['allcaps']]
target = sms_raw['spam']

from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
y_pred = bnb.fit(data, target).predict(data)


In [68]:
sms_raw

Unnamed: 0,spam,message,click,offer,winner,buy,free,cash,urgent,allcaps
0,False,"Go until jurong point, crazy.. Available only ...",False,False,False,False,False,False,False,False
1,False,Ok lar... Joking wif u oni...,False,False,False,False,False,False,False,False
2,True,Free entry in 2 a wkly comp to win FA Cup fina...,False,False,False,False,False,False,False,False
3,False,U dun say so early hor... U c already then say...,False,False,False,False,False,False,False,False
4,False,"Nah I don't think he goes to usf, he lives aro...",False,False,False,False,False,False,False,False
5,True,FreeMsg Hey there darling it's been 3 week's n...,False,False,False,False,False,False,False,False
6,False,Even my brother is not like to speak with me. ...,False,False,False,False,False,False,False,False
7,False,As per your request 'Melle Melle (Oru Minnamin...,False,False,False,False,False,False,False,False
8,True,WINNER!! As a valued network customer you have...,False,False,False,False,False,False,False,False
9,True,Had your mobile 11 months or more? U R entitle...,False,False,False,False,True,False,False,False


## Success Rate

Now we have our model as well as our returned predictions. 

The first thing to note is what data is directly comparable for model evaluation: our target and y_pred variables. Target is the actual outcomes, whether something was spam or ham. The y_pred is the predicted outcomes from our classifier. Both are ordered arrays with the results from each row of the dataframe. When the two agree that means our model was able to successfully predict whether a given message was spam or ham. When they disagree our model was incorrect.

The most basic measure of success, then, is how often our model was correct. This is called the accuracy. It's a metric you've seen before as it was our method of evaluation in the past lesson, but translated from a count to a rate or percentage.

Go ahead and calculate it in the cell below. If you're stuck look back at the previous lesson. If you haven't yet, make your own copy of this notebook to work with locally so you don't lose your work.

In [39]:
# Calculate the accuracy of your model here.
# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()
))

print('Total number of messages (i.e. data frame rows) evaluated:', data.shape[0])

print('Number of messages incorrectly classified:', (target != y_pred).sum())

accuracy = (((data.shape[0]) - (target != y_pred).sum()) / (data.shape[0]))

print('Accuracy of classifier model, in decimal form:', accuracy)

Number of mislabeled points out of a total 5572 points : 604
Total number of messages (i.e. data frame rows) evaluated: 5572
Number of messages incorrectly classified: 604
Accuracy of classifier model, in decimal form: 0.8916008614501076


You should be getting __89.16%__ off of 4968 correctly classified messages and 604 incorrectly classified.

___If not consult with your mentor before moving on.___

Now success rate is a popular way to evaluate a model, and what most people get excited about when discussing a model. However, for a data scientist, success rate is usually not sufficient. There are several reasons for this, but we'll mention two of them here.

Firstly, not all errors are created equal. Think of the situation we're currently working with: a spam filter. Are all types of errors equal here? Certainly not! If you were using this to remove messages from your inbox, letting in a spam message is not nearly as egregious as throwing out a real (and quite possibly very important) message. Knowing more about the kinds of errors you're generating can therefore be incredibly useful.

Secondly, understanding how your model is failing can be key to improving it. If a certain outcome is not being predicted accurately you may want to focus on engineering more features to identify that outcome.


## Confusion Matrix

The next level of analysis of your classifier is often something called a Confusion Matrix. This is a matrix that shows the count of each possible permutation of target and prediction. So in our case, it will show the counts for when a message was ham and we predicted ham, when a message was ham and we predicted spam, when a message was spam and we predicted ham, and when a message was spam and we predicted spam.

SKLearn has a built in confusion matrix function, so let's quickly import that and generate one here.

In [40]:
from sklearn.metrics import confusion_matrix
confusion_matrix(target, y_pred)

print('confusion matrix results: ', confusion_matrix(target, y_pred))
print('When applied to our span classifier, the confusion matrix reveals the following metrics:')
    
print('True positives (predicted ham is ham): 4770')
print('False positive or type 1 errors (predicted it is spam.  However it is actually ham): 55')
print('False negative or type 2 errors (predicted it is not spam (i.e. predicted ham).  However, it is actually spam): 549')
print('True negatives (predicted spam is spam): 198')




confusion matrix results:  [[4770   55]
 [ 549  198]]
When applied to our span classifier, the confusion matrix reveals the following metrics:
True positives (predicted ham is ham): 4770
False positive or type 1 errors (predicted it is spam.  However it is actually ham): 55
False negative or type 2 errors (predicted it is not spam (i.e. predicted ham).  However, it is actually spam): 549
True negatives (predicted spam is spam): 198


Here the columns are prediction and the rows are actual.

So what do we learn?

We learn the majority of our error is coming from times where we failed to identify a spam message. 549 of our 604 errors are from failing to identify spam. So we need to get a little bit better at identifying spam messages.

But before we move on or iterate on the model, let's talk about some key terms that you may run into when thinking about this kind of matrix.

Let's assume our goal is to identify spam (rather than identify ham).

Firstly, when we talk about errors in a binary classifier (where there are only two outcomes) we're generally referring to two kinds of errors. A __false positive__ is when we identify something as spam that is not. In this case we had 55 of these. This is sometimes also called a "Type I Error" or a "false alarm".

A __false negative__ is therefore when we mistakenly identify something as not spam when it is. We had 549 of these. This is also called a "Type II Error" or a "miss".

This also brings us to a conversation of sensitivity vs specificity.

__Sensitivity__ is the percentage of positives correctly identified, in our case 198/747 or 27%. This shows how good we are at catching positives, or how sensitive our model is to identifying positives.

__Specificity__ is just the opposite, the percentage of negatives correctly identified, 4770/4825 or 99%.

Again this confirms that we're not great at identifying spam, though we do label ham quite accurately. You should get familiar with these terms as in the practicing world they will often be used with little explanation and you will be expected to understand them.


## DRILL:

It's worth calculating these with code so that you fully understand how these statistics work, so here is your task for the cell below. Manually generate (__meaning don't use the SKLearn function__) your own confusion matrix and print it along with the sensitivity and specificity.

In [90]:
# Build your confusion matrix and calculate sensitivity and specificity here.

print('y_pred =', y_pred)

print('total rows/messages =', data.shape[0])

print('true positive / correctly_identified =', ((data.shape[0]) - ((target != y_pred).sum())))

print('false positive / incorrectly_identified =', (target != y_pred).sum())


y_pred = [False False False ... False False False]
total rows/messages = 5572
true positive / correctly_identified = 4968
false positive / incorrectly_identified = 604


In [116]:
#true_negative
print( ((data.shape[0]) - ((target == y_pred).sum())))

true positive / correctly_identified = 604




#Count total true positive values.  See the following code:  
#(Question for Mike - This should return a number, but is returning a boolean value.)

#How to solve 'return' outside of function syntax error:
#def fizz_count(x):
    #count = 0
    #for item in x:
        #if item == 'fizz':
            #count = count + 1
    #return count

#How to convert true to 1 and false to 0:
#x = int(x == 'true')


In [137]:
# Build your confusion matrix and calculate sensitivity and specificity here.


#true negative
#This would be the sum of the times the y_pred value equaled 'false' and the 'spam' feature value equaled 'false'


def t_negative(spam):
    true_negatives = 0   
    for item in spam:
        true_negaties = (item == True and item == y_pred).sum()
            #true_negatives == true_negatives + 1
    return true_negatives
print(true_negatives)      
        


1


In [136]:
# Build your confusion matrix and calculate sensitivity and specificity here.


#false negative
#This would be the sum of the times the y_pred value equaled 'false' and the 'spam' feature value equaled 'false'


def f_negative(spam):
    false_negatives = 0   
    for item in spam:
        false_negatives = (item == False and item == y_pred).sum()
            #false_negatives == true_negatives + 1
    return false_negatives
    print(false_negatives)      


In [139]:

#sensitivity (the percentage of positives correctly identified) 
data.shape[0] / ((data.shape[0]) - ((target != y_pred).sum()))


1.1215780998389695

In [140]:

#specificity (the percentage of negatives correctly identified) 


data.shape[0] / (target != y_pred).sum() 

9.225165562913908