### Accuracy And Error Type

In [1]:
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data_path = ("https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/"
             "master/sms_spam_collection/SMSSpamCollection"
            )
sms_raw = pd.read_csv(data_path, delimiter= '\t', header=None)
sms_raw.columns = ['spam', 'message']

In [3]:
sms_raw.head()

Unnamed: 0,spam,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
keywords = ['click', 'offer', 'winner', 'buy', 'free', 'cash', 'urgent']
for key in keywords:
    sms_raw[str(key)] = sms_raw.message.str.contains(' ' + str(key) + ' ', case = False)

In [5]:
sms_raw['allcaps'] = sms_raw.message.str.isupper()

In [6]:
data = sms_raw[keywords + ['allcaps']]
target = sms_raw['spam']

In [7]:
from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
bnb.fit(data, target)
y_pred = bnb.predict(data)

Now that we have created the prediction model and derived the prediction variable,lets construct from scratch the confusion matrix.

Here are the permutations that we are looking for to create our confusion matrix;

In [17]:
permutations = pd.DataFrame({'target': ['ham', 'ham', 'spam', 'spam'],'y_pred': ['ham', 'spam', 'ham', 'spam']})

In [18]:
permutations

Unnamed: 0,target,y_pred
0,ham,ham
1,ham,spam
2,spam,ham
3,spam,spam


in the two dimentional array of our confusion matrix, we will fill in the counts of the dataframes satisfying the above conditions.To do that lets combine the two iteratable columns side by side and run our filter afterwords.

In [10]:
tran_y_pred = pd.Series(y_pred)
combined_df = pd.concat([tran_y_pred, target], join = 'outer', axis = 1)
combined_df.columns = ['tran_y_pred', 'target']

In [11]:
combined_df.head()

Unnamed: 0,tran_y_pred,target
0,ham,ham
1,ham,ham
2,ham,spam
3,ham,ham
4,ham,ham


In [19]:
bool_list_1 = []
confusion_matrix = []
for bool_1 in ['ham', 'spam']:
    for bool_2 in ['ham', 'spam']:
        bool_list_1.append(combined_df.loc[(combined_df['target'] == bool_1) & (combined_df['tran_y_pred'] == bool_2)].count()[0])
    confusion_matrix.append(bool_list_1)
    bool_list_1 = []

In [20]:
confusion_matrix

[[4770, 55], [549, 198]]

Then lets find the sensitivity and specificity of our test;
sensitivity is the count of the matching spam value in the two variables to the ratio of the total spam in the target variable.therefore it is in index 3(4th) of the permutations dataframe which is 'spam' and 'spam'(it is the second index of the second array in the 'confusion_matrix' divided by the total 'spam' count in the target.

In [31]:
sensitivity = confusion_matrix[1][1]/combined_df[combined_df['target'] == 'spam'].count()[0]

In [32]:
sensitivity

0.26506024096385544

in the case of the specificity of our test;it is the opposite in a sense that we check for the true 'ham' and 'ham' combination.
specificity is the count of the matching ham value in the two variables to the ratio of the total ham in the target variable.therefore it is in index 0(1st) of the permutations dataframe which is 'ham' and 'ham'(it is the first index of the first array in the 'confusion_matrix' divided by the total 'ham' count in the target variable.

In [33]:
specificity = confusion_matrix[0][0]/combined_df[combined_df['target'] == 'ham'].count()[0]

In [34]:
specificity

0.9886010362694301