In [17]:
from scipy import stats
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report

import numpy as np
import pandas as pd
import statistics

# Evaluation Exercises

- Given the following confusion matrix, evaluate (by hand) the model's performance: 

|               | pred dog   | pred cat   |
|:------------  |-----------:|-----------:|
| actual dog    |         46 |         7  |
| actual cat    |         13 |         34 |


In [18]:
#predicted dog, actual dog = true positive
tp = 46
#predicted cat, actual cat = true negative
tn = 34
#predicted dog,　but it is actual cat = false positive 
fp = 13
#predicted cat, actual dog = false negative
fn = 7

#accuracy calc.
accuracy = (tp+tn)/(tp+tn+fp+fn)
precision = tp/(tp+fp)
recall = tp/(tp+fn)

print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)

Accuracy: 0.8
Precision: 0.7796610169491526
Recall: 0.8679245283018868


### In the context of this problem, what is a false positive?

- Truly depends on the labeling of positive and negative. Based on my model, a false positive is when the predicted dog outcome is a actually a cat. This is assuming we are positive on the outcome of dogs and negative on the outcome of cats. 

### In the context of this problem, what is a false negative?

- Adding context to the previous analyzation, a false negative would be predicting cat when the outcome is actually a dog. 

### How would you describe this model?
- Based on the performance, a recall matrix would be the best solution. However, there is still creditable accuracy in the model at 80% (opinion)

#### You are working as a datascientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant. Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects, and the data from their predictions can be found here.

In [19]:
# read the c3 csv
df = pd.read_csv('c3.csv')
df.head()

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect


- #### An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

In [21]:
# start by crosstabbing
pd.crosstab(df.actual, df.model1)

model1,Defect,No Defect
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,8,8
No Defect,2,182


In [22]:
#now get an accuracy score from sklearn metrics import
accuracy_score(df.actual, df.model1)

0.95

In [26]:
# if any ducks have defects, we would want to use a recall matrix
# To do this, take actual defects and compare them equal to model 1
# and get the mean to calculate the recall

m1_recall_test = df[df.actual == 'Defect']
(m1_recall_test.model1 == m1_recall_test.actual).mean()

0.5

### Takeaways from actual to model 1:
- 95% accuracy
- 50% recall

In [27]:
# model 2 evaluation
pd.crosstab(df.actual, df.model2)

model2,Defect,No Defect
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,9,7
No Defect,81,103


In [28]:
#accuracy score
accuracy_score(df.actual, df.model2)

0.56

In [29]:
# calculate recall
m2_recall_test = df[df.actual == 'Defect']
(m2_recall_test.model2 == m2_recall_test.actual).mean()

0.5625

### Takeaway from actual to model 2:
- 56% accuracy
- 56.25% recall

In [30]:
# model 3
pd.crosstab(df.actual, df.model3)

model3,Defect,No Defect
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,13,3
No Defect,86,98


In [31]:
# accuracy score
accuracy_score(df.actual, df.model3)

0.555

In [32]:
# calculate the recall
m3_recall_test = df[df.actual == 'Defect']
(m3_recall_test.model3 == m3_recall_test.actual).mean()

0.8125

### Takeaway from actual to model 3
- 55% accuracy
- 81% recall

#### *As we are using the recall metric to detect defects, and a false positive is better than a false negative, meaning we recall ducks that don't have a defect and send out new ducks is better than not recalling ducks that DO have a defect, we want the highest recall score possible. 81% recall in model 3 best fits this use case.*

- ### Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

#### The best evaluation metric here would be Precision as we do not want to  hand out so many free trips to hawaii. In this case, the cost of a false positive is high and we want to be as precise as possible to detect correct positive predictions.

In [41]:
prediction1 = df[df.model1 == 'Defect']
prediction2 = df[df.model2 == 'Defect']
prediction3 = df[df.model3 == 'Defect']

m1_precision = (prediction1.model1 == prediction1.actual).mean()
m2_precision = (prediction2.model2 == prediction2.actual).mean()
m3_precision = (prediction3.model3 == prediction3.actual).mean()

print('Model 1 Precision:', m1_precision*100,'%')
print('Model 2 Precision:', m2_precision*100,'%')
print('Model 3 Precision:', m3_precision*100,'%')

Model 1 Precision: 80.0 %
Model 2 Precision: 10.0 %
Model 3 Precision: 13.131313131313133 %


### Takeaways: 
- Model 1 has the highest precision, thus this model is the best for this use case.

#### 4. You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).

At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog (Phase I). Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II).

Several models have already been developed with the data, and you can find their results here.



In [43]:
paws = pd.read_csv('gives_you_paws.csv')
paws.head()

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog


- In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?

In [46]:
# accuracy is the best metrix to use here
# create a baseline

# to know what to create the baseline on, get the most common class
paws.actual.value_counts()

dog    3254
cat    1746
Name: actual, dtype: int64

In [47]:
# dogs is the most common, now create baseline column with dogs
paws['baseline'] = 'dog'

In [48]:
# now create the accuracy test and baseline test
# 4 models, 4 tests, with the baseline accuracy as well

m1_accuracy = (paws.model1 == paws.actual).mean()
m2_accuracy = (paws.model2 == paws.actual).mean()
m3_accuracy = (paws.model3 == paws.actual).mean()
m4_accuracy = (paws.model4 == paws.actual).mean()

# now calculate the baseline accuracy test
baseline_accuracy = (paws.baseline == paws.actual).mean()

print('Model 1 Accuracy:', m1_accuracy)
print('Model 2 Accuracy:', m2_accuracy)
print('Model 3 Accuracy:', m3_accuracy)
print('Model 4 Accuracy:', m4_accuracy)
print('Baseline Accuracy:', baseline_accuracy)

Model 1 Accuracy: 0.8074
Model 2 Accuracy: 0.6304
Model 3 Accuracy: 0.5096
Model 4 Accuracy: 0.7426
Baseline Accuracy: 0.6508


##### Model 1 is better than the baseline accuracy (65%) at 80%

- Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recomend for Phase I? For Phase II?

In [50]:
# For this case I only want dog pictures, which means I want 
# to use a recall metric 

dogs_only = paws[paws.actual == 'dog']
dogs_only.head()

Unnamed: 0,actual,model1,model2,model3,model4,baseline
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
5,dog,dog,dog,dog,dog,dog
8,dog,dog,cat,dog,dog,dog


In [52]:
# now use recall to find the model that has the most dogs

dogs_recall1 = (dogs_only.model1 == dogs_only.actual).mean()
dogs_recall2 = (dogs_only.model2 == dogs_only.actual).mean()
dogs_recall3 = (dogs_only.model3 == dogs_only.actual).mean()
dogs_recall4 = (dogs_only.model4 == dogs_only.actual).mean()

print('Model 1 Accuracy:', dogs_recall1)
print('Model 2 Accuracy:', dogs_recall2)
print('Model 3 Accuracy:', dogs_recall3)
print('Model 4 Accuracy:', dogs_recall4)

Model 1 Accuracy: 0.803318992009834
Model 2 Accuracy: 0.49078057775046097
Model 3 Accuracy: 0.5086047940995697
Model 4 Accuracy: 0.9557467732022127


In [53]:
# Phase II - we should use precision as metric since were 
# trying to minimize false positives

dogs_precision1 = paws[paws.model1 == 'dog']
dogs_precision2 = paws[paws.model2 == 'dog']
dogs_precision3 = paws[paws.model3 == 'dog']
dogs_precision4 = paws[paws.model4 == 'dog']

m1_precision = (dogs_precision1.model1 == dogs_precision1.actual).mean()
m2_precision = (dogs_precision2.model2 == dogs_precision2.actual).mean()
m3_precision = (dogs_precision3.model3 == dogs_precision3.actual).mean()
m4_precision = (dogs_precision4.model4 == dogs_precision4.actual).mean()

print('Precision M1:',m1_precision)
print('Precision M2:',m2_precision)
print('Precision M3:',m3_precision)
print('Precision M4:',m4_precision)

Precision M1: 0.8900238338440586
Precision M2: 0.8931767337807607
Precision M3: 0.6598883572567783
Precision M4: 0.7312485304490948


#### Takeaway:
- The accuracy in model 4 is the best, however since we are trying to limit false positives, I'd need to look closer towards the precision of the models. M1 is a strong contender with high accuracy and high precision, however, if going only on precision, Model 2 is the best option.

- Model 4 = Phase 1
- Model 2 = Phase 2

- Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recomend for Phase I? For Phase II?

In [60]:
# Phase I  Calculating Recall
subset = paws[paws.actual == 'cat']
m1_recall = (subset.model1 == subset.actual).mean()
m2_recall = (subset.model2 == subset.actual).mean()
m3_recall = (subset.model3 == subset.actual).mean()
m4_recall = (subset.model4 == subset.actual).mean()

print('Recall M1:',m1_recall)
print('REcall M2:',m2_recall)
print('Recall M3:',m3_recall)
print('Recall M4:',m4_recall)

Recall M1: 0.8150057273768614
REcall M2: 0.8906071019473081
Recall M3: 0.5114547537227949
Recall M4: 0.34536082474226804


Model 2 is the best with a recall of 89%

In [64]:
# Phase II  Calculating Precision

subset1 = paws[paws.model1 == 'cat']
subset2 = paws[paws.model2 == 'cat']
subset3 = paws[paws.model3 == 'cat']
subset4 = paws[paws.model4 == 'cat']

m1_precision = (subset1.model1 == subset1.actual).mean()
m2_precision = (subset2.model2 == subset2.actual).mean()
m3_precision = (subset3.model3 == subset3.actual).mean()
m4_precision = (subset4.model4 == subset4.actual).mean()

print('Precision M1:',m1_precision)
print('Precision M2:',m2_precision)
print('Precision M3:',m3_precision)
print('Precision M4:',m4_precision)

Precision M1: 0.6897721764420747
Precision M2: 0.4841220423412204
Precision M3: 0.358346709470305
Precision M4: 0.8072289156626506


Model 4 is the best at 81%

In [65]:
from sklearn.metrics import classification_report

In [67]:
x = classification_report(paws.actual, paws.model1,
                          labels = ['cat', 'dog'],
                          output_dict=True)
pd.DataFrame(x).T

Unnamed: 0,precision,recall,f1-score,support
cat,0.689772,0.815006,0.747178,1746.0
dog,0.890024,0.803319,0.844452,3254.0
accuracy,0.8074,0.8074,0.8074,0.8074
macro avg,0.789898,0.809162,0.795815,5000.0
weighted avg,0.820096,0.8074,0.810484,5000.0


In [69]:
print("Model 1")
pd.DataFrame(classification_report(paws.actual, paws.model1,
                                   labels = ['cat', 'dog'],
                                   output_dict=True)).T

Model 1


Unnamed: 0,precision,recall,f1-score,support
cat,0.689772,0.815006,0.747178,1746.0
dog,0.890024,0.803319,0.844452,3254.0
accuracy,0.8074,0.8074,0.8074,0.8074
macro avg,0.789898,0.809162,0.795815,5000.0
weighted avg,0.820096,0.8074,0.810484,5000.0


In [71]:
print("Model 2")
pd.DataFrame(classification_report(paws.actual, paws.model2,
                                   labels = ['cat', 'dog'],
                                   output_dict=True)).T

Model 2


Unnamed: 0,precision,recall,f1-score,support
cat,0.484122,0.890607,0.627269,1746.0
dog,0.893177,0.490781,0.633479,3254.0
accuracy,0.6304,0.6304,0.6304,0.6304
macro avg,0.688649,0.690694,0.630374,5000.0
weighted avg,0.750335,0.6304,0.63131,5000.0


In [72]:
print("Model 3")
pd.DataFrame(classification_report(paws.actual, paws.model3,
                                   labels = ['cat', 'dog'],
                                   output_dict=True)).T

Model 3


Unnamed: 0,precision,recall,f1-score,support
cat,0.358347,0.511455,0.421425,1746.0
dog,0.659888,0.508605,0.574453,3254.0
accuracy,0.5096,0.5096,0.5096,0.5096
macro avg,0.509118,0.51003,0.497939,5000.0
weighted avg,0.55459,0.5096,0.521016,5000.0


In [68]:
print("Model 4")
pd.DataFrame(classification_report(paws.actual, paws.model4,
                                   labels = ['cat', 'dog'],
                                   output_dict=True)).T


Model 4


Unnamed: 0,precision,recall,f1-score,support
cat,0.807229,0.345361,0.483755,1746.0
dog,0.731249,0.955747,0.82856,3254.0
accuracy,0.7426,0.7426,0.7426,0.7426
macro avg,0.769239,0.650554,0.656157,5000.0
weighted avg,0.757781,0.7426,0.708154,5000.0
