<div class="alert alert-block alert-warning">

# Evaluation Exercises

<div class="alert alert-block alert-success">

1. Create a new file named model_evaluation.py or model_evaluation.ipynb for these exercises.

In [1]:
import pandas as pd
import numpy as np
import os

<div class="alert alert-block alert-success">

2. Given the following confusion matrix, evaluate (by hand) the model's performance.
    |               | pred dog   | pred cat   |
    |:------------  |-----------:|-----------:|
    | actual dog    |         46 |         7  |
    | actual cat    |         13 |         34 |


* cat = positive class
* dog = negative class

<div class="alert alert-block alert-info">

a. In the context of this problem, what is a false positive?

False Positive: We predicted a cat but it is a dog

<div class="alert alert-block alert-info">

b. In the context of this problem, what is a false negative?

False Negative: We predicted a dog but it is a cat

<div class="alert alert-block alert-info">

c. How would you describe this model?

In [2]:
#true positive is predicting its a cat, and its a cat
tp = 34

#true negative is predicting its a dog, and its a dog
tn = 46

#false positive is predicting its a cat, but its a dog
fp = 7

#false negative is predicting its a dog, but its a cat
fn = 13

In [3]:
print("Cat-classifier (where 'cat' is the positive prediction)")

print("True Positives:", tp)
print("False Positives:", fp)
print("False Negatives:", fn)
print("True Negatives:", tn)
print("-------------")

accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp)
recall = tp / (tp + fn)

print("Accuracy is:", accuracy)
print("Recall is:", round(recall,2))
print("Precision is:", round(precision,2))

Cat-classifier (where 'cat' is the positive prediction)
True Positives: 34
False Positives: 7
False Negatives: 13
True Negatives: 46
-------------
Accuracy is: 0.8
Recall is: 0.72
Precision is: 0.83


<div class="alert alert-block alert-success">

3. You are working as a datascientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant. Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects, and the data from their predictions can be found here. Use the predictions dataset and pandas to help answer the following questions:



In [4]:
#load data frame
cody_df = pd.read_csv('c3.csv')

#take a look
cody_df.head()

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect


<div class="alert alert-block alert-info">

a. An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

In [5]:
#how many defects and non-defects do we have in the actual data?
cody_df.actual.value_counts()

No Defect    184
Defect        16
Name: actual, dtype: int64

Since we are interested in 'defects', we will asssign it as 'positive class' for the classifier.

- defects = positive class


Our best metric here is recall = tp/(tp + fn)
* how many real positives do we have?
* how many of defective ducks are actually flagged by defective (positive) by the models?
* let's minimize our false negatives

In [6]:
# Model positives
subset = cody_df [cody_df.actual == 'Defect']
subset.head()

Unnamed: 0,actual,model1,model2,model3
13,Defect,No Defect,Defect,Defect
30,Defect,Defect,No Defect,Defect
65,Defect,Defect,Defect,Defect
70,Defect,Defect,Defect,Defect
74,Defect,No Defect,No Defect,Defect


In [7]:
#Model 1 recall
(subset.actual == subset.model1).mean()

0.5

In [8]:
# Model 2 recall
(subset.actual == subset.model2).mean()

0.5625

In [9]:
# Model 3 recall
(subset.actual == subset.model3).mean()

0.8125

Takeaways:

 - Quality Control should select a model with higher recall (to avoid false negatives)
 - Quality Control should use Model 3

<div class="alert alert-block alert-info">

b. Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

PR team really wants to minimize the False positives - meaning choose model with highest precision.

So the best models for this scenario is precision = tp / (tp + fp)

defect = positive class

In [10]:
# choose subset for model 1 where we only select 'positive predictions'
subset = cody_df [cody_df.model1 == 'Defect']

# calculate precision
(subset.actual == subset.model1).mean()

0.8

In [11]:
# choose subset for model 2 where we only select 'positive predictions'
subset = cody_df [cody_df.model2 == 'Defect']

# calculate precision
(subset.actual == subset.model2).mean()

0.1

In [12]:
# choose subset for model3 where we only select 'positive predictions'
subset = cody_df [cody_df.model3 == 'Defect']

# calculate precision
(subset.actual == subset.model3).mean()

0.13131313131313133

Takeaways:

 - Use model 1 since it will minimize the false positive predictions of defects

<div class="alert alert-block alert-success">

4. You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog (Phase I). Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II). Several models have already been developed with the data, and you can find their results here. Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:

In [13]:
# load data frame
paws_df = pd.read_csv('gives_you_paws.csv')

# take a look
paws_df.head()

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog


In [14]:
#what kind of columns and dtypes are we dealing with?
paws_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   actual  5000 non-null   object
 1   model1  5000 non-null   object
 2   model2  5000 non-null   object
 3   model3  5000 non-null   object
 4   model4  5000 non-null   object
dtypes: object(5)
memory usage: 195.4+ KB


In [15]:
#what are our actual counts
paws_df.actual.value_counts()

dog    3254
cat    1746
Name: actual, dtype: int64

In [16]:
#set the most common class ('dog') as the baseline
paws_df['baseline'] = paws_df.actual.value_counts().idxmax()

<div class="alert alert-block alert-info">

a. In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?

In [17]:
#baseline accuracy 
(paws_df.actual == paws_df.baseline).mean()

0.6508

In [18]:
#model 1 accuracy
(paws_df.model1 == paws_df.actual).mean()

0.8074

In [19]:
#model 2 accuracy
(paws_df.model2 == paws_df.actual).mean()

0.6304

In [20]:
#model 3 accuracy
(paws_df.model3 == paws_df.actual).mean()

0.5096

In [21]:
#model 4 accuracy
(paws_df.model4 == paws_df.actual).mean()

0.7426

Takeaways:

 - In terms of accuracy, model 1 and model 4 perform better than baseline

<div class="alert alert-block alert-info">

b. Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recommend?

* dog = positive class
* cat = negative class

Two-phases - recall = tp/(tp + fn) and precision = tp/(tp + fp)

In [22]:
# For Phase I, choose a model with highest recall

subset = paws_df[paws_df.actual == 'dog']
subset.head()

Unnamed: 0,actual,model1,model2,model3,model4,baseline
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
5,dog,dog,dog,dog,dog,dog
8,dog,dog,cat,dog,dog,dog


In [23]:
# Model 1 Recall
(subset.actual == subset.model1).mean()

0.803318992009834

In [24]:
# Model 2 Recall
(subset.actual == subset.model2).mean()

0.49078057775046097

In [25]:
# Model 3 Recall
(subset.actual == subset.model3).mean()

0.5086047940995697

In [26]:
# Model 4 Recall
(subset.actual == subset.model4).mean()

0.9557467732022127

Takeaways:

* Model 4 is performing the best, with Recall of 0.96

In [27]:
# For Phase II, choose a model with highest precision

subset = paws_df[paws_df.actual == 'dog']
subset.head()

Unnamed: 0,actual,model1,model2,model3,model4,baseline
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
5,dog,dog,dog,dog,dog,dog
8,dog,dog,cat,dog,dog,dog


In [28]:
#take another look
paws_df.head()

Unnamed: 0,actual,model1,model2,model3,model4,baseline
0,cat,cat,dog,cat,dog,dog
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
4,cat,cat,cat,dog,dog,dog


In [29]:
subset1 = paws_df[paws_df.model1 == 'dog']
subset2 = paws_df[paws_df.model2 == 'dog']
subset3 = paws_df[paws_df.model3 == 'dog']
subset4 = paws_df[paws_df.model4 == 'dog']

In [30]:
# Model 1 Precision
(subset1.actual == subset1.model1).mean()

0.8900238338440586

In [31]:
# Model 2 Precision
(subset2.actual == subset2.model2).mean()

0.8931767337807607

In [32]:
# Model 3 Precision
(subset3.actual == subset3.model3).mean()

0.6598883572567783

In [33]:
# Model 4 Precision
(subset4.actual == subset4.model4).mean()

0.7312485304490948

Takeaways:

* Model 2 and Model 1 are performing best with Precision of 0.893

<div class="alert alert-block alert-info">

c. Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recommend?

* dog = negative class
* cat = postive class

Two-phases - recall = tp/(tp + fn) and precision = tp/(tp + fp)

In [34]:
# For Phase I, choose a model with highest recall

subset = paws_df[paws_df.actual == 'cat']
subset.head()

Unnamed: 0,actual,model1,model2,model3,model4,baseline
0,cat,cat,dog,cat,dog,dog
4,cat,cat,cat,dog,dog,dog
6,cat,cat,cat,cat,dog,dog
7,cat,dog,cat,cat,dog,dog
11,cat,cat,dog,cat,cat,dog


In [35]:
# Model 1 Recall
(subset.actual == subset.model1).mean()

0.8150057273768614

In [36]:
# Model 2 Recall
(subset.actual == subset.model2).mean()

0.8906071019473081

In [37]:
# Model 3 Recall
(subset.actual == subset.model3).mean()

0.5114547537227949

In [38]:
# Model 4 Recall
(subset.actual == subset.model4).mean()

0.34536082474226804

Takeaways:

* Model 2 is performing the best, with Recall of 0.89

In [39]:
subset1 = paws_df[paws_df.model1 == 'cat']
subset2 = paws_df[paws_df.model2 == 'cat']
subset3 = paws_df[paws_df.model3 == 'cat']
subset4 = paws_df[paws_df.model4 == 'cat']

In [40]:
# Model 1 Precision
(subset1.actual == subset1.model1).mean()

0.6897721764420747

In [41]:
# Model 2 Precision
(subset2.actual == subset2.model2).mean()

0.4841220423412204

In [42]:
# Model 3 Precision
(subset3.actual == subset3.model3).mean()

0.358346709470305

In [43]:
# Model 4 Precision
(subset4.actual == subset4.model4).mean()

0.8072289156626506

Takeaways:

* Model 4 is performing best with Precision of 0.807

<div class="alert alert-block alert-success">

5. Follow the links below to read the documentation about each function, then apply those functions to the data from the previous problem.

<div class="alert alert-block alert-info">

sklearn.metrics.precision_score

sklearn.metrics.recall_score

sklearn.metrics.accuracy_score

In [44]:
from sklearn.metrics import precision_score, recall_score, accuracy_score

In [45]:
def calculate_precision(predictions, positive='dog'):
    return precision_score(paws_df.actual, predictions, pos_label=positive)

In [46]:
def calculate_recall(predictions, positive='dog'):
    return recall_score(paws_df.actual, predictions, pos_label=positive)

In [47]:
def calculate_accuracy(predictions, positive='dog'):
    return accuracy_score(paws_df.actual, predictions)

In [48]:
pd.concat([
    paws_df.loc[:, 'model1':'baseline'].apply(calculate_recall).rename('recall'),
    paws_df.loc[:, 'model1':'baseline'].apply(calculate_precision).rename('precision'),
    paws_df.loc[:, 'model1':'baseline'].apply(calculate_accuracy).rename('accuracy'),
], axis=1)

Unnamed: 0,recall,precision,accuracy
model1,0.803319,0.890024,0.8074
model2,0.490781,0.893177,0.6304
model3,0.508605,0.659888,0.5096
model4,0.955747,0.731249,0.7426
baseline,1.0,0.6508,0.6508


<div class="alert alert-block alert-info">

sklearn.metrics.classification_report

In [49]:
from sklearn.metrics import classification_report

In [50]:
#classification report of model 1
print("Model 1")
pd.DataFrame(classification_report(paws_df.actual, paws_df.model1, labels=['cat','dog'], output_dict=True)).T

Model 1


Unnamed: 0,precision,recall,f1-score,support
cat,0.689772,0.815006,0.747178,1746.0
dog,0.890024,0.803319,0.844452,3254.0
accuracy,0.8074,0.8074,0.8074,0.8074
macro avg,0.789898,0.809162,0.795815,5000.0
weighted avg,0.820096,0.8074,0.810484,5000.0


In [51]:
#classification report of model 2
print("Model 2")
pd.DataFrame(classification_report(paws_df.actual, paws_df.model2, labels=['cat','dog'], output_dict=True)).T

Model 2


Unnamed: 0,precision,recall,f1-score,support
cat,0.484122,0.890607,0.627269,1746.0
dog,0.893177,0.490781,0.633479,3254.0
accuracy,0.6304,0.6304,0.6304,0.6304
macro avg,0.688649,0.690694,0.630374,5000.0
weighted avg,0.750335,0.6304,0.63131,5000.0


In [52]:
#classification report of model 3
print("Model 3")
pd.DataFrame(classification_report(paws_df.actual, paws_df.model3, labels=['cat','dog'], output_dict=True)).T

Model 3


Unnamed: 0,precision,recall,f1-score,support
cat,0.358347,0.511455,0.421425,1746.0
dog,0.659888,0.508605,0.574453,3254.0
accuracy,0.5096,0.5096,0.5096,0.5096
macro avg,0.509118,0.51003,0.497939,5000.0
weighted avg,0.55459,0.5096,0.521016,5000.0


In [53]:
#classification report of model 4
print("Model 4")
pd.DataFrame(classification_report(paws_df.actual, paws_df.model4, labels=['cat','dog'], output_dict=True)).T

Model 4


Unnamed: 0,precision,recall,f1-score,support
cat,0.807229,0.345361,0.483755,1746.0
dog,0.731249,0.955747,0.82856,3254.0
accuracy,0.7426,0.7426,0.7426,0.7426
macro avg,0.769239,0.650554,0.656157,5000.0
weighted avg,0.757781,0.7426,0.708154,5000.0
