# Evaluation Model Exercises

In [75]:
import pandas as pd

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

- **accuracy**: (TP + TN) / (TP + TN + FP + FN)
    - (3 + 2) / (3 + 2 + 1 + 2) = 62.5%

  
- **precision**: TP / (TP + FP)
    - 3 / (3 + 1) = 75%
   
   
- **recall**: TP / (TP + FN)
    - 3 / (3 + 2) = 60%
   

#### 2. Given the following confusion matrix, evaluate (by hand) the model's performance.

|               | pred dog   | pred cat   |
|:------------  |-----------:|-----------:|
| actual dog    |         46 |         7  |
| actual cat    |         13 |         34 |


In [8]:
# positive = dog
# negative = cat
TP = 46
TN = 34
FP = 7
FN = 13
model_accuracy = (TP + TN)/(TP + TN + FP + FN)
model_precision = TP/(TP + FP)
model_recall = TP/(TP + FN)

print(model_accuracy)
print(model_precision)
print(model_recall)

0.8
0.8679245283018868
0.7796610169491526


False positive = predicting a dog when it is actually a cat
False negative = predicting a cat when it is actually a dog


#### 3. You are working as a datascientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant.

Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects, and the data from their predictions can be found here.

Use the predictions dataset and pandas to help answer the following questions:

In [13]:
c3 = pd.read_csv('c3.csv')
c3.head()

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect


##### An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

In [23]:
# positive = defect
# negative = no defect

We want to identify as many true positives as possible, thus the cost of a false negative is high.
In this case, we would use recall to assure we have as few false negatives as possible.

In [24]:
subset = c3[c3.actual == 'Defect']
subset

Unnamed: 0,actual,model1,model2,model3
13,Defect,No Defect,Defect,Defect
30,Defect,Defect,No Defect,Defect
65,Defect,Defect,Defect,Defect
70,Defect,Defect,Defect,Defect
74,Defect,No Defect,No Defect,Defect
87,Defect,No Defect,Defect,Defect
118,Defect,No Defect,Defect,No Defect
135,Defect,Defect,No Defect,Defect
140,Defect,No Defect,Defect,Defect
147,Defect,Defect,No Defect,Defect


In [26]:
model1_recall = (subset.actual == subset.model1).mean()
model2_recall = (subset.actual == subset.model2).mean()
model3_recall = (subset.actual == subset.model3).mean()


print(f'model1_recall: {model1_recall:.2%}')
print(f'model2_recall: {model2_recall:.2%}')
print(f'model3_recall: {model3_recall:.2%}')

model1_recall: 50.00%
model2_recall: 56.25%
model3_recall: 81.25%


Model 3 would be the best fit for this use case

##### Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

We want to identify defects, but with a high cost to false positives. Thus precision is the most appropriate metric to evaluate. This is because precision minimizes false positives.

In [33]:
subsetm1 = c3[c3.model1 == 'Defect']
subsetm1.head()

Unnamed: 0,actual,model1,model2,model3
3,No Defect,Defect,Defect,Defect
30,Defect,Defect,No Defect,Defect
62,No Defect,Defect,No Defect,No Defect
65,Defect,Defect,Defect,Defect
70,Defect,Defect,Defect,Defect


In [36]:
m1_precision = (subsetm1.actual == subsetm1.model1).mean()

print(f'm1_precision: {m1_precision:.2%}')

m1_precision: 80.00%


In [37]:
subsetm2 = c3[c3.model2 == 'Defect']
subsetm2.head()

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect


In [39]:
m2_precision = (subsetm2.actual == subsetm2.model2).mean()
print(f'm2_precision: {m2_precision:.2%}')

m2_precision: 10.00%


In [40]:
subsetm3 = c3[c3.model3 == 'Defect']
subsetm3.head()

Unnamed: 0,actual,model1,model2,model3
1,No Defect,No Defect,Defect,Defect
3,No Defect,Defect,Defect,Defect
5,No Defect,No Defect,No Defect,Defect
9,No Defect,No Defect,No Defect,Defect
13,Defect,No Defect,Defect,Defect


In [41]:
m3_precision = (subsetm3.actual == subsetm3.model3).mean()

print(f'm3_precision: {m3_precision:.2%}')

m3_precision: 13.13%


As Model 1 has precision of 80%, this would be the best fit for this use case.

#### 4. You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).

At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog (Phase I). Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II).

Several models have already been developed with the data, and you can find their results here.

Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:

In [44]:
paws = pd.read_csv('gives_you_paws.csv')
paws.head()

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog


In [45]:
# positive is dog
# negative is cat

In [46]:
paws.actual.value_counts()

dog    3254
cat    1746
Name: actual, dtype: int64

In [48]:
paws['baseline'] = 'dog'
paws.head()

Unnamed: 0,actual,model1,model2,model3,model4,baseline
0,cat,cat,dog,cat,dog,dog
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
4,cat,cat,cat,dog,dog,dog


##### a. In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?


In [50]:
model1_accuracy = (paws.actual == paws.model1).mean()
model2_accuracy = (paws.actual == paws.model2).mean()
model3_accuracy = (paws.actual == paws.model3).mean()
model4_accuracy = (paws.actual == paws.model4).mean()
baseline_accuracy = (paws.actual == paws.baseline).mean()

print(f'model1_accuracy: {model1_accuracy:.2%}')
print(f'model2_accuracy: {model2_accuracy:.2%}')
print(f'model3_accuracy: {model3_accuracy:.2%}')
print(f'model4_accuracy: {model4_accuracy:.2%}')
print(f'baseline_accuracy: {baseline_accuracy:.2%}')

model1_accuracy: 80.74%
model2_accuracy: 63.04%
model3_accuracy: 50.96%
model4_accuracy: 74.26%
baseline_accuracy: 65.08%


Models 1 & 4 are more accurate than our baseline

#### b. Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recomend for Phase I? For Phase II?

In [53]:
# positive is dog
# negative is cat
## For this team, Phase 1 would best be served by a model with high recall;
#  so we capture as many true dog pictures as possible.
## Then, Phase 2 would be best served by a model with high precison; 
#  in order to identify as many false positive dog pics as possible.

In [56]:
paws

Unnamed: 0,actual,model1,model2,model3,model4,baseline
0,cat,cat,dog,cat,dog,dog
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
4,cat,cat,cat,dog,dog,dog
...,...,...,...,...,...,...
4995,dog,dog,dog,dog,dog,dog
4996,dog,dog,cat,cat,dog,dog
4997,dog,cat,cat,dog,dog,dog
4998,cat,cat,cat,cat,dog,dog


In [57]:
subset_dog = paws[paws.actual == 'dog']
subset_dog

Unnamed: 0,actual,model1,model2,model3,model4,baseline
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
5,dog,dog,dog,dog,dog,dog
8,dog,dog,cat,dog,dog,dog
...,...,...,...,...,...,...
4993,dog,dog,cat,dog,dog,dog
4995,dog,dog,dog,dog,dog,dog
4996,dog,dog,cat,cat,dog,dog
4997,dog,cat,cat,dog,dog,dog


In [58]:
(subset_dog.actual == subset_dog.model1).mean()

0.803318992009834

In [59]:
model1_recall = (subset_dog.actual == subset_dog.model1).mean()
model2_recall = (subset_dog.actual == subset_dog.model2).mean()
model3_recall = (subset_dog.actual == subset_dog.model3).mean()
model4_recall = (subset_dog.actual == subset_dog.model4).mean()

print(f'model1_recall: {model1_recall:.2%}')
print(f'model2_recall: {model2_recall:.2%}')
print(f'model3_recall: {model3_recall:.2%}')
print(f'model4_recall: {model4_recall:.2%}')

model1_recall: 80.33%
model2_recall: 49.08%
model3_recall: 50.86%
model4_recall: 95.57%


In [60]:
subset_dog_p_m1 = paws[paws.model1 == 'dog']
subset_dog_p_m1

Unnamed: 0,actual,model1,model2,model3,model4,baseline
1,dog,dog,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
5,dog,dog,dog,dog,dog,dog
7,cat,dog,cat,cat,dog,dog
8,dog,dog,cat,dog,dog,dog
...,...,...,...,...,...,...
4992,dog,dog,cat,cat,dog,dog
4993,dog,dog,cat,dog,dog,dog
4995,dog,dog,dog,dog,dog,dog
4996,dog,dog,cat,cat,dog,dog


In [61]:
model1_precision = (subset_dog_p_m1.actual == subset_dog_p_m1.model1).mean()
print(f'model1_precision: {model1_precision:.2%}')

model1_precision: 89.00%


In [62]:
subset_dog_p_m2 = paws[paws.model2 == 'dog']
subset_dog_p_m2

Unnamed: 0,actual,model1,model2,model3,model4,baseline
0,cat,cat,dog,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
5,dog,dog,dog,dog,dog,dog
9,dog,cat,dog,cat,dog,dog
10,dog,dog,dog,dog,dog,dog
...,...,...,...,...,...,...
4983,dog,cat,dog,dog,dog,dog
4988,dog,dog,dog,dog,cat,dog
4990,dog,dog,dog,cat,dog,dog
4995,dog,dog,dog,dog,dog,dog


In [63]:
subset_dog_p_m3 = paws[paws.model3 == 'dog']
subset_dog_p_m3

Unnamed: 0,actual,model1,model2,model3,model4,baseline
4,cat,cat,cat,dog,dog,dog
5,dog,dog,dog,dog,dog,dog
8,dog,dog,cat,dog,dog,dog
10,dog,dog,dog,dog,dog,dog
13,dog,cat,dog,dog,dog,dog
...,...,...,...,...,...,...
4993,dog,dog,cat,dog,dog,dog
4994,cat,cat,cat,dog,dog,dog
4995,dog,dog,dog,dog,dog,dog
4997,dog,cat,cat,dog,dog,dog


In [64]:
subset_dog_p_m4 = paws[paws.model4 == 'dog']
subset_dog_p_m4

Unnamed: 0,actual,model1,model2,model3,model4,baseline
0,cat,cat,dog,cat,dog,dog
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
4,cat,cat,cat,dog,dog,dog
...,...,...,...,...,...,...
4995,dog,dog,dog,dog,dog,dog
4996,dog,dog,cat,cat,dog,dog
4997,dog,cat,cat,dog,dog,dog
4998,cat,cat,cat,cat,dog,dog


In [65]:
model1_precision = (subset_dog_p_m1.actual == subset_dog_p_m1.model1).mean()
model2_precision = (subset_dog_p_m2.actual == subset_dog_p_m2.model2).mean()
model3_precision = (subset_dog_p_m3.actual == subset_dog_p_m3.model3).mean()
model4_precision = (subset_dog_p_m4.actual == subset_dog_p_m4.model4).mean()

print(f'model1_precision: {model1_precision:.2%}')
print(f'model2_precision: {model2_precision:.2%}')
print(f'model3_precision: {model3_precision:.2%}')
print(f'model4_precision: {model4_precision:.2%}')

model1_precision: 89.00%
model2_precision: 89.32%
model3_precision: 65.99%
model4_precision: 73.12%


##### Recommendation:
- Phase 1: Model4 is best with recall of 95.6%
- Phase 2: Model2 is best with precision of 89.3%

#### c. Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recomend for Phase I? For Phase II?

In [53]:
# positive is cat
# negative is dog
## For this team, Phase 1 would best be served by a model with high recall;
#  so we capture as many true cat pictures as possible.
## Then, Phase 2 would be best served by a model with high precison; 
#  in order to identify as many false positive cat pics as possible.

In [56]:
paws

Unnamed: 0,actual,model1,model2,model3,model4,baseline
0,cat,cat,dog,cat,dog,dog
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
4,cat,cat,cat,dog,dog,dog
...,...,...,...,...,...,...
4995,dog,dog,dog,dog,dog,dog
4996,dog,dog,cat,cat,dog,dog
4997,dog,cat,cat,dog,dog,dog
4998,cat,cat,cat,cat,dog,dog


In [66]:
subset_cat = paws[paws.actual == 'cat']
subset_cat

Unnamed: 0,actual,model1,model2,model3,model4,baseline
0,cat,cat,dog,cat,dog,dog
4,cat,cat,cat,dog,dog,dog
6,cat,cat,cat,cat,dog,dog
7,cat,dog,cat,cat,dog,dog
11,cat,cat,dog,cat,cat,dog
...,...,...,...,...,...,...
4987,cat,dog,cat,dog,dog,dog
4989,cat,cat,cat,dog,cat,dog
4991,cat,cat,cat,cat,dog,dog
4994,cat,cat,cat,dog,dog,dog


In [67]:
(subset_cat.actual == subset_cat.model1).mean()

0.8150057273768614

In [68]:
model1_recall = (subset_cat.actual == subset_cat.model1).mean()
model2_recall = (subset_cat.actual == subset_cat.model2).mean()
model3_recall = (subset_cat.actual == subset_cat.model3).mean()
model4_recall = (subset_cat.actual == subset_cat.model4).mean()

print(f'model1_recall: {model1_recall:.2%}')
print(f'model2_recall: {model2_recall:.2%}')
print(f'model3_recall: {model3_recall:.2%}')
print(f'model4_recall: {model4_recall:.2%}')

model1_recall: 81.50%
model2_recall: 89.06%
model3_recall: 51.15%
model4_recall: 34.54%


In [69]:
subset_cat_p_m1 = paws[paws.model1 == 'cat']
subset_cat_p_m1

Unnamed: 0,actual,model1,model2,model3,model4,baseline
0,cat,cat,dog,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
4,cat,cat,cat,dog,dog,dog
6,cat,cat,cat,cat,dog,dog
9,dog,cat,dog,cat,dog,dog
...,...,...,...,...,...,...
4989,cat,cat,cat,dog,cat,dog
4991,cat,cat,cat,cat,dog,dog
4994,cat,cat,cat,dog,dog,dog
4997,dog,cat,cat,dog,dog,dog


In [70]:
subset_cat_p_m2 = paws[paws.model2 == 'cat']
subset_cat_p_m2

Unnamed: 0,actual,model1,model2,model3,model4,baseline
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
4,cat,cat,cat,dog,dog,dog
6,cat,cat,cat,cat,dog,dog
7,cat,dog,cat,cat,dog,dog
...,...,...,...,...,...,...
4993,dog,dog,cat,dog,dog,dog
4994,cat,cat,cat,dog,dog,dog
4996,dog,dog,cat,cat,dog,dog
4997,dog,cat,cat,dog,dog,dog


In [71]:
subset_cat_p_m3 = paws[paws.model3 == 'cat']
subset_cat_p_m3

Unnamed: 0,actual,model1,model2,model3,model4,baseline
0,cat,cat,dog,cat,dog,dog
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
6,cat,cat,cat,cat,dog,dog
...,...,...,...,...,...,...
4990,dog,dog,dog,cat,dog,dog
4991,cat,cat,cat,cat,dog,dog
4992,dog,dog,cat,cat,dog,dog
4996,dog,dog,cat,cat,dog,dog


In [72]:
subset_cat_p_m4 = paws[paws.model4 == 'cat']
subset_cat_p_m4

Unnamed: 0,actual,model1,model2,model3,model4,baseline
11,cat,cat,dog,cat,cat,dog
52,dog,dog,cat,dog,cat,dog
64,cat,cat,cat,cat,cat,dog
66,cat,cat,cat,dog,cat,dog
99,dog,dog,cat,cat,cat,dog
...,...,...,...,...,...,...
4974,cat,cat,cat,dog,cat,dog
4981,cat,cat,cat,dog,cat,dog
4986,cat,dog,cat,cat,cat,dog
4988,dog,dog,dog,dog,cat,dog


In [73]:
model1_precision = (subset_cat_p_m1.actual == subset_cat_p_m1.model1).mean()
model2_precision = (subset_cat_p_m2.actual == subset_cat_p_m2.model2).mean()
model3_precision = (subset_cat_p_m3.actual == subset_cat_p_m3.model3).mean()
model4_precision = (subset_cat_p_m4.actual == subset_cat_p_m4.model4).mean()

print(f'model1_precision: {model1_precision:.2%}')
print(f'model2_precision: {model2_precision:.2%}')
print(f'model3_precision: {model3_precision:.2%}')
print(f'model4_precision: {model4_precision:.2%}')

model1_precision: 68.98%
model2_precision: 48.41%
model3_precision: 35.83%
model4_precision: 80.72%


##### Recommendation:
- Phase 1: Model2 is best (high recall)
- Phase 2: Model4 is best (high precision)

#### 5. Follow the links below to read the documentation about each function, then apply those functions to the data from the previous problem.

In [None]:
sklearn.metrics.accuracy_score(y_true, y_pred)

In [83]:
Model1_calc_acc = accuracy_score(paws.actual, paws.model1)
Model4_calc_acc = accuracy_score(paws.actual, paws.model4)

print(f'Model1_calc_acc: {Model1_calc_acc:.2%}')
print(f'Model4_calc_acc: {Model4_calc_acc:.2%}')

Model1_calc_acc: 80.74%
Model4_calc_acc: 74.26%


In [None]:
sklearn.metrics.precision_score(y_true, y_pred)

In [82]:
Model1_calc_pre = accuracy_score(subset_dog.actual, subset_dog.model1)
Model2_calc_pre = accuracy_score(subset_dog.actual, subset_dog.model2)
Model3_calc_pre = accuracy_score(subset_dog.actual, subset_dog.model3)
Model4_calc_pre = accuracy_score(subset_dog.actual, subset_dog.model4)

print(f'Model1_calc_pre: {Model1_calc_pre:.2%}')
print(f'Model2_calc_pre: {Model2_calc_pre:.2%}')
print(f'Model3_calc_pre: {Model3_calc_pre:.2%}')
print(f'Model4_calc_pre: {Model4_calc_pre:.2%}')

Model1_calc_pre: 80.33%
Model2_calc_pre: 49.08%
Model3_calc_pre: 50.86%
Model4_calc_pre: 95.57%


In [None]:
sklearn.metrics.recall_score(y_true, y_pred)

In [84]:
Model1_calc_recall = accuracy_score(subset_cat.actual, subset_cat.model1)
Model2_calc_recall = accuracy_score(subset_cat.actual, subset_cat.model2)
Model3_calc_recall = accuracy_score(subset_cat.actual, subset_cat.model3)
Model4_calc_recall = accuracy_score(subset_cat.actual, subset_cat.model4)

print(f'Model1_calc_recall: {Model1_calc_recall:.2%}')
print(f'Model2_calc_recall: {Model2_calc_recall:.2%}')
print(f'Model3_calc_recall: {Model3_calc_recall:.2%}')
print(f'Model4_calc_recall: {Model4_calc_recall:.2%}')

Model1_calc_recall: 81.50%
Model2_calc_recall: 89.06%
Model3_calc_recall: 51.15%
Model4_calc_recall: 34.54%
