# Model Evaluation Exercises

In [1]:
import numpy as np
import pandas as pd
from model_evaluation_functions import evaluation_metrics

## 1. Artisanal Evaluation
Given the following confusion matrix, evaluate (_by hand_) the model's performance.

|               | actual cat | actual dog |
|:------------  |-----------:|-----------:|
| predicted cat |         34 |          7 |
| predicted dog |         13 |         46 |

In [2]:
from IPython.display import Image
print("Artisanal Evaluation")
# Image(filename='./evaluations-by-hand.png', width=400, height=400)

Artisanal Evaluation


Context:

    On earth 617, the only animals are cats and dogs. Cats and dogs roam freely on the earth. Certain human communities can only be around dogs, because they're allergic to cats. Cats must be kept away from these communities at all costs, or they will sneeze louder than 12 Saturn V Rockets. 

    They've recruited Chris and the gang from good ol' earth 616 to help them evaluate their defense model. The model predicts whether an animal is a cat or not a cat. If the model predicts a cat, industrial-sized presentation lasers will point at the ground and lead the cats away from the community. Dogs ignore the lasers and mark their territory. If a cat enters the community, well, RIP ear drums.

In the context of this problem, what is a false positive (Type I Error)?
> In the context of is this problem, a False Positive means that the model __predicted a picture to be a Cat__, but it was __actually a Dog__.
- _Close call_

In the context of this problem, what is a false negative (Type II Error)?
> In the context of is this problem, a False Negative means that the model __predicted a picture to be a Dog__, but it was __actually a Cat__.
- _RIP eardrums_

How would you describe this model?
> This model predicts whether an animal is a cat or not a cat.
- "Not a cat" == Dog

## 2. C3 Rubber Duck Manufacturer
You are working as a data scientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant.

Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects.
Use the predictions dataset and pandas to help answer the following questions:

An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible.

### Acquire the data

In [35]:
# Load c3's baseline model data - 3 models
df_c3 = pd.read_csv('./c3.csv')

In [40]:
df_c3.head()

Unnamed: 0,actual,model1,model2,model3,baseline
0,No Defect,No Defect,Defect,No Defect,184
1,No Defect,No Defect,Defect,Defect,184
2,No Defect,No Defect,Defect,No Defect,184
3,No Defect,Defect,Defect,Defect,184
4,No Defect,No Defect,Defect,No Defect,184


In [41]:
df_c3['baseline'] = df_c3.actual.value_counts().index[0]

In [52]:
# Create confusion matrices for each model to evaluate their performance.
baseline_outcome_matrix = pd.crosstab(df_c3.baseline, df_c3.actual)
model1_outcome_matrix = pd.crosstab(df_c3.model1, df_c3.actual)  # model 1
model2_outcome_matrix = pd.crosstab(df_c3.model2, df_c3.actual)  # model 2
model3_outcome_matrix = pd.crosstab(df_c3.model3, df_c3.actual)  # model 3

### Evaluate each model's performance

#### Baseline Model

In [53]:
baseline_outcome_matrix

actual,Defect,No Defect
baseline,Unnamed: 1_level_1,Unnamed: 2_level_1
No Defect,16,184


In [54]:
false_n, true_n = baseline_outcome_matrix.values.ravel()

basemodel_specificity = true_n / (true_n + false_n)
print(f"Baseline Specificity {basemodel_specificity:.2%}")

Baseline Specificity 92.00%


#### Model 1 Evaluation

In [55]:
model1_outcome_matrix

actual,Defect,No Defect
model1,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,8,2
No Defect,8,182


In [56]:
evaluation_metrics(model1_outcome_matrix)

Model Evaluation
----------------
Accuracy               95.00%
Recall                 50.00%
Precision              80.00%
Specificity            98.91%


#### Model 2 Evaluation

In [57]:
model2_outcome_matrix

actual,Defect,No Defect
model2,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,9,81
No Defect,7,103


In [58]:
evaluation_metrics(model2_outcome_matrix)

Model Evaluation
----------------
Accuracy               56.00%
Recall                 56.25%
Precision              10.00%
Specificity            55.98%


#### Model 3 Evaluation

In [59]:
model3_outcome_matrix

actual,Defect,No Defect
model3,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,13,86
No Defect,3,98


In [60]:
evaluation_metrics(model3_outcome_matrix)

Model Evaluation
----------------
Accuracy               55.50%
Recall                 81.25%
Precision              13.13%
Specificity            53.26%


1. Which evaluation metric would be appropriate here?

>__Answer__: The most appropriate metric for C3's business problem is **Recall**.
The model identifies defective or non-defective rubber ducks. If the model predicts "no defect" when there is a "defect" (Type II Error), our customer receives a defective rubber ducky.

>__Reasoning__: Evaluating ducks classified as "no-defect" (True Positives and False Negatives) helps us evaluate the model's performance.
- If the recall rate is high, the model can determine what a __True__ non-defective rubber duck is.
- If the recall rate is low, the model is sending out truckloads of defective rubber ducks.


2. Which model would be the best fit for this use case?
> Model 3 would be the best fit for this use case because it has the __highest__ _recall_.

Model 3 Evaluation
- Accuracy               55.50%
- Recall                 81.25%
- Precision              13.13%
- Specificity            53.26%


   
        Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii.
> DANG. Talk about upholding their reputation. They've put the pressure on... they've hired one of the best in the business, `IGOT this`. I need to minimize those expensive vacations.

        They need you to predict which ducks will have defects, but tell you they really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here?
> __Answer__: *__Precision__* is the appropriate metric to evaluate the models.

>__Reasoning__: Evaluating ducks classified as "defect" (True Positives and False Positives) helps us evaluate the model's performance. Due to C3's PR stunt, we also need to make sure that people don't get a defective free duck AND get a vacation to Hawaii.
- Note: IRL customers would need to verify their claim. If C3 uses computer vision, its manufacturing plant would have a frame and timestamp when the customers' duck was evaluated.

    Which model would be the best fit for this use case?
> Model 1 would be the best fit for this use case because it has the __highest__ *precision*.

Model 1 Evaluation
- Accuracy               95.00%
- Recall                 50.00%
- Precision              80.00%
- Specificity            98.91%


## 3. Gives You Paws
You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).

At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog (Phase I). Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II).

Several models have already been developed with the data, and you can find their results here.

Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:

In [61]:
df_paws = pd.read_csv('./gives_you_paws.csv')

In [62]:
df_paws.head()

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog


In [63]:
df_paws['baseline'] = df_paws.actual.value_counts().index[0]

In [65]:
baseline_matrix = pd.crosstab(df_paws.baseline, df_paws.actual) # baseline
model1_matrix = pd.crosstab(df_paws.model1, df_paws.actual)  # model 1
model2_matrix = pd.crosstab(df_paws.model2, df_paws.actual)  # model 2
model3_matrix = pd.crosstab(df_paws.model3, df_paws.actual)  # model 3
model4_matrix = pd.crosstab(df_paws.model4, df_paws.actual)  # model 4

#### Baseline Evaluation

In [68]:
baseline_matrix

actual,cat,dog
baseline,Unnamed: 1_level_1,Unnamed: 2_level_1
dog,1746,3254


In [66]:
true_positives, true_negatives = baseline_matrix.values.ravel()

In [69]:
accuracy = round(true_negatives / (true_positives + true_negatives), 2)
print(f"Accuracy {accuracy:.2%}")

Accuracy 65.00%


#### Model 1 Evaluation

In [29]:
model1_matrix

actual,cat,dog
model1,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,1423,640
dog,323,2614


In [30]:
evaluation_metrics(model1_matrix)

Model Evaluation
----------------
Accuracy               80.74%
Recall                 81.50%
Precision              68.98%
Specificity            80.33%


#### Model 2 Evaluation

In [70]:
model2_matrix

actual,cat,dog
model2,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,1555,1657
dog,191,1597


In [71]:
evaluation_metrics(model2_matrix)

Model Evaluation
----------------
Accuracy               63.04%
Recall                 89.06%
Precision              48.41%
Specificity            49.08%


#### Model 3 Evaluation

In [72]:
model3_matrix

actual,cat,dog
model3,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,893,1599
dog,853,1655


In [73]:
evaluation_metrics(model3_matrix)

Model Evaluation
----------------
Accuracy               50.96%
Recall                 51.15%
Precision              35.83%
Specificity            50.86%


#### Model 4 Evaluation

In [74]:
model4_matrix

actual,cat,dog
model4,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,603,144
dog,1143,3110


In [75]:
evaluation_metrics(model4_matrix)

Model Evaluation
----------------
Accuracy               74.26%
Recall                 34.54%
Precision              80.72%
Specificity            95.57%


    In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?
> __Model 1 and Model 4__ _outperform_ the baseline model.

> __Model 2 and Model 3__ _underperform_ the baseline model.

    Suppose you are working on a team that solely deals with dog pictures.
    1. Which of these models would you recommend for Phase I?
> __Setup the Business Problem__:
> C3 provides a `...a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee)`. The company earns its money by providing only cat photos, only dog photos, or cat and dog photos.
> 1. They have a tiered system: cat/dog or cat and dog (with an additional charge).

> 2. C3 needs to show their users the correct type(s) of animal(s). Otherwise, they are providing a poor service (e.g. a customer wants to see cute cats but sees a bunch of dog photos) and losing potential profit (showing a customer cat AND dog photos for FREE).

> __Answer__: I would recommend using __Model 1__ for Phase I because it has the highest __recall__ score. __Recall__ captures all pictures that are actually dogs = True Positives + False Negatives. C3 can see _all_ of its dog photos.
> - True Positives = The model correctly classified a dog photo as dog photo.
> - False Negatives = The model incorrectly classified a dog photo as a cat photo.

    2. For Phase II?
> In Phase II I would recommend to use a different metric. __Precision__. With precision, C3's "Team Ruff" can find the model that reduces False Positives. False Positives occur when the model predicts a dog photo but it's actually a cat photo.
    
    Suppose you are working on a team that solely deals with cat pictures.
    1. Which of these models would you recommend for Phase I?
    2. For Phase II?

## 4.
Follow the links below to read the documentation about each function, then apply those functions to the data from the previous problem.

    sklearn.metrics.accuracy_score
    sklearn.metrics.precision_score
    sklearn.metrics.recall_score
    sklearn.metrics.classification_report