# Model Evaluation
<hr style="border:2px solid red"> </hr>

In [1]:
from pydataset import data
import numpy as np
import seaborn as sns
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import math

# import splitting and imputing functions
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# turn off pink boxes for demo
import warnings
warnings.filterwarnings("ignore")

# import our own acquire module
import acquire

# Remove limits on viewing dataframes
pd.set_option('display.max_columns', None)

### 2. Given the following confusion matrix, evaluate (by hand) the model's performance.
|               | pred dog   | pred cat   |
|:------------  |-----------:|-----------:|
| actual dog    |         46 |         7  |
| actual cat    |         13 |         34 |

- In the context of this problem, what is a false positive?
    - Assuming positive is a cat and negative is not a cat (dog)
    - False Positive: The photo is of a dog, but the prediction is a cat
- In the context of this problem, what is a false negative?
    - The photo is actually of a cat, but it is predicted to be a dog 
- How would you describe this model?
    - complicated

In [2]:
# Based on the confusion matrix above, I can put the numbers into thier respective outcomes:
true_positive = 34
true_negative = 46
false_positive = 7
false_negative = 13

# Now to use the formulas given in the curriculum for my evaluation metrics:
accuracy = (true_positive + true_negative) / (true_positive + true_negative + false_positive + false_negative)
recall = true_positive / (true_positive + false_negative)
precision = true_positive / (true_positive + false_positive)

# Making a pretty print statement so it's easier to read:
print("The accuracy of these evaluation metrics are as follows:")
print("Accuracy:", accuracy)
print("Recall:", round(recall,2))
print("Precision:", round(precision,2))

The accuracy of these evaluation metrics are as follows:
Accuracy: 0.8
Recall: 0.72
Precision: 0.83


### 3. You are working as a datascientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant.

### Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects, and the data from their predictions can be found in "c3.csv".

### Use the predictions dataset and pandas to help answer the following questions:

#### An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

In [3]:
# Aquiring the dataset:
ducks = pd.read_csv("c3.csv")
ducks.head()

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect


### The problem states: "...they want to identify as many of the ducks that have a defect as possible." 

#### Thinking through the problem...
- What is the positive and negative case?
    - When determining Positive/Negative, make life easier by making a correct identification as Positive 
    - The way I think of this is: if you were given a list and asked to find the lines that said it was a defective duck, each time you found one would be a Positive Identification!
   
   - Positive: A duck is identified to be defective
    - Negative: A duck is not identified as defective 
    
- What are the possible outcomes?
    - True Positive: A duck is defective and it does not get sold
    - True Negative: A duck is not defective and gets sold
    - False Positive: A duck is not defective, but is marked as defective, and does not get sold
    - False Negative: A duck is defective, but is not marked as defective, and ends up getting sold

### Which evaluation metric would be appropriate here? 
- Codeup Cody Creator wants to over identify than under identify. With this in mind, I think Recall would be best because a False Negative is more costly than a False Positive. 

### Which model would be the best fit for this use case?¶




In [4]:
# Alright, time for some legwork, or in this case, a lot of code!
# Which label (actual) appears most frequently in my dataset?
ducks.actual.value_counts()

No Defect    184
Defect        16
Name: actual, dtype: int64

In [5]:
# Model and baseline accuracy:
# First I'll create a new column called 'baseline_prediction'
# which will be given the most frequent label from actual (in this case 'No Defect')
# this baseline is in no way related to the baselines used int the evaluation matrixes 
ducks['baseline_prediction'] = 'No Defect'

# The data already has 3 columns dedicated to model predictions, so I'll check all three for accuracy:
model1_accuracy = (ducks.actual == ducks.model1).mean()
model2_accuracy = (ducks.actual == ducks.model2).mean()
model3_accuracy = (ducks.actual == ducks.model3).mean()

# And get a base line accuracy for comparison:
baseline_accuracy = (ducks.actual == ducks.baseline_prediction).mean()

print("Codeup Cody Creator Model Accuracies")
print("=====================================")
print(f'baseline accuracy: {baseline_accuracy:.2%}')
print(f'model one accuracy: {model1_accuracy:.2%}')
print(f'model two accuracy: {model2_accuracy:.2%}')
print(f'model three accuracy: {model3_accuracy:.2%}')


Codeup Cody Creator Model Accuracies
baseline accuracy: 92.00%
model one accuracy: 95.00%
model two accuracy: 56.00%
model three accuracy: 55.50%


In [6]:
# Recall Evaluation
# Recall is the percentage of positive cases that a model accurately predicted

subset = ducks[ducks.actual == 'Defect']
subset.head()

Unnamed: 0,actual,model1,model2,model3,baseline_prediction
13,Defect,No Defect,Defect,Defect,No Defect
30,Defect,Defect,No Defect,Defect,No Defect
65,Defect,Defect,Defect,Defect,No Defect
70,Defect,Defect,Defect,Defect,No Defect
74,Defect,No Defect,No Defect,Defect,No Defect


In [7]:
# I like making loops; so let's make a loop that finds the recall of all our models
models = ["model1" , "model2" , "model3"]

for x in models: 
    model_recall = ( subset.actual == subset[ x ] ).mean()
    print(x, "recall:", round(model_recall * 100),"%")


model1 recall: 50 %
model2 recall: 56 %
model3 recall: 81 %


> ANSWER "Which model would be the best fit for this use case":
>
> <b>Model 3</b> would be the best to use for a Recall Evaluation, as it gives us the best accuracy 

### Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. 


### Which evaluation metric would be appropriate here? 

- The company would rather a defect get sold, than a vacation go to someone with a non-defective duck.
- So for this case, a False Positive is more costly than a False Negative. Therefore, I believe a Precision evaluation would be best.

### Which model would be the best fit for this use case?

In [8]:
# Precision Evaluation
# Precision is the percentage of positive predictions that the model made, that are correct.
# (i.e. model prediction == 'Defect')

# Loopy Time ~
for x in models:
    # choose subset of model1 where we only select 'positive predictions'
    subset = ducks[ducks[ x ] == 'Defect']

    # calculate precision
    model_precision = ( subset.actual == subset[ x ] ).mean()
    
    print(x, "precision:", round(model_precision*100), "%" )


model1 precision: 80 %
model2 precision: 10 %
model3 precision: 13 %


> ANSWER "Which model would be the best fit for this use case": 
>
><b>Model 1</b> would be the best to use for a Precision Evaluation, as it will minimize the False Positive predictions of defects 

In [9]:
# Scratch work
# I'm not certain why my answer is different than the instructor/ where i was going with this
# So I'll leave this here to review at a later time
"""
for x in models:
    model_subset = ducks[ducks[ x ] == 'Defect']
    model_precision = (model_subset.model1 == model_subset.actual).mean()

    baseline_subset = ducks[ducks.baseline_prediction == 'Defect']
    baseline_precision = (baseline_subset.baseline_prediction == baseline_subset.actual).mean()

    print(f'model precision: {model_precision:.2%}')
    print(f'baseline precision: {baseline_precision:.2%}')
    
    # Model Two
model_subset = ducks[ducks.model2 == 'No Defect']
model_precision = (model_subset.model2 == model_subset.actual).mean()

baseline_subset = ducks[ducks.baseline_prediction == 'No Defect']
baseline_precision = (baseline_subset.baseline_prediction == baseline_subset.actual).mean()

print(f'model precision: {model_precision:.2%}')
print(f'baseline precision: {baseline_precision:.2%}')

# Model Three
model_subset = ducks[ducks.model3 == 'No Defect']
model_precision = (model_subset.model3 == model_subset.actual).mean()

baseline_subset = ducks[ducks.baseline_prediction == 'No Defect']
baseline_precision = (baseline_subset.baseline_prediction == baseline_subset.actual).mean()

print(f'model precision: {model_precision:.2%}')
print(f'baseline precision: {baseline_precision:.2%}')
"""

"\nfor x in models:\n    model_subset = ducks[ducks[ x ] == 'Defect']\n    model_precision = (model_subset.model1 == model_subset.actual).mean()\n\n    baseline_subset = ducks[ducks.baseline_prediction == 'Defect']\n    baseline_precision = (baseline_subset.baseline_prediction == baseline_subset.actual).mean()\n\n    print(f'model precision: {model_precision:.2%}')\n    print(f'baseline precision: {baseline_precision:.2%}')\n    \n    # Model Two\nmodel_subset = ducks[ducks.model2 == 'No Defect']\nmodel_precision = (model_subset.model2 == model_subset.actual).mean()\n\nbaseline_subset = ducks[ducks.baseline_prediction == 'No Defect']\nbaseline_precision = (baseline_subset.baseline_prediction == baseline_subset.actual).mean()\n\nprint(f'model precision: {model_precision:.2%}')\nprint(f'baseline precision: {baseline_precision:.2%}')\n\n# Model Three\nmodel_subset = ducks[ducks.model3 == 'No Defect']\nmodel_precision = (model_subset.model3 == model_subset.actual).mean()\n\nbaseline_subset

### 4. You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).

### At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog (Phase I). Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II).

### Several models have already been developed with the data, and you can find their results in the "gives_you_paws.csv".

In [10]:
# Aquiring the dataset:
paws = pd.read_csv("gives_you_paws.csv")
paws.head()

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog


In [11]:
# Checking value counts
paws.actual.value_counts()

dog    3254
cat    1746
Name: actual, dtype: int64

### Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:

### a. In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?


In [12]:
# First, I'll create baseline using the highest value count, which is "dog"
paws['baseline_prediction'] = 'dog'
paws.head()

models = ['baseline_prediction', 'model1', 'model2', 'model3', 'model4' ]

for x in models:
    model_accuracy = ( paws.actual == paws[ x ] ).mean()
    print(f'{x} accuracy: {model_accuracy:.2%}')


baseline_prediction accuracy: 65.08%
model1 accuracy: 80.74%
model2 accuracy: 63.04%
model3 accuracy: 50.96%
model4 accuracy: 74.26%


> Looks like <b>Model 1</b> and <b>Model 4</b> are better than the baseline in terms of accuracy.

In [13]:
# Instructor Solution
paws["baseline_prediction"] = paws.actual.value_counts().idxmax()

# Calling columns to make a list instead of writing them
models = list(paws.columns)
models = models[1:]
models

['model1', 'model2', 'model3', 'model4', 'baseline_prediction']

### b. Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recomend for Phase I? For Phase II?


In [14]:
# Phase 1: An automated algorithm tags pictures as either a cat or a dog
# For Phase 1, I should choose a model with highest Recall

# Recall Evaluation
# Recall is the percentage of positive cases that a model accurately predicted
# For this case, since they solely deal with dogs, I'll make 'dog' the Positive Identification  
subset = paws[paws.actual == 'dog']

# I'm gonna resuse my loop that finds the recall of all the models 
for x in models: 
    model_recall = ( subset.actual == subset[ x ] ).mean()
    print(x, "recall:", round(model_recall * 100),"%")


model1 recall: 80 %
model2 recall: 49 %
model3 recall: 51 %
model4 recall: 96 %
baseline_prediction recall: 100 %


> For Phase One, it looks like <b>Model 4</b> would be the best for Recall as it will minimize the False Negative predictions of dogs. 
>
>i.e. it will minimize tagging a picture of a dog as not a dog 

In [15]:
# Phase 2: Represents photos that have been initially identified, and are put through another round tagging
# I will use Precision this time to minimize the False Positives, i.e. tagging a photo as a dog that is not a dog

for x in models:
    # choose subset of model1 where we only select 'positive predictions'
    subset = paws[paws[ x ] == 'dog']

    # calculate precision
    model_precision = ( subset.actual == subset[ x ] ).mean()
    
    print(x, "precision:", round(model_precision*100, 2), "%" )


model1 precision: 89.0 %
model2 precision: 89.32 %
model3 precision: 65.99 %
model4 precision: 73.12 %
baseline_prediction precision: 65.08 %


> For Phase Two, it looks like <b>Model 2</b> would be the best for Precision as it will minimize the False Positive predictions of dogs. 
>
>i.e. it will minimize tagging a photo as a dog that is not a dog

### c. Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recomend for Phase I? For Phase II?

In [16]:
# Time to copy/paste my dog stuff and change it to cats!

# Phase 1: An automated algorithm tags pictures as either a cat or a dog
# For Phase 1, I should choose a model with highest Recall

print("Phase One Recall:")
# Recall Evaluation
# Recall is the percentage of positive cases that a model accurately predicted
# For this case, since they solely deal with cats, I'll make 'cat' the Positive Identification  
models = ['model1', 'model2', 'model3', 'model4' ]
subset = paws[paws.actual == 'cat']

# I'm gonna resuse my loop that finds the recall of all the models 
for x in models: 
    model_recall = ( subset.actual == subset[ x ] ).mean()
    print(x, "recall:", round(model_recall * 100),"%")

    

print("\nPhase Two Precision:")
# Phase 2: Represents photos that have been initially identified, and are put through another round tagging
# I will use Precision this time to minimize the False Positives, i.e. tagging a photo as a cat that is not a cat

for x in models:
    # choose subset of model1 where we only select 'positive predictions'
    subset = paws[paws[ x ] == 'cat']

    # calculate precision
    model_precision = ( subset.actual == subset[ x ] ).mean()
    
    print(x, "precision:", round(model_precision*100,2), "%" )


Phase One Recall:
model1 recall: 82 %
model2 recall: 89 %
model3 recall: 51 %
model4 recall: 35 %

Phase Two Precision:
model1 precision: 68.98 %
model2 precision: 48.41 %
model3 precision: 35.83 %
model4 precision: 80.72 %


> For the Cat Team, Phase One, <b>Model 2</b> would be the best for Recall 
> 
> For the Cat Team, Phase Two, <b>Model 4</b> would be the best for Precision