1. Create a new file named `model_evaluation.py` or `model_evaluation.ipynb` for these exercises.

> <input type="checkbox" checked> done

2. Given the following confusion matrix, evaluate (by hand) the model's performance.
    
    ```
    |               | pred dog   | pred cat   |
    |:------------  |-----------:|-----------:|
    | actual dog    |         46 |         7  |
    | actual cat    |         13 |         34 |
    
    ```
    
    - In the context of this problem, what is a false positive?
    - In the context of this problem, what is a false negative?
    - How would you describe this model?

> Adjust it to add positive notations

```
    |               | pred dog(P)| pred cat(N)|
    |:------------  |-----------:|-----------:|
    | actual dog(P) |         46 |         7  |
    | actual cat(N) |         13 |         34 |
    
```

- In the context of this problem, a **false positive** is <u>predicted dog but actual cat</u>, of which there are 13.
- In the context of this problem, a **false negative** is <u>predicted cat but actual dog</u>, of which there are 7.
- I would describe this model by evaluating it with the accuracy model.

$\frac{TP+TN}{TP+TN+FP+FN}$

$\frac{46 + 34}{46 + 7 + 13 + 34}$

$\frac{80}{100}$

- $0.8$ accuracy

3. You are working as a datascientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant.
    
    Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects, and the data from their predictions [can be found here](https://ds.codeup.com/data/c3.csv).
    
    Use the predictions dataset and pandas to help answer the following questions:
    
    - An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?
    - Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

In [502]:
# import libraries
import pandas as pd
import numpy as np
from sklearn import metrics

In [45]:
# import dataset
c3 = pd.read_csv('c3.csv')

In [49]:
# look at data
c3.head()

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect


In [59]:
c3.describe()

Unnamed: 0,actual,model1,model2,model3
count,200,200,200,200
unique,2,2,2,2
top,No Defect,No Defect,No Defect,No Defect
freq,184,190,110,101


In [81]:
mod1 = pd.crosstab(c3.model1,c3.actual)
mod1

actual,Defect,No Defect
model1,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,8,2
No Defect,8,182


In [83]:
mod2 = pd.crosstab(c3.model2,c3.actual)
mod2

actual,Defect,No Defect
model2,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,9,81
No Defect,7,103


In [121]:
mod3 = pd.crosstab(c3.model3,c3.actual)
mod3

actual,Defect,No Defect
model3,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,13,86
No Defect,3,98


In [504]:
def evaluator(df,prediction,actual,model='accuracy',target='None'):
    """
    Quick evaluator.
    df, required: a dataframe to look at
    prediction, required: the string name of model
    actual, required: the string name of the actual data
    model, opt: default to 'accuracy'. Additional options: 'precision','recall'
    target, required for precision and recall: string value to focus on (for being the positive value)
    """
    
    # run base calculation
    if model == 'accuracy':
        return (df[prediction] == df[actual]).mean()
    elif model == 'precision':
        pos_prediction = df[df[prediction] == target]
        return (pos_prediction[prediction] == pos_prediction[actual]).mean()
    elif model == 'recall':
        pos_actuals = df[df[actual] == target]
        return (pos_actuals[prediction] == pos_actuals[actual]).mean()
    else:
        print('Invalid input/unsupported model')

> - An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

In this case, we'll set a defective duck as 'positive'. As a result, we would focus on ~accuracy~ **recall**. 

In [441]:
# Test model 1 accuracy
evaluator(c3,'model1','actual')

0.95

In [443]:
# Test model 2 accuracy
evaluator(c3,'model2','actual')

0.56

In [582]:
# Test model 3 accuracy
evaluator(c3,'model3','actual')

0.555

A: From running our models, we find that **model1** has the highest accuracy of prediction, so it would be the best fit.

In [600]:
# For testing recall
recall = {col:evaluator(c3,col,'actual','recall','Defect') for col in c3.columns[1:]}
# recall

best_fit = max(recall,key=recall.get)
# best_fit
print(f'Best fit is {best_fit}')

Best fit is model3


> - Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

For this calculation, defective ducks are once again the **positive** value, while non-defective ducks are the **negative**. Since we want to minimize false positives, this calls for *precision*.

In [475]:
# Create a dictionary that holds the resulting percentages from the evaluator calculation
results = {
    'model1':evaluator(c3,'model1','actual','precision','Defect'),
    'model2':evaluator(c3,'model2','actual','precision','Defect'),
    'model3':evaluator(c3,'model3','actual','precision','Defect')
}

best_fit = max(results,key=results.get)

print(f'Best fit is: {best_fit} at {results[best_fit]}')
# print(results)

Best fit is: model1 at 0.8


As calculated, the best model for this job will be **model 1**

4. You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).
    
    At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog (Phase I). Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II).
    
    Several models have already been developed with the data, and [you can find their results here](https://ds.codeup.com/data/gives_you_paws.csv).
    
    Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:
    
    1. In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?
    2. Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recommend?
    3. Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recommend?

In [257]:
# Read in the data
paws = pd.read_csv('gives_you_paws.csv')
paws.head()

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog


In [487]:
# build baseline model
def base_model(actual):
    # get mode
    mode = actual.mode()
    # print(mode)
    
    # compare mode against actual model
    series = actual.isin(mode)
    # print(series)
    
    # return percentage of accuracy
    return series.mean()

In [491]:
results = {
    'baseline': base_model(paws.actual),
    'model1':evaluator(paws,'model1','actual'),
    'model2':evaluator(paws,'model2','actual'),
    'model3':evaluator(paws,'model3','actual'),
    'model4':evaluator(paws,'model4','actual')
}
# print(results)
for result in results:
    # print(result)
    if results[result] > results['baseline']:
        print(f'{result} exceeds the baseline ({results[result]} > {results["baseline"]})')

model1 exceeds the baseline (0.8074 > 0.6508)
model4 exceeds the baseline (0.7426 > 0.6508)


A: Models better than the baseline are model 1 and model 4. The others fall at or below the baseline.

In [508]:
# Evaluate for dogs
model = 'recall'
target = 'dog'

results = {
    'baseline': base_model(paws.actual),
    'model1':evaluator(paws,'model1','actual',model,target),
    'model2':evaluator(paws,'model2','actual',model,target),
    'model3':evaluator(paws,'model3','actual',model,target),
    'model4':evaluator(paws,'model4','actual',model,target)
}

best_fit = max(results,key=results.get)
print(best_fit)

model4


B: For working with dogs, you would likely use *recall* to minimize false negatives. **Model 4** works best for dogs.

In [514]:
# Evaluate for dogs
model = 'recall'
target = 'cat'

results = {
    'baseline': base_model(paws.actual),
    'model1':evaluator(paws,'model1','actual',model,target),
    'model2':evaluator(paws,'model2','actual',model,target),
    'model3':evaluator(paws,'model3','actual',model,target),
    'model4':evaluator(paws,'model4','actual',model,target)
}

best_fit = max(results,key=results.get)
print(best_fit)

model2


C: For working with cats, you would also want to minimize false negatives. **Model 2** works best for cats.

5. Follow the links below to read the documentation about each function, then apply those functions to the data from the previous problem.
    - [sklearn.metrics.accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)
    - [sklearn.metrics.precision_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html)
    - [sklearn.metrics.recall_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html)
    - [sklearn.metrics.classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

5.
    1. In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?
    
    2. Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recommend?
    
    3. Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recommend?
    

In [548]:
models = paws.columns[1:]
models

Index(['model1', 'model2', 'model3', 'model4'], dtype='object')

In [525]:
metrics.accuracy_score(paws.actual,paws.model1)

0.8074

In [550]:
# A:
# Dictionary comprehension!!
accuracy = {model:metrics.accuracy_score(paws.actual,paws[model]) for model in models}
best_fit = max(accuracy,key=accuracy.get)

print(accuracy)
print(f'The best fit model is {best_fit}')

{'model1': 0.8074, 'model2': 0.6304, 'model3': 0.5096, 'model4': 0.7426}
The best fit model is model1


In [570]:
# B:
# More dictionary comprehension!
recall = {model:metrics.recall_score(paws.actual,paws[model],pos_label='dog') for model in models}
recall

print(recall)
best_fit = max(recall,key=recall.get)
print(f'The best fit model is {best_fit}')

{'model1': 0.803318992009834, 'model2': 0.49078057775046097, 'model3': 0.5086047940995697, 'model4': 0.9557467732022127}
The best fit model is model4


In [572]:
# C:
recall = {model:metrics.recall_score(paws.actual,paws[model],pos_label='cat') for model in models}
recall

print(recall)
best_fit = max(recall,key=recall.get)
print(f'The best fit model is {best_fit}')

{'model1': 0.8150057273768614, 'model2': 0.8906071019473081, 'model3': 0.5114547537227949, 'model4': 0.34536082474226804}
The best fit model is model2
