In [9]:
import pandas as pd
import sklearn.metrics
from sklearn.metrics import confusion_matrix

## Exercise 

#### 1. Create a new file named model_evaluation.py or model_evaluation.ipynb for these exercises.

#### 2. Given the following confusion matrix, evaluate (by hand) the model's performance.
- In the context of this problem, what is a false positive?
- In the context of this problem, what is a false negative?
- How would you describe this model?



|               | pred dog   | pred cat   |
|:------------  |-----------:|-----------:|
| actual dog    |         46 |         7  |
| actual cat    |         13 |         34 |


The positive class is the pred dog and negative class is pred cat. 
The false positive is 13 
the false negative is 7

In [3]:
# Accuracy TP + TN / TP + TN + FP + FN
(46 + 34) / (13 + 7 + 46 + 34)

0.8

In [4]:
#Precision TP / TP + FP
46 / (46 + 13)

0.7796610169491526

The model is slightly more accurate at 80% than precise at 78%

#### 3. You are working as a datascientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant.
Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects, and the data from their predictions can be found here.
Use the predictions dataset and pandas to help answer the following questions:
- An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?
- Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

In [7]:
df = pd.read_csv('c3.csv')

In [12]:
pd.crosstab(df.actual, df.model1)

model1,Defect,No Defect
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,8,8
No Defect,2,182


In [13]:
confusion_matrix(df.actual, df.model1, labels=('Defect', 'No Defect'))

array([[  8,   8],
       [  2, 182]])

In [29]:
182 / 184

0.9891304347826086

In [14]:
df['baseline_prediction'] = 'No Defect'

In [15]:
df.head()

Unnamed: 0,actual,model1,model2,model3,baseline_prediction
0,No Defect,No Defect,Defect,No Defect,No Defect
1,No Defect,No Defect,Defect,Defect,No Defect
2,No Defect,No Defect,Defect,No Defect,No Defect
3,No Defect,Defect,Defect,Defect,No Defect
4,No Defect,No Defect,Defect,No Defect,No Defect


In [17]:
# Compare predicted to actual
model_accuracy = (df.model1 == df.actual).mean()
print(model_accuracy)

# Compare the baseline to actual
baseline_accuracy = (df.baseline_prediction == df.actual).mean()
print(baseline_accuracy)

0.95
0.92


In [24]:
# Restrict to positive values ('non defective') for the actual values

subset = df[df.actual == 'Defect']

# Recall metric for each model
model1_recall = (subset.model1 == subset.actual).mean()
model1_recall

0.5

In [26]:
model2_recall = (subset.model2 == subset.actual).mean()
model2_recall

0.5625

In [28]:
model3_recall = (subset.model3 == subset.actual).mean()
model3_recall

0.8125

In [30]:
(subset.baseline_prediction == subset.actual).mean()

0.0

The metric used is recall and the best model is model 3

In [124]:
model1_subset = df[df.model1 == 'Defect']
model1_subset

Unnamed: 0,actual,model1,model2,model3,baseline_prediction
3,No Defect,Defect,Defect,Defect,No Defect
30,Defect,Defect,No Defect,Defect,No Defect
62,No Defect,Defect,No Defect,No Defect,No Defect
65,Defect,Defect,Defect,Defect,No Defect
70,Defect,Defect,Defect,Defect,No Defect
135,Defect,Defect,No Defect,Defect,No Defect
147,Defect,Defect,No Defect,Defect,No Defect
163,Defect,Defect,Defect,Defect,No Defect
194,Defect,Defect,No Defect,Defect,No Defect
196,Defect,Defect,No Defect,No Defect,No Defect


In [125]:
model1_precision = (model1_subset.model1 == model1_subset.actual).mean()
model1_precision

0.8

In [127]:
model2_subset = df[df.model2 == 'Defect']

In [128]:
model2_precision = (model2_subset.model2 == model2_subset.actual).mean()
model2_precision

0.1

In [129]:
model3_subset = df[df.model3 == 'Defect']

In [130]:
model3_precision = (model3_subset.model3 == model3_subset.actual).mean()
model3_precision

0.13131313131313133

Use precision for metric and model 1 is the best

#### You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).
At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog (Phase I). Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II).
Several models have already been developed with the data, and you can find their results here.
Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:
- A. In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?
- B. Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recommend?
- C. Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recommend?

In [51]:
paws = pd.read_csv('gives_you_paws.csv')
paws.head()

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog


In [60]:
paws.actual.value_counts()

dog    3254
cat    1746
Name: actual, dtype: int64

In [83]:
paws['baseline_prediction'] = 'dog'

In [62]:
paws.head()

Unnamed: 0,actual,model1,model2,model3,model4,baseline_prediction
0,cat,cat,dog,cat,dog,dog
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
4,cat,cat,cat,dog,dog,dog


In [68]:
paws.columns

Index(['actual', 'model1', 'model2', 'model3', 'model4',
       'baseline_prediction'],
      dtype='object')

In [79]:
#get accuracy for all models
def get_model_accuracy():
    print("Accuracy for models")
    for col in paws.columns:
        print(f"{col}_model: {(paws[col] == paws.actual).mean()}")

In [134]:
# get recall for all models
def get_model_recall():
    actual_positive = paws[paws.actual == "dog"]
    print('Accuracy for models')
    for col in actual_positive.columns:
         print(f"{col}_model: {(actual_positive.actual == actual_positive[col]).mean()}")
        

In [140]:
# get precision for all models
def get_model_precision(positveClass):
    for col in paws.columns:
        subset = paws[paws[col] == target]
        precision = (subset.actual == subset[col]).mean()
        print(f"{col}_model: {precision}")

In [65]:
# Test baseline model accuracy
baseline_model_accuracy = (paws.baseline_prediction == paws.actual).mean()
baseline_model_accuracy

0.6508

In [67]:
# Test model1 accuracy
model1_accuracy = (paws.model1 == paws.actual).mean()
model1_accuracy

0.8074

In [84]:
get_model_accuracy()

Accuracy for models
actual_model: 1.0
model1_model: 0.8074
model2_model: 0.6304
model3_model: 0.5096
model4_model: 0.7426
baseline_prediction_model: 0.6508


In [135]:
get_model_recall()

Accuracy for models
actual_model: 1.0
model1_model: 0.803318992009834
model2_model: 0.49078057775046097
model3_model: 0.5086047940995697
model4_model: 0.9557467732022127
baseline_prediction_model: 1.0


In [141]:
# Model precision for dog
get_model_precision('dog')

actual_model: 1.0
model1_model: 0.8900238338440586
model2_model: 0.8931767337807607
model3_model: 0.6598883572567783
model4_model: 0.7312485304490948
baseline_prediction_model: 0.6508


In [142]:
# Model precision for cat
get_model_precision('cat')

actual_model: 1.0
model1_model: 0.6897721764420747
model2_model: 0.4841220423412204
model3_model: 0.358346709470305
model4_model: 0.8072289156626506
baseline_prediction_model: nan


### Model1 and model4 do better than the baseline model, where model2 and modle3 do worse
### Would reccomend all model 1, 2, and 4 for dog predictions
### would reccomend model1 and 4 for cat predictions

#### 5. Follow the links below to read the documentation about each function, then apply those functions to the data from the previous problem. 

In [97]:
sklearn.metrics.accuracy_score(paws.actual, paws.model1)

0.8074

In [107]:
sklearn.metrics.precision_score(paws.actual, paws.model1, pos_label='dog')

0.8900238338440586

In [111]:
sklearn.metrics.recall_score(paws.actual, paws.model1, pos_label="dog")

0.803318992009834

In [113]:
sklearn.metrics.classification_report(paws.actual, paws.model1)

'              precision    recall  f1-score   support\n\n         cat       0.69      0.82      0.75      1746\n         dog       0.89      0.80      0.84      3254\n\n    accuracy                           0.81      5000\n   macro avg       0.79      0.81      0.80      5000\nweighted avg       0.82      0.81      0.81      5000\n'

In [121]:
# Get report for all models
def get_model_reports():
    for col in paws.columns:
        print(col)
        print(sklearn.metrics.classification_report(paws.actual, paws[col]))
        print('-----------------------------------------------------------------------------------------------')

In [123]:
get_model_reports()

actual
              precision    recall  f1-score   support

         cat       1.00      1.00      1.00      1746
         dog       1.00      1.00      1.00      3254

    accuracy                           1.00      5000
   macro avg       1.00      1.00      1.00      5000
weighted avg       1.00      1.00      1.00      5000

-----------------------------------------------------------------------------------------------
model1
              precision    recall  f1-score   support

         cat       0.69      0.82      0.75      1746
         dog       0.89      0.80      0.84      3254

    accuracy                           0.81      5000
   macro avg       0.79      0.81      0.80      5000
weighted avg       0.82      0.81      0.81      5000

-----------------------------------------------------------------------------------------------
model2
              precision    recall  f1-score   support

         cat       0.48      0.89      0.63      1746
         dog       0.89 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
