# Evaluation Model Lesson
#### Corey Solitaire
#### 9/14/2020

## 1. Given the following confusion matrix, evaluate (by hand) the model's performance.

|               | actual cat | actual dog |
|:------------  |-----------:|-----------:|
| predicted cat |         34 |          7 |
| predicted dog |         13 |         46 |

   - In the context of this problem, what is a false positive?
   - In the context of this problem, what is a false negative?
   - How would you describe this model?


##  True = Predict Cat         | TP =Cat/Cat          | FP = Cat/Dog
## False= Predict Dog        | FN = Dog/ Cat       | TN = Dog/Dog

#### total observations = 100 pets
#### TP + TN = 80
#### accuracy = 80/100 or 80%

In [30]:
# Content from Ryan's Review
# reality is either true/false
# prediction is either positive/negative
true_positives = 34
true_negatives = 46
false_positives = 7
false_negatives = 13

# number of TRUE predictions (pos/neg) / total observations
accuracy = (true_positives + true_negatives) / (true_positives + true_negatives + false_positives + false_negatives)

# true positives / (all actual positives)
recall = true_positives / (true_positives + false_negatives)

# true positives / (all actual positive predictions)
precision = true_positives / (true_positives + false_positives)

# True Negative Rate
# true negatives / (all actual negatives) == TN / (TN + FP)
specificity = true_negatives / (true_negatives + false_positives)

print("Cat-classifier (where 'cat' is the positive prediction)")
print("Accuracy:", accuracy)
print("Recall:", recall)
print("Precision:", precision)
print("Specificity:", specificity)

Cat-classifier (where 'cat' is the positive prediction)
Accuracy: 0.8
Recall: 0.723404255319149
Precision: 0.8292682926829268
Specificity: 0.8679245283018868


## 2. You are working as a datascientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant.

#### Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects, and the data from their predictions can be found here.

#### Use the predictions dataset and pandas to help answer the following questions:

- An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

- Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

# Preliminary Findings:
#### - Baseline = No Defect

        No Defect       Defect

True    TP (no /no)    FP (no/defect)  

False   FN (defect/no) TN (defect/defect)

In [31]:
import pandas as pd
ducks = pd.read_csv('c3.csv')
ducks

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect
...,...,...,...,...
195,No Defect,No Defect,Defect,Defect
196,Defect,Defect,No Defect,No Defect
197,No Defect,No Defect,No Defect,No Defect
198,No Defect,No Defect,Defect,Defect


In [32]:
# Look at the value counts to make a baseline
ducks.actual.value_counts()

No Defect    184
Defect        16
Name: actual, dtype: int64

In [33]:
# Programmatically assign the value of the most frequent 
ducks["baseline"] = ducks.actual.value_counts().index[0]
ducks.head(2)

Unnamed: 0,actual,model1,model2,model3,baseline
0,No Defect,No Defect,Defect,No Defect,No Defect
1,No Defect,No Defect,Defect,Defect,No Defect


Quality Control, our internal customer, wants the metric to identify as many defective ducks as possible

Our best metric for Quality Control here is recall

Use recall when missing actual positive cases is expensive.

Optimizing for recall avoids false negatives (misses)


In [34]:
pd.crosstab(ducks.model1, ducks.actual)

actual,Defect,No Defect
model1,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,8,2
No Defect,8,182


In [35]:
# Let's evaluate Model 1
positive = "Defect"

# Subset will be all the times our positive case was correct
subset = ducks[ducks.actual == positive]

model_recall = (subset.actual == subset.model1).mean()
baseline_recall = (subset.baseline == subset.actual).mean()

print("Model 1")
print(f"Model recall: {model_recall:.2%}")
print(f"Baseline recall: {baseline_recall:.2%}")

Model 1
Model recall: 50.00%
Baseline recall: 0.00%


In [36]:
# Let's evaluate Model 2
positive = "Defect"

subset = ducks[ducks.actual == positive]
model_recall = (subset.actual == subset.model2).mean()
baseline_recall = (subset.baseline == subset.actual).mean()

print("Model 2")
print(f"Model recall: {model_recall:.2%}")
print(f"Baseline recall: {baseline_recall:.2%}")

Model 2
Model recall: 56.25%
Baseline recall: 0.00%


In [37]:
# Let's evaluate Model 3
positive = "Defect"

subset = ducks[ducks.actual == positive]
model_recall = (subset.actual == subset.model3).mean()
baseline_recall = (subset.baseline == subset.actual).mean()

print("Model 3")
print(f"Model recall: {model_recall:.2%}")
print(f"Baseline recall: {baseline_recall:.2%}")

Model 3
Model recall: 81.25%
Baseline recall: 0.00%


### Summary of Findings
 
 - We want to id as many defects as possible, so reduce type II erros (False Negatives)

 - Use should use Recall Method (with model 3) because it indicates how good is our model when the actual value is negative. Like Recall for the negative class

# Predict Hawaii Trip defects

   - The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii.
   - They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect.
   - Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?


In [38]:
# Goal is to minimize false positives
# We optimize for precision when we want to minimize false positives
positive = "Defect"

In [39]:
# Analyze model #1

# the boolean mask here is model1 == positive
subset = ducks[ducks.model1 == positive]

model_precision = (subset.actual == subset.model1).mean()

subset = ducks[ducks.baseline == positive]
baseline_precision = (subset.actual == subset.baseline).mean()

print("Model 1")
print(f"Model precision: {model_precision:.2%}")
print(f"Baseline precision: {baseline_precision:.2%}")

Model 1
Model precision: 80.00%
Baseline precision: nan%


In [40]:
# Model 2
subset = ducks[ducks.model2 == positive]
model_precision = (subset.actual == subset.model2).mean()
subset = ducks[ducks.baseline == positive]
baseline_precision = (subset.actual == subset.baseline).mean()

print("Model 2")
print(f"Model precision: {model_precision:.2%}")
print(f"Baseline precision: {baseline_precision:.2%}")

Model 2
Model precision: 10.00%
Baseline precision: nan%


In [41]:
# Analyze model #3
subset = ducks[ducks.model3 == positive]
model_precision = (subset.actual == subset.model3).mean()
subset = ducks[ducks.baseline == positive]
baseline_precision = (subset.actual == subset.baseline).mean()

print("Model 3")
print(f"Model precision: {model_precision:.2%}")
print(f"Baseline precision: {baseline_precision:.2%}")

Model 3
Model precision: 13.13%
Baseline precision: nan%


### Takeaway for Marketing:

   - Use model number 1 since it will minimize the false positive predictions of defects

## 3. You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).

#### At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog (Phase I). Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II).

#### Several models have already been developed with the data, and you can find their results here.

In [7]:
import pandas as pd
df = pd.read_csv('gives_you_paws.csv')
df

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog
...,...,...,...,...,...
4995,dog,dog,dog,dog,dog
4996,dog,dog,cat,cat,dog
4997,dog,cat,cat,dog,dog
4998,cat,cat,cat,cat,dog


### Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:

Tasks:

1. Create a baseline based on the most common class.
2. In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?
3. Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recomend for Phase I? For Phase II?
   - Dog = positive case
   - this problem setup ignores cats
   - Recall = TP / (TP + FN) = TRUE POSITIVE RATE = percentage of times we predict dogs correctly out of all actual dogs
       - Use recall when false negatives cost more than false positives
   - Precision = TP / (TP + FP) = TP / (predicted positives)
       - Use precision when false positives cost more than false negatives
   - If phase 3 is a person, that's expensive so we'd want to minimize FP since a person would need to catch those 
4. Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recomend for Phase I? For Phase II?


In [42]:
# Actual
df.actual.value_counts()

dog    3254
cat    1746
Name: actual, dtype: int64

In [43]:
df["baseline"] = df.actual.value_counts().index[0]
df.head()

Unnamed: 0,actual,model1,model2,model3,model4,baseline_prediction,baseline
0,cat,cat,dog,cat,dog,dog,dog
1,dog,dog,cat,cat,dog,dog,dog
2,dog,cat,cat,cat,dog,dog,dog
3,dog,dog,dog,cat,dog,dog,dog
4,cat,cat,cat,dog,dog,dog,dog


In [44]:
# Programmatically get all the model columns
# .loc[starting_row:ending_row, starting_column:ending_column]

models = df.loc[:, "model1":"baseline"].columns.tolist()
models

['model1', 'model2', 'model3', 'model4', 'baseline_prediction', 'baseline']

In [45]:
output = []
for model in models:
    
    output.append({
        "model": model,
        "accuracy": (df[model] == df.actual).mean(),
    })


metrics = pd.DataFrame(output)
metrics = metrics.sort_values(by="accuracy", ascending=False, ignore_index=True)
metrics

Unnamed: 0,model,accuracy
0,model1,0.8074
1,model4,0.7426
2,baseline_prediction,0.6508
3,baseline,0.6508
4,model2,0.6304
5,model3,0.5096


In [46]:
from sklearn.metrics import classification_report
target_names = df.actual.unique()
x = classification_report(df.actual, df.model1, target_names=target_names, output_dict=True)
pd.DataFrame(x)

Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.689772,0.890024,0.8074,0.789898,0.820096
recall,0.815006,0.803319,0.8074,0.809162,0.8074
f1-score,0.747178,0.844452,0.8074,0.795815,0.810484
support,1746.0,3254.0,0.8074,5000.0,5000.0


In [47]:
print("Model 1")
pd.DataFrame(classification_report(df.actual, df.model1, target_names=target_names, output_dict=True))

Model 1


Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.689772,0.890024,0.8074,0.789898,0.820096
recall,0.815006,0.803319,0.8074,0.809162,0.8074
f1-score,0.747178,0.844452,0.8074,0.795815,0.810484
support,1746.0,3254.0,0.8074,5000.0,5000.0


In [50]:
print("Model 2")
pd.DataFrame(classification_report(df.actual, df.model2, target_names=target_names, output_dict =True))

Model 2


Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.484122,0.893177,0.6304,0.688649,0.750335
recall,0.890607,0.490781,0.6304,0.690694,0.6304
f1-score,0.627269,0.633479,0.6304,0.630374,0.63131
support,1746.0,3254.0,0.6304,5000.0,5000.0


In [51]:
print("Model 3")
pd.DataFrame(classification_report(df.actual, df.model3, target_names=target_names, output_dict =True))

Model 3


Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.358347,0.659888,0.5096,0.509118,0.55459
recall,0.511455,0.508605,0.5096,0.51003,0.5096
f1-score,0.421425,0.574453,0.5096,0.497939,0.521016
support,1746.0,3254.0,0.5096,5000.0,5000.0


In [52]:
print("Model 4")
pd.DataFrame(classification_report(df.actual, df.model4, target_names=target_names, output_dict =True))

Model 4


Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.807229,0.731249,0.7426,0.769239,0.757781
recall,0.345361,0.955747,0.7426,0.650554,0.7426
f1-score,0.483755,0.82856,0.7426,0.656157,0.708154
support,1746.0,3254.0,0.7426,5000.0,5000.0



### Dog Team -> Dog is positive case

   - Phase I's best metric is Recall so we're optimizing for true positive cases / (all positve cases)
   - Phase II's metric is Precision - trying to minimize false positives (the model saying dog, but we have a cat)
   - Phase III is somebody on salary
   - Dog team’s phase one model is Model Number 4 - recall is 0.955    
   - Dog team's phase two model is Model Number 1 - precision is 0.890024 (Model 2's precision is almost the same)


### Cat Team -> Cat is positive case

   - Phase I should be recall to optimize for # of true positive cases out of all actual positive cases
   - Phase II should be precision to minimize False Positives
   - Model 2, use recall for phase I -> 0.89
   - Model 4, use precision for phase II -> 0.81

In [9]:
# My original code is below
# Baseline Model

df.actual.value_counts()

df['baseline_prediction'] = 'dog'

model_accuracy = (df.actual == df.actual).mean()
baseline_accuracy = (df.baseline_prediction == df.actual).mean()

print(f'   model accuracy: {model_accuracy:.2%}')
print(f'baseline accuracy: {baseline_accuracy:.2%}')

   model accuracy: 100.00%
baseline accuracy: 65.08%


### MODEL ONE

In [10]:
#Model 1
df.model1.value_counts()

df['baseline_prediction'] = 'dog'

model_accuracy = (df.model1 == df.actual).mean()
baseline_accuracy = (df.baseline_prediction == df.actual).mean()

print(f'   model accuracy: {model_accuracy:.2%}')
print(f'baseline accuracy: {baseline_accuracy:.2%}')


   model accuracy: 80.74%
baseline accuracy: 65.08%


In [11]:
#Model 1 Accuracy Score

from sklearn.metrics import accuracy_score
y_pred = df.model1
y_true = df.actual
accuracy_score(y_true, y_pred)

accuracy_score(y_true, y_pred)


0.8074

In [12]:
#Model 1 Precision Score

from sklearn.metrics import precision_score
y_pred = df.model1
y_true = df.actual
# provides a range of values
precision_score(y_true, y_pred, average='macro'), precision_score(y_true, y_pred, average='micro'), precision_score(y_true, y_pred, average='weighted')


(0.7898980051430666, 0.8074, 0.8200959550792857)

In [13]:
# Model 1 Recall Score

from sklearn.metrics import recall_score
y_pred = df.model1
y_true = df.actual
recall_score(y_true, y_pred, average='macro'), recall_score(y_true, y_pred, average='micro'), recall_score(y_true, y_pred, average='weighted')

(0.8091623596933477, 0.8074, 0.8074)

In [14]:
# Model 1 Classification Report

from sklearn.metrics import classification_report
y_pred = df.model1
y_true = df.actual
target_names = ['actual', 'model1']
print(classification_report(y_true, y_pred, target_names=target_names))

              precision    recall  f1-score   support

      actual       0.69      0.82      0.75      1746
      model1       0.89      0.80      0.84      3254

    accuracy                           0.81      5000
   macro avg       0.79      0.81      0.80      5000
weighted avg       0.82      0.81      0.81      5000



### MODEL TWO

In [15]:
#Model 2
df.model2.value_counts()

df['baseline_prediction'] = 'dog'

model_accuracy = (df.model2 == df.actual).mean()
baseline_accuracy = (df.baseline_prediction == df.actual).mean()

print(f'   model accuracy: {model_accuracy:.2%}')
print(f'baseline accuracy: {baseline_accuracy:.2%}')

   model accuracy: 63.04%
baseline accuracy: 65.08%


In [16]:
#Model 2 Accuracy Score

from sklearn.metrics import accuracy_score
y_pred = df.model2
y_true = df.actual
accuracy_score(y_true, y_pred)

accuracy_score(y_true, y_pred)

0.6304

In [17]:
#Model 2 Precision Score

from sklearn.metrics import precision_score
y_pred = df.model2
y_true = df.actual
# provides a range of values
precision_score(y_true, y_pred, average='macro'), precision_score(y_true, y_pred, average='micro'), precision_score(y_true, y_pred, average='weighted')


(0.6886493880609905, 0.6304, 0.7503348355300732)

In [18]:
# Model 2 Recall Score

from sklearn.metrics import recall_score
y_pred = df.model2
y_true = df.actual
recall_score(y_true, y_pred, average='macro'), recall_score(y_true, y_pred, average='micro'), recall_score(y_true, y_pred, average='weighted')

(0.6906938398488845, 0.6304, 0.6304)

In [19]:
# Model 2 Classification Report

from sklearn.metrics import classification_report
y_pred = df.model2
y_true = df.actual
target_names = ['actual', 'model2']
print(classification_report(y_true, y_pred, target_names=target_names))

              precision    recall  f1-score   support

      actual       0.48      0.89      0.63      1746
      model2       0.89      0.49      0.63      3254

    accuracy                           0.63      5000
   macro avg       0.69      0.69      0.63      5000
weighted avg       0.75      0.63      0.63      5000



### MODEL THREE

In [20]:
#model 3
df.model3.value_counts()

df['baseline_prediction'] = 'dog'

model_accuracy = (df.model3 == df.actual).mean()
baseline_accuracy = (df.baseline_prediction == df.actual).mean()

print(f'   model accuracy: {model_accuracy:.2%}')
print(f'baseline accuracy: {baseline_accuracy:.2%}')

   model accuracy: 50.96%
baseline accuracy: 65.08%


In [21]:
#Model 3 Accuracy Score

from sklearn.metrics import accuracy_score
y_pred = df.model3
y_true = df.actual
accuracy_score(y_true, y_pred)

accuracy_score(y_true, y_pred)

0.5096

In [22]:
#Model 3 Precision Score

from sklearn.metrics import precision_score
y_pred = df.model3
y_true = df.actual
# provides a range of values
precision_score(y_true, y_pred, average='macro'), precision_score(y_true, y_pred, average='micro'), precision_score(y_true, y_pred, average='weighted')


(0.5091175333635416, 0.5096, 0.5545900138497418)

In [23]:
# Model 3 Recall Score

from sklearn.metrics import recall_score
y_pred = df.model3
y_true = df.actual
recall_score(y_true, y_pred, average='macro'), recall_score(y_true, y_pred, average='micro'), recall_score(y_true, y_pred, average='weighted')

(0.5100297739111823, 0.5096, 0.5095999999999999)

In [24]:
# Model 3 Classification Report

from sklearn.metrics import classification_report
y_pred = df.model3
y_true = df.actual
target_names = ['actual', 'model3']
print(classification_report(y_true, y_pred, target_names=target_names))

              precision    recall  f1-score   support

      actual       0.36      0.51      0.42      1746
      model3       0.66      0.51      0.57      3254

    accuracy                           0.51      5000
   macro avg       0.51      0.51      0.50      5000
weighted avg       0.55      0.51      0.52      5000



### MODEL FOUR

In [25]:
#model 4
df.model4.value_counts()

df['baseline_prediction'] = 'dog'

model_accuracy = (df.model4 == df.actual).mean()
baseline_accuracy = (df.baseline_prediction == df.actual).mean()

print(f'   model accuracy: {model_accuracy:.2%}')
print(f'baseline accuracy: {baseline_accuracy:.2%}')

   model accuracy: 74.26%
baseline accuracy: 65.08%


In [26]:
#Model 4 Accuracy Score

from sklearn.metrics import accuracy_score
y_pred = df.model4
y_true = df.actual
accuracy_score(y_true, y_pred)

accuracy_score(y_true, y_pred)

0.7426

In [27]:
#Model 4 Precision Score

from sklearn.metrics import precision_score
y_pred = df.model1
y_true = df.actual
# provides a range of values
precision_score(y_true, y_pred, average='macro'), precision_score(y_true, y_pred, average='micro'), precision_score(y_true, y_pred, average='weighted')

(0.7898980051430666, 0.8074, 0.8200959550792857)

In [28]:
# Model 4 Recall Score

from sklearn.metrics import recall_score
y_pred = df.model4
y_true = df.actual
recall_score(y_true, y_pred, average='macro'), recall_score(y_true, y_pred, average='micro'), recall_score(y_true, y_pred, average='weighted')

(0.6505537989722403, 0.7426, 0.7426)

In [29]:
# Model 4 Classification Report

from sklearn.metrics import classification_report
y_pred = df.model4
y_true = df.actual
target_names = ['actual', 'model4']
print(classification_report(y_true, y_pred, target_names=target_names))

              precision    recall  f1-score   support

      actual       0.81      0.35      0.48      1746
      model4       0.73      0.96      0.83      3254

    accuracy                           0.74      5000
   macro avg       0.77      0.65      0.66      5000
weighted avg       0.76      0.74      0.71      5000



#### - Model #1 and #4 have better accuracies then the baseline model

## - B. Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recomend for Phase I? For Phase II?

#### - Phase One needs high Accuracy, because it makes sure that both classes are represented.  I would choose to run model 1 for phase two because it provides us with the greatest level is of precision.

#### - Phase Two needs high Precision, because it needs to make sure that when we id 'Dog' we want to make sure our positive predictions are correct.  I would choose to run model 4 for phase two because it provides us with the greatest level is of precision.

## - C. Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recomend for Phase I? For Phase II?

#### - Phase One needs high Accuracy, because it makes sure that both classes are represented.  I would choose to run model 1 for phase two because it provides us with the greatest level is of precision.

#### - Phase Two needs high Precision, because it needs to make sure that when we id 'cat' we want to make sure our positive predictions are correct.  I would choose to run model 4 for phase two because it provides us with the greatest level is of precision.

## 4. Follow the links below to read the documentation about each function, then apply those functions to the data from the previous problem.

   - sklearn.metrics.accuracy_score
   - sklearn.metrics.precision_score
   - sklearn.metrics.recall_score
   - sklearn.metrics.classification_report
