In [1]:
import pandas as pd

## Exercise 1

|               | actual cat | actual dog |
|:------------  |-----------:|-----------:|
| predicted cat |         34 |          7 |
| predicted dog |         13 |         46 |

## Workflow:
- Get the actual results
- Obtaion model predictions
- We have to decide what our positive case is
    - Usually, the measurement for something existing is the positive
    - Defining positive case is arbitrary 
- Produce a confusion matrix
- Produce our metrics

- In the context of this problem, what's a false positive?
    - I'll define a cat prediction as positive (b/c I want to, and b/c the upper-right of a confusion matrix is usually True Positives)
    - A false positive, then, is predicting a cat when we actually have a dog (7)
    - A false negative is predicting a dog, when we actually have a cat. (13)
- False Negative is a miss.
- False positive is a false alarm.

In [2]:
# reality is either true/false
# prediction is either positive/negative
true_positives = 34
true_negatives = 46
false_positives = 7
false_negatives = 13

# number of TRUE predictions (pos/neg) / total observations
accuracy = (true_positives + true_negatives) / (true_positives + true_negatives + false_positives + false_negatives)

# true positives / (all actual positives)
recall = true_positives / (true_positives + false_negatives)

# true positives / (all actual positive predictions)
precision = true_positives / (true_positives + false_positives)

# True Negative Rate
# true negatives / (all actual negatives) == TN / (TN + FP)
specificity = true_negatives / (true_negatives + false_positives)

print("Cat-classifier (where 'cat' is the positive prediction)")
print("Accuracy:", accuracy)
print("Recall:", recall)
print("Precision:", precision)
print("Specificity:", specificity)

Cat-classifier (where 'cat' is the positive prediction)
Accuracy: 0.8
Recall: 0.723404255319149
Precision: 0.8292682926829268
Specificity: 0.8679245283018868


## Exercise 2

You are working as a data scientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant.

Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects, and the data from their predictions can be found here.

Use the predictions dataset and pandas to help answer the following questions:

- An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?
- Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

In [3]:
# acquire the data
ducks = pd.read_csv("https://ds.codeup.com/data/c3.csv")
ducks.head()

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect


### Outcomes
- Possible outcomes for each of the 3 models:
- defect present, predicted defect
- defect present, predicted OK
- no defects, predicted OK
- no defects, predicted defect

The positive case is the presence of a defect

In [4]:
# Look at the value counts to make a baseline
ducks.actual.value_counts()

No Defect    184
Defect        16
Name: actual, dtype: int64

In [5]:
# Programmatically assign the value of the most frequent 
ducks["baseline"] = ducks.actual.value_counts().index[0]
ducks.head(2)

Unnamed: 0,actual,model1,model2,model3,baseline
0,No Defect,No Defect,Defect,No Defect,No Defect
1,No Defect,No Defect,Defect,Defect,No Defect


Quality Control, our internal customer, wants the metric to identify as many defective ducks as possible

Our best metric for Quality Control here is recall

Use recall when missing actual positive cases is expensive.

Optimizing for recall avoids false negatives (misses)

In [6]:
pd.crosstab(ducks.model1, ducks.actual)

actual,Defect,No Defect
model1,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,8,2
No Defect,8,182


In [7]:
tp = 8
tn = 182
fp = 2
fn = 8

recall = tp / (tp + fn)
recall

0.5

In [8]:
positive = "Defect"

# Subset.actual == "Defect" (we've defined Defect as our positive)
subset = ducks[ducks.actual == positive]

# Trues are True Positives from Model1 predictions
# True Positive + False Negatives = Total number of actual positives

# Recall AKA True Positive Rate AKA Sensitity
# TP / (TP + FN)
# TP / (Actual positives)

print("# of TP", (subset.actual == subset.model1).sum()) # Summing all the trues gives us # of True Positives
# Number of True Positives is our numerator with Recall

# Denominor in recall = TP + FN = NUMBER of ACTUAL POSITIVES
print("# of TP + FN", subset.shape[0], " = total # of actual positives")

# of TP 8
# of TP + FN 16  = total # of actual positives


In [9]:
# Let's evaluate Model 1
positive = "Defect"

# Subset will be all the times our positive case was correct
subset = ducks[ducks.actual == positive]

model_recall = (subset.actual == subset.model1).mean()
baseline_recall = (subset.baseline == subset.actual).mean()

print("Model 1")
print(f"Model recall: {model_recall:.2%}")
print(f"Baseline recall: {baseline_recall:.2%}")

Model 1
Model recall: 50.00%
Baseline recall: 0.00%


In [10]:
# Let's evaluate Model 2
positive = "Defect"

subset = ducks[ducks.actual == positive]
model_recall = (subset.actual == subset.model2).mean()
baseline_recall = (subset.baseline == subset.actual).mean()

print("Model 2")
print(f"Model recall: {model_recall:.2%}")
print(f"Baseline recall: {baseline_recall:.2%}")

Model 2
Model recall: 56.25%
Baseline recall: 0.00%


In [11]:
# Let's evaluate Model 3
positive = "Defect"

subset = ducks[ducks.actual == positive]
model_recall = (subset.actual == subset.model3).mean()
baseline_recall = (subset.baseline == subset.actual).mean()

print("Model 3")
print(f"Model recall: {model_recall:.2%}")
print(f"Baseline recall: {baseline_recall:.2%}")

Model 3
Model recall: 81.25%
Baseline recall: 0.00%


Takeaways so far:
- Quality Control should select a model that performs the best w/ recall (to avoid false negativeS)
- Quality Control should use Model 3

## Exercise 2 (part 2)
- The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii.
- They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect.
- Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

In [12]:
# Goal is to minimize false positives
# We optimize for precision when we want to minimize false positives
positive = "Defect"

In [13]:
# Analyze model #1

# the boolean mask here is model1 == positive
subset = ducks[ducks.model1 == positive]

model_precision = (subset.actual == subset.model1).mean()

subset = ducks[ducks.baseline == positive]
baseline_precision = (subset.actual == subset.baseline).mean()

print("Model 1")
print(f"Model precision: {model_precision:.2%}")
print(f"Baseline precision: {baseline_precision:.2%}")

Model 1
Model precision: 80.00%
Baseline precision: nan%


In [14]:
# Model 2
subset = ducks[ducks.model2 == positive]
model_precision = (subset.actual == subset.model2).mean()
subset = ducks[ducks.baseline == positive]
baseline_precision = (subset.actual == subset.baseline).mean()

print("Model 2")
print(f"Model precision: {model_precision:.2%}")
print(f"Baseline precision: {baseline_precision:.2%}")

Model 2
Model precision: 10.00%
Baseline precision: nan%


In [15]:
# Analyze model #3
subset = ducks[ducks.model3 == positive]
model_precision = (subset.actual == subset.model3).mean()
subset = ducks[ducks.baseline == positive]
baseline_precision = (subset.actual == subset.baseline).mean()

print("Model 3")
print(f"Model precision: {model_precision:.2%}")
print(f"Baseline precision: {baseline_precision:.2%}")

Model 3
Model precision: 13.13%
Baseline precision: nan%


Takeaway for Marketing:
- Use model number 1 since it will minimize the false positive predictions of defects

## Exercise 3
At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process.

- First an automated algorithm tags pictures as either a cat or a dog (Phase I).
- Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II).
- Data from several models is here https://ds.codeup.com/data/gives_you_paws.csv

Tasks:

1. Create a baseline based on the most common class.
2. In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?
3. Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recomend for Phase I? For Phase II?
    - Dog = positive case
    - this problem setup ignores cats 
    - Recall = TP / (TP + FN) = TRUE POSITIVE RATE = percentage of times we predict dogs correctly out of all actual dogs
        - Use recall when false negatives cost more than false positives
    - Precision = TP / (TP + FP) = TP / (predicted positives) 
        - Use precision when false positives cost more than false negatives
        - If phase 3 is a person, that's expensive so we'd want to minimize FP since a person would need to catch those
    - Specificity = TN / (TN + FP)
4. Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recomend for Phase I? For Phase II?

In [16]:
df = pd.read_csv("https://ds.codeup.com/data/gives_you_paws.csv")
df.head()


Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog


In [17]:
df.actual.value_counts()


dog    3254
cat    1746
Name: actual, dtype: int64

In [18]:
df["baseline"] = df.actual.value_counts().index[0]
df.head()

Unnamed: 0,actual,model1,model2,model3,model4,baseline
0,cat,cat,dog,cat,dog,dog
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
4,cat,cat,cat,dog,dog,dog


In [19]:
# Programmatically get all the model columns
# .loc[starting_row:ending_row, starting_column:ending_column]

models = df.loc[:, "model1":"baseline"].columns.tolist()
models

['model1', 'model2', 'model3', 'model4', 'baseline']

In [20]:
output = []
for model in models:
    
    output.append({
        "model": model,
        "accuracy": (df[model] == df.actual).mean(),
    })


metrics = pd.DataFrame(output)
metrics = metrics.sort_values(by="accuracy", ascending=False, ignore_index=True)
metrics

Unnamed: 0,model,accuracy
0,model1,0.8074
1,model4,0.7426
2,baseline,0.6508
3,model2,0.6304
4,model3,0.5096


In [21]:
from sklearn.metrics import classification_report

In [22]:
target_names = df.actual.unique()

In [23]:
x = classification_report(df.actual, df.model1, target_names=target_names, output_dict=True)
pd.DataFrame(x)

Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.689772,0.890024,0.8074,0.789898,0.820096
recall,0.815006,0.803319,0.8074,0.809162,0.8074
f1-score,0.747178,0.844452,0.8074,0.795815,0.810484
support,1746.0,3254.0,0.8074,5000.0,5000.0


In [24]:
print("Model 1")
pd.DataFrame(classification_report(df.actual, df.model1, target_names=target_names, output_dict=True))

Model 1


Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.689772,0.890024,0.8074,0.789898,0.820096
recall,0.815006,0.803319,0.8074,0.809162,0.8074
f1-score,0.747178,0.844452,0.8074,0.795815,0.810484
support,1746.0,3254.0,0.8074,5000.0,5000.0


In [25]:
print("Model 2")
print(classification_report(df.actual, df.model2, target_names=target_names))

Model 2
              precision    recall  f1-score   support

         cat       0.48      0.89      0.63      1746
         dog       0.89      0.49      0.63      3254

    accuracy                           0.63      5000
   macro avg       0.69      0.69      0.63      5000
weighted avg       0.75      0.63      0.63      5000



In [26]:
print("Model 3")
print(classification_report(df.actual, df.model3, target_names=target_names))

Model 3
              precision    recall  f1-score   support

         cat       0.36      0.51      0.42      1746
         dog       0.66      0.51      0.57      3254

    accuracy                           0.51      5000
   macro avg       0.51      0.51      0.50      5000
weighted avg       0.55      0.51      0.52      5000



In [27]:
print("Model 4")
pd.DataFrame(classification_report(df.actual, df.model4, target_names=target_names, output_dict=True))

Model 4


Unnamed: 0,cat,dog,accuracy,macro avg,weighted avg
precision,0.807229,0.731249,0.7426,0.769239,0.757781
recall,0.345361,0.955747,0.7426,0.650554,0.7426
f1-score,0.483755,0.82856,0.7426,0.656157,0.708154
support,1746.0,3254.0,0.7426,5000.0,5000.0


In [28]:
df.head()

Unnamed: 0,actual,model1,model2,model3,model4,baseline
0,cat,cat,dog,cat,dog,dog
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
4,cat,cat,cat,dog,dog,dog


## Dog Team -> Dog is positive case
- Phase I's best metric is Recall so we're optimizing for true positive cases / (all positve cases)
- Phase II's metric is Precision - trying to minimize false positives (the model saying dog, but we have a cat)
- Phase III is somebody on salary
- Dog team’s phase one model is Model Number 4 - recall is 0.955
- Dog team's phase two model is Model Number 1 - precision is 0.890024 (Model 2's precision is almost the same)

## Cat Team -> Cat is positive case
- Phase I should be recall to optimize for # of true positive cases out of all actual positive cases
- Phase II should be precision to minimize False Positives
- Model 2, use recall for phase I -> 0.89
- Model 4, use precision for phase II -> 0.81