# Confusion Matrix Clarification

- Which columns are actual vs. predicted w/ the confusion matrix?
    - There is not a standard for which Columns/Rows are which. Different texts have different arrangements
    - We'll see actual as the columns sometimes and rows other places.
- Which is the positive case w/ the classification report?
    - https://stackoverflow.com/questions/35178590/scikit-learn-confusion-matrix
    - See https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
    - See https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

|                        |                 Predicted Negative |                         Predicted Positive |
| :--------------------- | ------------------------------: | --------------------------------------: |
| **Actual Negative** |                   True Negative | False Positive, a Type I Error |
| **Actual Positive** | False Negative, a Type II Error |                           True Positive |

In [1]:
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [2]:
pets = pd.DataFrame()
pets["actual"] =    ["dog", "dog", "dog", "dog", "dog", "cat", "cat", "cat", "cat", "cat"]
pets["predicted"] = ["dog", "dog", "dog", "dog", "cat", "cat", "cat", "cat", "dog", "dog"]

# If we use actual as the 1st argument to crosstab, this will match the sklearn confusion_matrix output
# Set rows to "actual", columns to "predictions"
pd.crosstab(pets.actual, pets.predicted)

predicted,cat,dog
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,3,2
dog,1,4


In [3]:
# Same data where 1 = "dog"
df = pd.DataFrame()

df["actual"] =    [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
df["predicted"] = [1, 1, 1, 1, 0, 0, 0, 0, 1, 1]

# pandas crosstabulation
pd.crosstab(df.actual, df.predicted)

predicted,0,1
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3,2
1,1,4


## Takeaways from pd.crosstab
- The pd.crosstab output puts both rows and columns in alpha-numeric order...
- If 1 is positive and 0 is negative, then 
    - True Positives = 4
    - False Negatives = 1
    - False Positive = 2
    - True Negatives = 3

## Let's use the `confusion_matrix` function
- Send in the actual data as the 1st argument, prediction as 2nd
- Predictions are columns, Actuals are rows.

In [4]:
# The actual goes as the first argument, the prediction as the second
# Notice that the axes are backwards from the crosstabulation above...
x = pd.DataFrame(confusion_matrix(df.actual, df.predicted))
x.columns = ["Predict 0", "Predict 1"]
x.index = ["Actual 0", "Actual 1"]
x

Unnamed: 0,Predict 0,Predict 1
Actual 0,3,2
Actual 1,1,4


In [5]:
# Is there an assignment of what's positive from the function itself?
# This example is from https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

# It's treating 1 as the positive case and 0 as the negative case (which is a nice default)
tn, fp, fn, tp = confusion_matrix(df.actual, df.predicted).ravel()

print("True Positives", tp)
print("False Positives", fp)
print("False Negatives", fn)
print("True Negatives", tn)

print("-------------")

accuracy = (tp + tn) / (tp + tn + fp + fn)
recall = tp / (tp + fn)
precision = tp / (tp + fp)

print("Accuracy is", accuracy)
print("Recall is", recall)
print("Precision is", precision)

True Positives 4
False Positives 2
False Negatives 1
True Negatives 3
-------------
Accuracy is 0.7
Recall is 0.8
Precision is 0.6666666666666666


In [6]:
# Manual accounting
# If we set the positive case as 1 then:
tp = 4
tn = 3
fp = 2
fn = 1

print("True Positives", tp)
print("False Positives", fp)
print("False Negatives", fn)
print("True Negatives", tn)
print("-------------")

accuracy = (tp + tn) / (tp + tn + fp + fn)
recall = tp / (tp + fn)
precision = tp / (tp + fp)

print("Accuracy is", accuracy)
print("Recall is", recall)
print("Precision is", precision)

True Positives 4
False Positives 2
False Negatives 1
True Negatives 3
-------------
Accuracy is 0.7
Recall is 0.8
Precision is 0.6666666666666666


In [7]:
# first argument into classification_report function AND confusion_matrix function is the actual
# second argument is the prediction
print(classification_report(df.actual, df.predicted))

              precision    recall  f1-score   support

           0       0.75      0.60      0.67         5
           1       0.67      0.80      0.73         5

    accuracy                           0.70        10
   macro avg       0.71      0.70      0.70        10
weighted avg       0.71      0.70      0.70        10



# Takeaways
- Accuracy doesn't care what your "positive" is... b/c the numerator is TP + TN.
- The classification report row of "0" means treating zero as our "positive prediction"
- The classification report row of "1" means treating 1 as our "positive prediction"
- `tn, fp, fn, tp = confusion_matrix(df.actual, df.predicted).ravel()` works nicely if we have a binary classification.
- Unless otherwise specified, the confusion_matrix function sets:
    - Rows as actual
    - Columns as predictions