# **Fundamentals Confusion Matrix**

In Machine Learning (ML) and specifically in classification problems, the confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known. The confusion matrix is a 2x2 table that contains 4 outputs provided by the classifier (could also be bigger than 2x2 in multi-class classification). These four outputs are:
- **True Positives (TP)**: The predictor predicted the positive class and the actual class was also positive.
- **True Negatives (TN)**: The predictor predicted the negative class and the actual class was also negative.
- **False Positives (FP)**: The predictor predicted the positive class but the actual class was negative.
- **False Negatives (FN)**: The predictor predicted the negative class but the actual class was positive.

Here is a picture that shows the confusion matrix:  
<img src="Images/confusion_matrix_example.svg" width="500px"/>

## **Example**

First of all, let's import all necessary libraries. We will investigate the **Breast Cancer Dataset** which can be easily loaded using the [`sklearn` library](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer) or directly from the [Kaggle dataset](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.datasets import load_breast_cancer

We can load the dataset and seperate it into features and target variables.

In [2]:
# Load features and target data
data = load_breast_cancer().data
target = load_breast_cancer().target

# Load feature names and target names
feature_names = load_breast_cancer().feature_names
target_names = load_breast_cancer().target_names  

For a more easily view, we create a pandas `DataFrame` and display the first 5 rows of the dataset.

The target variable which describes, if a patient has a benign or malignant tumor, is encoded as 0 and 1.
- 0: Malignant
- 1: Benign

In [3]:
df = pd.concat([pd.DataFrame(data, columns=feature_names), pd.DataFrame(target, columns=['target'])], axis=1)
# df.target = df.target.astype(bool)  # Convert target variable from numeric to boolean type, 0 = malignant, 1 = benign
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


We will only use the first 10 features of the dataset for simplicity. Then we create train and test sets using the `train_test_split` function from the `sklearn` library.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,:10], df.target, test_size=0.33, random_state=0)

Next we will train a `LogisticRegression` model and make predictions on the test set. 

In [5]:
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.9095744680851063

We reached a score of 90%, which is a good score. Finally we create a confusion matrix using the `confusion_matrix` function from the `sklearn` library.

In [6]:
# Thus in binary classification, the count of 
# true negatives is C_{0,0}, 
# false negatives is C_{1,0}, 
# true positives is C_{1,1} and 
# false positives is C_{0,1}

confusion_matrix(y_test, lr.predict(X_test))

array([[ 59,   8],
       [  9, 112]], dtype=int64)

We get a 2x2 matrix because we have a simple binary classification problem whether the cancer is malignant or benign. The confusion matrix from the `sklearn` library is slightly build differently than the example shown above. 

Instead of having following structure (watch the picture above):
```py
# Predicted Value X Actual Value
[[TP, FP],
 [FN, TN]]
```
It has following structure:
```py
# Actual Value X Predicted Value
[[TN, FP],
 [FN, TP]]
```

**HINT**: Exspecially in medical applications, it is important to have a low number of FP or FN (depends on how we classify the tumor in this case) because it is worse to say that a patient is healthy when he is not than to say that a patient is sick when he is not. More information about another metrics (e.g. Accuracy, Precision, Recall, ...) can be found in different notebooks. 

In this case we have following outputs:
- **TP**: 112 - The model predicted, that 112 patients have benign tumor that actually have benign tumor.
- **TN**: 59 - The model predicted, that 59 patients have malignant tumor that actually have malignant tumor.
- **FP**: 8 - The model predicted, that 8 patients have benign tumor that actually have malignant tumor. (In best case, that should be 0)
- **FN**: 9 - The model predicted, that 9 patients have malignant tumor that actually have benign tumor. 

In [7]:
# Example 2
# 1 FP, 2 FN, 3 TP, 4 TN

actual = [0, 1, 1, 1, 1, 1, 0, 0, 0, 0]
predicted = [1, 0, 0, 1, 1, 1, 0, 0, 0, 0]

confusion_matrix(actual, predicted)

array([[4, 1],
       [2, 3]], dtype=int64)