# Creating Confusion Matrix DataFrame




Import pandas and numpy

In [1]:
import pandas as pd
import numpy as np

Then we make a pandas dataframe (table) that contain the predicted class and the actual class.

We gonna build it like this:
*   The column header is the predicted class
*   The row index is the actual class

```
    Predict| class1 | class2 | class3 |
    Actual |--------|--------|--------|
    -------|--------|--------|--------|
    class1 |  ....  |  ....  |  ....  |
    class2 |  ....  |  ....  |  ....  |
    class3 |  ....  |  ....  |  ....  |
```


Ok, Step 1, we need all class name for our table.



In [2]:
class_name = ['Apple', 'Banana', 'Cherry']

Let's make our dataframe

In [3]:
row_len = len(class_name)
col_len = len(class_name)

init_val = np.zeros((row_len, col_len), dtype=np.uint)

conf_matrix = pd.DataFrame(init_val, index=class_name, columns=class_name)

* We create confusion table with ``` pandas.Dataframe() ```. 

* We filled it by zero as our initial value, so we call ```numpy.zeros()``` with shape (row, col).

* Row length and column length equal to our number of class, therefore we initiate those variable with ```len(class_name)```.

* ```dtype=np.uint``` used to define our data type, numpy unsign integer, because it won't be possible to have negative value.

* Finally, creating the DataFrame with index name (row name) and column name same as class name.

Let's print our DataFrame. Run the code below.

In [4]:
conf_matrix

Unnamed: 0,Apple,Banana,Cherry
Apple,0,0,0
Banana,0,0,0
Cherry,0,0,0


**PURRFECTT!!!!**

# Filling The Matrix

Now, how we use it?

We need to create dummy data as an example.

In [5]:
actual_class = ['Apple', 'Apple', 'Apple', 'Apple', 'Apple', 'Apple', 'Apple', 'Apple', 'Apple', 'Apple', 'Apple', 'Banana', 'Banana', 'Banana', 'Banana', 'Banana', 'Banana', 'Banana', 'Banana', 'Banana', 'Banana', 'Banana', 'Banana', 'Cherry', 'Cherry', 'Cherry', 'Cherry', 'Cherry', 'Cherry', 'Cherry', 'Cherry', 'Cherry', 'Cherry', 'Cherry', 'Cherry', 'Cherry']
predicted_class = ['Apple', 'Apple', 'Apple', 'Apple', 'Apple', 'Apple', 'Apple', 'Banana', 'Cherry', 'Cherry', 'Cherry', 'Apple', 'Apple', 'Apple', 'Apple', 'Apple', 'Apple', 'Apple', 'Apple', 'Banana', 'Banana', 'Cherry', 'Cherry', 'Apple', 'Apple', 'Apple', 'Apple', 'Apple', 'Apple', 'Apple', 'Apple', 'Apple', 'Banana', 'Banana', 'Banana', 'Cherry']

And to fill it to DataFrame, we could iterate the predicted data as follow:

In [6]:
for i in range(len(actual_class)):
  conf_matrix[actual_class[i]][predicted_class[i]] += 1

Let's check it

In [7]:
conf_matrix

Unnamed: 0,Apple,Banana,Cherry
Apple,7,8,9
Banana,1,2,3
Cherry,3,2,1


# Ok, now.... What's confusion matrix used for? 

Actually we could use it to calculate the performance measure of our model, for example F1-score. But before do that, I'mma explain you something. 

# Precission

Based on Wikipedia, precision is:
> *... the number of correctly identified positive results divided by the number of all positive results, including those not identified correctly.*

From that explanation, we could write the formula of precission like this:

```
                   TP 
   Precission = ---------
                (TP + FP)      

Where:
   TP = True Positive 
   FP = False Positive          
```

# Code

Here's the code to calculate Precission

In [8]:
def precission(label, confusion_matrix):
  true_positive = confusion_matrix.loc[label, label]
  return true_positive/np.sum(confusion_matrix.loc[label, :])

# Recall

Based on Wikipedia:
> *Recall is the number of correctly identified positive results divided by the number of all samples that should have been identified as positive.*

And the formula is:

```
               TP
   Recall = ---------
            (TP + FN)      

Where:
   TP = True Positive 
   FP = False Positive
   FN = False Negative
```

***Note:***

True Positive  : The predicted class (+) the actual class is (+)

True Negative  : The predicted class (-) the actual class is (-)

False Positive : The predicted class (+) the actual class is (-)

False Negative : The predicted class (-) the actual class is (+)

# Code

Here's the code to calculate Recall

In [9]:
def recall(label, confusion_matrix):
  true_positive = confusion_matrix.loc[label, label]
  return true_positive / np.sum(confusion_matrix.loc[:, label])

# What Is F1-Score? 

Based on Wikipedia: 
> *F1-Score (F-Score or F-Measure) is a measure of a test's accuracy. F1-Score is the harmonic mean of the precision and recall.*

The formula to calculate F1-Score is:

```
                  (precission * recall) 
   F1_score = 2 * ---------------------
                  (precission + recall)       
```

# Code

Here's the code to calculate the F1-Score

In [10]:
def f1_score(label, precission, recall):
  return 2 * (precission * recall) / (precission + recall)

# Calculating F1-Score

In [23]:
result_df = pd.DataFrame(data=None, index=class_name, columns=['Precission', 'Recall', 'F1_score'], dtype=np.float)

for fruit in result_df.index:
  precission_val = precission(fruit, conf_matrix)
  recall_val = recall(fruit, conf_matrix)
  f1_score_val = f1_score(fruit, precission_val, recall_val)

  result_df.loc[fruit]['Precission'] = precission_val
  result_df.loc[fruit]['Recall'] = recall_val
  result_df.loc[fruit]['F1_score'] = f1_score_val

print(result_df.round(2))

Unnamed: 0,Precission,Recall,F1_score
Apple,0.29,0.64,0.4
Banana,0.33,0.17,0.22
Cherry,0.17,0.08,0.11


# Which One to Use?



*   F1-Score is best to use if we have an uneven class distribution. 
*   Precission best to use when we want to be more confident for the true positive, e.g spam emails. You'd rather have some spam emails in inbox than regular email in spam box. So, the email company want to be extra sure that an email is spam before they put it on the spam box and never get to see it. 
*   Recall best to use when the occurrence of false negatives is unaccepted/intolerable, e.g when predicting about whether a person have any disease or not, you'd rather choose a person labeled as false positive (identified have a disease but actually not) rather than false negative (identified not have a disease but actually have).

Reference:

1. Wikipedia

2. towardsdatascience.com