## CMPINF2100 Week 12 | Measuring CLASSIFICATION PERFORMANCE - Accuracy

### Import Modules

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.formula.api as smf

### Read Data

Read the data from LAST week.

In [2]:
df = pd.read_csv('../week_11/week_11_intro_binary_classification.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115 entries, 0 to 114
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x       115 non-null    float64
 1   y       115 non-null    int64  
dtypes: float64(1), int64(1)
memory usage: 1.9 KB


### Fit the logistic regression model

In [4]:
fit_glm = smf.logit(formula = 'y ~ x', data=df).fit()

Optimization terminated successfully.
         Current function value: 0.560099
         Iterations 6


### Predict the training set

Just to be safe, let's make a COPY of the training set and ADD new columns to the COPY.

In [5]:
df_copy = df.copy()

In [6]:
df_copy['pred_probability'] = fit_glm.predict( df )

In [7]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115 entries, 0 to 114
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   x                 115 non-null    float64
 1   y                 115 non-null    int64  
 2   pred_probability  115 non-null    float64
dtypes: float64(2), int64(1)
memory usage: 2.8 KB


In [8]:
df_copy

Unnamed: 0,x,y,pred_probability
0,-0.457429,1,0.270709
1,0.425948,1,0.513678
2,-0.784695,0,0.201258
3,-1.925209,0,0.061306
4,2.252617,1,0.901780
...,...,...,...
110,-0.791672,0,0.199933
111,0.452238,1,0.521449
112,0.535510,1,0.545976
113,-0.532739,0,0.253472


In [9]:
df_copy.describe()

Unnamed: 0,x,y,pred_probability
count,115.0,115.0,115.0
mean,0.019188,0.417391,0.417391
std,1.001227,0.495287,0.231525
min,-2.059272,0.0,0.052784
25%,-0.721905,0.0,0.213738
50%,0.114752,0.0,0.422218
75%,0.57051,1.0,0.556223
max,2.438859,1.0,0.919653


### Classify the predictions of the training set

In [10]:
df_copy['pred_class'] = np.where( df_copy.pred_probability > 0.5, 1, 0 )

In [11]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115 entries, 0 to 114
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   x                 115 non-null    float64
 1   y                 115 non-null    int64  
 2   pred_probability  115 non-null    float64
 3   pred_class        115 non-null    int32  
dtypes: float64(2), int32(1), int64(1)
memory usage: 3.3 KB


In [12]:
df_copy

Unnamed: 0,x,y,pred_probability,pred_class
0,-0.457429,1,0.270709,0
1,0.425948,1,0.513678,1
2,-0.784695,0,0.201258,0
3,-1.925209,0,0.061306,0
4,2.252617,1,0.901780,1
...,...,...,...,...
110,-0.791672,0,0.199933,0
111,0.452238,1,0.521449,1
112,0.535510,1,0.545976,1
113,-0.532739,0,0.253472,0


In [13]:
df_copy.nunique()

x                   115
y                     2
pred_probability    115
pred_class            2
dtype: int64

### Accuracy

Accuracy is the proportion of times the MODEL **CORRECTLY** classifies the OBSERVED output!

Accuracy is the NUMBER of CORRECT classifications DIVIDED by the total number of rows.

In [15]:
df_copy.head()

Unnamed: 0,x,y,pred_probability,pred_class
0,-0.457429,1,0.270709,0
1,0.425948,1,0.513678,1
2,-0.784695,0,0.201258,0
3,-1.925209,0,0.061306,0
4,2.252617,1,0.90178,1


Let's add a new column that stores if the classification is correct.

In [16]:
df_copy.y == df_copy.pred_class

0      False
1       True
2       True
3       True
4       True
       ...  
110     True
111     True
112     True
113     True
114     True
Length: 115, dtype: bool

In [17]:
df_copy['correct_class'] = df_copy.y == df_copy.pred_class

In [18]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115 entries, 0 to 114
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   x                 115 non-null    float64
 1   y                 115 non-null    int64  
 2   pred_probability  115 non-null    float64
 3   pred_class        115 non-null    int32  
 4   correct_class     115 non-null    bool   
dtypes: bool(1), float64(2), int32(1), int64(1)
memory usage: 3.4 KB


The NUMBER of correct classifications.

In [19]:
df_copy.correct_class.sum()

78

In [20]:
df_copy.correct_class.value_counts()

correct_class
True     78
False    37
Name: count, dtype: int64

We are instead interested in the PROPORTION of correct classifications rather than the COUNT.

In [21]:
df_copy.correct_class.value_counts(normalize=True)

correct_class
True     0.678261
False    0.321739
Name: proportion, dtype: float64

In [22]:
df_copy.correct_class.mean()

0.6782608695652174

However, the professor very rarely create a NEW column to store if the classification is correct.

Instead, I calculate the Accuracy by applying the `np.mean()` function DIRECTLY to the conditional test.

In [23]:
np.mean( df_copy.y == df_copy.pred_class )

0.6782608695652174

### Summary

The ACCURACY is a PROPORTION between 0 and 1. The closer to 1 the accuracy is the BETTER the model performed. The more ACCURATE the model is compared to the observed value.

Accuracy is a SUMMARY STATISTIC. It is the NUMBER of correct classification DIVIDED BY the total number of observations!