<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"></ul></div>

# 'Accuracy' and unbalanced data sets

'Accuracy' as a metric seems to be attractive to evaluate the performance of a classifier. It tells you in what percentage of cases the classifier predicts correct overall. However accuracy can be deceiving, or downright misleading, when a dataset is not balanced.  
Consider that we are evaluating a cancer diagnosis prediction. Cancer is in fact not very common in the sense that 97% of e.g. the dutch population actually was *not* diagnosed with cancer in the past 10 years'. So pretend we did a survey of the dutch population to test our diagnosis prediction. In this case we found 95 people without cancer and 5 with cancer. Our classifier identified 3 cancer cases in total: 2 False positive and 1 True positive. **spoiler-alert** This is a terrible classifier. Not only did it get 2 out of 3 wrong, of the people with actual cancer only 1 was correctly identified.  
So intuitively this classifier should be dismissed as lethal quackery. However the accuracy is 0.94 or 94% which most people will deem to be very acceptable.

3 things should be done to prevent this from happening:

1. Compare to a baseline: in this case the majority class. I.e.with a baseline of 95%, an accuracy of 94% actually shows a $loss$ of information.
2. The accuracy is this high because that classifier gets 93 of the 95 " $wrong$ " ones "$right$". Use recall instead; this measures the fraction of actual bad news cases it gets right. Or use preciseness which measures the fraction of predicted bad news cases it gets right
3. Balance the dataset (upsampling or downsampling)  

'(https://www.volksgezondheidenzorg.info/onderwerp/kanker/cijfers-context/huidige-situatie#node-prevalentie-van-kanker)



In [169]:
import pandas as pd
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score
from sklearn.metrics import accuracy_score

In [170]:
#create dataset
y_true = []
[y_true.append(0) for i in range(0,95)]
for i in range(0,5):
    y_true.append(1) 
y_pred = [0] * 93
for i in range(0,2):
    y_pred.append(1)
for i in range(0,4):
    y_pred.append(0)
for i in range(0,1):
    y_pred.append(1)
    

In [171]:
#calculate accuracy and create confusion matrix
df = pd.DataFrame()
df['X'] = y_pred

df['y_train'] = y_true

confusion = confusion_matrix(y_true, y_pred)
print(f'Accurcy score:{accuracy_score(y_true, y_pred)}')
pd.DataFrame(confusion, 
             columns=['predicted_benign','predicted_malign'], 
             index=['true_benign','true_malign'])



Accurcy score:0.94


Unnamed: 0,predicted_benign,predicted_malign
true_benign,93,2
true_malign,4,1


In [172]:
#Create report
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred, labels=None, target_names=['benign', 'malign'], sample_weight=None, digits=2))

             precision    recall  f1-score   support

     benign       0.96      0.98      0.97        95
     malign       0.33      0.20      0.25         5

avg / total       0.93      0.94      0.93       100



In [173]:
tn, fp, fn, tp = confusion.ravel()
(tn, fp, fn, tp)

(93, 2, 4, 1)

In [174]:
#Create Dummy classifer Always Predicts The Modal Value Of Target
from sklearn.dummy import DummyClassifier

# Create dummy classifer
dummy = DummyClassifier(strategy='most_frequent', random_state=1)

# "Train" model
dummy.fit(df[['X']],df['y_train'])

# Get accuracy score
print(f'Accuracy score choosing most frequent: {dummy.score(X, y)}')

Accuracy score choosing most frequent: 0.95


In [175]:
#create a balanced dataset manually
y_true = []
[y_true.append(0) for i in range(0,50)]
for i in range(0,50):
    y_true.append(1) 
y_pred = [0] * 49
for i in range(0,1):
    y_pred.append(1)
for i in range(0,40):
    y_pred.append(0)
for i in range(0,10):
     y_pred.append(1)

In [166]:
#calculate accuracy and create confusion matrix

confusion = confusion_matrix(y_true, y_pred)
print(f'Accuracy score:{accuracy_score(y_true, y_pred)}')
pd.DataFrame(confusion, 
             columns=['predicted_benign','predicted_malign'], 
             index=['true_benign','true_malign'])




Accuracy score:0.59


Unnamed: 0,predicted_benign,predicted_malign
true_benign,49,1
true_malign,40,10


In [167]:
#create classification report
from sklearn.metrics import classification_report
print(
classification_report(y_true, y_pred, labels=None, target_names=['benign', 'malign'],sample_weight=None, digits=2))

             precision    recall  f1-score   support

     benign       0.55      0.98      0.71        50
     malign       0.91      0.20      0.33        50

avg / total       0.73      0.59      0.52       100



In [168]:
tn, fp, fn, tp = confusion.ravel()
(tn, fp, fn, tp)

(49, 1, 40, 10)