# Some studies in imbalanced data

<div class="cite2c-biblio"></div><div class="cite2c-biblio"></div>I log my studies and thoughts about the imbalanced data problem in machine learning. To simplify the discussion, I will first restrict myself to binary classification problem. 

## The problem
<strong>Definition</strong>: A dataset is imbalanced if the classification categories are not approximately equally represented. (#cite-chawla2009data).           

<strong>Example</strong>

In [14]:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, weights=[0.9, 0.1])
print ('y = 1: %i instances'%list(y).count(1))
print ('y = 0: %i instances'%list(y).count(0))

y = 1: 104 instances
y = 0: 896 instances


## Preprocessing: Do we really want to balance it?
In general, we can upsample (the minority class) or downsample (the majority class) to balance the data. But resampling usually introduces bias. Other possibility is to use boosting, some learning machine has this built-in functionality (RF, XGBoost, etc.) to focus the learning in mis-classified instances. 

## Metric choice: What do we really care about? 
The performance of a classifier can be measured by confusion matrix, in which, based on the predictions and ground-truth, we can define TP, TN, FP and FN. Other metrics can then be derived:                  
<strong>Accuracy</strong> 
          $$ =\frac{TP+TN}{TP+FP+TN+FN} $$ 
- Not appropriate: can be over optimal (for example, a constant classifier can always predict 0 and still get a high accuracy ~0.9).     

<strong>ROC and ROC-AUC</strong>           
- x axis: $$ FPR = \frac{FP}{FP + TN} $$
- y axis: $$ TPR = \frac{TP}{TP + FN} $$
- perfect predictor: (0, 1)
- In the ROC curve, one point is one predict probability threshold, the curve tracks different thresholds. Its AUC is then independent of those thresholds and give an overall measure of performance.
- Not appropriate for very imbalanced data, because both FPR and TPR depends on TN, which usually is weighted less in our decision (image predicting a financial crisis, where most instances are negative, i.e. no crisis, in this case, predicting correctly a person with no crisis is less important). And a high ROC-AUC can achieve a low value when maximizing TN. Same reasoning for the dependence of FN in the y axis.
    
<strong>Precision-Recall and PR-AUC</strong>        
- Recall = $$\frac{TP}{TP + FN}$$, over all positive instances, how many does the predictor correctly predict?
- Precision = $$\frac{TP}{TP + FP}$$, over all the predictor predicts as positive, how many are correct?
- perfect predictor (1, 1)
- bad predictor, for example, if constantly predicts 1, P/P+N~0.1 for all recall values
- good when very imbalanced, no dependence of TN, only concern with TP.




{% bibliography %}