# Overview of Class Imbalance
> Note: This is a preliminary document that is supposed to give an overview of the class imbalance problem and suggest some solutions for internal use among course instructors. A more pedagogical version with code snippets or an extended blog post can be built on it later on for external use. 

## A. Why class imbalance is a problem?

Training an ML algorithm with imbalanced dataset might **create a learning bias** against the minority class(es) and consequently **affect the predictive performance** on such class(es). This is because most of the ML algorithms inherently assume balanced dataset and equal costs of errors (i.e. the cost of a false alarm is the same as the cost of a miss). Furthermore, the **performance evaluation under class-imbalance becomes more challenging**, especially when different errors are perceived with different degrees of importance (e.g. in medical applications the cost of a miss is much higher than that of a false alarm while diagnosing a fatal disease, and such costs can not be always quantified). In this case standard metrics, such as accuracy and AUROC, becomes less informative and even misleading in some high imbalance situations. Therefore, it is very essential to understand the effects of class imbalance on learning algorithms and examine how one can mitigate such effects and perform informative evaluations. We will start first by investigating the problems arising with the class imbalance.

### A.1 Prior bias (in the case of prior distribution drift between training and testing):

This is a problem arising in special circumstances where the class imbalance is not expected to be the same during training and testing. For instance, assume we are running a binary classification on cats and dogs images. For some reason, during the data collection, we were able to build a training dataset with 80% cats and 20% dogs. However, during testing we are expecting that our classifier will be tested on a more balanced situation where 50% of the images belong to cats and 50% belong to dogs. In such case, and assuming our classifier has *enough learning capacity*, the algorithm will inherently learn the wrong priors and lead to biased predictions at the test time.

Formally, let $x$ be the input, e.g. a cat/dog image, of a binary classifier $f(\cdot)$. You can think of $f(\cdot)$ being a logistic regression or a neural network mapping $x$ to $f(x)$. In such case, $f(x)$ represents the posterior probability of $x$ being in class $C1$, say cats, where 

\begin{equation}
f(x) = p(C1/x) \propto p(x/C_1) \times p(C1).
\end{equation}

The prediction is then made by choosing the class with higher posterior (if $f(x)>1/2$, which means $p(C1/x) > p(C2/x)$,  then it is a cat, otherwise dog).

> <sub>
In the case of logistic regression, one can easily link the posterior probability to the parameters ($w$ and $b$)
\begin{align}
p(C_1|x) = \frac{p(x|C_1)p(C_1)}{p(x)} = \frac{p(x|C_1)p(C_1)}{p(x|C_1)p(C_1) + p(x|C_2)p(C_2)}= \frac{1}{1 + \frac{p(x|C_2)p(C_2)}{p(x|C_1)p(C_1)}}
= \frac{1}{1 + \exp\left(-\ln\frac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)}\right)} = \frac{1}{1 + \exp\left(-(w^Tx + b)\right)} = \sigma(w^Tx + b).
\end{align}
 </sub>

Assuming that the classifier has enough capacity, it will learn the *biased* training prior $p(C1)=0.8$. If we expect that such prior distribution does not hold during test time, we need to mitigate this problem. This can be done during training time (re-balance dataset via sampling/data augmentation, use penalized models) or at the test time (correct the output probability, adjust threshold) as we will see later.

### A.2 Poor predictive performance on the minority class:

This is a more general problem that holds even if the imbalance persists during the test time. Take for example the prediction of a fatal rare disease: the dataset is expected to be imbalanced both during training and testing, and there is no issue of distribution drift here. However, the imbalance can still affect the learning task. This is because the relatively few instances in the minority class are not enough for the algorithm to fully discover the feature space and learn the appropriate rule. Therefore, the problem here is not in learning a biased prior as in the previous section, but rather in the ability to learn the whole posterior probability.

Take the Gaussian naive Bayes classifier as an example. In case of few instance in one of the classes, we will get a poor estimation for both the prior and the Gaussian likelihood, and hence a poor performance (high generalization error).

> Note that in the previous section we had assumed that the learning algorithm has *enough learning capacity* in order to filter out this effect and focus on the prior's bias.

In order to mitigate such problem, one has to adjust the data or the model during training (data augmentation, penalized models). However, re-balancing the data or penalizing the machine learning model can create an *artificial bias* that needs to be corrected during test time. More information on this in the sequel.

### A.3 Less informative evaluation metrics:

The accuracy metric is meaningful for a relatively balanced dataset and for the cases where the two types of errors have the same cost (cost of false alarm = cost of miss). However, for imbalanced datasets the accuracy metrics can become misleading by giving an inflated measure that can be easily beaten by a dummy classifier. Similarly for other evaluation metrics such as the false positive rate (FPR) used in the ROC.

Lets first review different evaluation metrics in a binary classification set up:
- Accuracy $= (TP + TN)/(P+N) = (TP + TN)/(TP + FN + TN + FP)$
- TPR or recall $= TP/P = TP/(TP+FN)$
- FPR = $FP/N = FP/(FN+TN)$
- precision $= TP/(TP+FP)$

The problem with the imbalance dataset is that the majority class (usually the negative class) is overwhelming. Hence, with any reasonable classifier the $TN$ term will dominate the numerator and denominator in the accuracy and makes it close to $1$. Similarly for the false positive rate (FPR) where the dominating $TN$ term makes it always close to zero. Therefore, the accuracy and the ROC might be misleading in assessing the performance under class imbalance.

I ideally, we want both the $FN$ and $FP$ terms to be small without including the dominating $TN$. Hence, it is a good practice to look at the precision and recall which are not sensitive to the class imbalance through the $TN$ term. One way to combine these two metrics is via a harmonic mean F-1 score. This is of course under the assumptions that both types of error have the same cost. If not, one can use a weighted version of such score, e.g. F-$\beta$ score. In general, it is always recommended to look at the entire confusion matrix in such cases.

> More info on F-1 score with useful visualizations [here](https://github.com/MhDia/MLteaching/blob/master/T2_F1score.ipynb).

## B. How to mitigate the class imbalance problem?

### B.1 Re-balance training dataset:
- Data augmentation
- Up-sample minority class
- Down-sample majority class

### B.2 Use penalized model for training:
Apply `class_weight` (or `sample_weight` in the `fit` method) in `sklearn`. This has similar, but not exactly same effect as B.1.

> Note that B.1 and B.2 can create an *artificial bias* that needs to be corrected during test time (see next).

### B.3 Adjust the prediction during testing (correct prior bias due to drift in distribution and potentially artificial prior bias added by the above re-balancing):
- Adjusting the output probability by correcting the bias (divide $f(x)$ by the training bias, or the artificial bias, and multiply by the true testing bias if predictable)
- Adjusting the threshold

(Both methods are equivalent).

### B.4 Use appropriate evaluation metrics:
- Precision/recall curve
- F-1 score (or weighted version)
- Confusion matrix
- avoid misleading metrics: accuracy, FPR, TNR (or specificity/selectivity), or any metric where the $TN$ term can dominate.