# Introduction

A dataset can be considered imbalanced, if one or more of the classes have a larger proportion of
examples comparatively with the others. These classes are called majority classes and the others
are called minority classes.

## Main causes:
1. When the examples were collected and sampled from the problem domain, error during data collection (bias sampling, measurement errors).
2. The imbalance might be a property of the problem domain.

## Why care about imbalance?
* Most ML algorithms for classification were designed around the assumption of an equal number of examples for each class; therefore imbalanced model will prone to majority class, which is bad for generalization.
* In real world, we're mostly interested in minority class so it's useless if a model shows poor performance on minor population.

# Survey on multi-class imbalance techniques

* Haixiang et al.: Two basic strategies are **preprocessing** and **cost-sensitive learning**, which are further integrated into classification models and divided into **ensemble-based classifiers** and **algorithm modified classifiers**.
* Yasir Arafat et al.: Three strategies -> sampling techniques (over/under-sampling), cost-sensitive learning methods and ensemble-based methods (combine different classifiers to form a strong classifier).
* R.Cruz et al. and M.Galar et al.: Four strategies -> algorithm-level approaches (takes into account imbalance between different classes), data-level approaches (sampling preprocessing), cost-sensitive learning frameworks (combines algorithm-level & data-level), ensemble-based approaches.
* Multi-class: One-Against-All (OAA) and One-Against-One (OAO)

Shuo Wang et al.: It is then concluded that oversampling does not help in either multi minority and multi majority
cases because it causes overfitting of the minority-class. Undersampling techniques on the other hand, in the multi-minority case, can be sensitive to the class number while in the multi majority case, there is a high risk of sacrificing too much majority-class performance. If we think about
the problem of multi class imbalance itself and not just the techniques applied, the multi majority situation seems to be more difficult than multi minority.

## 1. Data-level approaches
1. Resampling: select or generate a specific amount of examples from the majority or minority classes in order to rebalance them and diminish the impact of imbalance (undersampling, oversampling, hybrid).
2. Distance-based algorithm:
    * SMOTE: This can cause a problem of overgeneralization because the majority class is not taken into account. Since the generated synthetic samples disregard the majority class, it can lead to overlapping between classes especially in multiclass cases.
    * MDO: oversampling technique

## 2. Algorithm-level approaches
1. Modified multi-class HDDT (Hellinger Distance Decision Trees)

## 3. Cost-sensitive learning:
1. Example-dependent costs
2. Class-dependent costs

## 4. Ensemble-based approaches:
1. Bagging
    * SMOTEBagging: take class distribution into consideration among all minority classes after sampling.
    * Multi-class Roughly Balanced Bagging: Multinomial distribution is considered by using the classes’ prior probabilities.
    * SOUP-Bagging (Similarity Oversampling & Undersampling Preprocessing): utilizes hybrid sampling in order to obtain a more balanced distribution of classes in datasets.
2. Boosting
    * BPSO-AdaBoost-KNN: BPSO - feature selection algorithm and then the conjunction classifier Adaboost-KNN, measured by a new evaluation metric AUCarea.
    * SMOTEBoost (data-level): SMOTE sampling then boosting. Drawbacks of the SMOTEBoost is generating a large training dataset, therefore, slow in training phase of boosting.
    * RUSBoost (improve to reduce the computational requirement of SMOTEBoost): Use RUS (Random Under Sampling) from the majority class until all the classes reach the same size then AdaBoost is applied on the sampled data, the same for each iteration. Drawbacks of the RUSBoost is data loss in the under-sampling.
    * CatBoost (novel algorithm): Gradient boosting on decision trees, including 2 phases, building trees then setting value of leaves for the fixed tree.
3. Hybrid
    * Hybrid Ensemble for Classification of Multiclass Imbalanced (HECMI): for datasets with high imbalance ratio, give better recall rates for minority classes.
    
# Evaluation Metrics
Most notable metrics: F-measure, Geometric Mean (G-Mean) and AUC_ROC.

    * F-measure: the weighted average of the precision and recall metrics.
    * G-Mean: less sensitive to value skewness and especially outlier presence.
    * AUC_ROC curve: performance measurements used for classification problems that attempt to describe a model's ability to distinguish between classes. ROC is probability curve and AUC is the degree or measure of separability.
