### How to handle imbalanced data

### Introduction

<p> We have datasets in domains like, fraud detection in banking,real-time bidding in marketing or intrusion detection in networks,in common?</p>

<p>Data used in these areas often have less than 1% of rare,but 'interesting' events(e.g. fraudsters using credit cards,user clicking advertisement or corrupted server scanning its network). However most machine learning algorithms do not work very well with imbalanced datasets. The following seven techniques can help you, to train a classifier to detect the abnormal call</p>

![Title](imbalanced-data-1.png)


### 1. Use the right evaluation metrics

<p> Applying inappropriate evaluation metrics for model generated using imbalanced data can be dangerous. Imagine our training data is the one illustrated in graph above. If accuracy is used to measure the goodness of a model, a model which classifies all testing sample into '0' will have an excellent accuracy(99.8%), but obviously this model won't provide any valuable information for us. </p>

<p> In this case, other alternative evaluation metrics can be applied such as:</p>

<p> 1, Precision/Specificity: how many selected instanced are relevant.</p>
<p> 2, Recall/Sensitivity: how many relecant instanced are selected </p>
<p> 3, F1 score: harmonic mean of precision and recall
<p> 4, MCC: correlation coefficient between the observed and predicted binary classification
<p> 5, AUC: relation between true-positive rate and false positive rate
    
    
    

### 2. Resample the training set

<p> Apart from using different evaluation criteria, once can also work on getting different dataset. Two approaches to make a balanced dataset out of an imbalanced one are under-sampling and over-sampling.</p>

<p> 2.1 Under-sampling </p>
<p> Under-sampling balances the dataset by reducing the size of the abundant class. This method is used when quantity of data is sufficient. By keeping all samples in the rare class and randomly selecting an equal  number of samples in the abundant class, a balanced new dataset can be retrieved for further modeling.</p>

<p> 2.2. Over-sampling </p>
On the contrary, oversampling is used when the quantity of data is insufficient. It tries to balance dataset by increasing the size of the rare samples. Rather than getting rid of abundant samples, new rare samples are generated by using e.g. repetition, bootstrapping or SMOTE(Synthetic Minority Over Sampling Technique).</p>

<p> Note that there is no absolute advantage of one resampling method over another. Application of these two methods depends on the use case it applies to and the dataset itself. A combination of over-and under-sampling is oftern successful as well.
   
    
    

### Use K-fold Cross-Validation in the right way
<p> It is noteworthy that cross-validation should be applied properly while using over-sampling method to address imbalance problems.</p>

<p> Keep in mind that over-sampling takes observed rare samples and applies bootstrapping to generate new random data based on a distribution function. If cross-valiation is applied after over-sampling, basically what we are doing is overfitting our model to a specific artifical bootstrapping result. That is why cross-validation should always be done before over-sampling the data, just as how feature selection should be implemented. Only by resampling the data repeatedly, randomness can be introduced into the dataset to make sure that there won't be an overfitting problem.

### Ensemble different resampled datasets
<p> The easiest way to successfully generalize a model is by using more data. The problem is that out-of-the-box classifiers like logistic regression or random forest tend to generalize by discarding the rare calls. One easy best practice is building n models that use all the samples of the rare class and n-differing samples of the abundant class. Given that you want to ensemble 10 models, you would keep e.gi. the 1.000 cases of the rare class and randomly sample 10.000 cases of the abundant class. Then you just split the 10.000 cases in 10 chunks and train 10 different models.
    
![Title](imbalanced-data-2.png)

### Resample with different ratios

The previous approach can fine-tuned by playing with the ratio between the rare and the abundant class. The best ratio heavily depends on the data and the models that are used. But instead of training all models with the same ratio in the ensemble, it is worth trying to ensemble different ratios. So if 10 models are trained, it might make sense to have a model that has a ratio of 1:1(rare:abundant) and another one with 1:3, or even 2:1. Depending on the model used this can influence the weights that on class gets.

![Title](imbalanced-data-3.png)

### Cluster the abundant class
<p> An elegant approach was proposed by Sergey On Quora. Instead of relying on random samples to cover the variety of the training samples, he suggests clusering the abundant class in r groups, with r being the number of cases in r. For each group, only the medoid(center of cluster) is kept. The model is then trained with the rare class and the medoids only.</p>

### Design your own models

All the previous methods focus on the data and keep the models as a fixed component. But in fact, there is no need to resample the data if the model is suited for imbalanced data. The famous XGBoost is already a good starting point if the classes are not skewes too much, because it internally takes care that the bags it trains on are not imbalanced. But then again, the data is resampled, it is just happening secretly.

By designing a cost function that is penalizing wrong classification of the rare class more than wrong classifications of the abundant class, it is possible to design many models that naturally generalize in favor of the rare class. For example,tweaking an SVM to penalize wrong classifications of the rare class by the same ratio that this class is underrepresented.

![Title](imbalanced-data-4.png)

### Final Remarks
<p> This is not an exclusive list of techniques, but rather a starting point to handle imbalanced data. There is no best approach or model suited for all problems and it is strongly recommended to try different techniques and models to evaluate what works best. Try to be creative and combine different approaches. It si also important, to be aware that in many domains(e.g. fraud detection, real-time-bidding),where imbalanced classes occur, the 'market-rules' are constantly changing. So, check if past data might have become obsolete.