### Imbalanced Data
1) When observation in one class is severely higher than the observations in other class/(es) then there exists a class imbalance. This problem is referred to as Imbalanced data problem.

2) When there is severe skewness in the class distribution such as 80:20 majority to minority class ratio or beyond then there exists class imbalance. 

3) The high bias in the data can heavily affect many ML algorithms completely ignoring the minority class. The problem statements concerning imbalanced data are usually the ones where predictions on minority class are considered significant.   


### Handling Imbalanced Data: Best Practices and Approaches
1) <b>Collect More Data:</b><br>
A larger dataset might expose a different and perhaps more balanced perspective on the classes.

2) <b>Try Changing Your Performance Metric:</b><br>
Accuracy is not the metric to use when working with an imbalanced dataset. 

Looking at the following performance measures that can give more insight into the accuracy of the model than traditional classification accuracy:

a) <b>Confusion Matrix:</b> A breakdown of predictions into a table showing correct predictions (the diagonal) and the types of incorrect predictions made (what classes incorrect predictions were assigned).<br>
b) <b>Precision:</b> <br>
c) <b>Recall:</b><br>
d) <b>F1 Score (or F-score):</b> A weighted average of precision and recall.<br>
e) <b>Adjust the decision threshold (ROC, AUC)</b>


3) <b>Boosting Algorithm</b><br>
XGBoost: xgboost offers parameters to balance positive and negative weights using scale_pos_weight

4) <b>Weighting of examples</b><br>
It involves the creation of specific weight vectors in order to improve minority class predictions.

The class-specific weights(class_weight parameter) are calculated per class whereas the test-case-specific weights are calculated for each single instance.

5) <b>Use Stratified CV</b><br>

6) <b>Penalized SVM</b><br>
In SVM where it is desired to give more importance to certain classes or certain individual samples, the parameters class_weight and sample_weight can be used.

7) <b>Data Level approach</b><br>
We can perform undersampling or oversampling<br>

a) <b>Under_Sampling</b><br>
Undersampling techniques refer to remove majority class points. Some oversampling techniques are ENN, Random Under Sampling, TomekLinks, etc.

b) <b>OverSampling</b><br>
Oversampling techniques refer to create artificial minority class points. Some oversampling techniques are Random Over Sampling, ADASYN, SMOTE, etc.


### Install imblearn

1) In Anaconda prompt, write the follwing<br>
<b>conda install -c conda-forge imbalanced-learn</b>

2) In CMD<br>
<b>pip install imblearn</b>

3) In Jupyter<br>
<b>!pip install imblearn</b>

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [8]:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
from sklearn.metrics import roc_curve,roc_auc_score

In [9]:
def gen_metrics(model,x_train,x_test,y_train,y_test):
    model.fit(x_train,y_train)
    train_score = model.score(x_train,y_train)
    test_score = model.score(x_test,y_test)
    y_pred = model.predict(x_test)
    print('Predictions\n',y_pred)
    acc = accuracy_score(y_test,y_pred)
    print('Training score',train_score)
    print('Testing score',test_score)
    print('Accuracy_Score',acc)
    cm = confusion_matrix(y_test,y_pred)
    print('Confusion Matrix\n',cm)
    print('Classification Report\n',classification_report(y_test,y_pred))
    auc_score  = roc_auc_score(y_test,model.predict_proba(x_test)[:,1])
    print('AUC Score',auc_score)
    fpr,tpr,thresh = roc_curve(y_test,model.predict_proba(x_test)[:,1])
    plt.plot(fpr,tpr,color='blue')
    plt.plot([0,1],[0,1],label='TPR=FPR',linestyle=':',color='black')
    plt.title('ROC_AUC Curve')
    plt.xlabel('FPR')
    plt.ylabel('TPR')
    plt.legend(loc=8)
    plt.grid()
    plt.show()

In [10]:
from sklearn.datasets import make_classification

In [11]:
x,y = make_classification(n_samples=1000,n_features=2,n_redundant=0,
                          weights=[0.9,0.1],n_classes=2,random_state=1)

### OverSampling
<img src="oversampling.png" align="left">

In [39]:
from imblearn.over_sampling import RandomOverSampler

### UnderSampling

<img src="undersampling.png" align="left">

In [51]:
from imblearn.under_sampling import RandomUnderSampler

### SMOTE (Synthetic Minority OverSampling Technique)

1) Unlike random oversampling that only duplicates some random examples from the minority class, SMOTE generates examples based on the distance of each data (usually using Euclidean distance) and the minority class nearest neighbors, so the generated examples are different from the original minority class.<br>
2)  Steps in SMOTE

    a) Identify the minority class vector.
    b) Decide the number of nearest numbers (k), to consider.
    c) Compute a line between the minority data points and any of its neighbors and place a synthetic  point.
    4) Repeat step 3 for all minority data points and their k neighbors, till the data is balanced.

<img src="smote2.png" align="left">

In [56]:
from imblearn.over_sampling import SMOTE

#### Is it better to Undersample or OverSample?

Oversampling is better, because you keep all the information in the training dataset. With undersampling you drop a lot of information

### Other Methods

#### ADASYN (Adaptive Synthetic Sampling)
1) It is an oversampling technique<br>
2) Generate more synthetic examples in regions of the feature space where the density of minority examples is low, and fewer or none where the density is high.<br>

#### BorderLineSMOTE
1) This algorithm starts by classifying the minority class observations. It classifies any minority observation as a noise point if all the neighbors are the majority class and such an observation is ignored while creating synthetic data.<br>
2) It classifies a few points as border points that have both majority and minority class as neighborhood and resample completely from these points (Extreme observations on which a support vector will typically pay attention to)

<img src="borderline_smote.png" align="left" height="350" width="350">

#### TomekLinks
1) Tomek Links is an undersampling approach that identifies all the pairs of data points that are nearest to each other but belong to different classes, and these pairs (suppose a and b) are termed as Tomek links. Tomek Links follows these conditions:

    a and b are nearest neighbors of each other
    a and b belong to two different classes
    
2) These Tomek links points (a, b) are present on the boundary of separation of the two classes. So removing the majority class of Tomek links points increases the class separation, and also reduces the number of majority class samples along the boundary of the majority cluster.
<img src="tomek_links.png">

#### SMOTETOMEK
A hybrid method which is a mixture of the above two methods, it uses an under-sampling method (Tomek) with an oversampling method (SMOTE). 

In [12]:
from imblearn.over_sampling import ADASYN, BorderlineSMOTE
from imblearn.under_sampling import TomekLinks

In [64]:
from imblearn.combine import SMOTETomek,SMOTEENN