# Parrot Prediction Courses

## Handle Imbalanced Dataset

> Imbalanced data refers to a classification problems where the classes are not equally distributed.

It's very common problem, for example predicting frauds, where the number of postivie (frauds) instances is very small comparing to negative ones.

You can read good introduction about tackling imbalanced datasets [here](http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/).


LINKS:
- https://github.com/dmlc/xgboost/blob/master/demo/kaggle-higgs/higgs-cv.py (setting ratio)
- http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation
- https://github.com/dmlc/xgboost/issues/144

### General advices
These are some common tactics when approaching imbalanced datasets:

- collect more data,
- use better evaluation metric (that notices such mistakes - ie. AUC, F1, Kappa, ...),
- try oversampling or undersampling,
- generate artificial samples of minority class (ie. SMOTE algorithm),

In XGBoost you can try to:
- make sure that parameter `min_child_weight` is small (because leaf nodes can have smaller size groups), it is set to `min_child_weight=1` by default,
- assign more weights to specific samples while initalizing `DMatrix`,
- control the balance of positive and negative weights  using `set_pos_weight` parameter,
- use AUC for evaluation

### Prepare data
Load essential libraries

In [1]:
import numpy as np
import pandas as pd

import xgboost as xgb

from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.cross_validation import train_test_split

# reproducibility
seed = 123

Generate imbalanced dataset for binary classification. There will be only 10% (about 20 samples) of negative values.

In [2]:
X, y = make_classification(
    n_samples=200,
    n_features=5,
    n_informative=3,
    n_classes=2,
    weights=[.9, .1],
    shuffle=True,
    random_state=seed
)

In [3]:
print('Total number of negative instances: {}'.format(y.sum()))

Total number of negative instances: 21


Divide data into train and test datasets

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y, random_state=seed)

In [5]:
print('Total number of negative train instances: {}'.format(y_train.sum()))
print('Total number of negative test instances: {}'.format(y_test.sum()))

Total number of negative train instances: 14
Total number of negative test instances: 7


### Baseline model

In [6]:
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test)

In [7]:
params = {
    'objective':'binary:logistic',
    'max_depth':1,
    'silent':1,
    'eta':1
}

num_rounds = 15

In [8]:
bst = xgb.train(params, dtrain, num_rounds)
y_test_preds = (bst.predict(dtest) > 0.5).astype('int')

In [9]:
pd.crosstab(
    pd.Series(y_test, name='Actual'),
    pd.Series(y_test_preds, name='Predicted'),
    margins=True
)

Predicted,0,1,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,59,0,59
1,4,3,7
All,63,3,66


In [10]:
print('Accuracy: {0:.2f}'.format(accuracy_score(y_test, y_test_preds)))
print('Precision: {0:.2f}'.format(precision_score(y_test, y_test_preds)))
print('Recall: {0:.2f}'.format(recall_score(y_test, y_test_preds)))

Accuracy: 0.94
Precision: 1.00
Recall: 0.43


We are getting an [accuracy paradox](https://en.wikipedia.org/wiki/Accuracy_paradox?oldformat=true).

### Focus on minority class
Begin by calculating total number of positive and negative classes

In [11]:
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test)

In [12]:
train_labels = dtrain.get_label()

ratio = float(np.sum(train_labels == 0)) / np.sum(train_labels == 1)
params['scale_pos_weight'] = ratio

In [13]:
bst = xgb.train(params, dtrain, num_rounds)
y_test_preds = (bst.predict(dtest) > 0.5).astype('int')

pd.crosstab(
    pd.Series(y_test, name='Actual'),
    pd.Series(y_test_preds, name='Predicted'),
    margins=True
)

Predicted,0,1,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,51,8,59
1,0,7,7
All,51,15,66


In [14]:
print('Accuracy: {0:.2f}'.format(accuracy_score(y_test, y_test_preds)))
print('Precision: {0:.2f}'.format(precision_score(y_test, y_test_preds)))
print('Recall: {0:.2f}'.format(recall_score(y_test, y_test_preds)))

Accuracy: 0.88
Precision: 0.47
Recall: 1.00


You see that we made a trade-off here. We are now able to perfectly classify the minority class, but got poor performance on majority.

We can also set the weights manually while creating `DMatrix` objects. In this case we can help the algorithm to focus on particular instances.