# Parrot Prediction Courses

## Deal with missing values
The following notebook demonstrate XGBoost resilience to missing values. Two approaches - native interface, and Sklearn wrapper were tested against missing datasets.

**What you will learn**:
- how XGBoost handles missing data

### Prepare data
First begin with loading all libraries

In [None]:
import numpy as np
import xgboost as xgb

from xgboost.sklearn import XGBClassifier

from sklearn.cross_validation import cross_val_score

# reproducibility
seed = 123

Let's prepare a valid dataset with no missing values. There are 10 samples, each one will contain 5 randomly generated features and will be assigned to one of two classes.

In [None]:
# create valid dataset
np.random.seed(seed)

data_v = np.random.rand(10,5) # 10 entities, each contains 5 features
data_v

In the second example we are going to add some missing values

In [None]:
# add some missing values
data_m = np.copy(data_v)

data_m[2, 3] = np.nan
data_m[0, 1] = np.nan
data_m[0, 2] = np.nan
data_m[1, 0] = np.nan
data_m[4, 4] = np.nan
data_m[7, 2] = np.nan
data_m[9, 1] = np.nan

data_m

Also generate target variables. Each sample will be assigned to one of two classes - so we are dealing with binary classification problem

In [None]:
np.random.seed(seed)

label = np.random.randint(2, size=10) # binary target
label

### Native interface
In this case we will check how does the native interface handles missing data. Begin with specifing default parameters.

In [None]:
# specify general training parameters
params = {
    'objective':'binary:logistic',
    'max_depth':1,
    'silent':1,
    'eta':0.5
}

num_rounds = 5

In the experiment first we will create a valid `DMatrix` (with all values), see if it works ok, and then repeat the process with lacking one.

In [None]:
dtrain_v = xgb.DMatrix(data_v, label=label)

Cross-validate results

In [None]:
xgb.cv(params, dtrain_v, num_rounds, seed=seed)

The output obviously doesn't make sense, because the data is completely random. The point is that the algorithm is trying to make some sense of it.

When creating `DMatrix` holding missing values we have to explicitly tell what denotes that it's missing. Sometimes it might be `0`, `999` or others. In our case it's Numpy's `NAN`. Add `missing` argument to `DMatrix` constructor to handle it.

In [None]:
dtrain_m = xgb.DMatrix(data_m, label=label, missing=np.nan)

Cross-validate results

In [None]:
xgb.cv(params, dtrain_m, num_rounds, seed=seed)

It looks like the algorithm works also with missing values.

Missing value is commonly seen in real-world data sets. Handling missing values has no rule to apply to all cases, since there could be various reasons for the values to be missing. In xgboost we choose a soft way to handle missing values. When using a feature with missing values to do splitting, xgboost will assign a direction to the missing values instead of a numerical value. Specifically, xgboost guides all the data points with missing values to the left and right respectively, then choose the direction with a higher gain with regard to the objective.

### Sklearn wrapper
The following section shows how to validate the same behaviour using Sklearn interface.

Begin with defining parameters and creating an estimator object.

In [None]:
params = {
    'objective': 'binary:logistic',
    'max_depth': 1,
    'learning_rate': 0.5,
    'silent': 1.0,
    'n_estimators': 5
}

In [None]:
clf = XGBClassifier(**params)
clf

Cross-validate results with full dataset. Because we have only 10 samples, we will perform 2-fold CV.

In [None]:
cross_val_score(clf, data_v, label, cv=2, scoring='accuracy')

Some score was obtained, we won't dig into it's interpretation.

See if things work also with missing values

In [None]:
cross_val_score(clf, data_m, label, cv=2, scoring='accuracy')

Both methods works with missing datasets. The Sklearn package by default handles data with `np.nan` as missing, so you will need additional pre-precessing if using different convention.