## XGBOOST Practice

### Load in the data set
In this practice, we will be using XGBoost with Python to build a supervised learning system that predicts the housing prices of Boston in the past. 

This is an infamous dataset that has been used in many machine learning papers. The original data is constructed by the paper "Hedonic prices and the demand for clean air", J. Environ. Economics & Management, vol.5, 81-102, 1978.

We fisrt load the dataset from sklearn package.

In [16]:
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

We then construct a pandas dataset with the price and all the features. 

In [17]:
import pandas as pd

data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
data['PRICE'] = boston.target
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


### Simple XGBoost with CV
Load in XGBoost Packages and select out features (the first 13 columns) and labels (the last column). 

In [22]:
import xgboost as xgb
import pandas as pd
import numpy as np

X, y = data.iloc[:,:-1],data.iloc[:,-1]
data_dmatrix = xgb.DMatrix(data=X,label=y)

You then define the cross validation parameters and get the cross valudation results using [xgb.cv()](https://xgboost.readthedocs.io/en/latest/python/python_api.html).

The important parameters that you need to define as:
1. **learning_rate**: step size shrinkage used to prevent overfitting. Range is [0,1]
2. **max_depth**: determines how deeply each tree is allowed to grow during any boosting round.
3. **objective**: determines the loss function to be used like reg:linear for regression problems, reg:logistic for classification problems with only decision, binary:logistic for classification problems with probability.
4. **alpha**: L1 regularization on leaf weights. A large value leads to more regularization.
5. **colsample_bytree**: percentage of features used per tree. High value can lead to overfitting.


Let us first use "reg:lienar" as objective, colsample_bytree = 0.3, learning_rate = 0.1, max_depth=5 and alpha=10.

In the cross validation, you need to specify:
1. **num_boost_round**: denotes the number of trees you build (analogous to n_estimators)
2. **metrics**: tells the evaluation metrics to be watched during CV
3. **as_pandas**: to return the results in a pandas DataFrame.
4. **early_stopping_rounds**: finishes training of the model early if the hold-out metric ("rmse" in our case) does not improve for a given number of rounds.
5. **seed**: for reproducibility of results.

We will use num_boost_rounds=50, early_stopping_rounds=10, metrics="rmse", as_pandas=True, sead=123, nfolds=5.

In [27]:
params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate': 0.1, 'max_depth': 5, 'alpha': 10}
cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=5,
                    num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)

[10:48:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[10:48:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 8 extra nodes, 0 pruned nodes, max_depth=3
[10:48:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[10:48:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[10:48:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[10:48:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 8 extra nodes, 0 pruned nodes, max_depth=3
[10:48:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 4 extra nodes, 0 pruned nodes, max_depth=2
[10:48:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 0 pruned nodes, max_depth=3
[10:48:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 8 extra nodes, 0 pruned nodes, max_depth=3


Visualize the cross validation results using the dataset cv_results.

In [28]:
cv_results.head()

Unnamed: 0,train-rmse-mean,train-rmse-std,test-rmse-mean,test-rmse-std
0,21.647154,0.125324,21.65707,0.486184
1,19.713919,0.108728,19.712102,0.495758
2,17.97946,0.154234,18.00343,0.443481
3,16.356188,0.145047,16.405221,0.414656
4,14.950717,0.152347,15.005437,0.400347


### XGBoost with Parameter Tuning

Let us then start tuing the parameters of XGBoost. For more information on which parameters to tune, read the following documents: https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html

In this part, you need to create a grid of parameters you want to search, and use CV to fine tune the parameters. You will then read your cross-validation results. 

You can use a for-loop to tune the parameters. You can also use "from sklearn.grid_search import GridSearchCV" to tune your parameters. (For-loop is recommended).

In [60]:
from sklearn.grid_search import GridSearchCV

xgb_model = xgb.XGBRegressor()

params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate': 0.1, 'max_depth': 5, 'alpha': 10}


clf = GridSearchCV(xgb_model,
                   {'max_depth': [2,3,4,5,6],
                    "objective":["reg:linear"],
                    'colsample_bytree': [0.1,0.2,0.3,0.4,0.5],
                    'learning_rate': [0.1,0.2,0.3,0.4]}, verbose=1)
clf.fit(X,y)
print(clf.best_score_)
print(clf.best_params_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits
0.6446303734872245
{'colsample_bytree': 0.5, 'learning_rate': 0.3, 'max_depth': 3, 'objective': 'reg:linear'}


[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    5.9s finished
