# XGBoost

Gradient boosting is currently one of the most popular techniques for efficient modeling of tabular datasets of all sizes. 

XGboost is a very fast, scalable implementation of gradient boosting, with models using XGBoost regularly winning online data science competitions and being used at scale across different industries.

![xgboost](xgboost.png)

![xgboost_popularity](xgboost_popularity.png)

## XGBoost for Classification

![class_question](class_question.png)

![binary_class_question](binary_class_question.png)

We'll use the forrowing evaluation metrics:

- **ROC** for binary classification problems
- **Accuracy** for multiclass classification problems

### XGBoost: Fit/Predict

In [None]:
# Import xgboost
import xgboost as xgb

# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the training and test sets
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)

# Fit the classifier to the training set
xg_cl.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_cl.predict(X_test)

# Compute the accuracy: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))

accuracy: 0.743300

![decision_tree](decision_tree.png)

![dt_base_learner](dt_base_learner.png)

![dt_disadvantage](dt_disadvantage.png)

![cart](cart.png)

### Decision Tree: Fit/Predict

In [None]:
# Import the necessary modules
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the classifier: dt_clf_4
dt_clf_4 = DecisionTreeClassifier(max_depth=4)

# Fit the classifier to the training set
dt_clf_4.fit(X_train, y_train)

# Predict the labels of the test set: y_pred_4
y_pred_4 = dt_clf_4.predict(X_test)

# Compute the accuracy of the predictions: accuracy
accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
print("accuracy:", accuracy)

accuracy: 0.9649122807017544

![boosting](boosting.png)

![weak_and_strong_learners](weak_and_strong_learners.png)

![boosting_working](boosting_working.png)

![boosting_example](boosting_example.png)

![cross_validation](cross_validation.png)

### Cross Validation in XGBoost Example

### Measuring accuracy

In [None]:
# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the DMatrix from X and y: churn_dmatrix
churn_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:logistic", "max_depth":3}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, 
                  nfold=3, num_boost_round=5, 
                  metrics="error", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the accuracy
print(((1-cv_results["test-error-mean"]).iloc[-1]))

       train-error-mean  train-error-std  test-error-mean  test-error-std
    0           0.28232         0.002366          0.28378        0.001932
    1           0.26951         0.001855          0.27190        0.001932
    2           0.25605         0.003213          0.25798        0.003963
    3           0.25090         0.001845          0.25434        0.003827
    4           0.24654         0.001981          0.24852        0.000934
    0.75148
    
cv_results stores the training and test mean and standard deviation of the error per boosting round (tree built) as a DataFrame. 

From cv_results, the final round 'test-error-mean' is extracted and converted into an accuracy, where accuracy is 1-error. 

The final accuracy of around 75% is an improvement from earlier!

### Measuring AUC

In [None]:
# Perform cross_validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, 
                  nfold=3, num_boost_round=5, 
                  metrics="auc", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the AUC
print((cv_results["test-auc-mean"]).iloc[-1])

       train-auc-mean  train-auc-std  test-auc-mean  test-auc-std
    0        0.768893       0.001544       0.767863      0.002820
    1        0.790864       0.006758       0.789157      0.006846
    2        0.815872       0.003900       0.814476      0.005997
    3        0.822959       0.002018       0.821682      0.003912
    4        0.827528       0.000769       0.826191      0.001937
    0.826191
    
An AUC of 0.84 is quite strong. As you have seen, XGBoost's learning API makes it very easy to compute any metric you may be interested in.

![when_to_use_xgboost](when_to_use_xgboost.png)

![when_not_to_use_xgboost](when_not_to_use_xgboost.png)

## XGBoost for Regression

![regression_question](regression_question.png)

Evaluation metrics:
- MAE
- MSE
- RMSE

![loss_functions](loss_functions.png)

![xgboost_loss_functions](xgboost_loss_functions.png)

![base_learners](base_learners.png)

### Trees as base learners example

By default, XGBoost uses trees as base learners, so you don't have to specify that you want to use trees here with booster="gbtree".

In [None]:
# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBRegressor: xg_reg
xg_reg = xgb.XGBRegressor(objective='reg:linear', n_estimators=10, seed=123)

# Fit the regressor to the training set
xg_reg.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_reg.predict(X_test)

# Compute the rmse: rmse
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

RMSE: 78847.401758

### Linear base learners example: Learning API

This model, although not as commonly used in XGBoost, allows you to create a regularized linear regression using XGBoost's powerful learning API. 

However, because it's uncommon, you have to use XGBoost's own non-scikit-learn compatible functions to build the model, such as xgb.train().

In [None]:
# Convert the training and testing sets into DMatrixes: DM_train, DM_test
DM_train = xgb.DMatrix(data=X_train, label=y_train)
DM_test =  xgb.DMatrix(data=X_test, label=y_test)

# Create the parameter dictionary: params
params = {"booster":"gblinear", "objective":"reg:linear"}

# Train the model: xg_reg
xg_reg = xgb.train(params = params, dtrain=DM_train, num_boost_round=5)

# Predict the labels of the test set: preds
preds = xg_reg.predict(DM_test)

# Compute and print the RMSE
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))

RMSE: 41699.961001

Interesting - it looks like linear base learners performed better!

### Evaluating model quality

Here, you will compare the RMSE and MAE of a cross-validated XGBoost model on the Ames housing data.

**Perform 4-fold cross-validation with 5 boosting rounds and "rmse" as the metric.**

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics='rmse', as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Extract and print final boosting round metric
print((cv_results["test-rmse-mean"]).tail(1))

       train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
    0    141767.527344      429.450237   142980.429688    1193.794436
    1    102832.542969      322.473304   104891.392578    1223.157623
    2     75872.617187      266.469946    79478.935547    1601.344218
    3     57245.650390      273.623926    62411.919922    2220.151196
    4     44401.297851      316.422372    51348.281250    2963.378741
    
51348.28125

**Perform 4-fold cross-validation with 5 boosting rounds and "mae" as the metric.**

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics='mae', as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Extract and print final boosting round metric
print((cv_results["test-mae-mean"]).tail(1))

       train-mae-mean  train-mae-std  test-mae-mean  test-mae-std
    0   127343.484375     668.343954  127633.974610   2404.006927
    1    89770.052735     456.957620   90122.500000   2107.913315
    2    63580.791992     263.408277   64278.561524   1887.563452
    3    45633.152344     151.885300   46819.168945   1459.816196
    4    33587.093750      86.999137   35670.646485   1140.609806

35670.646485

## Regularization and base learners in XGBoost

![regularization_xgboost](regularization_xgboost.png)

![base_learners_in_xgboost](base_learners_in_xgboost.png)

### Using regularization in XGBoost

Having seen an example of l1 regularization in the video, you'll now vary the l2 regularization penalty - also known as "lambda" - and see its effect on overall model performance on the Ames housing dataset.

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

reg_params = [1, 10, 100]

# Create the initial parameter dictionary for varying l2 strength: params
params = {"objective":"reg:linear","max_depth":3}

# Create an empty list for storing rmses as a function of l2 complexity
rmses_l2 = []

# Iterate over reg_params
for reg in reg_params:

    # Update l2 strength
    params["lambda"] = reg
    
    # Pass this updated param dictionary into cv
    cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2, num_boost_round=5, metrics="rmse", as_pandas=True, seed=123)
    
    # Append best rmse (final round) to rmses_l2
    rmses_l2.append(cv_results_rmse["test-rmse-mean"].tail(1).values[0])

# Look at best rmse per l2 param
print("Best rmse as a function of l2:")
print(pd.DataFrame(list(zip(reg_params, rmses_l2)), columns=["l2", "rmse"]))

        l2          rmse
    0    1  52275.359375
    1   10  57746.062500
    2  100  76624.625000
    
 It looks like as as the value of 'lambda' increases, so does the RMSE.

### Visualizing individual XGBoost trees

Here, you will visualize individual trees from the fully boosted model that XGBoost creates using the entire housing dataset.

XGBoost has a plot_tree() function that makes this type of visualization easy. 

Once you train a model using the XGBoost learning API, you can pass it to the plot_tree() function along with the number of trees you want to plot using the num_trees argument.

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":2}

# Train the model: xg_reg
xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)

# Plot the first tree
xgb.plot_tree(xg_reg, num_trees=0)
plt.show()

# Plot the fifth tree
xgb.plot_tree(xg_reg, num_trees=4)
plt.show()

Find output here:

https://campus.datacamp.com/courses/extreme-gradient-boosting-with-xgboost/regression-with-xgboost?ex=9

They provide insight into how the model arrived at its final decisions and what splits it made to arrive at those decisions. This allows us to identify which features are the most important in determining house price.

### Visualizing feature importances: What features are most important in my dataset

Another way to visualize your XGBoost models is to examine the importance of each feature column in the original dataset within the model.

One simple way of doing this involves counting the number of times each feature is split on across all boosting rounds (trees) in the model, and then visualizing the result as a bar graph, with the features ordered according to how many times they appear. XGBoost has a plot_importance() function that allows you to do exactly this.

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {'objective':'reg:linear', 'max_depth':4}

# Train the model: xg_reg
xg_reg = xgb.train(dtrain=housing_dmatrix, params=params, num_boost_round=10)

# Plot the feature importances
xgb.plot_importance(xg_reg)
plt.show()

![feature_importance](feature_importance.svg)

It looks like GrLivArea is the most important feature.

## Tuning an XGBoost model

![tune](tune.png)

### Tuning the number of boosting rounds

Let's start with parameter tuning by seeing how the number of boosting rounds (number of trees you build) impacts the out-of-sample performance of your XGBoost model. 

You'll use xgb.cv() inside a for loop and build one model per num_boost_round parameter.

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree: params 
params = {"objective":"reg:linear", "max_depth":3}

# Create list of number of boosting rounds
num_rounds = [5, 10, 15]

# Empty list to store final round rmse per XGBoost model
final_rmse_per_round = []

# Iterate over num_rounds and build one model per num_boost_round parameter
for curr_num_rounds in num_rounds:

    # Perform cross-validation: cv_results
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3, num_boost_round=curr_num_rounds, metrics="rmse", as_pandas=True, seed=123)
    
    # Append final round RMSE
    final_rmse_per_round.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
num_rounds_rmses = list(zip(num_rounds, final_rmse_per_round))
print(pd.DataFrame(num_rounds_rmses,columns=["num_boosting_rounds","rmse"]))

       num_boosting_rounds          rmse
    0                    5  50903.299479
    1                   10  34774.194010
    2                   15  32895.098958
    
As you can see, increasing the number of boosting rounds decreases the RMSE.

### Automated boosting round selection using early_stopping
Now, instead of attempting to cherry pick the best possible number of boosting rounds, you can very easily have XGBoost automatically select the number of boosting rounds for you within xgb.cv(). This is done using a technique called early stopping.

Early stopping works by testing the XGBoost model after every boosting round against a hold-out dataset and stopping the creation of additional boosting rounds (thereby finishing training of the model early) if the hold-out metric ("rmse" in our case) does not improve for a given number of rounds. 

Here you will use the early_stopping_rounds parameter in xgb.cv() with a large possible number of boosting rounds (50). Bear in mind that if the holdout metric continuously improves up through when num_boost_rounds is reached, then early stopping does not occur.

In [None]:
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation with early stopping: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, metrics='rmse', num_boost_round=50, early_stopping_rounds=10, seed=123, as_pandas=True)

# Print cv_results
print(cv_results)

        train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
    0     141871.635417      403.636200   142640.656250     705.559400
    1     103057.036458       73.769561   104907.664062     111.112417
    2      75975.966146      253.726099    79262.057291     563.763448
    3      57420.531250      521.656754    61620.135417    1087.693857
    4      44552.955729      544.170190    50437.561198    1846.446330
    5      35763.947917      681.797248    43035.660156    2034.469858
    6      29861.464193      769.571238    38600.880208    2169.796232
    7      25994.676432      756.520565    36071.817708    2109.795430
    8      23306.836588      759.238254    34383.184896    1934.546688
    9      21459.769531      745.624998    33509.142578    1887.377024
    10     20148.722005      749.611886    32916.808594    1850.894249
    11     19215.382161      641.388291    32197.832682    1734.456935
    12     18627.389323      716.256596    31770.852865    1802.155484
    13     17960.694661      557.043073    31482.781901    1779.123826
    14     17559.736328      631.412555    31389.992188    1892.321520
    15     17205.712565      590.171852    31302.881511    1955.165830
    16     16876.571615      703.632214    31234.059896    1880.707172
    17     16597.662110      703.677609    31318.348959    1828.860617
    18     16330.460937      607.274494    31323.633464    1775.909418
    19     16005.972982      520.470911    31204.135417    1739.076156
    20     15814.301107      518.604195    31089.862630    1756.021674
    21     15493.405599      505.616447    31047.996094    1624.673955
    22     15270.733724      502.019237    31056.916015    1668.042812
    23     15086.382162      503.913199    31024.983724    1548.985354
    24     14917.608399      486.206187    30983.685547    1663.129296
    25     14709.589518      449.668010    30989.476563    1686.665979
    26     14457.286133      376.787666    30952.113932    1613.172643
    27     14185.567057      383.101961    31066.901693    1648.535213
    28     13934.067057      473.465714    31095.641276    1709.224163
    29     13749.645182      473.670437    31103.886719    1778.879529
    30     13549.836263      454.898488    30976.085287    1744.514533
    31     13413.485351      399.603618    30938.469401    1746.052597
    32     13275.916016      415.408786    30931.000000    1772.470824
    33     13085.878581      493.792860    30929.057291    1765.540568
    34     12947.181315      517.789746    30890.630208    1786.511479
    35     12846.027018      547.732805    30884.492839    1769.728719
    36     12702.378906      505.523658    30833.542318    1691.003062
    37     12532.244466      508.298594    30856.688151    1771.446377
    38     12384.055013      536.225042    30818.017578    1782.785133
    39     12198.444010      545.165197    30839.392578    1847.327022
    40     12054.583333      508.841412    30776.966797    1912.780933
    41     11897.036133      477.177991    30794.702474    1919.674832
    42     11756.221680      502.992782    30780.954427    1906.819550
    43     11618.845703      519.837120    30783.755859    1951.259331
    44     11484.080078      578.428250    30776.730469    1953.447230
    45     11356.553060      565.368380    30758.543620    1947.454953
    46     11193.558268      552.298848    30729.972005    1985.699316
    47     11071.315429      604.089960    30732.662760    1966.997355
    48     10950.778320      574.862779    30712.242188    1957.752039
    49     10824.865885      576.665756    30720.854818    1950.511520

![tree_params](tree_params.png)

![linear_params](linear_params.png)

### Tuning eta (learning rate)

In [None]:
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree (boosting round)
params = {"objective":"reg:linear", "max_depth":3}

# Create list of eta values and empty list to store final round rmse per xgboost model
eta_vals = [0.001, 0.01, 0.1]
best_rmse = []

# Systematically vary the eta 
for curr_val in eta_vals:

    params["eta"] = curr_val
    
    # Perform cross-validation: cv_results
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3, num_boost_round=10, early_stopping_rounds=5, metrics='rmse', seed=123, as_pandas=True)
    
    
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(eta_vals, best_rmse)), columns=["eta","best_rmse"]))

         eta      best_rmse
    0  0.001  195736.406250
    1  0.010  179932.182292
    2  0.100   79759.411458

### Tuning max_depth

Tune max_depth, which is the parameter that dictates the maximum depth that each tree in a boosting round can grow to. 

Smaller values will lead to shallower trees, and larger values to deeper trees.

In [None]:
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary
params = {"objective":"reg:linear"}

# Create list of max_depth values
max_depths = [2, 5, 10, 20]
best_rmse = []

# Systematically vary the max_depth
for curr_val in max_depths:

    params["max_depth"] = curr_val
    
    # Perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2, num_boost_round=10, early_stopping_rounds=5, metrics='rmse', seed=123, as_pandas=True)
    
    
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(max_depths, best_rmse)),columns=["max_depth","best_rmse"]))

       max_depth     best_rmse
    0          2  37957.468750
    1          5  35596.599610
    2         10  36065.550782
    3         20  36739.578125

### Tuning colsample_bytree

Now, it's time to tune "colsample_bytree". You've already seen this if you've ever worked with scikit-learn's RandomForestClassifier or RandomForestRegressor, where it just was called max_features. 

In both xgboost and sklearn, this parameter (although named differently) simply specifies the fraction of features to choose from at every split in a given tree. In xgboost, colsample_bytree must be specified as a float between 0 and 1.

In [None]:
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary
params={"objective":"reg:linear","max_depth":3}

# Create list of hyperparameter values: colsample_bytree_vals
colsample_bytree_vals = [0.1, 0.5, 0.8, 1]
best_rmse = []

# Systematically vary the hyperparameter value 
for curr_val in colsample_bytree_vals:

    params['colsample_bytree'] = curr_val
    
    # Perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
                 num_boost_round=10, early_stopping_rounds=5,
                 metrics="rmse", as_pandas=True, seed=123)
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(colsample_bytree_vals, best_rmse)), columns=["colsample_bytree","best_rmse"]))

       colsample_bytree     best_rmse
    0               0.1  48193.451172
    1               0.5  36013.544922
    2               0.8  35932.962891
    3               1.0  35836.044922

There are several other individual parameters that you can tune, such as "subsample", which dictates the fraction of the training data that is used during any given boosting round. 

![grid_search_review](grid_search_review.png)

![random_search_review](random_search_review.png)

### Grid search with XGBoost

In [None]:
# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
    'colsample_bytree': [0.3, 0.7],
    'n_estimators': [50],
    'max_depth': [2, 5]
}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor()

# Perform grid search: grid_mse
grid_mse = GridSearchCV(estimator=gbm, param_grid=gbm_param_grid, cv=4, scoring='neg_mean_squared_error', verbose=1)


# Fit grid_mse to the data
grid_mse.fit(X, y)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))

Best parameters found:  {'colsample_bytree': 0.7, 'max_depth': 5, 'n_estimators': 50}

Lowest RMSE found:  29916.562522854438

### Random search with XGBoost

In [None]:
# Create the parameter grid: gbm_param_grid 
gbm_param_grid = {
    'n_estimators': [25],
    'max_depth': range(2, 12)
}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor(n_estimators=10)

# Perform random search: grid_mse
randomized_mse = RandomizedSearchCV(estimator=gbm, param_distributions=gbm_param_grid, cv=4, scoring='neg_mean_squared_error', n_iter=5, verbose=1)


# Fit randomized_mse to the data
randomized_mse.fit(X, y)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", randomized_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))

    Best parameters found:  {'n_estimators': 25, 'max_depth': 6}
    Lowest RMSE found:  36909.98213965752

![grid_random_limitation](grid_random_limitation.png)

![grid_random_question](grid_random_question.png)

## Using XGBoost in pipelines 

![pipeline_review](pipeline_review.png)

![preprocessing_1](preprocessing_1.png)

![preprocessing_2](preprocessing_2.png)

### Encoding categorical columns I: LabelEncoder

In [None]:
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Fill missing values with 0
df.LotFrontage = df.LotFrontage.fillna(0)

# Create a boolean mask for categorical columns
categorical_mask = (df.dtypes == object)

# Get list of categorical column names
categorical_columns = df.columns[categorical_mask].tolist()

# Print the head of the categorical columns
print(df[categorical_columns].head())

# Create LabelEncoder object: le
le = LabelEncoder()

# Apply LabelEncoder to categorical columns
df[categorical_columns] = df[categorical_columns].apply(lambda x: le.fit_transform(x))

# Print the head of the LabelEncoded categorical columns
print(df[categorical_columns].head())

Entries in each categorical column are now encoded numerically. 

A BldgTpe of 1Fam is encoded as 0, while a HouseStyle of 2Story is encoded as 5.

### Encoding categorical columns II: OneHotEncoder

In the categorical columns of this dataset, there is no natural ordering between the entries. As an example: Using LabelEncoder, the CollgCr Neighborhood was encoded as 5, while the Veenker Neighborhood was encoded as 24, and Crawfor as 6. Is Veenker "greater" than Crawfor and CollgCr? No - and allowing the model to assume this natural ordering may result in poor performance.

As a result, there is another step needed: You have to apply a one-hot encoding to create binary, or "dummy" variables. You can do this using scikit-learn's OneHotEncoder.

In [None]:
# Import OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

# Create OneHotEncoder: ohe
ohe = OneHotEncoder(categorical_features=categorical_mask, sparse=False)

# Apply OneHotEncoder to categorical columns - output is no longer a dataframe: df_encoded
df_encoded = ohe.fit_transform(df)

# Print first 5 rows of the resulting dataset - again, this will no longer be a pandas dataframe
print(df_encoded[:5, :])

# Print the shape of the original DataFrame
print(df.shape)

# Print the shape of the transformed array
print(df_encoded.shape)

    (1460, 21)
    (1460, 62)

As you can see, after one hot encoding, which creates binary variables out of the categorical variables, there are now 62 columns.

### Encoding categorical columns III: DictVectorizer

The two step process you just went through - LabelEncoder followed by OneHotEncoder - can be simplified by using a DictVectorizer.

Using a DictVectorizer on a DataFrame that has been converted to a dictionary allows you to get label encoding as well as one-hot encoding in one go.

In [None]:
# Import DictVectorizer
from sklearn.feature_extraction import DictVectorizer

# Convert df into a dictionary: df_dict
df_dict = df.to_dict('records')

# Create the DictVectorizer object: dv
dv = DictVectorizer(sparse=False)

# Apply dv on df: df_encoded
df_encoded = dv.fit_transform(df_dict)

# Print the resulting first five rows
print(df_encoded[:5,:])

# Print the vocabulary
print(dv.vocabulary_)

Besides simplifying the process into one step, DictVectorizer has useful attributes such as vocabulary_ which maps the names of the features to their indices. 

### Preprocessing within a pipeline

In [None]:
# Import necessary modules
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline

# Fill LotFrontage missing values with 0
X.LotFrontage = X.LotFrontage.fillna(0)

# Setup the pipeline steps: steps
steps = [("ohe_onestep", DictVectorizer(sparse=False)),
         ("xgb_model", xgb.XGBRegressor())]

# Create the pipeline: xgb_pipeline
xgb_pipeline = Pipeline(steps=steps)

# Fit the pipeline
xgb_pipeline.fit(X.to_dict('records'), y)

## Additional Components introduced for pipelines

![pipeline_additional_components](pipeline_additional_components.png)

### Cross-validating your XGBoost model

In [None]:
# Import necessary modules
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# Fill LotFrontage missing values with 0
X.LotFrontage = X.LotFrontage.fillna(0)

# Setup the pipeline steps: steps
steps = [("ohe_onestep", DictVectorizer(sparse=False)),
         ("xgb_model", xgb.XGBRegressor(max_depth=2, objective="reg:linear"))]

# Create the pipeline: xgb_pipeline
xgb_pipeline = Pipeline(steps=steps)

# Cross-validate the model
cross_val_scores = cross_val_score(xgb_pipeline, X.to_dict('records'), y, scoring='neg_mean_squared_error', cv=10)

# Print the 10-fold RMSE
print("10-fold RMSE: ", np.mean(np.sqrt(np.abs(cross_val_scores))))

10-fold RMSE:  29867.603720688923

### Kidney disease case study I: Imputer (numerical and categorical)

In [None]:
# Import necessary modules
from sklearn_pandas import DataFrameMapper
from sklearn_pandas import CategoricalImputer

# Check number of nulls in each feature column
nulls_per_column = X.isnull().sum()
print(nulls_per_column)

# Create a boolean mask for categorical columns
categorical_feature_mask = X.dtypes == object

# Get list of categorical column names
categorical_columns = X.columns[categorical_feature_mask].tolist()

# Get list of non-categorical column names
non_categorical_columns = X.columns[~categorical_feature_mask].tolist()

# Apply numeric imputer
numeric_imputation_mapper = DataFrameMapper(
                                            [([numeric_feature], Imputer(strategy="median")) for numeric_feature in non_categorical_columns],
                                            input_df=True,
                                            df_out=True
                                           )

# Apply categorical imputer
categorical_imputation_mapper = DataFrameMapper(
                                                [(category_feature, CategoricalImputer()) for category_feature in categorical_columns],
                                                input_df=True,
                                                df_out=True
                                               )

    age        9
    bp        12
    sg        47
    al        46
    su        49
    bgr       44
    bu        19
    sc        17
    sod       87
    pot       88
    hemo      52
    pcv       71
    wc       106
    rc       131
    rbc      152
    pc        65
    pcc        4
    ba         4
    htn        2
    dm         2
    cad        2
    appet      1
    pe         1
    ane        1
    dtype: int64

### Feature Union

Having separately imputed numeric as well as categorical columns, your task is now to use scikit-learn's FeatureUnion to concatenate their results, which are contained in two separate transformer objects - numeric_imputation_mapper, and categorical_imputation_mapper, respectively.

### Kidney disease case study II: Feature Union

In [None]:
# Import FeatureUnion
from sklearn.pipeline import FeatureUnion

# Combine the numeric and categorical transformations
numeric_categorical_union = FeatureUnion([
                                          ("num_mapper", numeric_imputation_mapper),
                                          ("cat_mapper", categorical_imputation_mapper)
                                         ])

### Kidney disease case study III: Full pipeline

It's time to piece together all of the transforms along with an XGBClassifier to build the full pipeline!

Besides the numeric_categorical_union that you created in the previous exercise, there are two other transforms needed: the Dictifier() transform which we created for you, and the DictVectorizer().

After creating the pipeline, your task is to cross-validate it to see how well it performs.

In [None]:
# Create full pipeline
pipeline = Pipeline([
                     ("featureunion", numeric_categorical_union),
                     ("dictifier", Dictifier()),
                     ("vectorizer", DictVectorizer(sort=False)),
                     ("clf", xgb.XGBClassifier(max_depth=3))
                    ])

# Perform cross-validation
cross_val_scores = cross_val_score(pipeline, X, y, scoring="roc_auc", cv=3)

# Print avg. AUC
print("3-fold AUC: ", np.mean(cross_val_scores))

3-fold AUC:  0.998637406769937

## Tuning XGBoost Hyperparameters in a pipeline

In [None]:
# Create the parameter grid
gbm_param_grid = {
    'clf__learning_rate': np.arange(0.05, 1, 0.05),
    'clf__max_depth': np.arange(3, 10, 1),
    'clf__n_estimators': np.arange(50, 200, 50)
}

# Perform RandomizedSearchCV
randomized_roc_auc = RandomizedSearchCV(estimator=pipeline, param_distributions=gbm_param_grid, n_iter=2, scoring='roc_auc', verbose=1, cv=2)

# Fit the estimator
randomized_roc_auc.fit(X, y)

# Compute metrics
print(randomized_roc_auc.best_score_)
print(randomized_roc_auc.best_estimator_)

## What we have not done

![to_do](to_do.png)