<b>Supervised Learning</b> - Classification & Regression<br>

Numeric features should be scaled (Z-scored)<br>
Categorical features should be encoded (One-hot)<p>

XGBoost uses CART - Cassification and Regression Trees<br>

<b> When should I use XGBoost</b><br>
<pre>
Should:                                                       Shouldn't:
-samples > 1000 training ans < 100 feats                     - Image recon, NLP, computer vision
-num feats < num training samples                            - Few training samples
-categorical and numeric feats 


In [None]:
import xgboost as xgb

<b> LOSS FUNCTIONS</b><br>
Quantifies how far off a prediction is from the actual result<p>

reg:linear - use for Refression<br>
reg:logistic - use for Classification, when you want just <decision>, not probability<br>
binary:logistic - use for Classification, when you want <probability> rather than just decision<br>

<b> BASE LEARNERS</b><br>
Individual models to ensemble -  (i.e. tree and linear)<p>
    
Want base learners that when combined create final prediction that is non-linear<br>
Each base learner should be good at distinguishing or predictiong different parts of the dataset<br>

<b>DENSE MATRIX</b><br>
DMatrix são criadas durante o processo do XGBoost, mas para usar CV é necessário converter antes.<br>
DMatrix são estruturas otimizadas para XGB.<br>
```MD_train = xgb.DMatrix(data=X_train, label=y_train)```

In [None]:
### CLASSIFICATION

'''METRICS:
Binary classification model      => AUC
Multi-class classification model => Accuracy '''


xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)
xg_cl = xgb.XGBClassifier(objective='reg:logistic',    n_estimators=10, seed=123)

In [None]:
### REGRESSION

'''METRICS:
# Error = Actual - Predicted
# Root mean squared error (RMSE) =>  sqrt(mean(sum((Error)**2))) - punishes larger diff between actual and pred
# Mean absolute error (MAE)      =>  mean(sum(abs(Error))) '''

xg_reg = xgb.XGBRegressor(objective='reg:linear', n_estimators=10, seed=123)

In [None]:
### Selecting Base Learners
# Note: booster='gbtree' is the default

MD_train = xgb.DMatrix(data=X_train, label=y_train)
MD_test  = xgb.DMatrix(data=X_test,  label=y_test)

params = {'booster':'gblinear', 'objective':'reg:linear'}              # gblinear for linear; booster="gbtree" for trees

xg_reg = xgb.train(params = params, dtrain = DM_train, num_boost_round = 10)
preds = xg_reg.predict(DM_test)


In [None]:
# Compute the rmse
from sklearn.metrics import mean_squared_error
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))


<b>Plotting XGBoost trees</b>

In [None]:
# Create the DMatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
params = {"objective":"reg:linear", "max_depth":4}
xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)


# Plot the FIRST tree
xgb.plot_tree(xg_reg, num_trees=0)
plt.show()

# Plot the FIFTH tree
xgb.plot_tree(xg_reg, num_trees=4)
plt.show()

# Plot the LAST TREE SIDEWAYS
xgb.plot_tree(xg_reg, num_trees=9, rankdir="LR")
plt.show()

# Plot the FEATURE IMPORTANCES
xgb.plot_importance(xg_reg)
plt.show()


<h2>FINE-TUNING</h2>

<b>REGULARIZATION</b> - is a control on model complexity<br>

<b>COMMON TREE TUNABLE PARAMS:</b><br>
<b>learning rate</b> - [0.001, 0.01, 0.1]<br>
<b>gamma</b> - min loss reduction to create new tree split (Regularization)<br>
<b>alpha</b> - L1 reg on leaf weights, larger values mean more Regularization - [1, 10, 100]<br>
<b>lambda</b> - L2 reg on leaf weights, smoother Regularization<br>
<b>max_depth</b> - max depth per tree - [2, 5, 10, 20]<br>
<b>subsample</b> - % samples used per tree (low or high can overfit)<br>
<b>colsample_bytree</b> - % feats used per tree (low is additional Regularization, high can overfit) = max_features in RandomForest - [0.1, 0.5, 0.8, 1]<p>

<b>LINEAR TUNABLE PARAMS:</b><br>
<b>alpha</b> - L1 reg on leaf weights (Regularization)<br>
<b>lambda</b> - L2 reg on leaf weights (Regularization)<br>
<b>lambda_bias</b> - LG reg term on bias<p>

<b>estimators numbers</b><br>

In [None]:
### CROSS-VALIDATION in XGBoost
''' # DMatrix são criadas durante o processo do XGBoost, mas para usar CV é necessário converter antes.
      DMatrix são estruturas otimizadas para XGB.'''

DM_train = xgb.DMatrix(data=X_train, label=y_train)

params = {'objective':'binary:logistic',
          'max_depth':4,
          'colsample_bytree':0.3,
          'learning_rate':0.1}

cv_results = xgb.cv(dtrain=DM_train,                   # data to train
                    params=params,                     # params dict
                    nfold=4,                           # num of folds
                    num_boost_round=10,                # num of trees
                    early_stopping_rounds=10,          # early_stopping
                    metrics='error',                   # metric
                    as_pandas=True,                    # output as DataFrame
                    seed=42)                           # random seed


#metrics = error, auc, rmse, mae

print(cv_results)                                     # Print cv_results
print((cv_results["test-mae-mean"]).tail(1))          # Extract and print final boosting round metric

In [None]:
### REGULARIZATION - is a control on model complexity
# gamma - minimum loss reduction allowed for a split to occur
# alpha - l1 regularization on leaf weights, larger values mean more regularization
# lambda - l2 regularization on leaf weight (smoother regularization)


In [None]:
### GRIDSEARCH
from sklearn.model_selection import GridSearchCV

gbm_param_grid ={ 'learing_rate'    : [0.01, 0.1, 0.5 , 0.9],
                  'n_estimators'    : [200],
                  'subsample'       : [0.3, 0.5, 0.9],
                  'colsample_bytree': [0.3, 0.7]}

gbm = xgb.XGBRegressor()
grid_mse = GridSearchCV(estimator = gbm,
                        param_grid = gbm_param_grid,
                        scoring='neg_mean_squared_error',
                        cv=4,
                        verbose=1)
grid_mse.fit(X, y)

print ('Best params found: ', grid_mse.best_params_)
print ('Lowest RMSE found: ', np.sqrt(np.abs(grid_mse.best_score_)))

In [None]:
### RANDOMSEARCH
from sklearn.model_selection import RandomizedSearchCV

gbm_param_grid ={ 'learing_rate': np.arange(0.05, 1.05, 0.5),
                  'n_estimators': [200],
                  'subsample'   : np.arange(0.05, 1.05, 0.5)}

gbm = xgb.XGBRegressor()
randomized_mse = RandomizedSearchCV(estimator = gbm,
                                    param_distributions = gbm_param_grid,
                                    n_iter=25,                                  # number of random combinations
                                    scoring='neg_mean_squared_error',
                                    cv=4,
                                    verbose=1)
randomized_mse.fit(X, y)

print ('Best params found: ', randomized_mse.best_params_)
print ('Lowest RMSE found: ', np.sqrt(np.abs(randomized_mse.best_score_)))

The search space size can be massive for Grid Search in certain cases, whereas for Random Search the number of hyperparameters has a significant effect on how long it takes to run.

<h2>PREPROCESSING</h2>

In [None]:
### LABELENCODER
""" LabelEncoder - Converts a categorical column of strings into integers """

from sklearn.preprocessing import LabelEncoder

df.LotFrontage = df.LotFrontage.fillna(0)                    # Fill missing values with 0
categorical_mask = (df.dtypes == object)                     # Create a boolean mask for categorical columns
categorical_columns = df.columns[categorical_mask].tolist()  # Get list of categorical column names
print(df[categorical_columns].head())                        # Print the head of the categorical columns

le = LabelEncoder()                                          # Create LabelEncoder object
df[categorical_columns] = df[categorical_columns].apply(lambda x: le.fit_transform(x)) # Apply LabelEncoder

print(df[categorical_columns].head())                        # Print the head of the LabelEncoded categorical columns


In [None]:
# ONEHOTENCODER
""" OneHotEncoder - Takes the column on integers and encodes them as dummy variables """

from sklearn.preprocessing import OneHotEncoder

categorical_mask = (df.dtypes == object)
categorical_columns = df.columns[categorical_mask].tolist()

ohe = OneHotEncoder(categorical_features=categorical_mask, sparse=False)   # Create OneHotEncoder
df_encoded = ohe.fit_transform(df)                                         # output is no longer a dataframe

print(df_encoded[:5, :])                                                   # Print first 5 rows of the resulting dataset

print(df.shape)                                                            # Print the shape of the original DataFrame
print(df_encoded.shape)                                                    # Print the shape of the transformed array

In [None]:
# DictVectorizer
""" DictVectorizer - Converts lists of feature mappings into vectors """

from sklearn.feature_extraction import DictVectorizer

df_dict = df.to_dict('records')               # Convert df into a dictionary

dv = DictVectorizer(sparse=False)             # Create the DictVectorizer object
df_encoded = dv.fit_transform(df_dict)        # Apply dv on df

print(df_encoded[:5,:])                       # Print the resulting first five rows
print(dv.vocabulary_)                         # Print the vocabulary (maps the names of the features to their indices)


<h2>PIPELINES</h2>

In [None]:
# Import necessary modules
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

X = X.to_dict("records")

# Setup the pipeline steps
steps = [("ohe_onestep", DictVectorizer(sparse=False)),
         ("xgb_model", xgb.XGBRegressor(max_depth=2, objective="reg:linear"))]

# Create the pipeline
xgb_pipeline = Pipeline(steps)

# Cross-validate the model
cross_val_scores = cross_val_score(xgb_pipeline, X, y, cv=10, scoring='neg_mean_squared_error', )

# Print the 10-fold RMSE
print("10-fold RMSE: ", np.mean(np.sqrt(np.abs(cross_val_scores))))


# Import necessary modules
from sklearn_pandas import DataFrameMapper
from sklearn_pandas import CategoricalImputer


nulls_per_column = X.isnull().sum()                                       # Check number of nulls in each feature column
print(nulls_per_column)

categorical_feature_mask = X.dtypes == object                             # Create a boolean mask for categorical columns
categorical_columns = X.columns[categorical_feature_mask].tolist()        # Get list of categorical column names
non_categorical_columns = X.columns[~categorical_feature_mask].tolist()   # Get list of non-categorical column names


# Apply numeric imputer
numeric_imputation_mapper = DataFrameMapper(
                                            [([numeric_feature], Imputer(strategy="median")) \
                                             for numeric_feature in non_categorical_columns],
                                            input_df=True,
                                            df_out=True
                                           )

# Apply categorical imputer
categorical_imputation_mapper = DataFrameMapper(
                                                [(category_feature, CategoricalImputer(category_feature))\
                                                 for category_feature in categorical_columns],
                                                input_df=True,
                                                df_out=True
                                               )

# Import FeatureUnion
from sklearn.pipeline import FeatureUnion

# Combine the numeric and categorical transformations
numeric_categorical_union = FeatureUnion([
                                          ("num_mapper", numeric_imputation_mapper),
                                          ("cat_mapper", categorical_imputation_mapper)
                                         ])

# Create full pipeline
pipeline = Pipeline([
                     ("featureunion", numeric_categorical_union),
                     ("dictifier", Dictifier()),
                     ("vectorizer", DictVectorizer(sort=False)),
                     ("clf", xgb.XGBClassifier())
                    ])

# Perform cross-validation
cross_val_scores = cross_val_score(pipeline, X, y, scoring ="roc_auc", cv=3)

# Print avg. AUC
print("3-fold AUC: ", np.mean(cross_val_scores))


<b>STUDY CASE</b>

In [None]:
import pandas as pd
import xgboost as xgb
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
   
names = ["crime","zone","industry","charles","no", "rooms","age", "distance","radial","tax", "pupil","aam","lower","med_price"]
data = pd.read_csv("boston_housing.csv",names=names)

X, y = data.iloc[:,:-1],data.iloc[:,-1]

xgb_pipeline = Pipeline[("st_scaler", StandardScaler()),
                        ("xgb_model", xgb.XGBRegressor())]
   
gbm_param_grid = {'xgb_model__learning_rate'   : np.arange( 0.05, 1, 0.05),
                  'xgb_model__subsample'       : np.arange( .05, 1, .05),
                  'xgb_model__max_depth'       : np.arange( 3, 20, 1),
                  'xgb_model__colsample_bytree': np.arange( 0.1, 1.05, .05),
                  'xgb_model__n_estimators'    : np.arange( 50, 200, 50)}
   
randomized_neg_mse = RandomizedSearchCV(estimator = xgb_pipeline,
                                        param_distributions = gbm_param_grid,
                                        n_iter=10,
                                        scoring='neg_mean_squared_error',
                                        cv=4,
                                        verbose=1)
randomized_neg_mse.fit(X, y)

print("Best rmse: ", np.sqrt(np.abs(randomized_neg_mse.best_score_)))
print("Best model: ", randomized_neg_mse.best_estimator_)


<b> NEXT STEPS</b>

- Using XGBoost for <b>ranking/recommendation</b> problems (Netflix/Amazon problem) ==> Modify Loss Function (?)<br>
- <b>Bayesian Optimization</b> ==> Using more sophisticated hyperparameter tuning strategies for tuning XGBoost models<br>
- Using XGBoost as part of an <b>ensemble</b> of other models for regression/classification