# Machine Learning A-Z: Section 39 XGBoost

In this notebook we'll be using XGBoost to solve the same problem solved with the Artificial Neural Network (Churn prediction). First though, we'll cover a quick description of what XGBoost.

XGBoost is a library that implements Gradient Tree Boosting and does so in a way that makes it both simple to use, fast to train, and highly performant. At the moment it is regularly one of the top performing algorithms in many data science competitions.

Like Random Forest models, Gradient Tree Boosting is a type of ensemble learning using multiple trees to improve the performance of the model. However, unlike Random Forest, Gradient Tree Boosting does not create a random set of trees and average the output. Instead, GTB has an objective function which includes factors for both accuracy and model complexity and GTB judiciously add new trees to improve areas where the current ensemble performs poorly while keeping the model complexity low. GTB decides how to boost the model's performance (i.e. add trees) by looking at the gradient (where the model can improve most) of the objective function. This is where the name comes from.

The problem we'll be solving with XGBoost will be one of trying to determine which users are likely to stop using a particular bank (churn). In this case we'll create a geodemographic model from a sample of data the bank collected about it's customers and try to identify who is likely to leave the bank.

## Step 1 Import and Prepare the data.

In [1]:
import numpy as np # Libraries for fast linear algebra and array manipulation
import pandas as pd # Import and manage datasets
from plotly import __version__ as py__version__
import plotly.express as px # Libraries for ploting data
import plotly.graph_objects as go # Libraries for ploting data
from sklearn import __version__ as skl__version__
from sklearn.model_selection import train_test_split # Library to split data into training and test sets.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder # Libraries to do encoding of categorical variables
from sklearn.compose import ColumnTransformer # Library to transform only certain columns/features at a time
from sklearn.preprocessing import StandardScaler # Library to do feature scaling
from sklearn.metrics import confusion_matrix #Function for computing the confusion matrix
from sklearn.model_selection import cross_val_score # Function for doing K-Fold Cross Validation
from sklearn.model_selection import GridSearchCV # Library for doing Grid Search
from xgboost import __version__ as xgb__version__
from xgboost import XGBClassifier #XGBoost library for doing classification

Library versions used in this code:

In [2]:
print('Numpy: ' + np.__version__)
print('Pandas: ' + pd.__version__)
print('Plotly: ' + py__version__)
print('Scikit-learn: ' + skl__version__)
print('XGBoost Verion: ' + xgb__version__)

Numpy: 1.16.4
Pandas: 0.25.1
Plotly: 4.0.0
Scikit-learn: 0.21.2
XGBoost Verion: 0.90


In [3]:
def LoadData():
    dataset = pd.read_csv('Churn_Modelling.csv')
    return dataset

dataset = LoadData()
print(dataset.head(3))
print()
print(dataset.info())

   RowNumber  CustomerId   Surname  CreditScore Geography  Gender  Age  \
0          1    15634602  Hargrave          619    France  Female   42   
1          2    15647311      Hill          608     Spain  Female   41   
2          3    15619304      Onio          502    France  Female   42   

   Tenure    Balance  NumOfProducts  HasCrCard  IsActiveMember  \
0       2       0.00              1          1               1   
1       1   83807.86              1          0               1   
2       8  159660.80              3          1               0   

   EstimatedSalary  Exited  
0        101348.88       1  
1        112542.58       0  
2        113931.57       1  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber          10000 non-null int64
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender     

We can see that the dataset contains 14 columns. However, not all of them are useful for out model. Only the columns listed below will actually be useful:
* CreditScore
* Geography
* Gender
* Age
* Tenure (How long the person has been a customer)
* Balance
* NumOfProducts (How many of the bank's products does the customer use)
* HasCrCard
* IsActiveMember
* Estimated Salary

Using the data in these columns we'll try to predict the value in the *Exited* column to determine if a user will leave the bank soon or not.

You'll see in a moment that some of these columns are categorical variables that we'll need to encode to work properly. Also there does not appear to be any missing data in this data set.

## Step 2. Split and Encode the Data

In [4]:
X = dataset.iloc[:,3:-1].values # All the columns except the last are features
y = dataset.iloc[:,-1].values # The last column is the dependent variable

Now that we've split the data into dependent and independent datasets we need to encode the categorical variables in the independent variables.

We'll use One-Hot encoding on both gender and country to encode the categorical data. Don't forget to remove one of the new columns from the one-hot encoded categorical variables to avoid the dummy variable trap!

In [5]:
columntransformer = ColumnTransformer([
    ('Country_Category', OneHotEncoder(drop='first'), [1]),
    ('Gender_Category', OneHotEncoder(drop='first'), [2])],
    remainder = 'passthrough')
X = np.array(columntransformer.fit_transform(X))

print(X)

[[0.0 0.0 0.0 ... 1 1 101348.88]
 [0.0 1.0 0.0 ... 0 1 112542.58]
 [0.0 0.0 0.0 ... 1 0 113931.57]
 ...
 [0.0 0.0 0.0 ... 0 1 42085.58]
 [1.0 0.0 1.0 ... 1 0 92888.52]
 [0.0 0.0 0.0 ... 1 0 38190.78]]


Now it's time to split the data into test and training sets.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 42)

## Step 3: Fit the Model

In [7]:
classifier = XGBClassifier(n_estimators = 100)
classifier.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

## Step 4: Evaluate the Model Performance
Evaluate the performance of the model on the test data that was held back

In [8]:
y_pred = classifier.predict(X_test)

confusionMatrix = confusion_matrix(y_test, y_pred)

print(confusionMatrix)

[[1545   62]
 [ 212  181]]


As seen from the confusion matrix using a single evaluation of the model, we have achieved an accuracy of 97%. Next we'll compare against K-Fold Cross Validation

### K-Fold Cross Validation

In [9]:
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)

for accuracy in accuracies:
    print(f'{accuracy*100:0.2f}%')

print(f'Average Accuracy: {accuracies.mean()*100:0.2f}%')
print(f'Accuracy Standard Deviation: {accuracies.std()*100:0.2f}%')

87.39%
86.89%
85.89%
85.52%
86.88%
86.75%
87.36%
84.61%
86.11%
85.73%
Average Accuracy: 86.31%
Accuracy Standard Deviation: 0.85%


When we run 10-Fold Cross Validation on the model, we see that our average accuracy 86% and it typically varies by about 1%. This result matches closely and even slightly outperforms what we were able to obtain with our ANN. Below we'll see if we can improve the performance of our model using a grid search.

### Grid Search
For Grid Search we need a dictionary of parameters to evaluate the model at. In our case we are interested in looking at one set of parameters with a linear model and a second set of parameters with a nonlinear model, so we'll use two parameter dictionaries in a list to keep the parameter sets different.

In [10]:
parameters = [{
    'n_estimators':[50,100,500],
    'max_depth':[3,5,7],
    'learning_rate':[0,.33,.66,1],
    'gamma':[1,0.5,0.01],
    'min_child_weight':[1,2,5,10],
    'max_delta_step':[0,1,2,5,10]
}]

grid_search = GridSearchCV(estimator = classifier, param_grid = parameters, scoring = 'accuracy', cv = 5, iid = False, n_jobs = -1, verbose = 3)
grid_search = grid_search.fit(X_train, y_train)

print(f'Best Accuracy: {grid_search.best_score_*100:0.2f}%')
print('Best Parameters:')
for key, val in grid_search.best_params_.items():
    print(f'\t{key}: {val}')


Fitting 5 folds for each of 2160 candidates, totalling 10800 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    2.0s
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed:   11.1s
[Parallel(n_jobs=-1)]: Done 264 tasks      | elapsed:   31.4s
[Parallel(n_jobs=-1)]: Done 488 tasks      | elapsed:   59.3s
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 1128 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 1544 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done 2024 tasks      | elapsed:  4.0min
[Parallel(n_jobs=-1)]: Done 2568 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done 3176 tasks      | elapsed:  6.2min
[Parallel(n_jobs=-1)]: Done 3848 tasks      | elapsed:  7.6min
[Parallel(n_jobs=-1)]: Done 4584 tasks      | elapsed:  9.2min
[Parallel(n_jobs=-1)]: Done 5384 tasks      | elapsed: 10.8min
[Parallel(n_jobs=-1)]: Done 6248 tasks      | elapsed: 12.4min
[Parallel(n_jobs=-1)]: Done 7176 tasks      | 

Best Accuracy: 86.37%
Best Parameters:
	gamma: 0.01
	learning_rate: 0.33
	max_delta_step: 0
	max_depth: 3
	min_child_weight: 10
	n_estimators: 50


Looking at the grid search above we really don't see any significant improvement over the default parameters. In this case is appears that the accuracy of our model may be limited by the quality and less than perfect correlations of our dataset.