<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/images/IDSNlogo.png" width="300" alt="cognitiveclass.ai logo">
</center>

# Gradient Boosting  for classification with Python

Estimated time needed: **1.45** hours


In this notebook, you will learn Gradient Boosting for classification; AdaBoost is a particular case of Gradient Boosting and an additive model where we add weak learners to minimize the loss function. This lab will focus on <a href="https://xgboost.readthedocs.io/en/stable/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML241ENSkillsNetwork31576874-2022-01-01">XGBoost</a>, an open-source software library that provides a regularizing gradient boosting framework.  XGBoost can use different types of weak learners called boosters for classification and regression. We will focus on trees for classification.


Unlike Bagging and Random Forest, Gradient Boosting can cause overfitting. As a result, Gradient Boosting requires Hyperparameter tuning, taking more time to train. One advantage of Gradient Boosting is that each classifier is smaller, so predictions are faster in Gradient Boosting.
AdaBoost is a subclass of Gradient Boosting; one weakness of AdaBoost is that misclassified samples cause overfitting; Gradient Boosting uses different loss functions, reducing this effect.
The following table show's the Average accuracy and standard deviation for the Random Forest (RF), gradient boosting (GB) and XGBoost, both using the default (D) and (T) tuned parameter settings, we see that XGBoost does best, followed by GB outperforming the other methods.


## **Table of Contents**

​

<!-- <a href="#Multi-Dimensional_Scaling">Multi-Dimensional Scaling/a> -->

<ol>
<li style="list-style-type: none;">
<ol>
<li>Objectives</li>
<li>Setup
<ol>
<li>Installing Required Libraries&lt;</li>
<li>Importing Required Libraries</li>
<li>Defining Helper Functions</li>
</ol>
</li>
<li>How  Gradient Boosting  Works (Optional)
    <ol>
    <li> How to Minimize Cost</li>
    <li>  Example with Python </li>
    </ol>

<li>Xgboost<br />
<ol>
<li>About the dataset</li>
<li>Gradient Boosting parameter</li>
 <li>Evaluation Metric on Second Dataset </li>
    <li>Early Stopping </li>
    <li>Parameters for Trees </li>

</ol>
</li>
<li>Cancer Data Example with GridSearchCV</li>
<li>Practice</li>
</ol>
</li>
</ol>


## Objectives

After completing this lab you will be able to:

*   Understand   Gradient Boosting  is a linear combination of  𝑇 weak classifiers
*   Apply Gradient Boosting using  XGBoost,
*   Understand Hyperparameters selection in  XGBoost


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

In [6]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

In [3]:
import warnings

warnings.filterwarnings('ignore')

In [5]:
def get_accuracy(X_train, X_test, y_train, y_test, model):
    return {
        'test accuracy': accuracy_score(y_test, model.predict(X_test)),
        'train accuracy': accuracy_score(y_train, model.predict(X_train))
    }

In [8]:
def get_accuracy_boost(X, y, title, xlabel='Num Estimators', learning_rate=[0.2, 0.4, 0.6, 1.0], n_est=100):
    lines_array = ['solid', '--', '-.', ':']
    n_estimators = [n * 2 for n in range(1, n_est // 2)]
    acc_shape = (times, len(learning_rate), len(n_estimators))
    train_acc = np.zeros(acc_shape)
    test_acc = np.zeros(acc_shape)
    
    for n in tdqm(range(times)):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
        
        for n_trees in n_estimators:
            for j, lr in enumerate(learning_rate):
                model = XGBClassifier(objective=objective,
                                      learning_rate=lr,
                                      n_estimators=n_trees,
                                      eval_metric='mlogloss')
                model.fit(X_train, y_train)
                accuracy = get_accuracy(X_train, X_test, y_train, y_test, model)
                train_acc[n, j, (n_trees // 2) - 1] = accuracy['train accuracy']
                test_acc[n, j, (n_trees // 2) - 1] = accuracy['test accuracy']
    
    fig, ax1 = plt.subplots()
    mean_test = test_acc.mean(axis=0)
    mean_train = train_acc.mean(axis=0)
    ax2 = ax1.twinx()
    
    for j, (lr, line) in enumerate(zip(learning_rate, lines_array)):
        ax1.plot(mean_train[j, :], linestyle=line, color='b', label=f"Learning Rate {lr}")
        ax2.plot(mean_test[j, :], linestyle=line, color='r', label=str(lr))

    ax1.set_ylabel('Training Accuracy', color='b')
    ax1.legend()
    ax2.set_ylabel('Testing Accuracy', color='r')
    ax2.legend()
    ax1.set_xlabel(xlabel)
    plt.show()

# Xgboost

### About the dataset

We will use a telecommunications dataset for predicting customer churn. This is a historical customer dataset where each row represents one customer. The data is relatively easy to understand, and you may uncover insights you can use immediately. Typically, it is less expensive to keep customers than to acquire new ones, so the focus of this analysis is to predict the customers who will stay with the company.

This data set provides information to help you predict what behavior will help you to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.

The dataset includes information about:

*   Customers who left within the last month – the column is called Churn
*   Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
*   Customer account information – how long they had been a customer, contract, payment method, paperless billing, monthly charges, and total charges
*   Demographic info about customers – gender, age range, and if they have partners and dependents


In [9]:
churn_df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/ChurnData.csv")
churn_df.head()

Unnamed: 0,tenure,age,address,income,ed,employ,equip,callcard,wireless,longmon,...,pager,internet,callwait,confer,ebill,loglong,logtoll,lninc,custcat,churn
0,11.0,33.0,7.0,136.0,5.0,5.0,0.0,1.0,1.0,4.4,...,1.0,0.0,1.0,1.0,0.0,1.482,3.033,4.913,4.0,1.0
1,33.0,33.0,12.0,33.0,2.0,0.0,0.0,0.0,0.0,9.45,...,0.0,0.0,0.0,0.0,0.0,2.246,3.24,3.497,1.0,1.0
2,23.0,30.0,9.0,30.0,1.0,2.0,0.0,0.0,0.0,6.3,...,0.0,0.0,0.0,1.0,0.0,1.841,3.24,3.401,3.0,0.0
3,38.0,35.0,5.0,76.0,2.0,10.0,1.0,1.0,1.0,6.05,...,1.0,1.0,1.0,1.0,1.0,1.8,3.807,4.331,4.0,0.0
4,7.0,35.0,14.0,80.0,2.0,15.0,0.0,1.0,0.0,7.1,...,0.0,0.0,1.0,1.0,0.0,1.96,3.091,4.382,3.0,0.0


In [10]:
churn_df.dtypes

tenure      float64
age         float64
address     float64
income      float64
ed          float64
employ      float64
equip       float64
callcard    float64
wireless    float64
longmon     float64
tollmon     float64
equipmon    float64
cardmon     float64
wiremon     float64
longten     float64
tollten     float64
cardten     float64
voice       float64
pager       float64
internet    float64
callwait    float64
confer      float64
ebill       float64
loglong     float64
logtoll     float64
lninc       float64
custcat     float64
churn       float64
dtype: object

In [11]:
(churn_df.max() == 1).all()

False

### Data pre-processing and selection

Let's select some features for the modeling. Also, we change the target data type to be an integer, as it is a requirement by the skitlearn algorithm:

In [13]:
FEATURE_COLUMNS = ['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']
Y_COLUMN = 'churn'

In [15]:
X = churn_df[FEATURE_COLUMNS].astype(int)
y = churn_df[Y_COLUMN].astype(int)

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

`objective` :  Specify the learning task and the corresponding learning objective or a custom objective function to be used.
For example:

`binary:logistic`: binary classification

`multi:softprob`:multi-class classification


`learning_rate` : Boosting learning rate, also called `eta`


The `booster` parameter sets the type of learner, in this lab we will stick with trees. Let's experiment with some of the Gradient Boosting parameter:


If the outputs were y is -1 and 1, the form of the classifier would be, but `xgboost` will convert it to the same as the label `y`

$H(x) = 	ext{sign}(   h_1(x)+  h_2(x)+ h_3(x)+  h_4(x)+  h_5(x) )$

Unlike AdaBoost there is not $\alpha_t$, but there are some versions that have a similar term. We can fit all $H(x)$ and then make a prediction:


