GRADIENT BOOSTING ALGORITHM FOR IRIS SPECIES CLASSIFICATION

In [1]:
# Importing the required Package
from sklearn import datasets

import pandas as pd
import numpy as np

from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.metrics import classification_report

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV

import statistics

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Load data
iris = datasets.load_iris()

# Description About data set
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [3]:
# Training Data
train = pd.DataFrame(iris.data,columns = iris.feature_names)

# Testing Data
target = pd.DataFrame(iris.target,columns = ['species'])

print(train.head())
print(target.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2
   species
0        0
1        0
2        0
3        0
4        0


In [4]:
X_train,X_test,y_train,y_test = train_test_split(train,
                                                 target,
                                                 test_size=0.2,
                                                 random_state=10)

model = GradientBoostingClassifier(random_state=6,
                                   n_estimators=10)
model.fit(X_train,y_train)

y_predict = model.predict(X_test)

STRATIFIED K-FOLD CROSS VALIDATION

In [10]:
skf = StratifiedKFold(n_splits=10)

Stratified_score = []
for train_index, test_index in skf.split(train, target):
    
    X_train, X_test = train.iloc[list(train_index),:], train.iloc[list(test_index),:]
    y_train, y_test = target.iloc[list(train_index),:], target.iloc[list(test_index),:]
    
    model = GradientBoostingClassifier()
    model.fit(X_train,y_train)
    y_predict = model.predict(X_test)
    Stratified_score.append(accuracy_score(y_test,y_predict))

In [6]:
print("Minimum accuracy we get is {}".format(min(Stratified_score)))
print("Maximun accuracy we get is {}".format(max(Stratified_score)))
print("We can get average accuracy is {}".format(
    statistics.mean(Stratified_score)))

print("Accuracy of random forest tree model for classifying iris species",
      accuracy_score(y_test,y_predict))

Minimum accuracy we get is 0.8666666666666667
Maximun accuracy we get is 1.0
We can get average accuracy is 0.96
Accuracy of random forest tree model for classifying iris species 1.0


<h3>HyperParameter Tunning</h3>


**min_samples_split**
- Defines the minimum number of samples (or observations) which are required in a node to be considered for splitting.
- Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
- Too high values can lead to under-fitting hence, it should be tuned using CV.

**min_samples_leaf**
- Defines the minimum samples (or observations) required in a terminal node or leaf.
- Used to control over-fitting similar to min_samples_split.
- Generally lower values should be chosen for imbalanced class problems because the regions in which the minority class will be in majority will be very small.

**min_weight_fraction_leaf**
- Similar to min_samples_leaf but defined as a fraction of the total number of observations instead of an integer.

**max_depth**
- The maximum depth of a tree.
- Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample. Should be tuned using CV.

**max_leaf_nodes**
- The maximum number of terminal nodes or leaves in a tree.
- Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
- If this is defined, GBM will ignore max_depth.

**max_features**
- The number of features to consider while searching for a best split. These will be randomly selected.
- As a thumb-rule, square root of the total number of features works great but we should check upto 30-40% of the total number of features.
- Higher values can lead to over-fitting but depends on case to case.



**learning_rate**
- This determines the impact of each tree on the final outcome (step 2.4). GBM works by starting with an initial estimate which is updated using the output of each tree. The learning parameter controls the magnitude of this change in the estimates.
- Lower values are generally preferred as they make the model robust to the specific characteristics of tree and thus allowing it to generalize well.
- Lower values would require higher number of trees to model all the relations and will be computationally expensive.

**n_estimators**
- The number of sequential trees to be modeled (step 2)
- Though GBM is fairly robust at higher number of trees but it can still overfit at a point. Hence, this should be tuned using CV for a particular learning rate.

**subsample**
- The fraction of observations to be selected for each tree. Selection is done by random sampling.
- Values slightly less than 1 make the model robust by reducing the variance.
- Typical values ~0.8 generally work fine but can be fine-tuned further.

**loss**
- It refers to the loss function to be minimized in each split.
- It can have various values for classification and regression case. Generally the default values work fine. Other values should be chosen only if you understand their impact on the model.

In [7]:
# Learning rate for gradient boosting
learning_rates = [0.1, 0.05, 0.001, 0.01]

# Number of trees in random forest
n_estimators = [1, 2, 4, 8, 16, 32, 64, 100, 200]

# Number of features to consider at every split
max_features = list(range(1,train.shape[1]))

# Maximum number of levels in tree
max_depths = np.linspace(1, 32, 32, endpoint=True)


# Minimum number of samples required to split a node
min_samples_splits = np.linspace(0.1, 1.0, 10, endpoint=True)

# Minimum number of samples required at each leaf node
min_samples_leafs = np.linspace(0.1, 0.5, 5, endpoint=True)


# Create the random grid
random_grid = {'learning_rate': learning_rates,
               'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depths,
               'min_samples_split': min_samples_splits,
               'min_samples_leaf': min_samples_leafs}

<h3>Random Search</h3>

In [8]:
X_train,X_test,y_train,y_test=train_test_split(train,
                                               target,
                                               test_size=0.2,
                                               random_state=10)

random_search = RandomizedSearchCV(GradientBoostingClassifier(), 
                                   random_grid, 
                                   random_state=1, 
                                   n_iter=100, 
                                   cv=5, 
                                   verbose=0, 
                                   n_jobs=-1)

random_search.fit(X_train,y_train)

#Print The value of best Hyperparameters
print(random_search.best_params_)

{'n_estimators': 16, 'min_samples_split': 0.4, 'min_samples_leaf': 0.1, 'max_features': 2, 'max_depth': 17.0, 'learning_rate': 0.01}


In [9]:
model = GradientBoostingClassifier(n_estimators =  16, 
                                   min_samples_split =  0.4, 
                                   min_samples_leaf =  0.1, 
                                   max_features = 2, 
                                   max_depth =  17.0, 
                                   learning_rate =  0.01)

model.fit(X_train,y_train)
y_predict = model.predict(X_test)

print("Accuracy of random forest tree model for classifying iris species",
      accuracy_score(y_test,y_predict))
print("\nCurrently used params\n\n",model.get_params())

Accuracy of random forest tree model for classifying iris species 0.9333333333333333

Currently used params

 {'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.01, 'loss': 'deviance', 'max_depth': 17.0, 'max_features': 2, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 0.1, 'min_samples_split': 0.4, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 16, 'n_iter_no_change': None, 'random_state': None, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}
