**<center><h1>Decision Tree</h1></center>**

We will use the scikit-learn library to build the decision tree model. We will be using the iris dataset to build a decision tree classifier. The data set contains information of 3 classes of the iris plant with the following attributes: 
    - sepal length 
    - sepal width 
    - petal length 
    - petal width 
- class: 

        Iris Setosa 
        Iris Versicolour 
        Iris Virginica

The task is to predict the class of the iris plant based on the attributes.

In [1]:
#Importing required libraries
import pandas as pd
import numpy as np
import statistics

import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_iris

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold

from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.metrics import classification_report

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from pprint import pprint

from pprint import pprint

The scikit-learn dataset library already has the iris dataset. You can either use the dataset from the source or import it from the scikit-learn dataset library.

In [2]:
#Loading the iris data
data = load_iris()
print('Classes to predict: ', data.target_names)

Classes to predict:  ['setosa' 'versicolor' 'virginica']


There are three classes of iris plants: 'setosa', 'versicolor' and 'virginica'. Now, we have imported the iris data in the variable 'data'. We will now extract the attribute data and the corresponding labels. We can extract the attributes and labels by calling .data and .target as shown below:

In [3]:
# Training Data
X = pd.DataFrame(data.data,columns = data.feature_names)

# Testing Data
y = pd.DataFrame(data.target,columns = ['species'])

print('Number of examples in the data:', X.shape[0])

Number of examples in the data: 150


There are 150 examples/ samples in the data. The variable 'X' contains the attributes to the iris plant. The cell below shows the 4 attributes of the first four iris plants.

Now that we have extracted the data attributes and corresponding labels, we will split them to form train and test datasets. For this purpose, we will use the scikit-learn's 'train_test_split' function, which takes in the attributes and labels as inputs and produces the train and test sets.

In [4]:
#Using the train_test_split to create train and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 47, test_size = 0.25)

Since, this is a classification problem, we will import the DecisionTreeClassifier function from the sklearn library. Next, we will set the 'criterion' to 'entropy', which sets the measure for splitting the attribute to information gain.

<h3>Training the decision tree using entropy</h3>

In [5]:
#Importing the Decision tree classifier from the sklearn library.
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion = 'entropy')

Next, we will fit the classifier on the train attributes and labels.

In [6]:
#Training the decision tree classifier. 
clf.fit(X_train, y_train)

#Predicting labels on the test set.
y_pred =  clf.predict(X_test)

We will now evaluate the predicted classes using some metrics. For this case, we will use 'accuracy_score' to calculate the accuracy of the predicted labels.

In [7]:
#Importing the accuracy metric from sklearn.metrics library
print('Accuracy Score on train data: ', accuracy_score(y_true=y_train, y_pred=clf.predict(X_train)))
print('Accuracy Score on test data: {}\n\n'.format(accuracy_score(y_true=y_test, y_pred=y_pred)))
print(classification_report(y_test,y_pred))

Accuracy Score on train data:  1.0
Accuracy Score on test data: 0.9473684210526315


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        15
           1       0.80      1.00      0.89         8
           2       1.00      0.87      0.93        15

    accuracy                           0.95        38
   macro avg       0.93      0.96      0.94        38
weighted avg       0.96      0.95      0.95        38



<h3>Training the decision tree using gini</h3>

In [8]:
#Importing the Decision tree classifier from the sklearn library.
clf = DecisionTreeClassifier(criterion = 'gini')

In [9]:
#Training the decision tree classifier. 
clf.fit(X_train, y_train)

#Predicting labels on the test set.
y_pred =  clf.predict(X_test)

In [10]:
#Importing the accuracy metric from sklearn.metrics library
print('Accuracy Score on train data: ', accuracy_score(y_true=y_train, y_pred=clf.predict(X_train)))
print('Accuracy Score on test data: {}\n\n'.format(accuracy_score(y_true=y_test, y_pred=y_pred)))
print(classification_report(y_test,y_pred))

Accuracy Score on train data:  1.0
Accuracy Score on test data: 0.9473684210526315


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        15
           1       0.80      1.00      0.89         8
           2       1.00      0.87      0.93        15

    accuracy                           0.95        38
   macro avg       0.93      0.96      0.94        38
weighted avg       0.96      0.95      0.95        38



<h3>K-Fold Cross Validation</h3>

In [11]:
kf = KFold(n_splits=10)

k_fold_score = []
for train_index, test_index in kf.split(X,y):
    
    # print(X.iloc[list(train_index),:])
    
    X_train, X_test = X.iloc[list(train_index),:], X.iloc[list(test_index),:]
    y_train, y_test = y.iloc[list(train_index),:], y.iloc[list(test_index),:]
    
    model = DecisionTreeClassifier()
    model.fit(X_train,y_train)
    y_predict = model.predict(X_test)
    k_fold_score.append(accuracy_score(y_test,y_predict))
   

In [12]:
print("Minimum accuracy we get is {}".format(min(k_fold_score)))
print("Maximun accuracy we get is {}".format(max(k_fold_score)))
print("We can get average accuracy is {}".format(statistics.mean(k_fold_score)))

Minimum accuracy we get is 0.8
Maximun accuracy we get is 1.0
We can get average accuracy is 0.9400000000000001


<h3>Stratified K Fold Cross Validation</h3>

In [13]:
skf = StratifiedKFold(n_splits=10)

Stratified_score = []
for train_index, test_index in skf.split(X, y):
    
    X_train, X_test = X.iloc[list(train_index),:], X.iloc[list(test_index),:]
    y_train, y_test = y.iloc[list(train_index),:], y.iloc[list(test_index),:]
    
    model = DecisionTreeClassifier()
    model.fit(X_train,y_train)
    y_predict = model.predict(X_test)
    Stratified_score.append(accuracy_score(y_test,y_predict))

In [14]:
print("Minimum accuracy we get is {}".format(min(Stratified_score)))
print("Maximun accuracy we get is {}".format(max(Stratified_score)))
print("We can get average accuracy is {}".format(statistics.mean(Stratified_score)))

Minimum accuracy we get is 0.8666666666666667
Maximun accuracy we get is 1.0
We can get average accuracy is 0.9533333333333334


<h3><center>Parameter Tunning</center></h3>

<u><h4>Grid Search</h4></u>

<h4>Parameters</h4>

1. **criterion : {“gini”, “entropy”}, default=”gini”:**

    The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
        


2. **max_depth : int, default=None**

    The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.


3. **min_samples_split : int or float, default=2**

    The minimum number of samples required to split an internal node:

    If int, then consider min_samples_split as the minimum number.

    If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
    

4. **max_features : float or {“auto”, “sqrt”, “log2”}, default=None**

    The number of features to consider when looking for the best split:

    If int, then consider max_features features at each split.

    If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.

    If “auto”, then max_features=sqrt(n_features).

    If “sqrt”, then max_features=sqrt(n_features).

    If “log2”, then max_features=log2(n_features).

    If None, then max_features=n_features.
    
    Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.
    

5. **min_samples_leaf : int or float, default=1**

    The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

    If int, then consider min_samples_leaf as the minimum number.

    If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

In [15]:
param_grid = {"max_depth": [3, None],
              "min_samples_leaf": [1, 2, 4],
              "criterion": ["gini", "entropy"],
              "max_features": ['auto', 'sqrt']} 

pprint(param_grid)

{'criterion': ['gini', 'entropy'],
 'max_depth': [3, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4]}


In [16]:
rf = DecisionTreeClassifier()

rf_random = GridSearchCV(rf, 
                         param_grid, 
                         cv = 3, 
                         verbose=2, 
                         n_jobs = -1)

rf_random.fit(X_train,y_train)

print("Best Parameters",rf_random.best_params_)

Fitting 3 folds for each of 24 candidates, totalling 72 fits
Best Parameters {'criterion': 'entropy', 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1}


In [17]:
#Importing the Decision tree classifier from the sklearn library.
clf = DecisionTreeClassifier(criterion = 'gini',
                            max_depth = None,
                            max_features = 'auto',
                            min_samples_leaf = 1)


#Training the decision tree classifier. 
clf.fit(X_train, y_train)

#Predicting labels on the test set.
y_pred =  clf.predict(X_test)


#Importing the accuracy metric from sklearn.metrics library

from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

print('Accuracy Score on train data: ', 
      accuracy_score(y_true=y_train, y_pred=clf.predict(X_train)))
print('Accuracy Score on test data: {}\n\n'.format(accuracy_score(y_true=y_test, 
                                                                  y_pred=y_pred)
                                                                  ))
print(classification_report(y_test,y_pred))

Accuracy Score on train data:  1.0
Accuracy Score on test data: 0.9333333333333333


              precision    recall  f1-score   support

           0       1.00      1.00      1.00         5
           1       0.83      1.00      0.91         5
           2       1.00      0.80      0.89         5

    accuracy                           0.93        15
   macro avg       0.94      0.93      0.93        15
weighted avg       0.94      0.93      0.93        15



<h4><u>Random Search</u></h4>

In [18]:
random_search = RandomizedSearchCV(DecisionTreeClassifier(), 
                                   param_grid, 
                                   random_state=1, 
                                   n_iter=100, 
                                   cv=5, 
                                   verbose=0, 
                                   n_jobs=-1)

rf_random.fit(X_train,y_train)

#Print The value of best Hyperparameters
print(rf_random.best_params_)

Fitting 3 folds for each of 24 candidates, totalling 72 fits
{'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'min_samples_leaf': 1}


In [19]:
clf = DecisionTreeClassifier(criterion = 'gini',
                            max_depth = None,
                            max_features = 'auto',
                            min_samples_leaf = 4)


#Training the decision tree classifier. 
clf.fit(X_train, y_train)

#Predicting labels on the test set.
y_pred =  clf.predict(X_test)


print('Accuracy Score on train data: ', 
      accuracy_score(y_true=y_train, y_pred=clf.predict(X_train)))
print('Accuracy Score on test data: {}\n\n'.format(
    accuracy_score(y_true=y_test, y_pred=y_pred)))
print(classification_report(y_test,y_pred))

Accuracy Score on train data:  0.9703703703703703
Accuracy Score on test data: 1.0


              precision    recall  f1-score   support

           0       1.00      1.00      1.00         5
           1       1.00      1.00      1.00         5
           2       1.00      1.00      1.00         5

    accuracy                           1.00        15
   macro avg       1.00      1.00      1.00        15
weighted avg       1.00      1.00      1.00        15

