# <font color=#ba0095><center>Decision Trees</center></font>

Decision tree is a supervised learning method use both for __regression__ and __classification__. If you read this notebook, you propably have met some types of decision trees, as their name suggest, the method consist of binary decisions that are chained - thus creating visualy a tree. At the top is a root node, after which the branches grow into more various possible outcomes. Setting the number of nodes / depth of the tree is one of the main tasks to decide.<br>

__Overfitting__ is a very important term. It means that you trained your model to the point, where every datapoint has its own node, thus has 100% accuracy on already learned input (or almost every and nearly 100%), but fail to fit any other input. __Underfitting__ is the exact opposite, you generilize so much, that some useful input may not even be considered or creates only a few nodes on a large amount of input.<br>

That is why you must understand the parameters you are setting and how they affect the algorithm, we will discuss them at the right time

### <font color=#ba0095><center>regression tree</center></font>

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import tree
from sklearn.model_selection import train_test_split

In [4]:
path = ".jupyter\\datasets\\raw\\"
df = pd.read_csv(path + "winequality-red.csv")

In [74]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [76]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


Ok, we have clean dataset with pure numeric types. We want to use all the features, so we do not want pair plot so many features at once. I think all the features in this dataset are useful for modeling, so I will only split the features from the target variable - quality. But if you think you have some features in your dataset that might impair the algorithm, you should look into it and alter your features before you start with modeling

In [57]:
features_df = df.copy()
y = df["quality"] # creating target
features_df.drop(columns="quality", inplace=True) # deleting target from feature dataset

print(features_df.shape, y.shape) # only check

(1599, 11) (1599,)


In [58]:
model = tree.DecisionTreeRegressor(max_depth=5, min_samples_leaf=0.03)
X_train, X_test, y_train, y_test = train_test_split(features_df, y, test_size=0.25)

model.fit(X_train, y_train)
model.score(X_test, y_test)

0.3076629801719025

Ok, lets look what we have written here, mainly the parameters:<br>
* __max_depth__ - is the depth of the tree - the number of nodes that is created
* __min_samples_leaf__ - is parameter that sets the minimal number of input for the node to split further, once reached, no additional nodes are created. (0.04 = 4%)

#### <font color=#ba0095>So how does it work ?</font>

For better understanding I suggest to look for some better suited format to explain this (like video guide), but the main concept goes like this. We have our root node at the top, now, how the algorithm decides where to split into two additional nodes ? It calculates and finds the point, where __Mean Squared Error__ is the least and that is the place to split. Now every other new node that is created by this proces repeats it, untill there is no more space for splitting or until it meets the critera (for example the mentioned parameters above - it created x number of nodes or has not enough input to satisfy the condition)

#### <font color=#ba0095>How to set the parameters ?</font>

This is the tricky question, as a lot of things, this goes with experience, with knowing exactly what dataset you are working with etc. But do not forget the magic of the tools you have at hands, I would suggest you can create a function that will automate the testing of the parameters, I will a function for testing multiple variants of the parameters, feel free to play with these kind of things !

In [106]:
def model_tester(model, features, target):
    for num_nodes in range(2,15,3):
        print("Testing with maximum of " + str(num_nodes) + " nodes")
        min_samples = 0.05
        for x in range (5):
            #model = tree.DecisionTreeRegressor(max_depth=num_nodes, min_samples_leaf=min_samples)
            model.max_depth = num_nodes
            model.min_samples_leaf = min_samples
            X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.25)
            model.fit(X_train, y_train)
            score = model.score(X_test, y_test)*100
            print("--- minimum samples: " + str(round(min_samples,2)) +
                     ", score: " + str(int(score)) + "%")
            min_samples += 0.04

reg_model = tree.DecisionTreeRegressor()
model_tester(reg_model, features_df, y)

Testing with maximum of 2 nodes
--- minimum samples: 0.05, score: 22%
--- minimum samples: 0.09, score: 18%
--- minimum samples: 0.13, score: 23%
--- minimum samples: 0.17, score: 28%
--- minimum samples: 0.21, score: 22%
Testing with maximum of 5 nodes
--- minimum samples: 0.05, score: 24%
--- minimum samples: 0.09, score: 28%
--- minimum samples: 0.13, score: 25%
--- minimum samples: 0.17, score: 17%
--- minimum samples: 0.21, score: 19%
Testing with maximum of 8 nodes
--- minimum samples: 0.05, score: 32%
--- minimum samples: 0.09, score: 22%
--- minimum samples: 0.13, score: 20%
--- minimum samples: 0.17, score: 21%
--- minimum samples: 0.21, score: 28%
Testing with maximum of 11 nodes
--- minimum samples: 0.05, score: 25%
--- minimum samples: 0.09, score: 28%
--- minimum samples: 0.13, score: 23%
--- minimum samples: 0.17, score: 13%
--- minimum samples: 0.21, score: 20%
Testing with maximum of 14 nodes
--- minimum samples: 0.05, score: 34%
--- minimum samples: 0.09, score: 31%
--

Ok, you can tweek with the parameters more to squeeze the most of the algo. But from our first try, we can see that the minimum samples under 10% tend to do better. The options you can do are endless, you can calculate the overall % for the number of nodes, you can loop it x times and show the top 5 scores and their parameters and so on, use your creativity here !

### <font color=#ba0095><center>classification tree</center></font>

In [77]:
glass_df = pd.read_csv(path + "glass.csv")
glass_df.head()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


In [79]:
glass_df.describe()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
count,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0
mean,1.518365,13.40785,2.684533,1.444907,72.650935,0.497056,8.956963,0.175047,0.057009,2.780374
std,0.003037,0.816604,1.442408,0.49927,0.774546,0.652192,1.423153,0.497219,0.097439,2.103739
min,1.51115,10.73,0.0,0.29,69.81,0.0,5.43,0.0,0.0,1.0
25%,1.516523,12.9075,2.115,1.19,72.28,0.1225,8.24,0.0,0.0,1.0
50%,1.51768,13.3,3.48,1.36,72.79,0.555,8.6,0.0,0.0,2.0
75%,1.519157,13.825,3.6,1.63,73.0875,0.61,9.1725,0.0,0.1,3.0
max,1.53393,17.38,4.49,3.5,75.41,6.21,16.19,3.15,0.51,7.0


In [80]:
glass_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   RI      214 non-null    float64
 1   Na      214 non-null    float64
 2   Mg      214 non-null    float64
 3   Al      214 non-null    float64
 4   Si      214 non-null    float64
 5   K       214 non-null    float64
 6   Ca      214 non-null    float64
 7   Ba      214 non-null    float64
 8   Fe      214 non-null    float64
 9   Type    214 non-null    int64  
dtypes: float64(9), int64(1)
memory usage: 16.8 KB


There are 7 types of glasses in this database, and those are:
* 1 buildingwindowsfloatprocessed 
* 2 buildingwindowsnonfloatprocessed 
* 3 vehiclewindowsfloatprocessed
* 4 vehiclewindowsnonfloatprocessed (none in this database)
* 5 containers
* 6 tableware
* 7 headlamps

In [109]:
X = glass_df.copy()
y = glass_df["Type"]
X.drop(columns="Type", inplace=True)

In [110]:
clf = tree.DecisionTreeClassifier(max_depth=3, min_samples_split=0.05)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
clf.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=3, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=0.05,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

See, the above parameters are different from regression decision tree. That is the difference between those 2. By the nature of these parameters one will most likely be better fit than the other. For further information on these, look into the sklearn documentation, it is very well made and I would just copy most of it.

few important ones to know are:
* __criterion__ - here is used __gini__ which uses gini impurity, those are a little more complex ideas than I could write in 1 sentence, so look into it how it works
* __splitter__ - best and random
* __max_depth__ - still the same, it handles the max depth/number of nodes created
* __min_samples_split__ - the minimum amount of samples to split a node

In [111]:
clf.score(X_test, y_test)

0.6111111111111112

#### <font color=#ba0095>Let us try the classification again on the dataset before</font>

for the demonstration, lets try to repeat our code with the dataset above - red wine quality. We will again use our testing loop

In [114]:
# I have my features_df from red wine, but i overwritten the y, so I need to rewrite it again
# This is a bad practice ! I have few lines of code and already have to think about that, always think ahead !
y = df["quality"]

model_tester(clf, features_df, y)

Testing with maximum of 2 nodes
--- minimum samples: 0.05, score: 54%
--- minimum samples: 0.09, score: 52%
--- minimum samples: 0.13, score: 55%
--- minimum samples: 0.17, score: 50%
--- minimum samples: 0.21, score: 56%
Testing with maximum of 5 nodes
--- minimum samples: 0.05, score: 57%
--- minimum samples: 0.09, score: 53%
--- minimum samples: 0.13, score: 50%
--- minimum samples: 0.17, score: 55%
--- minimum samples: 0.21, score: 53%
Testing with maximum of 8 nodes
--- minimum samples: 0.05, score: 55%
--- minimum samples: 0.09, score: 55%
--- minimum samples: 0.13, score: 59%
--- minimum samples: 0.17, score: 59%
--- minimum samples: 0.21, score: 56%
Testing with maximum of 11 nodes
--- minimum samples: 0.05, score: 59%
--- minimum samples: 0.09, score: 53%
--- minimum samples: 0.13, score: 56%
--- minimum samples: 0.17, score: 52%
--- minimum samples: 0.21, score: 50%
Testing with maximum of 14 nodes
--- minimum samples: 0.05, score: 62%
--- minimum samples: 0.09, score: 51%
--

Ok, we can see that the classification version of decision trees does much better than the regression one. It was the bad decision right from the start as we had only 6 types of outcome. Regression should be used on continuous data and here you saw exactly why. 

#### <font color=#ba0095>final words</font>
There is only a minimum amount of code shown above, but it should be enough to get you started on decision trees, because the model alone is coded in a few lines, but the hard task here is to understand what it is doing. So, take your time, read through the parameters of the decision trees so you know what you are working with and can decide right from the bat which one to work on and not waste your precision time ! Thank you that you took your time to read through this short work