<a href="https://colab.research.google.com/github/CSpanias/ml_training/blob/master/DecisionTreeRegressor_tips.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Decision Tree Regressor Algorithm**

*This notebook was developed for Decision Tree practice and the resulting code is a blend of various different resources.*

Based mostly on:
* [Machine Learning for Absolute Beginners](https://scatterplotpress.teachable.com/courses)

Modified based on:
*  [
*  [
*  [



1. [Data Cleaning](#DataCleaning)
2. [KFold Cross-Validation with different criterion (gini, entropy)](#KFold)
3. [Decision Tree Visualization (GraphViz)](#GraphViz)
4. [Pruning the Decision Tree](#Pruning)
  * [Pre-Pruning](#Pre-Pruning)
  * [Post-Pruning](#Post-Pruning)
5. [Conclusions](#Conclusions)


In [23]:
import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import mean_absolute_error

In [24]:
# load data
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")

In [25]:
# print the first 5 rows of data
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [26]:
# print info about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB


<a name="DataCleaning"> </a>
# **Data Cleaning**

* Missing values
* Duplicate rows

In [27]:
# check for missing values
df.isna().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

In [28]:
# check for duplicates
print(df.duplicated().sum())
# remove duplicate rows
df.drop_duplicates(inplace=True)

1


# **Data PreProcessing**

In [29]:
# delete smoker variable
del df['smoker']

In [30]:
# convert non-numerical data to boolean using one-hot encoding
df = pd.get_dummies(df, columns=['time', 'day', 'sex'])

<a name='KFold'> </a>
 # **KFold Cross Validation with different criterion** 
(squared_error, friedman_mse, absolute_error, poisson)

Import the required libraries & methods.

In [37]:
import numpy as np # convert DataFrame to arrays, calculate means
from sklearn.tree import DecisionTreeRegressor # algorithm required
from sklearn.model_selection import train_test_split # split data
from sklearn.model_selection import KFold # perform a KFold Cross-Validation
from sklearn.metrics import mean_absolute_error, accuracy_score # evaluation metrics

1. Assign Xs (features) & y (target) 
2. Convert them from pandas DataFrame (matrix) and pandas Series (single column) to ***numpy 2- and 1-Dimensional numpy arrays***, respectively.

  Pandas DataFrames & Series are easier to read by a human, but numpy's arrays are ***easier to handled by the computer***.  

In [32]:
# from df drop the specified variable, axis=1: drop the whole column
X = df.drop(columns='tip', axis=1)
# convert pandas dataframe to a 2-Dimensional numpy array
#X = X.values
# confirm that X's shape is 2D
print(X.shape)

# assing target
y = df['tip']
# convert pandas series to a 1-Dimensional numpy array
#y = y.values
# confirm that y's shape is 1D
print(y.shape)

(243, 10)
(243,)


Perform KFold Cross-Validation to compare the mean accuracy of:
1. A Decision Tree model with gini as the criterion.
2. A Decision Tree model with entropy as the criterion.

In [41]:
# define the number of splits (train/test sets) and shuffle rows
kf = KFold(n_splits=5, shuffle=True)

# we want to cross-validate the model with both gini and entropy
for criterion in ['squared_error', 'friedman_mse', 'absolute_error',
                  'poisson']:
  # {} = placeholders, .format(x) = put x in the placeholders
  print("Decision Tree - {}" .format(criterion))
  # create an empty list of each metric so we can later 
  # store the score in it for each of the 5 splits
  mae = []

  # each split of X produces a training and testing set
  for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # select algorithm with the specified criterion
    model = DecisionTreeRegressor(criterion=criterion)
    # train the model using training data
    model.fit(X_train, y_train)
    # predict using the testing data
    y_pred = model.predict(X_test)
    # calculate and append accuracy score at the end of each list
    mae.append(mean_absolute_error(y_test, y_pred))
    
  # print a message while calculating the mean score of each metric
  # {:.4f} = format x to 4 decimal places, .format(x) = put x into the placeholder
    print("Mean Absolute Error: {:.4f}" .format(np.mean(mae)))

Decision Tree - squared_error


KeyError: ignored

Entropy seems to work better, hence, we will train our model using entropy as the criterion.

In [39]:
# select algorith and set entropy as the criterion
model = DecisionTreeRegressor()
# split training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    shuffle=True,
                                                    random_state=33)
# train the model using training data
model.fit(X_train, y_train)
# predict using the test data
y_pred_GS = model.predict(X_test)
# print MAE, {} = placeholder, .format(x) = put x into the placeholder
print("Mean Absolute Error: {:.2f}" .format(mean_absolute_error(y_test, y_pred_GS)))

Mean Absolute Error: 1.14


<a name='GraphViz'> </a>
# **Decision Tree Visualization** (GraphViz)

Decision Trees benefits:
1. **Interpretability**: clarity of information representation, useful for explaining the prediction to a non-technical audience.
2. **Transparency**: can greatly help in the *decision making process*. 

*If you're going to run this on your computer, make sure to install graphviz first. You can do this by opening your command prompt and typing "**pip install graphviz**".*

Sample Visualization for easier interpretation of the process:

In [None]:
# a list with the feature names
feature_names= ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree',
        'age']

# assign X and y variables
X = df[feature_names].values
y = df['label'].values

# split the data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    shuffle=True,
                                                    random_state=10)

# create the model with max_depth=3 (again) for rendering reasons
model = DecisionTreeClassifier(criterion='entropy', max_depth=3)

# train the model
model.fit(X_train, y_train)

# import export_graphviz
from sklearn.tree import export_graphviz
# export files as stored as .dot files
dot_file = export_graphviz(model, feature_names=feature_names,
                           class_names= ['Not Diabetic', 'Diabetic'])

# import graphviz to convert .dot file to a .png image
import graphviz
# read the dot_file
graph = graphviz.Source(dot_file)
# convert .dot to .png without generating additional files
graph.render(filename='tree_diabetes', format='png', cleanup=True)

# import IPython to display the rendered .png image
from IPython.display import Image
# display image using the path
Image(filename="/content/tree_diabetes.png")

<a name='Pruning'> </a>
# **Pruning the Decision Tree**
Removing the unnecessary elements from the Decision Trees with the aim of:
* Reducing the complexity to avoid overfitting
* Making it easier to interpret.



1. **Pre-Pruning**: the goal is to limit tree growth
  * ***max_depth***: grow up to a certain depth/height
  * ***min_samples_split***: the minimun number of samples required to split an internal node.
  * ***max_leaf_nodes***: limit the total number of leaf nodes allowed in the tree

    Additional info [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

2. **Post-pruning**: build the whole tree and then review the tree and decide which leaves to remove to make it smaller.

<a name='Pre-Pruning'> </a>
**Pre-Pruning** example:

In [None]:
# create the model with modified parameters
model = DecisionTreeClassifier(criterion='entropy',
                               max_depth=10,
                               min_samples_split=4,
                               max_leaf_nodes=20,)

# train the model
model.fit(X_train, y_train)

# predict using the training data
y_pred_train = model.predict(X_train)
# predict using the test data
y_pred = model.predict(X_test)
# print accuracy
print("Training Accuracy: {:.4f}" .format(accuracy_score(y_train, y_pred_train)))
print("Testing Accuracy: {:.4f}" .format(accuracy_score(y_test, y_pred)))

# import export_graphviz
from sklearn.tree import export_graphviz
# export files as stored as .dot files
dot_file = export_graphviz(model, feature_names=feature_names,
                           class_names= ['Not Diabetic', 'Diabetic'])

# import graphviz to convert .dot file to a .png image
import graphviz
# read the dot_file
graph = graphviz.Source(dot_file)
# convert .dot to .png without generating additional files
graph.render(filename='tree_diabetes', format='png', cleanup=True)

# import IPython to display the rendered .png image
from IPython.display import Image
# display image using the path
Image(filename="/content/tree_diabetes.png")

Pre-Prunning successfully ***increased testing accuracy*** in comparison to the default model (77% vs. 73%)

<a name='Post-Pruning'> </a>
**Post-Pruning** example:  
(*the same tree as the default (first) model, recreated here with visualization*)

In [None]:
# create the model with modified parameters
model = DecisionTreeClassifier(criterion='entropy')

# train the model
model.fit(X_train, y_train)

# predict using the training data
y_pred_train = model.predict(X_train)
# predict using the test data
y_pred = model.predict(X_test)
# print accuracy
print("Training Accuracy: {:.4f}" .format(accuracy_score(y_train, y_pred_train)))
print("Testing Accuracy: {:.4f}" .format(accuracy_score(y_test, y_pred)))

# import export_graphviz
from sklearn.tree import export_graphviz
# export files as stored as .dot files
dot_file = export_graphviz(model, feature_names=feature_names,
                           class_names= ['Not Diabetic', 'Diabetic'])

# import graphviz to convert .dot file to a .png image
import graphviz
# read the dot_file
graph = graphviz.Source(dot_file)
# convert .dot to .png without generating additional files
graph.render(filename='tree_diabetes', format='png', cleanup=True)

# import IPython to display the rendered .png image
from IPython.display import Image
# display image using the path
Image(filename="/content/tree_diabetes.png")

***Testing accuracy dropped considerably*** in comparison with the Pre-Prunned tree (0.74 vs. 0.77).  

The full tree has now ***17 rows*** which results in an ***overfitting*** indication:  *the training accuracy score (100%) is much higher that the testing accuracy score (74%)*.

Notice that, for example, in the ***7th row*** entropy scores are quite good, so try and stop tree growth there and see what happens.

In [None]:
# create the model with modified parameters
model = DecisionTreeClassifier(criterion='entropy',
                               max_depth=7)

# train the model
model.fit(X_train, y_train)

# predict using the training data
y_pred_train = model.predict(X_train)
# predict using the test data
y_pred = model.predict(X_test)
# print accuracy
print("Training Accuracy: {:.4f}" .format(accuracy_score(y_train, y_pred_train)))
print("Testing Accuracy: {:.4f}" .format(accuracy_score(y_test, y_pred)))

# import export_graphviz
from sklearn.tree import export_graphviz
# export files as stored as .dot files
dot_file = export_graphviz(model, feature_names=feature_names,
                           class_names= ['Not Diabetic', 'Diabetic'])

# import graphviz to convert .dot file to a .png image
import graphviz
# read the dot_file
graph = graphviz.Source(dot_file)
# convert .dot to .png without generating additional files
graph.render(filename='tree_diabetes', format='png', cleanup=True)

# import IPython to display the rendered .png image
from IPython.display import Image
# display image using the path
Image(filename="/content/tree_diabetes.png")

***Testing accuracy score*** raised from ***73% to 76%*** with just the first modification.

<a name='Conclusions'> </a>
# **Conclusions**
1. **GraphViz** allows to easily generate a visual representation of the Decision Tree which greatly helps in its ***interpretability***!

2. **Pre-Pruning** and **Post-Pruning** techniques were both successful to considerably ***increase testing accuracy*** and ***minimize overfitting*** with minimal modifications!