# Week 7 - Decision Trees

## Learning Objectives
+ Understanding decision tree
+ Visualizing decision tree
+ Decision Tree classification 
    + Iris dataset
    + Visualizing the decision surface
+ Decision Tree Regression
    + Auto MPG dataset
+ Intuition about Random Forests


The contents of this tutorial are based on the Chapter 6 of "Hands-On Machine Learning with Scikit-Learn and TensorFlow : Concepts, Tools, and Techniques to Build Intelligent Systems" by Géron, Aurélien \[[NUS library link](https://nus.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1LSwMxEB5svXjyVfFRJX9gtZtsNg0UL9Kl4FFB6KVkzVikdhdqi3_fyTRbW9Fbkh2GkJ3kG4aZbwCUvO0lv94ErzR6b8gdTo16zbw0ztnUlbnX0vfzN-7-Jp_Gavioxj_5suRpkdLVmlHwc7oJXChCXaPp5W0pxcb9sjFSaTQ5DZFTigMtNtNkyVzUZcneyN8wWxObNsRP8WN_dx5IYVu0hS3UKY5gH0MpwjHsYXUCh00DBhHv4ym4wdwtZvejUK47uONxsl6qKzHnNEkUsS_EVISQqyDEm70vd4VZQpASEdLZ60XxUX9FiQ50i-HzwyihvU1iqGfSHIxUZ9Cu6grPQTjlrCsD8HvMcomlRdTe9XTgYUl1eQGdP1Vc_rN-BQcyQBuHIbrQXi5WeM1HdMM_4htft4eB)\], [sklearn tutorial](https://scikit-learn.org/0.15/auto_examples/tree/plot_iris.html), [sklearn article on decision trees](https://scikit-learn.org/stable/modules/tree.html).

# Decision Trees in sklearn

Decision Trees can be used for classification as well as regression, using ```DecisionTreeClassifier``` and ```DecisionTreeRegressor```. They require little data preparation as compared to other methods, in particular they do not require feature scaling and centering. However, do note that the sklearn APIs do not support missing values, and hence all data should have **missing values imputed before** it is passed to these APIs.

# Decision Tree Classification

## Dataset - Iris Dataset

For this part of the tutorial, we will initiall start with a classic dataset we have previously used - iris dataset. We will just use two features - petal length and width at the moment to make it easy to visualize.

In [46]:
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data[:, 2:] # petal length and width
y = iris.target

In [47]:
import numpy as np

rng = np.random.RandomState(42)

As we know from previous tutorials, training a classifier is as simple as using the correct API and running ```fit()``` and ```predict()``` functions on it. So let us use the [```DecisionTreeClassifier```](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) and fit it on the data of petal length and width. The reason for choosing just two of the columns of the dataset is to visualize the decision surface easily at first.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=rng)

### Training Decision Tree

Training the decision tree is simple - however, we need to keep in mind the depth of the tree.

If left unconstrained, the tree structure will adapt itself to the training data, fitting it very closely, and most likely overfitting it. Such a model is often called a nonparametric model, not because it does not have any parameters (it often has a lot) but because the number of parameters is not determined prior to training, so the model structure is free to stick closely to the data. In contrast, a parametric model such as a linear model
has a predetermined number of parameters, so its degree of freedom is limited, reducing the risk of overfitting (but increasing the risk of underfitting). 

To avoid overfitting the training data, you need to restrict the Decision Tree’s freedom during training. As you know by now, this is called regularization. The regularization hyperparameters depend on the algorithm used, but generally you can at least restrict the maximum depth of the Decision Tree. 

## Plotting Decision Surface

Now let us try to visualize the decision surface of the classifier. As we have just two features, it will be easy to visualize. The intuition here is to use the model to make predictions for a grid of values across input domain. 

For this visualization, we will use two key functions: [```np.meshgrid```](https://numpy.org/doc/stable/reference/generated/numpy.meshgrid.html?ref=hackernoon.com) and [```plt.contourf```](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.contourf.html).

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Parameters
n_classes = 3
plot_step = 0.02

First we need to defined a grid of points across the feature space. To do this, we can find the minimum and maximum values for each feature and expand the grid one step beyond that to ensure the whole feature space is covered. Now we need a grid of uniform values across each dimension - done by ```np.meshgrid()```.

In [None]:
x_min, x_max = X_test[:, 0].min() - 1, X_test[:, 0].max() + 1
y_min, y_max = X_test[:, 1].min() - 1, X_test[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                    np.arange(y_min, y_max, plot_step))

Now for the prediction on these grid points, we need to create it into format for input to our model. We can use the [```np.c_```](https://numpy.org/doc/stable/reference/generated/numpy.c_.html) utility for converting slice objects to concatenation along second axis. So, we have the grid of values across feature space and the class labels predicted by our model. Plot this into contour plot!

In [None]:
Z = tree_clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

As you can see Decision Trees are fairly intuitive and their decisions are easy to interpret. Such models are often called **white box models**. 

Scikit-Learn uses the CART algorithm, which produces only binary trees: nonleaf nodes always have two children (i.e., questions only have yes/no answers). However, other algorithms such as ID3 can produce Decision Trees with nodes that have more than two children.

A Decision Tree can also estimate the probability that an instance belongs to a particular class k: first it traverses the tree to find the leaf node for this instance, and then it returns the ratio of training instances of class k in this node. 

The ```DecisionTreeClassifier``` also has a ```score``` function which returns the mean accuracy on given test data and labels. For other metrics, the ```predict``` function would still be used.

Keep in mind that if a feature has a low feature importance value, it doesn’t necessarily mean that the feature isn’t important for prediction, it just means that the particular feature wasn’t chosen at a particularly early level of the tree. In some cases, the feature could be identical or highly correlated with another informative feature. Feature importance values also don’t tell you which class they are very predictive for or relationships between features which may influence prediction. The feature importance just returns to us the normalized total reduction of criteria by feature (Gini importance).

### Reading the tree structure: 
Suppose you find an iris flower and you want to classify it. 
You start at the root node (depth 0, at the top): this node asks whether the flower’s petal width is smaller than 0.8 cm. 
If it is, then you move down to the root’s left child node (depth 1, left). In this case, it is a leaf node (i.e., it does not have any children nodes), so it does not ask any questions: you can simply look at the predicted class for that node and the Decision Tree predicts that your flower is an Iris-Setosa (class=setosa).
If it is not, then you move down to root's right child node and again face the question.

The gini score quantifies the purity of the node/leaf. A gini score greater than 0 implies that samples contained within that node belong to different classes. A gini score of 0 means that the node is pure - within that node only a single class of samples exist. In our tree, we have gini score greater than 0 at the root node, implying samples in root node are from different classes.

The value list tells us how many samples at the given node fall into each category. The categories are setosa, versicolor and virginica here. This is essentially similar information to what the gini score is giving us. 

Do note that regarding the color of the nodes, the non-colored nodes have no majority prediction. Any node that has a majority prediction is colored.

Predictions of decision trees are neither smooth nor continuous, but piecewise constant approximations as seen in the above figure. Therefore, they are not good at extrapolation. Also, it is important to note that if some classes dominate, biased decision trees can get created. It is therefore recommended to **balance** the dataset prior to fitting with the decision tree.

You may have also noticed that decision trees love orthogonal decision boundaries (all splits are perpendicular to an axis), which makes them sensitive to training set rotation. One way to handle the problem of generalizability caused due to this is to use PCA, which often results in better orientation of training data.

# Decision Tree Regression

Let us load another dataset for performing the regression. The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.

## Loading Dataset

In [20]:
import pandas as pd

df = pd.read_csv('auto-mpg.data', sep=r"\s+", header=None, names=['mpg', 'cylinders', 'displacement','horsepower', 'weight', 'acceleration', 'model year', 'origin', 'car name'])

In [None]:
y = df.mpg
X = df.drop('mpg', axis=1)
X.drop('car name', axis=1, inplace=True)

In [None]:
X.replace('?',np.nan, inplace=True)
X['horsepower'] = X['horsepower'].astype('float')

The horsepower has 6 missing values. We can just drop these rows as of now.

## Training Decision Tree Regression with different depths

For the regression task, we have another API: [```DecisionTreeRegressor```](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor). It also has similar attributes and functions as seen earlier, so we can still train with different depth values. 

The main difference is that instead of predicting a class in each node, this tree predicts a value. For a new instance, you traverse the tree and reach the leaf node that predicts a value. This prediction is simply the average target value of the training instances(samples) associated with that leaf node. This prediction results in Mean Squared Error (MSE) equal to value stated over these samples. The CART algorithm works mostly the same way as earlier, except that instead of trying to split the training set in a way that minimizes impurity, it now tries to split the training set in a way that minimizes the MSE.

In the regression tree visualization, the darker images indicate higher predicted target values.

The algorithm splits each region in a way that makes most training instances as close as possible to that predicted value.

Let us examine the prediction and decision path for a particular instance.

Let us compare the performance of normal linear regression vs the decision tree regression.

For datasets where the linearity assumption is hardly present, decision trees do not make any assumptions regarding the data. However, if linearity exists, we might be better off doing linear regression.

# Introduction to Random Forests

The intuition here is that if you aggregatethe predictions of a group of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. You can train a group of Decision Tree classifiers, each on a different random subset of the training set. To make predictions, you just obtain the predictions of all individual trees, then predict the class that gets the most votes. Such an *ensemble* of Decision Trees is called a Random Forest, and despite its simplicity, this is one of the most powerful Machine Learning algorithms available today. Even today, it is still one of the most popular techniques on Kaggle.

You can use the ```BaggingClassifier``` API along with the ```DecisionTreeClassifier``` to create the Random Forests. But to make our job easy, sklearn provides us with the ```RandomForestClassifier```.

# Practise Exercise (Optional):
1. Generate an artificial dataset for classification. Refer [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html). Use grid search with cross validation to find good hyperparameter values for DecisionTreeClassifier.
2. Continuing the previous exercise, generate 1,000 subsets of the training set, each containing 100 instances selected randomly. Use [```ShuffleSplit```](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html). Train one Decision Tree on each subset, using the best hyperparameter values found above. Evaluate these 1,000 Decision Trees on the test set. Since they were trained on smaller sets, these Decision Trees will likely perform worse than the first Decision Tree. 
3. On the same dataset, use the [```BaggingClassifier```](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) along with the ```DecisionTreeClassifier``` to create a forest of 100 trees. You can also use the maximum vote of the decision trees trained in earlier question and compare with results of BaggingClassifier with DecisionTreeClassifier.