### What is a Decision Tree? ###

Decision trees are machine learning models that try to find patterns in the features of data points.

<img src='tree_gif.gif' width="800" height="800">

If we’re given this magic tree, it seems relatively easy to make classifications. But how do these trees get created in the first place? Decision trees are supervised machine learning models, which means that they’re created from a training set of labeled data. Creating the tree is where the learning in machine learning happens.

Take a look at the gif on this page. We begin with every point in the training set at the top of the tree. These training points have labels — the red points represent students that didn’t get an A on a test and the green points represent students that did get an A on a test.

We then decide to split the data into smaller groups based on a feature. For example, that feature could be something like their average grade in the class. Students with an A average would go into one set, students with a B average would go into another subset, and so on.

Once we have these subsets, we repeat the process — we split the data in each subset again on a different feature. Eventually, we reach a point where we decide to stop splitting the data into smaller groups. We’ve reached a leaf of the tree. We can now count up the labels of the data in that leaf. If an unlabeled point reaches that leaf, it will be classified as the majority label.

We can now make a tree, but how did we know which features to split the data set with? After all, if we started by splitting the data based on the number of hours they slept the night before the test, we’d end up with a very different tree that would produce very different results. How do we know which tree is best? We’ll tackle this question soon!

### Implementing a Decision Tree ###

To answer the questions posed in the previous exercise, we’re going to do things a bit differently in this lesson and work “backwards” (!!!): we’re going to first fit a decision tree to a dataset and visualize this tree using scikit-learn. We’re then going to systematically unpack the following: how to interpret the tree visualization, how scikit-learn‘s implementation works, what is gini impurity, what are parameters and hyper-parameters of the decision tree model, etc.

We’re going to use a dataset about cars with six features:
* The price of the car, buying, which can be “vhigh”, “high”, “med”, or “low”.
* The cost of maintaining the car, maint, which can be “vhigh”, “high”, “med”, or “low”.
* The number of doors, doors, which can be “2”, “3”, “4”, “5more”.
* The number of people the car can hold, persons, which can be “2”, “4”, or “more”.
* The size of the trunk, lugboot, which can be “small”, “med”, or “big”.
* The safety rating of the car, safety, which can be “low”, “med”, or “high”.

The question we will be trying to answer using decision trees is: when considering buying a car, what factors go into making that decision?

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


#Import models from scikit learn module:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

#Loading the dataset
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data', names=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'accep'])

1. We’ve imported the dataset in the workspace.

* Take a look at the first five rows of the dataset.

In [3]:
## 1a. Take a look at the dataset
print(df.head())

  buying  maint doors persons lug_boot safety  accep
0  vhigh  vhigh     2       2    small    low  unacc
1  vhigh  vhigh     2       2    small    med  unacc
2  vhigh  vhigh     2       2    small   high  unacc
3  vhigh  vhigh     2       2      med    low  unacc
4  vhigh  vhigh     2       2      med    med  unacc


* We’ve created dummy features for the categorical values and set the predictor and target variables as X and y respectively.

In [4]:
## 1b. Setting the target and predictor variables
df['accep'] = ~(df['accep']=='unacc') #1 is acceptable, 0 if not acceptable
X = pd.get_dummies(df.iloc[:,0:6])
y = df['accep']

* You can examine the new set of features.

In [5]:
## 1c. Examine the new features
print(X.columns)
print(len(X.columns))

Index(['buying_high', 'buying_low', 'buying_med', 'buying_vhigh', 'maint_high',
       'maint_low', 'maint_med', 'maint_vhigh', 'doors_2', 'doors_3',
       'doors_4', 'doors_5more', 'persons_2', 'persons_4', 'persons_more',
       'lug_boot_big', 'lug_boot_med', 'lug_boot_small', 'safety_high',
       'safety_low', 'safety_med'],
      dtype='object')
21


2. We can now perform a train-test split and fit a decision tree to our training data. We’ll be using scikit-learn‘s train_test_split function to do the split and the DecisionTreeClassifier() class to fit the data

In [6]:
## 2a. Performing the train-test split
x_train, x_test, y_train, y_test = train_test_split(X,y, random_state=0, test_size=0.2)

## 2b.Fitting the decision tree classifier
dt = DecisionTreeClassifier(max_depth=3, ccp_alpha=0.01,criterion='gini')
dt.fit(x_train, y_train)

3. We’re now ready to visualize the decision tree! The tree module within scikit-learn has a plotting functionality that allows us to do this.

In [None]:
## 3.Plotting the Tree
plt.figure(figsize=(20,12))
tree.plot_tree(dt, feature_names = x_train.columns, max_depth=5, class_names = ['unacc', 'acc'], label='all', filled=True)
plt.tight_layout()
plt.show()
## this code will visualization of the decision tree, But for some reason this code cannnot run in this cell.

### Interpreting a Decision Tree ###

We’re now going to examine the decision tree we built for the car dataset.
Two important concepts to note here are the following:

1. The root node is identified as the top of the tree. This is notated already with the number of samples and the numbers in each class (i.e. unacceptable vs. acceptable) that was used to build the tree.

2. Splits occur with True to the left, False to the right. Note the right split is a leaf node i.e., there are no more branches. Any decision ending here results in the majority class. (The majority class here is unacc.)

To interpret the tree, it’s useful to keep in mind that the variables we’re looking at are categorical variables that correspond to:

* buying: The price of the car which can be “vhigh”, “high”, “med”, or “low”.
* maint: The cost of maintaining the car which can be “vhigh”, “high”, “med”, or “low”.
* doors: The number of doors which can be “2”, “3”, “4”, “5more”.
* persons: The number of people the car can hold which can be “2”, “4”, or “more”.
* lugboot: The size of the trunk which can be “small”, “med”, or “big”.
* safety: The safety rating of the car which can be “low”, “med”, or “high”.

### Gini Impurity ###

Consider the two trees below. Which tree would be more useful as a model that tries to predict whether someone would get an A in a class?

<img src='comparision.jpeg' width="800" height="800">

Let’s say you use the top tree. You’ll end up at a leaf node where the label is up for debate. The training data has labels from both classes! If you use the bottom tree, you’ll end up at a leaf where there’s only one type of label. There’s no debate at all! We’d be much more confident about our classification if we used the bottom tree.

This idea can be quantified by calculating the Gini impurity of a set of data points. For two classes (1 and 2) with probabilites p_1 and p_2 respectively, the Gini impurity is:

$$ 1-(p_{1}^2+p_{2}^2) = (p_{1}^{2}+(1-p_{1})^2) $$

<img src='gini.jpeg' width="800" height="800">

The goal of a decision tree model is to separate the classes the best possible, i.e. minimize the impurity (or maximize the purity). Notice that if p_1 is 0 or 1, the Gini impurity is 0, which means there is only one class so there is perfect separation. From the graph, the Gini impurity is maximum at p_1=0.5, which means the two classes are equally balanced, so this is perfectly impure!

In general, the Gini impurity for C classes is defined as:

$$ 1-\sum \limits_{1}^{c}p_{i}^2 $$