# Decision Trees

Decision trees are a powerful prediction method and extremely popular.

They are popular because the final model is so easy to understand by practicioners and domain experts alike. The final decision tree can explain exactly why a specific prediction was made, making it very attractive for operational use.

Decision trees also provide the foundation for more advanced ensemble methods such as bagging, random forests and gradient boosting.

In this tutorial, you will discover how to implement the Classification And Regression Tree algorithm from scratch with Python.

After completing this tutorial, you will know:
-  How to calculate and evaluate candidate split points in a data.
-  How to arrange splits into a decision tree structure.
-  How to apply the classification and regression tree algorithm to a real problem.

Let's get started.

## Descriptions

This section provides a brief introduction to the Classification and Regression Tree algorithm and the Banknote dataset used in this tutorial.

### Classification and Regression Trees

Classification or Regression Trees or CART for short is an acronym introduced by Leo Breiman to refer to Decision Tree algorithms that can be used for classification or regression predictive modeling problems.

We will focus on using CART for classification in this tutorial.

The representation of the CART model is a binary tree. This is the same binary tree from algorithms and data structures, nothing too fancy (each node can have zero, one or two child nodes).

A node represents a single input variable (X) and a split point on that variable, assuming the variable is numeric. The leaf nodes (also called terminal nodes) of the tree contain an output variable (y) which is used to make a prediction.

Once created, a tree can be navigated with a new row of data following each branch with the splits until a final prediction is made.

Creating a binary decision tree is actually a process of dividing up the input space. A greedy approach is used to divide the space, called Recursive Binary Splitting. This is a numerical procedure where all the values are lined up and different split points are tried and tested using a cost function.

The split with the best cost (lowest cost because we minimize cost) is selected. All input variables and all possible split points are evaluated and chosen in a greedy manner based on the cost function.

-  __Regression__: The cost function that is minimized to choose split points is the sum squared error across all training samples that fall within the rectangle.
-  __Classification__: The Gini cost function is used which provides an indication of how pure the nodes are, where node purity refers to how mixed the training data assigned to each node is.

Splitting continues until nodes contain a minimum number of training examples or a maximum tree depth is reached.

### Banknote Dataset

The banknote dataset involves predicting whether a given banknote is authentic given a number of measures taken from a photograph.

The dataset contains 1,372 rows with 5 numeric variables. It is a classification problem with two classes (binary classification).

Below provides a list of the five variables in the dataset.
1. Variance of Wavelet Transformed image (continuous).
2. Skewness of Wavelet Transformed image (continuous).
3. Kurtosis of Wavelet Transformed image (continuous).
4. Entropy of image (continuous).
5. Class (integer)

Below is a sample of the first 5 rows in the dataset
```
 3.6216,8.6661,-2.8073,-0.44699,0
 4.5459,8.1674,-2.4586,-1.4621,0
 3.866,-2.6383,1.9242,0.10645,0
 3.4566,9.5228,-4.0112,-3.5944,0
 0.32924,-4.4552,4.5718,-0.9888,0
```

Using the Zero Rule Algorithm to predict the most common class value, the baseline accuracy on the problem is about 50%.

You can learn more and download the dataset from the [UCI Machine Learning Repository.](http://archive.ics.uci.edu/ml/datasets/banknote+authentication)

Download the dataset and place it in your current working directory with the filename __data_banknote_authentication.csv__.

## Tutorial

This tutorial is broken down into 5 parts:
1. Gini Index.
2. Create Split.
3. Build a Tree.
4. Make a Prediction.
5. Banknote Case Study.

These steps will give you the foundation that you need to implement the CART algorithm from scratch and apply it to your own predictive modeling problems.

### 1. Gini Index

The Gini index is the name of the cost function used to evaluate splits in the dataset.

A split in the dataset involves one input attribute and one value for that attribute. It can be used to divide training patterns into two groups of rows.

A Gini score gives an idea of how good a split is by how mixed the classes are in the two groups created by the split. A perfect separation results in a Gini score of 0, whereas the worst case split that results in 50/50 classes in each group, results in a Gini score of 0.5 (for a 2 class problem).

Calculating Gini is best demonstrated with an example.

We have two groups of data with 2 rows in each group. The rows in the first group all belong to class 0 and the rows in the second group belong to class 1, so it's a perfect split.

We first need to calculate the proportion of classes in each group.

$$\text{proportion} = \frac{\text{count(class value)}}{\text{count(rows)}}$$

The proportion for this example would be:

$
\begin{align}
\text{Group 1 class 0} = 2 / 2 = 1\\
\text{Group 1 class 1} = 0 / 2 = 0\\
\text{Group 2 class 0} = 0 / 2 = 0\\
\text{Group 2 class 1} = 2 / 2 = 1
\end{align}
$

Gini is then calculated for each child node as follows:

$
\begin{align}
\text{Gini index} &= \sum(\text{proportion} * (1.0 - \text{proportion}))\\
&= 1.0 - \sum(\text{proportion}^2)
\end{align}
$

The Gini index for each group must then be weighted by the size of the group, relative to all of the samples in the parent, e.g. all samples that are currently being grouped. We can add this weighting to the Gini calculation for a group as follows:

$$\text{Gini index} = (1.0 - \sum(\text{proportion}^2)) * \frac{\text{group size}}{\text{total samples}}$$

In this example, the Gini scores for each group are calculated as follows:

$
\begin{align}
\text{Gini(group 1)} &= (1 - (1^2 + 0^2)) * 2/4\\
&= 0 * 0.5\\
&= 0\\
\text{Gini(group 2)} &= (1 - (0^2 + 1^2)) * 2/4\\
&= 0 * 0.5\\
&= 0
\end{align}
$

The scores are then added across each child node at the split point to give a final Gini score for the split point that can be compared to other candidate split points.

The Gini for this split point would then be calculated as 0.0 + 0.0 or a perfect Gini score of 0.0.

Below is a function named __gini_index()__ that calculates the Gini index for a list of groups and a list of known class values.