<a target="_blank" href="https://colab.research.google.com/github/JLDC/Data-Science-Fundamentals/blob/master/notebooks/108_decision-trees.ipynb">
    <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Open this notebook in Google Colab
</a>

___

# Regression and Classification Trees
___

In this notebook, we will learn about **regression trees** and **classification trees**. More generally, such trees are also called **decision trees** as an umbrella term for both regression and classification problems since one can interpret their structure as a *decision map*.

These are interesting because they represent a class of models for which we can't easily write functional form such as

$$f(\mathbf{w},\mathbf{x}) = b + w_1 \cdot x_1 + w_2 \cdot x_2 + \dots$$

In a sense, they are what we typically call **non-parametric** since they do not have parameters that we can adjust (in the sense of *weights*, this does not mean that there are no *hyperparameters* that can be used for model tuning!).

As a standalone, regression trees are not very powerful for prediction problem. However, they have two important purposes:  
1. They provide a nice interpretable non-parametric model to a given problem. Interpreting non-parametric models is often hard, but regression trees are an exception.
2. They serve as building blocks to strong prediction models such as random forests and boosted trees.

In this course, we focus mostly on the second of these two points. We will begin by building an understanding for regression trees in general, and, in the next notebook, we will extend this understanding to move on to random forests.

For the illustration of trees, we will use a dataset of Titanic passengers and whether they survived. We will also have a look at some simulated data to study the overfitting behavior of trees.

If you want to read more on tree models, consult Chapter 8 of the book [Introduction to Statistical Learning](https://www.statlearning.com/) and [this blog](http://uc-r.github.io/2018/05/09/random-forests/).


### 🧑‍💻 <font color=green>**Your Task**</font>

Go through the explanations and code pieces of this notebook and solve the questions outlined below. There is no need to understand things in great detail. If you just go a little bit further than scratching the surface it's perfectly fine. You should get a gut feeling about regression trees, no deep understanding needed.

___
## Data pre-processing
Just like the iris dataset, the Titanic passenger dataset is well known in the data science community. If you are not sure what a specific variable entails, you should try to figure it out by yourself using a search engine of your choice!

In [None]:
import matplotlib.pyplot as plt # Plotting
import numpy as np # Numerical computing
import pandas as pd # Dataframes
%matplotlib inline

# Define the path where the data is stored
DATA_PATH = "https://raw.githubusercontent.com/JLDC/Data-Science-Fundamentals/master/data"

In [None]:
# Import dataset
titanic = pd.read_csv(f"{DATA_PATH}/data/titanic.csv")
titanic # Display the data

As always, data cleaning comes first. We want to predict the `Survived` variable, i.e., whether our passengers have survived the accident or not. Let's look at how balanced our data is, and whether there are any missing values for the outcome variable.

In [None]:
# Survivors in the dataset
titanic["Survived"].value_counts()

In [None]:
# Check percentage of missing values
titanic.isnull().sum() / titanic.shape[0]

We directly notice that 20% of our dataset's `Age` values are missing. Not only that but also 77% of the `Cabin` values and 0.2% of the `Embarked` values.

Remember the concept of *data imputation* discussed in the last notebook? Let's go ahead and try it for the `Age` variable. We will use the mean `Age` to fill in the missing values and construct a tree using only the `Sex` and `Age` to begin with.

In [None]:
# Create a new variable, Age_Imputed and assign it with the same value as Age
titanic["Age_Imputed"] = titanic["Age"]
# Replace missing values with the mean
titanic["Age_Imputed"].fillna(titanic["Age"].mean(), inplace=True)

Because the sex is in text format, we also need to create dummies for it first.

In [None]:
# Replace the Sex variable with its dummy
titanic["Sex"] = pd.get_dummies(titanic["Sex"], drop_first=True)
titanic # Display the data again for good measures

## A simple classification tree

Let's dive right in and create our first classification tree. For this pedagogical example we will not make a train/test split yet and we will only use `Sex` and `Age_Imputed` as predictors.

In [None]:
# Classification tree and a function to plot its contents
from sklearn.tree import DecisionTreeClassifier, plot_tree

In [None]:
# Create the features and labels
X = titanic[["Sex", "Age_Imputed"]]
y = titanic["Survived"]

In [None]:
# Instantiate the tree model with a cost-complexity parameter for pruning of 0.01
tree1 = DecisionTreeClassifier(ccp_alpha=0.01)
# Fit the model on our data
tree1.fit(X, y)

In [None]:
# Set up the canvas
fig, ax = plt.subplots(figsize=(12, 8))
# Visualize the results
plot_tree(tree1, ax=ax, feature_names=["Sex", "Age"], 
          class_names=["Dead", "Survived"], label="root", filled=True)

___
#### ➡️ ✏️<font color=green>**Question 1**</font>

Try to read and understand the above plot.
+ Can you figure out what each square represents and how to read the whole tree?
+ What are the main predictors for survival according to the results of our tree?
___

Let's do a second tree but with more variables. Namely, we will add the number of siblings or spouses aboard (`SibSp`), the number of parents or children aboard (`Parch`) and the passenger class (`Pclass`)

In [None]:
# Create the new X values (y stays the same)
X2 = titanic[["Sex", "Age_Imputed", "SibSp", "Parch", "Pclass"]]
# Even though it is given as a number, Pclass is a categorical variable, thus we encode it
X2 = pd.get_dummies(X2, columns=["Pclass"])
X2 # Display the data

___
#### 🤔 Pause and ponder
Have you noticed that we did not use `drop_first=True` in the `get_dummies()` function above? This is not a mistake, so why did we not do it?
___

In [None]:
# Instantiate the tree model with a cost-complexity parameter for pruning of 0.01
tree2 = DecisionTreeClassifier(ccp_alpha=0.01)
# Fit the model on our data
tree2.fit(X2, y)

In [None]:
# Set up the canvas
fig, ax = plt.subplots(figsize=(12, 8))
# Visualize the results
plot_tree(tree2, ax=ax, class_names=["Dead", "Survived"], label="root", filled=True,
          feature_names=["Sex", "Age", "Siblings & Spouses", "Parents & Children", 
                         "1st Class", "2nd Class", "3rd Class"])

___
#### ➡️ ✏️<font color=green>**Question 2**</font>

Compare this new plot with the one above. What do you notice?


___
#### ➡️ ✏️<font color=green>**Question 3**</font>

Think about the following:

1. What is the purpose of a classification (and regression) tree? What defines a *good* tree?
2. What would be a *good* algorithm to build such a tree? Discuss the intuition behind the issues that must be taken into account when growing such a tree. Do you have an idea of such an algorithm?

___
## Overfitting example with regression trees
This example is inspired from  https://rstudio-pubs-static.s3.amazonaws.com/312930_f38ab2a78ed144a9b7431f1ffcd18539.html.

Have you noticed, how each time we instantiated a tree, we use `ccp_alpha = 0.01`, but what does this mean exactly? As it turns out, this is a tuning parameter which helps us avoiding that the tree overfits. In this example, we will look at what happens when this parameter is equal to zero (we don't specify it on instantiation).

We begin by generating two random normal variables, $x_1$ and $x_2$. Our target outcome $y$ takes values $0$ or $1$ with the probabilities:

\begin{align}
P(y = 1) = \begin{cases} 0.9 \, &\text{ if } x_1 < 2 \\ 0.9 &\text{ if } x_1 \geq 2 \text{ and } x_2 < -.5) \\ 0.1 &\text{ otherwise.}\end{cases}
\end{align}

___
#### ➡️ ✏️<font color=green>**Question 5**</font>

1. Draw the constellation for $x_1$, $x_2$ and $y$ in an $x_1$-$x_2$-space (plane).
2. Draw a (human-made) tree for predicting $y$. Is it a complex tree? 

In [None]:
# Now we generate data that correspond to the data-generating process outlined in this equation. No need to understand the details. Just trust that this does what it should.
# Set random seed, number of observations
np.random.seed(72)
N = 1272
# Generate variables
x1 = np.random.randn(N)
x2 = np.random.randn(N)
# 🙀 🤯 Generate y
y_overfit = (np.random.rand(N) < .1 + .8 * ((x1 < 2) | ((x1 >= 2) & (x2 < -.5)))).astype(int)
# For good measure show the counts in y
np.unique(y_overfit, return_counts=True)

In [None]:
# Add some "disturbances" to confuse our regression tree, 
# note that x3 is correlated with x1, but the other variables are not
x3 = np.random.randn(N) + x1
x4 = np.random.randn(N) + 2
x5 = np.random.randn(N) + .4

In [None]:
# Create feature matrix X 
X_overfit = np.array([x1, x2, x3, x4, x5]).transpose()

In [None]:
# Instantiate and fit the tree WITHOUT the cost-complexity pruning parameter
tree_overfit = DecisionTreeClassifier()
tree_overfit.fit(X_overfit, y_overfit)

In [None]:
# Set up the canvas
fig, ax = plt.subplots(figsize=(12, 8))
# Visualize the results
plot_tree(tree_overfit, ax=ax, class_names=["0", "1"], label="root", filled=True,
          feature_names=["x1", "x2", "x3", "x4", "x5"])

**That's quite the tree!** Also, this plot took quite a while to generate. Without going into too many details, this makes it clear that this cost-complexity parameter is very important for our tree to not overfit. A good idea is to proceed with cross-validation to find the optimal cost-complexity parameter!

Let's go ahead and do this to see how the cross-validation error behaves depending on the cost-complexity pruning value

In [None]:
# I mport cross-validation function from sklearn
from sklearn.model_selection import cross_val_score
# Instantiate lists to keep track of missclassification over CV rounds
accuracy_mean = []
accuracy_se = []
# CV error over different values of cost-complexity
ccp_values = [0, 0.0001, 0.00025, 0.0005, 0.00075, 0.001, 0.0025, 0.005, 0.0075, 
              0.01, 0.025, 0.05, 0.075, 0.1, .25, .5, .75, 1]
nfolds = 10 # 10-fold CV
for ccp in ccp_values:
    # Create tree
    tree_overfit = DecisionTreeClassifier(ccp_alpha=ccp)
    acc = cross_val_score(tree_overfit, X_overfit, y_overfit, cv=nfolds, scoring="accuracy")
    accuracy_mean.append(100 * np.mean(acc))
    accuracy_se.append(100 * np.std(acc) / np.sqrt(nfolds))

In [None]:
# Plot the results
fig, ax = plt.subplots(figsize=(12, 8))
ax.errorbar(ccp_values, accuracy_mean, yerr=accuracy_se)
# Show a red dot for the best result
best = np.argmax(accuracy_mean)
ax.scatter(ccp_values[best], accuracy_mean[best], color="red", s=100, label="Best")
# Set the x-axis scale to be logarithmic
ax.set_xscale("log")
# Add grid, labels, legend
ax.grid(True)
ax.set_xlabel("Cost-complexity pruning parameter (alpha)")
ax.set_ylabel("Mean out-of-sample accuracy")
ax.legend()

Now let's reesimate and draw the tree for the optimal ccp value.

In [None]:
tree_best = DecisionTreeClassifier(ccp_alpha=ccp_values[best])

tree_best.fit(X_overfit, y_overfit)


# Set up the canvas
fig, ax = plt.subplots(figsize=(12, 8))
# Visualize the results
plot_tree(tree_best, ax=ax, class_names=["0", "1"], label="root", filled=True,
          feature_names=["x1", "x2", "x3", "x4", "x5"])

___
#### ➡️ ✏️<font color=green>**Question 6**</font>

Is this satisfactory? Did cross-validation do a good job?

___
## Gini impurity and cross-entropy

Let's create a quick and dirty plot of the shape of the **cross-entropy** function and of the **Gini impurity** function.

In [None]:
# Set up the canvas
fig, ax = plt.subplots(figsize=(12, 8))
# Plot cross-entropy
ax.plot(p, -p * np.log(p), label="Cross-entropy")
# Plot Gini impurity
ax.plot(p, p * (1 - p), label="Gini impurity")
# Add grid, legend, labels, ticks
ax.legend()
ax.grid(True)
ax.set_xlabel("$p$")
ax.set_xticks(np.arange(11) / 10)