# Decision trees

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Intro" data-toc-modified-id="Intro-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Intro</a></span></li><li><span><a href="#The-problem" data-toc-modified-id="The-problem-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>The problem</a></span></li><li><span><a href="#Data-exploration" data-toc-modified-id="Data-exploration-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data exploration</a></span></li><li><span><a href="#Train-test-splitting" data-toc-modified-id="Train-test-splitting-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Train test splitting</a></span></li><li><span><a href="#Models" data-toc-modified-id="Models-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Models</a></span><ul class="toc-item"><li><span><a href="#Baseline-model" data-toc-modified-id="Baseline-model-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Baseline model</a></span></li><li><span><a href="#Simple-tree-(depth=1)" data-toc-modified-id="Simple-tree-(depth=1)-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Simple tree (depth=1)</a></span></li><li><span><a href="#Bigger-tree-(depth=3)" data-toc-modified-id="Bigger-tree-(depth=3)-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Bigger tree (depth=3)</a></span></li><li><span><a href="#Huge-tree-(depth=20)" data-toc-modified-id="Huge-tree-(depth=20)-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Huge tree (depth=20)</a></span></li><li><span><a href="#Overfitting" data-toc-modified-id="Overfitting-5.5"><span class="toc-item-num">5.5&nbsp;&nbsp;</span>Overfitting</a></span></li><li><span><a href="#Other-hyperparameters" data-toc-modified-id="Other-hyperparameters-5.6"><span class="toc-item-num">5.6&nbsp;&nbsp;</span>Other hyperparameters</a></span></li><li><span><a href="#Grid-search" data-toc-modified-id="Grid-search-5.7"><span class="toc-item-num">5.7&nbsp;&nbsp;</span>Grid search</a></span></li></ul></li><li><span><a href="#Feature-importance" data-toc-modified-id="Feature-importance-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Feature importance</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Summary</a></span></li></ul></div>

In [2]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

# Introduction

As an example, do you remember the Titanic dataset? We are going to create a predictor of who will die following a decision tree. A decision tree tries to predict the value "Survived" (as a Yes/No) using the following approach:

<img src="https://www.researchgate.net/profile/Joop_Hox/publication/317307818/figure/fig2/AS:633029202571264@1527937331016/Decision-tree-on-Titanic-survival-data-Source-https-en_Q640.jpg" width=300>

Decision trees:
 * are used **both** for regression and classification,
 * involve stratifying (segmenting) the prediction stage,
 * act an iterative manner,
 * have this name because splitting rules are represented in a tree.

Addiionally, decision trees:
 * are simple
 * are useful for interpretation
 * alone, are not very powerful predictors but...
 * ...give rise to more complex models, like Random Forest or Gradient Boosted Trees algorithms.

# The problem

Today we will be using a **white wine** dataset. Experts have rated several wines, whose physical properties are also given.

In [4]:
df = pd.read_csv("../data/wine_quality.csv")

## What do we want to do?
 * predict the quality of the wine base on the other properties.
 
To do that, we are going to:
 * build a **supervised** learning model...
 * ...which is a **regression** model (predict quantitative feature) and...
 * ...that tries to predict wine `quality` from its physical properties

We will do train-test splitting for correct asessment of model performance.

We will use MSE metric: $$MSE=\frac{1}{N}\sum(\hat{y} - y)^2$$

To study which model is the best, we will:
 * try several models and...
 * ...keep the one with the **least** MSE on **test set** (i.e., the least test error)
 * ...we will show what happens with training error.

## Data exploration

## Train-test splitting

Supervised machine learning is about creating models that accurately map the given **inputs** to the given **outputs**.

What’s most important to understand is that you usually need , assess the predictive performance of your model, and validate the model.

Because you need unbiased evaluation to properly use these models, it means that you can’t evaluate the predictive performance of a model with the same data you used for training. 
It's like studying a few pages from a book and just taking the test for those pages, when you should take the test for the whole book.

You need evaluate the model with fresh data that hasn’t been seen by the model before. You can accomplish that by splitting your dataset before you use it.

## Models

We need to create several models to see which one is the best. And how do we know "the best"? We compare the models to a *baseline* case.

We will change the maximum depth of a tree to explore different models

### Baseline model

The baseline model is predicting **just** the mean, to ask the question: "Can we do better than the average value?"

Train error

Test error

### Simple tree (depth=1)

### Bigger tree (depth=3)

### Huge tree (depth=20)

**Something happened?**

### Overfitting

Lets see how training and test error changes with `max_depth`

We can see how, when `max_depth` increases above ~N:
 * training error decresases (more precise on training samples)
 * test error increases (model is memorizing training set and not generalizing very well)
 
This is the famous overfitting! And this is why **test error** is the one you should look at!

### Other hyperparameters

`min_samples_split`: the minimum number of samples required to split an internal node  

`max_features`: the number of features to consider when looking for the best split.


Why would you not consider all the features to find a Decision Tree?

### Grid search

Lets find the **best** combination of hyperparameters, i.e. the ones yielding the least test error

## Feature importance

## Summary

 * Decision trees are useful for regression (`DecisionTreeRegressor`) and classification (`DecisionTreeRegressor`).
 * Their behavior is quite intuitive, interpretable, and explainable.
 
 * Decision trees overfit when `max_depth` becomes very big (too many individual leaves at the end)
 * Prevent overfitting (always, not only in tree based methods) by looking at test error
 
 * One decision tree is often not a very powerful ML algorithm
 * Decision trees are the building blocks of more advanced and superpowerful algorithms

# Trees ensembles

In [None]:
from sklearn.ensemble import RandomForestClassifier

## The problem: detecting breast cancer

In [None]:
df_cancer = pd.read_csv('../data/breast_cancer.csv')

## Random forest

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees.

Random decision forests correct for decision trees' habit of overfitting to their training set.

## Boosting

Gradient boosting are specific types of algorithms that take a weak hypothesis or weak learning algorithm and make a series of tweaks to it that will improve the strength of the hypothesis/learner.