# Supervised learning with decision trees

In supervised learning, a model is trained to predict the output value (target) based on the available features.

In this notebook, we will learn how to use the decision tree model for supervised learning tasks (regression and classification problems).

First, to get an intuition how it works, go through  the visual ML tutorial:

*   http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
*   http://www.r2d3.us/visual-intro-to-machine-learning-part-2/


We will use the same libs as in the previous lab (pandas and seaborn) and scikit-learn for training a machine learning model: https://scikit-learn.org/stable/tutorial/index.html

#### Imports and functions

In [None]:
import pandas as pd
import seaborn as sns
from IPython.display import Image 
import pydotplus

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, \
    GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics 
from sklearn.datasets import load_boston, load_breast_cancer
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO  

In [None]:
def plot_decision_tree(model, feature_names):
  dot_data = StringIO()
  export_graphviz(model, out_file=dot_data,  
                  filled=True, rounded=True, 
                  feature_names=feature_names)
  graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
  return Image(graph.create_png())

def display_confusion_matrix(y_test, y_pred):
  confusion_matrix = pd.DataFrame(metrics.confusion_matrix(y_test, y_pred))
  confusion_matrix.index.name = 'Actual'
  confusion_matrix.columns.name = 'Predicted'
  sns.heatmap(confusion_matrix, annot=True)

# Regression model
First, we will apply the tree models to a regression problem - we will predict the price of houses in Boston area.

## Dataset exploration
We will train the model on a built-in scikit-learn dataset of house prices in Boston.
https://scikit-learn.org/stable/datasets/toy_dataset.html#boston-house-prices-dataset

First, we load and explore the characteristics of the dataset (use the same methods as in the previous lab).

In [None]:
house_price_data = load_boston()
house_price_data_pd = pd.DataFrame(house_price_data.data, columns=house_price_data.feature_names)
house_price_data_pd.head()

Calculate the dataset statistics

In [None]:
house_price_data_pd.??

We will predict the price of the houses in Boston based on their metadata. 

Plot the distribution of price value:

In [None]:
sns.??(house_price_data.target)

Plot the feature correlation matrix

In [None]:
??

## Model training

To know if our model can generalize on unseen data, we are going to split the dataset into training and test data (20% examples). The model will be trained on the train split and the evaluation metrics will be calculated on test data.

 We use `train_test_split` from scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
X_train, X_test, y_train, y_test = train_test_split(house_price_data_pd, house_price_data.target, test_size=0.2)

We will use a decision tree regression model: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

First, we create the model and next, we train it on the train data (with `.fit()` method).

In [None]:
model = DecisionTreeRegressor()
model = model.fit(X_train,y_train)

We can visualize the tree structure with a graph.  What can you say from this plot? Is it easy to interpret?

In [None]:
plot_decision_tree(model, house_price_data.feature_names)


First, to verify if the model learned the training data, we will calculate the mean absolute error between the real and predicted value on the training set:

In [None]:
y_pred = model.predict(X_train)
metrics.mean_absolute_error(y_train, y_pred)

Next, to check how it generalizes on new examples, calculate the error on the test set (use `X_test` and `y_test` variables):

In [None]:
??

Why are these values different? 

### Tuning the model hyper-parameters

If the difference between the error on the train and test sets is high, it means that the model **overfits** on the training set (it just learn the train examples by heart). 

Now, try to restrict the model by limiting the `max_depth` parameter to 3 and `min_samples_leaf` to 10. Plot the structure and analyze the errors. What can you say about the tree structure and errors for the train and test sets?

In [None]:
model = DecisionTreeRegressor(max_depth=3, min_samples_leaf=10)
model = model.fit(X_train,y_train)

In [None]:
plot_decision_tree(model, house_price_data.feature_names)

Calculate the predictions and error on the train set.

In [None]:
??

Calculate the predictions and error on the test set.

In [None]:
??

Try some other values of max depth and min samples. Can you improve the results on the test set?

### Grid search and cross-validation

We could improve the results by manually selecting the hyperparameters. However, if there are many parameters to tune, this approach will take a very long time. 

Now, we will select these values automatically. We will use grid search to compare the results for all combinations of hyperparameters and cross-validation to iteratively split the training and validation sets (to avoid overfitting to a single test set).

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html


In [None]:
param_grid = {'max_depth': [2, 3, 5, 10], 'min_samples_leaf': [1, 5, 10]}

In [None]:
model = DecisionTreeRegressor()
search = GridSearchCV(model, param_grid)
search.fit(X_train, y_train)

In [None]:
plot_decision_tree(search.best_estimator_, house_price_data.feature_names)

What are the best parameters and the test error for this configuration of the model? Is it better than for the manually selected values?

In [None]:
print(search.best_params_)

In [None]:
y_pred = search.predict(y_test)
?? # print the error

## Model ensemble 

Next, we will try to improve the results and reduce overfitting by using an ensemble of models.

### Random forest
We will train a random forest model and tune its parameters with grid search and cross validation. This model is an ensemble of single decision trees - it fits a tree model to different subsamples of the dataset. Next, the results of particular estimators are averaged to get the final result. This method is more robust to overfitting than a single estimator.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor

In [None]:
param_grid = {'max_depth': [2, 3, 5, 10], 'min_samples_leaf': [1, 5, 10], 'n_estimators': [50, 100, 200]}
model = RandomForestRegressor()

Run grid search on the random forest model and parameter grid.

In [None]:
??

What are the best parameters for this model?

In [None]:
??

Calculate the error on the test set for this model.

In [None]:
??

# Classification model
Next, we will train a model on a binary classification task. We will predict if the patient has breast cancer based on the information from a medical image.
Dataset description: 
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

### Load the dataset

In [None]:
breast_cancer_dataset = load_breast_cancer()
breast_cancer_dataset_pd = pd.DataFrame(breast_cancer_dataset.data, columns=breast_cancer_dataset.feature_names)
breast_cancer_dataset_pd.head()

In [None]:
breast_cancer_dataset_pd.describe() 

Use seaborn `countplot` to plot the number of examples in each class.

In [None]:
??(breast_cancer_dataset.target)

### Prepare data for training 
Create train and test splits

In [None]:
X_train, X_test, y_train, y_test = ??(breast_cancer_dataset_pd, breast_cancer_dataset.target, test_size=0.2)

### Train a decision tree classifier
Plot the structure and print the best parameters

In [None]:
param_grid = {'max_depth': [2, 3, 5, 10], 'min_samples_leaf': [1, 5, 10]}
model = DecisionTreeClassifier()
search = GridSearchCV(model, param_grid)

Fit the grid search model, print the best hyperparameters and plot the decision tree.

In [None]:
??

#### Calculate the evaluation metrics

Print the classification accuracy (percentage of correctly predicted classes)

In [None]:
y_pred = search.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))

The accuracy is useful but it does not show the complete information about the classification results (in particular, for unbalanced classes).

We can display the confusion matrix to visualize the number of correctly (True Positive/True Negative) and incorrectly (False Positive/False Negative) classified instances for each class  .

In [None]:
display_confusion_matrix(y_test, y_pred)

### Train a random forest 
Use grid search, print best hyperparams and perform the same evaluation as for the decision tree classifier.

In [None]:
param_grid = {'max_depth': [2, 3, 5, 10], 'min_samples_leaf': [1, 5, 10],
              'n_estimators': [50, 100, 200]}
model = RandomForestClassifier()

Fit grid search

In [None]:
??

Print best params and accuracy

In [None]:
??

Display the confusion matrix

In [None]:
??

# What to remember

* In supervised learning, we train a model to learn the output (target) based on the input (features).
* In regression problems, the model predicts a continuous value (eg. house prices). In classification, the model predicts a category (eg. sickness or not).
* To evaluate the model, we split the dataset to train and test sets. The model is trained on training data, and the metrics are calculated on test data.
* If the difference between train and test metrics is high, the model *overfits*. We can reduce the overfitting by adding some restriction to the model parameters.
* A decision tree is a useful model which is easy to interpret. A random forest is an ensemble of decision trees. Usually, it has a better accuracy but it is more difficult to interpret.