# Decision Trees: Scikit-Learn
M2U5 - Exercise 1

## What are we going to do?
- We will train a linear regression model using decision trees
- We will check to see if there is any deviation or overfitting in the model
- We will optimise the hyperparameters with validation
- We will evaluate the model on the test subset

Remember to follow the instructions for the submission of assignments indicated in [Submission Instructions](https://github.com/Tokio-School/Machine-Learning-EN/blob/main/Submission_instructions.md).

## Instructions
We are going to solve a multivariate linear regression problem similar to the previous exercises, but this time using a decision tree for linear regression.

An example that you can use as a reference for this exercise: [Decision Tree Regression](https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html)

In [None]:
# TODO: Import all the necessary modules into this cell

## Generate a synthetic dataset

Generate a synthetic dataset with a fairly large error term and few features, manually or with Scikit-learn:

In [None]:
# TODO: Generate a synthetic dataset, with few features and a significant error term
# Do not add a bias term to X

m = 1000
n = 2

X = [...]

Theta_true = [...]

error = 0.3

Y = [...]

# Check the values and dimensions of the vectors
print('Theta and its dimensions to be estimated:')
print()
print()

print('First 10 rows and 5 columns of X and Y:')
print()
print()

print('Dimensions of X and:')
print()

In [None]:
# TODO: Graphically represent the dataset in 3D to ensure that the error term is sufficiently high

plt.figure(1)

plt.title()
plt.xlabel()
plt.ylabel()

[...]

plt.show()

## Preprocess the data

- Randomly reorder the data.
- Normalise the data.
- Divide the dataset into training and test subsets.

*Note*: We will use K-fold again for the cross-validation.

In [None]:
# TODO: Randomly reorder the data, normalise the examples, and divide them into training and test subsets

## Train an initial model

We will begin exploring decision tree models for regression with an initial model.

To do this, train a [sklearn.tree.DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) model on the training subset:

In [None]:
# TODO: Train a regression tree on the training subset with a max. depth of 2

Now check the suitability of the model by evaluating it on the test subset:

In [None]:
# TODO: Evaluate the model with MSE, RMSE and R^2 on the test subset

y_test_pred = [...]

mse = [...]
rmse = [...]
r2_score = [...]
print('mean square error: {%.2f}'.format(mse))
print('Root mean square error: {%.2f}'.format(rmse))
print('Coefficient of determination: {%.2f}'.format(r2_score))

*QUESTION:*
*Do you think there is deviation or overfitting in this model?*

To find out, compare its accuracy with that calculated on the training subset and answer in this cell:

In [None]:
# TODO: Now evaluate the model with MSE, RMSE and R^2 on the training subset

y_train_pred = [...]

mse = [...]
rmse = [...]
r2_score = [...]
print('mean square error: {%.2f}'.format(mse))
print('Root mean square error: {%.2f}'.format(rmse))
print('Coefficient of determination: {%.2f}'.format(r2_score))

As mentioned above, decision trees tend to overfit, to over-adjust to the data used to train them, and sometimes fail to predict well on new examples.

We are going to check this graphically by training another model with a much larger maximum depth of 6:

In [None]:
# TODO: Train another regression tree on the training subset with max. depth of 6

In [None]:
# TODO: Now evaluate the model with MSE, RMSE, and R^2 on the training subset

y_train_pred = [...]

mse = [...]
rmse = [...]
r2_score = [...]
print('mean square error: {%.2f}'.format(mse))
print('Root mean square error: {%.2f}'.format(rmse))
print('Coefficient of determination: {%.2f}'.format(r2_score))

Compare the training accuracy of this model with the previous one (on the training subset).

*QUESTION:* Is the accuracy greater or lesser as the maximum depth of the tree increases?

We will now plot both models, to check whether they suffer from deviation or overfitting.

To do so, you can be guided by the preceding example: [Decision Tree Regression](https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html)

In [None]:
# TODO: Graphically represent the predictions of both models

plt.figure(2)

plt.title([...])
plt.xlabel([...])
plt.ylabel([...])

# Plot the training subset for features 1 and 2 (with different shapes) in a dot plot
plt.scatter([...])
plt.scatter([...])
# Plot the test subset for features 1 and 2 (with different shapes) in a dot plot, with a different colour from the training subset
plt.scatter([...])
plt.scatter([...])

# Plot the predictions of the two models on a line graph, with different colours, and a legend to distinguish them
# Use a linear space with a large number of elements between the max. and min. value of both X features as the horizontal axis
x_axis = [...]

plt.plot([...])
plt.plot([...])

plt.show()

As we have seen, too small a max. depth generally leads to a model with deviation, a model that is not able to fit the curve well enough, while too high a max. depth leads to a model with overfitting, a model that fits the curve very well, but does not have good accuracy on future examples.

Therefore, among all regression tree hyperparameters, we have the maximum depth, which we need to optimise using validation. There are also other hyperparameters, such as the criteria for measuring the quality of a split, the strategy for creating that split, the min. number of examples needed to split a node, etc.

For the sake of simplicity, let's start by performing a cross-validation just to find the optimal value for the maximum depth:

In [None]:
# TODO: Train a different model for each max_depth value considered on a different fold

# Values of max_depth to be considered in an integer space [1, 8]
max_depths = [...]
print('Max. depths to be considered:')
print(max_depths)

# Create x K-fold splits, one for each value of max_depth to be considered
kf = [...]

# Iterate on the splits, train your models and evaluate them on the generated CV subset
linear_models = []
best_model = None
for train, cv in kf.split(X):
    # Train a model on the training subset
    # Evaluate it on the cv subset using its method score()
    # Save the model with the best score on the best_model variable and display the alpha of the best model
    alpha = [...]
    print('Max. depth used:', max_depth)
    
    linear_models.append([...])
    
    # If the model is better than the best model so far, update the best model found
    best_model = [...]
    
    print('Max. depth and R^2 of the best tree so far:', max_depth, best_model.score([...]))

## Evaluate the model on the test subset

Finally, we are going to evaluate the model on the test subset.

To do this, calculate its MSE, RMSE, and R^2 metrics and plot the model predictions and residuals vs. the test subset:

In [None]:
# TODO: Evaluate the model with MSE, RMSE and R^2 on the test subset

y_train_test = [...]

mse = [...]
rmse = [...]
r2_score = [...]
print('mean square error: {%.2f}'.format(mse))
print('Root mean square error: {%.2f}'.format(rmse))
print('Coefficient of determination: {%.2f}'.format(r2_score))

In [None]:
# TODO: Plot the predictions of the best tree on the test subset and its residuals

plt.figure(3)

plt.title([...])
plt.xlabel([...])
plt.ylabel([...])

# Plot the test subset on a dot plot, representing both features with different shapes
plt.scatter([...])

# Plot the model's predictions on a line graph
# Use a linear space with a large number of elements between the max. and min. value of the X_test features as the horizontal axis
x_axis = [...]

plt.plot([...])

# Calculate the residuals and plot them as a bar chart on the horizontal axis
residuals = [...]

plt.bar([...]

plt.show()