<font color="red">**Disclaimer**: The author of this exercise has been fired because he made a fundamental machine learning mistake when creating this Jupyter Notebook.</font>

> Your task is to find the mistake the data scientist made and submit the answer on ILIAS.

# Debugging Decision Trees and Random Forest

In todays exercise we want to predict car prices from the AutoScout24 dataset by using a Decision Tree and Random Forest.

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.model_selection import learning_curve
from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score

from tqdm.notebook import tqdm

## Load the dataset
We are going to use the AutoScout24 dataset and apply the feature engineering we have developed in the previous exercises.

In [None]:
df = pd.read_csv("cars.csv")

df['Age'] = df.Year-1984
df['Brand'] = df.Name.str.split(' ').map(lambda x: x[0])
df.drop(['Name', 'Registration', 'Year'], axis='columns', inplace=True)
df.drop_duplicates(inplace=True)
df.drop([17010, 7734, 47002, 44369, 24720, 50574, 36542, 42611,
         22513, 12773, 21501, 2424, 52910, 29735, 43004, 47125], axis='rows', inplace=True)
df.drop(df.index[df.EngineSize > 7500], axis='rows', inplace=True)

df.head()

### Normalize data, handle categorical data

In [None]:
df["Brand"] = df["Brand"].astype("category").cat.codes
df["Color"] = df["Color"].astype("category").cat.codes
df.head()

### Split the data into train and test set

In [None]:
X = df.drop(columns=["Price"])
y = df["Price"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

## Decision Trees
We start by checking on how good decision trees perform on our dataset.
> Grow a decision tree for our training data.  Use a [DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) from Scikit-Learn.


In [None]:
# dt = ...

*Click on the dots to display the solution*

In [None]:
dt = DecisionTreeRegressor(max_depth=30, min_samples_split=2, random_state=42)
dt.fit(X_train, y_train)

> Now calculate the $R^2$ score on the training set.

In [None]:
# r2_train = ...
# print("r2 train:", r2_train)

*Click on the dots to display the solution*

In [None]:
y_pred = dt.predict(X_train)
r2_train = r2_score(y_train, y_pred)
print("r2 train:", r2_train)

Wow, such a high score. Almost perfect... Now calculate the cross validation score using a 5-fold cross validation. 

> Perform a 5-fold cross validation by using  the [cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function from Scikit-Learn.

In [None]:
# r2_cv = ...
# print("r2 cv:", r2_cv)

*Click on the dots to display the solution*

In [None]:
r2_cv = np.mean(cross_val_score(dt, X_train, y_train, cv=5, scoring="r2"))
print("r2 cv:", r2_cv)

Our CV-Score looks rather disappointing. How is this behaviour called?

### Plot the learning curve

To further investigate on how our model performs, we plot the so called learning curve. We use the [learning_curve](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html) function from Scikit-Learn.

The learning curve shows the validation and the training score of our model for varying numbers of training samples. It is a nice tool to find out how much we benefit from adding more trianing data and wheter our model suffers from a variance error or a bias error. If both the validation score and the training score converge to a value that is too low with increasing size of the training set, we will not benefit much from more training data.

In [None]:
def plot_learning_curve(estimator, X, y, title):
    plt.figure(figsize=(15,8))
    plt.title(title)
    plt.xlabel("Number of training examples")
    plt.ylabel("Score")
    
    train_sizes = np.linspace(.1, 1.0, 10)
    train_sizes, train_scores, cv_scores = learning_curve(
        estimator, X, y, cv=5, train_sizes=train_sizes, scoring="r2")
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    cv_scores_mean = np.mean(cv_scores, axis=1)
    cv_scores_std = np.std(cv_scores, axis=1)
    plt.grid()
    plt.ylim(0.75, 1.01)

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, cv_scores_mean - cv_scores_std,
                     cv_scores_mean + cv_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, cv_scores_mean, 'o-', color="g",
             label="Cross validation score")

    plt.legend(loc="best")
    
    print("cross validation r2-score:", cv_scores_mean[-1])
    print("training r2-score", train_scores_mean[-1])
    
plot_learning_curve(dt, X_train, y_train, "Decision Tree")

Our $R^2$ score on the training set is pretty high. We play around with the parameters `max_depth` and `min_samples_split` to close the gap between the training and CV-score.

In [None]:
parameters = {
    "max_depth": range(10, 21),
    "min_samples_split": [2, 3, 4, 5, 6, 7], 
    "random_state": [42]
}

gridSearch = GridSearchCV(dt, parameters, cv=5, scoring="r2")
gridSearch.fit(X_test, y_test)

print("Best score", gridSearch.best_score_)
print("Best params", gridSearch.best_params_)

We can access the best decision tree by means of the attribute `best_estimator_`.

In [None]:
dt = gridSearch.best_estimator_

Let us plot the learning curve again.

In [None]:
plot_learning_curve(dt, X_train, y_train, "Decision Tree")

### Evaluate the performance on the test set

Now that we have chosen the optimal hyperparameters for out tree, let's check how good our model performs on the test set.

In [None]:
y_pred = dt.predict(X_test)
r2 = r2_score(y_test, y_pred)
print("r2:", r2)

### Examine our tree
Let us check how our grown decision tree looks like.

In [None]:
print("Number of nodes:", dt.tree_.node_count)
print("Depth:", dt.tree_.max_depth)

#### Plot our tree (Optional)
Scikit-Learn version 0.21 provides a cool new function `plot_tree` which allows us to easily visualize our Decision Tree. To run the following cell, make sure that your Scikit-Learn version is up to date.

In [None]:
from sklearn.tree import plot_tree

plt.figure(figsize=(20,7))
plot_tree(dt, max_depth=2, filled=True)
plt.show()

In our plot we can see that the first splits were made based on the features with indices 3 and 8. What kind of features are these?

In [None]:
print("Feature 3:", X_train.columns[3])
print("Feature 8:", X_train.columns[8])

It seems like the features *Horsepower* and *Age* are somehow important. Our estimator stores the importance of each feature in the attribute `feature_importances_`. Note that the values should sum up to 1.

In [None]:
dt.feature_importances_

We can plot these feature importances.

In [None]:
def plot_feature_importance(estimator, X, title, ax=None):
    importances = estimator.feature_importances_
    indices = np.argsort(importances)[::-1]
    feature_names = X_train.columns.take(indices).tolist()
    
    plt.subplots(figsize=(15, 5))
    plt.title(title)
    plt.bar(range(X_train.shape[1]), importances[indices], color="b", align="center")
    plt.xticks(range(X_train.shape[1]), feature_names)
    plt.xlim([-1, X_train.shape[1]])
    plt.show()
    
plot_feature_importance(dt, X_train, "Feature importances extracted from the Decision Tree model")

As we have guessed based on the plot of our tree, the features *Horsepower* and *Age* are the most important ones.

## Random Forest
Let us check if we can beat the score of our model by using a Random Forest.

> Fit a [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) and use 10 decision trees.

In [None]:
# rf = ...

*Click on the dots to display the solution*

In [None]:
rf = RandomForestRegressor(n_estimators=10)
rf.fit(X_train, y_train)

We plot the learning curve as we did with the decision tree.

In [None]:
plot_learning_curve(rf, X_train, y_train, "Random Forest")

The gap between the training and the cross validation score is not as high as with the decision tree.

We still want to run a grid search to tune the parameters.

In [None]:
parameters = {
    "n_estimators": [10],
    "max_depth": range(5, 21),
    "min_samples_split": [2, 3, 4, 5, 6, 7]
}

gridSearch = GridSearchCV(rf, parameters, cv=5, scoring="r2", n_jobs=-1)
gridSearch.fit(X_test, y_test)

print("Best score", gridSearch.best_score_)
print("Best params", gridSearch.best_params_)

### Evaluate our model on the test set

In [None]:
y_pred = rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
print("r2:", r2)

### Plot feature importance

In [None]:
plot_feature_importance(rf, X_train, "Feature importances extracted from the Random Forest model")

When using a random forest model, the *Mileage* feature gets more weight than the *Age* feature.