* Ready unit 3 (even just the draft) before starting on this unit!
* The goal of this unit is the same: to predict who won medals in the olympics
* the sport we're focussed on is Rhythmic Gymnastics
* the medal column means they won gold, silver, or bronze
* Previously was saw how a single decision tree could fit the training data very well but it was very over-fit
* The point of /this/ module is to try to avoid the overfitting that we saw with decision tress
* We will use random forests, which have already been explained in the content



# Exercise: Random forests and model architecture


In the previous exercise, we used decision trees to predict whether a Rhythmic Gymnastics athlete would win a medal in the olympics (we did not differentiate between medals).

Recall that decision trees could fit the training data very well, but they have a tendency to *overfit*, meaning that the results would degrade considerably when using the *test* set or any *unseen data*.

This time we will used *random forests* to address that overfit tendency.

We we also look at how the *model's architecture* can influence its performance.

## Data visualization and preparation

As usual, let's take another quick look at the `olympics` dataset, then split it into *train* and *test* sets:


In [None]:
import pandas
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
from sklearn.model_selection import train_test_split

# Import the data from the .csv file
dataset = pandas.read_csv('olympics.csv', delimiter="\t")

# Remove the male column, which we know is entirely 0s 
# (There is no male rhythmic gymnastics in the Olympics)
del dataset["Male"]

#Let's have a look at the data and the relationship we are going to model
print(dataset.head())

# Split the dataset in an 75/25 train/test ratio. 
train, test = train_test_split(dataset, test_size=0.25, random_state=1, shuffle=True)

Hopefully this looks familiar to you! If not, jump back and go through the previous exercise on decision trees.

## Decision tree

Let's quickly train our the previous decision tree to remind ourselves of its performance:

In [None]:
import numpy as np
import sklearn.tree
from sklearn.metrics import recall_score as sensitivity_score

# Make a utility method that we can re-use throughout this exercise
# To easily fit and test out model
def fit_and_test_model(model):
    '''
    Trains a model and tests it against both train and test sets
    '''  

    # Define the features that we will use all models
    features = ["Age", "Height", "Weight", "Year"]

    # Train the model
    model.fit(train[features], train.Medal)

    # Assess its performance
    # -- Train
    predictions = model.predict(train[features])
    train_sensitivity = sensitivity_score(train.Medal, predictions)

    # -- Test
    predictions = model.predict(test[features])
    test_sensitivity = sensitivity_score(test.Medal, predictions)

    return train_sensitivity, test_sensitivity

# Fit a tree using max_depth=2, as it yielded the best results in the previous exercise
model = sklearn.tree.DecisionTreeClassifier(random_state=1, max_depth=2)

# train a decision tree model
dt_train_sensitivity, dt_test_sensitivity = fit_and_test_model(model)
print("Model trained!")

print("Train Sensitivity:", dt_train_sensitivity)
print("Test Sensitivity:", dt_test_sensitivity)

## Random Forest

A random forest is a collection of decision trees that work together to calculate the label for a sample.

Trees in a random forest are trained independently, on different partitions of data, and thus develop different biases, but when combined they are less likely to overfit the data.

Let's build a very simple forest with two tree and the *default* parameters:

In [None]:
from sklearn.ensemble import RandomForestClassifier



# Create a random forest model with two trees
random_forest = RandomForestClassifier( n_estimators=2,
                                        random_state=2, 
                                        verbose=False)

# Train and test the model
train_sensitivity, test_sensitivity = fit_and_test_model(random_forest)

print("Train Sensitivity:", train_sensitivity)
print("Test Sensitivity:", test_sensitivity)

Our two-tree forest has done more poorly than the single tree on the test set, though has done a better job on the train set. 

To some extent this should be expected. Random forests usually work with many more trees. Simply having two allowed it to overfit the training data much better than the original decision tree.

## Altering the number of trees

Let's then build several forest models, each with a different number of trees, and see how they perform:

In [None]:
import graphing

# n_estimators states how many trees to put in the model
# We will make one model for every entry in this list
# and see how well each model performs 
n_estimators = [2, 4, 6, 8, 10, 12, 14,
                16, 18, 20, 40, 60, 80, 
                100, 150, 200, 250, 300,
                500]

# Train our models and report their performance
train_sensitivities = []
test_sensitivities = []

for n_estimator in n_estimators:
    print("Preparing a model with", n_estimator, "trees")

    # Prepare the model 
    rf = RandomForestClassifier(n_estimators=n_estimator, 
                                random_state=2, 
                                verbose=False)
    
    # Train and test the result
    train_sensitivity, test_sensitivity = fit_and_test_model(rf)

    # Save the results
    test_sensitivities.append(test_sensitivity)
    train_sensitivities.append(train_sensitivity)


# Plot results
graphing.line_2D([("Train", lambda x: train_sensitivities), ("Test", lambda x: test_sensitivities)], 
                    n_estimators,
                    label_x="Numer of estimators (n_estimators)",
                    label_y="Sensitivity",
                    title="Performance X number of trees")

The metrics look great for the *training* set, but not so much for the *test* set. In fact, more trees isn't always better - 500 estimators did marginally worse than 100.

We can assume this is a result of overtraining due to model complexity.

## Altering the minimum number of samples for split parameter

Recall that decision trees have a *root node*, *internal nodes* and *leaf nodes*, and that the first two can be split into newer nodes with subsets of data.

If we let our model split and create too many nodes, it can become increasingly complex and start to overfit.

One way to limit that complexity is to tell the model that each node needs to have __at least__ a certain number of samples, otherwise it can't split into subnodes. 

In other words, we can set the model's `min_samples_split` parameter to the least number of samples required so that a node can be split.

Our default value for `min_samples_split` is only `2`, so models will quickly become too complex if that parameter is left untouched.

We will now use the best performing model above, then try it with different `min_samples_split` values and compare the results:

In [None]:
# Create a range of values for the minimun sample splits parameter
min_samples_split = np.linspace(2,50,num=(50-2), dtype=int)

# Train our models and report their performance
train_sensitivities = []
test_sensitivities = []

# Build models using different values for min_samples_split
for min_split in min_samples_split:
    print("Preparing a model with min_samples_split=", min_split)

    # Prepare the model 
    rf = RandomForestClassifier(n_estimators=8, # best result from our first experiment
                                min_samples_split=min_split, 
                                # max_features=None,
                                random_state=2, 
                                verbose=False)

    # Train and test the result
    train_sensitivity, test_sensitivity = fit_and_test_model(rf)

    # Save the results
    test_sensitivities.append(test_sensitivity)
    train_sensitivities.append(train_sensitivity)



# Plot results
graphing.line_2D([("Train", lambda x: train_sensitivities), ("Test", lambda x: test_sensitivities)],
                    min_samples_split,
                    label_x="min_samples_split",
                    label_y="Sensitivity",
                    title="Performance X minimum number of samples for split")


As you can see above, the more we restrict a model's complexity - by limiting its ability to split nodes - the more we hurt its *training* performance and - up to a point - the more we __increase__ its performance on the *test* set.

By limiting the model complexity we address `overfitting`, improving its ability to generalize and make accurate predictions on *unseen* data.

Notice that using `min_samples_split=5` gave us the best result for the *test* set, and that higher values did not improve that outcome.

## Altering the model depth

Maybe the results above can still be improved if we experiment with different model parameters?

The `model_depth` parameter limits the maximum depth of the trees in a forest. Its default value is `None`, which means nodes can be expanded until all leaves are *pure* (all samples in it have the same label) or have less samples than the value set for `min_samples_split`.

Limiting the `model_depth` seems to be a different way to limit a model's complexity.

Let's see if we can determine what that "adequate" value is by training the most recent model with different variations:

In [None]:
# max_depths states how deep trees can maximally be expanded
# We will make one model for every entry in this list
# and see how well each model performs 
max_depths = np.linspace(2,50,num=(50-2), dtype=int)

# Train our models and report their performance
train_sensitivities = []
test_sensitivities = []

for max_depth in max_depths:
    print("Preparing a model with max_depth=", max_depth)

    # Prepare the model 
    rf = RandomForestClassifier(n_estimators=8, # best result from our first experiment
                                min_samples_split=6, # best result from our second experiment
                                max_depth=max_depth,
                                random_state=2, 
                                verbose=False)

    # Train and test the results
    train_sensitivity, test_sensitivity = fit_and_test_model(rf)

    # Save the results
    test_sensitivities.append(test_sensitivity)
    train_sensitivities.append(train_sensitivity)


# Plot results
graphing.line_2D([("Train", lambda x: train_sensitivities), ("Test", lambda x: test_sensitivities)], 
                    max_depths,
                    label_x="max_depths",
                    label_y="Sensitivity",
                    title="Performance X maximum tree depth")


The plot above tells us that our model actually __benefits__ from a higher value for `max_depth`, up to the limit of `11`.

The poor results shown by setting this parameter too low suggest that you can actually constrain a model too much and hurt its performance.

As usual, it is important to evaluate different values when setting model parameters and defining its architecture.

## Our final model
After careful study we have improved our random forest model by optimizing three different model params: `n_estimators`, `min_samples_split` and `max_depth`.

Let's run our final model using the optimal parameters we learned and compare the results to our original single decision tree:

## Adding some contraints

You may have noticed that all of our model have had an argument stating `max_features=None`. This allows trees to train like the original decision tree - all trees can use all features. Let's not use that argument this time, letting the algorithm decide how many features should be available to each tree. 

In [None]:
# # Train our models and report their performance
# train_sensitivities = []
# test_sensitivities = []

# for max_depth in max_depths:
#     print("Preparing a model with max_depth=", max_depth)

#     # Prepare the model 
#     rf = RandomForestClassifier(n_estimators=8, # best result from our first experiment
#                                 min_samples_split=6, # best result from our second experiment
#                                 max_depth=max_depth,
#                                 max_features=None,
#                                 random_state=2, 
#                                 verbose=False)
    
#     # Assess and save its performance
#     train_sensitivity, test_sensitivity = fit_and_test_model(rf)
#     train_sensitivities.append(train_sensitivity)
#     test_sensitivities.append(test_sensitivity)


# # Plot results
# graphing.line_2D([("Train", lambda x: train_sensitivities), ("Test", lambda x: test_sensitivities)], 
#                     max_depths,
#                     label_x="max_depths",
#                     label_y="Sensitivity",
#                     title="Performance X maximum tree depth")

These results are quite different from what we just saw.

In [None]:
# Train the model with the best parameters we've found so far
rf = RandomForestClassifier(n_estimators=8, # best result from our first experiment
                            min_samples_split=6, # best result from our second experiment
                            max_depth=14, # best result from our third experiment
                            random_state=2, 
                            verbose=False)

# Train the model and save its performance
train_sensitivity, test_sensitivity = fit_and_test_model(rf)
print("train", train_sensitivity)
print("test", test_sensitivity)

## Model comparison
Let's compare the final results to the ones in our single decision tree model:

In [None]:
cars = {"Model": ["Decision tree","Final random forest"],
        "Train sensitivity": [dt_train_sensitivity, train_sensitivity],
        "Test sensitivity": [dt_test_sensitivity, test_sensitivity],
        }

df = pandas.DataFrame(cars, columns = ["Model", "Train sensitivity", "Test sensitivity"])

print (df)

As you can see, fine tuning the model's parameters resulted in a significant improvement in the *test* set results.

The lower sensitivity score for the *train* set indicates that the model is not overfitting anymore.

## Summary

In this exercise we covered the following topics:

- Random forest models and how they differ from decision trees
- How we can change a model's architecture by setting different parameters and changing their values
- The importance of trying several combinations of parameters and evaluate these changes to improve performance

In the future you will see that different models have architectures where you can fine tune the parameters. Experimentation is needed to achieve the best possible results.

