<a href="https://colab.research.google.com/github/RenatodaCostaSantos/Machine_learning_lessons/blob/main/Supervised%20ML/Decision%20trees%20and%20random%20forests/Lesson_3_Random_forests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to random forests

In the previous lesson, we learned how to build a decision tree model and the main metrics and parameters one can use to evaluate and optimize the model. In this lesson, we will learn about cross-validation and ensemble techniques to improve our confidence in the model's performance once new data is available.

We will load the modified version of the [Portuguese student's performance dataset](https://archive.ics.uci.edu/ml/datasets/student+performance) to apply the theory learned throughout this lesson. The modifications were done in [lesson 2](https://colab.research.google.com/drive/1K1jVJ974iTxw1-bfdoSd8zDfxRHMcBkK#scrollTo=xqQcz4UJrRKo) when we prepared it for machine learning. Let's read it:


In [None]:
import pandas as pd
import numpy as np

  from google.colab import drive

  drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
grades = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Decision trees/grades_categorical.csv')

In [None]:
grades.head()

Unnamed: 0,absences,health,sex_fem,internet,Pstatus_together,famrel,Mjob_health,Mjob_other,Mjob_services,Mjob_teacher,Fjob_health,Fjob_other,Fjob_services,Fjob_teacher,guardian_mother,guardian_other,grades_cat
0,4,2.0,1,0,0,3.0,0,0,0,0,0,0,0,1,1,0,Sufficient
1,2,2.0,1,1,1,4.0,0,0,0,0,0,1,0,0,0,0,Sufficient
2,6,2.0,1,1,1,3.0,0,0,0,0,0,1,0,0,1,0,Sufficient
3,0,4.0,1,1,1,2.0,1,0,0,0,0,0,1,0,1,0,Good
4,0,4.0,1,0,1,3.0,0,1,0,0,0,1,0,0,0,0,Sufficient


# Cross-validation

To understand how cross-validation works, consider a toy dataframe containing 10 observations. When we use the train_test_split() function and choose the parameter test_size = 0.2, it will select 20% of the observations as part of the test set, and 80% as part of the training set. In our toy example, it would mean 2 observations in the test set and 8 in the training set. The choice of observations that belong to the training and test sets is random, and a different selection of observations for the training and test sets would impact the model's performance. In practice, this is precisely what happens.

Now, imagine that instead of performing only one random division of the observations into training and test sets, we performed many different ones in the same dataset. How? Well, we could follow the steps below:

- First, we shuffle the observations. 

- Then, we label them with numbers ranging from 1 to 10. 

- Next, we consider a test set containing only observations 1 and 2 and use the rest as a training set. In statistics, this step is called a **fold**.

- We train a model and evaluate it. Instead of claiming the metric's value as the model's performance, we save the result in a list.

- We repeat the procedure above using observations 3 and 4 as the test set and the remaining observations as a training set.

- Next, calculate the model's performance once again, and save the value in a list. 

- Follow this procedure until all observations belong to the test set once. We will end up with a list containing different scores for the model's performance. 

- Finally, we average the score values and claim that this average is the correct score for the model. That is how **K-Fold cross-validation** works. The letter **K** counts the number of folds used in the cross-validation.

Let's instantiate and train a decision tree and perform cross-validation using sci-kit-learn to see how to implement it in a real scenario. We will use the [cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) method from sci-kit-learn. This method does not require an explicit dataset split into training and test sets.

In [None]:
# Separate features from target
X = grades.drop('grades_cat',axis = 1)
y = grades['grades_cat']

Next, we instantiate a regression tree, and use the cross_val_score to find the model's performance.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
# Instantiate a classification tree
class_tree = DecisionTreeClassifier(criterion = 'gini', max_depth = 3, random_state = 34)

# Fit and evaluate the model with cross-validation
scores_list = cross_val_score(class_tree,X,y, cv = 5, n_jobs = -1)
print(scores_list)

[0.5        0.56923077 0.51538462 0.53076923 0.5503876 ]


Some important notes are:

- As we described above, each fold provides a different score. The list above contains the scores using accuracy, the default metric for a classification tree.

- We used the default value for the cross-validation parameter cv. It states the number of folds, which in this case was 5 (that's why we have a list containing five different scores).

- n_jobs = -1 use all available processors to perform cross-validation.

The last step is to compute the mean value for the list of scores.

In [None]:
# Compute the mean score
print(scores_list.mean())

0.533154442456768


There are other cross-validation methods, but we will focus on the K-fold cross-validation in this lesson.

# Cross-validate tool

If we need to change the metric used to compute the score, we can change it by using the *scoring* parameter from the cross_val_score method. A list of metrics can be found on the [sci-kit-learn page](https://scikit-learn.org/stable/modules/model_evaluation.html). 

Let's practice with it and compute the recall_macro metric for the tree we built.

In [None]:
# Fit and evaluate the model with cross-validation using the recall_macro metric
scores_list = cross_val_score(class_tree,X,y, cv = 5, n_jobs = -1, scoring = 'recall_macro')
print(scores_list)
mean_macro = scores_list.mean()
print(f'The mean value for the recall_macro score was {mean_macro*100:.2f}%.')

[0.18943242 0.20281563 0.157277   0.16720017 0.16666667]
The mean value for the recall_macro score was 17.67%.


We see that the model's sensitivity, or recall, is not strong. We will look into other optimization tools later to try to improve these outcomes.

Sci-kit-learn also contains the cross_validate method that allows one to use multiple metrics as input for the *scoring* parameter and obtain an array of scores for each metric. Let's practice with it.

In [None]:
from sklearn.model_selection import cross_validate
# Fit and evaluate the model with cross-validation using the recall_macro and weighted f1 metrics
multiple_scores_list = cross_validate(class_tree,X,y, cv = 10, n_jobs = -1, scoring = ('recall_macro','f1_weighted'))

print(multiple_scores_list)

{'fit_time': array([0.00557804, 0.00759363, 0.00521326, 0.00805235, 0.00519872,
       0.00498486, 0.00530887, 0.00520611, 0.00536227, 0.00368977]), 'score_time': array([0.00552416, 0.00801897, 0.00554323, 0.00547934, 0.00542212,
       0.0052619 , 0.00743246, 0.00591421, 0.00550389, 0.00330544]), 'test_recall_macro': array([0.17361111, 0.16203704, 0.21759259, 0.16666667, 0.16203704,
       0.2037037 , 0.17579365, 0.16666667, 0.18181818, 0.18181818]), 'test_f1_weighted': array([0.4041942 , 0.38769231, 0.46348178, 0.40690738, 0.3956044 ,
       0.43496503, 0.4       , 0.38073038, 0.41275632, 0.42520533])}


It provides a dictionary that also includes the time needed to fit an estimator (the model) and the time it took to compute the score on the test set. 

If we want to check just the recall_macro scores, for example, we can call it explicitly:

In [None]:
# Print scores only for the recall_macro metric
scores_recall = multiple_scores_list['test_recall_macro']
print(multiple_scores_list['test_recall_macro'])

# Compute the mean for the recall_macro scores
mean_recall = scores_recall.mean()
print(f'The mean value for the recall_macro score was {mean_recall:.2f}.')

[0.17361111 0.16203704 0.21759259 0.16666667 0.16203704 0.2037037
 0.17579365 0.16666667 0.18181818 0.18181818]
The mean value for the recall_macro score was 0.18.


# Make scorer tool

Sometimes the metric we want to evaluate a model will not be available in the sci-kit-learn library. In these situations, it is possible to define your scorer (or tweak existing ones) with the [make_scorer](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html) tool.

If you look at the list of possible metrics, you will find that the popular RMSE metric is not available. Let's use the make_scorer tool to create it.

In [None]:
from sklearn.metrics import make_scorer
from sklearn.metrics import mean_squared_error

# Tweaking the mean squared error metric
RMSE = make_scorer(mean_squared_error, squared = False, greater_is_better= False)

Some notes about the code above:

- The make_scorer function takes the metric we want to customize as a parameter. Extra parameters, associated with the chosen metric, are separated by commas. In this case, we want to customize the root mean squared error metric, so we entered the mean_squared_error metric and set the parameter squared to False.

- A large value for most metrics usually implies a strong model. For example, a large accuracy value usually (not always) implies the model is good at making predictions. However, the RMSE metric does not follow that criterion. It is an error. If the error is large, it will return a large value. That's why we set the greater_is_better parameter to False. It basically forces the RMSE values to follow the same logic as most metrics by multiplying its value by -1. That implies larger numbers (near zero) will be associated with good models, while smaller values (much smaller than 0) will be associated with weaker models.

Let's create a regression tree and compute the RMSE value using the cross_val_scorer method.

In [None]:
# Read the file with grades suitable for regression
grades_reg = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Decision trees/grades_reg.csv')

In [None]:
grades_reg.head()

Unnamed: 0,absences,health,sex_fem,internet,Pstatus_together,famrel,G3,Mjob_health,Mjob_other,Mjob_services,Mjob_teacher,Fjob_health,Fjob_other,Fjob_services,Fjob_teacher,guardian_mother,guardian_other
0,4,2.0,1,0,0,3.0,11,0,0,0,0,0,0,0,1,1,0
1,2,2.0,1,1,1,4.0,11,0,0,0,0,0,1,0,0,0,0
2,6,2.0,1,1,1,3.0,12,0,0,0,0,0,1,0,0,1,0
3,0,4.0,1,1,1,2.0,14,1,0,0,0,0,0,1,0,1,0
4,0,4.0,1,0,1,3.0,13,0,1,0,0,0,1,0,0,0,0


In [None]:
# Separate features from target
X = grades_reg.drop(['G3'], axis = 1)
y = grades_reg['G3']

In [None]:
from sklearn.tree import DecisionTreeRegressor
# Instantiate a regression tree
regression_tree_mse = DecisionTreeRegressor(criterion = 'squared_error', max_depth = 3, random_state = 34)

# Fit and evaluate the model with the RMSE and cross-validation
scores_list_RMSE = cross_val_score(regression_tree_mse, X, y, cv = 5, scoring = RMSE, n_jobs = -1)
print(scores_list)

# Calculate the mean score
mean_score_RMSE = scores_list_RMSE.mean()

print(f'The mean score using the RMSE scorer to evaluate the regression tree was, {mean_score_RMSE:.2f}.')

[0.18943242 0.20281563 0.157277   0.16720017 0.16666667]
The mean score using the RMSE scorer to evaluate the regression tree was, -3.25.


# Grid search and randomized search

We just learned about K-fold cross-validation and how it improves the model's score. Previously, we also saw that different values for the parameters and their arguments influence the score value for the model. Sci-kit-learn contains the [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) tool, which provides a way to automate both steps. It will be computationally expensive, given that it can exhaustively try all possible combinations stored in a list made by the user. However, if computational power is not an issue, this is an excellent way to optimize the machine learning workflow.

In practice, we need to instantiate the GridSearchCV and use a dictionary containing the parameters we want to explore and the range of values for the arguments. Let's practice with it.

In [None]:
from sklearn.model_selection import GridSearchCV

# Create a grid of parameters
parameters = {'criterion': ['gini', 'entropy'],
              'class_weight': [None, 'balanced'],
              'min_samples_split': [12, 30, 48],
              'max_depth': list(range(3,7)),
              'min_samples_leaf': list(range(9,19,3))
              }

# Instantiate a model
tree = DecisionTreeClassifier(random_state = 34)

# Instantiate a grid search
gridSearch = GridSearchCV(tree, param_grid = parameters, cv = 5, scoring = 'recall_macro', n_jobs= -1)
            

We have created a GridSearch instance. Next, we can fit the data we are interested in and search for the best model.

In [None]:
# Separate features and target
X = grades.drop('grades_cat', axis = 1)
y = grades['grades_cat']

In [None]:
# Fit the data and search for the best model of the grid
gridSearch.fit(X,y)

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=34), n_jobs=-1,
             param_grid={'class_weight': [None, 'balanced'],
                         'criterion': ['gini', 'entropy'],
                         'max_depth': [3, 4, 5, 6],
                         'min_samples_leaf': [9, 12, 15, 18],
                         'min_samples_split': [12, 30, 48]},
             scoring='recall_macro')

We can get the best set of parameters, score, and estimator using the best_parms, best_score_, and best_estimator_ attributes. Let's check them out.

In [None]:
# Printing best parameters, score and estimator
best_parameters = gridSearch.best_params_
best_score = gridSearch.best_score_
best_estimator = gridSearch.best_estimator_

print(f'The best set of parameters was {best_parameters}.')
print(f'The best score was a recall_macro of {best_score*100:.2f}%.')
print(f'The best estimator was {best_estimator}.')

The best set of parameters was {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 4, 'min_samples_leaf': 15, 'min_samples_split': 48}.
The best score was a recall_macro of 30.13%.
The best estimator was DecisionTreeClassifier(class_weight='balanced', max_depth=4,
                       min_samples_leaf=15, min_samples_split=48,
                       random_state=34).


Previously in this lesson, we found the mean value for the recall_macro metric using only cross-validation, finding a score of 17%. GridSearchCV improved it considerably to ~30%.

An alternative to GridSearchCV when computational power or time is important, and the optimal value for the metric is not an issue, the [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) is a good option. It works similarly to GridSearchCV, but only chooses a random set of parameters from the grid at each fold. 

# Introduction to random forests

One common and powerful technique used to improve the performance of a model is given by the **random forest** algorithm. This method will use the same data to create a series of random decision trees by using different features and observations to train each of them. As we saw in the first decision tree lesson, every tree is highly sensitive to the variable and threshold used to split a given node. The random forest technique thus generates a large number of decision trees (forest) that are different from each other. The more diverse they are, the better. Some will underperform, and some will overfit the data. This ensemble technique is very efficient at mitigating the biggest downsize of decision trees; their tendency to overfit.

One important question is: How does the algorithm choose the random selection of features and observations for each tree?

There are two specific approaches adopted by random forests:

1 - **Bagging** (also known as Bootstrapping and AGGregatING). In this case, random forests create a series of decision trees using a subset of the available data. The random choice is made *with replacement*, which means some observations can be selected more than once for a given subset. Each subtree contains the same features as the original dataset.

![bagging](https://raw.githubusercontent.com/RenatodaCostaSantos/Machine-Learning---Lessons/main/Supervised%20ML/Decision%20trees%20and%20random%20forests/images/Bagging.png)

2 - **Random subspace** (also known as Feature Bagging or Attribute Bagging). Here, each subtree will contain a subset of the original features. It also uses subsampling with replacement for each subtree. 

![random subspaces](https://raw.githubusercontent.com/RenatodaCostaSantos/Machine-Learning---Lessons/main/Supervised%20ML/Decision%20trees%20and%20random%20forests/images/random_subspaces.png)

These two approaches can be applied to the same tree, separated or combined. When combined, they are referred to as **random patches**.

Sci-kit-learn includes the [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) instantiators. Let's practice importing, instantiating, and evaluating a model using sci-kit-learn.




In [None]:
# Split features and target
X_reg = grades_reg.drop('G3', axis = 1)
y_reg = grades_reg['G3']

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_reg, y_reg, test_size= 0.3, random_state = 34)

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Instantiate a random forest
random_forest = RandomForestRegressor(max_depth= 3, n_jobs= -1, random_state= 34)

# Fit the data
random_forest.fit(X_train,y_train)

# Compute the score
score = random_forest.score(X_test,y_test)

print(score)

0.02978221160334593


The default metrics for random forests are the same as for decision trees, i.e., $R^2$ for regression forests and the accuracy for classification forests. However, there are some subtleties in how they are calculated:

- For regression forests, it uses the mean of all predictions made by each tree. 

- For classification, sci-kit-learn computes the probability for each class at every tree. Then, it averages these probabilities and returns the class with the highest average probability. *This procedure is different from the standard implementation of a random forest where every tree predicts a class, and the highest number of predictions (mode) is chosen*.

Therefore, the model we instantiate above is not performing well with an $R^2 = 0.030$. The predictions are able to explain only 3% of the variance shown in the target variable.

# Random forests: Parameters

Random forest classifier and regression share the same parameters as their decision trees counterparts: 'criterion', 'max_depth', 'min_samples_leaf', 'min_weight_fraction_leaf', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'random_state', 'ccp_alpha' and, for classifiers only, 'class_weight'. However, there are a few extra parameters exclusive to random forests:

- n_estimators: The number of trees that will be generated by the random forest. The default value is 100.

- bootstrap: When set to False, the whole dataset will be used to generate each tree. Otherwise, it follows the bagging scheme we described above.

- n_jobs: The number of processing jobs that run in parallel. When set to -1, it uses all available processors.

- verbose: print logs describing the algorithm operations at each step. Higher integer numbers will provide more detailed logs. The default value is 0, where no logs are printed.

- warm_start: It takes a boolean as an argument. If set to True, it uses previous information, saving computational time. For example, if we set 'n_estimators': [100,200], it will train 100 trees first and save that information. Before it starts the next 200 trees, it will use the information about the first 100 and then train the extra 100.

- max_samples: Only works if bootstrap = True. This parameter limits the number of observations used to train each tree.

- oob_score: It takes a boolean as an argument and represents the out-of-the-bag score. This is an alternative metric that uses the observations that were left out during the bagging procedure as a validation set. It only works when bootstrap = True. We can obtain its value using the oob_score_ attribute. The scores obtained through this attribute are the $R^2$ for regression forests and the accuracy for classification forests.

Let's practice with some of these parameters and obtain the out-of-the-bag score for a random forest.




In [None]:
from sklearn.ensemble import RandomForestClassifier

# Instantiate a classification forest 
random_forest_classif = RandomForestClassifier(max_depth=3, n_jobs=-1, random_state = 34, n_estimators= 20, oob_score= True)

# Fit the forest
random_forest_classif.fit(X,y)

# Get oob_score
oob_score = random_forest_classif.oob_score_

print(f'The accuracy for the out-of-the-bag set in this classification forest was, {oob_score*100:.2f}%.')


The accuracy for the out-of-the-bag set in this classification forest was, 54.39%.


# Out-of-bag (OOB) score - Regression forests

For the regression forest, once we set the oob_score parameter to True, sci-kit-learn will create a subtree with a subset of the data and another subtree with the observations that were left out. Next, it will make predictions in both subtrees and compute the score for both trees. This is the first of many trees and steps the random forest regressor will create. The computation of the OOB $R^2$ score is then straightforward and follows the same steps as for the subtree used in the training set. 

# Out-of-bag (OOB) score - Classification forests

We already explained above how to compute the score for a classification forest. Let's work all the steps in an exercise using sci-kit-learn to make it explicit. We will use the random_forest_classif instance we created above.



In [None]:
from sklearn.metrics import accuracy_score

# Get and store oob predictions
oob_predictions_class = random_forest_classif.oob_decision_function_

print(oob_predictions_class)

[[0.0233713  0.27526908 0.01850346 0.45696319 0.09955674 0.12633623]
 [0.01779383 0.18483143 0.02089541 0.55071356 0.07718143 0.14858434]
 [0.01868363 0.16519849 0.01697206 0.59238432 0.0899006  0.11686089]
 ...
 [0.01573107 0.1437328  0.01390058 0.5697899  0.06066863 0.196177  ]
 [0.01515793 0.15689882 0.01343504 0.59142343 0.08734163 0.13574314]
 [0.02296422 0.15850608 0.02035826 0.57905424 0.09005771 0.12905949]]


It returns a list of lists. Every list contains the probability associated with each class. Let's get the classes names:


In [None]:
# Get classes names in the order of the predictions above
classes = random_forest_classif.classes_
print(classes)

['Excellent' 'Good' 'Poor' 'Sufficient' 'Very Good' 'Weak']


Let's build a dataframe to make it easier to visualize the results.

In [None]:
# Create a dataframe from the observations and classes names
probabilities_oob = pd.DataFrame(oob_predictions_class, columns =  classes )

In [None]:
probabilities_oob.head()

Unnamed: 0,Excellent,Good,Poor,Sufficient,Very Good,Weak
0,0.023371,0.275269,0.018503,0.456963,0.099557,0.126336
1,0.017794,0.184831,0.020895,0.550714,0.077181,0.148584
2,0.018684,0.165198,0.016972,0.592384,0.089901,0.116861
3,0.030575,0.142464,0.023845,0.553891,0.104219,0.145007
4,0.034802,0.193078,0.035141,0.525376,0.095001,0.116602


Every observation will have a class prediction associated with it. We will create another column called y_pred and use the [idxmax()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.idxmax.html) method from pandas to get the column associated with the highest probability for each observation. Next, we will add the correct outcomes in another column.

In [None]:
# Create a new column with the predicted classes
probabilities_oob['y_pred'] = probabilities_oob.idxmax(axis = 1)

In [None]:
probabilities_oob.head()

Unnamed: 0,Excellent,Good,Poor,Sufficient,Very Good,Weak,y_pred
0,0.023371,0.275269,0.018503,0.456963,0.099557,0.126336,Sufficient
1,0.017794,0.184831,0.020895,0.550714,0.077181,0.148584,Sufficient
2,0.018684,0.165198,0.016972,0.592384,0.089901,0.116861,Sufficient
3,0.030575,0.142464,0.023845,0.553891,0.104219,0.145007,Sufficient
4,0.034802,0.193078,0.035141,0.525376,0.095001,0.116602,Sufficient


Next, we add the correct outcomes in another column.

In [None]:
# Create a new column with the outcomes
probabilities_oob['y_test'] = y

In [None]:
probabilities_oob.head()

Unnamed: 0,Excellent,Good,Poor,Sufficient,Very Good,Weak,y_pred,y_test
0,0.023371,0.275269,0.018503,0.456963,0.099557,0.126336,Sufficient,Sufficient
1,0.017794,0.184831,0.020895,0.550714,0.077181,0.148584,Sufficient,Sufficient
2,0.018684,0.165198,0.016972,0.592384,0.089901,0.116861,Sufficient,Sufficient
3,0.030575,0.142464,0.023845,0.553891,0.104219,0.145007,Sufficient,Good
4,0.034802,0.193078,0.035141,0.525376,0.095001,0.116602,Sufficient,Sufficient


We are ready to compute the accuracy by hand and learn how to obtain the accuracy score. Let's compute it.


In [None]:
# Compute the accuracy score by hand
correct_predictions = probabilities_oob[(probabilities_oob['y_pred'] == probabilities_oob['y_test'])].shape[0]
total_numb_obs = probabilities_oob.shape[0]

accuracy = correct_predictions/total_numb_obs

print(f'The accuracy computed by hand was {accuracy*100:.2f}%.')

The accuracy computed by hand was 54.39%.


As expected, this is precisely the value obtained by sci-kit-learn using the oob_score_ attribute.

# Extremely randomized trees

There is a lesser-known type of random forest; **extra trees** (short for extremely randomized trees). It shares most of the same parameters as a random forest algorithm, but also includes two fundamental differences that aim to increase the level of randomness of the trees:

- Every tree uses the entire dataset. In other words, it does not use **bagging**, and the OOB set is not available.

- It does use a random selection of the columns, however, the thresholds at the nodes are chosen at random. The optimal threshold of this random choice is selected to split the data. 

The code steps to instantiate, fit and compute the score of an extra tree are very similar to the ones we used for random forests. However, due to the lack of the OOB set, we need to separate the data into training and test sets.

Let's practice with it.

In [None]:
from sklearn.ensemble import ExtraTreesRegressor

# Instantiate an extra tree regressor
extra_tree_reg = ExtraTreesRegressor(max_depth = 9, n_estimators = 200, random_state = 34)

# Perform cross-validation 
extra_trees_crossvalidation = cross_val_score(extra_tree_reg, X_reg, y_reg, cv = 8, n_jobs= -1)

# Calculate the mean of the accuracy
cross_val_mean = extra_trees_crossvalidation.mean()

In [None]:
print(f'The mean value for the R^2 value, using the extra trees instance and cross-validation, was {cross_val_mean:.4f}.')

The mean value for the R^2 value, using the extra trees instance and cross validation, was -0.3357.


The $R^2$ value for this model was negative. In a real-life scenario, we would have to rethink the data and model selection. 

# Wrapping up

Decision tree advantages are:

- They are easier to understand compared to other machine learning models.

- They work for classification and regression problems.

- They are easy to visualize and display, making them easy to share with a non-technical audience.

- They don't require underlining assumptions about the data. It simply divides the data into homogenous groups.

The disadvantages of decision trees are:

- They don't stop dividing the data until achieving a homogeneous group. For that reason, it tends to overfit the data.

- It requires extra work to prune it and tweak the hyperparameters in order to avoid overfitting.

- Decision trees are extremely sensitive to small changes in the data. The threshold values at a split can change drastically with a small variation in the observations. However, we learned that random forests and extra trees mitigate this issue.

- Trees tend to be time-consuming and computationally expensive.

- Although they seem easy to grasp, the details of the computations become complex and repetitive.

Despite their limitations, decision trees and random forests are widely used and are two of the most important machine learning algorithms in use today.

# Summary

In this lesson we've learned:

- How to perform cross-validation using the K-Fold cross-validation method.

- The cross_val_score and cross_validate tool from sci-kit-learn and how to implement them in a real dataset.

- How to create a score with the make_scorer method.

- How grid search operates and how to implement it using the GridSearchCV class from sci-kit-learn.

- The importance of random forests to mitigate the tendency of decision trees to overfit.

- How random forests achieve it and how to implement them using sci-kit-learn.