# Random Forest
In this notebook, we will develop a Random Forest machine learning model to predict the next day's return of the HSCEI Index. We will determine the optimal hyperparameters for the model using both grid search and random search techniques.<br><br>
**The notebook is structured as follows**
1. Import the data
2. Define predictor variables and a target variable
3. Split the data into train and test dataset

## Import Libraries

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

import yfinance as yf
import datetime

# Machine learning libraries
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

#Evaluation metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [2]:
# Define the ticker symbol for HSCEI index
hscei_ticker = '^HSCE'  # HSCEI ticker symbol

# Define the date range
start_date = datetime.datetime(2010, 1, 1)
end_date = datetime.datetime(2020, 1, 5)

# Fetch the HSCEI data
data = yf.download(hscei_ticker, start=start_date, end=end_date)

# Print the first few rows of the data
data.head()

[*********************100%***********************]  1 of 1 completed


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-01-04,12791.330078,12928.769531,12642.099609,12750.549805,12750.549805,0
2010-01-05,12910.370117,13157.769531,12868.650391,13142.030273,13142.030273,0
2010-01-06,13175.110352,13350.120117,13114.740234,13246.209961,13246.209961,0
2010-01-07,13354.969727,13354.969727,13019.030273,13073.200195,13073.200195,0
2010-01-08,13082.700195,13131.049805,12952.179688,13035.089844,13035.089844,0


Here we can clearly see our data has zero NaN values.
<a id='input'></a>
## Create Input Parameters

We will create custom indicators, which will help in making the prediction more accurate. 

1. `ret1`,`ret5`,`ret10`,`ret20`,`ret40`: 1-days, 5-days, 10-days, 20-days and 40-days returns of `Close` prices, respectively.
2. `std5`,`std10`,`std20`,`std40`: 5-days, 10-days, 20-days and 40-days standard deviation of `Close` prices, respectively.


In [3]:
# Returns
data['ret1'] = data['Adj Close'].pct_change()
data['ret5'] = data.ret1.rolling(5).sum()
data['ret10'] = data.ret1.rolling(10).sum()
data['ret20'] = data.ret1.rolling(20).sum()
data['ret40'] = data.ret1.rolling(40).sum()

# Standard Deviation
data['std5'] = data.ret1.rolling(5).std()
data['std10'] = data.ret1.rolling(10).std()
data['std20'] = data.ret1.rolling(20).std()
data['std40'] = data.ret1.rolling(40).std()

# Future returns
data['retFut1'] = data.ret1.shift(-1)

# Define predictor variables (X) and a target variable (y)
data = data.dropna()
predictor_list = ['ret1', 'ret5', 'ret10', 'ret20',
                  'ret40', 'std5', 'std10', 'std20', 'std40']
X = data[predictor_list]
y = data.retFut1

# Split the data into train and test dataset
train_length = int(len(data)*0.80)
X_train = X[:train_length]
X_test = X[train_length:]
y_train = y[:train_length]
y_test = y[train_length:]

The key hyperparameters in the random forest method are
- n_estimators,
- max_features, 
- max_depth, 
- min_samples_leaf, 
- and bootstrap.   

We have defined below a range of values for each of these hyperparameters.

In [4]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start=10, stop=20, num=5)]

# Number of features to consider at every split
max_features = [round(x, 2) for x in np.linspace(start=0.3, stop=1.0, num=5)]

# Max depth of the tree
max_depth = [int(round(x, 2)) for x in np.linspace(start=2, stop=10, num=5)]

# Minimum number of samples required at each leaf node
min_samples_leaf = [int(x) for x in np.linspace(start=300, stop=600, num=10)]

# Method of selecting training subset for training each tree
bootstrap = [True, False]

# Save these parameters in a dictionry
param_grid = {'n_estimators': n_estimators,
              'max_features': max_features,
              'max_depth': max_depth,
              'min_samples_leaf': min_samples_leaf,
              'bootstrap': bootstrap
              }

# Print the dictionary
param_grid

{'n_estimators': [10, 12, 15, 17, 20],
 'max_features': [0.3, 0.48, 0.65, 0.82, 1.0],
 'max_depth': [2, 4, 6, 8, 10],
 'min_samples_leaf': [300, 333, 366, 400, 433, 466, 500, 533, 566, 600],
 'bootstrap': [True, False]}

In [5]:
# Create the base model to tune
random_forest = RandomForestRegressor()

The RandomizedSearchCV takes the following parameter as input

1. estimator: The base estimator model for which the best hyperparameter values are found.
2. param_distributions: Dictionary of parameter names and list of values to try.
3. n_iter: Number of parameters that are tried to find the best values.
4. random_state: The random seed value.
5. cv: cross-validation generator or iterable.

In [6]:
# Random search of parameters by searching across 50 different combinations
rf_random = RandomizedSearchCV(estimator=random_forest,
                               param_distributions=param_grid,
                               n_iter=50,
                               random_state=42,
                               cv=5
                               )

# Fit the model to find the best hyperparameter values
rf_random.fit(X_train, y_train)

The best hyperparameters values for the random forest model is found below.

In [7]:
rf_random.best_params_

{'n_estimators': 15,
 'min_samples_leaf': 500,
 'max_features': 0.82,
 'max_depth': 8,
 'bootstrap': True}

In this step, we train the model created using the best hyperparameter values.

In [8]:
# Assign the best model to best_random_forest
best_random_forest_random = rf_random.best_estimator_

# Initialize random_state to 42
best_random_forest_random.random_state = 42

# Fit the best random forest model on the train dataset
best_random_forest_random.fit(X_train, y_train)

# Grid search

Similarly, we can find the best model using the grid search cross-validation technique. Since this method gets time-consuming because it tries out all possible combinations, we have defined fewer hyperparameter values for illustration purpose only. You may choose to specify more values for hyperparameter.

In [9]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start=10, stop=20, num=3)]

# Number of features to consider at every split
max_features = [round(x, 2) for x in np.linspace(start=0.3, stop=1.0, num=3)]

# Minimum number of samples required at each leaf node
min_samples_leaf = [int(x) for x in np.linspace(start=300, stop=600, num=3)]

# Method of selecting training subset for training each tree
bootstrap = [True, False]

# Create the random grid
param_grid = {'n_estimators': n_estimators,
              'max_features': max_features,
              'min_samples_leaf': min_samples_leaf,
              'bootstrap': bootstrap
              }

param_grid

{'n_estimators': [10, 15, 20],
 'max_features': [0.3, 0.65, 1.0],
 'min_samples_leaf': [300, 450, 600],
 'bootstrap': [True, False]}

The below code finds the best hyperparameter values.

In [10]:
# Grid search of parameters by searching all the possible combinations
rf_grid = GridSearchCV(estimator=random_forest,
                       param_grid=param_grid, cv=5
                       )

# Fit the model to find the best hyperparameter values
rf_grid.fit(X_train, y_train)

# Best hyperparameter values
rf_grid.best_params_

{'bootstrap': True,
 'max_features': 1.0,
 'min_samples_leaf': 600,
 'n_estimators': 10}

In [11]:
# Assign the best model to best_random_forest
best_random_forest_grid = rf_grid.best_estimator_

# Initialize random_state to 42
best_random_forest_grid.random_state = 42

# Fit the best random forest model on the train dataset
best_random_forest_grid.fit(X_train, y_train)

<a id='prediction'></a>
## Prediction & Evaluation

We will compare the predictive performance of the grid model with that of the random model.

In [12]:
#Predict the next day return
retFut1_random = best_random_forest_random.predict(X_test)
retFut1_grid = best_random_forest_grid.predict(X_test)

In [13]:
#Mean absolute error
MAE_random = mean_absolute_error(y_test, retFut1_random)
print(f"MAE RF with random search : {MAE_random}")
MAE_grid = mean_absolute_error(y_test, retFut1_grid)
print(f"MAE RF with grid search : {MAE_grid}")

#Mean squared error
MSE_random = mean_squared_error(y_test,retFut1_random)
print(f"MSE RF with random search : {MSE_random}")
MSE_grid = mean_squared_error(y_test,retFut1_grid)
print(f"MSE RF with grid search : {MSE_grid}")

#Root Mean squared error
RMSE_random = np.sqrt(mean_squared_error(y_test,retFut1_random))
print(f"RMSE RF with random search : {RMSE_random}")
RMSE_grid = np.sqrt(mean_squared_error(y_test,retFut1_grid))
print(f"RMSE RF with grid search : {RMSE_grid}")


MAE RF with random search : 0.009097830474213293
MAE RF with grid search : 0.009077935904365885
MSE RF with random search : 0.00014300716305058872
MSE RF with grid search : 0.00014256693023949123
RMSE RF with random search : 0.011958560241541986
RMSE RF with grid search : 0.011940139456450718
