# Random Forest

1. What is the idea of bagging?
Reducing the variance of the estimated prediction function by running many DTs (or any other learners) and averaging or taking the majority.

2. What is the tradeoff in choosing the number m of features to sample in every node?
Bigger m means less randomness, and the behavior gets closed to regular.
Smaller m means more randomness, as random forest tries to achieve, but it means also more noise.

3. What's Out Of Bag Sampling? What is the size of the Out Of Bag set when the dataset is large?
The idea is like K-fold Cross Validation.
We estimate over an observation when training without the observation.

4. How can one measure feature importance in random forest?
A classic way is to sum the improvement that the splits by this feature added.
Another way, specifically for random forests, is to calculate the OOB accuracy with a feature and without and compare.

## Exercise
Demonstrate in code that when the number of variables is large but the fraction of relevant variables is small, random forests are likely to perform poorly with small m.

In [78]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [79]:
temps = pd.read_csv('temps.csv')

temps.head(5)

Unnamed: 0,year,month,day,week,temp_2,temp_1,average,actual,forecast_noaa,forecast_acc,forecast_under,friend
0,2016,1,1,Fri,45,45,45.6,45,43,50,44,29
1,2016,1,2,Sat,44,45,45.7,44,41,50,44,61
2,2016,1,3,Sun,45,44,45.8,41,43,46,47,56
3,2016,1,4,Mon,44,41,45.9,40,44,48,46,53
4,2016,1,5,Tues,41,40,46.0,44,46,46,46,41


In [80]:
temps.describe()

Unnamed: 0,year,month,day,temp_2,temp_1,average,actual,forecast_noaa,forecast_acc,forecast_under,friend
count,348.0,348.0,348.0,348.0,348.0,348.0,348.0,348.0,348.0,348.0,348.0
mean,2016.0,6.477011,15.514368,62.652299,62.701149,59.760632,62.543103,57.238506,62.373563,59.772989,60.034483
std,0.0,3.49838,8.772982,12.165398,12.120542,10.527306,11.794146,10.605746,10.549381,10.705256,15.626179
min,2016.0,1.0,1.0,35.0,35.0,45.1,35.0,41.0,46.0,44.0,28.0
25%,2016.0,3.0,8.0,54.0,54.0,49.975,54.0,48.0,53.0,50.0,47.75
50%,2016.0,6.0,15.0,62.5,62.5,58.2,62.5,56.0,61.0,58.0,60.0
75%,2016.0,10.0,23.0,71.0,71.0,69.025,71.0,66.0,72.0,69.0,71.0
max,2016.0,12.0,31.0,117.0,117.0,77.4,92.0,77.0,82.0,79.0,95.0


In [81]:
temps = pd.get_dummies(temps)
temps.iloc[:,5:].head(5)

Unnamed: 0,average,actual,forecast_noaa,forecast_acc,forecast_under,friend,week_Fri,week_Mon,week_Sat,week_Sun,week_Thurs,week_Tues,week_Wed
0,45.6,45,43,50,44,29,1,0,0,0,0,0,0
1,45.7,44,41,50,44,61,0,0,1,0,0,0,0
2,45.8,41,43,46,47,56,0,0,0,1,0,0,0
3,45.9,40,44,48,46,53,0,1,0,0,0,0,0
4,46.0,44,46,46,46,41,0,0,0,0,0,1,0


In [82]:
y = np.array(temps['actual'])

temps= temps.drop('actual', axis = 1)
temps_list = list(temps.columns)
temps = np.array(temps)

x_train, x_test, y_train, y_test = train_test_split(temps, y, test_size = 0.25, random_state = 42)

## Standard RF

In [83]:
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rf.fit(x_train, y_train)

preds = rf.predict(x_test)

print(f'mse is {mean_squared_error(y_test,preds)}')

mse is 26.02690937931034


## RF with small m and many uninformative features

In [84]:
z = np.zeros((x_train.shape[0],10000))
mx_train = np.concatenate((x_train,z),axis=1)
z = np.zeros((x_test.shape[0],10000))
mx_test = np.concatenate((x_test,z),axis=1)

rf = RandomForestRegressor(n_estimators = 1000, random_state = 42, max_features=2)
rf.fit(mx_train, y_train)

preds = rf.predict(mx_test)

print(f'mse is {mean_squared_error(y_test,preds)}')

mse is 31.230310574913798
