# Random Forests

Random Forests are an extension of Decision Trees. Instead of building a single decision tree, many trees are built and their predictions are averaged. This usually results in better performance because it reduces the risk of overfitting to the training data.

## How are trees in a random forest made different?
What does averaging actually mean? How are the trees made different?

- Bagging:
    - Each tree is trained on some random subset of the training data.
- Random Feature Selection:
    - When splitting a node during the construction of the tree, the choice of the split is not made from all features. Instead, a random subset of features is selected.


The code for random forests is very similar to decision trees. The main difference is that you import `RandomForestRegressor` instead of `DecisionTreeRegressor`.

In [2]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# We load the data just as normal
iowa_file_path = './data/iowa_house_data.csv'

home_data = pd.read_csv(iowa_file_path)

y = home_data.SalePrice

feature_names = ['LotArea',
                 'YearBuilt',
                 '1stFlrSF',
                 '2ndFlrSF',
                 'FullBath',
                 'BedroomAbvGr',
                 'TotRmsAbvGrd']
X = home_data[feature_names]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=1)

# Now instead of DecisionTreeRegressor, we would use RandomForestRegressor
iowa_model = RandomForestRegressor(random_state=1)

iowa_model.fit(X, y)

val_predictions = iowa_model.predict(X_valid)

for i in range(len(X_valid)):
    print(f"Predicted: {val_predictions[i]}, Actual: {y_valid.iloc[i]}")

mae = mean_absolute_error(y_valid, val_predictions)
print(f"Mean Absolute Error: {mae}")


Predicted: 218290.0, Actual: 231500
Predicted: 168660.5, Actual: 179500
Predicted: 125714.5, Actual: 122000
Predicted: 85089.0, Actual: 84500
Predicted: 144467.0, Actual: 142000
Predicted: 312617.8, Actual: 325624
Predicted: 296307.93, Actual: 285000
Predicted: 147721.87504761908, Actual: 151000
Predicted: 203766.0, Actual: 195000
Predicted: 257640.5, Actual: 275000
Predicted: 175273.0, Actual: 175000
Predicted: 76934.8, Actual: 61000
Predicted: 191307.35, Actual: 174000
Predicted: 318621.24, Actual: 385000
Predicted: 237510.08, Actual: 230000
Predicted: 98087.0, Actual: 87000
Predicted: 123609.0, Actual: 125000
Predicted: 117133.74, Actual: 98600
Predicted: 234593.59, Actual: 260000
Predicted: 132521.5, Actual: 143000
Predicted: 132565.0, Actual: 124000
Predicted: 144901.45, Actual: 122500
Predicted: 231208.0, Actual: 236500
Predicted: 326188.65, Actual: 337500
Predicted: 84337.0, Actual: 76000
Predicted: 180696.5, Actual: 187000
Predicted: 126889.84, Actual: 128000
Predicted: 185027.

This gives us a MAE of 7879. A siginficant improvement over the decision tree's MAE of 29652.