## Random forest regression

 - A Random Forest is an ensemble of Decision Trees, generally <br>
 trained via the bagging method (or sometimes pasting), typically with max_samples <br>
 set to the size of the training set. 

- Instead of building a BaggingClassifier and passing <br>
 it a DecisionTreeClassifier, you can instead use the RandomForestClassifier <br>
 class, which is more convenient and optimized for Decision Trees.


- Similarly, there is a RandomForestRegressor class for regression tasks.

#### Bagging

 - The Approach is to use the same training algorithm for every <br>
 predictor, but to train them on different random subsets of the training set. 

 
 - When sampling is performed with replacement, this method is called bagging (short for <br>
 bootstrap aggregating). When sampling is performed without replacement, it is called <br>
 pasting.
 

 - Once all predictors are trained, the ensemble can make a prediction for a new <br>
 instance by simply aggregating the predictions of all predictors. The aggregation <br>
 function is typically the statistical mode.

### Importing libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dataset = pd.read_csv('Real_Estate.csv', index_col='No')
dataset.head()

Unnamed: 0_level_0,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,2012.917,32.0,84.87882,10,24.98298,121.54024,37.9
2,2012.917,19.5,306.5947,9,24.98034,121.53951,42.2
3,2013.583,13.3,561.9845,5,24.98746,121.54391,47.3
4,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
5,2012.833,5.0,390.5684,5,24.97937,121.54245,43.1


In [2]:
X = dataset.iloc[:, : -1].values
y = dataset.iloc[:, -1].values

In [3]:
corr = dataset.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
X1 transaction date,1.0,0.0175488,0.06088,0.00963544,0.0350578,-0.0410818,0.0874906
X2 house age,0.0175488,1.0,0.025622,0.0495925,0.0544199,-0.0485201,-0.210567
X3 distance to the nearest MRT station,0.06088,0.025622,1.0,-0.602519,-0.591067,-0.806317,-0.673613
X4 number of convenience stores,0.00963544,0.0495925,-0.602519,1.0,0.444143,0.449099,0.571005
X5 latitude,0.0350578,0.0544199,-0.591067,0.444143,1.0,0.412924,0.546307
X6 longitude,-0.0410818,-0.0485201,-0.806317,0.449099,0.412924,1.0,0.523287
Y house price of unit area,0.0874906,-0.210567,-0.673613,0.571005,0.546307,0.523287,1.0


### Splitting the dataset

In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### Training

In [36]:
from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators=500, max_leaf_nodes=16)
rf_regressor.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=16,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [37]:
y_predicted = rf_regressor.predict(X_test)
score = rf_regressor.score(X_test, y_test)
score

# y_predicted = np.expand_dims(y_predicted, axis=1)
# y_test = np.expand_dims(y_test, axis=1)
# np.concatenate((y_predicted, y_test), axis=1);

0.7336111864989442