# Assignment 25

In this assignment students will build the random forest model after normalizing the variable to house pricing from boston 
data set.

Following the code to get data into the environment:
 import numpy as np
 import pandas as pd
 import matplotlib.pyplot as plt
 import seaborn as sns
 from sklearn.model_selection import train_test_split
 from sklearn.preprocessing import StandardScaler
 from sklearn import datasets
 boston = datasets.load_boston()
 features = pd.DataFrame(boston.data, columns=boston.feature_names)
 targets = boston.target

### What is Random Forest model?

The random forest is a model made up of many decision trees. Rather than just being a forest though, this model is random because of two concepts:

1. Random sampling of data points
2. Splitting nodes based on subsets of features

### Random Sampling

One of the keys behind the random forest is that each tree trains on random samples of the data points. The samples are drawn with replacement (known as bootstrapping) which means that some samples will be trained on in a single tree multiple times (we can also disable this behavior if we want). The idea is that by training each tree on different samples, although each tree might have high variance with respect to a particular set of the training data, overall, the entire forest will have low variance. This procedure of training each individual learner on different subsets of the data and then averaging the predictions
is known as bagging, short for bootstrap aggregating.

### Random Subsets of Features

Another concept behind the random forest is that only a subset of all the features are considered for splitting each node in each decision tree. Generally this is set to sqrt(n_features) meaning that at each node, the decision tree considers splitting on a sample of the features totaling the square root of the total number of features. The random forest can also be trained considering all the features at every node. (These options can be controlled in the Scikit-Learn random forest implementation).

If you grasp a single decision tree, bagging decision trees, and random subsets of features, then you have a pretty good understanding of how a random forest works. The random forest combines hundreds or thousands of decision trees, trains each one on a slightly different set of the observations (sampling the data points with replacement) and also splits nodes in each tree considering only a limited number of the features. The final predictions made by the random forest are made by averaging the predictions of each individual tree.

### Random Forest in Practice

Much like any other Scikit-Learn model, to use the random forest in Python requires only a few lines of code. We’ll build a random forest, but not for the simple problem presented above. To contrast the ability of the random forest with a single decision tree, we’ll use a real-world dataset split into a training and testing set.

In [8]:

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler



In [3]:
# Core Libraries to load (for data manipulation and analysis)

import numpy as np
import pandas as pd

#### Loading Dataset

In [9]:
from sklearn import datasets

boston = datasets.load_boston()

features = pd.DataFrame(boston.data, columns=boston.feature_names)

targets = boston.target

#### Exploring data

In [10]:
features.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [6]:
features.shape

(506, 13)

In [23]:
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
PRICE      506 non-null float64
dtypes: float64(14)
memory usage: 55.4 KB


The dataframe is missing the target or dependent column.

In [18]:
boston.target[:5]

array([24. , 21.6, 34.7, 33.4, 36.2])

In [19]:
#add target prices to bos dataframe
features['PRICE'] = boston.target

In [20]:
features.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [24]:
features.get_dtype_counts()

float64    14
dtype: int64

The columns' datatypes are all numeric

In [21]:
features.PRICE.mean()

22.532806324110698

In [22]:
# Convert the column values of the dataframe as float
float_array = features['PRICE'].values.astype(float)

In [25]:
# Dataframe shape after updating 
print(features.shape)

(506, 14)


Check for datatypes and presence of null values using info()

features.info()

No cleaning required as the data is already cleaned and has no null or NaN values

#### Building a model

In [29]:
# The column that we want to predict.
y_column = features['PRICE']

# The columns that we will be making predictions with.
x_columns = features.drop('PRICE', axis=1)

In [30]:
# split the data into training and test sets and scale the variables

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x_columns, y_column, test_size = 0.3, random_state = 25)

X_scaler = StandardScaler()
X_train = X_scaler.fit_transform(X_train)
X_test = X_scaler.transform(X_test)

y_scaler = StandardScaler()
y_train = y_scaler.fit_transform(y_train[:, None])[:, 0]
y_test = y_scaler.transform(y_test[:, None])[:, 0]

In [31]:
# Instantiate a random forest regressor since we have to predict on continous variables, and fit the training set

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [33]:
from sklearn import metrics

y_pred = model.predict(X_test)

print("Test Accuracy:", format(metrics.r2_score(y_test, y_pred) * 100, '.2f'), '%')
print("Mean Squared Error:", format(metrics.mean_squared_error(y_test, y_pred), '.5f'))

Test Accuracy: 83.25 %
Mean Squared Error: 0.14294


Perform GridSerach to tune the hyper parameters, then use the best estimator for scoring on the test set.

In [37]:
from sklearn.model_selection import GridSearchCV

parameters = {"min_samples_split": [2, 5, 10],
              "max_depth": [None, 2, 5, 10],
              "min_samples_leaf": [1, 3, 5],
              "max_features": ['auto', 'sqrt', 'log2'],
              "n_estimators": [50, 75, 100]
              }

grid_search = GridSearchCV(RandomForestRegressor(), param_grid=parameters, n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

print("Best parameters set found on development set:\n")
print(grid_search.best_params_)

Fitting 3 folds for each of 324 candidates, totalling 972 fits


[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    8.6s
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:   27.7s
[Parallel(n_jobs=-1)]: Done 446 tasks      | elapsed:   54.4s
[Parallel(n_jobs=-1)]: Done 796 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 972 out of 972 | elapsed:  2.9min finished


Best parameters set found on development set:

{'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 75}


In [36]:
print("Accuracy for test data set:\n")
y_pred = grid_search.predict(X_test)
print("Test Accuracy:", format(metrics.r2_score(y_test, y_pred) * 100, '.2f'), '%')
print("Mean Squared Error:", format(metrics.mean_squared_error(y_test, y_pred), '.5f'))

Accuracy for test data set:

Test Accuracy: 87.39 %
Mean Squared Error: 0.10762


Accuaracy score has increased by 4% from 83% to 87% after tuning the hyper parameters and mean squared error is reduced from 0.14 to 0.11.