<a href="https://colab.research.google.com/github/DavidBillayio/PythonMLtips/blob/master/RandomForestRegressor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Random Forest Regressors
For this example we seek to predict the sale prices in the test data set. You will notice that the training data has the sale prices listed for a number of homes and the test data is missing the sale prices. 

Your job is to use the following code to predict the sale prices of the test data homes.

Why use a Random Forest Regressor?

A random forest regressor is an ensemble algorithm. This means that the algorithm combines multiple of the same algorithm to make a better predicition than a single instance of the algorithm. Regression is, of-course, the method of predicting a continuous response output.

The Random Forest Regressor uses the principle of combining a number of random weak forest classification algorithms by a voting mechanism to create a stronger prediction. 

A disadvantage to Random Forest Regressors is that it is not infact continuous but still has some disadvantages of a classification tree algoritm in that it does not predict outside of the range of the training data and does not provide a strictly continuous output.

bagging - an ensemble meta-algoritm designed to improve stability, reduce variance, and improve accuracy.
boosting - an ensemble meta-algorithm for reducing bias and variance in supervised learning.

In [None]:
# A simple random forest regressor that is tested and optimized for some parameters.

# import the modules

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
print("modules imported")

In [None]:
#Read the data

full_train_data = pd.read_csv('full_train_data.csv')
test_data = pd.read_csv('test_data.csv')

# Let's first look at the training data
print(full_train_data.head())

In [None]:
#Notice that the Lot facing is in North, South, East and West. This can be more effectively interpreted through a One Hot encoder.

#first, and most importantly, we make a copy to avoid changing the original data
X_train = full_train_data.copy()

#import OneHot Encoder
from sklearn.preprocessing import OneHotEncoder

cols = ['lot facing']
OH_encoder = OneHotEncoder(sparse = False)
OH_train = pd.DataFrame(OH_encoder.fit_transform(X_train[cols]))
#notice the new data columns
print(OH_train)

In [None]:
#But how do we know which is which? By:
OH_encoder.categories_

In [None]:
#So we create the new column headings for the One Hot Encoded columns
OH_train.columns =['E', 'N', 'S', 'W']


#just to double check:
print(OH_train)
print(X_train)
print(X_train['lot facing'].value_counts())
OH_train.sum()

In [None]:
#at a high level it checks out.
# We must now add the OH values to the dataframe

OH_X_train = pd.concat([X_train,OH_train], axis = 1)
print(OH_X_train.head())

In [None]:
#that was a lot of work, let's use something easier for the same result.

# We will re-copy our data for the second try

X_train2 = full_train_data.copy()

In [None]:
from sklearn.compose import ColumnTransformer

transformer = ColumnTransformer(
    transformers=[
        ('lot facing',        # Just a name
         OneHotEncoder(), # The transformer class
         [3]            # The column(s) to be applied on.
         )
    ], remainder='passthrough'
)
OH_X_train2 = pd.DataFrame(transformer.fit_transform(X_train2))
OH_X_train2.columns =['E', 'N', 'S', 'W', 'area', 'bedrooms', 'bathrooms', 'saleprice']
print(OH_X_train2)

In [None]:
#that was much easier. We continue.

In [None]:
#Next, we define the target and features we will be using to predict the sale price

#define the target
y = OH_X_train2.saleprice

#define the features we are interested in using to predict
features = ['area', 'bedrooms', 'bathrooms', 'N', 'S', 'E', 'W']

#define the input features in a new dataframe
X = OH_X_train2[features].copy()
print(X)

In [None]:
#Separate our training and validation sets from the test data

X_train, X_valid, y_train, y_valid = train_test_split(X,y, train_size = 0.8, test_size = 0.2, random_state = 0)
print(X_train, X_valid, y_train, y_valid)

In [None]:
#We will want to try several models using various parameters to see which model will work best

#Define several random forest regressors to compare.
model_1 = RandomForestRegressor(n_estimators=50, random_state=0)
model_2 = RandomForestRegressor(n_estimators=100, random_state=0)
model_3 = RandomForestRegressor(n_estimators=100, criterion='mae', random_state=0)
model_4 = RandomForestRegressor(n_estimators=200, min_samples_split=20, random_state=0)
model_5 = RandomForestRegressor(n_estimators=10000, max_depth=4, random_state=0)

models = [model_1, model_2, model_3, model_4, model_5]
print('models loaded')

In [None]:
# next we will define a function to score each model

def score_model(model, Xt, Xv, Yt, Yv):
  """takes in the model, the training and validation data and returns the mean absolute error"""
  model.fit(Xt,Yt)
  prediction = model.predict(Xv)
  return mean_absolute_error(Yv, prediction)

In [None]:
for i in range(0,len(models)):
  mae = score_model(models[i],X_train, X_valid, y_train, y_valid)
  print("Model {} MAE: {}".format(i+1, mae))

#Is the error good? 

It doesn't look like it, but we will take the model with the lowest error. What are the issues with this?

After all of that, where we doing again? That's right, predicting the test values.

In [None]:
#What do we need to do first?

#that's right
OH_test = pd.DataFrame(transformer.fit_transform(test_data))
OH_test.columns =['E', 'N', 'S', 'W', 'area', 'bedrooms', 'bathrooms']
print(OH_test)

In [None]:
#initiate chosen model
chosen_model = model_5

chosen_model.fit(X,y)
prediction_test = chosen_model.predict(OH_test)
output_data = pd.DataFrame({'sale price' : prediction_test})
output = pd.concat([test_data,output_data], axis = 1)
output.to_csv('submission.csv', index=False)

#Some interesting other information about our model:

In [23]:
feature_importance = chosen_model.feature_importances_

print("Feature ranking:")
for i, column in enumerate(OH_test.columns):
    print("{}. {} ".format(column , feature_importance[i]))


Feature ranking:
E. 0.6078583837647725 
N. 0.11054492785198458 
S. 0.12717739801475603 
W. 0.01972355786594817 
area. 0.06734195486872883 
bedrooms. 0.022403869857629183 
bathrooms. 0.04494990777618075 
