<a href="https://colab.research.google.com/github/DavidBillayio/PythonMLtips/blob/master/RandomForestRegressor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

For this example we seek to predict the sale prices in the test data set. You will notice that the training data has the sale prices listed for a number of homes and the test data is missing the sale prices. 

Your job is to use the following code to predict the sale prices of the test data homes.

In [1]:
# A simple random forest regressor that is optimized for some parameters.

# import the modules

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
print("modules imported")

In [7]:
#Read the data

full_train_data = pd.read_csv('full_train_data.csv')
test_data = pd.read_csv('test_data.csv')

# Let's first look at the training data

print(full_train_data.head())

   area  bedrooms  bathrooms lot facing  saleprice
0  3836         2          2          N     398532
1   248         4          1          S     289967
2   496         2          1          E     236893
3  2295         3          1          W     309548
4   670         3          2          N     157009


In [32]:
#Notice that the Lot facing is in North, South, East and West. This can be more effectively interpreted through a One Hot encoder.

from sklearn.preprocessing import OneHotEncoder

cols = ['lot facing']
OH_encoder = OneHotEncoder(sparse = False)
OH_train = OH_encoder.fit_transform(full_train_data[cols])


#notice the new data columns added
print(OH_train)


0      N
1      S
2      E
3      W
4      N
      ..
317    N
318    S
319    E
320    W
321    N
Name: lot facing, Length: 322, dtype: object
[[0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 ...
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]]


In [33]:
#But how do we know which is which?

OH_encoder.categories_

[array(['E', 'N', 'S', 'W'], dtype=object)]

In [21]:
#Define the target and features we will be using to predict the sale price

#define the target
y = full_train_data.saleprice

#define the features we are interested in using to predict
features = ['area', 'bedrooms', 'bathrooms', 'lot facing']

#define the input features in a new dataframe
X = full_train_data[features].copy()

<class 'pandas.core.frame.DataFrame'>


In [23]:
#Separate our training and validation sets from the test data

X_train, X_valid, y_train, y_valid = train_test_split(X,y, train_size = 0.8, test_size = 0.2, random_state = 0)

In [None]:
#We will want to try several models using various parameters to see which model will work best

#Define several random forest regressors to compare.
model_1 = RandomForestRegressor(n_estimators=50, random_state=0)
model_2 = RandomForestRegressor(n_estimators=100, random_state=0)
model_3 = RandomForestRegressor(n_estimators=100, criterion='mae', random_state=0)
model_4 = RandomForestRegressor(n_estimators=200, min_samples_split=20, random_state=0)
model_5 = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)