# Regression with Random Forest
* make a Random Forest Regressor to predict the sale price of houses. 



https://www.kaggle.com/pollux751/housing-prices-regression-with-random-forest/data

In [1]:
# First import all necessary libraries
import pandas as pd
import numpy as np

In [2]:
# Import training/testing data
df = pd.read_csv("data/train.csv")
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


# Getting familiar with the dataset:


* print the isnull().sum of your data
* if you cannot see all the 75 rows add this instruction before pd.options.display.max_rows = 100

In [3]:
pd.options.display.max_rows = 100
# see some info
df.isnull().sum()

Id                  0
MSSubClass          0
MSZoning            0
LotFrontage       259
LotArea             0
Street              0
Alley            1369
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          8
MasVnrArea          8
ExterQual           0
ExterCond           0
Foundation          0
BsmtQual           37
BsmtCond           37
BsmtExposure       38
BsmtFinType1       37
BsmtFinSF1          0
BsmtFinType2       38
BsmtFinSF2          0
BsmtUnfSF           0
TotalBsmtSF         0
Heating             0
HeatingQC           0
CentralAir          0
Electrical          1
1stFlrSF            0
2ndFlrSF            0
LowQualFin

***YOUR TURN***
#### remove some columns that have too many missing values:
'Alley', 'PoolQC', 'Fence', 'MiscFeature', 'FireplaceQu'
and print the shape of the database after the removing operation

In [4]:
df.drop(['Alley', 'PoolQC', 'Fence', 'MiscFeature', 'FireplaceQu'],axis=1,inplace=True) 
df.shape

(1460, 76)

***YOUR TURN*** remove the the id column as well

In [5]:
# we should also remove the id -
df.drop(['Id'],axis=1,inplace=True) 

### Deal with categorical data

##### we still have both categorical data and missing values and we need to pay attention to this...

***YOUR TURN*** fill the df['LotFrontage'] with the median of the column 
* we did this in a previous notebook but if you don't rememeber how to do it try to google: 'how to fill missing value in pandas with the median of the column?'

In [6]:
df['LotFrontage'] = df['LotFrontage'].fillna(df['LotFrontage'].median())


#### YOUR TURN drop all the other nan values rows

* NOTICE always check that the target column is included when you drop rows...

In [7]:
df.dropna(inplace = True)

* chech again that we don't have any missing values at this point

In [8]:
df.isnull().sum()

MSSubClass       0
MSZoning         0
LotFrontage      0
LotArea          0
Street           0
LotShape         0
LandContour      0
Utilities        0
LotConfig        0
LandSlope        0
Neighborhood     0
Condition1       0
Condition2       0
BldgType         0
HouseStyle       0
OverallQual      0
OverallCond      0
YearBuilt        0
YearRemodAdd     0
RoofStyle        0
RoofMatl         0
Exterior1st      0
Exterior2nd      0
MasVnrType       0
MasVnrArea       0
ExterQual        0
ExterCond        0
Foundation       0
BsmtQual         0
BsmtCond         0
BsmtExposure     0
BsmtFinType1     0
BsmtFinSF1       0
BsmtFinType2     0
BsmtFinSF2       0
BsmtUnfSF        0
TotalBsmtSF      0
Heating          0
HeatingQC        0
CentralAir       0
Electrical       0
1stFlrSF         0
2ndFlrSF         0
LowQualFinSF     0
GrLivArea        0
BsmtFullBath     0
BsmtHalfBath     0
FullBath         0
HalfBath         0
BedroomAbvGr     0
KitchenAbvGr     0
KitchenQual      0
TotRmsAbvGrd

# create the targets and features

***YOUR TURN*** create a targets variable with the sale price and features variable with the features

In [9]:
targets = df["SalePrice"]
features = df.drop("SalePrice", axis = 1)

### Consider the categorical data

***YOUR TURN*** using get_dummies create the dummies columns for all the categorical features use also drop_first = true
* there are few way you can achieve this.
* the quickest one is to do pd.get_dummies directly on the features database and drop first
* in the solution I went for the long solution, since I wanted to be sure that integers and float values were not considered as categorical. But I think is not necessary and if you go for the short way I believe is fine. Check both methods.. at the end we all should have 227 features

In [10]:
featuresObjects = features.loc[:, features.dtypes == object]
featuresObjects.columns

Index(['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond',
       'PavedDrive', 'SaleType', 'SaleCondition'],
      dtype='object')

In [11]:
features.drop(['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond',
       'PavedDrive', 'SaleType', 'SaleCondition'],axis=1,inplace=True) 

In [12]:
featuresObjectsDummies = pd.get_dummies(featuresObjects, drop_first=True)

In [13]:

features = features.join(featuresObjectsDummies)
features.shape

(1338, 227)

## Train split

***YOUR TURN*** import the library for splitting and split the data into training and setting

In [14]:
from sklearn.model_selection import train_test_split
# Split up training set 
x_train, x_test, y_train, y_test = train_test_split(features, targets, test_size=0.3)

***YOUR TURN*** import from sklearn.ensemble import RandomForestRegressor

In [15]:
from sklearn.ensemble import RandomForestRegressor

***YOUR TURN*** create the model and set the number of estimator and max_features (look at the manual to see what max features does)

In [16]:
# Create random forest regression predictor
regr = RandomForestRegressor(n_estimators = 150 , max_features = 10 )

***YOUR TURN*** fit the model with x_train andy_train

In [17]:
# Fit and test to see how accurate the algorithm is
regr.fit(x_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features=10, max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=150, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

***YOUR TURN*** print the training score and the testing score

In [18]:
print (regr.score (x_train, y_train))
print (regr.score(x_test, y_test))

0.9722123386541784
0.8630794029489773


# Predictions 

***YOUR TURN*** store the predictions in a variable called predictions

In [19]:
predictions = regr.predict(x_test)

### Mean Square Error Evaluation

***YOUR TURN*** import from sklearn.metrics import mean_squared_error and print the root of the MSE of the y_test and predictions, use np.sqrt for the square root.

In [20]:
from sklearn.metrics import mean_squared_error
np.sqrt(mean_squared_error(y_test, predictions))

29569.96413369838

This tells us that the average error is 33000 dollars which makes sense if we look at the price values and the test accuracy of our model

## Features importance

***YOUR TURN*** print the features importances

* in the solution I show you how to put the features_importances into its own database so it is nicer to see in the output and also you can sort it by values..

In [21]:
feature_importances = pd.DataFrame(regr.feature_importances_, index = features.columns, 
                                   columns=['importance']).sort_values('importance', ascending=False)
feature_importances.head(20)

Unnamed: 0,importance
OverallQual,0.080441
TotalBsmtSF,0.063799
GrLivArea,0.061213
GarageCars,0.051464
GarageArea,0.048305
1stFlrSF,0.044135
GarageYrBlt,0.030624
ExterQual_TA,0.028013
YearBuilt,0.027121
LotArea,0.026851
