In [1]:
import numpy as np
import pandas as pd
from pathlib import Path

In [2]:
data = pd.read_csv(Path('resources/house_clean_data_2.csv'))

In [3]:
data.head()

Unnamed: 0,SalePrice,LotArea,BedroomAbvGr,FullBath,HalfBath,Neighborhood,HouseStyle,OverallQual,YearBuilt
0,208500,8450,3,2,1,CollgCr,2Story,7,2003
1,181500,9600,3,2,0,Veenker,1Story,6,1976
2,223500,11250,3,2,1,CollgCr,2Story,7,2001
3,140000,9550,3,1,0,Crawfor,2Story,7,1915
4,250000,14260,4,2,1,NoRidge,2Story,8,2000


In [4]:
data.shape

(1094, 9)

In [5]:
## ENCODING 

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

X_train, X_test, y_train, y_test = train_test_split(data.drop(columns="SalePrice"), data.SalePrice, test_size=0.20, random_state=42)

enc = OneHotEncoder(handle_unknown='ignore')

enc.fit(X_train.select_dtypes('object'))
X_train_encoded = pd.concat([X_train.select_dtypes(exclude='object'),pd.DataFrame(enc.transform(X_train.select_dtypes('object')).toarray(),index=X_train.select_dtypes('object').index, columns=enc.get_feature_names().tolist())], axis =1)
X_test_encoded = pd.concat([X_test.select_dtypes(exclude='object'),pd.DataFrame(enc.transform(X_test.select_dtypes('object')).toarray(),index=X_test.select_dtypes('object').index, columns=enc.get_feature_names().tolist())], axis =1)

In [6]:
X_train_encoded

Unnamed: 0,LotArea,BedroomAbvGr,FullBath,HalfBath,OverallQual,YearBuilt,x0_Blmngtn,x0_Blueste,x0_BrDale,x0_BrkSide,...,x0_Timber,x0_Veenker,x1_1.5Fin,x1_1.5Unf,x1_1Story,x1_2.5Fin,x1_2.5Unf,x1_2Story,x1_SFoyer,x1_SLvl
6,10084,3,2,0,8,2004,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
789,11361,3,2,0,6,1976,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1049,7415,3,2,1,6,2004,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
837,5400,2,1,0,5,1954,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
661,8750,3,2,1,6,1998,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
330,4280,2,1,0,5,1913,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
466,7943,3,1,0,4,1961,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
121,31770,3,1,0,6,1960,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1044,12665,4,2,1,8,2005,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


### OneHotEncoder

Encode categorical features as a one-hot numeric array.

The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter)

By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.

In [7]:
## MODEL  RandomForestRegressor

from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, random_state=42)
regr.fit(X_train_encoded, y_train)

RandomForestRegressor(random_state=42)

### Random Forest Regressor

A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

In [8]:
#regr = RandomForestRegressor(n_estimators=200, max_depth=10, min_samples_split=2, min_samples_leaf=2, random_state=42)
#regr.fit(X_train_encoded, y_train)

In [9]:
R2=regr.score(X_test_encoded, y_test)
R2

0.8384987104170214

In [10]:
# Adjusted R-Squared

from sklearn.metrics import r2_score
n=X_test_encoded.shape[0]
p=X_test_encoded.shape[1]

Adj_r2 = 1-(1-R2)*(n-1)/(n-p-1)

# n = number of observation, p = number of features

In [11]:
Adj_r2

0.8033112786084394

### Adjusted R square 

Adjusted R2 is a corrected goodness-of-fit (model accuracy) measure for linear models. It identifies the percentage of variance in the target field that is explained by the input or inputs.

R2 tends to optimistically estimate the fit of the linear regression. It always increases as the number of effects are included in the model. Adjusted R2 attempts to correct for this overestimation. Adjusted R2 might decrease if a specific effect does not improve the model.

In [12]:
#  Mean Absolute Error (MAE)

from sklearn.metrics import mean_absolute_error
mean_absolute_error(regr.predict(X_test_encoded), y_test)


24206.873862143948

### Mean Absolute Error (MAE)

With any machine learning project, it is essential to measure the performance of the model. What we need is a metric to quantify the prediction error in a way that is easily understandable to an audience without a strong technical background. For regression problems, the Mean Absolute Error (MAE) is just such a metric.

The mean absolute error is the average difference between the observations (true values) and model output (predictions). The sign of these differences is ignored so that cancellations between positive and negative values do not occur. If we didn’t ignore the sign, the MAE calculated would likely be far lower than the true difference between model and data.  (https://insidelearningmachines.com/mean_absolute_error/)