# Description
This is a notebook for a beginners Kaggle competiton. With a dataset that have 79 explanatory variables describing (almost) every aspect of residential homes from Ames, Iowa, the main objective is to predict the final price of each home. More informations can be read here: https://www.kaggle.com/competitions/home-data-for-ml-course/overview

## Evaluation metric
The metric used to evaluate submissions are RMSE (Root Mean Squared Error) between the logarithm of the predicted value and the logarithm of the observed sales price. Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.

### Root Mean Squared Error
RMSE is a commonly used evaluation metric that measure the differences between values predicted by a regression model and the actual values of the dataset. Those are the step for calculate it:
* Take the difference between each predicted value and its corresponding actual value;
* Square each of these differences;
* Calculate the average of these squared differences;
* Take the square root of the average.

The formula for RMSE is as follows:

$$RMSE = sqrt((1/n) * sum((predictedValue - actualValue)^2))$$

So, the lower the RMSE value, better the model are.

In [2]:
# importing libraries
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

In [34]:
# importing the train DataFrame
train_file_path = './data/train.csv'
train_df = pd.read_csv(train_file_path)

# diving in features and target, aka X & y
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 
            'TotRmsAbvGrd', 'MSSubClass', 'LotArea', 'OverallQual', 'OverallCond',
            'YearBuilt', 'YearRemodAdd','1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea',
            'FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd','Fireplaces',
            'WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea',
            'MiscVal','MoSold','YrSold']
X = train_df[features]
y = train_df.SalePrice

In [22]:
# visualizing features rows
X

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd,MSSubClass,LotArea.1,OverallQual,...,Fireplaces,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,8450,2003,856,854,2,3,8,60,8450,7,...,0,0,61,0,0,0,0,0,2,2008
1,9600,1976,1262,0,2,3,6,20,9600,6,...,1,298,0,0,0,0,0,0,5,2007
2,11250,2001,920,866,2,3,6,60,11250,7,...,1,0,42,0,0,0,0,0,9,2008
3,9550,1915,961,756,1,3,7,70,9550,7,...,1,0,35,272,0,0,0,0,2,2006
4,14260,2000,1145,1053,2,4,9,60,14260,8,...,1,192,84,0,0,0,0,0,12,2008
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,7917,1999,953,694,2,3,7,60,7917,6,...,1,0,40,0,0,0,0,0,8,2007
1456,13175,1978,2073,0,2,3,7,20,13175,6,...,2,349,0,0,0,0,0,0,2,2010
1457,9042,1941,1188,1152,2,4,9,70,9042,7,...,2,0,60,0,0,0,0,2500,5,2010
1458,9717,1950,1078,0,1,2,5,20,9717,5,...,0,366,0,112,0,0,0,0,4,2010


In [23]:
# visualizing target values
y

0       208500
1       181500
2       223500
3       140000
4       250000
         ...  
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1460, dtype: int64

In [35]:
# diving the training dataset in train and validation
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

In [36]:
# visualizing train_x
train_X

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd,MSSubClass,LotArea.1,OverallQual,...,Fireplaces,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
6,10084,2004,1694,0,2,3,7,20,10084,8,...,1,255,57,0,0,0,0,0,8,2007
807,21384,1923,1072,504,1,3,6,70,21384,5,...,1,0,312,0,0,0,0,0,5,2009
955,7136,1946,979,979,2,4,8,90,7136,6,...,0,0,0,0,0,0,0,0,8,2007
1040,13125,1957,1803,0,2,3,8,20,13125,5,...,1,0,0,0,0,0,0,0,1,2006
701,9600,1969,1164,0,1,3,6,20,9600,7,...,0,0,0,0,0,0,0,0,7,2006
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
715,10140,1974,1350,0,2,3,7,20,10140,6,...,1,0,0,0,0,0,0,0,8,2009
905,9920,1954,1063,0,1,3,6,20,9920,5,...,0,0,0,164,0,0,0,0,2,2010
1096,6882,1914,773,582,1,3,7,70,6882,6,...,0,136,0,115,0,0,0,0,3,2007
235,1680,1971,483,504,1,2,5,160,1680,6,...,0,0,0,0,0,0,0,0,8,2008


In [37]:
# visualizing val_X
val_X

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd,MSSubClass,LotArea.1,OverallQual,...,Fireplaces,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
258,12435,2001,963,829,2,3,7,60,12435,7,...,1,0,96,0,245,0,0,0,5,2008
267,8400,1939,1052,720,2,4,8,75,8400,5,...,1,262,24,0,0,0,0,0,7,2008
288,9819,1967,900,0,1,3,5,20,9819,5,...,0,0,0,0,0,0,0,0,2,2010
649,1936,1970,630,0,1,1,3,180,1936,4,...,0,0,0,0,0,0,0,0,12,2007
1233,12160,1959,1188,0,1,3,6,20,12160,5,...,0,0,0,0,0,0,0,0,5,2010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1017,5814,1984,1360,0,1,1,4,120,5814,8,...,1,63,0,0,0,0,0,0,8,2009
534,9056,2004,707,707,2,3,6,60,9056,8,...,1,100,35,0,0,0,0,0,10,2006
1334,2368,1970,765,600,1,3,7,160,2368,5,...,0,0,36,0,0,0,0,0,5,2009
1369,10635,2003,1668,0,2,3,8,20,10635,8,...,1,0,262,0,0,0,0,0,5,2010


After split the full train dataset into train and validation, now we have a train dataset with 1095 rows, and a validation dataset with 365 rows.

In [38]:
# instantiating and fitting the model
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)

In [39]:
# visualizing the coeffient of determination applied on the data that the model has been training
rf_model.score(train_X, train_y)

0.9723549401546177

In [40]:
# visualizing the coeffient of determination applied on the that that the model hasn't seen yet
rf_model.score(val_X, val_y)

0.8919192920412191

So based on the coefficient of determination, the model can predict with about 0.83 of precision on a validation set, on a scale from 0 to 1.

In [59]:
# making predictions, and then calculating the MAE
rf_predict = rf_model.predict(val_X)
rf_mae_train = mean_absolute_error(val_y, rf_predict)
print(f'MAE achieved without any hyperparameter tuning: {rf_mae_train}')

MAE achieved without any hyperparameter tuning: 17879.090684931507


In [65]:
# trying to tunning just the n_estimators parameter
estimators = [100, 150, 200, 250, 300, 350, 400, 450, 500]

for numestimators in estimators:
    rf_model_tunning = RandomForestRegressor(random_state=1, n_estimators=numestimators)
    rf_model_tunning.fit(train_X, train_y)
    
    rf_predict_tunning = rf_model_tunning.predict(val_X)
    rf_mae_tunning = mean_absolute_error(val_y, rf_predict_tunning)
    
    print(f'MAE with n_estimators = {numestimators}: {rf_mae_tunning}')

MAE with n_estimators = 100: 17879.090684931507
MAE with n_estimators = 150: 17921.050514459665
MAE with n_estimators = 200: 17864.308721461188
MAE with n_estimators = 250: 17889.87246757991
MAE with n_estimators = 300: 17863.46909284627
MAE with n_estimators = 350: 17885.35303065884
MAE with n_estimators = 400: 17877.512600456623
MAE with n_estimators = 450: 17830.830956874684
MAE with n_estimators = 500: 17847.95117625571
