# Housing Prices Machine Learning Project

Predicting house prices is crucial for everyone involved the real estate industry, from property owners and investors to buyers and sellers. In this project I am given housing data describing many aspects of residential homes in Ames, Iowa. This data includes 79 explanatory variables to help characterize houses including the size the house, year the house was built, the number of bedrooms, the number of bathrooms and the number of kitchens just to name a few. For this project I will be utilizing these 79 explanatory variables and the power of machine learning, **Random Forest Regression** and **XGBoost** algorithms, to predict what the final price should be for a set of given homes.

## Loading The Data

In this project, I am given 2 datasets with housing information. The dataset **train.csv** comprises of data on 1460 houses and includes the target variable *SalePrice* indicating the property's sale price in dollars. The dataset **test.csv** on the otherhand contains data on a seperate 1459 houses, but does not contain a variable the indicates the sales price of the house. I will analyze and use Machine Learning techniques on the **train.csv** data in order to create a model. This model will then be used to predict the house prices of those in the **test.csv** dataset. This project is a "competition project" on the Kaggle site, so I will not know what the real housing prices are for those in the **test.csv** dataset. Instead I will submit my guess of the sales price for those 1459 houses using my model and will receive a score based on the accuracy.

I will begin by importing packages that I will use throughout this project. I will also import the datasets from the Kaggle site.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error


import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
from learntools.core import binder
binder.bind(globals())

In [2]:
train_data = pd.read_csv('../input/train.csv', index_col='Id')
train_data.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
test_data = pd.read_csv('../input/test.csv', index_col='Id')
test_data.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,Gar2,12500,6,2010,WD,Normal
1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,MnPrv,,0,3,2010,WD,Normal
1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,6,2010,WD,Normal
1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,Inside,...,144,0,,,,0,1,2010,WD,Normal


In [4]:
test_data.columns

Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'Wo

Here I can see that both datasets contain the 79 explanitory variables such as, *MSSubClass*, *MSZoning*, *LotFrontage*, and *LotArea* just to name the first few. The **train_data** also contains the *SalePrice* variable, which will obviously be used as the target variable.

I will now take a closer look at the variables using the descibe method.

In [5]:
train_data.describe()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,46.549315,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,161.319273,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,0.0,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,0.0,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1474.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [6]:
test_data.describe()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
count,1459.0,1232.0,1459.0,1459.0,1459.0,1459.0,1459.0,1444.0,1458.0,1458.0,...,1458.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0
mean,57.378341,68.580357,9819.161069,6.078821,5.553804,1971.357779,1983.662783,100.709141,439.203704,52.619342,...,472.768861,93.174777,48.313914,24.243317,1.79438,17.064428,1.744345,58.167923,6.104181,2007.769705
std,42.74688,22.376841,4955.517327,1.436812,1.11374,30.390071,21.130467,177.6259,455.268042,176.753926,...,217.048611,127.744882,68.883364,67.227765,20.207842,56.609763,30.491646,630.806978,2.722432,1.30174
min,20.0,21.0,1470.0,1.0,1.0,1879.0,1950.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0
25%,20.0,58.0,7391.0,5.0,5.0,1953.0,1963.0,0.0,0.0,0.0,...,318.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2007.0
50%,50.0,67.0,9399.0,6.0,5.0,1973.0,1992.0,0.0,350.5,0.0,...,480.0,0.0,28.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0
75%,70.0,80.0,11517.5,7.0,6.0,2001.0,2004.0,164.0,753.5,0.0,...,576.0,168.0,72.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0
max,190.0,200.0,56600.0,10.0,9.0,2010.0,2010.0,1290.0,4010.0,1526.0,...,1488.0,1424.0,742.0,1012.0,360.0,576.0,800.0,17000.0,12.0,2010.0


There are a couple of things that stand out:

- The count for a few variables is different from the number of houses in the datasets. This indicates that some houses are missing values in their data.
- There are only 36 explanatory variables appearing in the result of the describe method meaning there are 43 variables that are categorical.
- In general all the variables tend to have similar numbers across both datasets hopefully indicating that the data was properly/randomly split when separating the **train** and **test** datasets.

## Split the Data
Next I will split the train_data into 4 groups using the train_test_split method. (Using 80% train, 20% Validation)

- X_train: 80% of the data with all variables except the target variable. Will be used to build the model(s).
- X_valid: 20% of the data with all variables except the target variable. Will be used to test the accuracy of model(s), and check results of adjustments against the model.
- y_train: The same 80% of the data as the X_train set, but only comprised of the target *SalePrice* variable.
- y_test: The same 20% of the data as the X_test set, but only comprised of the target *SalePrice* variable.

I need to split the data since like I mentioned before I will not know the *SalePrice* variable of the test_data. Therefor, I cannot use that data to asses the accuracy of my models and determine what proper adjustments to make against it.

In [7]:
# Create a copy of the train_data that I will make adjustments against.
X = train_data

# Separate the target variable from the explanatory variables.
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

# Split the taining data set and validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

## Asses and Clean Variables in the Data

#### Missing Data
Next I will handle the missing data points. There are a couple of ways deal with missing values in data. In this case I will be using simple imputation to fill in the missing values. For this project I will be calculating the missing values using the "constant" strategy of the SimpleImputer method.

In [8]:
# Select numerical columns to impute against
numerical_cols = [cname for cname in X_train.columns if 
                X_train[cname].dtype in ['int64', 'float64']]

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy = "constant")

#### Categorical Data
As mentioned earlier, 43 of the explanatory variables are categorical. There are also a few approaches to handle categorical variables. In this project, I will be using One-Hot Encoding which creates new columns in the data indicating the presence (or absence) of each possible value of the categorical data. Since this typically does not perform well with categorical variables with a large number of differing values, I will find the cardinality of the categorical variables, and then use One-Hot Encoding on those with less than 10 cardinality, and drop the varaiables with 10 or more cardinality from the data.

In [9]:
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality.
categorical_cols = [cname for cname in X_train.columns if
                    X_train[cname].nunique() < 10 and 
                    X_train[cname].dtype == "object"]

# Using Pipeline to help preprocess the categorical data both Imputing and One-Hot Encoding
categorical_transformer = Pipeline(steps=[
    ("imputer",SimpleImputer(strategy = "constant")),
    ("one hot",OneHotEncoder(handle_unknown = "ignore"))])

## Selecting Varaibles For Model
Now based on how I approached the data in the previous section, I will select which columns to use for the model to predict. I will keeping only the numerical and low cardinality categorical variables for each of the 3 X datasets. (**X_train**, **X_valid**, **X_test**)

In [10]:
# Keep selected columns only
my_cols = categorical_cols + numerical_cols

X_train_mycol = X_train[my_cols].copy()
X_valid_mycol = X_valid[my_cols].copy()
X_test_mycol = test_data[my_cols].copy()

I will run the head method and columns method on the valid dataset to show the number of columns I am keeping and which columns they are.

In [11]:
X_valid_mycol.head()

Unnamed: 0_level_0,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Condition1,Condition2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
530,RL,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,Norm,Norm,...,484,0,0,200,0,0,0,0,3,2007
492,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Artery,Norm,...,240,0,0,32,0,0,0,0,8,2006
460,RL,Pave,,IR1,Bnk,AllPub,Corner,Gtl,Norm,Norm,...,352,0,0,248,0,0,0,0,7,2009
280,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Norm,Norm,...,505,288,117,0,0,0,0,0,3,2008
656,RM,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Norm,Norm,...,264,0,0,0,0,0,0,0,3,2010


In [12]:
X_valid_mycol.columns

Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'RoofStyle', 'RoofMatl', 'MasVnrType', 'ExterQual',
       'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
       'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir',
       'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC',
       'Fence', 'MiscFeature', 'SaleType', 'SaleCondition', 'MSSubClass',
       'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
       'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',
       'GarageCars', 

In [13]:
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

## Creating The Models
For this project I will using both Random Forest Regression and XGBoost Regression (gradient boosting) to create models. 

The Random Forest Regression works by constructing a collection of decision trees. Data is inputted into each decision tree in the forest, and each tree independently makes a prediction. The final prediction is then determined by taking the average of all the predictions from the decision trees.

Gradient boosting works by going through cycles to iteratively add models into an ensemble. It begins by initializing the ensemble with a single model, whose predictions can be pretty naive. It then iteratively trains new models to correct the errors made by previous models in the ensemble. The errors from the previous models are used as targets for the next model to improve upon. This process continues iteratively, with each new model trying to minimize the residual errors of the ensemble.

I will create 5 models for using each technique with differing parameters.

In [14]:
# Define the Random Forsest models for different n_estimators.
model_RF1 = RandomForestRegressor(n_estimators = 100, random_state = 0)
model_RF2 = RandomForestRegressor(n_estimators = 300, random_state = 0)
model_RF3 = RandomForestRegressor(n_estimators = 500, random_state = 0)
model_RF4 = RandomForestRegressor(n_estimators = 700,  random_state = 0)
model_RF5 = RandomForestRegressor(n_estimators = 900, random_state = 0)

# Define the XGBoost models for different n_estimators.
model_XGB1 = XGBRegressor(n_estimators = 100, learning_rate = 0.05, random_state = 0)
model_XGB2 = XGBRegressor(n_estimators = 400, learning_rate = 0.05, random_state = 0)
model_XGB3 = XGBRegressor(n_estimators = 700, learning_rate = 0.05, random_state = 0)
model_XGB4 = XGBRegressor(n_estimators = 1000, learning_rate = 0.05, random_state = 0)
model_XGB5 = XGBRegressor(n_estimators = 1300, learning_rate = 0.05, random_state = 0)

## Model Testing
Now I will create a score_model function that uses mean absolute error between the predictions based on the X-valid data (calculated using the model trained on the X_train_mycol and y_train data) and the y_valid data.
The model that gives the lowest mean absolute error should be considered the best model.

In [15]:
def score_model(model, X_t=X_train_mycol, X_v=X_valid_mycol, y_t=y_train, y_v=y_valid):
    my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                  ('model', model)
                                 ])
    my_pipeline.fit(X_t, y_t)
    preds = my_pipeline.predict(X_v)
    return mean_absolute_error(y_v, preds)

#create set of all the combined models and then iterate through them calculating their mean absolute error.
models = [model_RF1, model_RF2, model_RF3, model_RF4, model_RF5, model_XGB1, model_XGB2, model_XGB3, model_XGB4, model_XGB5]
for i in range(0, len(models)):
    mae = score_model(models[i])
    print("Model " + str(i+1) + " MAE: " + str(mae))

Model 1 MAE: 17621.3197260274
Model 2 MAE: 17305.304303652967
Model 3 MAE: 17287.301842465753
Model 4 MAE: 17214.610132093934
Model 5 MAE: 17232.302907153728
Model 6 MAE: 17511.70408818493
Model 7 MAE: 17219.78846050942
Model 8 MAE: 17210.830037992295
Model 9 MAE: 17207.32646618151
Model 10 MAE: 17208.56204516267


It appears **model_XGB4** should be considered the best and final model as it is about tied for having the lowest mean absolute error with a more complex models having a larger error, most likely due to overfitting.

## Creating and Fitting the Final Model
I will now create the final model, once again calculate its mean absolute value under its new final model name, and then use the **X_test** data to predict the Sale Price of all houses in the **X_test** dataset. An **output** dataset consisting of the House Ids and predicted SalePrice values will be created and submitted to the Kaggle compition.

In [16]:
#Creating the Final Model.
model_final = XGBRegressor(n_estimators = 1000, learning_rate = 0.05, random_state = 0)

In [17]:
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model_final)])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train_mycol, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

MAE: 17207.32646618151


In [18]:
# Preprocessing of test data, fit model using the X_test_mycol dataset
preds_test = my_pipeline.predict(X_test_mycol)

In [19]:
#Create the output dataset and take a quick look at it.
output = pd.DataFrame({'Id': test_data.index,
                       'SalePrice': preds_test})

In [20]:
output.to_csv('submission.csv', index=False)
print("Submission was successfully saved!")

Submission was successfully saved!


## Result
In conclusion, after submitting the output, the result is that the my submission got a final score of 14893. In this case the lower the score the more accurate the model, as I believe the score indicates the average error in the predicted house price compared to the real sale price from the houses in the **test** dataset.