The first thing that we need to do is import the relevant packages. I will import the machine learning packages at a later stage.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np

The next thing that I will do is import the training data and have a look at it. I have used the generic name "df" to call the data.

In [2]:
df = pd.read_csv('train.csv')

In [3]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

There are 81 columns in the dataset which is obviously a very large number. I will initially create a model that only uses a few features and then iterate from there to create a more sophisticated model. These features will be numerical columns because then I do not need to spend time cleaning or incoding the data. The target variable will obviously be the `SalePrice` column.

In [5]:
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import time
import winsound


In [6]:
X = df[['MSSubClass', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
       'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars',
       'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
       'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold', 'LotFrontage', 'MasVnrArea', 'GarageYrBlt']]
y = df['SalePrice']


neigh_dummies = pd.get_dummies(df.Neighborhood, prefix = "Neighbourhood", drop_first=True)
X = pd.concat([X, neigh_dummies], axis = 1)


zone_dummies = pd.get_dummies(df.MSZoning, prefix = "MSZoning", drop_first=True)
X = pd.concat([X, zone_dummies], axis = 1)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=999)

In [13]:
# params={
#     'max_depth':[3,5,7],                              
#     'subsample':[0.6,0.8,1.0],
#     'colsample_bytree':[0.5,0.7,1],
#     'n_estimators':[300, 500],
#     'gamma': [0, 0.5, 1.0, 1.5], 
#     'reg_alpha':[0, 0.02, 0.04],
#     'learning_rate':[0.02, 0.06, 0.1]
# }

params={
    'max_depth':[5],                              
    'subsample':[0.8,],
    'colsample_bytree':[0.5],
    'n_estimators':[500],
    'gamma': [0], 
    'reg_alpha':[0.04],
    'learning_rate':[0.02]
}

In [14]:
xg_reg = xgb.XGBRegressor(objective='reg:linear', seed=999, n_jobs=-1)

rs = GridSearchCV(xg_reg,
                  params,
                  cv=5,
                  scoring="neg_mean_squared_log_error",
                  n_jobs=-1,
                  verbose=3)

In [15]:
start = time.time()
rs.fit(X, y)
end = time.time()
print(f"The model took {(end-start)/60} minutes to fit.")
winsound.MessageBeep()

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   12.5s remaining:   18.8s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   13.7s finished


The model took 0.2504852573076884 minutes to fit.


In [16]:
print(rs.best_params_)

{'colsample_bytree': 0.5, 'gamma': 0, 'learning_rate': 0.02, 'max_depth': 5, 'n_estimators': 500, 'reg_alpha': 0.04, 'subsample': 0.8}


In [18]:
y_pred = rs.predict(X_test)

print("Mean Squared Error:", mean_squared_error(np.log(y_test), np.log(y_pred)))
print('Root Mean Squared Error:', mean_squared_error(np.log(y_test), np.log(y_pred)) **.5)


Mean Squared Error: 0.003964654447002182
Root Mean Squared Error: 0.06296550203883221


I think that I have finally got a model that makes somewhat decent predictions. There is still a lot of room for improvement but as a first attempt it probably is not too bad. Let us see how it performs.

In [19]:
testdf = pd.read_csv('test.csv')

In [20]:
X = testdf[['MSSubClass', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
       'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars',
       'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
       'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold', 'LotFrontage', 'MasVnrArea', 'GarageYrBlt']]


neigh_dummies = pd.get_dummies(testdf.Neighborhood, prefix = "Neighbourhood", drop_first=True)
X = pd.concat([X, neigh_dummies], axis = 1)


zone_dummies = pd.get_dummies(testdf.MSZoning, prefix = "MSZoning", drop_first=True)
X = pd.concat([X, zone_dummies], axis = 1)


predictions = rs.predict(X)

In [21]:
predictions

array([124857.9  , 155561.53 , 180968.28 , ..., 183313.11 , 117506.305,
       232676.7  ], dtype=float32)

In [22]:
submission = pd.DataFrame({'Id': testdf['Id'],
              'SalePrice': predictions})

In [23]:
submission.to_csv('housingpricesv11cv', index = False)

In [24]:
submission.tail()

Unnamed: 0,Id,SalePrice
1454,2915,79387.046875
1455,2916,85390.648438
1456,2917,183313.109375
1457,2918,117506.304688
1458,2919,232676.703125
