This utilizes the following Kaggle dataset https://www.kaggle.com/c/home-data-for-ml-course .
In order to predict house prices, a number of variables summarising information about houses is given in the data. I go straight to modelling and do not include EDA. The metric used to rank entries in the competition is the mean squared log error, hence it is the scoring metric used here

In [194]:
#Importing necessary libraries
import numpy as np 
import pandas as pd 

Train_full=pd.read_csv('../Downloads/train.csv',index_col='Id')
Test_full=pd.read_csv('../Downloads/test.csv',index_col='Id')
print(Train_full.shape)
Train_full.head(2)

(1460, 80)


Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500


## Data Cleaning

In [195]:
print("Count of null values for each column in training and test\n")
Datasets=[Train_full,Test_full]
print("Training set\n")

flag=0
for dataset in Datasets:
    for i,j in zip(dataset.columns, dataset.isnull().sum()):
        if j>0:
            print(i,j,end="\n")
    flag+=1
    if flag==1:
        print("\nTest set\n")
    

Count of null values for each column in training and test

Training set

LotFrontage 259
Alley 1369
MasVnrType 8
MasVnrArea 8
BsmtQual 37
BsmtCond 37
BsmtExposure 38
BsmtFinType1 37
BsmtFinType2 38
Electrical 1
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageQual 81
GarageCond 81
PoolQC 1453
Fence 1179
MiscFeature 1406

Test set

MSZoning 4
LotFrontage 227
Alley 1352
Utilities 2
Exterior1st 1
Exterior2nd 1
MasVnrType 16
MasVnrArea 15
BsmtQual 44
BsmtCond 45
BsmtExposure 44
BsmtFinType1 42
BsmtFinSF1 1
BsmtFinType2 42
BsmtFinSF2 1
BsmtUnfSF 1
TotalBsmtSF 1
BsmtFullBath 2
BsmtHalfBath 2
KitchenQual 1
Functional 2
FireplaceQu 730
GarageType 76
GarageYrBlt 78
GarageFinish 78
GarageCars 1
GarageArea 1
GarageQual 78
GarageCond 78
PoolQC 1456
Fence 1169
MiscFeature 1408
SaleType 1


In [196]:
#List of numerical or categorical columns, for later use
Cat_cols=[x for x in Train_full.columns if Train_full[x].dtype=='object']
Num_cols=[x for x in Train_full.columns if (Train_full[x].dtype=='float64' or Train_full[x].dtype=='int64' )]

### Creating dummies for categorical variables

In [197]:
#Creating dummy variables for categorical columns
Train_full=pd.get_dummies(data=Train_full,columns=Cat_cols,prefix=Cat_cols,dummy_na=True)
Test_full=pd.get_dummies(data=Test_full,columns=Cat_cols,prefix=Cat_cols,dummy_na=True)
print("Training data shape ",Train_full.shape,"\nTest data shape ",Test_full.shape)


Training data shape  (1460, 332) 
Test data shape  (1459, 313)


In [198]:
#Some categorical dummies are in the Training set but absent in the test set. We will include these in the
#test set and assign it a value of 0

print("Dummy columns in the training data that are not found in test, some catgeory variables in the test data\
did not have the same range of values as they did in the training set\n")
for x in Cat_cols:
    for y in Train_full.columns:
        if y.startswith(x) and y not in Test_full.columns:
            print(y)
            Test_full[y]=0

Dummy columns in the training data that are not found in test, some catgeory variables in the test datadid not have the same range of values as they did in the training set

Utilities_NoSeWa
Condition2_RRAe
Condition2_RRAn
Condition2_RRNn
HouseStyle_2.5Fin
RoofMatl_ClyTile
RoofMatl_Membran
RoofMatl_Metal
RoofMatl_Roll
Exterior1st_ImStucc
Exterior1st_Stone
Exterior2nd_Other
Heating_Floor
Heating_OthW
Electrical_Mix
GarageQual_Ex
PoolQC_Fa
MiscFeature_TenC


In [199]:
#Remove the Sale price variable from training data
X=Train_full.drop('SalePrice',axis=1)
Y=Train_full.SalePrice

#Rearranging the columns in the test data to fit that of the training data
Test_full=Test_full[X.columns]
print("Training data shape ",X.shape,"\nTest data shape ",Test_full.shape)
Test_full.head(2)

Training data shape  (1460, 331) 
Test data shape  (1459, 331)


Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,SaleType_Oth,SaleType_WD,SaleType_nan,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,SaleCondition_nan
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1461,20,80.0,11622,5,6,1961,1961,0.0,468.0,144.0,...,0,1,0,0,0,0,0,1,0,0
1462,20,81.0,14267,6,6,1958,1958,108.0,923.0,0.0,...,0,1,0,0,0,0,0,1,0,0


In [200]:
X.head(2)

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,SaleType_Oth,SaleType_WD,SaleType_nan,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,SaleCondition_nan
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,7,5,2003,2003,196.0,706,0,...,0,1,0,0,0,0,0,1,0,0
2,20,80.0,9600,6,8,1976,1976,0.0,978,0,...,0,1,0,0,0,0,0,1,0,0


Order of columns preserved, important as parameter values are associated with column order and not column names
when fitting and predicting

### Impute missing values for numerical columns using kNN

In [201]:

from sklearn.impute import KNNImputer
imputer=KNNImputer(n_neighbors=3)

X_impute=imputer.fit_transform(X)
X=pd.DataFrame(X_impute,columns=X.columns,index=X.index)

#Similarly for the test dataset
imputer_test=KNNImputer(n_neighbors=3)
X_test_impute=imputer_test.fit_transform(Test_full)
Test_full=pd.DataFrame(X_test_impute,columns=Test_full.columns,index=Test_full.index)

In [202]:
#Taking natural log of Sale Price
Y=Y.apply(lambda x: np.log(x))

## *_Model Building_*
Fitting the data with various appropriate models and recording accuracy of each model:

In [203]:
#Making a table to jot down results
Results=pd.DataFrame(columns=['Model','Mean_sq_log_error','Notes'])
pd.set_option('display.max_colwidth', 0)

In [204]:
#Splitting training and testing data
from sklearn.model_selection import train_test_split
X_train,X_valid,Y_train,Y_valid=train_test_split(X,Y,random_state=1)

### Linear Regression

In [205]:
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

Linear_model=LinearRegression(fit_intercept=True)
Linear_model.fit(X_train,Y_train)
R2=Linear_model.score(X_valid,Y_valid)
Y_pred=Linear_model.predict(X_valid)
Score= mean_squared_error(Y_valid,Y_pred)
Results=Results.append(other={'Model':"Linear reg",'Mean_sq_log_error': round(Score,3),'Notes': "R2= {}".format(round(R2,3))},
                       ignore_index=True)
Results.head()

Unnamed: 0,Model,Mean_sq_log_error,Notes
0,Linear reg,0.018,R2= 0.893


### Random Forest Regressor
Having used Random Forests for several classification models, I am curious to see how it performs for regression

#### Grid Search

In [206]:
from sklearn.ensemble import RandomForestRegressor 
TreeReg=RandomForestRegressor(random_state=1)

#Set up a grid to determine optimal hyperparameters
param={'max_depth':[6,7,8,9,None], 'min_samples_split':[2,3,4],'min_samples_leaf':[1,2,3]}

from sklearn.model_selection import GridSearchCV
SearchObject=GridSearchCV(estimator=TreeReg,param_grid=param,cv=3,scoring='neg_root_mean_squared_error',n_jobs=-1)
SearchObject.fit(X,Y)
print(SearchObject.best_params_,"\nBest Mean Squared Log Error out of the search:",SearchObject.best_score_)

{'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 3} 
Best Mean Squared Log Error out of the search: -0.14709649308765013


#### Final Model

In [207]:
TreeRegFinal=RandomForestRegressor(max_depth=6,min_samples_leaf=1,min_samples_split=2,random_state=1)
TreeRegFinal.fit(X_train,Y_train)
Y_pred=TreeRegFinal.predict(X_valid)
Score=mean_squared_error(Y_valid,Y_pred)
Results=Results.append(other={'Model':"Random Forest reg",'Mean_sq_log_error':round(Score,3),'Notes':"Grid Search used"}
                       ,ignore_index=True)
Results.head()

Unnamed: 0,Model,Mean_sq_log_error,Notes
0,Linear reg,0.018,R2= 0.893
1,Random Forest reg,0.022,Grid Search used


### Ridge Regression

Compared to OLS, Ridge regression has a lesser chance for overfitting the data. It adds a penalty term that is a factor of the sum of squared coefficients (alpha) to the minimizing equation. I use Grid search to find an optimal alpha value

#### Grid Search

In [208]:
from sklearn.linear_model import Ridge

RidgeModel=Ridge(normalize=True)
param={'alpha':[i for i in np.linspace(0.2,0.5,num=300)]} #Actually experimented with many different bounds
SearchObject=GridSearchCV(estimator=RidgeModel,param_grid=param,cv=3,scoring='neg_root_mean_squared_error',n_jobs=-1)
SearchObject.fit(X,Y)
print(SearchObject.best_params_,"\nBest Mean Squared Log Error out of the search:",SearchObject.best_score_)


{'alpha': 0.3785953177257525} 
Best Mean Squared Log Error out of the search: -0.13883048099842402


#### Final Model

In [209]:
RidgeFinal=Ridge(normalize=True,alpha=0.38)
RidgeFinal.fit(X_train,Y_train)
Y_pred=RidgeFinal.predict(X_valid)
R2=RidgeFinal.score(X_valid,Y_valid)
Score=mean_squared_error(Y_valid,Y_pred)
Results=Results.append(other={'Model':"Ridge Reg",'Mean_sq_log_error':round(Score,3),'Notes':"X standardized, alpha=0.38, R2 = {}".format(round(R2,3))}
                       ,ignore_index=True)
Results.head()

Unnamed: 0,Model,Mean_sq_log_error,Notes
0,Linear reg,0.018,R2= 0.893
1,Random Forest reg,0.022,Grid Search used
2,Ridge Reg,0.016,"X standardized, alpha=0.38, R2 = 0.904"


In [211]:
### Using Ridge Regression model's predictions for final submission
Final=Ridge(normalize=True,alpha=0.38)
Final.fit(X,Y)
Y_final=Final.predict(Test_full)
#Convert back from log as we had taken log of sale price
Y_final=np.exp(Y_final)
output=pd.DataFrame({"Id":Test_full.index, "SalePrice":Y_final})
output.to_csv('submission.csv',index=False)