# KAGGLE Competition
House Prices - Advanced Regression Techniques
Competition Description


Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
## Goal
It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. 
### Metric
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

## Inputs
### Imports

In [45]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
pd.set_option('max_columns',100)
pd.set_option('max_rows',90)
from pycaret.regression import setup, compare_models
from catboost import CatBoostRegressor

### Reading data

In [46]:
train0=pd.read_csv('train.csv')
test0=pd.read_csv('test.csv')

In [47]:
train0.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


## Data Cleaning

#### Combining train and test sets

In [48]:
train1=train0.drop(['Id','SalePrice'],axis=1)
test1=test0.drop(['Id'],axis=1)
data1=pd.concat([train1,test1],axis=0)

#### Filling null values

Number of null values

In [49]:
data1.isna().sum().sum()

13965

Columns where NAN has meaning

In [50]:
data2=data1.copy()

In [51]:
such_cols=['Alley','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','FireplaceQu',
            'GarageType','GarageFinish','GarageQual','GarageCond','PoolQC','Fence','MiscFeature']

In [52]:
for col in such_cols:
    data2[col].fillna('None',inplace=True)

Remaining categorical columns with null values

In [53]:
data2.select_dtypes(['O']).isna().any().sum()

9

In [54]:
data2.loc[:,data2.isna().any()].select_dtypes(['O']).isna().sum()

MSZoning        4
Utilities       2
Exterior1st     1
Exterior2nd     1
MasVnrType     24
Electrical      1
KitchenQual     1
Functional      2
SaleType        1
dtype: int64

In [55]:
data3=data2.copy()

In [56]:
for col in data3.loc[:,data3.isna().any()].select_dtypes(['O']).columns:
    data3[col].fillna(data3[col].mode()[0],inplace=True)

In [57]:
data3.select_dtypes(['O']).isna().any().sum()

0

### Numeric null values

In [58]:
data4=data3.copy()

In [59]:
data4.isna().sum().sum()

678

In [60]:
data4.loc[:,data4.isna().any()].select_dtypes(exclude=['O']).isna().sum()

LotFrontage     486
MasVnrArea       23
BsmtFinSF1        1
BsmtFinSF2        1
BsmtUnfSF         1
TotalBsmtSF       1
BsmtFullBath      2
BsmtHalfBath      2
GarageYrBlt     159
GarageCars        1
GarageArea        1
dtype: int64

GarageYrBlt has missing values when there is no garage

So giving a value which might denote lower standard :min -10

In [61]:
fillyr=data4.GarageYrBlt.min()
data4.GarageYrBlt.fillna(fillyr,inplace=True)

Filling by mean for remaining numerical nulls

In [62]:
for col in data4.loc[:,data4.isna().any()].columns:
    data4[col].fillna(data4[col].mean(),inplace=True)

In [63]:
data4.isna().sum().sum()

0

## Feature Transformation

In [64]:
data5=data4.copy()

Check for negatives before log1p transformation

In [65]:
data5.select_dtypes(exclude=['O']).min().min()

0.0

In [66]:
for col in data5.select_dtypes(exclude=['O']).columns:
    data5[col]=np.log1p(data5[col])

## Feature Encoding

In [67]:
data6=data5.copy()

Changing the categorical cols into numericals

In [68]:
data6=pd.get_dummies(data6)

## Target transformation

In [69]:
y=train0.iloc[:,-1]
log_y=np.log(y)

## Model Selection
#### Splitting into train and test

In [70]:
data7=data6.copy()

In [71]:
len(train0)

1460

In [72]:
train7=data7.iloc[:len(train0),:]
test7=data7.iloc[len(train0):,:]

### Pycaret

In [73]:
clf=setup(data=pd.concat([train7,log_y],axis=1),target='SalePrice',fold_shuffle=True)

Unnamed: 0,Description,Value
0,session_id,442
1,Target,SalePrice
2,Original Data,"(1460, 303)"
3,Missing Values,False
4,Numeric Features,302
5,Categorical Features,0
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(1021, 289)"


In [74]:
compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
br,Bayesian Ridge,0.0835,0.0146,0.1195,0.8984,0.0093,0.007,0.101
huber,Huber Regressor,0.0832,0.0148,0.1201,0.8983,0.0093,0.007,0.544
catboost,CatBoost Regressor,0.0807,0.015,0.1213,0.8975,0.0094,0.0068,8.357
ridge,Ridge Regression,0.0844,0.0149,0.1207,0.8955,0.0094,0.0071,0.032
gbr,Gradient Boosting Regressor,0.0869,0.0169,0.1289,0.8835,0.01,0.0073,0.49
omp,Orthogonal Matching Pursuit,0.089,0.0168,0.1287,0.8832,0.01,0.0075,0.037
lightgbm,Light Gradient Boosting Machine,0.0906,0.0183,0.1343,0.8758,0.0104,0.0076,0.288
xgboost,Extreme Gradient Boosting,0.0956,0.02,0.1402,0.8655,0.0109,0.008,1.048
et,Extra Trees Regressor,0.0984,0.0215,0.145,0.8555,0.0112,0.0082,1.633
rf,Random Forest Regressor,0.0978,0.0215,0.1448,0.8553,0.0112,0.0082,1.507


BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, alpha_init=None,
              compute_score=False, copy_X=True, fit_intercept=True,
              lambda_1=1e-06, lambda_2=1e-06, lambda_init=None, n_iter=300,
              normalize=False, tol=0.001, verbose=False)

In [75]:
cat=CatBoostRegressor()

## Evaluation

In [76]:
cv_score=cross_val_score(cat,train7,log_y,scoring='neg_mean_squared_error',cv=5)
print('The cross validation score is:',round(np.exp(-cv_score).mean(),4))

Learning rate set to 0.04196
0:	learn: 0.3920688	total: 37.5ms	remaining: 37.4s
1:	learn: 0.3820692	total: 71ms	remaining: 35.4s
2:	learn: 0.3717818	total: 113ms	remaining: 37.5s
3:	learn: 0.3624619	total: 132ms	remaining: 32.8s
4:	learn: 0.3534737	total: 164ms	remaining: 32.7s
5:	learn: 0.3451567	total: 184ms	remaining: 30.5s
6:	learn: 0.3363482	total: 227ms	remaining: 32.1s
7:	learn: 0.3283936	total: 262ms	remaining: 32.5s
8:	learn: 0.3201740	total: 298ms	remaining: 32.9s
9:	learn: 0.3126882	total: 342ms	remaining: 33.9s
10:	learn: 0.3055658	total: 405ms	remaining: 36.4s
11:	learn: 0.2989998	total: 439ms	remaining: 36.1s
12:	learn: 0.2926074	total: 468ms	remaining: 35.5s
13:	learn: 0.2862044	total: 524ms	remaining: 36.9s
14:	learn: 0.2802419	total: 590ms	remaining: 38.8s
15:	learn: 0.2738828	total: 614ms	remaining: 37.8s
16:	learn: 0.2686413	total: 633ms	remaining: 36.6s
17:	learn: 0.2631572	total: 684ms	remaining: 37.3s
18:	learn: 0.2579520	total: 745ms	remaining: 38.5s
19:	learn: 0

## Submission

In [77]:
cat.fit(train7,log_y)

Learning rate set to 0.043466
0:	learn: 0.3876649	total: 16.6ms	remaining: 16.6s
1:	learn: 0.3771140	total: 31.3ms	remaining: 15.6s
2:	learn: 0.3661884	total: 54.6ms	remaining: 18.1s
3:	learn: 0.3569015	total: 75.4ms	remaining: 18.8s
4:	learn: 0.3476101	total: 87.2ms	remaining: 17.3s
5:	learn: 0.3383369	total: 98.7ms	remaining: 16.4s
6:	learn: 0.3290432	total: 114ms	remaining: 16.1s
7:	learn: 0.3207390	total: 128ms	remaining: 15.9s
8:	learn: 0.3125872	total: 142ms	remaining: 15.6s
9:	learn: 0.3055187	total: 156ms	remaining: 15.4s
10:	learn: 0.2983200	total: 172ms	remaining: 15.5s
11:	learn: 0.2915483	total: 207ms	remaining: 17s
12:	learn: 0.2849846	total: 230ms	remaining: 17.5s
13:	learn: 0.2786119	total: 251ms	remaining: 17.7s
14:	learn: 0.2719889	total: 263ms	remaining: 17.3s
15:	learn: 0.2660708	total: 277ms	remaining: 17s
16:	learn: 0.2608894	total: 291ms	remaining: 16.9s
17:	learn: 0.2553045	total: 306ms	remaining: 16.7s
18:	learn: 0.2501960	total: 321ms	remaining: 16.6s
19:	learn

<catboost.core.CatBoostRegressor at 0x2582af0a208>

In [78]:
log_pred=cat.predict(test7)
pred=np.exp(log_pred)

In [79]:
sub=pd.read_csv('sample_submission.csv')

In [80]:
sub['SalePrice']=pred
sub.head()

Unnamed: 0,Id,SalePrice
0,1461,127911.878314
1,1462,163656.677609
2,1463,188695.407731
3,1464,196540.331147
4,1465,183569.819911


In [81]:
sub.to_csv('sample_submission.csv',index=False)

## Kaggle Score: Score: 0.12652