# KAGGLE Competition
House Prices - Advanced Regression Techniques
Competition Description


Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
## Goal
It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. 
### Metric
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

## Inputs
### Imports

In [37]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from catboost import CatBoostRegressor
pd.set_option('display.max_columns',90)
pd.set_option('display.max_rows',90)
from pycaret.regression import setup, compare_models


### Reading data

In [2]:
train0=pd.read_csv('train.csv')
test0=pd.read_csv('test.csv')

In [3]:
train0.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


## Data Cleaning

#### Combining train and test sets

In [4]:
train1=train0.drop(['Id','SalePrice'],axis=1)
test1=test0.drop(['Id'],axis=1)
data1=pd.concat([train1,test1],axis=0)

#### Filling null values

Number of null values

In [5]:
data1.isna().sum().sum()

13965

Columns where NAN has meaning

In [6]:
data2=data1.copy()

In [7]:
such_cols=['Alley','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','FireplaceQu',
            'GarageType','GarageFinish','GarageQual','GarageCond','PoolQC','Fence','MiscFeature']

In [8]:
for col in such_cols:
    data2[col].fillna('None',inplace=True)

Remaining categorical columns with null values

In [9]:
data2.select_dtypes(['O']).isna().any().sum()

9

In [10]:
data2.loc[:,data2.isna().any()].select_dtypes(['O']).isna().sum()

MSZoning        4
Utilities       2
Exterior1st     1
Exterior2nd     1
MasVnrType     24
Electrical      1
KitchenQual     1
Functional      2
SaleType        1
dtype: int64

In [11]:
data3=data2.copy()

In [12]:
for col in data3.loc[:,data3.isna().any()].select_dtypes(['O']).columns:
    data3[col].fillna(data3[col].mode()[0],inplace=True)

In [13]:
data3.select_dtypes(['O']).isna().any().sum()

0

### Numeric null values

In [14]:
data4=data3.copy()

In [15]:
data4.isna().sum().sum()

678

In [16]:
data4.loc[:,data4.isna().any()].select_dtypes(exclude=['O']).isna().sum()

LotFrontage     486
MasVnrArea       23
BsmtFinSF1        1
BsmtFinSF2        1
BsmtUnfSF         1
TotalBsmtSF       1
BsmtFullBath      2
BsmtHalfBath      2
GarageYrBlt     159
GarageCars        1
GarageArea        1
dtype: int64

GarageYrBlt has missing values when there is no garage

So giving a value which might denote lower standard :min -10

In [17]:
fillyr=data4.GarageYrBlt.min()
data4.GarageYrBlt.fillna(fillyr,inplace=True)

Filling by mean for remaining numerical nulls

In [18]:
for col in data4.loc[:,data4.isna().any()].columns:
    data4[col].fillna(data4[col].mean(),inplace=True)

In [19]:
data4.isna().sum().sum()

0

## Feature Encoding

In [20]:
data5=data4.copy()

Changing the categorical cols into numericals

In [21]:
num_trans=0
for col in data5.loc[:,data5.nunique()<30].columns:
    categories=data5.groupby(col).count().index
    label={k:i+1 for i,k in enumerate(categories)}
    data5[col]=data5[col].map(label)
    num_trans+=1
print('Number of columns transformed:',num_trans)


Number of columns transformed: 58


## Feature transformation

In [22]:
data6=data5.copy()

For columns without 0 value

In [23]:
for col in data6.columns:
    if 0 not in data6[col].values:
        data6[col]=np.log(data6[col])

## Target transformation

In [24]:
y=train0.iloc[:,-1]
log_y=np.log(y)

## Model Selection
#### Splitting into train and test

In [25]:
data7=data6.copy()

In [26]:
len(train0)

1460

In [27]:
train7=data7.iloc[:len(train0),:]
test7=data7.iloc[len(train0):,:]

### Pycaret

In [28]:
clf=setup(data=pd.concat([train7,log_y],axis=1),target='SalePrice',fold_shuffle=True)

Unnamed: 0,Description,Value
0,session_id,2266
1,Target,SalePrice
2,Original Data,"(1460, 80)"
3,Missing Values,False
4,Numeric Features,72
5,Categorical Features,7
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(1021, 121)"


In [29]:
compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
catboost,CatBoost Regressor,0.083,0.0161,0.1246,0.8994,0.0096,0.0069,5.935
gbr,Gradient Boosting Regressor,0.0908,0.0187,0.1351,0.8824,0.0104,0.0076,0.316
lightgbm,Light Gradient Boosting Machine,0.0925,0.0187,0.1355,0.8819,0.0105,0.0077,0.264
et,Extra Trees Regressor,0.0968,0.0208,0.1423,0.8684,0.011,0.0081,0.92
rf,Random Forest Regressor,0.0992,0.0214,0.1448,0.8652,0.0112,0.0083,1.021
xgboost,Extreme Gradient Boosting,0.1028,0.0228,0.1498,0.8554,0.0116,0.0086,0.624
br,Bayesian Ridge,0.0998,0.0245,0.1506,0.8497,0.0115,0.0083,0.039
ridge,Ridge Regression,0.1004,0.025,0.1522,0.8464,0.0116,0.0084,0.026
omp,Orthogonal Matching Pursuit,0.1052,0.0269,0.158,0.8354,0.0121,0.0088,0.029
lr,Linear Regression,0.1025,0.0283,0.1607,0.8246,0.0123,0.0086,1.167


<catboost.core.CatBoostRegressor at 0x189bc724c18>

In [39]:
cat=CatBoostRegressor()

## Submission

In [41]:
cat.fit(train7,log_y)

Learning rate set to 0.043466
0:	learn: 0.3881509	total: 12.2ms	remaining: 12.2s
1:	learn: 0.3780172	total: 22.2ms	remaining: 11.1s
2:	learn: 0.3671397	total: 29ms	remaining: 9.65s
3:	learn: 0.3573554	total: 39ms	remaining: 9.72s
4:	learn: 0.3490878	total: 45.5ms	remaining: 9.05s
5:	learn: 0.3400263	total: 54.2ms	remaining: 8.97s
6:	learn: 0.3307437	total: 60.8ms	remaining: 8.62s
7:	learn: 0.3226855	total: 70.5ms	remaining: 8.74s
8:	learn: 0.3150492	total: 77.1ms	remaining: 8.48s
9:	learn: 0.3073668	total: 85.2ms	remaining: 8.43s
10:	learn: 0.2996275	total: 91.9ms	remaining: 8.26s
11:	learn: 0.2928598	total: 100ms	remaining: 8.23s
12:	learn: 0.2860116	total: 108ms	remaining: 8.16s
13:	learn: 0.2792599	total: 116ms	remaining: 8.17s
14:	learn: 0.2730436	total: 122ms	remaining: 8.03s
15:	learn: 0.2668385	total: 129ms	remaining: 7.92s
16:	learn: 0.2605971	total: 136ms	remaining: 7.85s
17:	learn: 0.2551473	total: 142ms	remaining: 7.75s
18:	learn: 0.2497552	total: 150ms	remaining: 7.75s
19:	

<catboost.core.CatBoostRegressor at 0x189bc7182b0>

In [42]:
log_pred=cat.predict(test7)
pred=np.exp(log_pred)

In [43]:
sub=pd.read_csv('sample_submission.csv')

In [44]:
sub['SalePrice']=pred
sub.head()

Unnamed: 0,Id,SalePrice
0,1461,124702.81972
1,1462,162173.189249
2,1463,189757.650569
3,1464,193876.77169
4,1465,177801.264418


In [45]:
sub.to_csv('sample_submission.csv',index=False)

## Kaggle Score: 0.12608