# KAGGLE Competition
House Prices - Advanced Regression Techniques
Competition Description


Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
## Goal
It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. 
### Metric
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

## Inputs
### Imports

In [51]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
pd.set_option('max_columns',100)
pd.set_option('max_rows',90)
from catboost import CatBoostRegressor

### Reading data

In [52]:
train0=pd.read_csv('train.csv')
test0=pd.read_csv('test.csv')

In [53]:
train0.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


## Data Cleaning

#### Combining train and test sets

In [54]:
train1=train0.drop(['Id','SalePrice'],axis=1)
test1=test0.drop(['Id'],axis=1)
data1=pd.concat([train1,test1],axis=0)

#### Filling null values

Number of null values

In [55]:
data1.isna().sum().sum()

13965

Columns where NAN has meaning

In [56]:
data2=data1.copy()

In [57]:
such_cols=['Alley','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','FireplaceQu',
            'GarageType','GarageFinish','GarageQual','GarageCond','PoolQC','Fence','MiscFeature']

In [58]:
for col in such_cols:
    data2[col].fillna('None',inplace=True)

Remaining categorical columns with null values

In [59]:
data2.select_dtypes(['O']).isna().any().sum()

9

In [60]:
data2.loc[:,data2.isna().any()].select_dtypes(['O']).isna().sum()

MSZoning        4
Utilities       2
Exterior1st     1
Exterior2nd     1
MasVnrType     24
Electrical      1
KitchenQual     1
Functional      2
SaleType        1
dtype: int64

In [61]:
data3=data2.copy()

In [62]:
for col in data3.loc[:,data3.isna().any()].select_dtypes(['O']).columns:
    data3[col].fillna(data3[col].mode()[0],inplace=True)

In [63]:
data3.select_dtypes(['O']).isna().any().sum()

0

### Numeric null values

In [64]:
data4=data3.copy()

In [65]:
data4.isna().sum().sum()

678

In [66]:
data4.loc[:,data4.isna().any()].select_dtypes(exclude=['O']).isna().sum()

LotFrontage     486
MasVnrArea       23
BsmtFinSF1        1
BsmtFinSF2        1
BsmtUnfSF         1
TotalBsmtSF       1
BsmtFullBath      2
BsmtHalfBath      2
GarageYrBlt     159
GarageCars        1
GarageArea        1
dtype: int64

GarageYrBlt has missing values when there is no garage

So giving a value which might denote lower standard :min -10

In [67]:
fillyr=data4.GarageYrBlt.min()
data4.GarageYrBlt.fillna(fillyr,inplace=True)

Filling by mean for remaining numerical nulls

In [68]:
for col in data4.loc[:,data4.isna().any()].columns:
    data4[col].fillna(data4[col].mean(),inplace=True)

In [69]:
data4.isna().sum().sum()

0

## Feature Encoding

In [70]:
data5=data4.copy()

Changing the categorical cols into numericals

In [71]:
num_trans=0
for col in data5.loc[:,data5.nunique()<30].columns:
    categories=data5.groupby(col).count().index
    label={k:i+1 for i,k in enumerate(categories)}
    data5[col]=data5[col].map(label)
    num_trans+=1
print('Number of columns transformed:',num_trans)


Number of columns transformed: 58


## Feature transformation

In [72]:
data6=data5.copy()

For columns without 0 value

In [73]:
for col in data6.columns:
    if 0 not in data6[col].values:
        data6[col]=np.log(data6[col])

## Target transformation

In [74]:
y=train0.iloc[:,-1]
log_y=np.log(y)

## Model Selection

In [75]:
catbst=CatBoostRegressor()

## Evaluation
#### Splitting into train and test

In [76]:
data7=data6.copy()

In [77]:
len(train0)

1460

In [78]:
train7=data7.iloc[:len(train0),:]
test7=data7.iloc[len(train0):,:]

#### Cross validation score

In [79]:
cv_score=cross_val_score(catbst,train7,log_y,scoring='neg_mean_squared_error',cv=5)
print('The cross validation score is:',round(np.exp(-cv_score).mean(),4))

Learning rate set to 0.04196
0:	learn: 0.3913618	total: 219ms	remaining: 3m 38s
1:	learn: 0.3808122	total: 238ms	remaining: 1m 58s
2:	learn: 0.3707639	total: 260ms	remaining: 1m 26s
3:	learn: 0.3617140	total: 285ms	remaining: 1m 10s
4:	learn: 0.3537290	total: 312ms	remaining: 1m 2s
5:	learn: 0.3445102	total: 342ms	remaining: 56.6s
6:	learn: 0.3360606	total: 352ms	remaining: 49.9s
7:	learn: 0.3277131	total: 408ms	remaining: 50.6s
8:	learn: 0.3201239	total: 429ms	remaining: 47.2s
9:	learn: 0.3130555	total: 449ms	remaining: 44.5s
10:	learn: 0.3054651	total: 470ms	remaining: 42.2s
11:	learn: 0.2992313	total: 482ms	remaining: 39.7s
12:	learn: 0.2925502	total: 492ms	remaining: 37.4s
13:	learn: 0.2860195	total: 505ms	remaining: 35.5s
14:	learn: 0.2799822	total: 520ms	remaining: 34.2s
15:	learn: 0.2742162	total: 534ms	remaining: 32.8s
16:	learn: 0.2683784	total: 550ms	remaining: 31.8s
17:	learn: 0.2630052	total: 561ms	remaining: 30.6s
18:	learn: 0.2577487	total: 597ms	remaining: 30.8s
19:	lear

## Submission

In [80]:
catbst.fit(train7,log_y)

Learning rate set to 0.043466
0:	learn: 0.3881509	total: 7.13ms	remaining: 7.12s
1:	learn: 0.3780172	total: 16.7ms	remaining: 8.33s
2:	learn: 0.3671397	total: 23.2ms	remaining: 7.71s
3:	learn: 0.3573554	total: 32.6ms	remaining: 8.11s
4:	learn: 0.3490878	total: 38.6ms	remaining: 7.68s
5:	learn: 0.3400263	total: 47.5ms	remaining: 7.87s
6:	learn: 0.3307437	total: 53.8ms	remaining: 7.63s
7:	learn: 0.3226855	total: 65.8ms	remaining: 8.16s
8:	learn: 0.3150492	total: 72.3ms	remaining: 7.96s
9:	learn: 0.3073668	total: 80.4ms	remaining: 7.96s
10:	learn: 0.2996275	total: 86.6ms	remaining: 7.79s
11:	learn: 0.2928598	total: 94.5ms	remaining: 7.78s
12:	learn: 0.2860116	total: 101ms	remaining: 7.64s
13:	learn: 0.2792599	total: 108ms	remaining: 7.62s
14:	learn: 0.2730436	total: 114ms	remaining: 7.52s
15:	learn: 0.2668385	total: 121ms	remaining: 7.43s
16:	learn: 0.2605971	total: 129ms	remaining: 7.44s
17:	learn: 0.2551473	total: 135ms	remaining: 7.36s
18:	learn: 0.2497552	total: 143ms	remaining: 7.37s

<catboost.core.CatBoostRegressor at 0x276027316d0>

In [81]:
log_pred=catbst.predict(test7)
pred=np.exp(log_pred)

In [82]:
sub=pd.read_csv('sample_submission.csv')

In [83]:
sub['SalePrice']=pred
sub.head()

Unnamed: 0,Id,SalePrice
0,1461,124702.81972
1,1462,162173.189249
2,1463,189757.650569
3,1464,193876.77169
4,1465,177801.264418


In [84]:
sub.to_csv('sample_submission.csv',index=False)

## Kaggle Score: 0.12608