<a href="https://www.kaggle.com/code/lonnieqin/house-prices-prediction-with-catboost?scriptVersionId=114918575" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## House Price Regression with CatBoost
## Table of Contents
- Summary
- Import Packages
- Import Datasets
- Common Functions
- Exploratory Data Analysis & Data Preprocessing
    - Statistic infos
    - Missing Value Imputation
    - Convert Categorical Features to Numerical Features
    - Train Validation Split
    - Calculate Correlated Features
    - Feature Scaling
- Model Development and Evaluation
    - Hyperparameter Tuning
    - Model Training with K-Fold Algorithm
    - Model Training with all data
- Conclusion


## Summary
In this notebook, I will use CatBoost to create House Price Predictor and use hyperparameter searching techniques to find best results.

## Import Packages

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
from sklearn import metrics
import tensorflow as tf
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, StratifiedKFold
import numpy as np

## Import Datasets

In [2]:
train = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/train.csv")

test = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/test.csv")


## Common Functions

**Evaluation Function**

In [3]:
def evaluate(model, x_val, y_val):
    y_pred = model.predict(x_val)
    r2 = metrics.r2_score(y_val, y_pred)
    mse = metrics.mean_squared_error(y_val, y_pred)
    mae = metrics.mean_absolute_error(y_val, y_pred)
    msle = metrics.mean_squared_log_error(y_val, y_pred)
    mape = np.mean(tf.keras.metrics.mean_absolute_percentage_error(y_val, y_pred).numpy())
    rmse = np.sqrt(mse)
    rmlse_score = rmlse(y_val, y_pred).numpy()
    print("R2 Score:", r2)
    print("MSE:", mse)
    print("MAE:", mae)
    print("MSLE:", msle)
    print("MAPE", mape)
    print("RMSE:", rmse)
    print("RMLSE", rmlse_score)
    return {"r2": r2, "mse": mse, "mae": mae, "msle": msle, "mape": mape, "rmse": rmse, "rmlse": rmlse_score}

**Root Mean Squared Logarithmic Error**

In [4]:
def rmlse(y_true, y_pred):
    return tf.sqrt(tf.reduce_mean(tf.square(tf.math.log(y_pred + 1) - tf.math.log(y_true + 1))))

**Submission**

In [5]:
def submit(model, X, ids, file_path):
    SalePrice = model.predict(X)
    submission = pd.DataFrame({"Id": ids, "SalePrice": SalePrice.reshape(-1)})
    submission.to_csv(file_path, index=False)

## Exploratory Data Analysis & Data Preprocessing

In [6]:
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [7]:
train.shape

(1460, 81)

**Statistic infos**

In [8]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [9]:
train.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


**Correlation scores**

In [10]:
correlation_scores = train.corr()
correlation_scores

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
Id,1.0,0.011156,-0.010601,-0.033226,-0.028365,0.012609,-0.012713,-0.021998,-0.050298,-0.005024,...,-0.029643,-0.000477,0.002889,-0.046635,0.00133,0.057044,-0.006242,0.021172,0.000712,-0.021917
MSSubClass,0.011156,1.0,-0.386347,-0.139781,0.032628,-0.059316,0.02785,0.040581,0.022936,-0.069836,...,-0.012579,-0.0061,-0.012037,-0.043825,-0.02603,0.008283,-0.007683,-0.013585,-0.021407,-0.084284
LotFrontage,-0.010601,-0.386347,1.0,0.426095,0.251646,-0.059213,0.123349,0.088866,0.193458,0.233633,...,0.088521,0.151972,0.0107,0.070029,0.041383,0.206167,0.003368,0.0112,0.00745,0.351799
LotArea,-0.033226,-0.139781,0.426095,1.0,0.105806,-0.005636,0.014228,0.013788,0.10416,0.214103,...,0.171698,0.084774,-0.01834,0.020423,0.04316,0.077672,0.038068,0.001205,-0.014261,0.263843
OverallQual,-0.028365,0.032628,0.251646,0.105806,1.0,-0.091932,0.572323,0.550684,0.411876,0.239666,...,0.238923,0.308819,-0.113937,0.030371,0.064886,0.065166,-0.031406,0.070815,-0.027347,0.790982
OverallCond,0.012609,-0.059316,-0.059213,-0.005636,-0.091932,1.0,-0.375983,0.073741,-0.128101,-0.046231,...,-0.003334,-0.032589,0.070356,0.025504,0.054811,-0.001985,0.068777,-0.003511,0.04395,-0.077856
YearBuilt,-0.012713,0.02785,0.123349,0.014228,0.572323,-0.375983,1.0,0.592855,0.315707,0.249503,...,0.22488,0.188686,-0.387268,0.031355,-0.050364,0.00495,-0.034383,0.012398,-0.013618,0.522897
YearRemodAdd,-0.021998,0.040581,0.088866,0.013788,0.550684,0.073741,0.592855,1.0,0.179618,0.128451,...,0.205726,0.226298,-0.193919,0.045286,-0.03874,0.005829,-0.010286,0.02149,0.035743,0.507101
MasVnrArea,-0.050298,0.022936,0.193458,0.10416,0.411876,-0.128101,0.315707,0.179618,1.0,0.264736,...,0.159718,0.125703,-0.110204,0.018796,0.061466,0.011723,-0.029815,-0.005965,-0.008201,0.477493
BsmtFinSF1,-0.005024,-0.069836,0.233633,0.214103,0.239666,-0.046231,0.249503,0.128451,0.264736,1.0,...,0.204306,0.111761,-0.102303,0.026451,0.062021,0.140491,0.003571,-0.015727,0.014359,0.38642


**Factors that impact house price most**

In [11]:
train.corr()["SalePrice"].sort_values(key = lambda x: abs(x), ascending=False)

SalePrice        1.000000
OverallQual      0.790982
GrLivArea        0.708624
GarageCars       0.640409
GarageArea       0.623431
TotalBsmtSF      0.613581
1stFlrSF         0.605852
FullBath         0.560664
TotRmsAbvGrd     0.533723
YearBuilt        0.522897
YearRemodAdd     0.507101
GarageYrBlt      0.486362
MasVnrArea       0.477493
Fireplaces       0.466929
BsmtFinSF1       0.386420
LotFrontage      0.351799
WoodDeckSF       0.324413
2ndFlrSF         0.319334
OpenPorchSF      0.315856
HalfBath         0.284108
LotArea          0.263843
BsmtFullBath     0.227122
BsmtUnfSF        0.214479
BedroomAbvGr     0.168213
KitchenAbvGr    -0.135907
EnclosedPorch   -0.128578
ScreenPorch      0.111447
PoolArea         0.092404
MSSubClass      -0.084284
OverallCond     -0.077856
MoSold           0.046432
3SsnPorch        0.044584
YrSold          -0.028923
LowQualFinSF    -0.025606
Id              -0.021917
MiscVal         -0.021190
BsmtHalfBath    -0.016844
BsmtFinSF2      -0.011378
Name: SalePr

### Missing Value Imputation

I will use following strategies to apply imputation to missing values. 
- For numerical columns, I will replace missing value with their mean value.
- For categorical columns, I will replace missing value with unknown category.

In [12]:
for data in [train, test]:
    null_counts = data.isnull().sum()
    null_counts[null_counts > 0]
    null_columns = list(pd.DataFrame(null_counts[null_counts > 0]).index)
    for column in null_columns:
        if data[column].dtype == object:
            data[column] = data[[column]].replace(np.NAN, "Unknown")
        else:
            data[column] = data[column].replace(np.NAN, data[column].mean())

### Convert Categorical Features to Numerical Features

In [13]:
train_test = pd.get_dummies(pd.concat([train, test]))

In [14]:
train_test.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_New,SaleType_Oth,SaleType_Unknown,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,1,60,65.0,8450,7,5,2003,2003,196.0,706.0,...,0,0,0,1,0,0,0,0,1,0
1,2,20,80.0,9600,6,8,1976,1976,0.0,978.0,...,0,0,0,1,0,0,0,0,1,0
2,3,60,68.0,11250,7,5,2001,2002,162.0,486.0,...,0,0,0,1,0,0,0,0,1,0
3,4,70,60.0,9550,7,5,1915,1970,0.0,216.0,...,0,0,0,1,1,0,0,0,0,0
4,5,60,84.0,14260,8,5,2000,2000,350.0,655.0,...,0,0,0,1,0,0,0,0,1,0


In [15]:
mean_value = train_test.mean()
std_value = train_test.std()
mean_value.pop("SalePrice")
std_value.pop("SalePrice")
print(mean_value)
print(std_value)

Id                        1460.000000
MSSubClass                  57.137718
LotFrontage                 69.315409
LotArea                  10168.114080
OverallQual                  6.089072
                             ...     
SaleCondition_AdjLand        0.004111
SaleCondition_Alloca         0.008222
SaleCondition_Family         0.015759
SaleCondition_Normal         0.822885
SaleCondition_Partial        0.083933
Length: 312, dtype: float64
Id                        842.787043
MSSubClass                 42.517628
LotFrontage                21.314457
LotArea                  7886.996359
OverallQual                 1.409947
                            ...     
SaleCondition_AdjLand       0.063996
SaleCondition_Alloca        0.090317
SaleCondition_Family        0.124562
SaleCondition_Normal        0.381832
SaleCondition_Partial       0.277335
Length: 312, dtype: float64


In [16]:
train_features = train_test.iloc[0: len(train)]
test_features = train_test.iloc[len(train):]
_ = train_features.pop("Id")
_ = test_features.pop("SalePrice")
test_ids = test_features.pop("Id")

### Train Validation Split

In [17]:
train_features, val_features = train_test_split(train_features, test_size=0.2, random_state=np.random.randint(1000))

### Calculate Correlated Features

In [18]:
train_features.corr()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,SaleType_New,SaleType_Oth,SaleType_Unknown,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
MSSubClass,1.000000,-0.344856,-0.115305,0.043194,-0.068302,0.021869,0.051335,0.015437,-0.058979,-0.059594,...,-0.035357,-0.016011,,0.002062,0.005152,0.023828,0.022479,-0.027648,0.027050,-0.042875
LotFrontage,-0.344856,1.000000,0.303613,0.228501,-0.036063,0.113632,0.090529,0.169541,0.200871,0.039988,...,0.133686,-0.026274,,-0.085812,-0.039539,-0.038340,-0.014378,0.008812,-0.063089,0.131280
LotArea,-0.115305,0.303613,1.000000,0.100631,-0.011123,0.018765,0.011054,0.111464,0.217383,0.123789,...,0.021799,-0.006230,,0.003862,-0.037309,-0.012773,0.007278,-0.004956,0.008240,0.024719
OverallQual,0.043194,0.228501,0.100631,1.000000,-0.067271,0.582230,0.567515,0.414471,0.193549,-0.058647,...,0.323063,-0.064266,,-0.206371,-0.136342,-0.039827,-0.032657,-0.052346,-0.112271,0.317946
OverallCond,-0.068302,-0.036063,-0.011123,-0.067271,1.000000,-0.336445,0.099877,-0.113469,-0.013368,0.049069,...,-0.137464,-0.055250,,0.154107,-0.064270,-0.009698,-0.020540,-0.011633,0.148259,-0.131760
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
SaleCondition_AdjLand,0.023828,-0.038340,-0.012773,-0.039827,-0.009698,-0.024870,-0.061116,-0.005453,-0.001898,-0.014313,...,-0.015099,-0.002575,,0.019924,-0.013852,1.000000,-0.004716,-0.005589,-0.109661,-0.015357
SaleCondition_Alloca,0.022479,-0.014378,0.007278,-0.032657,-0.020540,-0.017128,-0.019801,-0.000722,-0.003066,-0.026211,...,-0.027651,-0.004716,,0.036485,-0.025367,-0.004716,1.000000,-0.010235,-0.200817,-0.028123
SaleCondition_Family,-0.027648,0.008812,-0.004956,-0.052346,-0.011633,-0.067355,-0.069882,-0.026126,-0.000767,0.003126,...,-0.032774,-0.005589,,0.020118,-0.030067,-0.005589,-0.010235,1.000000,-0.238021,-0.033334
SaleCondition_Normal,0.027050,-0.063089,0.008240,-0.112271,0.148259,-0.128849,-0.093440,-0.068531,0.023184,0.033781,...,-0.643007,-0.109661,,0.630536,-0.589904,-0.109661,-0.200817,-0.238021,1.000000,-0.653996


In [19]:
thresold = 0.05
correlated_scores = train_features.corr()["SalePrice"]
correlated_scores = correlated_scores[correlated_scores.abs() >= thresold]
correlated_columns = list(correlated_scores.index)
correlated_columns.remove("SalePrice")
print(correlated_columns)

['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'ScreenPorch', 'MSZoning_C (all)', 'MSZoning_FV', 'MSZoning_RH', 'MSZoning_RL', 'MSZoning_RM', 'Alley_Grvl', 'Alley_Unknown', 'LotShape_IR1', 'LotShape_IR2', 'LotShape_Reg', 'LandContour_Bnk', 'LandContour_HLS', 'LotConfig_CulDSac', 'LotConfig_Inside', 'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_ClearCr', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_IDOTRR', 'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NAmes', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_SWISU', 'Neighborhood_Saw

In [20]:
train_targets = train_features.pop("SalePrice")
val_targets = val_features.pop("SalePrice")

### Feature Scaling

In [21]:
categorical_columns = set(train.dtypes[train.dtypes==object].index)

In [22]:
scale_strategies = ["none", "standard_scale", "standard_scale_exclude_categorcial_features"]
scale_strategy = scale_strategies[2]
if scale_strategy == scale_strategies[1]:
    train_features = (train_features - mean_value) / std_value
    val_features = (val_features - mean_value) / std_value
    test_features = (test_features - mean_value) / std_value
if scale_strategy == scale_strategies[2]:
    for column in train_features.columns:
        is_categorical_feature = False
        components = column.split("_")
        if len(components) == 2 and components[0] in categorical_columns:
            is_categorical_feature = True
        if is_categorical_feature == False:
            for features in [train_features, val_features, test_features]:
                features.loc[:, column] = (features.loc[:, column] - mean_value[column]) / std_value[column]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


In [23]:
train_features.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,SaleType_New,SaleType_Oth,SaleType_Unknown,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
407,0.302516,-0.296297,0.685671,-0.063174,1.289537,-1.859033,-0.395536,-0.572132,-0.969025,-0.293086,...,0,0,0,1,0,0,0,0,1,0
361,-0.167877,0.034462,-0.129848,-0.77242,-0.507197,-1.033717,-0.108377,-0.572132,-0.093127,-0.293086,...,0,0,0,1,0,0,0,0,1,0
608,0.302516,0.407451,0.253567,1.355319,0.39117,-1.231793,0.65738,-0.572132,-0.029465,-0.293086,...,0,0,0,1,0,0,1,0,0,0
1193,1.478499,0.034462,-0.718666,-0.063174,-0.507197,0.914028,0.70524,1.807139,0.969365,-0.293086,...,0,0,0,1,0,0,0,0,1,0
1378,2.419286,-2.266791,-1.041602,-0.063174,-0.507197,0.0557,-0.539116,1.711968,-0.290698,-0.293086,...,0,0,0,1,0,0,0,0,1,0


In [24]:
use_correlated_columns = True
if use_correlated_columns:
    train_features = train_features[correlated_columns]
    val_features = val_features[correlated_columns]
    test_features = test_features[correlated_columns]

## Model Development and Evaluation

### Hyperparameter Tuning

In [25]:
import catboost
import time
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
begin = time.time()
parameters = {
    "depth": [4, 5, 6, 7, 8, 9],
    "learning_rate": [0.01, 0.05, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15],
    "iterations": [500, 10000], 
}
def train_with_catboost(hyperparameters, X_train, X_val, y_train, y_val):
    keys = hyperparameters.keys()
    best_index = {key:0 for key in keys}
    best_cat = None
    best_score = 10e8
    for (index, key) in enumerate(keys):
        print("Find best parameter for %s" %(key))
        items = hyperparameters[key]
        best_parameter = None
        temp_best = 10e8
        for (key_index, item) in enumerate(items):
            iterations = hyperparameters["iterations"][best_index["iterations"]] if key != "iterations" else item
            learning_rate = hyperparameters["learning_rate"][best_index["learning_rate"]] if key != "learning_rate" else item
            depth = hyperparameters["depth"][best_index["depth"]] if key != "depth" else item
            print("Training with iterations: %d learning_rate: %.2f depth:%d"%(iterations, learning_rate, depth))
            cat = catboost.CatBoostRegressor(
                iterations = iterations, 
                learning_rate = learning_rate,
                depth = depth,
                verbose=500
            )
            cat.fit(X_train, y_train, verbose=False)
            result = evaluate(cat, X_val, y_val)
            score = result["rmlse"]
            if score < temp_best:
                temp_best = score
                best_index[key] = key_index
                best_parameter = item
            if score < best_score:
                best_score = score
                best_cat = cat
        print("Best Parameter for %s: "%(key), best_parameter)
    best_parameters = {
        "iterations": hyperparameters["iterations"][best_index["iterations"]],
        "learning_rate": hyperparameters["learning_rate"][best_index["learning_rate"]],
        "depth": hyperparameters["depth"][best_index["depth"]]
    }
    return best_cat, best_score, best_parameters
best_cat, best_score, best_parameters = train_with_catboost(parameters, train_features, val_features, train_targets, val_targets)
print("Best RMLSE: ", best_score)
print("Best Parameters: ", best_parameters)
elapsed = time.time() - begin 
print("Elapsed time: ", elapsed)
submit(best_cat, test_features, test_ids, "submission_cat.csv")

Find best parameter for depth
Training with iterations: 500 learning_rate: 0.01 depth:4
R2 Score: 0.8614748941645479
MSE: 1039439477.2812322
MAE: 17773.728839959324
MSLE: 0.016464609595544927
MAPE 9.367609235348228
RMSE: 32240.339286075017
RMLSE 0.12831449487702054
Training with iterations: 500 learning_rate: 0.01 depth:5
R2 Score: 0.8625905074773935
MSE: 1031068341.148482
MAE: 17265.838019321454
MSLE: 0.015424229971051721
MAPE 8.997216096314043
RMSE: 32110.252897610164
RMLSE 0.12419432342523438
Training with iterations: 500 learning_rate: 0.01 depth:6
R2 Score: 0.8567818247760965
MSE: 1074654477.2089067
MAE: 17116.028528196137
MSLE: 0.015103878793584642
MAPE 8.796762864427778
RMSE: 32781.923024876174
RMLSE 0.12289783884830784
Training with iterations: 500 learning_rate: 0.01 depth:7
R2 Score: 0.8596288730417234
MSE: 1053291314.6026068
MAE: 16969.152120631563
MSLE: 0.015155897437267092
MAPE 8.717267607359384
RMSE: 32454.449842858325
RMLSE 0.12310929062124877
Training with iterations: 5

### Model Training with K-Fold Algorithm

In [26]:
from sklearn.model_selection import KFold
X = pd.concat([train_features, val_features])
y = pd.concat([train_targets, val_targets])
fold = 1
models = []
for train_indices, valid_indices in KFold(n_splits=5, shuffle=True).split(X):
    print("Training with Fold %d" % (fold))
    X_train = X.iloc[train_indices]
    X_val = X.iloc[valid_indices]
    y_train = y.iloc[train_indices]
    y_val = y.iloc[valid_indices]
    cat = catboost.CatBoostRegressor(
        iterations = best_parameters["iterations"], 
        learning_rate = best_parameters["learning_rate"],
        depth = best_parameters["depth"]
    )
    cat.fit(X_train, y_train, verbose=False)
    models.append(cat)
    evaluate(cat, X_val, y_val)
    submit(cat, test_features, test_ids, "submission_cat_fold%d.csv"%(fold))
    fold += 1

Training with Fold 1
R2 Score: 0.9003344075301658
MSE: 577861230.7462425
MAE: 14377.603281360725
MSLE: 0.015816579247204754
MAPE 8.552603069018472
RMSE: 24038.74436708878
RMLSE 0.12576398231292119
Training with Fold 2
R2 Score: 0.8948219938640544
MSE: 545866345.9384582
MAE: 14524.041426391417
MSLE: 0.015920166684151513
MAPE 8.9566634984489
RMSE: 23363.782783155177
RMLSE 0.1261751428933271
Training with Fold 3
R2 Score: 0.8621294677414684
MSE: 915821486.9326822
MAE: 16283.311272481795
MSLE: 0.019763047043290502
MAPE 9.598588483074428
RMSE: 30262.542638262934
RMLSE 0.1405811048586918
Training with Fold 4
R2 Score: 0.9455061693089956
MSE: 259891817.3184134
MAE: 11438.121165040877
MSLE: 0.007379030482142839
MAPE 6.592050070724944
RMSE: 16121.160545023222
RMLSE 0.08590128335562187
Training with Fold 5
R2 Score: 0.8677711308224061
MSE: 1180902025.4244986
MAE: 17940.542643265275
MSLE: 0.015575798984013576
MAPE 8.715028787924831
RMSE: 34364.25505411835
RMLSE 0.12480304076429218


In [27]:
SalePrice = np.mean([model.predict(test_features) for model in models], axis=0)
submission = pd.DataFrame({"Id": test_ids, "SalePrice": SalePrice})
submission.to_csv("submission.csv", index=False)

## Model Training with all data
I would like to train the Model with all data to see what's happening becuase it seems a waste not to use all data.

In [28]:
cat = catboost.CatBoostRegressor(
    iterations = best_parameters["iterations"], 
    learning_rate = best_parameters["learning_rate"],
    depth = best_parameters["depth"]
)
cat.fit(X, y, verbose=False)
evaluate(cat, X, y)
submit(cat, test_features, test_ids, "submission_cat_all_dataset.csv")

R2 Score: 0.9934578873406833
MSE: 41259721.4438474
MAE: 4899.643333368161
MSLE: 0.0019622836885055867
MAPE 3.1504717350151914
RMSE: 6423.3730581251
RMLSE 0.044297671366625886


## Conclusion
Catboost is a good Regression tool to solve House Price Prediction Problem. In validation dataset, it can achieve MAE about 14400 and MAPE about 8.4. What it mean is that when this Model predict house prices, it has 14400 dollars error and 8.4% error in average. When it comes to RMLSE score, this model can get 0.10 in validation set and 0.13 in test set (top 25% rank in Kaggle LeaderBoard), looks overfits a little bit. Since this is a small dataset,when training the Model using K-Fold algorithm, results can also be different sinificantly. If we take the mean value of the KFold results, it can often get a better result.


## If you found my work useful, please give me an upvote, thanks.