<a href="https://www.kaggle.com/code/lonnieqin/house-prices-prediction-with-xgboost?scriptVersionId=114978469" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## House Price Regression with XGBoost
## Table of Contents
- Summary
- Import Packages
- Import Datasets
- Common Functions
- Exploratory Data Analysis & Data Preprocessing
    - Statistic infos
    - Missing Value Imputation
    - Convert Categorical Features to Numerical Features
    - Train Validation Split
    - Calculate Correlated Features
    - Feature Scaling
- Model Development and Evaluation

## Summary
In this notebook, I will use XGBoost to create House Price Predictor and use hyperparameter searching techniques to find best results.

## Import Packages

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
from sklearn import metrics
import tensorflow as tf
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, KFold

## Import Datasets

In [2]:
train = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/train.csv")

test = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/test.csv")


## Common Functions

**Evaluation Function**

In [3]:
def evaluate(model, x_val, y_val):
    y_pred = model.predict(x_val)
    r2 = metrics.r2_score(y_val, y_pred)
    mse = metrics.mean_squared_error(y_val, y_pred)
    mae = metrics.mean_absolute_error(y_val, y_pred)
    msle = metrics.mean_squared_log_error(y_val, y_pred)
    mape = np.mean(tf.keras.metrics.mean_absolute_percentage_error(y_val, y_pred))
    rmse = np.sqrt(mse)
    rmlse_score = rmlse(y_val, y_pred)
    print("R2 Score:", r2)
    print("MSE:", mse)
    print("MAE:", mae)
    print("MSLE:", msle)
    print("MAPE", mape)
    print("RMSE:", rmse)
    print("RMLSE", rmlse_score)
    return {"r2": r2, "mse": mse, "mae": mae, "msle": msle, "mape": mape, "rmse": rmse, "rmlse": rmlse_score}

**Root Mean Squared Logarithmic Error**

In [4]:
def rmlse(y_true, y_pred):
    return np.sqrt(np.mean(np.square(np.log(y_pred + 1) - np.log(y_true + 1))))

**Submission**

In [5]:
def submit(model, X, ids, file_path):
    SalePrice = model.predict(X)
    submission = pd.DataFrame({"Id": ids, "SalePrice": SalePrice.reshape(-1)})
    submission.to_csv(file_path, index=False)

## Exploratory Data Analysis & Data Preprocessing

In [6]:
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [7]:
train.shape

(1460, 81)

**Statistic infos**

In [8]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [9]:
train.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


**Correlation scores**

In [10]:
correlation_scores = train.corr()
correlation_scores

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
Id,1.0,0.011156,-0.010601,-0.033226,-0.028365,0.012609,-0.012713,-0.021998,-0.050298,-0.005024,...,-0.029643,-0.000477,0.002889,-0.046635,0.00133,0.057044,-0.006242,0.021172,0.000712,-0.021917
MSSubClass,0.011156,1.0,-0.386347,-0.139781,0.032628,-0.059316,0.02785,0.040581,0.022936,-0.069836,...,-0.012579,-0.0061,-0.012037,-0.043825,-0.02603,0.008283,-0.007683,-0.013585,-0.021407,-0.084284
LotFrontage,-0.010601,-0.386347,1.0,0.426095,0.251646,-0.059213,0.123349,0.088866,0.193458,0.233633,...,0.088521,0.151972,0.0107,0.070029,0.041383,0.206167,0.003368,0.0112,0.00745,0.351799
LotArea,-0.033226,-0.139781,0.426095,1.0,0.105806,-0.005636,0.014228,0.013788,0.10416,0.214103,...,0.171698,0.084774,-0.01834,0.020423,0.04316,0.077672,0.038068,0.001205,-0.014261,0.263843
OverallQual,-0.028365,0.032628,0.251646,0.105806,1.0,-0.091932,0.572323,0.550684,0.411876,0.239666,...,0.238923,0.308819,-0.113937,0.030371,0.064886,0.065166,-0.031406,0.070815,-0.027347,0.790982
OverallCond,0.012609,-0.059316,-0.059213,-0.005636,-0.091932,1.0,-0.375983,0.073741,-0.128101,-0.046231,...,-0.003334,-0.032589,0.070356,0.025504,0.054811,-0.001985,0.068777,-0.003511,0.04395,-0.077856
YearBuilt,-0.012713,0.02785,0.123349,0.014228,0.572323,-0.375983,1.0,0.592855,0.315707,0.249503,...,0.22488,0.188686,-0.387268,0.031355,-0.050364,0.00495,-0.034383,0.012398,-0.013618,0.522897
YearRemodAdd,-0.021998,0.040581,0.088866,0.013788,0.550684,0.073741,0.592855,1.0,0.179618,0.128451,...,0.205726,0.226298,-0.193919,0.045286,-0.03874,0.005829,-0.010286,0.02149,0.035743,0.507101
MasVnrArea,-0.050298,0.022936,0.193458,0.10416,0.411876,-0.128101,0.315707,0.179618,1.0,0.264736,...,0.159718,0.125703,-0.110204,0.018796,0.061466,0.011723,-0.029815,-0.005965,-0.008201,0.477493
BsmtFinSF1,-0.005024,-0.069836,0.233633,0.214103,0.239666,-0.046231,0.249503,0.128451,0.264736,1.0,...,0.204306,0.111761,-0.102303,0.026451,0.062021,0.140491,0.003571,-0.015727,0.014359,0.38642


**Factors that impact house price most**

In [11]:
train.corr()["SalePrice"].sort_values(key = lambda x: abs(x), ascending=False)

SalePrice        1.000000
OverallQual      0.790982
GrLivArea        0.708624
GarageCars       0.640409
GarageArea       0.623431
TotalBsmtSF      0.613581
1stFlrSF         0.605852
FullBath         0.560664
TotRmsAbvGrd     0.533723
YearBuilt        0.522897
YearRemodAdd     0.507101
GarageYrBlt      0.486362
MasVnrArea       0.477493
Fireplaces       0.466929
BsmtFinSF1       0.386420
LotFrontage      0.351799
WoodDeckSF       0.324413
2ndFlrSF         0.319334
OpenPorchSF      0.315856
HalfBath         0.284108
LotArea          0.263843
BsmtFullBath     0.227122
BsmtUnfSF        0.214479
BedroomAbvGr     0.168213
KitchenAbvGr    -0.135907
EnclosedPorch   -0.128578
ScreenPorch      0.111447
PoolArea         0.092404
MSSubClass      -0.084284
OverallCond     -0.077856
MoSold           0.046432
3SsnPorch        0.044584
YrSold          -0.028923
LowQualFinSF    -0.025606
Id              -0.021917
MiscVal         -0.021190
BsmtHalfBath    -0.016844
BsmtFinSF2      -0.011378
Name: SalePr

### Missing Value Imputation

I will use following strategies to apply imputation to missing values. 
- For numerical columns, I will replace missing value with their mean value.
- For categorical columns, I will replace missing value with unknown category.

In [12]:
for data in [train, test]:
    null_counts = data.isnull().sum()
    null_counts[null_counts > 0]
    null_columns = list(pd.DataFrame(null_counts[null_counts > 0]).index)
    for column in null_columns:
        if data[column].dtype == object:
            data[column] = data[[column]].replace(np.NAN, "Unknown")
        else:
            data[column] = data[column].replace(np.NAN, data[column].mean())

### Convert Categorical Features to Numerical Features

In [13]:
train_test = pd.get_dummies(pd.concat([train, test]))

In [14]:
train_test.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_New,SaleType_Oth,SaleType_Unknown,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,1,60,65.0,8450,7,5,2003,2003,196.0,706.0,...,0,0,0,1,0,0,0,0,1,0
1,2,20,80.0,9600,6,8,1976,1976,0.0,978.0,...,0,0,0,1,0,0,0,0,1,0
2,3,60,68.0,11250,7,5,2001,2002,162.0,486.0,...,0,0,0,1,0,0,0,0,1,0
3,4,70,60.0,9550,7,5,1915,1970,0.0,216.0,...,0,0,0,1,1,0,0,0,0,0
4,5,60,84.0,14260,8,5,2000,2000,350.0,655.0,...,0,0,0,1,0,0,0,0,1,0


In [15]:
mean_value = train_test.mean()
std_value = train_test.std()
mean_value.pop("SalePrice")
std_value.pop("SalePrice")
print(mean_value)
print(std_value)

Id                        1460.000000
MSSubClass                  57.137718
LotFrontage                 69.315409
LotArea                  10168.114080
OverallQual                  6.089072
                             ...     
SaleCondition_AdjLand        0.004111
SaleCondition_Alloca         0.008222
SaleCondition_Family         0.015759
SaleCondition_Normal         0.822885
SaleCondition_Partial        0.083933
Length: 312, dtype: float64
Id                        842.787043
MSSubClass                 42.517628
LotFrontage                21.314457
LotArea                  7886.996359
OverallQual                 1.409947
                            ...     
SaleCondition_AdjLand       0.063996
SaleCondition_Alloca        0.090317
SaleCondition_Family        0.124562
SaleCondition_Normal        0.381832
SaleCondition_Partial       0.277335
Length: 312, dtype: float64


In [16]:
train_features = train_test.iloc[0: len(train)]
test_features = train_test.iloc[len(train):]
_ = train_features.pop("Id")
_ = test_features.pop("SalePrice")
test_ids = test_features.pop("Id")

### Calculate Correlated Features

In [17]:
train_features.corr()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,SaleType_New,SaleType_Oth,SaleType_Unknown,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
MSSubClass,1.000000,-0.357056,-0.139781,0.032628,-0.059316,0.027850,0.040581,0.022895,-0.069836,-0.065649,...,-0.045156,-0.014555,,0.026359,0.005003,0.016241,0.030002,0.000983,0.024359,-0.051068
LotFrontage,-0.357056,1.000000,0.306795,0.234196,-0.052820,0.117598,0.082746,0.179283,0.215828,0.043340,...,0.126580,-0.023461,,-0.089928,-0.021846,-0.037020,-0.018090,0.015818,-0.072074,0.124842
LotArea,-0.139781,0.306795,1.000000,0.105806,-0.005636,0.014228,0.013788,0.103960,0.214103,0.111170,...,0.020039,-0.005722,,-0.002292,-0.029126,-0.013208,0.008966,-0.010781,0.005711,0.022635
OverallQual,0.032628,0.234196,0.105806,1.000000,-0.091932,0.572323,0.550684,0.410238,0.239666,-0.059119,...,0.327412,-0.057962,,-0.225013,-0.103535,-0.041677,-0.044950,-0.025515,-0.143282,0.323295
OverallCond,-0.059316,-0.052820,-0.005636,-0.091932,1.000000,-0.375983,0.073741,-0.127788,-0.046231,0.040229,...,-0.156175,-0.050663,,0.163684,-0.046367,-0.038888,-0.033444,-0.023873,0.161642,-0.151659
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
SaleCondition_AdjLand,0.016241,-0.037020,-0.013208,-0.041677,-0.038888,-0.045601,-0.040294,-0.011959,-0.014874,-0.015130,...,-0.015827,-0.002378,,0.020457,-0.014289,1.000000,-0.004772,-0.006177,-0.112080,-0.016038
SaleCondition_Alloca,0.030002,-0.018090,0.008966,-0.044950,-0.033444,-0.010104,-0.020727,-0.009689,0.021369,-0.026277,...,-0.027489,-0.004131,,0.035530,-0.024817,-0.004772,1.000000,-0.010729,-0.194663,-0.027856
SaleCondition_Family,0.000983,0.015818,-0.010781,-0.025515,-0.023873,-0.035785,-0.048056,-0.009914,0.000765,-0.007929,...,-0.035587,-0.005348,,0.028599,-0.032128,-0.006177,-0.010729,1.000000,-0.252006,-0.036062
SaleCondition_Normal,0.024359,-0.072074,0.005711,-0.143282,0.161642,-0.158427,-0.120577,-0.084241,-0.019560,0.041207,...,-0.645698,-0.097031,,0.634322,-0.582947,-0.112080,-0.194663,-0.252006,1.000000,-0.654323


In [18]:
thresold = 0.05
correlated_scores = train_features.corr()["SalePrice"]
correlated_scores = correlated_scores[correlated_scores.abs() >= thresold]
correlated_columns = list(correlated_scores.index)
correlated_columns.remove("SalePrice")
print(correlated_columns)

['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'ScreenPorch', 'PoolArea', 'MSZoning_C (all)', 'MSZoning_FV', 'MSZoning_RH', 'MSZoning_RL', 'MSZoning_RM', 'Alley_Grvl', 'Alley_Unknown', 'LotShape_IR1', 'LotShape_IR2', 'LotShape_Reg', 'LandContour_Bnk', 'LandContour_HLS', 'LotConfig_CulDSac', 'LotConfig_Inside', 'LandSlope_Gtl', 'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_ClearCr', 'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_IDOTRR', 'Neighborhood_MeadowV', 'Neighborhood_Mitchel', 'Neighborhood_NAmes', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_SWISU', 'Nei

In [19]:
y = train_features.pop("SalePrice")
X = train_features

### Feature Scaling

In [20]:
categorical_columns = set(train.dtypes[train.dtypes==object].index)

In [21]:
scale_strategies = ["none", "standard_scale", "standard_scale_exclude_categorcial_features"]
scale_strategy = scale_strategies[2]
if scale_strategy == scale_strategies[1]:
    X = (X - mean_value) / std_value
    test_features = (test_features - mean_value) / std_value
if scale_strategy == scale_strategies[2]:
    for column in train_features.columns:
        is_categorical_feature = False
        components = column.split("_")
        if len(components) == 2 and components[0] in categorical_columns:
            is_categorical_feature = True
        if is_categorical_feature == False:
            for features in [X, test_features]:
                features.loc[:, column] = (features.loc[:, column] - mean_value[column]) / std_value[column]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


In [22]:
X.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,SaleType_New,SaleType_Oth,SaleType_Unknown,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,0.06732,-0.202464,-0.217841,0.646073,-0.507197,1.046078,0.896679,0.525132,0.580809,-0.293086,...,0,0,0,1,0,0,0,0,1,0
1,-0.873466,0.501284,-0.072032,-0.063174,2.187904,0.154737,-0.395536,-0.572132,1.177912,-0.293086,...,0,0,0,1,0,0,0,0,1,0
2,0.06732,-0.061714,0.137173,0.646073,-0.507197,0.980053,0.848819,0.33479,0.097858,-0.293086,...,0,0,0,1,0,0,0,0,1,0
3,0.302516,-0.437047,-0.078371,0.646073,-0.507197,-1.859033,-0.682695,-0.572132,-0.494855,-0.293086,...,0,0,0,1,1,0,0,0,0,0
4,0.06732,0.68895,0.518814,1.355319,-0.507197,0.94704,0.7531,1.387268,0.468852,-0.293086,...,0,0,0,1,0,0,0,0,1,0


In [23]:
use_correlated_columns = True
if use_correlated_columns:
    X = X[correlated_columns]
    test_features = test_features[correlated_columns]

## Model Development and Evaluation

In [24]:
import xgboost
import time
import sklearn
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
def train_with_xgboost(hyperparameters, X_train,  y_train, X_val = None, y_val = None):
    keys = hyperparameters.keys()
    #for key in keys:
    #    hyperparameters[key] = sklearn.utils.shuffle(hyperparameters[key])
    best_index = {key:0 for key in keys}
    best_model = None
    best_parameters = None
    best_score = 10e8
    for (index, key) in enumerate(keys):
        print("Find best parameter for %s" %(key))
        items = hyperparameters[key]
        best_parameter = items[best_index[key]]
        for (key_index, item) in enumerate(items):
            params = {key2: hyperparameters[key2][best_index[key2]] if key2 != key else item for key2 in keys}
            print("Training with %s" %(params))
            model = xgboost.XGBRegressor(
                **params
            )
            model.fit(X_train, y_train, verbose=False)
            if len(X_val) != 0 and len(y_val) != 0:
                result = evaluate(model, X_val, y_val)
            else:
                result = evaluate(model, X_train, y_train)
            score = result["rmlse"]
            if score < best_score:
                best_score = score
                best_index[key] = key_index
                best_parameter = item
                best_model = model
                best_parameters = params
        print("Best Parameter for %s: "%(key), best_parameter)
    return best_model, best_score, best_parameters

In [25]:
def split_data(X, y, strategy):
    if not strategy in ["full", "kfold", "train_validation_split"]:
        return (0, [], [], [], [])
    if strategy == "full":
        yield (0, X, y, [], [])
    for index, (train_indices, valid_indices) in enumerate(KFold(n_splits=5, shuffle=True).split(X)):
            X_train = X.iloc[train_indices]
            X_val = X.iloc[valid_indices]
            y_train = y.iloc[train_indices]
            y_val = y.iloc[valid_indices]
            yield (index, X_train, y_train, X_val, y_val)
            if strategy != "kfold":
                break

In [26]:
parameters = {
    "max_depth": list(range(4, 10)),
    "learning_rate": list(np.linspace(0.03, 0.15, 13)),
    "booster": ["gbtree", "gblinear", "dart"],
}
models = []
for strategy in ["full", "kfold"]:
    for (index, X_train, y_train, X_val, y_val) in split_data(X, y, strategy):
        begin = time.time()
        best_model, best_score, best_parameters = train_with_xgboost(parameters, X_train, y_train, X_val, y_val)
        print("Best RMLSE: ", best_score)
        print("Best Parameters: ", best_parameters)
        elapsed = time.time() - begin 
        print("Elapsed time: ", elapsed)
        submit(best_model, test_features, test_ids, "submission_%s_%d.csv"%(strategy, index))
        models.append(best_model)

Find best parameter for max_depth
Training with {'max_depth': 4, 'learning_rate': 0.03, 'booster': 'gbtree'}
R2 Score: 0.9280549089583293
MSE: 453742478.95352465
MAE: 15108.81006260702
MSLE: 0.013630965661721091
MAPE 8.644273
RMSE: 21301.231864695634
RMLSE 0.11675170998144598
Training with {'max_depth': 5, 'learning_rate': 0.03, 'booster': 'gbtree'}
R2 Score: 0.9453435187689506
MSE: 344706871.94334066
MAE: 13253.35116384846
MSLE: 0.00986149111389065
MAPE 7.4280357
RMSE: 18566.28320217433
RMLSE 0.0993050192948772
Training with {'max_depth': 6, 'learning_rate': 0.03, 'booster': 'gbtree'}
R2 Score: 0.9554093756898656
MSE: 281223640.41275483
MAE: 11976.297137200343
MSLE: 0.007358885660878677
MAPE 6.523532
RMSE: 16769.723921781027
RMLSE 0.08578393743195241
Training with {'max_depth': 7, 'learning_rate': 0.03, 'booster': 'gbtree'}
R2 Score: 0.9616767387074596
MSE: 241696706.87315243
MAE: 11037.376610659247
MSLE: 0.005638933998601864
MAPE 5.8081455
RMSE: 15546.597919582035
RMLSE 0.07509281609

## Create submission file

In [27]:
SalePrice = np.mean([model.predict(test_features) for model in models], axis=0)
submission = pd.DataFrame({"Id": test_ids, "SalePrice": SalePrice})
submission.to_csv("submission.csv", index=False)


## If you found my work useful, please give me an upvote, thanks.