# House Prices - Advanced Regression Techniques

In this notebook we will create a model to predict house prices for the **Kaggle** competition [House Prices - Advanced Regression Techniques](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques).  

This project uses the Ames Housing dataset provided as part of the Kaggle competition:  
*Anna Montoya and DataCanary. House Prices - Advanced Regression Techniques. Kaggle, 2016.*  
The dataset is used strictly for non-commercial, educational purposes.  

This is a streamlined notebook with only the necessary steps for preparing the data, training the model, and exporting the predictions for submission.  

The whole process (EDA, feature engineering and selection, finetuning of models, etc) is available in the housing_full_analysis.ipynb notebook on my project repository:  
https://github.com/Kev-HL/house-price-prediction

## 1. Setup and data loading

In [1]:
import numpy as np
import pandas as pd
from catboost import CatBoostRegressor, Pool

In [2]:
trainrawdata_path = '../data/raw/train.csv' # Relative path to the training dataset
traindf = pd.read_csv(trainrawdata_path)

testrawdata_path = '../data/raw/test.csv' # Relative path to the test dataset
testdf = pd.read_csv(testrawdata_path)

## 2. Data preprocessing

### 2.1. Handling missing values

Even though Catboost can handle missing values natively, we will repeat the same steps we did in the main notebook, for consistency.

Handle missing values on traindf:

In [3]:
# Fill NaN values in PoolQC with 'None'
traindf['PoolQC'] = traindf['PoolQC'].fillna('None')

# Drop rows where MiscVal is 0 and MiscFeature is not NaN
traindf = traindf.drop(index=traindf[(traindf['MiscVal'] == 0) & (traindf['MiscFeature'].notna())].index)

# Fill NaN values in MiscFeature with 'None'
traindf['MiscFeature'] = traindf['MiscFeature'].fillna('None')

# Fill NaN values in Alley with 'None'
traindf['Alley'] = traindf['Alley'].fillna('None')

# Fill NaN values in Fence with 'None'
traindf['Fence'] = traindf['Fence'].fillna('None')

# Drop NaN values where both MasVnrType and MasVnrArea are NaN
traindf = traindf.drop(index=traindf[(traindf['MasVnrType'].isnull()) & (traindf['MasVnrArea'].isnull())].index)
# Drop rows where MasVnrArea is 1.0
traindf = traindf.drop(index=traindf[(traindf['MasVnrArea'] == 1.0)].index)
# Fill NaN values in MasVnrType and MasVnrArea based on Neighborhood
# Create boolean mask for those rows where MasVnrType is NaN and MasVnrArea is not 0
mask1 = traindf['MasVnrType'].isna() & (traindf['MasVnrArea'] != 0)
# Create boolean mask for those rows where MasVnrType has a valid value and MasVnrArea is 0
mask2 = ~traindf['MasVnrType'].isna() & (traindf['MasVnrArea'] == 0)
# Group by Neighborhood and get the mode of MasVnrType by Neighborhood and the median of MasVnrArea.
MasVnrType_mode_Neighborhood = (traindf.groupby('Neighborhood')['MasVnrType'].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else 'None'))
MasVnrArea_median_Neighborhood = traindf.groupby('Neighborhood')['MasVnrArea'].median()
# Map the mode values to the original DataFrame
traindf.loc[mask1, 'MasVnrType'] = traindf.loc[mask1, 'Neighborhood'].map(MasVnrType_mode_Neighborhood)
traindf.loc[mask2, 'MasVnrArea'] = traindf.loc[mask2, 'Neighborhood'].map(MasVnrArea_median_Neighborhood)
# Drop rows where MasVnrArea is 0 and MasVnrType is not NaN
traindf = traindf.drop(index=traindf[(traindf['MasVnrArea'] == 0) & ~(traindf['MasVnrType'].isnull())].index)
# Fill NaN values in MasVnrType with 'None' for remaining NaN values
traindf['MasVnrType'] = traindf['MasVnrType'].fillna('None')

# Fill NaN values in FireplaceQu with 'None'
traindf['FireplaceQu'] = traindf['FireplaceQu'].fillna('None')

# Fill NaN values in LotFrontage based on the median LotFrontage for each Neighborhood
# Create boolean mask for those rows where LotFrontage is NA.
mask = traindf['LotFrontage'].isna()
# Group by Neighborhood and get the mode of LotFrontage by Neighborhood
LotFrontage_median_Neighborhood = traindf.groupby('Neighborhood')['LotFrontage'].median()
# Map the median values to the original DataFrame
traindf.loc[mask, 'LotFrontage'] = traindf.loc[mask, 'Neighborhood'].map(LotFrontage_median_Neighborhood)

# Fill NaN values in GarageType, GarageFinish, GarageQual, GarageCond with 'None'
for var in ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']:
    traindf[var] = traindf[var].fillna('None')
# Fill NaN values in GarageYrBlt with -1
traindf['GarageYrBlt'] = traindf['GarageYrBlt'].fillna(-1)

# Fill NaN values in the basement categorical columns with 'None' if all basement-related columns are NaN
BsmtCatCols = ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']
mask = traindf[BsmtCatCols].isnull().all(axis=1)
traindf.loc[mask, BsmtCatCols] = traindf.loc[mask, BsmtCatCols].fillna('None')

# Drop remaining NaN values in basement-related columns
traindf = traindf.drop(index=traindf[traindf['BsmtExposure'].isnull() | traindf['BsmtFinType2'].isnull()].index)

# Drop rows where Electrical is NaN
traindf = traindf.drop(index=traindf[traindf['Electrical'].isnull()].index)

Check there are no more missing values on traindf:

In [4]:
assert traindf.isnull().sum().sum() == 0, f"Missing values found in traindf:\n{traindf.isnull().sum()[traindf.isnull().sum() > 0]}"

Sanity check to ensure the shape of the dataset is the same after these changes than in the 'full analysis' notebook (1444, 81):

In [5]:
assert traindf.shape == (1444, 81), f"Unexpected traindf shape: {traindf.shape}, expected (1444, 81)"

Handle missing values on testdf:

In [6]:
# Replace NA with 'None' in every missing PoolQC that has PoolArea = 0
mask = (testdf['PoolArea'] == 0) & (testdf['PoolQC'].isnull())
testdf.loc[mask, 'PoolQC'] = 'None'

# Replace NA with 'None' in every missing MiscFeature that has MiscVal = 0
mask = (testdf['MiscVal'] == 0) & (testdf['MiscFeature'].isnull())
testdf.loc[mask, 'MiscFeature'] = 'None'

# Replace NA with 'None' in every missing Alley and Fence
testdf['Alley'] = testdf['Alley'].fillna('None')
testdf['Fence'] = testdf['Fence'].fillna('None')

# For those with both MasVnrtype and MasVnrArea missing, we will first replace the area with the median of the neighborhood from the training set
mask = testdf['MasVnrType'].isna() & (testdf['MasVnrArea'].isna())
testdf.loc[mask, 'MasVnrArea'] = testdf.loc[mask, 'Neighborhood'].map(MasVnrArea_median_Neighborhood)
# Then replace the MasVnrType with the mode of the neighborhood from the training set on those rows with a valid MasVnrArea (>0)
mask = testdf['MasVnrType'].isna() & (testdf['MasVnrArea'] > 0)
testdf.loc[mask, 'MasVnrType'] = testdf.loc[mask, 'Neighborhood'].map(MasVnrType_mode_Neighborhood)
# And for those with MasVnrArea = 0 and MasVnrType missing, we will replace the type with 'None'
mask = testdf['MasVnrType'].isna() & (testdf['MasVnrArea'] == 0)
testdf.loc[mask, 'MasVnrType'] = 'None'

# Replace NA with 'None' in every missing FireplaceQu that has Fireplaces = 0
mask = (testdf['Fireplaces'] == 0) & (testdf['FireplaceQu'].isnull())
testdf.loc[mask, 'FireplaceQu'] = 'None'

# Replace NA with the median LotFrontage of the neighborhood from the training set
mask = testdf['LotFrontage'].isna()
testdf.loc[mask, 'LotFrontage'] = testdf.loc[mask, 'Neighborhood'].map(LotFrontage_median_Neighborhood)

# Replace NA with 'None' in every missing categorical Garage variables, with -1 in GarageYrBlt and with 0 in GarageArea and GarageCars
# But only for those entries where all Garage variables mean there is no garage
mask = (
    ((testdf['GarageArea'].isnull()) | (testdf['GarageArea'] == 0)) &
    ((testdf['GarageCars'].isnull()) | (testdf['GarageCars'] == 0)) &
    (testdf['GarageQual'].isnull()) &
    (testdf['GarageType'].isnull()) &
    (testdf['GarageFinish'].isnull()) &
    (testdf['GarageCond'].isnull()) &
    ((testdf['GarageYrBlt'].isnull()) | (testdf['GarageYrBlt'] == 0))
)
for var in ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']:
    testdf.loc[mask, var] = 'None'
testdf.loc[mask, 'GarageYrBlt'] = -1
testdf.loc[mask, 'GarageArea'] = 0
testdf.loc[mask, 'GarageCars'] = 0

# Replace NA with 'None' in every missing categorical Basement variables, and with 0 in the numerical ones
# But only for those entries where all Basement variables mean there is no basement
mask = (
    ((testdf['BsmtFinSF1'].isnull()) | (testdf['BsmtFinSF1'] == 0)) &
    ((testdf['BsmtFinSF2'].isnull()) | (testdf['BsmtFinSF2'] == 0)) &
    ((testdf['BsmtUnfSF'].isnull()) | (testdf['BsmtUnfSF'] == 0)) &
    ((testdf['TotalBsmtSF'].isnull()) | (testdf['TotalBsmtSF'] == 0)) &
    ((testdf['BsmtFullBath'].isnull()) | (testdf['BsmtFullBath'] == 0)) &
    ((testdf['BsmtHalfBath'].isnull()) | (testdf['BsmtHalfBath'] == 0)) &
    (testdf['BsmtQual'].isnull()) &
    (testdf['BsmtCond'].isnull()) &
    (testdf['BsmtExposure'].isnull()) &
    (testdf['BsmtFinType1'].isnull()) &
    (testdf['BsmtFinType2'].isnull())
)
for var in ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']:
    testdf.loc[mask, var] = 'None'
for var in ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath']:
    testdf.loc[mask, var] = 0

# Replace NA with the mode of the neighborhood from the training set for MSzoning
mask = testdf['MSZoning'].isna()
MSZoning_mode_Neighborhood = (traindf.groupby('Neighborhood')['MSZoning'].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else 'None'))
testdf.loc[mask, 'MSZoning'] = testdf.loc[mask, 'Neighborhood'].map(MSZoning_mode_Neighborhood)

# Replace NA with the mode of the neighborhood from the training set for PoolQC
mask = testdf['PoolQC'].isna()
PoolQC_mode_Neighborhood = (traindf.groupby('Neighborhood')['PoolQC'].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else 'None'))
testdf.loc[mask, 'PoolQC'] = testdf.loc[mask, 'Neighborhood'].map(PoolQC_mode_Neighborhood)

# Replace NA with the mode of the neighborhood from the training set for Utilities
mask = testdf['Utilities'].isna()
Utilities_mode_Neighborhood = (traindf.groupby('Neighborhood')['Utilities'].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else 'None'))
testdf.loc[mask, 'Utilities'] = testdf.loc[mask, 'Neighborhood'].map(Utilities_mode_Neighborhood)

# Replace NA with 'Typ' in every missing Functional
mask = testdf['Functional'].isna()
testdf.loc[mask, 'Functional'] = 'Typ'

# Replace NA with the mode of the neighborhood from the training set for Exterior1st
mask = testdf['Exterior1st'].isna()
Exterior1st_mode_Neighborhood = (traindf.groupby('Neighborhood')['Exterior1st'].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else 'None'))
testdf.loc[mask, 'Exterior1st'] = testdf.loc[mask, 'Neighborhood'].map(Exterior1st_mode_Neighborhood)

# Replace NA with the mode of the neighborhood from the training set for Exterior2nd
mask = testdf['Exterior2nd'].isna()
Exterior2nd_mode_Neighborhood = (traindf.groupby('Neighborhood')['Exterior2nd'].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else 'None'))
testdf.loc[mask, 'Exterior2nd'] = testdf.loc[mask, 'Neighborhood'].map(Exterior2nd_mode_Neighborhood)

# Replace NA with the mode of the neighborhood from the training set for KitchenQual
mask = testdf['KitchenQual'].isna()
KitchenQual_mode_Neighborhood = (traindf.groupby('Neighborhood')['KitchenQual'].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else 'None'))
testdf.loc[mask, 'KitchenQual'] = testdf.loc[mask, 'Neighborhood'].map(KitchenQual_mode_Neighborhood)

# Replace NA with 'Other' in every missing MiscFeature
mask = testdf['MiscFeature'].isna()
testdf.loc[mask, 'MiscFeature'] = 'Other'

# Replace NA with the mode of the neighborhood from the training set for SaleType
mask = testdf['SaleType'].isna()
SaleType_mode_Neighborhood = (traindf.groupby('Neighborhood')['SaleType'].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else 'None'))
testdf.loc[mask, 'SaleType'] = testdf.loc[mask, 'Neighborhood'].map(SaleType_mode_Neighborhood)

# With the assumption that if GarageCars is null, then there is no garage, we will replace the missing values of the categorical variables with 'None'
# and the numerical variables with 0 or -1, depending on the variable
row_label = testdf[testdf['GarageCars'].isnull()].index[0]
for var in ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']:
    testdf.loc[row_label, var] = 'None'
testdf.loc[row_label, 'GarageArea'] = 0.0
testdf.loc[row_label, 'GarageCars'] = 0.0
testdf.loc[row_label, 'GarageYrBlt'] = -1

# For the rest of the missing categorical Garage values, we will replace them with the mode of the neighborhood from the training set
for var in ['GarageFinish', 'GarageQual', 'GarageCond', 'BsmtExposure', 'BsmtQual', 'BsmtCond']:
    mask = testdf[var].isna()
    mode = (traindf.groupby('Neighborhood')[var].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else 'None'))
    testdf.loc[mask, var] = testdf.loc[mask, 'Neighborhood'].map(mode)

# And for GarageYrBlt, the only numerical variable, we will replace it with the median of the neighborhood from the training set
GarageYrBlt_median_Neighborhood = traindf.groupby('Neighborhood')['GarageYrBlt'].median()
row_label = testdf[testdf['GarageYrBlt'].isnull()].index[0]
neighborhood = testdf.loc[row_label, 'Neighborhood']
testdf.loc[row_label, 'GarageYrBlt'] = GarageYrBlt_median_Neighborhood[neighborhood]

Check there are no more missing values in testdf:

In [7]:
assert testdf.isnull().sum().sum() == 0, f"Missing values found in testdf:\n{testdf.isnull().sum()[testdf.isnull().sum() > 0]}"

### 2.2. Feature engineering

Creating new features:

In [8]:
traindf['TotalBathrooms'] = traindf['FullBath'] + (0.5 * traindf['HalfBath']) + traindf['BsmtFullBath'] + (0.5 * traindf['BsmtHalfBath'])
testdf['TotalBathrooms'] = testdf['FullBath'] + (0.5 * testdf['HalfBath']) + testdf['BsmtFullBath'] + (0.5 * testdf['BsmtHalfBath'])

traindf['TotalSF'] = traindf['TotalBsmtSF'] + traindf['1stFlrSF'] + traindf['2ndFlrSF']
testdf['TotalSF'] = testdf['TotalBsmtSF'] + testdf['1stFlrSF'] + testdf['2ndFlrSF']

traindf['TotalFinSF'] = traindf['BsmtFinSF1'] + traindf['BsmtFinSF2'] + traindf['1stFlrSF'] + traindf['2ndFlrSF']
testdf['TotalFinSF'] = testdf['BsmtFinSF1'] + testdf['BsmtFinSF2'] + testdf['1stFlrSF'] + testdf['2ndFlrSF']

traindf['Has2ndFloor'] = (traindf['2ndFlrSF'] > 0).astype(int)
testdf['Has2ndFloor'] = (testdf['2ndFlrSF'] > 0).astype(int)

traindf['HasBasement'] = (traindf['TotalBsmtSF'] > 0).astype(int)
testdf['HasBasement'] = (testdf['TotalBsmtSF'] > 0).astype(int)

traindf['HasGarage'] = (traindf['GarageArea'] > 0).astype(int)
testdf['HasGarage'] = (testdf['GarageArea'] > 0).astype(int)

traindf['HasPool'] = (traindf['PoolArea'] > 0).astype(int)
testdf['HasPool'] = (testdf['PoolArea'] > 0).astype(int)

traindf['HouseAge'] = traindf['YrSold'] - traindf['YearBuilt']
testdf['HouseAge'] = testdf['YrSold'] - testdf['YearBuilt']

traindf['GarageAge'] = traindf['YrSold'] - traindf['GarageYrBlt']
traindf.loc[traindf['GarageYrBlt'] == -1, 'GarageAge'] = -1 
testdf['GarageAge'] = testdf['YrSold'] - testdf['GarageYrBlt']
testdf.loc[testdf['GarageYrBlt'] == -1, 'GarageAge'] = -1 

traindf['RemodelAge'] = traindf['YrSold'] - traindf['YearRemodAdd']
traindf.loc[traindf['RemodelAge'] < 0, 'RemodelAge'] = 0
testdf['RemodelAge'] = testdf['YrSold'] - testdf['YearRemodAdd']
testdf.loc[testdf['RemodelAge'] < 0, 'RemodelAge'] = 0

traindf['WasRemodel'] = (traindf['YearRemodAdd'] != traindf['YearBuilt']).astype(int)
testdf['WasRemodel'] = (testdf['YearRemodAdd'] != testdf['YearBuilt']).astype(int)

traindf['QualityIndex'] = traindf['OverallQual'] * traindf['OverallCond']
testdf['QualityIndex'] = testdf['OverallQual'] * testdf['OverallCond']

traindf['LotRatio'] = traindf['GrLivArea'] / traindf['LotArea']
testdf['LotRatio'] = testdf['GrLivArea'] / testdf['LotArea']

Applying domain-transformations:

In [9]:
# Training set log transformations
traindf['LotFrontage'] = np.log(traindf['LotFrontage'])
traindf['LotArea'] = np.log(traindf['LotArea'])
traindf['1stFlrSF'] = np.log(traindf['1stFlrSF'])
traindf['GrLivArea'] = np.log(traindf['GrLivArea'])
traindf['TotalSF'] = np.log(traindf['TotalSF'])
traindf['TotalFinSF'] = np.log(traindf['TotalFinSF'])
traindf['TotalBsmtSF'] = np.log1p(traindf['TotalBsmtSF'])
traindf['SalePrice'] = np.log(traindf['SalePrice']) # Target variable transformation

# Test set log transformations
testdf['LotFrontage'] = np.log(testdf['LotFrontage'])
testdf['LotArea'] = np.log(testdf['LotArea'])
testdf['1stFlrSF'] = np.log(testdf['1stFlrSF'])
testdf['GrLivArea'] = np.log(testdf['GrLivArea'])
testdf['TotalSF'] = np.log(testdf['TotalSF'])
testdf['TotalFinSF'] = np.log(testdf['TotalFinSF'])
testdf['TotalBsmtSF'] = np.log1p(testdf['TotalBsmtSF'])

Dropping unnecessary columns:

In [10]:
drop_columns = ['Id','MiscVal', 'MiscFeature', 'Utilities', 'OpenPorchSF']

traindf.drop(columns = drop_columns, inplace=True)
test_id = testdf['Id'] # Save the Id column for submission later
testdf.drop(columns = drop_columns, inplace=True)

### 2.3. Sanitize data types

In [11]:
cat_features = [
    'MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour',
    'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
    'HouseStyle', 'OverallQual', 'OverallCond', 'RoofStyle', 'RoofMatl', 'Exterior1st',
    'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
    'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC',
    'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType',
    'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence',
    'MoSold', 'SaleType', 'SaleCondition', 'Has2ndFloor', 'HasBasement',
    'HasGarage', 'HasPool', 'WasRemodel'
]

num_features = [
    'LotFrontage', 'LotArea', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
    'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
    'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
    'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars',
    'GarageArea', 'WoodDeckSF', 'EnclosedPorch', '3SsnPorch',
    'ScreenPorch', 'PoolArea', 'YrSold', 'TotalBathrooms', 'TotalSF',
    'TotalFinSF', 'HouseAge', 'GarageAge', 'RemodelAge', 'QualityIndex', 'LotRatio'
]

for cat in cat_features:
    traindf[cat] = traindf[cat].astype(str)
    testdf[cat] = testdf[cat].astype(str)

for num in num_features:
    traindf[num] = traindf[num].astype(np.float32)
    testdf[num] = testdf[num].astype(np.float32)

traindf['SalePrice'] = traindf['SalePrice'].astype(np.float32)

### 2.4. Split features and labels

In [12]:
Y = traindf['SalePrice']
X = traindf.drop('SalePrice', axis=1)

## 3. Model training

In [13]:
# Create Pools (CatBoost's data structure)
train_pool = Pool(data=X, label=Y, cat_features=cat_features)
test_pool = Pool(data=testdf, cat_features=cat_features)

In [14]:
# Create final model
Final_model = CatBoostRegressor(
    iterations=750,  # Number of boosting iterations
    # The best model was trained with 3000 iterations and early stopping, and achieved best results at 687 iterations
    # Given that the final model is trained with the full training set (instead of the 70/20/10 split), we will increase the number of iterations to 750 (+ ~10%)
    learning_rate=0.03,
    depth=6,
    l2_leaf_reg=1,
    bagging_temperature=1.0,
    eval_metric='RMSE',
    random_seed=33,
    verbose=100  # Print progress after how many iterations
)

# Fit the model
Final_model.fit(train_pool)

0:	learn: 0.3909216	total: 50.1ms	remaining: 37.6s
100:	learn: 0.1255822	total: 469ms	remaining: 3.01s
200:	learn: 0.0979779	total: 916ms	remaining: 2.5s
300:	learn: 0.0875046	total: 1.37s	remaining: 2.04s
400:	learn: 0.0798132	total: 1.83s	remaining: 1.59s
500:	learn: 0.0733863	total: 2.29s	remaining: 1.14s
600:	learn: 0.0676355	total: 2.74s	remaining: 680ms
700:	learn: 0.0627697	total: 3.2s	remaining: 224ms
749:	learn: 0.0607096	total: 3.43s	remaining: 0us


<catboost.core.CatBoostRegressor at 0x7e010fb5b050>

## 4. Prediction and export for submission

Finally, let's predict the prices of the test set, and export the results for submission.

In [15]:
test_pred = Final_model.predict(test_pool)

In [16]:
test_pred_df = pd.DataFrame({
    'Id': test_id,
    'SalePrice': test_pred  # Convert predictions back to original scale
})

In [17]:
test_pred_df['SalePrice'] = np.expm1(test_pred_df['SalePrice'])  # Convert log predictions back to original scale

In [18]:
test_pred_df.head()

Unnamed: 0,Id,SalePrice
0,1461,120789.03527
1,1462,163266.007771
2,1463,187628.270793
3,1464,196560.812107
4,1465,183218.249905


In [19]:
test_pred_df.to_csv('../outputs/submission.csv', index=False)