# House Prices Prediction - Regression

In this workbook, I work through the housing prices dataset to build a regression model that predicts house prices.

1. Basic data cleaning and missing value handling.
2. Explratory data analysis.
3. Feature engineering
4. Data splitting and preprocessing pipeline
5. Model training and evaluation
6. Hyperparameter tuning
7. Model Ensembling
8. Submission preparation

In [3]:
# import relevant packages

# Data manipulation
import pandas as pd
import numpy as np
np.seterr(all='ignore')  # Ignore numpy warnings
# Data visualization
import plotly.graph_objects as go
import plotly.express as px

# Normalization
import scipy.stats as stats

# ipynb readability
from IPython.display import display, HTML

# remove warnings
import warnings
warnings.filterwarnings('ignore')


## 1. Basic Data Exploration

1. Import the data
2. Look at summary stats
3. Evaluate null values

In [4]:
# import the data
df = pd.read_csv('/Users/chamodh/Documents/Projs/kaggle-competitions/house-price-pred/datasets/train.csv')

In [5]:
def create_scrollable_table(df, table_id, title):
    html = f'<h3>{title}</h3>'
    html += f'<div id="{table_id}" style="height:300px; overflow:auto;">'
    html += df.to_html()
    html += '</div>'
    return html

In [6]:
df.shape

(1460, 81)

In [7]:
# Summary stats for numerical features
numerical_features = df.select_dtypes(include=[int])
summary_stats = numerical_features.describe().T
html_numerical = create_scrollable_table(summary_stats, numerical_features, "Summary statistics for numerical features")

display(HTML(html_numerical))

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,1460.0,730.5,421.610009,1.0,365.75,730.5,1095.25,1460.0
MSSubClass,1460.0,56.89726,42.300571,20.0,20.0,50.0,70.0,190.0
LotArea,1460.0,10516.828082,9981.264932,1300.0,7553.5,9478.5,11601.5,215245.0
OverallQual,1460.0,6.099315,1.382997,1.0,5.0,6.0,7.0,10.0
OverallCond,1460.0,5.575342,1.112799,1.0,5.0,5.0,6.0,9.0
YearBuilt,1460.0,1971.267808,30.202904,1872.0,1954.0,1973.0,2000.0,2010.0
YearRemodAdd,1460.0,1984.865753,20.645407,1950.0,1967.0,1994.0,2004.0,2010.0
BsmtFinSF1,1460.0,443.639726,456.098091,0.0,0.0,383.5,712.25,5644.0
BsmtFinSF2,1460.0,46.549315,161.319273,0.0,0.0,0.0,0.0,1474.0
BsmtUnfSF,1460.0,567.240411,441.866955,0.0,223.0,477.5,808.0,2336.0


In [8]:
# Summary stats for categorical variables
categorical_features = df.select_dtypes(include=[object])
cat_summary_stats = categorical_features.describe().T
html_categorical = create_scrollable_table(cat_summary_stats, 'categorical_features', 'Summary statistics for categorical features.')

display(HTML(html_categorical))

Unnamed: 0,count,unique,top,freq
MSZoning,1460,5,RL,1151
Street,1460,2,Pave,1454
Alley,91,2,Grvl,50
LotShape,1460,4,Reg,925
LandContour,1460,4,Lvl,1311
Utilities,1460,2,AllPub,1459
LotConfig,1460,5,Inside,1052
LandSlope,1460,3,Gtl,1382
Neighborhood,1460,25,NAmes,225
Condition1,1460,9,Norm,1260


In [9]:
# Null values in the variables
null_values = df.isnull().sum()
html_null_values = create_scrollable_table(null_values.to_frame(), 'null_values', 'Null values in the dataset')

# Percentage of missing values for each feature
missing_percentage = (df.isnull().sum() / len(df)) * 100
html_missing_percentage = create_scrollable_table(missing_percentage.to_frame(), 'missing_percentage', 'Percentage of missing values')

display(HTML(html_null_values + html_missing_percentage))

Unnamed: 0,0
Id,0
MSSubClass,0
MSZoning,0
LotFrontage,259
LotArea,0
Street,0
Alley,1369
LotShape,0
LandContour,0
Utilities,0

Unnamed: 0,0
Id,0.0
MSSubClass,0.0
MSZoning,0.0
LotFrontage,17.739726
LotArea,0.0
Street,0.0
Alley,93.767123
LotShape,0.0
LandContour,0.0
Utilities,0.0


In [10]:
# Exploring rows with missing columns
rows_with_missing_values = df[df.isnull().any(axis=1)]
html_rows_with_missing_values = create_scrollable_table(rows_with_missing_values,
                                                        'rows_with_missing_values',
                                                        'Rows with missing values')

display(HTML(html_rows_with_missing_values))

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000
5,6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Mitchel,Norm,Norm,1Fam,1.5Fin,5,5,1993,1995,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,Wood,Gd,TA,No,GLQ,732,Unf,0,64,796,GasA,Ex,Y,SBrkr,796,566,0,1362,1,0,1,1,1,1,TA,5,Typ,0,,Attchd,1993.0,Unf,2,480,TA,TA,Y,40,30,0,320,0,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000
6,7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,1Fam,1Story,8,5,2004,2005,Gable,CompShg,VinylSd,VinylSd,Stone,186.0,Gd,TA,PConc,Ex,TA,Av,GLQ,1369,Unf,0,317,1686,GasA,Ex,Y,SBrkr,1694,0,0,1694,1,0,2,0,3,1,Gd,7,Typ,1,Gd,Attchd,2004.0,RFn,2,636,TA,TA,Y,255,57,0,0,0,0,,,,0,8,2007,WD,Normal,307000
7,8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NWAmes,PosN,Norm,1Fam,2Story,7,6,1973,1973,Gable,CompShg,HdBoard,HdBoard,Stone,240.0,TA,TA,CBlock,Gd,TA,Mn,ALQ,859,BLQ,32,216,1107,GasA,Ex,Y,SBrkr,1107,983,0,2090,1,0,2,1,3,1,TA,7,Typ,2,TA,Attchd,1973.0,RFn,2,484,TA,TA,Y,235,204,228,0,0,0,,,Shed,350,11,2009,WD,Normal,200000
8,9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Artery,Norm,1Fam,1.5Fin,7,5,1931,1950,Gable,CompShg,BrkFace,Wd Shng,,0.0,TA,TA,BrkTil,TA,TA,No,Unf,0,Unf,0,952,952,GasA,Gd,Y,FuseF,1022,752,0,1774,0,0,2,0,2,2,TA,8,Min1,2,TA,Detchd,1931.0,Unf,2,468,Fa,TA,Y,90,0,205,0,0,0,,,,0,4,2008,WD,Abnorml,129900
9,10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,Corner,Gtl,BrkSide,Artery,Artery,2fmCon,1.5Unf,5,6,1939,1950,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,BrkTil,TA,TA,No,GLQ,851,Unf,0,140,991,GasA,Ex,Y,SBrkr,1077,0,0,1077,1,0,1,0,2,2,TA,5,Typ,2,TA,Attchd,1939.0,RFn,1,205,Gd,TA,Y,0,4,0,0,0,0,,,,0,1,2008,WD,Normal,118000


In [11]:
df.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

## Explore dependant variable
- Should it be normalized?
- Normalize dependant variable

In [12]:
# Fit a normal distribution to the SalePrice data
mu, sigma = stats.norm.fit(df['SalePrice'])

# Create a histogram of `SalePrice` column
hist_saleprice = go.Histogram(x=df['SalePrice'],
                              nbinsx=50,
                              name='Histogram',
                              opacity=0.75,
                              histnorm='probability density',
                              marker=dict(color='purple')
                              )

# Calculate the normal distribution based on the fitted parameters
x_norm = np.linspace(df['SalePrice'].min(), df['SalePrice'].max(), 100)
y_norm = stats.norm.pdf(x_norm, mu, sigma)

# Create a normal distribution overlay
norm_saleprice = go.Scatter(x=x_norm,
                       y=y_norm,
                       mode='lines',
                       name=f"Normal dist. (mu = {mu:.2f}, sigma= {sigma:.2f})",
                       line=dict(color='green')
                       )

# Combine the histogram and overlay
fig = go.Figure(data = [hist_saleprice, norm_saleprice])

# Configure layout
fig.update_layout(
    title='SalePrice distribution',
    xaxis_title='SalePrice',
    yaxis_title='Density',
    legend_title_text='Fitted Normal Distribution',
    plot_bgcolor = 'rgba(32, 32, 32, 1)',
    paper_bgcolor = 'rgba(32, 32, 32, 1)',
    font=dict(color = 'white')
)

# Show the plot
fig.show()

In [13]:
# Create Q-Q plot
qq_data = stats.probplot(df['SalePrice'], dist='norm')
qq_fig = px.scatter(x=qq_data[0][0],
                    y=qq_data[0][1],
                    labels={'x': 'Theoretical Quantiles', 'y': 'Ordered Values'},
                    color_discrete_sequence = ['purple'])
qq_fig.update_layout(
    title='Q-Q Plot',
    plot_bgcolor = 'rgba(32, 32, 32, 1)',
    paper_bgcolor = 'rgba(32, 32, 32, 1)',
    font = dict(color = 'white')
)

# Calculate the line of best fit
slope, intercept, r_value, p_value, std_err = stats.linregress(qq_data[0][0], qq_data[0][1])
line_x = np.array(qq_data[0][0])
line_y = intercept + slope * line_x

# Add the line of best fit to the Q-Q plot
line_data = go.Scatter(x=line_x,
                       y=line_y,
                       mode='lines',
                       name='Normal Line',
                       line=dict(color='green')
                       )
qq_fig.add_trace(line_data)

# Show the figure
qq_fig.show()

## Exploratory Analysis

Hypothesis to prove from the data.

1. Distribution of dweling types and their relation to sale prices.
2. Does zoning impact sale price
3. Does street and alley access types affect on sale price.
4. Is there a correlation between living area and sale price.
5. Is there a correlation between property age and sale price.
6. Does pricing change year to year.
7. What is the average property price by property shape.

In [14]:
# Distribution of dweling types and their relation to sale prices.
dweling_types = df['BldgType'].value_counts()
dweling_prices = df.groupby('BldgType')['SalePrice'].mean()

# Format labels for the second graph
formatted_dweling_types = ['$' + f'{value:,.2f}' for value in dweling_prices.values]

# Create bar charts
# Plot the distribution of building types
fig1 = go.Figure(data=[go.Bar(
    x=dweling_types.index,
    y=dweling_types.values,
    marker_color='rgb(76, 175, 80)',
    text=dweling_types.values,
    textposition='outside',
    width=0.4,
    marker=dict(line=dict(width=2, color='rgba(0, 0, 0, 1)'), opacity=1)
)])

fig1.update_layout(
    title='Distribution of Building Types',
    xaxis_title='Building Type',
    yaxis_title='Count',
    plot_bgcolor = 'rgba(34, 34, 34, 1)',
    paper_bgcolor = 'rgba(34, 34, 34, 1)',
    font = dict(color = 'white')
)

fig1.show()

In [15]:
# Plot average sale price by building type

fig2 = go.Figure(data=[go.Bar(
    x=dweling_prices.index,
    y=dweling_prices.values,
    marker_color='rgb(156, 39, 76)',
    text=formatted_dweling_types,
    textposition='outside',
    width=0.4,
    marker=dict(line=dict(width=2, color='rgba(0,0,0,1)'), opacity=1)
)])

fig2.update_layout(
    title='Average Sale Price by Building Type',
    xaxis_title='Building Type',
    yaxis_title='Avg. Price',
    plot_bgcolor = 'rgba(34, 34, 34, 1)',
    paper_bgcolor = 'rgba(34, 34, 34, 1)',
    font = dict(color = 'white')
)

fig2.show()

In [16]:
# Zoning impact of sale price
zoning_prices = df.groupby('MSZoning')['SalePrice'].mean()
fig3 = px.bar(
    x=zoning_prices.index,
    y=zoning_prices.values,
    title='Average Sale Price by Zoning',
    color_discrete_sequence=['purple', 'green'],
    text=zoning_prices.values,
    template='plotly_dark'
)

fig3.update_traces(texttemplate='$%{text:,.0f}', textposition='outside')
fig3.update_xaxes(title='Zoning')
fig3.update_yaxes(title='Sale Price', tickprefix='$', tickformat=',')
fig3.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')

fig3.show()

In [17]:
# Street and alley access on sales price

street_prices = df.groupby('Street')['SalePrice'].mean()
alley_prices = df.groupby('Alley')['SalePrice'].mean()

# Street prices
colors_street = np.where(street_prices.index == 'Pave', 'purple', 'green')

fig5 = px.bar(
    x=street_prices.index,
    y=street_prices.values,
    title='Average Sale Price by Street',
    template='plotly_dark',
    text=street_prices.values,
    color=colors_street,
    color_discrete_sequence=['purple', 'green']
)
fig5.update_traces(texttemplate='$%{text:,.0f}', textposition='outside')
fig5.update_xaxes(title='Street Type')
fig5.update_yaxes(title='Sale Price', tickprefix='$', tickformat=',')
fig5.update_layout(showlegend=False)

# Alley prices
colors_alley = np.where(alley_prices.index == 'Pave', 'purple', 'green')

fig6 = px.bar(
    x=alley_prices.index,
    y=alley_prices.values,
    title='Average Sale Price by Alley',
    template='plotly_dark',
    text=alley_prices.values,
    color=colors_alley,
    color_discrete_sequence=['purple', 'green']
)

fig6.update_traces(texttemplate='$%{text:,.0f}', textposition='outside')
fig6.update_xaxes(title='Alley Type')
fig6.update_yaxes(title='Sale Price', tickprefix='$', tickformat=',')
fig6.update_layout(showlegend=False)

# Show plots
fig5.show()
fig6.show()

In [18]:
# Average sale price by property shape

colors = px.colors.qualitative.Plotly

shape_prices= df.groupby('LotShape')['SalePrice'].mean()
contour_prices = df.groupby('LandContour')['SalePrice'].mean()

# Shape prices

fig7 = px.bar(
    x=shape_prices.index,
    y=shape_prices.values,
    title='Average Sale Price by Lot Shape',
    template='plotly_dark',
    text=shape_prices.values
)

fig7.update_traces(marker_color = colors, texttemplate='$%{text:,.0f}', textposition='outside')
fig7.update_xaxes(title='Property Shape')
fig7.update_yaxes(title='Sale Price', tickprefix='$', tickformat=',')
fig7.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')

fig8 = px.bar(
    x=contour_prices.index,
    y=contour_prices.values,
    title='Average Sale Price by Land Contour',
    template='plotly_dark',
    text=contour_prices.values
)

fig8.update_traces(marker_color = colors, texttemplate='$%{text:,.0f}', textposition='outside')
fig8.update_xaxes(title='Property Contour')
fig8.update_yaxes(title='Sale Price', tickprefix='$', tickformat=',')

# Show figures
fig7.show()
fig8.show()

In [19]:
# 5. Calculate property Age
df['PropertyAge'] = df['YrSold'] - df['YearBuilt']

# Calculate Correlation between property age and saleprices
age_pice_corr = df['PropertyAge'].corr(df['SalePrice'])
print(f'Correlation between property age and sale price: {age_pice_corr}')

# Create a scatter plot to visualize the relationship between property age and sale prices
fig9 = px.scatter(df,
                  x='PropertyAge',
                  y='SalePrice',
                  color='PropertyAge',
                  color_continuous_scale= px.colors.sequential.Purp,
                  title='Relationship between Property Age and Sale Price')

fig9.update_layout(
    plot_bgcolor = 'rgb(30,30,30)',
    paper_bgcolor = 'rgb(30,30,30)',
    font = dict(color = 'white')
)

fig9.show()

Correlation between property age and sale price: -0.5233504175468163


In [20]:
# 6. Correlation between living area and sales price

living_area_price_corr = df['GrLivArea'].corr(df['SalePrice'])
print(f"Correlation between Living Area (above grade) and Sale Price: {living_area_price_corr}")

fig10 = px.scatter(df,
                   x='GrLivArea',
                   y='SalePrice',
                   title='Living Area (above grade) vs Sale Price',
                   color = 'GrLivArea',
                   color_continuous_scale=px.colors.sequential.Purp
                   )

fig10.update_layout(
    plot_bgcolor = 'rgb(30,30,30)',
    paper_bgcolor = 'rgb(30,30,30)',
    font = dict(color = 'white')
)

fig10.show()


Correlation between Living Area (above grade) and Sale Price: 0.7086244776126523


In [21]:
# 7. Box plot for price over the years

yearly_avg_sale_price = df.groupby('YrSold')['SalePrice'].mean()

fig13 = px.box(df,
               x='YrSold',
               y='SalePrice',
               title = 'Sale Price Trends Over the Years',
               points=False, color_discrete_sequence=['green']
               )

fig13.add_trace(
    px.line(x=yearly_avg_sale_price.index, y=yearly_avg_sale_price.values).data[0]
)

for year, avg_price in yearly_avg_sale_price.items():
  fig13.add_annotation(
      x=year,
      y=avg_price,
      text=f'{avg_price:,.0f}',
      font=dict(color='white'),
      showarrow=False,
      bgcolor = 'rgba(128, 0, 128, 0.6)'
  )

fig13.update_layout(
    plot_bgcolor = 'rgb(30,30,30)',
    paper_bgcolor = 'rgb(30,30,30)',
    font = dict(color = 'white'),
    xaxis_title = 'Year Sold',
    yaxis_title = 'Average Sale Price'
)

fig13.show()

## Creating a Data Pipeline

So we have consistent infrastructure for transforming the test set

Goal: To craete infrastructure that lets us make changes without breaking everything

In [22]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Define transformers for numerical and categorical columns

numerical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ]
)

In [23]:
# Update categorical and numerical columns
categorical_columns = df.select_dtypes(include=['object', 'category']).columns
numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns

# Remove target variable from numerical columns
numerical_columns = numerical_columns.drop('SalePrice')

# Combine transformers using ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_columns),
        ('cat', categorical_transformer, categorical_columns)
    ], remainder = 'passthrough' # All the data going in comes out not just the transformed values
)

# Create a pipeline with the preprocessor
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor)
])

# Apply the pipeline to the dataset
X = df.drop('SalePrice', axis=1)
y = np.log(df['SalePrice']) # normalize dependent variable
X_preprocessed = pipeline.fit_transform(X)

## Fit and Parameter Tune models

- Explore and see how the models work (or just they dont)

In [24]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score

# Split the data into training and testing sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.2, random_state=42)

# Define the models
models = {
    "LinearRegression" : LinearRegression(),
    "RandomForest" : RandomForestRegressor(random_state=42),
    "XGBoost" : XGBRegressor(random_state=42)
}

# Define the hyperparameter grinds for each model
param_grids = {
    "LinearRegression" : {},
    "RandomForest" : {
        'n_estimators' : [100, 200, 500],
        'max_depth' : [None, 10, 30],
        'min_samples_split' : [2, 5, 10],
    },
    "XGBoost" : {
        'n_estimators' : [100, 200, 500],
        'learning_rate' : [0.01, 0.1, 0.3],
        'max_depth' : [3, 6, 10],
    }
}

# 3-fold corss-validation
cv = KFold(n_splits=3, shuffle=True, random_state=42)

# Train and tune models
grids = {}
for model_name, model in models.items():
  print(f"Training and tuning {model_name}")
  grids[model_name] = GridSearchCV(
      estimator=model,
      param_grid=param_grids[model_name],
      cv=cv,
      scoring='neg_mean_squared_error',
      n_jobs=-1,
      verbose=1)
  grids[model_name].fit(X_train, y_train)
  best_params = grids[model_name].best_params_
  best_score = np.sqrt(-1 * grids[model_name].best_score_)


  print(f"Best parameters for {model_name}: {best_params}")
  print(f"Best RMSE for {model_name}: {best_score}\n")

Training and tuning LinearRegression
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Best parameters for LinearRegression: {}
Best RMSE for LinearRegression: 0.17616568200328317

Training and tuning RandomForest
Fitting 3 folds for each of 27 candidates, totalling 81 fits
Best parameters for RandomForest: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 500}
Best RMSE for RandomForest: 0.15321256965909336

Training and tuning XGBoost
Fitting 3 folds for each of 27 candidates, totalling 81 fits
Best parameters for XGBoost: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500}
Best RMSE for XGBoost: 0.13797005834667236



In [25]:
from sklearn.neural_network import MLPRegressor

X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

# Create an MLPRegressor instance
mlp = MLPRegressor(
    random_state = 42,
    max_iter=10000,
    n_iter_no_change = 3,
    learning_rate_init = 0.001
)

# Define the parameter grid for tuning
param_grid = {
    'hidden_layer_sizes' : [(10,), (10, 10), (10,10,10), (25)],
    'activation' : ['relu', 'tanh'],
    'solver' : ['adam'],
    'alpha' : [0.0001, 0.001, 0.01],
    'learning_rate' : ['constant', 'invscaling', 'adaptive'],
}

# Creating GridSearchCV object
grid_search_mlp = GridSearchCV(
    mlp,
    param_grid,
    scoring='neg_mean_squared_error',
    cv=3,
    n_jobs=-1,
    verbose=1
)

# Fit the model on the training data
grid_search_mlp.fit(X_train_scaled, y_train)

# Print the best parameters found duing the search
print(f"Best parameters found: {grid_search_mlp.best_params_}")

# Evaluate the model on test data
best_score = np.sqrt(-1 * grid_search_mlp.best_score_)
print(f"Test Score: {best_score}")

Fitting 3 folds for each of 72 candidates, totalling 216 fits
Best parameters found: {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': (10, 10), 'learning_rate': 'constant', 'solver': 'adam'}
Test Score: 0.22778267781613565


## Principal Component Analysis

- Basic feature engineering

In [26]:
# PCA

from sklearn.decomposition import PCA

pca = PCA()

X_pca_pre = pca.fit_transform(X_preprocessed)

# Calculate the cumulative explained variance
cumulative_explained_variance = np.cumsum(pca.explained_variance_ratio_)

# Choose the number of components based on the explained variance threshold
n_components = np.argmax(cumulative_explained_variance >= 0.95) + 1

pca = PCA(n_components=n_components)
pipeline_pca = Pipeline(
    steps = [('preprocessor', preprocessor),
             ('pca', pca)]
)

X_pca = pipeline_pca.fit_transform(X)

Running the same models with new Data

In [27]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca, y, test_size=0.2, random_state=42)

# Define the models
param_grids = {
    'LinearRegression' : {},
    'RandomForest' : {
        'n_estimators' : [100, 200, 500],
        'max_depth' : [None, 10, 30],
        'min_samples_split' : [2, 5, 10],
    },
    'XGBoost' : {
        'n_estimators' : [100, 200, 500],
        'learning_rate' : [0.01, 0.1, 0.3],
        'max_depth' : [3, 6, 10],
    }
}

# 3 fold cross validation
cv = KFold(n_splits=3, shuffle=True, random_state=42)

# Train and tune the models
grids_pca = {}

for model_name, model in models.items():
  print(f'Training and tuning {model_name}')
  grids_pca[model_name] = GridSearchCV(
      estimator=model,
      param_grid=param_grids[model_name],
      cv=cv,
      scoring='neg_mean_squared_error',
      n_jobs=-1,
      verbose=1
  )

  grids_pca[model_name].fit(X_train_pca, y_train_pca)
  best_params = grids_pca[model_name].best_params_
  best_score = np.sqrt(
      -1 * grids_pca[model_name].best_score_
  )

  print(f'Best parameters for {model_name} : {best_params}')
  print(f'Best RMSE for {model_name} : {best_score}\n')

Training and tuning LinearRegression
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Best parameters for LinearRegression : {}
Best RMSE for LinearRegression : 0.16294794958281145

Training and tuning RandomForest
Fitting 3 folds for each of 27 candidates, totalling 81 fits
Best parameters for RandomForest : {'max_depth': None, 'min_samples_split': 5, 'n_estimators': 500}
Best RMSE for RandomForest : 0.15199947901325844

Training and tuning XGBoost
Fitting 3 folds for each of 27 candidates, totalling 81 fits


Exception ignored in: <function ResourceTracker.__del__ at 0x122665bc0>
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 91, in _stop
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 116, in _stop_locked
ChildProcessError: [Errno 10] No child processes
Exception ignored in: <function ResourceTracker.__del__ at 0x104c65bc0>
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 91, in _stop
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 116, in _stop_locked
ChildProcessError: [Errno 10] No child processes
Exception ignored in: <function ResourceTracker.__del__ at 0x102359bc0>
Traceback (most recent call last

Best parameters for XGBoost : {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500}
Best RMSE for XGBoost : 0.13722603124561958



In [28]:
from sklearn.neural_network import MLPRegressor

X_train_pca_scaled = X_train_pca.copy()
X_test_pca_scaled = X_test_pca.copy()

# Create an MLPRegressor instance
mlp = MLPRegressor(
    random_state = 42,
    max_iter = 10000,
    n_iter_no_change = 3,
    learning_rate_init = 0.001
)

# Define the parameter grid for tuning
param_grid = {
    'hidden_layer_sizes' : [(10,), (10,10), (10,10,10), (25)],
    'activation' : ['relu', 'tanh'],
    'solver' : ['adam'],
    'alpha' : [0.0001, 0.001, 0.01, .1, 1],
    'learning_rate' : ['constant', 'invscaling', 'adaptive'],
}

# Create GridSearchCV object
grid_search_mlp_pca = GridSearchCV(
    mlp,
    param_grid,
    scoring='neg_mean_squared_error',
    cv=3,
    n_jobs=-1,
    verbose=1
)

# Fit the model on training data
grid_search_mlp_pca.fit(X_train_pca_scaled, y_train)

# Print the best params found during the search
print('Best parameters found: ', grid_search_mlp_pca.best_params_)

# Evaluate the model on the test data
best_score = np.sqrt(-1 * grid_search_mlp_pca.best_score_)
print('Best score: ', best_score)

Fitting 3 folds for each of 120 candidates, totalling 360 fits
Best parameters found:  {'activation': 'tanh', 'alpha': 1, 'hidden_layer_sizes': (10, 10, 10), 'learning_rate': 'constant', 'solver': 'adam'}
Best score:  0.22268361231174003


In [29]:
from sklearn.metrics import mean_squared_error

for i in grids.keys():
  print(i + ':' + str(np.sqrt(mean_squared_error(grids[i].predict(X_test), y_test))))

LinearRegression:0.13212453580847555
RandomForest:0.14740705465925416
XGBoost:0.13519734581188902


In [30]:
from sklearn.metrics import mean_squared_error

for i in grids_pca.keys():
    print(i + ':' + str(np.sqrt(mean_squared_error(grids_pca[i].predict(X_test_pca), y_test))))

LinearRegression:0.14196360812575212
RandomForest:0.15341290192309123
XGBoost:0.14076453118827834


In [31]:
print(f"MLPRegressor: {str(np.sqrt(mean_squared_error(grid_search_mlp.predict(X_test_scaled),y_test)))}")

MLPRegressor: 0.16683910272240587


In [32]:
print(f"MLPRegressor PCA: {str(np.sqrt(mean_squared_error(grid_search_mlp_pca.predict(X_test_pca_scaled),y_test)))}")

MLPRegressor PCA: 0.18359760643950232


## Feature engineering

In [33]:
var_explore = df[['Fence','Alley','MiscFeature','PoolQC','FireplaceQu','GarageCond','GarageQual',
                  'GarageFinish','GarageType','BsmtExposure','BsmtFinType2','BsmtFinType1','BsmtCond',
                  'BsmtQual','MasVnrType','Electrical','MSZoning','Utilities','Exterior1st','Exterior2nd',
                  'KitchenQual','Functional','SaleType','LotFrontage','GarageYrBlt','MasVnrArea','BsmtFullBath',
                  'BsmtHalfBath','GarageCars','GarageArea','TotalBsmtSF']]

display(HTML(create_scrollable_table(var_explore, 'var_explore', 'Variables to explore for feature engineering')))

Unnamed: 0,Fence,Alley,MiscFeature,PoolQC,FireplaceQu,GarageCond,GarageQual,GarageFinish,GarageType,BsmtExposure,BsmtFinType2,BsmtFinType1,BsmtCond,BsmtQual,MasVnrType,Electrical,MSZoning,Utilities,Exterior1st,Exterior2nd,KitchenQual,Functional,SaleType,LotFrontage,GarageYrBlt,MasVnrArea,BsmtFullBath,BsmtHalfBath,GarageCars,GarageArea,TotalBsmtSF
0,,,,,,TA,TA,RFn,Attchd,No,Unf,GLQ,TA,Gd,BrkFace,SBrkr,RL,AllPub,VinylSd,VinylSd,Gd,Typ,WD,65.0,2003.0,196.0,1,0,2,548,856
1,,,,,TA,TA,TA,RFn,Attchd,Gd,Unf,ALQ,TA,Gd,,SBrkr,RL,AllPub,MetalSd,MetalSd,TA,Typ,WD,80.0,1976.0,0.0,0,1,2,460,1262
2,,,,,TA,TA,TA,RFn,Attchd,Mn,Unf,GLQ,TA,Gd,BrkFace,SBrkr,RL,AllPub,VinylSd,VinylSd,Gd,Typ,WD,68.0,2001.0,162.0,1,0,2,608,920
3,,,,,Gd,TA,TA,Unf,Detchd,No,Unf,ALQ,Gd,TA,,SBrkr,RL,AllPub,Wd Sdng,Wd Shng,Gd,Typ,WD,60.0,1998.0,0.0,1,0,3,642,756
4,,,,,TA,TA,TA,RFn,Attchd,Av,Unf,GLQ,TA,Gd,BrkFace,SBrkr,RL,AllPub,VinylSd,VinylSd,Gd,Typ,WD,84.0,2000.0,350.0,1,0,3,836,1145
5,MnPrv,,Shed,,,TA,TA,Unf,Attchd,No,Unf,GLQ,TA,Gd,,SBrkr,RL,AllPub,VinylSd,VinylSd,TA,Typ,WD,85.0,1993.0,0.0,1,0,2,480,796
6,,,,,Gd,TA,TA,RFn,Attchd,Av,Unf,GLQ,TA,Ex,Stone,SBrkr,RL,AllPub,VinylSd,VinylSd,Gd,Typ,WD,75.0,2004.0,186.0,1,0,2,636,1686
7,,,Shed,,TA,TA,TA,RFn,Attchd,Mn,BLQ,ALQ,TA,Gd,Stone,SBrkr,RL,AllPub,HdBoard,HdBoard,TA,Typ,WD,,1973.0,240.0,1,0,2,484,1107
8,,,,,TA,TA,Fa,Unf,Detchd,No,Unf,Unf,TA,TA,,FuseF,RM,AllPub,BrkFace,Wd Shng,TA,Min1,WD,51.0,1931.0,0.0,0,0,2,468,952
9,,,,,TA,TA,Gd,RFn,Attchd,No,Unf,GLQ,TA,TA,,SBrkr,RL,AllPub,MetalSd,MetalSd,TA,Typ,WD,50.0,1939.0,0.0,1,0,1,205,991


In [34]:
from sklearn.preprocessing import FunctionTransformer

# Feature engineering functions

def custom_features(df):
    df_out = df.copy()
    df_out['PropertyAge'] = df_out['YrSold'] - df_out['YearBuilt']
    df_out['TotalSF'] = df_out['TotalBsmtSF'] + df_out['1stFlrSF'] + df_out['2ndFlrSF']
    df_out['TotalBath'] = df_out['FullBath'] + df_out['HalfBath'] + df_out['BsmtFullBath'] + df_out['BsmtHalfBath']
    df_out['HasRemodeled'] = df_out['YearRemodAdd'] > df_out['YearBuilt'].astype(object)
    df_out['HasSecondFloor'] = (df_out['2ndFlrSF'] > 0).astype(object)
    df_out['HasGarage'] = (df_out['GarageArea'] > 0).astype(object)
    df_out['YrSoldCat'] = df_out['YrSold'].astype(object)
    df_out['MoSoldCat'] = df_out['MoSold'].astype(object)
    df_out['YearBuiltCat'] = df_out['YearBuilt'].astype(object)
    df_out['MSSubClassCat'] = df_out['MSSubClass'].astype(object)

    return df_out


feature_engineering_transformer = FunctionTransformer(custom_features)

In [35]:
# Identify the categorical and numerical columns after feature engineering

new_cols_categorical = pd.Index(['HasRemodeled', 'HasGarage', 'HasSecondFloor'])
new_cols_numeric = pd.Index(['PropertyAge', 'TotalSF', 'TotalBath', 'YrSoldCat', 'MoSoldCat', 'YearBuiltCat', 'MSSubClassCat'])


# Categorical and numerical columns
categorical_columns = df.select_dtypes(include=['object', 'category']).columns.append(new_cols_categorical)
numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns.append(new_cols_numeric)

# Remove target variable from numerical columns
numerical_columns = numerical_columns.drop('SalePrice')

# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers = [
        ('num', numerical_transformer, numerical_columns),
        ('cat', categorical_transformer, categorical_columns)
    ], remainder= 'passthrough'
)

# Create pipeline with the preprocessor
pipeline_fe = Pipeline(steps=[
    ('fe', feature_engineering_transformer),
    ('preprocessor', preprocessor),
    ('pca', pca)
])

# Apply the pipeline to the dataset
X = df.drop('SalePrice', axis=1)
y = np.log(df['SalePrice'])
X_preprocessed_fe = pipeline_fe.fit_transform(X)

In [36]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train_fe, X_test_fe, y_train_fe, y_test_fe = train_test_split(X_preprocessed_fe, y, test_size=0.2, random_state=42)

# Define the models
models = {
    "LinearRegression" : LinearRegression(),
    "RandomForest" : RandomForestRegressor(random_state=42),
    "XGBoost" : XGBRegressor(random_state=42)
}

# Define hyperparameter grids for each model
param_grids = {
    "LinearRegression" : {},
    "RandomForest" : {
        'n_estimators' : [100, 200, 500],
        'max_depth' : [None, 10, 30],
        'min_samples_split' : [2, 5, 10],
    },
    "XGBoost" : {
        'n_estimators' : [100, 200, 500],
        'max_depth' : [3, 6, 10],
        'learning_rate' : [0.01, 0.1, 0.3],
    }
}

# 3-fold cross-validation
cv = KFold(n_splits=3, shuffle=True, random_state=42)

# Train and tune models
grids_fe = {}

for model_name, model in models.items():
    print(f"Training and tuning {model_name}")
    grids_fe[model_name] = GridSearchCV(
        estimator=model,
        param_grid=param_grids[model_name],
        cv=cv,
        scoring='neg_mean_squared_error',
        n_jobs=-1,
        verbose=0,
    )
    grids_fe[model_name].fit(X_train_fe, y_train_fe)
    best_params = grids_fe[model_name].best_params_
    best_score = np.sqrt(-1 * grids_fe[model_name].best_score_)

    print(f"Best parameters for {model_name}: {best_params}")
    print(f"Best MSE for {model_name}: {best_score}\n")

Training and tuning LinearRegression
Best parameters for LinearRegression: {}
Best MSE for LinearRegression: 0.16503621759468923

Training and tuning RandomForest
Best parameters for RandomForest: {'max_depth': 10, 'min_samples_split': 2, 'n_estimators': 200}
Best MSE for RandomForest: 0.1521749276125234

Training and tuning XGBoost


Exception ignored in: <function ResourceTracker.__del__ at 0x104a1dbc0>
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 91, in _stop
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 116, in _stop_locked
ChildProcessError: [Errno 10] No child processes
Exception ignored in: <function ResourceTracker.__del__ at 0x105191bc0>
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 91, in _stop
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 116, in _stop_locked
ChildProcessError: [Errno 10] No child processes
Exception ignored in: <function ResourceTracker.__del__ at 0x105075bc0>
Traceback (most recent call last

Best parameters for XGBoost: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500}
Best MSE for XGBoost: 0.14126928812134099



In [37]:
X_train_scaled_fe = X_train_fe.copy()
X_test_scaled_fe = X_test_fe.copy()

# Create an MLPRegressor instance
mlp = MLPRegressor(
    random_state=42,
    max_iter=10000,
    n_iter_no_change=3,
    learning_rate_init=0.001
)

# Define the parameter grid for tuning
para_grid = {
    'hidden_layer_sizes' : [(10,), (10, 10), (10, 10, 10), (25)],
    'activation' : ['relu', 'tanh', 'sigmoid'],
    'solver' : ['adam', 'sgd'],
    'alpha' : [.1, .5, 1, 10, 100],
    'learning_rate' : ['constant', 'invscaling', 'adaptive'],
    'learning_rate_init' : [0.001, 0.01, 0.1]
}

# Create GridSearchCV object
grid_search_mlp_fe = GridSearchCV(
    mlp,
    param_grid=para_grid,
    scoring='neg_mean_squared_error',
    cv=3,
    n_jobs=-1,
    verbose=0,
)

# Fit the model on the training data
grid_search_mlp_fe.fit(X_train_scaled_fe, y_train_fe)

# Print the best parameters found during the search
print(f"Best parameters found: {grid_search_mlp_fe.best_params_}")
print(f"Best score: {np.sqrt(-1 * grid_search_mlp_fe.best_score_)}")

  return ((y_true - y_pred) ** 2).mean() / 2
  ret = a @ b
  ret = a @ b
  ret = a @ b
  return ((y_true - y_pred) ** 2).mean() / 2
  ret = a @ b
  return ((y_true - y_pred) ** 2).mean() / 2
  ret = a @ b
  ret = a @ b
  return ((y_true - y_pred) ** 2).mean() / 2
  ret = a @ b
  ret = a @ b
  return ((y_true - y_pred) ** 2).mean() / 2
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  return ((y_true - y_pred) ** 2).mean() / 2
  ret = a @ b
  ret = a @ b
  ret = a @ b
  return ((y_true - y_pred) ** 2).mean() / 2
  ret = a @ b
  return ((y_true - y_pred) ** 2).mean() / 2
  ret = a @ b
  ret = a @ b
  return ((y_true - y_pred) ** 2).mean() / 2
  ret = a @ b
  ret = a @ b
  return ((y_true - y_pred) ** 2).mean() / 2
  ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
  ret = a @ b
  ret = a @ b
  return ((y_true - y_pred) ** 2).mean() / 2
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  return ((y_true - y_pred) ** 2).mean() / 2
  ret 

Best parameters found: {'activation': 'tanh', 'alpha': 0.5, 'hidden_layer_sizes': (10,), 'learning_rate': 'adaptive', 'learning_rate_init': 0.1, 'solver': 'sgd'}
Best score: 0.1283706380909018


In [38]:
from sklearn.metrics import mean_squared_error

for i in grids_fe.keys():
    print(f"{i} : {np.sqrt(mean_squared_error(grids[i].predict(X_test), y_test))}")

LinearRegression : 0.13212453580847555
RandomForest : 0.14740705465925416
XGBoost : 0.13519734581188902


In [39]:
print( str(np.sqrt(mean_squared_error(grid_search_mlp_fe.predict(X_test_scaled_fe), y_test))))

0.13334199931610188


In [42]:
# load the test set
df_test = pd.read_csv('../datasets/test.csv')

In [43]:
# Preprocess the test set
df_test_preprocessed = pipeline_fe.transform(df_test)

In [45]:
# XGBoost submission
y_xgboost = np.exp(grids_fe['XGBoost'].predict(df_test_preprocessed))

df_xgboost_out = df_test[['Id']].copy()
df_xgboost_out['SalePrice'] = y_xgboost

df_xgboost_out.to_csv('../datasets/submissions/xgboost_submission.csv', index=False)

In [46]:
# rf submission
y_rf = np.exp(grids_fe['RandomForest'].predict(df_test_preprocessed))

df_rf_out = df_test[['Id']].copy()
df_rf_out['SalePrice'] = y_xgboost

df_xgboost_out.to_csv('../datasets/submissions/rf_submission.csv', index=False)

In [47]:
# mlp submission
y_mlp = np.exp(grid_search_mlp_fe.predict(df_test_preprocessed))

df_mlp_out = df_test[['Id']].copy()
df_mlp_out['SalePrice'] = y_mlp

df_mlp_out.to_csv('../datasets/submissions/mlp_submission.csv', index=False)

In [48]:
y_avg_ens_preds = (y_rf + y_xgboost + y_mlp) / 3

# avg submission
df_avg_ens_out = df_test[['Id']].copy()
df_avg_ens_out['SalePrice'] = y_avg_ens_preds

df_avg_ens_out.to_csv('../datasets/submissions/avg_ens_submission.csv', index=False)

In [50]:
from sklearn.ensemble import StackingRegressor

grids_fe['MLP'] = grid_search_mlp_fe

best_estimators = [(model_name, grid.best_estimator_) for model_name, grid in grids_fe.items()]

# Define candidate meta-models

meta_models = {
    'MLP' : MLPRegressor(
        random_state=42,
        max_iter=10000,
        n_iter_no_change=3,
        learning_rate_init=0.001
    ),
    'LinearRegression' : LinearRegression(),
    'XGBoost' : XGBRegressor(random_state=42)
}


# Define hyper-parameter grids for each model
meta_param_grids = {
    'MLP' : {
        'final_estimator__hidden_layer_sizes' : [(10,), (10, 10)],
        'final_estimator__activation' : ['relu', 'tanh'],
        'final_estimator__solver' : ['adam', 'sgd'],
        'final_estimator__alpha' : [0.001, 0.01, .1, .5],
        'final_estimator__learning_rate' : ['constant', 'invscaling', 'adaptive']
    }, 
    'LinearRegression' : {},
    'XGBoost' : {
        'final_estimator__n_estimators' : [100, 200, 500],
        'final_estimator__learning_rate' : [0.01, 0.1, 0.3],
        'final_estimator__max_depth' : [3, 6, 10],
    }
}

# 3-fold cross validation
cv = KFold(n_splits=3, shuffle=True, random_state=42)

# Train and tune the stacking ensemble
best_score = float('inf')
best_model = None

for meta_name, meta_model in meta_models.items():
    print(f"Training and tuning {meta_name} as the meta-model...")

    stacking_regressor = StackingRegressor(estimators=best_estimators, final_estimator=meta_model, cv=cv)
    grid_search = GridSearchCV(estimator=stacking_regressor,
                            param_grid=meta_param_grids[meta_name],
                            cv=cv,
                            scoring='neg_mean_squared_error',
                            n_jobs=-1,
                            verbose=1
                            )
    
    grid_search.fit(X_train_fe, y_train_fe)
    best_params = grid_search.best_params_
    best_rmse = np.sqrt(-1 * grid_search.best_score_)

    print(f"Best parameters for {meta_name} : {best_params}")
    print(f"Best RMSE for {meta_name} : {best_rmse}\n")

    if best_rmse < best_score:
        best_score = best_rmse
        best_model = grid_search


# Evaluate the best stacking ensemble on the data
y_pred = best_model.predict(X_test_fe)
rmse = np.sqrt(mean_squared_error(y_test_fe, y_pred))
print(f"Best stacking ensemble's RMSE on test data: {rmse}")

Training and tuning MLP as the meta-model...
Fitting 3 folds for each of 96 candidates, totalling 288 fits
Best parameters for MLP : {'final_estimator__activation': 'relu', 'final_estimator__alpha': 0.5, 'final_estimator__hidden_layer_sizes': (10,), 'final_estimator__learning_rate': 'adaptive', 'final_estimator__solver': 'sgd'}
Best RMSE for MLP : 0.14154372838675827

Training and tuning LinearRegression as the meta-model...
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Best parameters for LinearRegression : {}
Best RMSE for LinearRegression : 0.14559134043455152

Training and tuning XGBoost as the meta-model...
Fitting 3 folds for each of 27 candidates, totalling 81 fits
Best parameters for XGBoost : {'final_estimator__learning_rate': 0.01, 'final_estimator__max_depth': 3, 'final_estimator__n_estimators': 500}
Best RMSE for XGBoost : 0.13337271242298795

Best stacking ensemble's RMSE on test data: 0.14155497940034387


In [None]:
y_stack = np.exp(best_model.predict(df_test_preprocessed))

#xgboost submission

df_stack_out = df_test[['Id']].copy()
df_stack_out['SalePrice'] = y_stack

df_stack_out.to_csv('../datasets/submissions/sub_stack.csv', index=False)

Exception ignored in: <function ResourceTracker.__del__ at 0x1025d5bc0>
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 91, in _stop
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 116, in _stop_locked
ChildProcessError: [Errno 10] No child processes
Exception ignored in: <function ResourceTracker.__del__ at 0x104785bc0>
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 91, in _stop
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 116, in _stop_locked
ChildProcessError: [Errno 10] No child processes
Exception ignored in: <function ResourceTracker.__del__ at 0x10684dbc0>
Traceback (most recent call last