# House Price Prediction - Modelling

This analysis deals with the prediction of house prices based on the house's properties. The prediction is based on a sample of houses from Ames, Iowa. The dataset itself is obtained from [Kaggle](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques) as part of a competition.

## Extract-Transform-Load (ETL)

In [None]:
import pandas as pd
import numpy as np

In [None]:
houses_train = pd.read_csv("../data/modelling/train.csv")
houses_validation = pd.read_csv("../data/modelling/validation.csv")

## Feature Selection

In this section, we will select the features that we will use for our model. For a first baseline model, we will use the whole feature set. We will then use feature selection techniques to reduce the number of features and improve the model's performance.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from house_prices.modelling import build_transformer, ORDINAL_FEATURE_MAPPINGS

First we check the correlation of the features with the target variable `SalePrice` and among each other. It is important to note that we only consider numerical features for this analysis and also only a linear relationships are considered.

In [None]:
corr_matrix = houses_train.select_dtypes(include="number").corr().round(2)

fig = plt.figure(figsize=(15, 15))

mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
cmap = sns.diverging_palette(230, 20, as_cmap=True)

ax = fig.add_subplot(1, 1, 1)
ax.set_title("Correlation Heatmap of Numeric Features")
sns.heatmap(corr_matrix, annot=True, ax=ax, cmap=cmap, mask=mask, annot_kws={"fontsize": 6})

plt.show()

Next we take a look at outliers in the data. We will remove the outliers from the training set.

In [None]:
fig = plt.figure(figsize=(15, 15))

for index, feature in enumerate(houses_train.select_dtypes(include="number").columns):
  ax = fig.add_subplot(houses_train.select_dtypes(include="number").shape[1] // 5 + 1, 5, index + 1)
  ax.set_title(feature)
  sns.scatterplot(x=feature, y="SalePrice", data=houses_train, ax=ax)

plt.tight_layout()
plt.show()

Next we take a look at the categorical features. We filter variables that largely have the same value and therefore a low variance.

In [None]:
categorical_features = houses_train.select_dtypes(include="object").columns
categorical_features_equality = houses_train[categorical_features].apply(lambda x: x.value_counts().max() / x.value_counts().sum())
categorical_features_equality = categorical_features_equality[categorical_features_equality > 0.95]

print(categorical_features_equality)

Finally we transform the selected features to a form that can be used by the model. A similar transformation will be applied to the data inside of the machine learning pipeline.

In [None]:
ordinal_pipeline = Pipeline([
  ("imputer", SimpleImputer(strategy="most_frequent")),
  ("encoder", OrdinalEncoder(categories=[value for key, value in ORDINAL_FEATURE_MAPPINGS.items()], dtype=int)),
])

binary_pipeline = Pipeline([
  ("imputer", SimpleImputer(strategy="most_frequent")),
  ("encoder", OneHotEncoder(handle_unknown="ignore")),
])

numerical_pipeline = Pipeline([
  ("imputer", SimpleImputer(strategy="mean")),
  ("scaler", StandardScaler()),
])

transformer = build_transformer(houses_train, ordinal_pipeline, binary_pipeline, numerical_pipeline)

houses_train_transformed = transformer.fit_transform(houses_train)
houses_train_transformed = pd.DataFrame(houses_train_transformed, columns=transformer.get_feature_names_out())

houses_train_transformed