# Housing Price Prediction Model

This notebook demonstrates the steps taken to build a machine learning model for predicting housing prices in Warsaw. The dataset contains information on property details such as size, rent, location, and amenities. The dataset has been sebscraped from Otodom with Selenium. More information can be found on: https://github.com/KevinVanWallendael/OtodomScraper 

In this notebook, we will:
- Load and preprocess the data
- Handle missing values and outliers
- Engineer new features like price per square meter and amenities
- Train a machine learning model using XGBoost
- Evaluate the model using performance metrics

We will also explain the significance of each preprocessing step and why particular evaluation metrics are used in this regression task.

## Step 1: Import Libraries

We begin by importing the required libraries:
- `pandas` and `numpy` for data manipulation and numerical operations.
- `sklearn` libraries for building and evaluating machine learning models.
- `XGBoost`, a popular gradient boosting method, for training our regression model.
- `joblib` to save the trained model for future use.

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
import joblib
from xgboost import XGBRegressor

## Step 2: Load the Dataset

Here, we load the dataset containing the housing details. It's crucial to check the dataset structure before any further processing to ensure it's ready for modeling.

In [2]:
# Load the dataset
df = pd.read_csv(r'data/otodom_data.csv')

### Step 3: Preprocessing 'size' Column

We begin by cleaning the 'size' column. The 'm²' unit is removed, commas are replaced with dots to convert to the correct float format, and then the column is cast to a numeric type.

Data preprocessing ensures that the features are in a usable format for machine learning algorithms.

In [3]:
# Preprocess 'size' column
df['size'] = df['size'].str.replace('m²', '', regex=True).str.replace(',', '.').astype(float)

### Step 4: Handling Missing or Non-Numeric Values in 'Czynsz'

We handle missing values and non-numeric entries in the 'Czynsz' column (monthly rent). Any 'brak informacji' (missing info) is converted to NaN, and formatting issues like extra spaces and commas are fixed. 

Handling missing values is important to ensure that the model receives valid data for training and evaluation.

In [4]:
# Handle missing or non-numeric values in 'Czynsz' (monthly rent)
df['Czynsz'] = (
    df['Czynsz']
    .replace('brak informacji', np.nan)
    .str.replace(' zł', '', regex=True)
    .str.replace(',', '.', regex=True)
    .str.replace(' ', '', regex=True)
    .astype(float)
)

### Step 5: Creating Missing Indicator for 'Czynsz'

We create a binary indicator for whether 'Czynsz' (monthly rent) is missing or not. This can provide additional information to the model about missing rent values.

Feature engineering helps improve model performance by providing meaningful input features.

In [5]:
# Create missing indicator for 'Czynsz'
df['has_czynsz'] = df['Czynsz'].isna().astype(int)

### Step 6: Extracting City, Neighborhood, and Region

We extract the city, neighborhood, and region information from the 'location' column using regular expressions. This enables the model to leverage location-based information during training.

Feature extraction is crucial for ensuring the model can use the most relevant data for its predictions.

In [6]:
# Extract city, neighborhood, and region
df['city'] = df['location'].str.extract(r'(\w+),\s*\w+,\s*(\w+)$')[0]
df['neighborhood'] = df['location'].str.extract(r'(\w+),\s*(\w+),\s*\w+$')[1]
df['region'] = df['location'].str.extract(r'(\w+)$')[0]
df = df.drop(columns=['location'])

## Step 7: Preprocessing the Price Column

We clean the 'price' column by removing unnecessary characters such as currency symbols and spaces, and replacing commas with dots. Any rows with unavailable price information are removed.

It's essential to preprocess target variables just as carefully as input features to ensure the model can make accurate predictions.

In [7]:
# Preprocess price column
df['price'] = df['price'].str.replace(' zł', '', regex=True).str.replace(' ', '', regex=True).str.replace(',', '.', regex=True)
df['price'] = df['price'].replace('Pricenotavailable', np.nan).astype(float)
df = df.dropna(subset=['price'])

### Step 8: Creating 'Price per Square Meter' Feature

We calculate the price per square meter, which can help the model understand the pricing dynamics relative to property size.

Feature engineering helps the model use domain knowledge, like the price-per-sqm, to improve predictive accuracy.

In [8]:
# Feature: Price per sqm
df['price_per_sqm'] = df['price'] / df['size']

### Step 9: Extracting Features from 'Informacje dodatkowe'

We create binary features based on the amenities mentioned in the 'Informacje dodatkowe' column (e.g., balcony, garage). This allows the model to understand whether these amenities are present or not.

Feature extraction from text data is used to derive meaningful features from unstructured information.

In [9]:
# Feature Extraction: 'Informacje dodatkowe'
def extract_amenities(data):
    amenities = ['balkon', 'taras', 'garaż/miejsce parkingowe', 'piwnica', 'oddzielna kuchnia', 'ogródek', 'pom. użytkowe']
    for amenity in amenities:
        data[f'has_{amenity.replace("/", "_").replace(" ", "_").lower()}'] = data['Informacje dodatkowe'].str.contains(amenity, case=False, na=False).astype(int)
    return data

df = extract_amenities(df)
df = df.drop(columns=['Informacje dodatkowe'])

### Step 10: Removing Outliers

Outliers are removed using the Interquartile Range (IQR) method. This helps ensure that extreme values do not skew the model's predictions.

Handling outliers is crucial for improving model accuracy, especially for regression tasks where large errors can disproportionately affect performance.

In [10]:
# Handle outliers (IQR method)
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    return df[(df[column] > (Q1 - 1.5 * IQR)) & (df[column] < (Q3 + 1.5 * IQR))]

df = remove_outliers(df, 'price')
df = remove_outliers(df, 'size')

### Step 11: Log Transforming the Target Variable

We apply a log transformation to the 'price' column to reduce skewness and make the distribution more Gaussian. This is a common technique in regression problems with heavily skewed data.

Log transformation can improve model performance by stabilizing variance and making predictions more reliable.

In [11]:
# Log transform price (reduces skewness)
df['log_price'] = np.log(df['price'])

### Step 12: Defining Features and Target

We define the input features (`X`) and the target variable (`y`). In this case, the target variable is the log-transformed price.

Carefully defining the features and target is key to ensuring the model learns from the right data.

In [12]:
# Define features and target
X = df.drop(columns=['price', 'log_price', 'title'])
y = df['log_price']

### Step 13: Creating Preprocessing Pipelines

We define separate preprocessing pipelines for numerical and categorical features. For numerical features, we apply imputation and scaling. For categorical features, we apply imputation and one-hot encoding.

Using a pipeline allows us to streamline the preprocessing steps and ensure consistency during training and testing.

In [13]:
# Define categorical and numerical features
categorical_features = ['Ogrzewanie', 'Piętro', 'Stan wykończenia', 'Rynek', 'Forma własności', 'Typ ogłoszeniodawcy', 'neighborhood']
numerical_features = ['size', 'Czynsz', 'has_czynsz', 'price_per_sqm'] + [col for col in df.columns if col.startswith('has_')]

# Preprocessing Pipelines
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

### Step 14: Model Pipeline with XGBoost

Here, we create the machine learning pipeline, which includes both preprocessing and the model. We use XGBoost, a powerful gradient boosting algorithm, for regression.

XGBoost is popular due to its efficiency, scalability, and strong performance in many tasks.

In [14]:
# Model Pipeline with XGBoost
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', XGBRegressor(n_estimators=500, learning_rate=0.05, max_depth=6, random_state=42))
])

### Step 15: Training the Model

We split the dataset into training and testing sets and train the model using the training data. It's important to evaluate the model on unseen data to ensure it generalizes well.

In [15]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model.fit(X_train, y_train)

### Step 16: Model Evaluation using Mean Absolute Error (MAE)

We evaluate the model's performance using the Mean Absolute Error (MAE). MAE is a suitable metric for regression tasks as it measures the average absolute difference between predicted and actual values.

MAE is important because it gives us an idea of how close the predictions are to the actual values, providing insight into the model's accuracy.

In [16]:
# Model Prediction and Evaluation
y_pred = np.exp(model.predict(X_test))  # Reverse log transformation
mae = mean_absolute_error(np.exp(y_test), y_pred)
print(f'Mean Absolute Error: {mae}')

Mean Absolute Error: 82749.85096153835


### Step 17: Saving the Model

Finally, we save the trained model and the preprocessing pipeline using `joblib`. This allows us to reload the model in the future and make predictions without retraining.

Saving models is a best practice for machine learning workflows, enabling reproducibility and efficient use of trained models.

In [17]:
# Save the model
joblib.dump(model, 'housing_price_predictor_model.pkl')
joblib.dump(preprocessor, 'preprocessor.pkl')

['preprocessor.pkl']