# Housing Price Predictor

## Objectives

The objective of this project is to evaluate and identify the optimal model for predicting property prices in the UK. To achieve this, the project leverages publicly available property value data, postcode location information, finacial data and EPC (Energy Performance Certificate) records.

The workflow includes several key steps:

**1. Data Collection and Integration** – Importing and merging multiple datasets into a cohesive format.

**2. Data Cleaning** – Handling missing values, removing outliers, and ensuring data quality.

**3. Exploratory Data Analysis (EDA)** – Visualizing trends and relationships in the data to better understand feature behavior.

**4. Modeling** – Testing and optimizing various predictive models, including linear and tree-based methods, to identify the best-performing   approach.

The ultimate goal is to develop a robust predictive model capable of accurately estimating the price of individual properties based on multiple features, providing reliable and actionable insights.

## 1. Data Collection and Integration

### 1.1 Data Sources

Accurate and reliable data is critical for the success of this project. Without high-quality data, any conclusions regarding model performance would be unreliable. This project draws on six key data sources:

**1. Price Paid Data (2018 – August 2025)** – Detailed property transaction prices across the UK. Gov.uk Price Paid Data

**2. ONS Postcode Directory (Latest Centroids)** – Geographic centroids for UK postcodes, enabling spatial analysis. ONSPD Centroids

**3. EPC Open Data** – Energy Performance Certificate records for UK properties. EPC Data

**4. Bank of England Interest Rates** – Historical base interest rates affecting property financing. Bank of England Rates

**5. ONS Unemployment Rates** – Labour market data reflecting economic conditions. ONS Unemployment Data

**6. ONS Inflation Rates** – Consumer price inflation data, providing context for market trends. ONS Inflation Data

Due to the large size of these datasets, some were accessed directly from public URLs and preprocessed in chunks to manage memory usage. Unnecessary features were removed before merging with other dataframes to reduce storage requirements. For example:

In [None]:
import pandas as pd

#Set the years of data available
years = [18, 19, 20, 21, 22, 23, 24, 25]

#Set the heading names from the CSVs
headers = ["transaction_id","price","date_of_transfer","postcode","property_type","new_build_flag","tenure_type","primary_addressable_object_name",
            "secondary_addressable_object_name","street","locality","town_city","district","county","ppd_category_type","record_status"]

#Drop irellevent columns
drop = ["transaction_id","primary_addressable_object_name","secondary_addressable_object_name","street","locality","ppd_category_type","record_status"]

#Loop through all available CSVs
dfs = {}
for year in years:
    url = f"http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-20{year}.csv"
    df = pd.read_csv(url)
    df.columns = headers
    df = df.drop(columns=drop)
    dfs[f"df20{year}"] = df
    print(f"{year} done")

### 1.2 Data Merging

To construct the final analysis dataset, all five external sources were merged into the primary Price Paid Data. Since each source had different structures and identifiers, a tailored merging strategy was required for each:

**1. ONS Postcode Directory** – Standardized postcodes by removing spaces and converting all characters to uppercase before merging.

**2. EPC Open Data** – Created a new matching column by combining cleaned address fields with standardized postcodes to align property records.

**3. Interest Rates** – Merged by aligning the month and year of the property sale with the corresponding Bank of England interest rate.

**4. Unemployment Rates** – Merged using the sale month and year to match with the Office for National Statistics unemployment rate for the same period.

**5. Inflation Rates** – Forward-filled inflation rates so that all transaction dates falling between ONS update intervals were assigned the most recent available rate.

This process ensured that each property transaction in the Price Paid Data was enriched with geographic, energy performance, and macroeconomic context. 

## 2. Data Cleaning

To prepare the final dataset for predictive modelling, the following preprocessing steps were applied:  

| Step | Action |Detail| Purpose |
|------|-------|------|---------|
| 1 | **Column Reduction** | Removed unnecessary and duplicate columns | Streamlined the dataset to reduce noise and memory usage |
| 2 | **Handling Missing Values** | Removed rows or imputed with `0`, mode, or mean depending on the feature | Ensured dataset completeness and reduced bias from missing values |
| 3 | **Categorical Encoding** | Applied one-hot encoding to categorical variables | Made categorical data usable in machine learning models |
| 4 | **Feature Engineering** | Created new features such as day of week, quarter, and month of sale | Captured temporal trends and improved model expressiveness |
| 5 | **Data Type Standardisation** | Converted all features to `int` or `float` | Ensured numerical consistency across all features |
| 6 | **Outlier Removal** | Removed the top 0.05% of property prices | Reduced skewness and improved model stability |  


### 2.1 Column Reduction

In [None]:
drop_cols = ['ADDRESS2', 'FLAT_TOP_STOREY', 'FLAT_STOREY_COUNT','HOT_WATER_ENV_EFF', 'FLOOR_ENERGY_EFF', 'WALLS_ENV_EFF', 'SHEATING_ENERGY_EFF','FLOOR_ENV_EFF','SHEATING_ENV_EFF']
df.drop(columns=drop_cols, inplace=True, errors='ignore')

### 2.2 Handling Missing Values

In [None]:
#Columns replaced with their mode
mode_impute_cols = ['MAINS_GAS_FLAG', 'MAIN_HEATING_CONTROLS','GLAZED_TYPE', 'LOW_ENERGY_LIGHTING','SOLAR_WATER_HEATING_FLAG',
                    'MAIN_FUEL', 'GLAZED_AREA', 'ROOF_ENERGY_EFF', 'ROOF_ENV_EFF']

for col in mode_impute_cols:
    if col in df.columns:
        mode_series = df[col].mode(dropna=True)
        if not mode_series.empty:
            df[col].fillna(mode_series[0], inplace=True)

#Columns replaced with their median
median_impute_cols = ['MULTI_GLAZE_PROPORTION', '', 'EXTENSION_COUNT', 'NUMBER_HABITABLE_ROOMS', 'NUMBER_HEATED_ROOMS', 'FLOOR_HEIGHT']

for col in median_impute_cols:
    if col in df.columns:
        df[col].fillna(df[col].median(), inplace=True)

#Columns replaced with 0
zero_fill_cols = ['EXTENSION_COUNT', 'WIND_TURBINE_COUNT','NUMBER_OPEN_FIREPLACES']
for col in zero_fill_cols:
    if col in df.columns:
        df[col].fillna(0, inplace=True)

#Dropping rows with null values
critical_cols = ['BUILT_FORM', 'LAT', 'LONG', 'OSEAST1M', 'OSNRTH1M','LIGHTING_ENV_EFF', 'LIGHTING_ENERGY_EFF', 
                 'MAINHEATC_ENV_EFF','MAINHEATC_ENERGY_EFF', 'MAINHEAT_ENV_EFF', 'MAINHEAT_ENERGY_EFF','WALLS_ENERGY_EFF',  
                 'WINDOWS_ENERGY_EFF','WINDOWS_ENV_EFF', 'HOT_WATER_ENERGY_EFF', 'HOTWATER_DESCRIPTION', 'ADDRESS1']

df.dropna(subset=[col for col in critical_cols if col in df.columns], inplace=True)

missing_summary = df.isnull().sum()
print("Remaining missing values:\n", missing_summary[missing_summary > 0])

### 2.3 Categorical Encoding

In [None]:
df_binary = pd.get_dummies(df, dtype=int, columns=['tenure_type','new_build_flag', 'property_type'])
df_binary.rename(columns={'tenure_type_F':'Freehold Tenure','tenure_type_L':'Leasehold Tenure','new_build_flag_N':'Old Build',
                          'new_build_flag_Y':'New Build','property_type_D':'Detached', 'property_type_F':'Flat', 
                          'property_type_O':'Other Property Type','property_type_S':'Semi-detached','property_type_T':'Terraced'}, inplace=True)

### 2.4 Feature Engineering

In [None]:
df_binary['Transfer Date'] = pd.to_datetime(df_binary['Transfer Date'], errors='coerce')
df_binary['Year'] = df_binary['Transfer Date'].dt.year
df_binary['Month'] = df_binary['Transfer Date'].dt.month
df_binary['Quarter'] = df_binary['Transfer Date'].dt.quarter
df_binary['Day of the Week'] = df_binary['Transfer Date'].dt.dayofweek
df_binary['Transfer Date'] = df_binary['Transfer Date'].astype(str).str[:10]
df_binary.tail()

### 2.5 Data Type Standardisation

In [None]:
df['MAIN_HEATING_CONTROLS'] = df['MAIN_HEATING_CONTROLS'].astype('float64')

### 2.6 Outlier Removal

In [None]:
upper = df['Price (Thousands)'].quantile(0.95)
df_no_outliers =  df['Price (Thousands)'] <= upper
df_no_outliers.describe()

## 3. Exploratory Data Analytics

To analyse the distribution of property prices, two visualisations were generated: a boxplot (Figure 1) and a histogram (Figure 2). Both indicate that most properties cluster within a £200,000 range, but the histogram reveals a long, gradually tapering upper tail. This suggests that a substantial portion of properties lie well above the upper quartile, and this upper 25% is spread across a wide price range rather than forming a concentrated cluster.

Such a distribution poses challenges for many models, particularly those that rely on strict assumptions of normality or homoscedasticity. XGBoost, however, is well suited to this task: its gradient boosting framework can model non-linear relationships and handle skewed distributions effectively by iteratively focusing on difficult-to-predict instances, including high-value properties in the tail. Moreover, its ability to apply regularisation helps prevent overfitting to sparse, extreme values, while techniques such as target transformation can further stabilise variance in the upper tail.

This combination of flexibility, robustness, and capacity to weight hard-to-predict observations makes XGBoost a strong choice for modelling property prices in the presence of a long-tailed distribution.

<center>
  <img src="C:/Users/lenovo/OneDrive/Desktop/House_price_predict_EDA/Box%20Plot.png" width="1000">
</center>
<figure style="text-align: center;">
    <img src="images/Box%20Plot.png" width="500">
    <figcaption>Figure 1: Distribution of House Prices Using a Box Plot</figcaption>
</figure>


<center>
  <img src="C:\Users\lenovo\OneDrive\Desktop\House_price_predict_EDA\Price Hist.png" width="1000">
</center>
<figure style="text-align: center;">
    <img src="images/Box%20Plot.png" width="500">
    <figcaption>Figure 2: Distribution of House Prices Using a Histogram</figcaption>
</figure>

The proportion of different property types is illustrated using a series of pie charts (Figure 3). While most property types are relatively evenly distributed, new builds are markedly underrepresented compared to older properties, and Freehold properties significantly outnumber Leasehold ones. This imbalance could make it more challenging for a model to detect subtle differences associated with underrepresented categories. However, given the large overall size of the dataset, even these smaller segments contain sufficient instances to provide meaningful training signals, mitigating some of the potential difficulties in modelling.

<center>
  <img src="C:\Users\lenovo\OneDrive\Desktop\House_price_predict_EDA\Pie.png" width="1000">
</center>
<figure style="text-align: center;">
    <img src="images/Box%20Plot.png" width="500">
    <figcaption>Figure 3: Pie Charts of Property Type Proportions</figcaption>
</figure>

Feature engineering introduced several time-based metrics, among which the month variable showed no meaningful correlation with property prices. The most notable temporal trend was a significant drop in average house prices during weekends. Further analysis suggests this is likely attributable to a reduced volume of ‘other’ property types—presumed to be commercial buildings—being sold on weekends, which skews the mean downward.

Year-on-year, average property prices displayed a generally upward trend, with only a slight decline observed in 2023. Quarterly analysis revealed that the third and fourth quarters exhibited marginally higher average prices compared to the first and second, although this increase was less than 1%, indicating limited seasonal influence overall.

<center>
  <img src="C:\Users\lenovo\OneDrive\Desktop\House_price_predict_EDA\Average Time Prices.png" width="1000">
</center>
<figure style="text-align: center;">
    <img src="images/Box%20Plot.png" width="500">
    <figcaption>Figure 4: Line Chart Showing Change in Average House Price for Various Time Metrics</figcaption>
</figure>

A bubble plot was created to visualise property prices geographically, using latitude and longitude as the axes. This revealed higher average prices concentrated around London and other major cities, including Liverpool and Manchester. These patterns indicate that geographical location has a substantial influence on property prices and is likely to be a key predictor in any pricing model. Incorporating spatial features allows models, such as XGBoost, to capture regional variations and localised pricing trends, which may significantly improve predictive accuracy.

<center>
  <img src="C:\Users\lenovo\OneDrive\Desktop\House_price_predict_EDA\Location.png" width="1000">
</center>

<figure style="text-align: center;">
    <img src="images/Box%20Plot.png" width="1000">
    <figcaption>Figure 5: Geographical Impact on House Prices</figcaption>
</figure>

As expected, lower interest rates, inflation, and unemployment levels are associated with higher property sales. These macroeconomic conditions directly influence buyer affordability and market confidence, leading to increased transaction volumes. Consequently, the model is likely to perform more accurately during periods of low interest, low inflation, and low unemployment, as these conditions dominate the dataset and provide abundant training examples. However, this also implies a potential limitation: the model may underperform during periods of economic stress or high rates, where fewer sales and more extreme price fluctuations occur. Incorporating macroeconomic indicators as features allows XGBoost to partially adjust for these conditions, but careful attention to data distribution and potential non-linear effects is necessary to maintain predictive reliability across varying economic climates.

<center>
  <img src="C:\Users\lenovo\OneDrive\Desktop\House_price_predict_EDA\Financial .png" width="1000">
</center>
<figure style="text-align: center;">
    <img src="images/Box%20Plot.png" width="1000">
    <figcaption>Figure 6: Effect on Various Macroeconomic Data on Quantity of Houses Sold</figcaption>
</figure>

## 4. Modelling

The features exhibiting the strongest correlation with property price were found to be latitude, whether the property is detached, and the number of habitable rooms. This relationship was visualised using a single-column heatmap, which clearly highlights the relative strength of these correlations. Latitude likely captures regional pricing differences, detached properties generally command higher prices due to size and exclusivity, and the number of habitable rooms reflects property scale and utility. These insights suggest that these features will be particularly influential in the predictive performance of the model.

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats
import numpy as np

p_score_df = []
for col in x_data:
    pearson_coef, p_value = stats.pearsonr(y_data, x_data[col])
    pcrs = np.sqrt(pearson_coef ** 2)
    p_score_df.append([col, pearson_coef, p_value, pcrs])

p_score_df = pd.DataFrame(p_score_df, columns=["Feature", "Pearson Coefficient", "P-value", "Root Square Pearson Coefficient"])
p_score_df.set_index("Feature", inplace=True)
p_score_df = p_score_df.sort_values(by="Root Square Pearson Coefficient", ascending=False)

plt.figure(figsize=(6, len(p_score_df) * 0.5))
sns.heatmap(p_score_df[["Root Square Pearson Coefficient"]], annot=True, cmap='coolwarm', cbar=False,
            linewidths=0.5, linecolor='gray')

plt.yticks(rotation=0)
plt.title("Correlation with Price (Thousands)")
plt.show()

Model performance was evaluated using mean absolute error (MAE), mean squared error (MSE) to assess sensitivity to outliers, and R² to measure overall explanatory power. The baseline linear regression model yielded a MAE of 63.3, MSE of 33,120, and R² of -0.185, indicating that simply predicting the mean property price would outperform the model. In contrast, XGBoost demonstrated a substantial improvement, achieving a MAE of 44.9, MSE of 4,541, and R² of 0.838. This highlights XGBoost’s superior ability to capture non-linear relationships and complex interactions in the data, as well as its robustness against the skewed distribution and outliers present in the property prices.

In [None]:
from xgboost import XGBRegressor

xgb = XGBRegressor()
xgb.fit(x_train, y_train)

y_pred_xgb = xgb.predict(x_test)
print("XG Boost- MAE:", mean_absolute_error(y_test, y_pred_xgb))
print("XG Boost- MSE:", mean_squared_error(y_test, y_pred_xgb))
print("XG Boost- R-squared:", r2_score(y_test, y_pred_xgb))

The XGBoost model was hyperparameter-tuned using RandomizedSearchCV with 3-fold cross-validation over 20 iterations. The parameter search space included:

base_score: [0.25, 0.5] – initial prediction score of all instances, influencing the starting point for boosting.

booster: ['gbtree'] – using tree-based boosting to capture non-linearities and feature interactions.

learning_rate: [0.05, 0.1, 0.15] – controls the step size at each iteration, balancing convergence speed and overfitting.

max_depth: [2, 3, 5] – maximum depth of each tree, controlling model complexity.

min_child_weight: [1, 2] – minimum sum of instance weights in a child node, helping prevent overfitting on sparse data.

n_estimators: [100, 300, 500] – number of boosting rounds.

Other parameters were left at default, including colsample_bytree and gamma, allowing the model to automatically determine feature sampling and minimum loss reduction for splitting nodes.

After tuning, the final model achieved MAE = 44.8, MSE = 4,526, and R² = 0.838, indicating that hyperparameter optimisation slightly improved predictive accuracy and stability. This setup balances bias and variance while allowing XGBoost to effectively model non-linear relationships and handle the skewed distribution of property prices.