# Overview

This project explores the question of affordability around UK housing. The datasets used are as follows:
- [UK House Price Index (HM Land Registry)](https://www.gov.uk/government/statistical-data-sets/uk-house-price-index-data-downloads-january-2025)
- [Annual Survey of Hours and Earnings (Sheet 12 - Full-time employees' pay by work region) (ONS)](https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/datasets/ashe1997to2015selectedestimates)
- [CPIH Index (ONS)](https://www.ons.gov.uk/economy/inflationandpriceindices/timeseries/l522/mm23)

This initial transformation phase aims to prepare the data in a format suitable for use within SQL.

This includes tasks such as unpivoting tables, removing columns irrelevant to the project, and eliminating "spacer" rows and columns.

Datatypes of columns are not addressed at this stage, nor are dimension tables set up or relationships between tables defined. These tasks will be handled in the next phase, within SQL.

# Initial Data Transformation (PowerQuery)

The original data sources are not structured in a format that is immediately suitable for use in SQL. For example, they contain spacer rows and columns, as well as data presented at multiple levels of granularity, which could lead to double (or even triple) counting. In this step, we focus on reshaping each dataset into a format that can be loaded into SQL, primarily through unpivoting and filtering, as needed.

Note that this stage does not involve handling of data types.

## UK House Price Index

This source contains data such as average house sale price, volume of sales, and average sale price by property type (e.g. detached, semi-detached, terraced, flat), all broken down by region.

However, the data includes rows at multiple levels of granularity — for example, UK-wide figures, RUTS1 regions, RUTS2 regions, and individual towns and cities. Retaining all of these would lead to significant overcounting. To avoid this, only the rows at the RUTS1 region level are kept. Specifically, these regions are:

- North East  
- North West  
- Yorkshire and The Humber  
- East Midlands  
- West Midlands  
- East of England  
- London  
- South East  
- South West  
- Scotland  
- Wales  
- Northern Ireland

To filter the data accordingly — and to avoid using the built-in column filter (which doesn't seem to behave as I would expect when combined with the search feature) — a helper table containing the RUTS1 regions (called `RUTS1Regions`) has been created. An additional column is added to the main table to act as an indicator of whether a row corresponds to a RUTS1 region or not.

![A snapshot of the helper function, indicating whether a row is based on a RUTS region or not](documentation_images/DiscardKeepHelper.png)

The table is then loaded into Power Query, where the data is filtered to keep only the "Keep" rows and discard the rest. Additionally, all columns are removed except the following:

- `Date`
- `RegionName`
- `AveragePrice`
- `SalesVolume`
- The average price columns for each property type (detached, semi-detached, terraced, and flat)

![The House Price Index table with unneeded columns removed](documentation_images/HPIFilteredPowerQuery.png)

This table is loaded and saved as "HPIInitialTransform.csv"

### A note on national data

No sales volumes are provided for the different types of houses sold (detached, semi, terraced, and flats) for each region, and so there is no way to reconstruct the national average sale price by type using the regional data (without assumptions on the distribution of each type sold). 

Luckily, national figures are provided (filter for RegionName = "United Kingdom"), so we create a separate table with these figures.

We will also include the sales volume and Average Price columns, as although we could technically reproduce them using the regional data, i.e. $$ \text{National Average Price} = \frac{\sum_{\text{regions}} \big( \text{Average price by region} \times \text{Sales volume per region}\big)}{\sum_{\text{regions}} \text{sales volume per region}},$$ the national data is available and so we may as well use it.

This table is saved as "HPINational.csv". It will be used when doing analysis on the national level.

![The House Price Index table with unneeded columns removed](documentation_images/HPINationalSnapshot.png)

## Annual Survey of Hours and Earnings (median pay by region) - Initial data transformation

From this source, we use the median gross annual earnings of full-time employed workers by region. The original data source looks as follows: 

![snapshot of the original source data for the UK House Price Index, obtained via the HM Land Registry](documentation_images/MedianSalarySource.png)

The following formatting issues will be corrected:
- Multiple "header" and "subheader" rows will be removed.
- Some years, such as 2004, 2011 and 2021, have multiple entries due to the presence of an "old vs new" way of measuring the data. In each case, we will remove the "old" way. Moreover, these columns are separated by a "spacer column", which will be removed.
- The summary rows labelled "United Kingdom", and "England" will be removed to avoid double-counting (or triple-counting).
- The data is in a pivot-table format, but does not have pivot-table functionality. The data is more suitably presented using a column for the region, a column for the year, and a third column for the value. We will therefore have to unpivot the table.

These changes will be performed in PowerQuery. A snapshot summary of this process is as follows:

![snapshot summarising the data transformation carried out in PowerQuery](documentation_images/MedianSalaryPowerQuery.png)

Now we have a table containing a row for each (Region,Year) pair, along with the median salary value. 

Note that the "Year" column is not numerical, and contains issues such as 2006 displaying as 2006*, or 2011 displaying as 2011soc10. Transforming this column by extracting the first four characters resolves this.

This table is saved as a .csv, called "MedianPayInitialTransform.csv"

### A caveat with using median data. 

The median of the median salaries by region is NOT the median salary of the whole United Kingdom (the median of the medians of a partition of the dataset is NOT the median of the original dataset). 

Therefore, we create another table (a national version) containing only the median salary for each year for the entire United Kingdom. It has two columns, namely "Year" and "MedianSalary" and is saved as "MedianSalaryNational.csv". It will be used when finding insights and producing visuals on the national level.

![Tranformed data showing UK wide median salary for each year](documentation_images/MedianSalaryUKWide.png)

## CPIH Index - Initial data transformation

The last source we will use - the CPIH Index, with base month July 2015 (i.e. CPI for July 2015 is 100).

This is the most straightforward source to transform. On line 194 of the source file, the data changes from quarterly data to monthly data. We therefore remove rows 1 to 193, and give the columns meaningful names.

![Removing the first 193 rows in PowerQuery, obtaining monthly data](documentation_images/CPIHPowerQuery.png)

Note that the MonthYear column is a text column (not a date column). We will deal with this in SQL. 

This table is saved as a .csv, called "CPIHInitialTransform.csv"