# Overview

In this section, we use our model to extract insights from the data, focusing on five key questions:

1. How have house prices changed each year in nominal terms, adjusted for CPIH (inflation), and in relation to the national median salary?

2. How has housing affordability changed over time in each region? Which regions are the most and least affordable in each year?

3. How have the prices of different house types evolved relative to one another over time?

4. What is the national trend in the total number of housing sales?

5. How are yearly sales distributed across the months, and do any seasonal trends exist?

For each question, we will write a query to extract the relevant data and then export the results to Power BI for quick visualisation. The process of exporting and creating the visualisations will be demonstrated once, as we will explore Power BI in more detail in the next and final section (when creating a report).

# Q1: How have house prices changed each year in nominal terms, adjusted for CPIH (inflation), and in relation to the national median salary?

First, we join the national housing data, CPIH data and national median salary data to the dates dimension table on the relevant columns. This produces the following:

![output of query joining national housing data, cpih data and national median salary data](documentation_images\joining-national-house-cpi-salary.png)

For this analysis, we make use of the `year`, `average_price`, `index_normed` and `MedianSalary` columns. 

We will group by the `year` and take the average (mean) of the `average_price`, `average_price`/`index_normed` (i.e. CPIH adjusted average price), and `average_price`/`MedianSalary` (i.e. average price to median salary ratio) columns. 

Note that it could be argued that this method of taking the mean is slightly inaccurate, due to the fact that we are not weighting by monthly sales volume. However, sales volume is not available until 2005, and moreover, the sales volumes Y-on-Y are relatively similar, so using a "simple mean" (i.e. an unweighted average) will not affect the general trend. 

```sql
with HouseCPIHSalaryNationalData as (
    select
        d.year,
        h.average_price,
        c.index_normed,
        ms.MedianSalary
    from datesdim d
	join housepricesnational h on d.date = h.date
	join cpih c on d.month = c.month and d.year = c.year
	join mediansalarynational ms on d.year = ms.year
)
select
    year,
    avg(average_price) as average_price,
    round(avg(average_price / index_normed),2) as average_price_cpih_adjusted,
    round(avg(average_price / MedianSalary),2) as average_price_to_median_salary_ratio
from HouseCPIHSalaryNationalData
group by year
order by year;
```

Output:

![output of query answering Q1](documentation_images\Q1SQLOutput.png)

## Creating a visual with the output of the SQL query

To create a visual with the information from the output, we export the output and load it into Power BI. 

This process is only given for this question (for Q2-Q5, we will just give the output) as there's nothing really to do in Power BI, other than drag and drop fields, and we will discuss Power BI more in the next section when we create a report.

We use the `Export` button in MySQL to export the output as a CSV file:

![THe export recordset button](documentation_images/ExportRecordSet.png)


In Power BI, start by selecting Get data > Text/CSV to load the dataset. Once imported, the data will be available in the Data pane.

![loading the data in PowerBI](documentation_images/LoadCSVPowerBI.png)

Next, create a line chart, adding the average house price and CPIH-adjusted average house price to the y-axis section. Note that we place the average price-to-median salary ratio on a secondary y-axis, as it has a different scale.

We then modify the y-axis ranges so that both lines begin at the same height, allowing for an easy visual comparison.

Then, line thickness is adjusted, and the average price line is changed to a dashed line, as the primary focus of this visual is the comparison between the CPIH adjusted price and the price-to-salary ratio. 

Finally, the title, axes labels, and legend labels are changed to ensure the chart is clear and informative. Data labels for key values are also added.

![formatting the visual](documentation_images/VisualFormattingPowerBI.png)

The result is the following visualisation:

![formatting the visual](documentation_images/Q1PowerBIVisual.png)

# Q2: How has housing affordability changed over time in each region? Which regions are the most and least affordable in each year?

To answer this question, we will use the `housepricesregional` and `mediansalaryregional` tables. 

We first perform a cross join with the DatesDim and RegionsDim tables, creating a dimension table with (date, region) pairs. 

Then, we (inner) join the `housepricesregional` table on the date and region columns, and also (inner) join the `mediansalaryregional` table on the region and year columns. (Recall that house prices are recorded monthly and median salaries are recorded annually.)

The query and output are as follows:

```sql
with date_region_pairs as (
    select d.date, d.year, r.region
    from datesdim d
    cross join regionsdim r
)
select dr.date, dr.year, dr.region, h.average_price, m.salary
from date_region_pairs dr
inner join housepricesregional h on dr.date = h.date and dr.region = h.region
inner join mediansalaryregional m on dr.region = m.region and dr.year = m.year;
```

Output:

![output of query finding average house price and median salary per (date, region) pair](documentation_images/Q2IntermediateSQLOutput.png)

Next, we wrap the select query in a CTE (note, we drop `dr.date` as we only included it for clarity in the above output - it is not actually needed). Then we calculate the "affordability index" (`average_price` / `salary`), group by `year` and `region`. We output and export this query and create a visual in Power BI (which we won't walk through, as the workflow is very similar to that in Q1). 

Once again, we opt for a simple average rather than one weighted by monthly sales volume, for much the same reasons as we did in Q1. 

The query, output and visualisation are as follows:

```sql
with date_region_pairs as (
    select d.date, d.year, r.region
    from datesdim d
    cross join regionsdim r
),
year_region_price_salary as (
select dr.year, dr.region, h.average_price, m.salary
from date_region_pairs dr
inner join housepricesregional h on dr.date = h.date and dr.region = h.region
inner join mediansalaryregional m on dr.region = m.region and dr.year = m.year
)
select year, region, avg(average_price / salary) as affordability_index 
from year_region_price_salary 
group by year, region;
```

Output:

![affordability trends SQL output](documentation_images/Q2AffordabilitySQLOutput.png)

Visualisation:

![affordability trends visualisation, made in PowerBI](documentation_images/Q2PowerBIVisual.png)

To answer the second part of the question (namely, which regions are the least and most affordable in each year), we wrap the select statement from the previous query in a CTE and apply the windows function, `rank()`, over the year, ordered by the affordability index. The query and output are as follows:

```sql
with date_region_pairs as (
    select d.date, d.year, r.region
    from datesdim d
    cross join regionsdim r
),
year_region_price_salary as (
select dr.year, dr.region, h.average_price, m.salary
from date_region_pairs dr
inner join housepricesregional h on dr.date = h.date and dr.region = h.region
inner join mediansalaryregional m on dr.region = m.region and dr.year = m.year
),
affordability_by_year_region as (
select year, region, avg(average_price / salary) as affordability_index 
from year_region_price_salary 
group by year, region
),
ranked_affordability as (
select year,region, rank() over (partition by year order by affordability_index) as affordability_rank
from affordability_by_year_region
)
select year,region, if(affordability_rank = 1, "Most affordable", "Least affordable") as Affordability
from ranked_affordability
where affordability_rank = 1 or affordability_rank = 12;
```

Output:

![most and least affordable region for each year](documentation_images/Q2SQLOutputPart2.png)

This output would benefit greatly from a pivot on the `Affordability` column. We do this in PowerBI via a matrix visualisation. This gives the following:

![matrix visualisation showing the least and most affordable region in each year](documentation_images/Q2PowerBIVisualPart2.png)

# Q3: How have the prices of different house types evolved relative to one another over time?

For this question, we return to the national house price data in the table, `housepricesnational`. 

We will perform an (inner) join between this table and the `DatesDim` table.

We then group the results by year and calculate the average (mean) sale price for each housing type.

Finally, we filter out rows which have null values.

Once again, we use a simple average (i.e. we don't weight based on monthly sales), just as we did when answering Q1 and Q2. Moreover, we do not have a breakdown for sales volume for each house type, so even if we wanted to perform a weighted average, we couldn't. Therefore, the simple average will do.

The query and output are as follows:

```sql
select d.year, 
	   avg(h.detached_avg) as detached, 
	   avg(h.semi_avg) as semidetached, 
	   avg(h.terraced_avg) as terraced, 
	   avg(h.flat_avg) as flat
from datesdim d 
join housepricesnational h on d.date = h.date
group by d.year
having avg(h.detached_avg) is not null 
   and avg(h.semi_avg) is not null 
   and avg(h.terraced_avg) is not null 
   and avg(h.flat_avg) is not null;
```

![average price by type for each year](documentation_images/Q3SQLOutput.png)

A suitable visualization for this data is a 100% stacked area chart, as this will allow for a clear comparison of the average prices relative to eachother. The visualisation is as follows:

![matrix visualisation showing the distribution of average prices of different types of houses from 2005 to 2024](documentation_images/Q3PowerBIVisual.png)

# Q4: What are the regional trends in the total number of housing sales?

For this question, we (inner) join the `DatesDim` table and the `housepricesregional` table on the `date` column, and select the columns `DatesDim.year`, `region` and `sales_volume`.

We then group by `year` and `region`, and find the sum of `sales_volume`. 

The query and output are as follows:

```sql
select d.year, h.region, sum(h.sales_volume)
from DatesDim d
join housepricesregional h on d.date = h.date
group by d.year, h.region
having sum(h.sales_volume) is not null;
```

Output:

![output of query to find the total sales volume by region and year](documentation_images/Q4SQLOutput.png)

For the visualisation, we create a stacked area chart in Power BI. It is as follows:

![stacked area chart showing the total number of annual housing sales per region](documentation_images/Q4PowerBIVisual.png)

# Q5: How are yearly sales distributed across the months, and do any seasonal trends exist?

For our last question (and perhaps the easiest to query), we (inner) join the `DatesDim` dimension table to the `housepricesnational` table and simply select the `year`, `month` and `sales_volume` columns. 

As part of the query, we create a column with the abbreviated month name. This is for the sake of the associated visualisation. We keep the `month` column around (containing numbers from 1 to 12) as this can be used as a "helper sort column" to ensure the abbreviated month names are sorted chronologically (as opposed to alphabetically, which is the standard sorting order for text-based columns).

Moreover, December 2024 is missing data, and as we are looking at monthly distributions over each year, we must exclude the 2024 data.

The query and output are as follows:

```sql
select d.year, 
d.month, 
date_format(date(concat('2000-', d.month, '-01')), '%b') AS month_name_abbreved, 
h.sales_volume
from DatesDim d
join housepricesnational h on d.date = h.date
where h.sales_volume is not null and d.year < 2024;
```

Output:

![number of house sales per (month,year) nationally](documentation_images/Q5SQLOutput.png)

Once again, the visual of choice with PowerBI is a stacked area chart. It is as follows:



![stacked area chart showing the trend of total number of housing sales per month](documentation_images/Q5PowerBIVisual.png)