# Data Science and Programming
# Week 6


# Table of contents
* [Introduction](#Introduction)
 * [Problem](#Problem)
 * [Importing the libraries and data](#Importing-the-libraries-and-data)
* [Exploring the data](#Exploring-the-data)
 * [Time series graphs](#Time-series-graphs)
 * [Checkpoint 1](#Checkpoint-1)
 * [Exploring a smaller subset of the data](#Exploring-a-smaller-subset-of-the-data)
* [Communicating the result](#Communicating-the-result)
 * [Checkpoint 2](#Checkpoint-2)

# Introduction

This activity uses the Seaborn library in Python to plot _time series_ so you can explore how measures change over time. More information about Seaborn can be seen at: https://seaborn.pydata.org/

The activity uses the data from the MEI large data set (number 5) which has data about London boroughs and national regions between 20 F03 and 2019.

Note the data has been reformatted slightly compared to the published version, to put years into rows rather than columns.

## Problem

***How have income and house prices changed in different areas over time?***

## Importing the libraries and data

> Run the code in the boxes below to import the `pandas` and `seaborn` libraries and create the `borough_data` data set.

In [None]:
import pandas as pd
import seaborn as sns

In [None]:
borough_data = pd.read_csv('MEI LDS 5 stacked years.csv')
# check the data
borough_data

# Exploring the data

> Run the code below to check the data types.

In [None]:
borough_data.info()

You can make it easier to explore the data for the national regions by taking a copy of the data for which `Region` is equal to National.

> Run the code below to create a new data set with just the national regions.

In [None]:
# create a slice of the data where Region = National
national_data = borough_data[borough_data['Region'] == 'National'].copy()
#check the data
national_data

## Time-series graphs

You can see how median house prices have changed over time using a _time-series_ graph. In *Seaborn* a time series can be plotted using `relplot()` with `kind='line'`.

> Run the code below to create a time-series of median house prices in the various national regions.

In [None]:
# time series of median house prices by national region
# lw=3 - thicker lines
sns.relplot(data=national_data, kind='line', x='Year', y='MEDIAN HOUSE PRICE (£)', hue='Area', lw=3, aspect=2);

> Add and run code below to create a time series of median income for the national regions.
>
> *Hint: You can copy the column name from the output of `info` above.*

In [None]:
# time series of median income by national region


It is oftern preferable for the $y$-axis on a graph to start at zero, to avoid misleading the viewer by making small differences look bigger. 

> Run the code below to plot a graph where the *y*-axis starts at 0.

In [None]:
# define fig as a time series of median house prices with a y-axis that starts at zero
fig = sns.relplot(data=national_data, kind='line', x='Year', y='MEDIAN HOUSE PRICE (£)', hue='Area', lw=3, aspect=2);

# set the y-axis of fig to start at zero, and let seaborn decide the upper limit
fig.set(ylim=(0, None));

> Adapt the code for your time series for median income so that the $y$-axis starts at zero.

In [None]:
# time series of median income with a y-axis that starts at zero


## Checkpoint 1

> * Describe how median house prices and median incomes have changed over time.
> * How do the changes in London compare to the rest of the country?
> * Look at the shape of the graph around 2008. What might explain this?

## Exploring a smaller subset of the data

There are too many boroughs for a time series for Inner London to be readable.

> Run the code below to create a time series for median house prices in inner London boroughs.

In [None]:
# time series of median house prices for inner London boroughs
# lw=3 - thicker lines
sns.relplot(data=borough_data[borough_data['Region'] == 'Inner'], kind='line', x='Year', y='MEDIAN HOUSE PRICE (£)', hue='Area', lw=3, aspect=2);

Notice that: 
* the scale on the $y$-axis is now in millions (`1e6` is shorthand for the standard form notation $1×10^6$);
* there are too many boroughs to distinguish the different colours.

> Run the code below to: 
> * create a subset of the data with just a few boroughs;
> * plot a time series for this subset.

In [None]:
# filter the data to just Westminster, Newham and Hackney
selected_data = borough_data[(borough_data['Area'] == 'Westminster') 
                             | (borough_data['Area'] == 'Newham')
                             | (borough_data['Area'] == 'Hackney')].copy()

# check the data
selected_data

In [None]:
# time series of median house price for selected boroughs
sns.relplot(data=selected_data, kind='line', x='Year', y='MEDIAN HOUSE PRICE (£)', hue='Area', lw=3, aspect=2);

> Choose three more boroughs and add code below to:
> * create a subset of the data for just those boroughs;
> * plot a time series of median house prices for those boroughs;
> * plot a time series of median income for those boroughs.

In [None]:
# filter the data to just three boroughs

# check the data


In [None]:
# time series of median house price for selected boroughs


In [None]:
# time series of median income for selected boroughs


# Communicating the result
## Checkpoint 2

> Use your analysis to answer the original problem: 
> ***How have income and house prices changed in different areas over time?***