# Initial data read
Notebook to begin exploring the UK air pollution data made available by the department for environment, food and rural affairs: [UK Air Information Resource](https://uk-air.defra.gov.uk/).  
&copy; Crown 2020 copyright Defra via uk-air.defra.gov.uk, licenced under the [Open Government Licence (OGL)](http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/).

First, want to access some air pollution readings for Oxford, UK.

In [1]:
import sys
import time
from pathlib import Path
import pandas as pd
import numpy as np
import altair as alt

# auto-formatting of notebook cell code
%load_ext lab_black

repo_location = (
    "\\Users\\Simon\\Documents\\Python_Projects\\Productivity Boiz\\air-pollution"
)

sys.path.append(repo_location)
top_level_dir = Path(repo_location)
interim_data_path = top_level_dir / "data/interim"

# if not interim_data_path.is_dir():
#     interim_data_path.mkdir()
#
# save_location = interim_data_path / "oxford_ebbes_full.csv"
# df.to_csv(save_location, index=False)
# print(f"Full historic air pollution data for {} ({time_range}) saved to: {save_location}")

# tracks version numbers used to run notebook
%load_ext watermark
%watermark -u -n -t -iv -v -g -a Simon-Lee-UK

altair 4.0.1
pandas 1.0.3
numpy  1.18.2
Simon-Lee-UK 
last updated: Fri Mar 27 2020 20:58:25 

CPython 3.7.3
IPython 7.13.0
Git hash: 071babb14d8d2775795d98e03b5c40316e8a22f1


## Data info
All Data GMT hour ending  
Status: R = Ratified / P = Provisional / P* = As supplied  
Oxford St Ebbes  
  
### Raw columns:
`Date` - format: 'dd-mm-yyyy'  
`time` - 24 hour, e.g. '19:00'  
`PM<sub>10</sub> particulate matter (Hourly measured)` - $PM_{10}$ particulate matter with diameter $< 10 \;\mu m$  
`status` - ratified / provisional / as supplied  
`unit` - $\mu g \:/\: m^{3}$ (FIDAS)  
`Nitric oxide` - 'NO'  
`status.1` - ratified / provisional / as supplied  
`unit.1` - $\mu g \:/\: m^{3}$  
`Nitrogen dioxide` - 'NO$_2$'  
`status.2` - ratified / provisional / as supplied  
`unit.2` - $\mu g \:/\: m^{3}$  
`Nitrogen oxides as nitrogen dioxide` - 'NO$_x$ as NO$_2$'  
`status.3` - ratified / provisional / as supplied  
`unit.3` - $\mu g \:/\: m^{3}$  
`PM<sub>2.5</sub> particulate matter (Hourly measured)` - $PM_{2.5}$ particulate matter with diameter $< 2.5 \;\mu m$    
`status.4` - ratified / provisional / as supplied  
`unit.4` - $\mu g \:/\: m^{3}$ (Ref.eq)

In [2]:
def get_single_year(year):
    data_url = f"https://uk-air.defra.gov.uk/data_files/site_data/OX8_{year}.csv"
    single_year = pd.read_csv(data_url, header=4)
    return single_year


def column_conversion(raw_data):
    column_dict = {
        "Date": "date",
        "PM<sub>10</sub> particulate matter (Hourly measured)": "pm_10",
        "status": "status_pm_10",
        "unit": "unit_pm_10",
        "Nitric oxide": "nitric_oxide",
        "status.1": "status_nitric_oxide",
        "unit.1": "unit_nitric_oxide",
        "Nitrogen dioxide": "nitrogen_dioxide",
        "status.2": "status_nitrogen_dioxide",
        "unit.2": "unit_nitrogen_dioxide",
        "Nitrogen oxides as nitrogen dioxide": "NO2_eq",
        "status.3": "status_NO2_eq",
        "unit.3": "unit_NO2_eq",
        "PM<sub>2.5</sub> particulate matter (Hourly measured)": "pm_2_5",
        "status.4": "status_pm_2_5",
        "unit.4": "unit_pm_2_5",
    }

    converted_columns = raw_data.rename(columns=column_dict)
    converted_columns = extend_date_with_time(converted_columns)
    return converted_columns


def extend_date_with_time(raw_data):
    extended_date = raw_data.copy()
    extended_date.time = extended_date.time.replace("24:00", "00:00")
    extended_date.date = extended_date.date + " " + extended_date.time
    return extended_date


def datetime_conversion(raw_data, target_column="date", date_format="%d-%m-%Y %H:%M"):
    converted_datetime = raw_data.copy()
    try:
        converted_datetime[target_column] = pd.to_datetime(
            converted_datetime[target_column], format=date_format
        )
    except ValueError:
        converted_datetime[target_column] = pd.to_datetime(
            converted_datetime[target_column]
        )

    return converted_datetime

Data available back to 2008: 'https://uk-air.defra.gov.uk/data_files/site_data/OX8_2008.csv'  
Can use consistent csv file names to loop and save a fully appended version of the data.

In [3]:
start_year = 2008
end_year = 2020
years_of_interest = list(np.arange(start_year, end_year + 1))

In [4]:
inspect_columns = pd.DataFrame(
    {
        "Data (Year)": ["blank"],
        "Date": [False],
        "time": [False],
        "Nitric oxide": [False],
        "Nitrogen dioxide": [False],
        "Nitrogen oxides as nitrogen dioxide": [False],
        "PM<sub>10</sub> particulate matter (Hourly measured)": [False],
        "PM<sub>2.5</sub> particulate matter (Hourly measured)": [False],
        "Volatile PM<sub>10</sub> (Hourly measured)": [False],
        "Volatile PM<sub>2.5</sub> (Hourly measured)": [False],
        "Volatile PM2.5 (Hourly measured)": [False],
        "Non-volatile PM<sub>10</sub> (Hourly measured)": [False],
        "Non-volatile PM<sub>2.5</sub> (Hourly measured)": [False],
    }
)
inspect_columns = inspect_columns.append(
    [inspect_columns] * (len(years_of_interest) - 1), ignore_index=True
)

In [5]:
inspect_columns

Unnamed: 0,Data (Year),Date,time,Nitric oxide,Nitrogen dioxide,Nitrogen oxides as nitrogen dioxide,PM<sub>10</sub> particulate matter (Hourly measured),PM<sub>2.5</sub> particulate matter (Hourly measured),Volatile PM<sub>10</sub> (Hourly measured),Volatile PM<sub>2.5</sub> (Hourly measured),Volatile PM2.5 (Hourly measured),Non-volatile PM<sub>10</sub> (Hourly measured),Non-volatile PM<sub>2.5</sub> (Hourly measured)
0,blank,False,False,False,False,False,False,False,False,False,False,False,False
1,blank,False,False,False,False,False,False,False,False,False,False,False,False
2,blank,False,False,False,False,False,False,False,False,False,False,False,False
3,blank,False,False,False,False,False,False,False,False,False,False,False,False
4,blank,False,False,False,False,False,False,False,False,False,False,False,False
5,blank,False,False,False,False,False,False,False,False,False,False,False,False
6,blank,False,False,False,False,False,False,False,False,False,False,False,False
7,blank,False,False,False,False,False,False,False,False,False,False,False,False
8,blank,False,False,False,False,False,False,False,False,False,False,False,False
9,blank,False,False,False,False,False,False,False,False,False,False,False,False


In [6]:
query_columns = inspect_columns.columns.tolist()

for idx, indv_year in enumerate(years_of_interest):
    single_year = get_single_year(year=indv_year)

    single_year_cols = single_year.columns.tolist()
    inspect_columns.loc[idx, "Data (Year)"] = indv_year
    for col in query_columns:
        if col in single_year_cols:
            inspect_columns.loc[idx, col] = True

    processed_year = single_year.pipe(column_conversion).pipe(datetime_conversion)
    if idx == 0:
        air_pollution = processed_year.copy()
    else:
        air_pollution = air_pollution.append(processed_year, ignore_index=True)
    time.sleep(1.5)  # creates interval between requests to uk-air.defra.gov.uk

inspect_columns

Unnamed: 0,Data (Year),Date,time,Nitric oxide,Nitrogen dioxide,Nitrogen oxides as nitrogen dioxide,PM<sub>10</sub> particulate matter (Hourly measured),PM<sub>2.5</sub> particulate matter (Hourly measured),Volatile PM<sub>10</sub> (Hourly measured),Volatile PM<sub>2.5</sub> (Hourly measured),Volatile PM2.5 (Hourly measured),Non-volatile PM<sub>10</sub> (Hourly measured),Non-volatile PM<sub>2.5</sub> (Hourly measured)
0,2008,True,True,True,True,True,True,True,False,False,True,False,True
1,2009,True,True,True,True,True,True,True,True,False,True,True,True
2,2010,True,True,True,True,True,True,True,True,False,True,True,True
3,2011,True,True,True,True,True,True,True,True,True,False,True,True
4,2012,True,True,True,True,True,True,True,True,True,False,True,True
5,2013,True,True,True,True,True,True,True,True,True,False,True,True
6,2014,True,True,True,True,True,True,True,True,True,False,True,True
7,2015,True,True,True,True,True,True,True,True,True,False,True,True
8,2016,True,True,True,True,True,True,True,True,True,False,True,True
9,2017,True,True,True,True,True,True,True,True,True,False,True,True


Inspect how many missing entries we have for this complete data set:

In [7]:
air_pollution.isna().sum()

date                                                   0
time                                                   0
pm_10                                              21399
status_pm_10                                       21399
unit_pm_10                                         21399
nitric_oxide                                       11981
status_nitric_oxide                                11981
unit_nitric_oxide                                  11981
nitrogen_dioxide                                   13092
status_nitrogen_dioxide                            13092
unit_nitrogen_dioxide                              13092
NO2_eq                                             13092
status_NO2_eq                                      13092
unit_NO2_eq                                        13092
Non-volatile PM<sub>2.5</sub> (Hourly measured)    32534
status_pm_2_5                                      33922
unit_pm_2_5                                        33922
pm_2_5                         

In [8]:
""" Docstring

Parameters
----------
p_1 : dtype
    Description of p_1
p_2 : dtype
    Description of p_2
    
Returns
-------
r_1 : dtype
    Description of r_1
r_2 : dtype
    Description of r_2
"""

' Docstring\n\nParameters\n----------\np_1 : dtype\n    Description of p_1\np_2 : dtype\n    Description of p_2\n    \nReturns\n-------\nr_1 : dtype\n    Description of r_1\nr_2 : dtype\n    Description of r_2\n'