# Initial data read
Notebook to begin exploring the UK air pollution data made available by the department for environment, food and rural affairs: [UK Air Information Resource](https://uk-air.defra.gov.uk/).  
&copy; Crown 2020 copyright Defra via uk-air.defra.gov.uk, licenced under the [Open Government Licence (OGL)](http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/).

First, want to access some air pollution readings for Oxford, UK.

In [1]:
import pandas as pd

# auto-formatting of notebook cell code
%load_ext lab_black

# tracks version numbers used to run notebook
%load_ext watermark
%watermark -u -n -t -iv -v -g -a Simon-Lee-UK

pandas 1.0.3
Simon-Lee-UK 
last updated: Fri Mar 27 2020 13:16:28 

CPython 3.7.3
IPython 7.13.0
Git hash: 77be65bf47c8543b16ffb3465993b3ed6da68b41


## Data info
All Data GMT hour ending  
Status: R = Ratified / P = Provisional / P* = As supplied  
Oxford St Ebbes  
  
### Raw columns:
`Date` - format: 'dd-mm-yyyy'  
`time` - 24 hour, 'e.g. 19:00'  
`PM<sub>10</sub> particulate matter (Hourly measured)` - $PM_{10}$ particulate matter with diameter $< 10 \;\mu m$  
`status` - ratified / provisional / as supplied  
`unit` - $\mu g \:/\: m^{3}$ (FIDAS)  
`Nitric oxide` - NO  
`status.1` - ratified / provisional / as supplied  
`unit.1` - $\mu g \:/\: m^{3}$  
`Nitrogen dioxide` - NO$_2$  
`status.2` - ratified / provisional / as supplied  
`unit.2` - $\mu g \:/\: m^{3}$  
`Nitrogen oxides as nitrogen dioxide` - NO$_x$ as NO$_2$  
`status.3` - ratified / provisional / as supplied  
`unit.3` - $\mu g \:/\: m^{3}$  
`PM<sub>2.5</sub> particulate matter (Hourly measured)` - $PM_{2.5}$ particulate matter with diameter $< 2.5 \;\mu m$    
`status.4` - ratified / provisional / as supplied  
`unit.4` - $\mu g \:/\: m^{3}$ (Ref.eq)

In [2]:
df = pd.read_csv(
    "https://uk-air.defra.gov.uk/data_files/site_data/OX8_2020.csv", header=4
)

Data available back to 2008: 'https://uk-air.defra.gov.uk/data_files/site_data/OX8_2008.csv'  
Can use consistent csv file names to loop and save a fully appended version of the data.

In [3]:
df.columns.tolist()

['Date',
 'time',
 'PM<sub>10</sub> particulate matter (Hourly measured)',
 'status',
 'unit',
 'Nitric oxide',
 'status.1',
 'unit.1',
 'Nitrogen dioxide',
 'status.2',
 'unit.2',
 'Nitrogen oxides as nitrogen dioxide',
 'status.3',
 'unit.3',
 'PM<sub>2.5</sub> particulate matter (Hourly measured)',
 'status.4',
 'unit.4']

In [4]:
df.dtypes

Date                                                      object
time                                                      object
PM<sub>10</sub> particulate matter (Hourly measured)     float64
status                                                    object
unit                                                      object
Nitric oxide                                             float64
status.1                                                  object
unit.1                                                    object
Nitrogen dioxide                                         float64
status.2                                                  object
unit.2                                                    object
Nitrogen oxides as nitrogen dioxide                      float64
status.3                                                  object
unit.3                                                    object
PM<sub>2.5</sub> particulate matter (Hourly measured)    float64
status.4                 

In [5]:
df.isna().sum()

Date                                                      0
time                                                      0
PM<sub>10</sub> particulate matter (Hourly measured)      2
status                                                    2
unit                                                      2
Nitric oxide                                             11
status.1                                                 11
unit.1                                                   11
Nitrogen dioxide                                         11
status.2                                                 11
unit.2                                                   11
Nitrogen oxides as nitrogen dioxide                      11
status.3                                                 11
unit.3                                                   11
PM<sub>2.5</sub> particulate matter (Hourly measured)     2
status.4                                                  2
unit.4                                  

In [6]:
column_conversion = {
    "Date": "date",
    "PM<sub>10</sub> particulate matter (Hourly measured)": "pm_10",
    "status": "status_pm_10",
    "unit": "unit_pm_10",
    "Nitric oxide": "nitric_oxide",
    "status.1": "status_nitric_oxide",
    "unit.1": "unit_nitric_oxide",
    "Nitrogen dioxide": "nitrogen_dioxide",
    "status.2": "status_nitrogen_dioxide",
    "unit.2": "unit_nitrogen_dioxide",
    "Nitrogen oxides as nitrogen dioxide": "NO2_eq",
    "status.3": "status_NO2_eq",
    "unit.3": "unit_NO2_eq",
    "PM<sub>2.5</sub> particulate matter (Hourly measured)": "pm_2_5",
    "status.4": "status_pm_2_5",
    "unit.4": "unit_pm_2_5",
}

In [7]:
df.rename(columns=column_conversion, inplace=True)
df.columns

Index(['date', 'time', 'pm_10', 'status_pm_10', 'unit_pm_10', 'nitric_oxide',
       'status_nitric_oxide', 'unit_nitric_oxide', 'nitrogen_dioxide',
       'status_nitrogen_dioxide', 'unit_nitrogen_dioxide', 'NO2_eq',
       'status_NO2_eq', 'unit_NO2_eq', 'pm_2_5', 'status_pm_2_5',
       'unit_pm_2_5'],
      dtype='object')

In [8]:
df.time = df.time.replace("24:00", "00:00")

In [9]:
df.date = df.date + " " + df.time

In [10]:
try:
    df.date = pd.to_datetime(df.date, format="%d-%m-%Y %H:%M")
except ValueError:
    df.date = pd.to_datetime(df.date)

In [11]:
df.dtypes

date                       datetime64[ns]
time                               object
pm_10                             float64
status_pm_10                       object
unit_pm_10                         object
nitric_oxide                      float64
status_nitric_oxide                object
unit_nitric_oxide                  object
nitrogen_dioxide                  float64
status_nitrogen_dioxide            object
unit_nitrogen_dioxide              object
NO2_eq                            float64
status_NO2_eq                      object
unit_NO2_eq                        object
pm_2_5                            float64
status_pm_2_5                      object
unit_pm_2_5                        object
dtype: object

In [12]:
df

Unnamed: 0,date,time,pm_10,status_pm_10,unit_pm_10,nitric_oxide,status_nitric_oxide,unit_nitric_oxide,nitrogen_dioxide,status_nitrogen_dioxide,unit_nitrogen_dioxide,NO2_eq,status_NO2_eq,unit_NO2_eq,pm_2_5,status_pm_2_5,unit_pm_2_5
0,2020-01-01 01:00:00,01:00,37.950,P,ugm-3 (FIDAS),2.47755,P,ugm-3,23.75899,P,ugm-3,27.55785,P,ugm-3,31.958,P,ugm-3 (Ref.eq)
1,2020-01-01 02:00:00,02:00,38.125,P,ugm-3 (FIDAS),2.05763,P,ugm-3,23.95215,P,ugm-3,27.10714,P,ugm-3,32.783,P,ugm-3 (Ref.eq)
2,2020-01-01 03:00:00,03:00,40.425,P,ugm-3 (FIDAS),1.76368,P,ugm-3,20.90984,P,ugm-3,23.61412,P,ugm-3,35.661,P,ugm-3 (Ref.eq)
3,2020-01-01 04:00:00,04:00,40.075,P,ugm-3 (FIDAS),2.23610,P,ugm-3,21.73078,P,ugm-3,25.15942,P,ugm-3,35.472,P,ugm-3 (Ref.eq)
4,2020-01-01 05:00:00,05:00,39.800,P,ugm-3 (FIDAS),2.17311,P,ugm-3,22.64830,P,ugm-3,25.98036,P,ugm-3,35.354,P,ugm-3 (Ref.eq)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2059,2020-03-26 20:00:00,20:00,31.300,P,ugm-3 (FIDAS),0.00000,P,ugm-3,16.25625,P,ugm-3,16.25625,P,ugm-3,21.227,P,ugm-3 (Ref.eq)
2060,2020-03-26 21:00:00,21:00,33.700,P,ugm-3 (FIDAS),0.00000,P,ugm-3,18.36000,P,ugm-3,18.36000,P,ugm-3,22.359,P,ugm-3 (Ref.eq)
2061,2020-03-26 22:00:00,22:00,35.600,P,ugm-3 (FIDAS),0.12473,P,ugm-3,15.49125,P,ugm-3,15.68250,P,ugm-3,24.717,P,ugm-3 (Ref.eq)
2062,2020-03-26 23:00:00,23:00,39.600,P,ugm-3 (FIDAS),0.00000,P,ugm-3,15.87375,P,ugm-3,15.87375,P,ugm-3,26.604,P,ugm-3 (Ref.eq)
