# Data Curation 

The aim of this work was to estimate the real changes in air quality levels only due to COVID-19 lockdown measures based on a BAU scenario using statistical models. For this purpose, different statistical techniques were applied to explain daily pollutants concentrations at airquality monitoring sites in Spanish cities using meteorological variables as predictors. 

The whole curation process have been performed using the `src/` scripts and this notebook only show examples of the process for monitoring sites in Madrid capital.

In [1]:
suppressMessages(library(tidyverse))

# repository directory
setwd("AirQualityCOVID/")

In [2]:
Sys.setlocale("LC_ALL", "es_ES.UTF-8")

For this purpose, daily pollutants concentrations and meteorological data time series from `January 1,2013` to `December 30, 2020` have been obtained, using the data from 2013-2019 as train partition to build the model.

The start date of the time series have been fixed in 2013 due to the availability in the Download service of [_European Enviroment Aency_ (**EEA**)](https://discomap.eea.europa.eu/map/fme/AirQualityExport.htm)

In [3]:
suppressMessages(library(lubridate))

start_dt <- ymd("2013-01-01")
end_dt <- ymd("2020-12-30")

---

## Air Quality

> _The whole curation process of the air quality data have been performed by the `src/curation/airQuality.R` script. ._

This study was focused on **urban traffic** sites from the most populated Spanish cities (with more than **> 100 000 inhabitants**).Traffic emission should have been strongly affected by the COVID-19 closure restrictions, being even more noticeable in larger cities with higher traffic under normal conditions. Moreover, ground-level measurements provided by air quality stations are more sensitive to emission source changes and are more relevant to human health.

The pollutant studied were:

* _Nitrogen Monoxide_ ($NO$) $\rightarrow$ **no**
* _nitrogen dioxide_ ($NO_2$) $\rightarrow$ **no2**
* _Ozone_ ($O_3$) $\rightarrow$ **o3**
* _Particulate matter of less than $\leq 10 \mu m$_ ($PM10$) $\rightarrow$ **pm10**
* _Particulate matter of less than $\leq 2.5 \mu m$_ ($PM2.5$) $\rightarrow$ **pm2.5**

In [4]:
site_type <- "traffic"
site_area <- "urban"

pollutants <- c("no", "no2", "o3", "pm10", "pm2.5")

The file `data/curation/estaciones-CA-JA.xlsx` contains the information of the air quality monitoring sites in Spanish cities with more than 100 000 inhabitants.

In [5]:
suppressMessages(library(openxlsx))

# AQ station in cities with more than 100000 inhabitants
sites.100mil <- read.xlsx("data/curation/estaciones-CA-JA.xlsx",
                          sheet="ciudades-100000-A") %>% 
                    filter(Municipio == "Madrid") %>%
                    select("Municipio", "Población",
                           "Estación.tráfico", "Código.estación") 

Daily pollutants concentrations time series from 2013-2020 have been obtained from **EEA** using the [`saqgetr`](https://github.com/skgrange/saqgetr) package for R.

In [6]:
suppressMessages(library(saqgetr))

spain.sites <- get_saq_sites() %>%
    filter(country == "spain",
           site %in% sites.100mil$"Código.estación",
           site_type == site_type,
           site_area == site_area,
           date_start <= start_dt,
           ) %>%
    select(site, site_name, latitude, longitude, elevation, 
           country, site_type, site_area, date_start, date_end)

sites.AQ <- merge(x = spain.sites,
                  y = sites.100mil,
                  by.x = "site", by.y="Código.estación",
                  all.x = TRUE) 
head(sites.AQ %>% select(site, site_name, Municipio, Población))

Unnamed: 0_level_0,site,site_name,Municipio,Población
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<dbl>
1,es0115a,PLAZA DE ESPAÑA,Madrid,3266126
2,es0118a,ESCUELAS AGUIRRE,Madrid,3266126
3,es0120a,RAMÓN Y CAJAL,Madrid,3266126
4,es1422a,PLAZA DEL CARMEN,Madrid,3266126
5,es1426a,MORATALAZ,Madrid,3266126
6,es1521a,BARRIO DEL PILAR,Madrid,3266126


Despite the reliability of the source, some negative values of concentrations, with no physical meaning, were shown being necessary to preprocess the data removing those values. After the preprocess, only air quality data, by pollutant, were retained when there were observations available for more than 3 years and at least the 80% of daily data between March, 2020 and June 2020. 

In [7]:
valid.df <- read.csv("data/curation/checked_AQ.csv")
head(valid.df %>%
    select(site, variable, site_name, Municipio, Población))

Unnamed: 0_level_0,site,variable,site_name,Municipio,Población
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<fct>,<int>
1,es0041a,no2,DIRECCIÓN DE SALUD,Bilbao,346843
2,es0041a,pm10,DIRECCIÓN DE SALUD,Bilbao,346843
3,es0110a,no,ERANDIO,Bilbao,346843
4,es0110a,no2,ERANDIO,Bilbao,346843
5,es0110a,pm10,ERANDIO,Bilbao,346843
6,es0110a,pm2.5,ERANDIO,Bilbao,346843


---

## Meteorological data

Daily meteorological data have been obtained from the nearest location to the selected air quality stations with at least 80% of available records between 2013-2020. A good selection of predictors with high explanatory power of air quality can increase the accuracy and performance of the model. Hence, meteorological variables with high influence in pollutants levels were included in the study. 

However, It was not able to find a database with all the wanted  meteorological variables. Thus, different sources have been used to obtain meteorological data. 

Sin embargo, no se ha podido encontrar una base de datos completa y válida con datos suficientes de todas las variables meteorológicas que se querían usar para el estudio. Por ello, se han utilizado tres fuentes distintas de datos meteorológicos, siendo la fuente principal la _[Agencia Estatal de Meteorología (**AEMET**)](http://www.aemet.es/es/portada)_. A estos datos se les ha añadido datos de dirección y velocidad del viento de _[National Oceanic and Atmospheric Administration Integrated Surface Database (**NOAA ISD**)](https://www.ncdc.noaa.gov/isd)_ y datos de la humedad relativa y la radiación de 

### Source: Agencia Estatal de Meteorología (**AEMET**)

> The whole curation process of **AEMET** data are available at the script `src/curation/aemet.py`

Daily temperature (maximum, mean and minimum; ºC), precipitation (mm) and surface pressure (maximun and minimum; hPa) were downloaded from the OpenData platform of the Agencia Estatal de Meteorología (AEMET) through its Application Programming Interface (API) using the “pyAEMET” Python framework developed for this purpose (https://github.com/Jaimedgp/pyAEMET).

| Variable |                     description                      | Unit |
|:--------:|:----------------------------------------------------:|:------:|
|   fecha  |             date of the day(AAAA-MM-DD)              |   -    |
|   tmed   |               Mean daily Temperature                 |   ºC   |
|   prec   |           Daily Precipitation from 07am to 07pm      |   mm   |
|   tmax   |              Max Temperature of the day              |   ºC   |
|   tmin   |              Min Temperature of the day              |   ºC   |
| presmax  |         Max pressure at site reference nivel         |  hPa   |
| presmin  |         Min pressure at site reference nivel         |  hPa   |


```Python

# Initialize AEMET API class with the api key

aemet = AemetClima(apikey=apikey)

aemet.estaciones_curacion(latitud=row["latitude"],
                          longitud=row["longitude"], 
                          n_cercanas=10,   
                          fecha_ini=date(2013, 1, 1),
                          fecha_end=date(2020, 12, 30),
                          umbral=0.8,
                          variables=["fecha", "tmed", "prec",
                                     "tmin", "tmax", "presMax", "presMin"],
                          save_folder="")
```

In [8]:
aemet <- read.csv("data/curation/checked_AEMET.csv")
head(aemet %>% select(indicativo, nombre, dist, siteAQ))

Unnamed: 0_level_0,indicativo,nombre,dist,siteAQ
Unnamed: 0_level_1,<fct>,<fct>,<dbl>,<fct>
1,1082,BILBAO AEROPUERTO,5.396849,es0041a
2,1082,BILBAO AEROPUERTO,5.756578,es0110a
3,3195,"MADRID, RETIRO",1.137154,es0118a
4,3195,"MADRID, RETIRO",4.417472,es0120a
5,5783,SEVILLA AEROPUERTO,7.968651,es0817a
6,5783,SEVILLA AEROPUERTO,11.220392,es0890a


### Source: NOAA

The surface wind speed (m/s) and direction (in degrees, being 90º for East) were included due to their influence in pollutants transport that will affect the local measurements. Both were obtained from the National Oceanic Atmospheric Administration (**NOAA ISD**) using the [`worldmet`](https://github.com/davidcarslaw/worldmet) R package

> The whole curation process of **NOAA ISD** data are available at the script `src/curation/worldMet.R`

| variable |     Description      |     Unit        |
|:--------:|:--------------------:|:---------------:|
|    ws    | wind speed |       m/s       |
|    wd    | wind direction | º. 90 for East |

```R

getMeta(lat = sites.AQ[sites.AQ$ site == st, ]$latitude[1],
        lon = sites.AQ[sites.AQ$ site == st, ]$longitude[1],
        end.year = "current",
        n = 10, returnMap = F)
```

In [9]:
worldmet <- read.csv("data/curation/checked_NOAA-ISD.csv")
head(worldmet %>% select(station, code, dist, siteAQ))

Unnamed: 0_level_0,station,code,dist,siteAQ
Unnamed: 0_level_1,<fct>,<fct>,<dbl>,<fct>
1,BILBAO,080250-99999,5.464009,es0041a
2,BILBAO,080250-99999,5.36354,es0110a
3,CUATRO VIENTOS,082230-99999,10.367903,es0118a
4,BARAJAS,082210-99999,10.444349,es0120a
5,SEVILLA,083910-99999,6.979568,es0817a
6,SEVILLA,083910-99999,10.069231,es0890a


### Source: ERA5-Land

In addition, daily solar radiation (W/m2) and relative humidity (%) were downloaded from the [_**ERA5-Land**_](https://cds.climate.copernicus.eu/cdsapp#!/search?type=dataset&text=era5-land) reanalysis dataset. Solar radiation has been included because of the influence of photochemistry on ozone formation from primary air pollutants. This re-analysis has a spacial resolution of $0.1º x 0.1º$ (9km). 

|    Variable     |   Description    | Unit  |
|:---------------:|:----------------:|:-------:|
| solar.radiation | Solar Radiation  | W/$m^2$ |
|       RH        | Relative Humidity |   \%    |