SEP data workflow: air quality data #17

FranckCo · 2021-11-07T10:57:07Z

Design and implement data workflow.

francescadag · 2022-01-20T17:00:23Z

Italian Air quality data
Data extraction
Step1: Data source website
Step2: Select DATA panel; data are organized in a set of tables.
Step3: Scroll to the requested table, named “Tabella 1 – PM10. Stazioni di monitoraggio: dati e parametri statistici per la valutazione della qualità dell'aria (2019)”.
Step4: Download link available on the left bottom at the end of the table.
These steps were repeated to extract the datasets of the other requested pollutans:
AMBIENT AIR QUALITY: NITROGEN DIOXIDE NO2
AMBIENT AIR QUALITY: TROPOSPHERIC OZONE O3
AMBIENT AIR QUALITY: PARTICULATE PM2.5

Data transformation
The downloaded file has the following Data Structure:
“Regione”,”Provincia”,”Comune”,”Nome della stazione Tipo di zona”,”Tipo di stazione”,”Giorni di superamento di 50 µg/m3”,”Valore medio annuo³ [µg/m³]”,”Rendimento [%]”,”Rispetta copertura minima”,”sufficiente distribuzione temporale nell'anno”,”numero_dati_validi”,”TIPO DI DATI 4”,”Codice zona”,”Nome zona”.

Data need to be filtered in order to be compliant to the requested Data Structure,
NUTS3 variable has been added through a transformation from municipality_id Variable, using data from ISTAT LAU archive
Provided metadata for NUTS3 transformation need to be downloaded and merged.
Metadata are referenced in a time series and Variable regarding year 2019 has been used in the script.
Metadata regarding pollutant type, data reference time and aggregation type have been added in the datafile.

The Data transformation phase was applied only to the dataset related to the PM10 pollutant.
Transformation script in R language: processing_ETL_AIR.R.txt

Data loading
The extracted datasets were uploaded to the FTP area of the project:

The dataset concerning PM 10.
The AMBIENT AIR QUALITY: NITROGEN DIOXIDE NO2.
The AMBIENT AIR QUALITY: TROPOSPHERIC OZONE O3 O3.
The AMBIENT AIR QUALITY: PARTICULATE PM2.5PM2.5.
All files are available in tabular format.

Possible integration in the French Air pollution datasets
The French dataset about the PM10 taken from the European Environmental Agency and uploaded to the FTP server, in its initial version contains the geographic coordinates; it has been enriched with the Municipality value through a script in java using the specific service/API.
The file obtained is available on the FTP area of the project.
After receiving confirmation that the extrapolated French dataset is correct, in the same way we will also extrapolate the datasets of the other pollutants and it will be possible to add the field referring to the Municipality to them as well.

FranckCo · 2022-02-03T10:17:11Z

Here is the direct link to the data for France, 2019 and PM10. Now we have to find a way to automate the "Download CSV" button.

Regarding the Nominatim API for geocoding, the problem is that it does not return the LAU (commune code), only the postal code, which is not the same (see example).

Also, the file on FTP does not seem to be UTF-8.

pafrance · 2022-02-03T15:59:22Z

Several themes regarding Air Quality are mixed together in ISPRA website
We actually don't know whether ISPRA has data for any wanted pollutant or not.
Here are the download link we could find.
PM10: https://annuario.isprambiente.it/sites/default/files/sys_ind_files/indicatori_ada/448/TABELLA%201_PM10_2019_rev.xlsx
PM2.5: https://annuario.isprambiente.it/sites/default/files/sys_ind_files/indicatori_ada/452/TABELLA%201_PM25_2019_rev_0.xlsx
NO2: https://annuario.isprambiente.it/sites/default/files/sys_ind_files/indicatori_ada/450/TABELLA1_NO2_2019.xlsx
O3: https://annuario.isprambiente.it/sites/default/files/sys_ind_files/indicatori_ada/451/TABELLA%201_O3_SALUTE_2019.xlsx
Url can hardly be made from rules. The only standard part is the prefix: https://annuario.isprambiente.it/sites/default/files/sys_ind_files/indicatori_ada/{theme_number}
where theme_number is loosely related to the pollutant and not to some standard classification
The last part of the url recites loosely as follow: /TABELLA[ ]1_{pollutant}_{referenced_year}[revision].xlsx
I don't know if these information can be used to construct a download link on the fly, but, at least they're useful to extract metadata about the dataset. namely the pollutant name and the year of reference, which, in turn are not present as columns in the dataset.

Dataset structure also differs from file to file. Mappings will be provided shortly

NB: We could switch to other more easy to manage data sources, such as the same EEA portal used to fetch French data if we just wanted to test the pipeline process. Different data sources like ISPRA give us a better use case to test data integration process as well.

pafrance · 2022-02-03T17:02:40Z

About ISPRA data file harmonization.
Dataset structure also differs from file to file, so the actual mapping needs a little reworking.
Mapping is explained in the file attached
Air Pollution meta.xlsx

There are three sections:

A set of variables of interest need to be extracted from the files, when available.
Desired Column Name is stated on the leftmost column, complete with description
Actual column name in Italian is provided for any of the input files

CSV final column name	Description
Region	Region code
Department	Department code
Municipality	Municipality code
StationId	Sensor station id
RegionName	Region Name
DepartmentName	Department name
MunicipalityName	Municipality name
StationName	Sensor Station Name
Aqvalue	Average annual value

A set of variable can be ignored completely
These columns are described for clarifying, but are not important and can be left out of the process
Last set of values are constants given as metadata in the file or in the metadata description which are not always provided.

CSV final column name	Description
Pollutant	Pollutant notation - constant data extracted from filename
PollutantDesc	Pollutant description - extracted from standard AQ Model definition
AggregationType	Aggregation level code - extracted from dataset
AggregationDesc	Aggregation level definition - extracted from standard AQ Model definition
ReferenceYear	Sampling Year - constant data extracted from filename
MeasureUnit	Measure unit - constant data extracted from column description
Country	Country - constant data extracted from web source
NUTS3	Department level (NUTS) - use NUTS3_LAU_IT metadata for department level join and conversion

francescadag · 2022-02-04T07:53:29Z

@FranckCo To automate the download, I saw that an API is called to get the resource that has this format:
http://aidef.apps.eea.europa.eu/tools/download?download_query=http://aidef.apps.eea.europa.eu/?source={"query":{"bool":{"must":[{"term":{"CountryOrTerritory":"France"}},{"term":{"ReportingYear":"2019"}},{"term":{"Pollutant":"Particulate+matter+<+10+µm+(aerosol)"}}]}}}&download_format=csv
The "source" field is a json that can be easily edited to get the other files.

Regarding the service for geocoding, in addition to the Municipality field, is also needed the LAU? I could add it, I just need to have the table of the French LAUs, these are the ones found on the Eurostat website.

I checked the files on the server and converted the non-UTF-8 ones, reorganized them and the pollutant files are now here. The Italian ones are in CSV format, if necessary I will also upload those in Excel format.

pafrance · 2022-02-04T11:04:27Z

Dear Franck, another option is to use European source both for Territorial metadata (Nuts3/Lau) and for Air Pollution Dataset retrieval. We tested Francesca's url above with "Italy" instead of "France" and it works.
This is a huge simplification to test the automated pipeline.
In the mean time we will use Ontology Integration for PM10 dataset to link Italian ISPRA and EAA French dataset only.
We could compare the outcomes later.

pafrance · 2022-02-17T15:35:32Z

Update regarding territorial representation mismatch in AIR Quality data:

IT: ISPRA sources uses ISTAT codes for Administrative Territorial Unit representation
FR: EAA source uses geo coordinates.

These codes must be converted into NUTS3 + LAU via custom ETL for each file.

IT: Metadata from Eurostat have been provided for administrative codes conversions.
FR: A service for translating geo coordinates into administrative names has been suggested too.

Both files must have two additional columns where actual NUTS3 and LAU codes are stored
Then, Ontological framework maps NUTS3 and LAUs accordingly.

francescadag · 2022-02-25T07:31:26Z

Added the LAU and NUTS3 fields to the French Air pollution datasets using python script (which was sent to Franck).
It obtains the datasets using the API described here, the Municipality name field is added using the following service and subsequently the corresponding LAU and NUTS3 are extrapolated from it from the Eurostat Excel file (of 2019).
The files obtained are on the FTP server here.

pafrance · 2022-04-20T13:25:27Z

SEP process steps

Step 1; Data acquisition

Source files are loaded from ftp repository to MySQL local database
FR source file(s)
FR PM10 pollution: (this file has been previously elaborated, adding LAU and NUTS3 by a dedicated service)
FR Census File:
IT source file(s)
IT PM10 pollution:
IT Census File: Istat website
Metadata:
Nuts3_LAU taken from Eurostat
Air Pollution codelists taken from EAA (example )
Age Classes and Sex codelist merged between IT and FR alike

Common data model (table or file structure) as specified here #17 (comment)
**Variable_name,Description,data_type,
StationLocalId,'Sensor station ID',text,
pollutant,'pollutant code codelist from EAA website',text,
COUNTRY,'country of reference',text,
Reporting_YEAR,'year of reference',number,
AggregationType,'data aggregation codelist from EAA website','text,
AQValue,'observed value',number
id,'observetion key',number
lauCode,'Municipality code',text (can be easily linked to territory metadata like NUTS3 and geonames)
concentration,'codelist if unity of measure',text

Input files are uploaded from ftp source as is.
Harmonization has been done via sql union queries wrapped in a single view

CREATE ALGORITHM=UNDEFINED DEFINER=root@localhost SQL SECURITY DEFINER VIEW v_observations AS
select
obs.StationLocalId AS StationLocalId,
obs.pollutant AS pollutant,
obs.COUNTRY AS COUNTRY,
obs.Reporting_YEAR AS Reporting_YEAR,
obs.AggregationType AS AggregationType,
obs.AQValue AS AQValue,
obs.spId AS spId,row_number() OVER (ORDER BY obs.StationLocalId ) AS id,
obs.lauCode AS lauCode,
obs.concentration AS concentration
from (
select
substr(air_pollution_it.MUNICIPALITY_ID,(length(air_pollution_it.province_id) + 1)) AS lauCode,
air_pollution_it.station_code AS StationLocalId,
'PM10' AS pollutant,'IT' AS COUNTRY,
2019 AS Reporting_YEAR,'Annual mean / 1 calendar year' AS AggregationType,
air_pollution_it.VALORE_MEDIO_ANNUO AS AQValue,NULL AS spId,
'µg/m3' AS concentration
from air_pollution_it
union all
select
substr(air_pollution_it.MUNICIPALITY_ID,
(length(air_pollution_it.province_id) + 1)) AS lauCode,
air_pollution_it.station_code AS StationLocalId,
'PM10' AS pollutant,
'IT' AS COUNTRY,
2019 AS Reporting_YEAR,
'1 day exceed 180' AS AggregationType,
air_pollution_it.Giorni_di_superamento AS AQValue,
NULL AS spId,
NULL AS concentration
from air_pollution_it
union all
select
a.LAU AS lauCode,
a.StationLocalId AS StationLocalId,
b.Notation AS Pollutant,
c.cod AS COUNTRY,
a.ReportingYear AS Reporting_YEAR,
a.AggregationType AS AggregationType,
a.AQValue AS AQValue,
a.SamplingPointLocalId AS spId,
a.Unit AS concentration
from (
(
air_pollution_fr a
join pollutant b on((a.Pollutant = b.Label)))
join country c on((a.CountryOrTerritory = c.description)))
) obs

This query has 3 nested sub queries, one for french data and two for Italian data, each for a single measure type. The resultset has the form of a data cube. Codelists and measures have been normalized and linked to their corresponding metadata

CREATE ALGORITHM=UNDEFINED DEFINER=root@localhost SQL SECURITY DEFINER VIEW v_cens_obs AS
select
row_number() OVER (ORDER BY d.nuts3,d.lau,d.age ) AS id,
d.lau AS lau,
d.nuts3 AS nuts3,
d.sex AS sex,
d.age AS age,
d.population AS population,
d.country AS country
from
(
select
lpad(c.lau,6,0) AS lau,
c.nuts3 AS nuts3,
c.sex AS sex,
c.age AS age,
c.population AS population,
c.country AS country
from
cens_it c
union all
select
a.lau AS lau,
b.NUTS_3_CODE AS nuts3,
a.sex AS sex,
a.age AS age,
a.population AS population,
'FR' AS country
from (cens_fr a join lau_nuts3 b on((a.lau = b.LAU_CODE)))
) d;

This query integrates files from different countries as one single harmonized view.

MySQL is used as data repository for monolith, the tool for mappings.
This tool associates mappings with sql queries on the mysql database, so that SparQL queries can be translated automatically into SQL queries.
The tool Monolith can export them in xml format, although this is hardly reusable elsewhere.

The sparql resultset can be formatted in csv, json and rdf and sent to the subsequent stages of the pipeline.

FranckCo added the sep pilot Required for the Supporting Environment Policies pilot label Nov 7, 2021

FranckCo added this to the Milestone 6 milestone Nov 7, 2021

FranckCo assigned FranckCo, romaintailhurat and NicoLaval Nov 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEP data workflow: air quality data #17

SEP data workflow: air quality data #17

FranckCo commented Nov 7, 2021

francescadag commented Jan 20, 2022

FranckCo commented Feb 3, 2022

pafrance commented Feb 3, 2022

pafrance commented Feb 3, 2022

francescadag commented Feb 4, 2022

pafrance commented Feb 4, 2022

pafrance commented Feb 17, 2022

francescadag commented Feb 25, 2022

pafrance commented Apr 20, 2022 •

edited

Loading

SEP data workflow: air quality data #17

SEP data workflow: air quality data #17

Comments

FranckCo commented Nov 7, 2021

francescadag commented Jan 20, 2022

FranckCo commented Feb 3, 2022

pafrance commented Feb 3, 2022

pafrance commented Feb 3, 2022

francescadag commented Feb 4, 2022

pafrance commented Feb 4, 2022

pafrance commented Feb 17, 2022

francescadag commented Feb 25, 2022

pafrance commented Apr 20, 2022 • edited Loading

pafrance commented Apr 20, 2022 •

edited

Loading