# Exploratory analysis of water quality database

<a id = 'top'><a/>

## TABLE OF CONTENT

[0. Import datasets](#import_dataset)

[0.1 Loading datasets](#load_files)

[0.1.1 Loading spatial data](#load_spatial)

[0.1.2 Loading emissions data](#load_emissions)
    
[0.1.3 Loading monitoring data](#load_monitoring)

[0.1.4 Loading aggregated data](#load_aggregated)

[0.2 Web scraping from Discodata EEA](#web_scraping)

[1. Exploratory analysis of datasets](#exploratory_analysis)

[1.1 Exploratory analysis of spatial dataset](#explore_spatial)

[1.2 Exploratory analysis of emissions dataset](#explore_emissions)
  
[1.3 Exploratory analysis of monitoring dataset](#explore_monitoring)

[1.4 Exploratory analysis of aggregated dataset](#explore_aggregated)

<a id = 'import_dataset'><a/>
## 0. Import datasets
[Top](#top)
    
[1.](#exploratory_analysis)

<a id = 'load_files'><a/>
### 0.1 Loading datasets
[Top](#top)

In [1]:
from urllib.request import urlopen
import json
import pandas as pd
from pandas import json_normalize
import numpy as np

<a id = 'load_spatial'><a/>
#### 0.1.1 Loading spatial data
[Top](#top)

The data describing the identifiers, names and locations of the monitored water bodies is referred as "spatial" dataset.

This is imported as a csv file, which was dowloaded from the following website https://discomap.eea.europa.eu/App/DiscodataViewer/?fqn=[WISE_SOE].[v2r1].[Waterbase_S_WISE_SpatialObject_DerivedData].

The json url was not available.

In [2]:
spatial_raw = pd.read_csv("SpatialObject.csv")
spatial_raw

Unnamed: 0,countryCode,thematicIdIdentifier,thematicIdIdentifierScheme,monitoringSiteIdentifier,monitoringSiteIdentifierScheme,monitoringSiteName,waterBodyIdentifier,waterBodyIdentifierScheme,waterBodyName,specialisedZoneType,...,subUnitIdentifier,subUnitIdentifierScheme,subUnitName,rbdIdentifier,rbdIdentifierScheme,rbdName,confidentialityStatus,lon,lat,statusCode
0,FR,FRFR05234020,euMonitoringSiteCode,FRFR05234020,euMonitoringSiteCode,MAUBOURGUET,FRFR326A,euSurfaceWaterBodyCode,L'ECHEZ DU CONFLUENT DU CANAL DU MOULIN AU CON...,riverWaterBody,...,FRF,euSubUnitCode,"LA GARONNE, L'ADOUR, LA DORDOGNE, LA CHARENTE ...",FRF,euRBDCode,"LA GARONNE, L'ADOUR, LA DORDOGNE, LA CHARENTE ...",F,0.03117,43.46819,stable
1,FR,FRFR05236100,euMonitoringSiteCode,FRFR05236100,euMonitoringSiteCode,PONT DE GERDE,FRFR236,euSurfaceWaterBodyCode,L'ADOUR DE SA SOURCE AU CONFLUENT DE LA DOULOU...,riverWaterBody,...,FRF,euSubUnitCode,"LA GARONNE, L'ADOUR, LA DORDOGNE, LA CHARENTE ...",FRF,euRBDCode,"LA GARONNE, L'ADOUR, LA DORDOGNE, LA CHARENTE ...",F,0.15756,43.05401,stable
2,FR,FRFR05237000,euMonitoringSiteCode,FRFR05237000,euMonitoringSiteCode,ST-PEE,FRFR273,euSurfaceWaterBodyCode,LA NIVELLE,riverWaterBody,...,FRF,euSubUnitCode,"LA GARONNE, L'ADOUR, LA DORDOGNE, LA CHARENTE ...",FRF,euRBDCode,"LA GARONNE, L'ADOUR, LA DORDOGNE, LA CHARENTE ...",F,-1.56073,43.35571,stable
3,FR,FRFR05238500,euMonitoringSiteCode,FRFR05238500,euMonitoringSiteCode,BIRIATOU,FRFT08,euSurfaceWaterBodyCode,ESTUAIRE BIDASSOA,transitionalWaterBody,...,FRF,euSubUnitCode,"LA GARONNE, L'ADOUR, LA DORDOGNE, LA CHARENTE ...",FRF,euRBDCode,"LA GARONNE, L'ADOUR, LA DORDOGNE, LA CHARENTE ...",F,-1.74205,43.33168,stable
4,FR,FRGLJ062510T,euMonitoringSiteCode,FRGLJ062510T,euMonitoringSiteCode,RETENUE DE ROPHEMEL,FRGL018,euSurfaceWaterBodyCode,RETENUE DE ROPHEMEL,lakeWaterBody,...,FRG,euSubUnitCode,"LA LOIRE, LES COURS D'EAU CÔTIERS VENDÉENS ET ...",FRG,euRBDCode,"LA LOIRE, LES COURS D'EAU CÔTIERS VENDÉENS ET ...",F,-2.06050,48.31543,stable
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73209,FR,FRFR05232350,euMonitoringSiteCode,FRFR05232350,euMonitoringSiteCode,BALEIX,FRFRR238_1,euSurfaceWaterBodyCode,LE LEES,riverWaterBody,...,FRF,euSubUnitCode,"LA GARONNE, L'ADOUR, LA DORDOGNE, LA CHARENTE ...",FRF,euRBDCode,"LA GARONNE, L'ADOUR, LA DORDOGNE, LA CHARENTE ...",F,-0.11570,43.37718,stable
73210,FR,FRFR05233000,euMonitoringSiteCode,FRFR05233000,euMonitoringSiteCode,ST-MONT,FRFR327C,euSurfaceWaterBodyCode,L'ADOUR DU CONFLUENT DE L'ECHEZ AU CONFLUENT D...,riverWaterBody,...,FRF,euSubUnitCode,"LA GARONNE, L'ADOUR, LA DORDOGNE, LA CHARENTE ...",FRF,euRBDCode,"LA GARONNE, L'ADOUR, LA DORDOGNE, LA CHARENTE ...",F,-0.15118,43.65393,stable
73211,FR,FRFR05234000,euMonitoringSiteCode,FRFR05234000,euMonitoringSiteCode,TASQUE,FRFR235A,euSurfaceWaterBodyCode,L'ARROS DU CONFLUENT DU LURUS AU CONFLUENT DE ...,riverWaterBody,...,FRF,euSubUnitCode,"LA GARONNE, L'ADOUR, LA DORDOGNE, LA CHARENTE ...",FRF,euRBDCode,"LA GARONNE, L'ADOUR, LA DORDOGNE, LA CHARENTE ...",F,0.02190,43.64107,stable
73212,FR,FRFR05234015,euMonitoringSiteCode,FRFR05234015,euMonitoringSiteCode,OZON-DARRE,FRFR235B,euSurfaceWaterBodyCode,L'ARROS DU CONFLUENT DU LACA (INCLUS) AU CONFL...,riverWaterBody,...,FRF,euSubUnitCode,"LA GARONNE, L'ADOUR, LA DORDOGNE, LA CHARENTE ...",FRF,euRBDCode,"LA GARONNE, L'ADOUR, LA DORDOGNE, LA CHARENTE ...",F,0.24909,43.15934,stable


In [3]:
records, atrributes = spatial_raw.shape

In [4]:
print(f"The spatial dataset has {records} records and {atrributes} atrributes.")

The spatial dataset has 73214 records and 23 atrributes.


<a id = 'load_emissions'><a/>
#### 0.1.2 Load emissions data
[Top](#top)

The dataset that give sinformation about the substances and their amount emitted to the water bodies is referred to as "emissions".

This is imported as json file directly from the website https://discodata.eea.europa.eu/# from the database and table WISE_SOE > latest > Waterbase_T_WISE1_Emissions. The followign query was run on the online server:

SELECT *
FROM [WISE_SOE].[latest].[Waterbase_T_WISE1_Emissions]

This automatically also created a URL leading to the JSON file. However, this by default included only the first 100 records, therefore the URL was modified changin the attribtue "nrOfHits" from 100 to 103285, the total nr of rows of the table.

In [152]:
"""
Import json through URL provided by the database online.
The URL shows by default only 100 records (nrOfHits). Therefore it is necessary to loop through the pages (p).
eea_emission_url = "https://discodata.eea.europa.eu/sql?query=SELECT%20*%0AFROM%20%5BWISE_SOE%5D.%5Blatest%5D.%5BWaterbase_T_WISE1_Emissions%5D&p=1&nrOfHits=103285&mail=null&schema=null"
eea_emission_response = urlopen(eea_emission_url)
emissions_raw = json.loads(eea_emission_response.read())
"""

In [6]:
emissions_raw = {"results":[]}
p = 1
nr = 1000
while True:
    eea_emission_url = "https://discodata.eea.europa.eu/sql?query=SELECT%20*%20FROM%20%5BWISE_SOE%5D.%5Blatest%5D.%5BWaterbase_T_WISE1_Emissions%5D&p={page}&nrOfHits={num_record}&mail=null&schema=null".format(page=p, num_record=nr)
    eea_emission_response = urlopen(eea_emission_url)
    json_data = json.loads(eea_emission_response.read())
    if len(json_data.get("results", []))==0:
        break
    else:
        emissions_raw["results"].extend(json_data.get("results", []))
    p = p + 1

In [7]:
emissions_df_raw = json_normalize(emissions_raw['results'])

In [8]:
emissions_df_raw

Unnamed: 0,countryCode,spatialUnitIdentifier,spatialUnitIdentifierScheme,phenomenonTimeReferencePeriod,observedPropertyDeterminandCode,observedPropertyDeterminandLabel,parameterEmissionsSourceCategory,parameterEPRTRfacilities,resultEmissionsValue,resultEmissionsUom,procedureEmissionsMethod,resultObservationStatus,Remarks,metadata_versionId,metadata_beginLifeSpanVersion,metadata_statusCode,metadata_observationStatus,metadata_statements,UID
0,AT,AT1000,euRBDCode,2016,CAS_7439-92-1,Lead and its compounds,I,yes,759.500000,kg/a,calculated,X,data derived from EPRTR by ETC,http://discomap.eea.europa.eu/data/wisesoe/der...,2020-06-08 00:00:00.000,experimental,A,,137076
1,AT,AT1000,euRBDCode,2016,CAS_7439-92-1,Lead and its compounds,U2,yes,280.000000,kg/a,calculated,X,data derived from EPRTR by ETC,http://discomap.eea.europa.eu/data/wisesoe/der...,2020-06-08 00:00:00.000,experimental,A,,137077
2,AT,AT1000,euRBDCode,2016,CAS_7439-97-6,Mercury and its compounds,I,yes,5.290000,kg/a,calculated,X,data derived from EPRTR by ETC,http://discomap.eea.europa.eu/data/wisesoe/der...,2020-06-08 00:00:00.000,experimental,A,,137078
3,AT,AT1000,euRBDCode,2016,CAS_7440-02-0,Nickel and its compounds,I,yes,2568.300000,kg/a,calculated,X,data derived from EPRTR by ETC,http://discomap.eea.europa.eu/data/wisesoe/der...,2020-06-08 00:00:00.000,experimental,A,,137080
4,AT,AT1000,euRBDCode,2016,CAS_7440-02-0,Nickel and its compounds,U2,yes,3690.000000,kg/a,calculated,X,data derived from EPRTR by ETC,http://discomap.eea.europa.eu/data/wisesoe/der...,2020-06-08 00:00:00.000,experimental,A,,137081
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103280,XK,XK,countryCode,2019,EEA_31615-01-7,Total nitrogen,U22,no,9.477000,t/a,measured,A,,https://cdr.eionet.europa.eu/xk/eea/wise_soe/w...,2021-01-13 09:43:12.000,stable,A,,162583
103281,XK,XK,countryCode,2020,EEA_31-02-7,Total suspended solids,U22,no,6.959722,t/a,measured,A,,https://cdr.eionet.europa.eu/xk/eea/wise_soe/w...,2022-01-12 23:18:02.000,stable,A,,179787
103282,XK,XK,countryCode,2020,EEA_3133-01-5,BOD5,U22,no,3.117451,t/a,measured,A,,https://cdr.eionet.europa.eu/xk/eea/wise_soe/w...,2022-01-12 23:18:02.000,stable,A,,179788
103283,XK,XK,countryCode,2020,EEA_3133-03-7,CODCr,U22,no,55.646105,t/a,measured,A,,https://cdr.eionet.europa.eu/xk/eea/wise_soe/w...,2022-01-12 23:18:02.000,stable,A,,179789


In [9]:
record, attributes = emissions_df_raw.shape
print(f"The emisisons df is composed of {record} records and {attributes} attributes.")

The emisisons df is composed of 103285 records and 19 attributes.


<a id = 'load_monitoring'><a/>
#### 0.1.3 Loading monitoring data
[Top](#top)

In [None]:
monitoring_raw = {"results" : []}

p = 1
nr = 1000000

while True:
    monitoring_url = "https://discodata.eea.europa.eu/sql?query=SELECT%20*%0AFROM%20%5BWISE_SOE%5D.%5Blatest%5D.%5BWaterbase_T_WISE3_MonitoringData%5D&p={page}&nrOfHits={num_record}&mail=null&schema=null".format(page = p, num_record = nr)
    monitoring_response = urlopen(monitoring_url)
    json_data = json.loads(monitoring_response.read())
    
    if len(json_data.get("results", [])) == 0:
        break
    else:
        monitoring_raw['results'].extend(json_data.get("results", []))
    p = p + 1
# Takes long to load

In [None]:
monitoring_df_raw = json_normalize(monitoring_raw["results"])

In case the loading of the JSON file through a URL looping over the entire pages of the resutl table would not work or take to long, the following method to import the data can be chosen.

The total records number is obtained by counting the total records of the table directly at the user interface query editor of the database website https://discodata.eea.europa.eu/#:
SELECT COUNT(*) AS total_records
FROM [WISE_SOE].[latest].[Waterbase_T_WISE3_MonitoringData]

Afterward, this number is used to substitute the default value (100) of number of records shown in the URL-JSON which would import the Monitoring dataset, obtained with the query SELECT * AS total_records FROM [WISE_SOE].[latest].[Waterbase_T_WISE3_MonitoringData] directly at the user interface query editor of the database website https://discodata.eea.europa.eu/#.

In [11]:
monitoring_url_tot_rec = "https://discodata.eea.europa.eu/sql?query=SELECT%20COUNT(*)%20AS%20total_records%0AFROM%20%5BWISE_SOE%5D.%5Blatest%5D.%5BWaterbase_T_WISE3_MonitoringData%5D&p=1&nrOfHits=100&mail=null&schema=null"

monitoring_resp_tot_rec = urlopen(monitoring_url_tot_rec)
monitoring_tot_rec_raw = json.loads(monitoring_resp_tot_rec.read())
monitoring_tot_rec_raw2 = json_normalize(monitoring_tot_rec_raw["results"])
monitoring_tot_rec = monitoring_tot_rec_raw2.iloc[0].total_records
monitoring_tot_rec

4888878

In [12]:
nr_mon = monitoring_tot_rec

monitoring_url = "https://discodata.eea.europa.eu/sql?query=SELECT%20*%0AFROM%20%5BWISE_SOE%5D.%5Blatest%5D.%5BWaterbase_T_WISE3_MonitoringData%5D&p=1&nrOfHits={num_rec_mon}&mail=null&schema=null".format(num_rec_mon = nr_mon)
monitoring_response = urlopen(monitoring_url)
monitoring_response

<http.client.HTTPResponse at 0x1e12612f700>

In [None]:
monitoring_raw = json.loads(monitoring_response.read())
monitoring_df_raw = json_normalize(monitoring_raw["results"])
monitoring_df_raw
# JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [None]:
nr_mon = monitoring_tot_rec

monitoring_url = "https://discodata.eea.europa.eu/sql?query=SELECT%20*%0AFROM%20%5BWISE_SOE%5D.%5Blatest%5D.%5BWaterbase_T_WISE3_MonitoringData%5D&p=1&nrOfHits={num_rec_mon}&mail=null&schema=null".format(num_rec_mon = nr_mon)
monitoring_response = urlopen(monitoring_url)
# To avoid the unexpected response <http.client.HTTPResponse at 0x1e12612f700>, read the response:
raw_data = monitoring_response.read()
encoding = monitoring_response.info().get_content_charset('utf8')
monitoring_raw = json.loads(raw_data.decode(encoding))

To avoid the unexpected response <http.client.HTTPResponse at 0x1e12612f700>, after trying to get the dataset from URL-JSON obtained with the direct querying at the dataset website, the csv file can be uploaded as follows.

In [16]:
monitoring_df_raw = pd.read_csv("Monitoring.csv")
monitoring_df_raw

  monitoring_df_raw = pd.read_csv("Monitoring.csv")


Unnamed: 0,countryCode,monitoringSiteIdentifier,monitoringSiteIdentifierScheme,observedProperty,phenomenonTimePeriod,phenomenonTimePeriod_year,phenomenonTimePeriod_month,phenomenonTimePeriod_day,resultObservedValue,resultObservationStatus,Remarks,metadata_versionId,metadata_beginLifeSpanVersion,metadata_statusCode,metadata_observationStatus,metadata_statements,UID
0,AT,AT212753,eionetMonitoringSiteCode,SF,2007-05-08,2007,5.0,8.0,26.30,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,1
1,AT,AT212753,eionetMonitoringSiteCode,SF,2007-05-09,2007,5.0,9.0,25.80,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,2
2,AT,AT212753,eionetMonitoringSiteCode,SF,2007-05-10,2007,5.0,10.0,25.10,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,3
3,AT,AT212753,eionetMonitoringSiteCode,SF,2007-05-11,2007,5.0,11.0,24.60,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,4
4,AT,AT212753,eionetMonitoringSiteCode,SF,2007-05-12,2007,5.0,12.0,25.50,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4888873,SE,SE635140-128901,euMonitoringSiteCode,SF,2020-08-01,2020,8.0,1.0,11.40,,,https://cdr.eionet.europa.eu/se/eea/wise_soe/w...,2022-02-12 07:04:29.000,valid,A,,6766109
4888874,SE,SE635140-128901,euMonitoringSiteCode,SF,2020-09-01,2020,9.0,1.0,8.97,,,https://cdr.eionet.europa.eu/se/eea/wise_soe/w...,2022-02-12 07:04:29.000,valid,A,,6766110
4888875,SE,SE635140-128901,euMonitoringSiteCode,SF,2020-10-01,2020,10.0,1.0,38.80,,,https://cdr.eionet.europa.eu/se/eea/wise_soe/w...,2022-02-12 07:04:29.000,valid,A,,6766111
4888876,SE,SE635140-128901,euMonitoringSiteCode,SF,2020-11-01,2020,11.0,1.0,77.90,,,https://cdr.eionet.europa.eu/se/eea/wise_soe/w...,2022-02-12 07:04:29.000,valid,A,,6766112


In [142]:
records, attributes = monitoring_df_raw.shape
print(f"The monitoring dataset has {records} records and {attributes} attributes.")

The monitoring dataset has 4888878 records and 17 attributes.


<a id = 'load_aggregated'><a/>
#### 0.1.4 Loading aggregated data
[Top](#top)

In [None]:
aggregated_raw = {"results" : []}

p = 1
nr = 1000000

while True:
    aggreagated_url = "https://discodata.eea.europa.eu/sql?query=SELECT%20*%0AFROM%20%5BWISE_SOE%5D.%5Blatest%5D.%5BWaterbase_T_WISE6_AggregatedData%5D&p={page}&nrOfHits={num_record}&mail=null&schema=null".format(page = p, num_record = nr)
    aggregated_response = urlopen(aggreagated_url)
    json_data = json.loads(aggregated_response.read())
    
    if len(json_data.get("results", [])) == 0:
        break
    else:
        aggregated_raw['results'].extend(json_data.get("results", []))
    p = p + 1

# HTTP Error 500 !!

In [None]:
aggregated_df_raw = json_normlaize(aggregated_raw["results"])

Since it returns an error 500, the following method to import the data can be tried.

The total records number is obtained by counting the total records of the table directly at the user interface query editor of the database website https://discodata.eea.europa.eu/#:
SELECT COUNT(*) AS total_records
FROM [WISE_SOE].[latest].[Waterbase_T_WISE6_AggregatedData]

Afterward, this number is used to substitute the default value (100) of number of records shown in the URL-JSON which would import the Monitoring dataset, obtained with the query SELECT * AS total_records FROM [WISE_SOE].[latest].[Waterbase_T_WISE6_AggregatedData] directly at the user interface query editor of the database website https://discodata.eea.europa.eu/#.

In [18]:
aggreagated_url_tot_rec = "https://discodata.eea.europa.eu/sql?query=SELECT%20COUNT(*)%20AS%20total_records%0AFROM%20%5BWISE_SOE%5D.%5Blatest%5D.%5BWaterbase_T_WISE6_AggregatedData%5D&p=1&nrOfHits=100&mail=null&schema=null"

aggreagated_resp_tot_rec = urlopen(aggreagated_url_tot_rec)
aggregated_tot_rec_raw = json.loads(aggreagated_resp_tot_rec.read())
aggregated_tot_rec_raw2 = json_normalize(aggregated_tot_rec_raw["results"])
aggregated_tot_rec = aggregated_tot_rec_raw2.iloc[0].total_records
aggregated_tot_rec

4550559

In [None]:
nr_aggr = aggregated_tot_rec

aggreagated_url = "https://discodata.eea.europa.eu/sql?query=SELECT%20*%0AFROM%20%5BWISE_SOE%5D.%5Blatest%5D.%5BWaterbase_T_WISE6_AggregatedData%5D&p=1&nrOfHits={num_rec_aggr}&mail=null&schema=null".format(num_rec_aggr = nr_aggr)
aggregated_response = urlopen(aggreagated_url)
aggregated_raw = json.loads(aggregated_response.read())
aggregated_df_raw = json_normalize(aggregated_raw["results"])
aggregated_df_raw
# Error 500 internal server error

Since also this attempt to access directyl the database returns a 500 error, internal server error, that can't be managed but only by the server admin, the dataset is directly imported as cav file.

In [20]:
aggregated_df_raw = pd.read_csv("Aggregated.csv")
aggregated_df_raw

  aggregated_df_raw = pd.read_csv("Aggregated.csv")


Unnamed: 0,countryCode,monitoringSiteIdentifier,monitoringSiteIdentifierScheme,parameterWaterBodyCategory,observedPropertyDeterminandCode,observedPropertyDeterminandLabel,procedureAnalysedMatrix,resultUom,phenomenonTimeReferenceYear,parameterSamplingPeriod,...,procedureAnalyticalMethod,parameterSampleDepth,resultObservationStatus,remarks,metadata_versionId,metadata_beginLifeSpanVersion,metadata_statusCode,metadata_observationStatus,metadata_statements,UID
0,AL,AL1,eionetMonitoringSiteCode,LW,CAS_14265-44-2,Phosphate,W,mg{P}/L,2004,2004-01--2004-12,...,,-9999.0,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,1
1,AL,AL1,eionetMonitoringSiteCode,LW,CAS_14265-44-2,Phosphate,W,mg{P}/L,2005,2005-01--2005-12,...,,-9999.0,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,2
2,AL,AL1,eionetMonitoringSiteCode,LW,CAS_14265-44-2,Phosphate,W,mg{P}/L,2006,2006-01--2006-12,...,,-9999.0,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,3
3,AL,AL1,eionetMonitoringSiteCode,LW,CAS_14265-44-2,Phosphate,W,mg{P}/L,2007,2007-01--2007-12,...,,-9999.0,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,4
4,AL,AL1,eionetMonitoringSiteCode,LW,CAS_14797-55-8,Nitrate,W,mg{NO3}/L,2005,2005-01--2005-12,...,,-9999.0,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4550554,RO,RO85010,euMonitoringSiteCode,RW,CAS_7440-66-6,Zinc and its compounds,W-DIS,ug/L,2017,2017-01-01--2017-12-31,...,Other analytical method,0.0,,EN ISO 8288:2001,http://discomap.eea.europa.eu/data/wisesoe/der...,2019-08-29 07:54:03.000,experimental,A,,17834956
4550555,RO,RO85010,euMonitoringSiteCode,RW,CAS_7440-47-3,Chromium and its compounds,W-DIS,ug/L,2017,2017-01-01--2017-12-31,...,EN ISO 15586:2003,0.0,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2019-08-29 07:54:03.000,experimental,A,,17834957
4550556,RO,RO85010,euMonitoringSiteCode,RW,CAS_7440-38-2,Arsenic and its compounds,W-DIS,ug/L,2017,2017-01-01--2017-12-31,...,EN ISO 15586:2003,0.0,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2019-08-29 07:54:03.000,experimental,A,,17834958
4550557,RO,RO85010,euMonitoringSiteCode,RW,EEA_31-02-7,Total suspended solids,W,mg/L,2017,2017-01-01--2017-12-31,...,EN 872:2005,0.0,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2019-08-29 07:54:03.000,experimental,A,,17834959


<a id = 'load_aggregated_waterbody'><a/>
#### 0.1.5 Loading aggregated by water body data
[Top](#top)

In [27]:
aggregatedwater_raw = {"results":[]}
p = 1
nr = 1000
while True:
    eea_aggregatedwater_url = "https://discodata.eea.europa.eu/sql?query=SELECT%20*%20FROM%20%5BWISE_SOE%5D.%5Blatest%5D.%5BWaterbase_T_WISE6_AggregatedDataByWaterBody%5D&p={page}&nrOfHits={num_record}&mail=null&schema=null".format(page=p, num_record=nr)
    eea_aggregatedwater_response = urlopen(eea_aggregatedwater_url)
    json_data = json.loads(eea_aggregatedwater_response.read())
    if len(json_data.get("results", []))==0:
        break
    else:
        aggregatedwater_raw["results"].extend(json_data.get("results", []))
    p = p + 1

In [28]:
aggregatedwater_df_raw = json_normalize(aggregatedwater_raw["results"])

In [29]:
aggregatedwater_df_raw

Unnamed: 0,countryCode,waterBodyIdentifier,waterBodyIdentifierScheme,parameterWaterBodyCategory,observedPropertyDeterminandCode,observedPropertyDeterminandLabel,procedureAnalysedMatrix,resultUom,phenomenonTimeReferenceYear,parameterSamplingPeriod,...,resultNumberOfSitesClass4,resultNumberOfSitesClass5,resultObservationStatus,remarks,metadata_versionId,metadata_beginLifeSpanVersion,metadata_statusCode,metadata_observationStatus,metadata_statements,UID
0,AL,ALGW_011,eionetGroundWaterBodyCode,GW,CAS_14797-55-8,Nitrate,W,mg{NO3}/L,2004,2004-01-01--2004-12-31,...,0.0,,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,NOTE_LEGACY: The resultNumberOfSamples changed...,1
1,AL,ALGW_011,eionetGroundWaterBodyCode,GW,CAS_14797-55-8,Nitrate,W,mg{NO3}/L,2005,2005-01-01--2005-12-31,...,0.0,,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,NOTE_LEGACY: The resultNumberOfSamples changed...,2
2,AL,ALGW_011,eionetGroundWaterBodyCode,GW,CAS_14797-65-0,Nitrite,W,mg{NO2}/L,2005,2005-01-01--2005-12-31,...,1.0,0.0,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,U,NOTE_LEGACY: The resultNumberOfSamples changed...,3
3,AL,ALGW_011,eionetGroundWaterBodyCode,GW,EEA_3132-01-2,Dissolved oxygen,W,mg/L,2005,2005-01-01--2005-12-31,...,,,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,NOTE_LEGACY: The resultNumberOfSamples changed...,4
4,AL,ALGW_021,eionetGroundWaterBodyCode,GW,CAS_14797-65-0,Nitrite,W,mg{NO2}/L,2004,2004-01-01--2004-12-31,...,0.0,1.0,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,NOTE_LEGACY: The resultNumberOfSamples changed...,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20612,UK,UK500,eionetGroundWaterBodyCode,GW,CAS_14798-03-9,Ammonium,W,mg{NH4}/L,2004,2004-01-01--2004-12-31,...,0.0,,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,NOTE_LEGACY: The resultNumberOfSamples changed...,19099
20613,UK,UK500,eionetGroundWaterBodyCode,GW,EEA_3132-01-2,Dissolved oxygen,W,mg/L,2001,2001-01-01--2001-12-31,...,,,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,NOTE_LEGACY: The resultNumberOfSamples changed...,19100
20614,UK,UK500,eionetGroundWaterBodyCode,GW,EEA_3132-01-2,Dissolved oxygen,W,mg/L,2002,2002-01-01--2002-12-31,...,,,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,NOTE_LEGACY: The resultNumberOfSamples changed...,19101
20615,UK,UK500,eionetGroundWaterBodyCode,GW,EEA_3132-01-2,Dissolved oxygen,W,mg/L,2003,2003-01-01--2003-12-31,...,,,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,NOTE_LEGACY: The resultNumberOfSamples changed...,19102


In [30]:
records, attribtues = aggregatedwater_df_raw.shape
print(f"The aggregated by water bodies dataset has {records} records and {attribtues} attribtues.")

The aggregated by water bodies dataset has 20617 records and 35 attribtues.


In [35]:
aggregatedwater_df_raw['parameterWaterBodyCategory'].unique()

array(['GW'], dtype=object)

In case the previous method would not work, the following can be tried.

In [None]:
aggregatedwater_df_raw = aggregatedwater_raw

In [130]:
aggreagatedwater_url_tot_rec = "https://discodata.eea.europa.eu/sql?query=SELECT%20COUNT(*)%20AS%20total_records%0AFROM%20%5BWISE_SOE%5D.%5Blatest%5D.%5BWaterbase_T_WISE6_AggregatedDataByWaterBody%5D&p=1&nrOfHits=100&mail=null&schema=null"

aggreagatedwater_resp_tot_rec = urlopen(aggreagatedwater_url_tot_rec)
aggregatedwater_tot_rec_raw = json.loads(aggreagatedwater_resp_tot_rec.read())
aggregatedwater_tot_rec_raw2 = json_normalize(aggregatedwater_tot_rec_raw["results"])
aggregatedwater_tot_rec = aggregatedwater_tot_rec_raw2.iloc[0].total_records
aggregatedwater_tot_rec

20617

In [None]:
nr_aggrw = aggregatedwater_tot_rec

aggreagatedwater_url = "https://discodata.eea.europa.eu/sql?query=SELECT%20*%0AFROM%20%5BWISE_SOE%5D.%5Blatest%5D.%5BWaterbase_T_WISE6_AggregatedDataByWaterBody%5D&p=1&nrOfHits={num_rec_aggrw}&mail=null&schema=null".format(num_rec_aggrw = nr_aggrw)
aggregatedwater_response = urlopen(aggreagatedwater_url)
aggregatedwater_raw = json.loads(aggregatedwater_response.read())
aggregatedwater_df_raw = json_normalize(aggregatedwater_raw["results"])
aggregatedwater_df_raw

##### NOTE
Since the water body considered in this dataset is only Ground Water (GW), this will not be taken into account during the analysis.

<a id = 'web_scraping'><a/>
### 0.2 Web scraping from Discodata EEA
[Top](#top)

In [None]:
# Install the required libraries
!pip install selenium
"""
Run if not installed yet
!pip install BeautifulSoup4
!pip install requests
!pip install pandas
!pip install lxml
"""

In [102]:
# Import the required libraries
from bs4 import BeautifulSoup
import requests
import lxml
from selenium import webdriver
import time

In [18]:
# Get the url and the permission
url_emissions = "https://discomap.eea.europa.eu/App/DiscodataViewer/?fqn=[WISE_SOE].[v2r1].[Waterbase_T_WISE1_Emissions]#"
requests.get(url_emissions)

<Response [200]>

In [None]:
# If Response[200] the permission is allowed
# and it is possible to save into text
text_emissions = requests.get(url_emissions)
text_emissions.text

In [None]:
# Use a parser to change the html code into py-friendly text
soup_emissions = BeautifulSoup(text_emissions.text, 'lxml')
soup_emissions

In [29]:
# Inspect the html code for 'table'
table_emisisons = soup_emissions.find('table', class_ = 'table table-bordered table-sm ')
table_emisisons

In [30]:
# Create a list with all the column names
table_emissions_header = soup_emissions.find
headers = []
for i in table_emisisons.find_all('th'):
    title = i.text
    headers.append(title)

AttributeError: 'NoneType' object has no attribute 'find_all'

In [31]:
soup_emissions.find_all('table')

[]

### Using selenium for dynamic html pages

In [13]:
url_emissions = "https://discomap.eea.europa.eu/App/DiscodataViewer/?fqn=[WISE_SOE].[v2r1].[Waterbase_T_WISE1_Emissions]#"
webdriver = webdriver.Chrome()
webdriver.get(url_emissions)
#time.sleep(2)


In [7]:
webdriver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
html_emissions = BeautifulSoup(webdriver.page_source,'lxml')

table_emissions = html.find('table', class_ = 'table table-bordered table-sm ')

print(table_emissions.prettify())

AttributeError: 'NoneType' object has no attribute 'prettify'

In [41]:
print(table_emissions)

None


<a id = 'exploratory_analysis'></a>
## 1. Exploratory analysis of the datasets
[Top](#top)

[0.](#import_dataset)


<a id = 'explore_spatial'></a>
### 1.1 Exploratory analysis of spatial dataset
[Top](#top)

In [38]:
spatial_raw.columns

Index(['countryCode', 'thematicIdIdentifier', 'thematicIdIdentifierScheme',
       'monitoringSiteIdentifier', 'monitoringSiteIdentifierScheme',
       'monitoringSiteName', 'waterBodyIdentifier',
       'waterBodyIdentifierScheme', 'waterBodyName', 'specialisedZoneType',
       'naturalAWBHMWB', 'reservoir', 'surfaceWaterBodyTypeCode',
       'subUnitIdentifier', 'subUnitIdentifierScheme', 'subUnitName',
       'rbdIdentifier', 'rbdIdentifierScheme', 'rbdName',
       'confidentialityStatus', 'lon', 'lat', 'statusCode'],
      dtype='object')

In [39]:
spatial_raw['thematicIdIdentifier'].equals(spatial_raw['monitoringSiteIdentifier'])

False

In [40]:
spatial = spatial_raw[spatial_raw['confidentialityStatus'] == 'F'] # Keep only free confidential data.

In [41]:
spatial['countryCode'].unique()

array(['FR', 'IT', 'HR', 'PL', 'PT', 'UK', 'HU', 'AL', 'IE', 'AT', 'BA',
       'BG', 'CY', 'IS', 'DE', 'EL', 'FI', 'LT', 'NL', 'LU', 'LV', 'MT',
       'NO', 'CH', 'DK', 'RO', 'SI', 'ES', 'SE', 'SK', 'BE', 'CZ', 'EE',
       'LI', 'ME', 'MK', 'XK', 'RS', 'TR'], dtype=object)

In [42]:
spatial['thematicIdIdentifier'].unique()

array(['FRFR05234020', 'FRFR05236100', 'FRFR05237000', ...,
       'FRFR05234000', 'FRFR05234015', 'FRFR05234019'], dtype=object)

In [43]:
spatial['waterBodyIdentifier'].unique()

array(['FRFR326A', 'FRFR236', 'FRFR273', ..., 'FRFRR238_1', 'FRFR235A',
       'FRFR235B'], dtype=object)

#### Null values
After noticing that there are null values among the countries codes, it is necessary to detect and exclude the rows with null or nan values in them from the dataset.

In [44]:
spatial[spatial['countryCode'].isna()] # Select all rows with NaN under a single DataFrame column

Unnamed: 0,countryCode,thematicIdIdentifier,thematicIdIdentifierScheme,monitoringSiteIdentifier,monitoringSiteIdentifierScheme,monitoringSiteName,waterBodyIdentifier,waterBodyIdentifierScheme,waterBodyName,specialisedZoneType,...,subUnitIdentifier,subUnitIdentifierScheme,subUnitName,rbdIdentifier,rbdIdentifierScheme,rbdName,confidentialityStatus,lon,lat,statusCode


In [45]:
spatial_lon_na = spatial[spatial['lon'].isna()] # Rows in which 'lon' values are null
spatial_lon_na['countryCode'].unique()          # Which countries do not have lon values for certain water bodies

array(['AL', 'AT', 'BA', 'EL', 'FR', 'HR', 'HU', 'IE', 'IT', 'PT', 'FI',
       'SE', 'CY', 'BG', 'DE', 'DK', 'EE', 'ES', 'IS', 'LT', 'LU', 'LV',
       'ME', 'NO', 'UK', 'RO', 'RS', 'SK', 'TR'], dtype=object)

In [46]:
spatial_lat_na = spatial[spatial['lat'].isna()]
spatial_lat_na['countryCode'].unique()

array(['AL', 'AT', 'BA', 'EL', 'FR', 'HR', 'HU', 'IE', 'IT', 'PT', 'FI',
       'SE', 'CY', 'BG', 'DE', 'DK', 'EE', 'ES', 'IS', 'LT', 'LU', 'LV',
       'ME', 'NO', 'UK', 'RO', 'RS', 'SK', 'TR'], dtype=object)

In [47]:
spatial[spatial.isnull().any(axis=1)]

Unnamed: 0,countryCode,thematicIdIdentifier,thematicIdIdentifierScheme,monitoringSiteIdentifier,monitoringSiteIdentifierScheme,monitoringSiteName,waterBodyIdentifier,waterBodyIdentifierScheme,waterBodyName,specialisedZoneType,...,subUnitIdentifier,subUnitIdentifierScheme,subUnitName,rbdIdentifier,rbdIdentifierScheme,rbdName,confidentialityStatus,lon,lat,statusCode
66,IT,IT02PO29,euMonitoringSiteCode,IT02PO29,euMonitoringSiteCode,STAB. SIMA,IT0201VA,euGroundWaterBodyCode,PIANA DI AOSTA,groundWaterBody,...,,,,ITB2018,euRBDCode,RBD FIUME PO,F,7.36693,45.73500,stable
67,IT,IT02PO34,euMonitoringSiteCode,IT02PO34,euMonitoringSiteCode,GRAND PLACE,IT0201VA,euGroundWaterBodyCode,PIANA DI AOSTA,groundWaterBody,...,,,,ITB2018,euRBDCode,RBD FIUME PO,F,7.35605,45.73558,stable
107,IT,IT02-PO9,euMonitoringSiteCode,IT02-PO9,euMonitoringSiteCode,BIRRERIA,IT0201VA,euGroundWaterBodyCode,PIANA DI AOSTA,groundWaterBody,...,,,,ITB2018,euRBDCode,RBD FIUME PO,F,7.36670,45.73610,deprecated
127,IT,IT13SA13P,euMonitoringSiteCode,IT13SA13P,euMonitoringSiteCode,SA13,IT13SA,euGroundWaterBodyCode,PIANA DEL SANGRO,groundWaterBody,...,,,,ITE2018,euRBDCode,RBD APPENNINO CENTRALE,F,14.50610,42.24105,stable
187,IT,IT13SA21P,euMonitoringSiteCode,IT13SA21P,euMonitoringSiteCode,SA21,IT13SA,euGroundWaterBodyCode,PIANA DEL SANGRO,groundWaterBody,...,,,,ITE2018,euRBDCode,RBD APPENNINO CENTRALE,F,14.44058,42.14367,stable
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72877,IT,IT13SU43P,euMonitoringSiteCode,IT13SU43P,euMonitoringSiteCode,SU43,IT13SU,euGroundWaterBodyCode,PIANA DI SULMONA,groundWaterBody,...,,,,ITE2018,euRBDCode,RBD APPENNINO CENTRALE,F,13.80745,42.09816,stable
72887,UK,UKEW81120002,euMonitoringSiteCode,UKEW81120002,euMonitoringSiteCode,81120002,UKGB40802G806700,euGroundWaterBodyCode,TAMAR,groundWaterBody,...,,,,UK08,euRBDCode,SOUTH WEST,F,-4.07000,50.47000,deprecated
72914,IT,IT13PE7P,euMonitoringSiteCode,IT13PE7P,euMonitoringSiteCode,PE7,IT13PE,euGroundWaterBodyCode,PIANA DEL PESCARA,groundWaterBody,...,,,,ITE2018,euRBDCode,RBD APPENNINO CENTRALE,F,14.04828,42.30399,stable
72915,IT,IT13PE80P,euMonitoringSiteCode,IT13PE80P,euMonitoringSiteCode,PE80,IT13PE,euGroundWaterBodyCode,PIANA DEL PESCARA,groundWaterBody,...,,,,ITE2018,euRBDCode,RBD APPENNINO CENTRALE,F,14.15944,42.42583,stable


In [48]:
spatial.isnull().sum().sum() # Total null values

192972

In [49]:
spatial = spatial.dropna(subset = ['lon', 'lat'])

In [50]:
spatial['waterBodyIdentifier'].isna().sum()

706

In [51]:
spatial = spatial.dropna(subset = ['waterBodyIdentifier'])

In [52]:
spatial.isnull().sum().sum()

153282

In [53]:
# In case wanted to drop all NaN
"""
spatial.dropna(inplace=True)
spatial.isnull().sum().sum() # Total null values after the cleaning"""


'\nspatial.dropna(inplace=True)\nspatial.isnull().sum().sum() # Total null values after the cleaning'

In [54]:
spatial['countryCode'].unique() # Unique values of countries after the cleaning

array(['FR', 'IT', 'HR', 'PL', 'PT', 'UK', 'HU', 'IE', 'BA', 'BG', 'CY',
       'IS', 'DE', 'LT', 'NL', 'LU', 'LV', 'MT', 'NO', 'CH', 'DK', 'RO',
       'SI', 'SE', 'SK', 'AL', 'AT', 'BE', 'CZ', 'EE', 'EL', 'ES', 'FI',
       'LI', 'ME', 'MK', 'XK', 'RS', 'TR'], dtype=object)

In [55]:
spatial['specialisedZoneType'].unique()

array(['riverWaterBody', 'transitionalWaterBody', 'lakeWaterBody',
       'groundWaterBody', 'coastalWaterBody', 'territorialWaters'],
      dtype=object)

#### Duplicates
It is necessary now to check the presence of duplicated rows.

In [25]:
spatial[spatial.duplicated()] # This shows only the duplictaed rows.

Unnamed: 0,countryCode,thematicIdIdentifier,thematicIdIdentifierScheme,monitoringSiteIdentifier,monitoringSiteIdentifierScheme,monitoringSiteName,waterBodyIdentifier,waterBodyIdentifierScheme,waterBodyName,specialisedZoneType,...,subUnitIdentifier,subUnitIdentifierScheme,subUnitName,rbdIdentifier,rbdIdentifierScheme,rbdName,confidentialityStatus,lon,lat,statusCode
3384,IT,IT020561VA1,euMonitoringSiteCode,IT020561VA1,euMonitoringSiteCode,DOIRE DE LA THUILE - CHAZ PONTAILLE,IT020561VA,euSurfaceWaterBodyCode,DOIRE DE LA THUILE,riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,6.90119,45.70563,stable
3389,IT,IT020570081VA1,euMonitoringSiteCode,IT020570081VA1,euMonitoringSiteCode,DOIRE DE FERRET - GREUVETTAZ,IT020570081VA,euSurfaceWaterBodyCode,DOIRE DE VAL FERRET,riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,7.04696,45.86905,stable
3391,IT,IT020570081VA2,euMonitoringSiteCode,IT020570081VA2,euMonitoringSiteCode,DOIRE DE FERRET - PLANPINCIEUX,IT020570081VA,euSurfaceWaterBodyCode,DOIRE DE VAL FERRET,riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,6.98846,45.83103,stable
3393,IT,IT020570082VA1,euMonitoringSiteCode,IT020570082VA1,euMonitoringSiteCode,DOIRE DE FERRET - FOCE,IT020570082VA,euSurfaceWaterBodyCode,DOIRE DE VAL FERRET,riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,6.96235,45.80982,stable
3422,IT,IT020802VA1,euMonitoringSiteCode,IT020802VA1,euMonitoringSiteCode,SAINT-BARTHÏ¿½LEMY - PONTE PIERREY,IT020802VA,euSurfaceWaterBodyCode,TORRENT DE SAINT-BARTH?LEMY,riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,7.51992,45.80641,stable
3426,IT,IT020804VA1,euMonitoringSiteCode,IT020804VA1,euMonitoringSiteCode,SAINT-BARTHÏ¿½LEMY - FOCE,IT020804VA,euSurfaceWaterBodyCode,TORRENT DE SAINT-BARTH?LEMY,riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,7.4595,45.73625,stable
3446,IT,IT020941VA2,euMonitoringSiteCode,IT020941VA2,euMonitoringSiteCode,EVANÏ¿½ON - MONTE VERRAZ,IT020941VA,euSurfaceWaterBodyCode,TORRENT EVENSON,riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,7.74147,45.88489,stable
3460,IT,IT021041VA1,euMonitoringSiteCode,IT021041VA1,euMonitoringSiteCode,LYS - GRENNE,IT021041VA,euSurfaceWaterBodyCode,TORRENT LYS,riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,7.81165,45.86369,stable
3702,IT,IT03N0080563IR1,euMonitoringSiteCode,IT03N0080563IR1,euMonitoringSiteCode,MARMIROLO,IT03N0080564LO,euSurfaceWaterBodyCode,MINCIO (FIUME),riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,10.70809,45.30345,stable
3720,IT,IT03N00806000412IR1,euMonitoringSiteCode,IT03N00806000412IR1,euMonitoringSiteCode,BARGHE,IT03N00806000412LO,euSurfaceWaterBodyCode,CHIESE (FIUME),riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,10.40019,45.68847,stable


In [26]:
spatial[spatial.duplicated(keep = False)] # This shows the duplictaed items, including the duplicated rows.

Unnamed: 0,countryCode,thematicIdIdentifier,thematicIdIdentifierScheme,monitoringSiteIdentifier,monitoringSiteIdentifierScheme,monitoringSiteName,waterBodyIdentifier,waterBodyIdentifierScheme,waterBodyName,specialisedZoneType,...,subUnitIdentifier,subUnitIdentifierScheme,subUnitName,rbdIdentifier,rbdIdentifierScheme,rbdName,confidentialityStatus,lon,lat,statusCode
3383,IT,IT020561VA1,euMonitoringSiteCode,IT020561VA1,euMonitoringSiteCode,DOIRE DE LA THUILE - CHAZ PONTAILLE,IT020561VA,euSurfaceWaterBodyCode,DOIRE DE LA THUILE,riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,6.90119,45.70563,stable
3384,IT,IT020561VA1,euMonitoringSiteCode,IT020561VA1,euMonitoringSiteCode,DOIRE DE LA THUILE - CHAZ PONTAILLE,IT020561VA,euSurfaceWaterBodyCode,DOIRE DE LA THUILE,riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,6.90119,45.70563,stable
3388,IT,IT020570081VA1,euMonitoringSiteCode,IT020570081VA1,euMonitoringSiteCode,DOIRE DE FERRET - GREUVETTAZ,IT020570081VA,euSurfaceWaterBodyCode,DOIRE DE VAL FERRET,riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,7.04696,45.86905,stable
3389,IT,IT020570081VA1,euMonitoringSiteCode,IT020570081VA1,euMonitoringSiteCode,DOIRE DE FERRET - GREUVETTAZ,IT020570081VA,euSurfaceWaterBodyCode,DOIRE DE VAL FERRET,riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,7.04696,45.86905,stable
3390,IT,IT020570081VA2,euMonitoringSiteCode,IT020570081VA2,euMonitoringSiteCode,DOIRE DE FERRET - PLANPINCIEUX,IT020570081VA,euSurfaceWaterBodyCode,DOIRE DE VAL FERRET,riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,6.98846,45.83103,stable
3391,IT,IT020570081VA2,euMonitoringSiteCode,IT020570081VA2,euMonitoringSiteCode,DOIRE DE FERRET - PLANPINCIEUX,IT020570081VA,euSurfaceWaterBodyCode,DOIRE DE VAL FERRET,riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,6.98846,45.83103,stable
3392,IT,IT020570082VA1,euMonitoringSiteCode,IT020570082VA1,euMonitoringSiteCode,DOIRE DE FERRET - FOCE,IT020570082VA,euSurfaceWaterBodyCode,DOIRE DE VAL FERRET,riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,6.96235,45.80982,stable
3393,IT,IT020570082VA1,euMonitoringSiteCode,IT020570082VA1,euMonitoringSiteCode,DOIRE DE FERRET - FOCE,IT020570082VA,euSurfaceWaterBodyCode,DOIRE DE VAL FERRET,riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,6.96235,45.80982,stable
3421,IT,IT020802VA1,euMonitoringSiteCode,IT020802VA1,euMonitoringSiteCode,SAINT-BARTHÏ¿½LEMY - PONTE PIERREY,IT020802VA,euSurfaceWaterBodyCode,TORRENT DE SAINT-BARTH?LEMY,riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,7.51992,45.80641,stable
3422,IT,IT020802VA1,euMonitoringSiteCode,IT020802VA1,euMonitoringSiteCode,SAINT-BARTHÏ¿½LEMY - PONTE PIERREY,IT020802VA,euSurfaceWaterBodyCode,TORRENT DE SAINT-BARTH?LEMY,riverWaterBody,...,ITN008,euSubUnitCode,SU PO,ITB2018,euRBDCode,RBD FIUME PO,F,7.51992,45.80641,stable


In [27]:
spatial = spatial.drop_duplicates() # Drop all duplicated rows keepign the first one.
spatial[spatial.duplicated()]       # Checking that no duplictaes are present in the df anymore.

Unnamed: 0,countryCode,thematicIdIdentifier,thematicIdIdentifierScheme,monitoringSiteIdentifier,monitoringSiteIdentifierScheme,monitoringSiteName,waterBodyIdentifier,waterBodyIdentifierScheme,waterBodyName,specialisedZoneType,...,subUnitIdentifier,subUnitIdentifierScheme,subUnitName,rbdIdentifier,rbdIdentifierScheme,rbdName,confidentialityStatus,lon,lat,statusCode


In [28]:
# Select only necessary columns to reduce its size. (Can eb be reivsed later, afetr checking all the df).
spatial = spatial[['countryCode', 'thematicIdIdentifier', 'monitoringSiteIdentifier', 'monitoringSiteIdentifierScheme',
                   'monitoringSiteName', 'waterBodyIdentifier', 'waterBodyIdentifierScheme', 'waterBodyName',
                   'specialisedZoneType', 'naturalAWBHMWB', 'reservoir', 'surfaceWaterBodyTypeCode', 'subUnitIdentifier',
                   'rbdIdentifier', 'rbdName', 'lon', 'lat', 'statusCode']]

In [89]:
records2, attributes2 = spatial.shape
print(f"The cleaned spatial df has {records2} records and {attributes2} attributes")

The cleaned spatial df has 54343 records and 23 attributes


In [90]:
spatial.to_csv("spatial_cleaned.csv") # Save the cleaned df in csv format.

<a id = 'explore_emissions'></a>
### 1.2 Exploratory analysis of emissions dataset
[Top](#top)

[Monitoring](#explore_monitoring)

In [67]:
emissions_df_raw.columns

Index(['countryCode', 'spatialUnitIdentifier', 'spatialUnitIdentifierScheme',
       'phenomenonTimeReferencePeriod', 'observedPropertyDeterminandCode',
       'observedPropertyDeterminandLabel', 'parameterEmissionsSourceCategory',
       'parameterEPRTRfacilities', 'resultEmissionsValue',
       'resultEmissionsUom', 'procedureEmissionsMethod',
       'resultObservationStatus', 'Remarks', 'metadata_versionId',
       'metadata_beginLifeSpanVersion', 'metadata_statusCode',
       'metadata_observationStatus', 'metadata_statements', 'UID'],
      dtype='object')

#### Null values

In [68]:
emissions_df_raw.isnull().sum().sum()

299218

In [69]:
emissions_df_raw[emissions_df_raw.isnull().any(axis=1)]

Unnamed: 0,countryCode,spatialUnitIdentifier,spatialUnitIdentifierScheme,phenomenonTimeReferencePeriod,observedPropertyDeterminandCode,observedPropertyDeterminandLabel,parameterEmissionsSourceCategory,parameterEPRTRfacilities,resultEmissionsValue,resultEmissionsUom,procedureEmissionsMethod,resultObservationStatus,Remarks,metadata_versionId,metadata_beginLifeSpanVersion,metadata_statusCode,metadata_observationStatus,metadata_statements,UID
0,AT,AT1000,euRBDCode,2016,CAS_7439-92-1,Lead and its compounds,I,yes,759.500000,kg/a,calculated,X,data derived from EPRTR by ETC,http://discomap.eea.europa.eu/data/wisesoe/der...,2020-06-08 00:00:00.000,experimental,A,,137076
1,AT,AT1000,euRBDCode,2016,CAS_7439-92-1,Lead and its compounds,U2,yes,280.000000,kg/a,calculated,X,data derived from EPRTR by ETC,http://discomap.eea.europa.eu/data/wisesoe/der...,2020-06-08 00:00:00.000,experimental,A,,137077
2,AT,AT1000,euRBDCode,2016,CAS_7439-97-6,Mercury and its compounds,I,yes,5.290000,kg/a,calculated,X,data derived from EPRTR by ETC,http://discomap.eea.europa.eu/data/wisesoe/der...,2020-06-08 00:00:00.000,experimental,A,,137078
3,AT,AT1000,euRBDCode,2016,CAS_7440-02-0,Nickel and its compounds,I,yes,2568.300000,kg/a,calculated,X,data derived from EPRTR by ETC,http://discomap.eea.europa.eu/data/wisesoe/der...,2020-06-08 00:00:00.000,experimental,A,,137080
4,AT,AT1000,euRBDCode,2016,CAS_7440-02-0,Nickel and its compounds,U2,yes,3690.000000,kg/a,calculated,X,data derived from EPRTR by ETC,http://discomap.eea.europa.eu/data/wisesoe/der...,2020-06-08 00:00:00.000,experimental,A,,137081
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103280,XK,XK,countryCode,2019,EEA_31615-01-7,Total nitrogen,U22,no,9.477000,t/a,measured,A,,https://cdr.eionet.europa.eu/xk/eea/wise_soe/w...,2021-01-13 09:43:12.000,stable,A,,162583
103281,XK,XK,countryCode,2020,EEA_31-02-7,Total suspended solids,U22,no,6.959722,t/a,measured,A,,https://cdr.eionet.europa.eu/xk/eea/wise_soe/w...,2022-01-12 23:18:02.000,stable,A,,179787
103282,XK,XK,countryCode,2020,EEA_3133-01-5,BOD5,U22,no,3.117451,t/a,measured,A,,https://cdr.eionet.europa.eu/xk/eea/wise_soe/w...,2022-01-12 23:18:02.000,stable,A,,179788
103283,XK,XK,countryCode,2020,EEA_3133-03-7,CODCr,U22,no,55.646105,t/a,measured,A,,https://cdr.eionet.europa.eu/xk/eea/wise_soe/w...,2022-01-12 23:18:02.000,stable,A,,179789


In [70]:
emissions_df_raw = emissions_df_raw.drop(['metadata_versionId',
                                           'metadata_beginLifeSpanVersion', 'metadata_statusCode',
                                           'metadata_observationStatus', 'metadata_statements'], axis = 1)

In [71]:
emissions_df_raw[emissions_df_raw.isnull().any(axis=1)]

Unnamed: 0,countryCode,spatialUnitIdentifier,spatialUnitIdentifierScheme,phenomenonTimeReferencePeriod,observedPropertyDeterminandCode,observedPropertyDeterminandLabel,parameterEmissionsSourceCategory,parameterEPRTRfacilities,resultEmissionsValue,resultEmissionsUom,procedureEmissionsMethod,resultObservationStatus,Remarks,UID
107,AT,AT1001,euSubUnitCode,2004--2007,CAS_7723-14-0,Total phosphorus,NP,,4722.000000,t/a,,,,169087
108,AT,AT1001,euSubUnitCode,2004--2007,CAS_7723-14-0,Total phosphorus,NP1,,1529.000000,t/a,,,,169088
109,AT,AT1001,euSubUnitCode,2004--2007,CAS_7723-14-0,Total phosphorus,NP2,,44.000000,t/a,,,,169089
110,AT,AT1001,euSubUnitCode,2004--2007,CAS_7723-14-0,Total phosphorus,NP4,,299.000000,t/a,,,,169090
111,AT,AT1001,euSubUnitCode,2004--2007,EEA_31615-01-7,Total nitrogen,NP,,83218.000000,t/a,,,,169091
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103280,XK,XK,countryCode,2019,EEA_31615-01-7,Total nitrogen,U22,no,9.477000,t/a,measured,A,,162583
103281,XK,XK,countryCode,2020,EEA_31-02-7,Total suspended solids,U22,no,6.959722,t/a,measured,A,,179787
103282,XK,XK,countryCode,2020,EEA_3133-01-5,BOD5,U22,no,3.117451,t/a,measured,A,,179788
103283,XK,XK,countryCode,2020,EEA_3133-03-7,CODCr,U22,no,55.646105,t/a,measured,A,,179789


In [72]:
emissions_df_raw['countryCode'].unique()
emissions_df_raw['spatialUnitIdentifier'].unique()
emissions_df_raw['observedPropertyDeterminandCode'].unique()
emissions_df_raw['resultEmissionsValue'].unique()
emissions_df_raw['resultEmissionsUom'].unique()
emissions_df_raw['parameterEmissionsSourceCategory'].unique()
emissions_df_raw['spatialUnitIdentifierScheme'].unique()

array(['euRBDCode', 'euSubUnitCode', 'eionetSubUnitCode', 'countryCode',
       'eionetRBDCode'], dtype=object)

In [73]:
emissions_df_raw = emissions_df_raw.dropna(subset = ['observedPropertyDeterminandCode', 'resultEmissionsValue', 'resultEmissionsUom'])

In [74]:
emissions_df_raw.isnull().sum().sum()

186743

In [75]:
emissions_df_raw[emissions_df_raw['resultEmissionsUom'] == "None"]
emissions_df_raw[emissions_df_raw['resultEmissionsUom'].isnull()]
emissions_df_raw['resultEmissionsUom'].isnull().sum()

0

#### Normalizing unit of measure columns

In [76]:
# Normalizing measured data from kg/a to t/a to have comparable data
emissions_df_raw['resultsEmissionsValueNEW'] = np.where(emissions_df_raw['resultEmissionsUom'] == "kg/a",
                                                       emissions_df_raw['resultEmissionsValue']/1000,
                                                        emissions_df_raw['resultEmissionsValue'])
emissions_df_raw['resultEmissionsUomNEW'] = np.where(emissions_df_raw['resultEmissionsUom'] == "kg/a",
                                                     "t/a",
                                                     emissions_df_raw['resultEmissionsUom'])
emissions_df_raw

Unnamed: 0,countryCode,spatialUnitIdentifier,spatialUnitIdentifierScheme,phenomenonTimeReferencePeriod,observedPropertyDeterminandCode,observedPropertyDeterminandLabel,parameterEmissionsSourceCategory,parameterEPRTRfacilities,resultEmissionsValue,resultEmissionsUom,procedureEmissionsMethod,resultObservationStatus,Remarks,UID,resultsEmissionsValueNEW,resultEmissionsUomNEW
0,AT,AT1000,euRBDCode,2016,CAS_7439-92-1,Lead and its compounds,I,yes,759.500000,kg/a,calculated,X,data derived from EPRTR by ETC,137076,0.759500,t/a
1,AT,AT1000,euRBDCode,2016,CAS_7439-92-1,Lead and its compounds,U2,yes,280.000000,kg/a,calculated,X,data derived from EPRTR by ETC,137077,0.280000,t/a
2,AT,AT1000,euRBDCode,2016,CAS_7439-97-6,Mercury and its compounds,I,yes,5.290000,kg/a,calculated,X,data derived from EPRTR by ETC,137078,0.005290,t/a
3,AT,AT1000,euRBDCode,2016,CAS_7440-02-0,Nickel and its compounds,I,yes,2568.300000,kg/a,calculated,X,data derived from EPRTR by ETC,137080,2.568300,t/a
4,AT,AT1000,euRBDCode,2016,CAS_7440-02-0,Nickel and its compounds,U2,yes,3690.000000,kg/a,calculated,X,data derived from EPRTR by ETC,137081,3.690000,t/a
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103280,XK,XK,countryCode,2019,EEA_31615-01-7,Total nitrogen,U22,no,9.477000,t/a,measured,A,,162583,9.477000,t/a
103281,XK,XK,countryCode,2020,EEA_31-02-7,Total suspended solids,U22,no,6.959722,t/a,measured,A,,179787,6.959722,t/a
103282,XK,XK,countryCode,2020,EEA_3133-01-5,BOD5,U22,no,3.117451,t/a,measured,A,,179788,3.117451,t/a
103283,XK,XK,countryCode,2020,EEA_3133-03-7,CODCr,U22,no,55.646105,t/a,measured,A,,179789,55.646105,t/a


In [77]:
emissions_df_raw = emissions_df_raw.drop(['resultEmissionsUom', 'resultEmissionsValue'], axis = 1)

In [78]:
emissions_df_raw

Unnamed: 0,countryCode,spatialUnitIdentifier,spatialUnitIdentifierScheme,phenomenonTimeReferencePeriod,observedPropertyDeterminandCode,observedPropertyDeterminandLabel,parameterEmissionsSourceCategory,parameterEPRTRfacilities,procedureEmissionsMethod,resultObservationStatus,Remarks,UID,resultsEmissionsValueNEW,resultEmissionsUomNEW
0,AT,AT1000,euRBDCode,2016,CAS_7439-92-1,Lead and its compounds,I,yes,calculated,X,data derived from EPRTR by ETC,137076,0.759500,t/a
1,AT,AT1000,euRBDCode,2016,CAS_7439-92-1,Lead and its compounds,U2,yes,calculated,X,data derived from EPRTR by ETC,137077,0.280000,t/a
2,AT,AT1000,euRBDCode,2016,CAS_7439-97-6,Mercury and its compounds,I,yes,calculated,X,data derived from EPRTR by ETC,137078,0.005290,t/a
3,AT,AT1000,euRBDCode,2016,CAS_7440-02-0,Nickel and its compounds,I,yes,calculated,X,data derived from EPRTR by ETC,137080,2.568300,t/a
4,AT,AT1000,euRBDCode,2016,CAS_7440-02-0,Nickel and its compounds,U2,yes,calculated,X,data derived from EPRTR by ETC,137081,3.690000,t/a
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103280,XK,XK,countryCode,2019,EEA_31615-01-7,Total nitrogen,U22,no,measured,A,,162583,9.477000,t/a
103281,XK,XK,countryCode,2020,EEA_31-02-7,Total suspended solids,U22,no,measured,A,,179787,6.959722,t/a
103282,XK,XK,countryCode,2020,EEA_3133-01-5,BOD5,U22,no,measured,A,,179788,3.117451,t/a
103283,XK,XK,countryCode,2020,EEA_3133-03-7,CODCr,U22,no,measured,A,,179789,55.646105,t/a


#### Split date columns

In [79]:
emissions_df_raw['phenomenonTimeReferencePeriod'].unique()

array(['2016', '2017', '2018', '2019', '2020', '2004--2007', '2007',
       '2009--2011', '2010', '2011', '2012', '2014', '2015', '2005',
       '1998', '2000', '2001', '2002', '2003', '2004', '2006', '2008',
       '2009', '2013', '1977--1998', '1987--1998', '1985', '1992', '1995',
       '1996', '2009--2014', '2012--2015', '2011--2014', '2000--2006',
       '2009--2012', '2016--2018', '2018--2019', '2019--2020',
       '2013--2014'], dtype=object)

In [80]:
emissions_df_raw[['TimeReferenceStart', 'TimeReferenceEnd']] = emissions_df_raw['phenomenonTimeReferencePeriod'].str.split('--', expand = True)

In [81]:
emissions_df_raw['TimeReferenceEnd'] = np.where(emissions_df_raw['TimeReferenceEnd'].isnull(),
                                                emissions_df_raw['TimeReferenceStart'],
                                                emissions_df_raw['TimeReferenceEnd'])

In [82]:
emissions_df_raw.drop('phenomenonTimeReferencePeriod', axis = 1)

Unnamed: 0,countryCode,spatialUnitIdentifier,spatialUnitIdentifierScheme,observedPropertyDeterminandCode,observedPropertyDeterminandLabel,parameterEmissionsSourceCategory,parameterEPRTRfacilities,procedureEmissionsMethod,resultObservationStatus,Remarks,UID,resultsEmissionsValueNEW,resultEmissionsUomNEW,TimeReferenceStart,TimeReferenceEnd
0,AT,AT1000,euRBDCode,CAS_7439-92-1,Lead and its compounds,I,yes,calculated,X,data derived from EPRTR by ETC,137076,0.759500,t/a,2016,2016
1,AT,AT1000,euRBDCode,CAS_7439-92-1,Lead and its compounds,U2,yes,calculated,X,data derived from EPRTR by ETC,137077,0.280000,t/a,2016,2016
2,AT,AT1000,euRBDCode,CAS_7439-97-6,Mercury and its compounds,I,yes,calculated,X,data derived from EPRTR by ETC,137078,0.005290,t/a,2016,2016
3,AT,AT1000,euRBDCode,CAS_7440-02-0,Nickel and its compounds,I,yes,calculated,X,data derived from EPRTR by ETC,137080,2.568300,t/a,2016,2016
4,AT,AT1000,euRBDCode,CAS_7440-02-0,Nickel and its compounds,U2,yes,calculated,X,data derived from EPRTR by ETC,137081,3.690000,t/a,2016,2016
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103280,XK,XK,countryCode,EEA_31615-01-7,Total nitrogen,U22,no,measured,A,,162583,9.477000,t/a,2019,2019
103281,XK,XK,countryCode,EEA_31-02-7,Total suspended solids,U22,no,measured,A,,179787,6.959722,t/a,2020,2020
103282,XK,XK,countryCode,EEA_3133-01-5,BOD5,U22,no,measured,A,,179788,3.117451,t/a,2020,2020
103283,XK,XK,countryCode,EEA_3133-03-7,CODCr,U22,no,measured,A,,179789,55.646105,t/a,2020,2020


#### Duplicates

In [92]:
emissions_df_raw[emissions_df_raw.duplicated(keep=False)]

Unnamed: 0,countryCode,spatialUnitIdentifier,spatialUnitIdentifierScheme,phenomenonTimeReferencePeriod,observedPropertyDeterminandCode,observedPropertyDeterminandLabel,parameterEmissionsSourceCategory,parameterEPRTRfacilities,procedureEmissionsMethod,resultObservationStatus,Remarks,UID,resultsEmissionsValueNEW,resultEmissionsUomNEW,TimeReferenceStart,TimeReferenceEnd


There are no duplicated rows.

#### Emissions source categories

In [83]:
emissions_df_raw['observedPropertyDeterminandLabel'].unique()
emissions_df_raw['parameterEmissionsSourceCategory'].unique()

array(['I', 'U2', 'NP', 'NP1', 'NP2', 'NP4', 'U21', 'U22', 'U23', 'U24',
       'I4', 'NP3', 'NP7', 'U', 'U1', 'NP72', 'O', 'NP8', 'I3', 'NP5',
       'PT', 'O2', 'O3', 'O4', 'U11', 'NP71', 'U12', 'U13', 'O1', 'U14'],
      dtype=object)

In [84]:
emission_source_category = ['PT', 
                            'U',
                            'U1', 'U11', 'U12', 'U13', 'U14',
                            'U2', 'U21', 'U22', 'U23', 'U24',
                            'I', 'I3', 'I4', 
                            'O', 'O1', 'O2', 'O3', 'O4', 
                            'NP',
                            'NP1', 'NP2', 'NP3', 'NP4', 'NP5', 'NP7', 'NP8',
                            'NP71', 'NP72', 'NP73', 'NP74']
emission_source_category_label = ['Point Sources',
                                  'Point Urban Wastewater',
                                  'Point Urban Wastewater Untreated',
                                  'Point Urban Wastewater Untreated less than 2000 p.e.',
                                  'Point Urban Wastewater Untreated between 2000 and 10000 p.e.',
                                  'Point Urban Wastewater Untreated between 10000 and 100000 p.e.',
                                  'Point Urban Wastewater Untreated more than 100000 p.e.',
                                  'Point Urban Wastewater Treated',
                                  'Point Urban Wastewater Treated less than 2000 p.e.',
                                  'Point Urban Wastewater Treated between 2000 and 10000 p.e.',
                                  'Point Urban Wastewater Treated between 10000 and 100000 p.e.',
                                  'Point Urban Wastewater Treated more than 100000 p.e.',
                                  'Point Industrial Wastewater',
                                  'Point Industrial Wastewater Treated',
                                  'Point Industrial Wastewater Untreated',
                                  'Point Other point emissions',
                                  'Point Contaminated sites or abandoned industrial sites',
                                  'Point Waste disposal sites',
                                  'Point Mine waters',
                                  'Point Aquaculture',
                                  'Diffuse sources',
                                  'Diffuse Agricultural emissions',
                                  'Diffuse Atmospheric deposition',
                                  'Diffuse Un-connected dwellings emissions',
                                  'Diffuse Urban run-off',
                                  'Diffuse Storm overflow emissions',
                                  'Diffuse Other diffuse emissions',
                                  'Diffuse Background emissions',
                                  'Diffuse Other Forestry emissions',
                                  'Diffuse Other Transport emissions',
                                  'Diffuse Other Mining emissions',
                                  'Diffuse Other Aquaculture emissions']

In [85]:
emission_category = pd.DataFrame(list(zip(emission_source_category, emission_source_category_label)), columns = ['EmissionSourceCat', 'EmissionSourceCatLabel'])
emission_category

Unnamed: 0,EmissionSourceCat,EmissionSourceCatLabel
0,PT,Point Sources
1,U,Point Urban Wastewater
2,U1,Point Urban Wastewater Untreated
3,U11,Point Urban Wastewater Untreated less than 200...
4,U12,Point Urban Wastewater Untreated between 2000 ...
5,U13,Point Urban Wastewater Untreated between 10000...
6,U14,Point Urban Wastewater Untreated more than 100...
7,U2,Point Urban Wastewater Treated
8,U21,Point Urban Wastewater Treated less than 2000 ...
9,U22,Point Urban Wastewater Treated between 2000 an...


#### Save cleaned df

In [91]:
emission_category.to_csv("emission_category.csv")

In [93]:
emissions_df_raw.to_csv("emissions_cleaned.csv")

<a id = 'explore_monitoring'></a>
### 1.3 Exploratory analysis of monitoring dataset
[Top](#top)

[Emissions](#explore_emissions)

The explanantion with some metadata of the monitoring dataset is available here https://www.eea.europa.eu/data-and-maps/data/waterbase-water-quantity-14/waterbase-water-quantity-microsoft-access-database-2-tables.

In [94]:
monitoring_df_raw.columns

Index(['countryCode', 'monitoringSiteIdentifier',
       'monitoringSiteIdentifierScheme', 'observedProperty',
       'phenomenonTimePeriod', 'phenomenonTimePeriod_year',
       'phenomenonTimePeriod_month', 'phenomenonTimePeriod_day',
       'resultObservedValue', 'resultObservationStatus', 'Remarks',
       'metadata_versionId', 'metadata_beginLifeSpanVersion',
       'metadata_statusCode', 'metadata_observationStatus',
       'metadata_statements', 'UID'],
      dtype='object')

In [95]:
monitoring_df_raw.head()

Unnamed: 0,countryCode,monitoringSiteIdentifier,monitoringSiteIdentifierScheme,observedProperty,phenomenonTimePeriod,phenomenonTimePeriod_year,phenomenonTimePeriod_month,phenomenonTimePeriod_day,resultObservedValue,resultObservationStatus,Remarks,metadata_versionId,metadata_beginLifeSpanVersion,metadata_statusCode,metadata_observationStatus,metadata_statements,UID
0,AT,AT212753,eionetMonitoringSiteCode,SF,2007-05-08,2007,5.0,8.0,26.3,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,1
1,AT,AT212753,eionetMonitoringSiteCode,SF,2007-05-09,2007,5.0,9.0,25.8,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,2
2,AT,AT212753,eionetMonitoringSiteCode,SF,2007-05-10,2007,5.0,10.0,25.1,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,3
3,AT,AT212753,eionetMonitoringSiteCode,SF,2007-05-11,2007,5.0,11.0,24.6,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,4
4,AT,AT212753,eionetMonitoringSiteCode,SF,2007-05-12,2007,5.0,12.0,25.5,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,5


In [102]:
monitoring_df_raw['observedProperty'].unique() # SF GWL
monitoring_df_raw['metadata_statusCode'].unique() #'experimental', 'stable', 'valid' 
monitoring_df_raw['metadata_observationStatus'].unique() # 'A', 'U'
monitoring_df_raw['resultObservationStatus'].unique() # nan, 'A', 'O', 'L', 'M', 'N'
monitoring_df_raw['Remarks'].unique() #

array([nan, 'under repair', 'ES_QA_4004', ..., 'IT02_VALTOURNENCHE_MAEN',
       'IT02_GRESSONEY_LA_TRINITE_ALPE_COURTLYS',
       'IT02_GRESSONEY_SAINT_JEAN_CAPOLUOGO'], dtype=object)

In [103]:
monitoring_df_raw['resultObservationStatus'].value_counts()

A    1118445
L      30219
M      22010
O      16265
N        579
Name: resultObservationStatus, dtype: int64

In [106]:
resultObservationStatus = ['A', 'O', 'L', 'M', 'N', 'W', 'X', 'Y']
resultObservationStatusLabel = ['Correct',
                                'Missing value - no further information or past record should be deleted',
                                'Missing value - not collected',
                                'Missing value - not exist',
                                'Missing value - not relevant or not significant',
                                'Missing value - in another source category',
                                'Reported value includes data from another source category (categories)',
                                'The source category does not exactly match the standard definition']
obs_status_df = pd.DataFrame(list(zip(resultObservationStatus, resultObservationStatusLabel)),
                             columns = ['Obs_status', 'Obs_status_label'])

In [107]:
obs_status_df

Unnamed: 0,Obs_status,Obs_status_label
0,A,Correct
1,O,Missing value - no further information or past...
2,L,Missing value - not collected
3,M,Missing value - not exist
4,N,Missing value - not relevant or not significant
5,W,Missing value - in another source category
6,X,Reported value includes data from another sour...
7,Y,The source category does not exactly match the...


In [119]:
monitoring_df_raw_missing = monitoring_df_raw[(monitoring_df_raw['resultObservationStatus'] == 'O') |
                                               (monitoring_df_raw['resultObservationStatus'] == 'L') |
                                               (monitoring_df_raw['resultObservationStatus'] == 'M') |
                                               (monitoring_df_raw['resultObservationStatus'] == 'N') |
                                               (monitoring_df_raw['resultObservationStatus'] == 'W')]
""" |
                                               (monitoring_df_raw['resultObservationStatus'] == 'X') |
                                               (monitoring_df_raw['resultObservationStatus'] == 'Y')"""

" |\n                                               (monitoring_df_raw['resultObservationStatus'] == 'X') |\n                                               (monitoring_df_raw['resultObservationStatus'] == 'Y')"

In [120]:
monitoring_df_raw_missing['resultObservedValue'].unique()

array([            nan,  2.27300000e+01,  2.27000000e+01,  2.26600000e+01,
        2.25800000e+01,  2.25200000e+01,  2.27200000e+01,  2.27400000e+01,
        2.27500000e+01,  2.27900000e+01,  2.27600000e+01,  2.80100000e+01,
        1.06700000e+02,  0.00000000e+00,  9.56000000e+01,  4.81612903e-01,
        3.03500000e-01,  2.83968254e-01,  7.06666667e-01,  1.36126984e+00,
       -1.17283333e+00,  8.69838710e-01,  1.81800000e+00, -5.21129032e-01,
       -6.08500000e-01,  4.96282615e+01,  4.80228420e+01,  4.86930356e+01,
        5.35070200e+01,  5.25833333e+01,  5.23266667e+01,  5.14500000e+01,
        5.15533333e+01,  5.22950000e+01,  5.15100000e+01,  5.01000000e+01,
        1.00000000e+01,  1.24000000e+01,  1.26000000e+01,  1.31000000e+01,
        1.56000000e+01,  1.55000000e+01,  1.62000000e+01,  1.61000000e+01,
        1.54000000e+01,  1.47000000e+01,  1.44000000e+01,  1.06000000e+01,
        1.08000000e+01,  9.31000000e+00,  2.04000000e+01,  4.80000000e+01,
        6.22000000e+01,  

Not all the observation results labeled as "missing" actually shows a null value, some of them shows also an actual value.

In [125]:
monitoring = monitoring_df_raw.drop(['metadata_versionId', 'metadata_beginLifeSpanVersion',
                                     'metadata_statements', 'Remarks'],
                                      axis = 1)

#### Null values

In [128]:
monitoring[monitoring.isnull().any(axis=1)]

Unnamed: 0,countryCode,monitoringSiteIdentifier,monitoringSiteIdentifierScheme,observedProperty,phenomenonTimePeriod,phenomenonTimePeriod_year,phenomenonTimePeriod_month,phenomenonTimePeriod_day,resultObservedValue,resultObservationStatus,metadata_statusCode,metadata_observationStatus,UID
0,AT,AT212753,eionetMonitoringSiteCode,SF,2007-05-08,2007,5.0,8.0,26.30,,experimental,A,1
1,AT,AT212753,eionetMonitoringSiteCode,SF,2007-05-09,2007,5.0,9.0,25.80,,experimental,A,2
2,AT,AT212753,eionetMonitoringSiteCode,SF,2007-05-10,2007,5.0,10.0,25.10,,experimental,A,3
3,AT,AT212753,eionetMonitoringSiteCode,SF,2007-05-11,2007,5.0,11.0,24.60,,experimental,A,4
4,AT,AT212753,eionetMonitoringSiteCode,SF,2007-05-12,2007,5.0,12.0,25.50,,experimental,A,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4888873,SE,SE635140-128901,euMonitoringSiteCode,SF,2020-08-01,2020,8.0,1.0,11.40,,valid,A,6766109
4888874,SE,SE635140-128901,euMonitoringSiteCode,SF,2020-09-01,2020,9.0,1.0,8.97,,valid,A,6766110
4888875,SE,SE635140-128901,euMonitoringSiteCode,SF,2020-10-01,2020,10.0,1.0,38.80,,valid,A,6766111
4888876,SE,SE635140-128901,euMonitoringSiteCode,SF,2020-11-01,2020,11.0,1.0,77.90,,valid,A,6766112


In [126]:
monitoring.isnull().sum().sum()

4125968

In [127]:
monitoring.isnull().sum()

countryCode                             0
monitoringSiteIdentifier                0
monitoringSiteIdentifierScheme          0
observedProperty                        0
phenomenonTimePeriod                    0
phenomenonTimePeriod_year               0
phenomenonTimePeriod_month          27909
phenomenonTimePeriod_day           327895
resultObservedValue                 68804
resultObservationStatus           3701360
metadata_statusCode                     0
metadata_observationStatus              0
UID                                     0
dtype: int64

Remove all the rows that shows a null value in the column 'resultObservedValue'. The null values in the rest of the columns are not influencing the analysis because the 'phenomenonTimePeriod' column has the month and day values in case of null values in columns 'phenomenonTimePeriod_month' and 'phenomenonTimePeriod_day'. And the status of the observation result will be considered as 'accepted' unless is it missing.

For both cases, the null values will be replaces with dummy values.

In [130]:
monitoring = monitoring.dropna(subset = ['resultObservedValue'], axis = 0)

In [131]:
monitoring.isnull().sum()

countryCode                             0
monitoringSiteIdentifier                0
monitoringSiteIdentifierScheme          0
observedProperty                        0
phenomenonTimePeriod                    0
phenomenonTimePeriod_year               0
phenomenonTimePeriod_month          27859
phenomenonTimePeriod_day           317631
resultObservedValue                     0
resultObservationStatus           3701342
metadata_statusCode                     0
metadata_observationStatus              0
UID                                     0
dtype: int64

Replace the null values in the column 'resultObservationStatus' with the value 'Unknown'.

Replace the null values in the columns 'phenomenonTimePeriod_month' and 'phenomenonTimePeriod_day' with the value 1, as first month and first day.

In [134]:
monitoring['resultObservationStatus'] = monitoring['resultObservationStatus'].fillna('Unknown')

In [137]:
monitoring['phenomenonTimePeriod_month'] = monitoring['phenomenonTimePeriod_month'].fillna(1)

In [138]:
monitoring['phenomenonTimePeriod_day'] = monitoring['phenomenonTimePeriod_day'].fillna(1)

In [139]:
monitoring.isnull().sum()

countryCode                       0
monitoringSiteIdentifier          0
monitoringSiteIdentifierScheme    0
observedProperty                  0
phenomenonTimePeriod              0
phenomenonTimePeriod_year         0
phenomenonTimePeriod_month        0
phenomenonTimePeriod_day          0
resultObservedValue               0
resultObservationStatus           0
metadata_statusCode               0
metadata_observationStatus        0
UID                               0
dtype: int64

#### Duplicates

In [133]:
monitoring[monitoring.duplicated(keep = False)]

Unnamed: 0,countryCode,monitoringSiteIdentifier,monitoringSiteIdentifierScheme,observedProperty,phenomenonTimePeriod,phenomenonTimePeriod_year,phenomenonTimePeriod_month,phenomenonTimePeriod_day,resultObservedValue,resultObservationStatus,metadata_statusCode,metadata_observationStatus,UID


Add a column with the unit of measure, depending on the property observed: GWL as Groundwater level (m) and SF as Stream flow (m3/s).

In [143]:
monitoring['resultObservedUnit'] = np.where(monitoring['observedProperty'] == 'SF',
                                           'm3/s',
                                           'm')

In [145]:
monitoring[monitoring['observedProperty'] == 'GWL']

Unnamed: 0,countryCode,monitoringSiteIdentifier,monitoringSiteIdentifierScheme,observedProperty,phenomenonTimePeriod,phenomenonTimePeriod_year,phenomenonTimePeriod_month,phenomenonTimePeriod_day,resultObservedValue,resultObservationStatus,metadata_statusCode,metadata_observationStatus,UID,resultObservedUnit
64438,BE,BEVL_VMM_MS_A051,euMonitoringSiteCode,GWL,2008-01,2008,1.0,1.0,44.435234,Unknown,experimental,A,287723,m
64439,BE,BEVL_VMM_MS_A051,euMonitoringSiteCode,GWL,2008-02,2008,2.0,1.0,44.481923,Unknown,experimental,A,287724,m
64440,BE,BEVL_VMM_MS_A051,euMonitoringSiteCode,GWL,2008-03,2008,3.0,1.0,44.660132,Unknown,experimental,A,287725,m
64441,BE,BEVL_VMM_MS_A051,euMonitoringSiteCode,GWL,2008-04,2008,4.0,1.0,44.674854,Unknown,experimental,A,287726,m
64442,BE,BEVL_VMM_MS_A051,euMonitoringSiteCode,GWL,2008-05,2008,5.0,1.0,44.469028,Unknown,experimental,A,287727,m
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4871998,IE,IEWE_G_0002_1200_0013,euMonitoringSiteCode,GWL,2020-07-01,2020,7.0,1.0,3.691000,Unknown,valid,A,6749234,m
4871999,IE,IEWE_G_0002_1200_0013,euMonitoringSiteCode,GWL,2020-08-01,2020,8.0,1.0,5.660000,Unknown,valid,A,6749235,m
4872000,IE,IEWE_G_0002_1200_0013,euMonitoringSiteCode,GWL,2020-09-01,2020,9.0,1.0,6.954000,Unknown,valid,A,6749236,m
4872001,IE,IEWE_G_0002_1200_0013,euMonitoringSiteCode,GWL,2020-10-01,2020,10.0,1.0,4.813000,Unknown,valid,A,6749237,m


In [146]:
records, attributes = monitoring.shape
print(f"The cleaned monitoring df has now {records} records and {attributes} attributes.")

The cleaned monitoring df has now 4820074 records and 14 attributes.


Filter the dataset to surface waters results only.

In [148]:
monitoring_surf = monitoring[monitoring['observedProperty'] == 'SF']

#### Save the cleaned df

In [147]:
monitoring.to_csv("Monitoring_cleaned.csv")

In [150]:
monitoring_surf.to_csv("Monitoring_surf_cleaned.csv")

<a id = 'explore_aggregated'></a>
### 1.4 Exploratory analysis of aggregated dataset
[Top](#top)

In [138]:
aggregated_df_raw.columns

Index(['countryCode', 'monitoringSiteIdentifier',
       'monitoringSiteIdentifierScheme', 'parameterWaterBodyCategory',
       'observedPropertyDeterminandCode', 'observedPropertyDeterminandLabel',
       'procedureAnalysedMatrix', 'resultUom', 'phenomenonTimeReferenceYear',
       'parameterSamplingPeriod', 'procedureLOQValue', 'resultNumberOfSamples',
       'resultQualityNumberOfSamplesBelowLOQ', 'resultQualityMinimumBelowLOQ',
       'resultMinimumValue', 'resultQualityMeanBelowLOQ', 'resultMeanValue',
       'resultQualityMaximumBelowLOQ', 'resultMaximumValue',
       'resultQualityMedianBelowLOQ', 'resultMedianValue',
       'resultStandardDeviationValue', 'procedureAnalyticalMethod',
       'parameterSampleDepth', 'resultObservationStatus', 'remarks',
       'metadata_versionId', 'metadata_beginLifeSpanVersion',
       'metadata_statusCode', 'metadata_observationStatus',
       'metadata_statements', 'UID'],
      dtype='object')

In [151]:
aggregated_df_raw.head()

Unnamed: 0,countryCode,monitoringSiteIdentifier,monitoringSiteIdentifierScheme,parameterWaterBodyCategory,observedPropertyDeterminandCode,observedPropertyDeterminandLabel,procedureAnalysedMatrix,resultUom,phenomenonTimeReferenceYear,parameterSamplingPeriod,...,procedureAnalyticalMethod,parameterSampleDepth,resultObservationStatus,remarks,metadata_versionId,metadata_beginLifeSpanVersion,metadata_statusCode,metadata_observationStatus,metadata_statements,UID
0,AL,AL1,eionetMonitoringSiteCode,LW,CAS_14265-44-2,Phosphate,W,mg{P}/L,2004,2004-01--2004-12,...,,-9999.0,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,1
1,AL,AL1,eionetMonitoringSiteCode,LW,CAS_14265-44-2,Phosphate,W,mg{P}/L,2005,2005-01--2005-12,...,,-9999.0,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,2
2,AL,AL1,eionetMonitoringSiteCode,LW,CAS_14265-44-2,Phosphate,W,mg{P}/L,2006,2006-01--2006-12,...,,-9999.0,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,3
3,AL,AL1,eionetMonitoringSiteCode,LW,CAS_14265-44-2,Phosphate,W,mg{P}/L,2007,2007-01--2007-12,...,,-9999.0,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,4
4,AL,AL1,eionetMonitoringSiteCode,LW,CAS_14797-55-8,Nitrate,W,mg{NO3}/L,2005,2005-01--2005-12,...,,-9999.0,,,http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,5


In [166]:
aggregated_df_raw['resultObservationStatus'].unique() # nan, 'A', 'O'
aggregated_df_raw['metadata_statusCode'].unique() # 'experimental', 'valid', 'stable'
aggregated_df_raw['metadata_observationStatus'].unique() # 'A', 'U'
aggregated_df_raw['metadata_statements'].unique() # 
aggregated_df_raw['remarks'].unique() # 
aggregated_df_raw['procedureAnalyticalMethod'].unique() #

array([nan, 'ISO 7890-3 : 2000', 'EN 26777:1993', ...,
       'APAT CNR IRSA 9020 Man 29 2005',
       'APAT CNR IRSA 4110 A2 Man 29 2015',
       'APAT CNR IRSA 4110 A2 Man 29 2006'], dtype=object)

In [165]:
aggregated_df_raw[~(aggregated_df_raw['remarks'].isna())]

Unnamed: 0,countryCode,monitoringSiteIdentifier,monitoringSiteIdentifierScheme,parameterWaterBodyCategory,observedPropertyDeterminandCode,observedPropertyDeterminandLabel,procedureAnalysedMatrix,resultUom,phenomenonTimeReferenceYear,parameterSamplingPeriod,...,procedureAnalyticalMethod,parameterSampleDepth,resultObservationStatus,remarks,metadata_versionId,metadata_beginLifeSpanVersion,metadata_statusCode,metadata_observationStatus,metadata_statements,UID
168,BA,BAB3,eionetMonitoringSiteCode,LW,CAS_7439-89-6,Iron and its compounds,W,ug/L,2005,2005-01--2005-12,...,,11.17,,"0-1m,1-7m,7-bottom,0-2m,19-bottom,10-bottom,0-...",http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,3755
170,BA,BAB3,eionetMonitoringSiteCode,LW,CAS_7439-96-5,Manganese and its compounds,W,ug/L,2005,2005-01--2005-12,...,,11.17,,"0-1m,1-7m,7-bottom,0-2m,19-bottom,10-bottom,0-...",http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,3757
172,BA,BAB3,eionetMonitoringSiteCode,LW,CAS_7439-97-6,Mercury and its compounds,W,ug/L,2005,2005-01--2005-12,...,,12.06,,"0-1m,7-bottom,0-2m,19-bottom,10-bottom,0-4m,17...",http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,3759
173,BA,BAB3,eionetMonitoringSiteCode,LW,CAS_7439-97-6,Mercury and its compounds,W,ug/L,2006,2006-01--2006-12,...,,9.56,,"0-2,2-bottom,0-3,3-bottom,0-4,4-8,8-bottom",http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,3760
174,BA,BAB3,eionetMonitoringSiteCode,LW,CAS_7440-02-0,Nickel and its compounds,W,ug/L,2005,2005-01--2005-12,...,,11.17,,"0-1m,1-7m,7-bottom,0-2m,19-bottom,10-bottom,0-...",http://discomap.eea.europa.eu/data/wisesoe/der...,2015-11-30 00:00:00.000,experimental,A,,3761
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4550544,RO,RO144300_3,euMonitoringSiteCode,RW,CAS_7439-97-6,Mercury and its compounds,W-DIS,ug/L,2017,2017-01-01--2017-12-31,...,Other analytical method,0.00,,EN ISO 17852:2006,http://discomap.eea.europa.eu/data/wisesoe/der...,2019-08-29 07:54:03.000,experimental,A,,17834946
4550548,RO,RO85010,euMonitoringSiteCode,RW,CAS_14798-03-9,Ammonium,W,mg{NH4}/L,2017,2017-01-01--2017-12-31,...,Other analytical method,0.00,,ISO 7150-1:2001,http://discomap.eea.europa.eu/data/wisesoe/der...,2019-08-29 07:54:03.000,experimental,A,,17834950
4550549,RO,RO85010,euMonitoringSiteCode,RW,CAS_14797-55-8,Nitrate,W,mg{NO3}/L,2017,2017-01-01--2017-12-31,...,Other analytical method,0.00,,ISO 7890-3:2000,http://discomap.eea.europa.eu/data/wisesoe/der...,2019-08-29 07:54:03.000,experimental,A,,17834951
4550554,RO,RO85010,euMonitoringSiteCode,RW,CAS_7440-66-6,Zinc and its compounds,W-DIS,ug/L,2017,2017-01-01--2017-12-31,...,Other analytical method,0.00,,EN ISO 8288:2001,http://discomap.eea.europa.eu/data/wisesoe/der...,2019-08-29 07:54:03.000,experimental,A,,17834956


In [171]:
aggregated = aggregated_df_raw.drop(['metadata_versionId', 'metadata_beginLifeSpanVersion', 'remarks',
                                     'procedureAnalyticalMethod', 'metadata_statusCode', 'metadata_observationStatus'],
                                      axis = 1)

In [172]:
aggregated = aggregated[aggregated['metadata_statements'].isna()]

In [173]:
aggregated['metadata_statements'].unique()

array([nan], dtype=object)

In [175]:
aggregated = aggregated.drop(['metadata_statements'],
                            axis = 1)

In [176]:
aggregated[aggregated['resultObservationStatus'] == 'O'] # The results are actually there

Unnamed: 0,countryCode,monitoringSiteIdentifier,monitoringSiteIdentifierScheme,parameterWaterBodyCategory,observedPropertyDeterminandCode,observedPropertyDeterminandLabel,procedureAnalysedMatrix,resultUom,phenomenonTimeReferenceYear,parameterSamplingPeriod,...,resultQualityMeanBelowLOQ,resultMeanValue,resultQualityMaximumBelowLOQ,resultMaximumValue,resultQualityMedianBelowLOQ,resultMedianValue,resultStandardDeviationValue,parameterSampleDepth,resultObservationStatus,UID
2610347,SK,SKIDK002,euMonitoringSiteCode,RW,EEA_3133-05-9,Dissolved organic carbon (DOC),W-DIS,mg{C}/L,2013,2013-02-05--2013-12-03,...,False,4.1575,False,4.96,False,4.155,0.493206,0.25,O,11196309
2610348,SK,SKIDK002,euMonitoringSiteCode,RW,EEA_3133-05-9,Dissolved organic carbon (DOC),W-DIS,mg{C}/L,2014,2014-02-12--2014-11-26,...,False,4.235833,False,5.1,False,4.35,0.448506,0.25,O,11196310
2610349,SK,SKIDK003,euMonitoringSiteCode,RW,EEA_3133-05-9,Dissolved organic carbon (DOC),W-DIS,mg{C}/L,2013,2013-01-22--2013-12-03,...,False,4.420833,False,5.98,False,4.28,0.831519,0.25,O,11196311
2610350,SK,SKIDK003,euMonitoringSiteCode,RW,EEA_3133-05-9,Dissolved organic carbon (DOC),W-DIS,mg{C}/L,2014,2014-01-21--2014-11-11,...,False,4.42,False,5.7,False,4.3,0.500511,0.25,O,11196312
2610351,SK,SKIDK005,euMonitoringSiteCode,RW,EEA_3133-05-9,Dissolved organic carbon (DOC),W-DIS,mg{C}/L,2013,2013-01-22--2013-12-03,...,False,4.766667,False,5.7,False,4.7,0.516935,0.25,O,11196313
2610352,SK,SKIDK005,euMonitoringSiteCode,RW,EEA_3133-05-9,Dissolved organic carbon (DOC),W-DIS,mg{C}/L,2014,2014-01-21--2014-11-11,...,False,4.957143,False,6.3,False,5.0,0.698687,0.25,O,11196314


#### Null values

In [177]:
aggregated.isnull().sum()

countryCode                                   0
monitoringSiteIdentifier                      0
monitoringSiteIdentifierScheme                0
parameterWaterBodyCategory                    0
observedPropertyDeterminandCode               0
observedPropertyDeterminandLabel              0
procedureAnalysedMatrix                       0
resultUom                                     0
phenomenonTimeReferenceYear                   0
parameterSamplingPeriod                   69786
procedureLOQValue                       1437394
resultNumberOfSamples                     62926
resultQualityNumberOfSamplesBelowLOQ    1771224
resultQualityMinimumBelowLOQ            1366688
resultMinimumValue                        98666
resultQualityMeanBelowLOQ               1364321
resultMeanValue                              47
resultQualityMaximumBelowLOQ            1365736
resultMaximumValue                        97182
resultQualityMedianBelowLOQ             2020003
resultMedianValue                       

In [178]:
aggregated[aggregated['resultMeanValue'].isna()] # Also the values in the column Median are null. Therefore those rows will be dropped.

Unnamed: 0,countryCode,monitoringSiteIdentifier,monitoringSiteIdentifierScheme,parameterWaterBodyCategory,observedPropertyDeterminandCode,observedPropertyDeterminandLabel,procedureAnalysedMatrix,resultUom,phenomenonTimeReferenceYear,parameterSamplingPeriod,...,resultQualityMeanBelowLOQ,resultMeanValue,resultQualityMaximumBelowLOQ,resultMaximumValue,resultQualityMedianBelowLOQ,resultMedianValue,resultStandardDeviationValue,parameterSampleDepth,resultObservationStatus,UID
1080603,IT,IT12-4_30,eionetMonitoringSiteCode,LW,CAS_12002-48-1,Trichlorobenzenes (all isomers),W,ug/L,2010,2010-01--2010-12,...,,,,,,,,25.0,,9257422
2096957,IT,IT12L3_44,euMonitoringSiteCode,LW,CAS_12002-48-1,Trichlorobenzenes (all isomers),W,ug/L,2010,2010-01--2010-12,...,,,,,,,,26.0,,10624804
2096958,IT,IT12L3_44,euMonitoringSiteCode,LW,CAS_12002-48-1,Trichlorobenzenes (all isomers),W,ug/L,2010,2010-01--2010-12,...,,,,,,,,55.0,,10624805
2097039,IT,IT12L3_44,euMonitoringSiteCode,LW,CAS_12002-48-1,Trichlorobenzenes (all isomers),W,ug/L,2010,2010-01--2010-12,...,,,,,,,,50.0,,10624886
2097053,IT,IT12L3_42,euMonitoringSiteCode,LW,CAS_12002-48-1,Trichlorobenzenes (all isomers),W,ug/L,2010,2010-01--2010-12,...,,,,,,,,30.0,,10624900
2097060,IT,IT12L3_57,euMonitoringSiteCode,LW,CAS_12002-48-1,Trichlorobenzenes (all isomers),W,ug/L,2010,2010-01--2010-12,...,,,,,,,,0.3,,10624907
2097068,IT,IT12L3_44,euMonitoringSiteCode,LW,CAS_12002-48-1,Trichlorobenzenes (all isomers),W,ug/L,2010,2010-01--2010-12,...,,,,,,,,25.0,,10624915
2097099,IT,IT12L4_26,euMonitoringSiteCode,LW,CAS_56-23-5,Carbon tetrachloride,W,ug/L,2010,2010-01--2010-12,...,,,,,,,,0.2,,10624946
2097118,IT,IT12L4_26,euMonitoringSiteCode,LW,CAS_56-23-5,Carbon tetrachloride,W,ug/L,2010,2010-01--2010-12,...,,,,,,,,50.0,,10624965
2097164,IT,IT12L4_26,euMonitoringSiteCode,LW,CAS_12002-48-1,Trichlorobenzenes (all isomers),W,ug/L,2010,2010-01--2010-12,...,,,,,,,,0.2,,10625011


In [179]:
aggregated = aggregated.dropna(subset = ['resultMeanValue'], axis = 0)

In [180]:
aggregated['parameterSamplingPeriod'] = np.where(aggregated['parameterSamplingPeriod'].isna(),
                                                aggregated['phenomenonTimeReferenceYear'],
                                                aggregated['parameterSamplingPeriod'])

In [182]:
aggregated.isnull().sum()

countryCode                                   0
monitoringSiteIdentifier                      0
monitoringSiteIdentifierScheme                0
parameterWaterBodyCategory                    0
observedPropertyDeterminandCode               0
observedPropertyDeterminandLabel              0
procedureAnalysedMatrix                       0
resultUom                                     0
phenomenonTimeReferenceYear                   0
parameterSamplingPeriod                       0
procedureLOQValue                       1437351
resultNumberOfSamples                     62883
resultQualityNumberOfSamplesBelowLOQ    1771177
resultQualityMinimumBelowLOQ            1366641
resultMinimumValue                        98619
resultQualityMeanBelowLOQ               1364274
resultMeanValue                               0
resultQualityMaximumBelowLOQ            1365689
resultMaximumValue                        97135
resultQualityMedianBelowLOQ             2019956
resultMedianValue                       

#### Duplicates

In [184]:
aggregated[aggregated.duplicated(keep = False)]

Unnamed: 0,countryCode,monitoringSiteIdentifier,monitoringSiteIdentifierScheme,parameterWaterBodyCategory,observedPropertyDeterminandCode,observedPropertyDeterminandLabel,procedureAnalysedMatrix,resultUom,phenomenonTimeReferenceYear,parameterSamplingPeriod,...,resultQualityMeanBelowLOQ,resultMeanValue,resultQualityMaximumBelowLOQ,resultMaximumValue,resultQualityMedianBelowLOQ,resultMedianValue,resultStandardDeviationValue,parameterSampleDepth,resultObservationStatus,UID


In [185]:
aggregated[['parameterSamplingPeriodStart', 'parameterSamplingPeriodEnd']] = aggregated['parameterSamplingPeriod'].str.split('--',
                                                                                                                            expand = True)

In [192]:
aggregated[aggregated['parameterSamplingPeriodStart'].isna()]

Unnamed: 0,countryCode,monitoringSiteIdentifier,monitoringSiteIdentifierScheme,parameterWaterBodyCategory,observedPropertyDeterminandCode,observedPropertyDeterminandLabel,procedureAnalysedMatrix,resultUom,phenomenonTimeReferenceYear,parameterSamplingPeriod,...,resultQualityMaximumBelowLOQ,resultMaximumValue,resultQualityMedianBelowLOQ,resultMedianValue,resultStandardDeviationValue,parameterSampleDepth,resultObservationStatus,UID,parameterSamplingPeriodStart,parameterSamplingPeriodEnd
1854952,EE,EESJA9303000,eionetMonitoringSiteCode,RW,CAS_14265-44-2,Phosphate,W,mg{P}/L,2013,2013,...,False,0.007,True,0.0050,,-9999.0,,10367820,,
1854954,EE,EESJA9303000,eionetMonitoringSiteCode,RW,CAS_14797-65-0,Nitrite,W,mg{NO2}/L,2013,2013,...,True,0.002,True,0.0020,,-9999.0,,10367822,,
1854955,EE,EESJA9303000,eionetMonitoringSiteCode,RW,CAS_14798-03-9,Ammonium,W,mg{NH4}/L,2013,2013,...,True,0.010,True,0.0100,,-9999.0,,10367823,,
1854956,EE,EESJA9303000,eionetMonitoringSiteCode,RW,CAS_7439-92-1,Lead and its compounds,W-DIS,ug/L,2013,2013,...,False,0.540,True,0.1000,,-9999.0,,10367824,,
1854957,EE,EESJA9303000,eionetMonitoringSiteCode,RW,CAS_7440-02-0,Nickel and its compounds,W-DIS,ug/L,2013,2013,...,False,2.300,False,0.2000,,-9999.0,,10367825,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4473134,DK,DK620014,eionetMonitoringSiteCode,RW,CAS_7723-14-0,Total phosphorus,W,mg{P}/L,2020,2020,...,False,0.210,False,0.0595,0.048317,0.0,,17752178,,
4473135,DK,DK660014,euMonitoringSiteCode,RW,EEA_3161-02-2,Total oxidised nitrogen,W-DIS,mg{N}/L,2020,2020,...,False,9.800,False,2.0000,3.145656,0.0,,17752179,,
4473136,DK,DK660014,euMonitoringSiteCode,RW,EEA_31615-01-7,Total nitrogen,W,mg{N}/L,2020,2020,...,False,12.000,False,2.2000,3.538933,0.0,,17752180,,
4473137,DK,DK660014,euMonitoringSiteCode,RW,CAS_14265-44-2,Phosphate,W-DIS,mg{P}/L,2020,2020,...,False,0.200,False,0.1100,0.057543,0.0,,17752181,,


In [193]:
aggregated['parameterSamplingPeriodStart'] = np.where(aggregated['parameterSamplingPeriodStart'].isna(),
                                                     aggregated['phenomenonTimeReferenceYear'],
                                                     aggregated['parameterSamplingPeriodStart'])

In [194]:
aggregated['parameterSamplingPeriodStart'].isna().sum()

0

In [195]:
aggregated['parameterSamplingPeriodEnd'] = np.where(aggregated['parameterSamplingPeriodEnd'].isna(),
                                                   aggregated['phenomenonTimeReferenceYear'],
                                                   aggregated['parameterSamplingPeriodEnd'])

In [196]:
aggregated['parameterSamplingPeriodEnd'].isna().sum()

0

In [202]:
aggregated = aggregated.drop(['parameterSamplingPeriod', 'resultObservationStatus', 'parameterSampleDepth'],
                            axis = 1)

#### Filter to only River and Lake waters (RW, LW)

In [203]:
aggregated_rw_lw = aggregated[(aggregated['parameterWaterBodyCategory'] == 'RW') |
                              (aggregated['parameterWaterBodyCategory'] == 'LW')]

In [204]:
aggregated_rw_lw

Unnamed: 0,countryCode,monitoringSiteIdentifier,monitoringSiteIdentifierScheme,parameterWaterBodyCategory,observedPropertyDeterminandCode,observedPropertyDeterminandLabel,procedureAnalysedMatrix,resultUom,phenomenonTimeReferenceYear,procedureLOQValue,...,resultQualityMeanBelowLOQ,resultMeanValue,resultQualityMaximumBelowLOQ,resultMaximumValue,resultQualityMedianBelowLOQ,resultMedianValue,resultStandardDeviationValue,UID,parameterSamplingPeriodStart,parameterSamplingPeriodEnd
0,AL,AL1,eionetMonitoringSiteCode,LW,CAS_14265-44-2,Phosphate,W,mg{P}/L,2004,,...,,0.001956,,0.002608,,0.001956,,1,2004-01,2004-12
1,AL,AL1,eionetMonitoringSiteCode,LW,CAS_14265-44-2,Phosphate,W,mg{P}/L,2005,,...,,0.033000,,0.052000,,0.030000,0.016050,2,2005-01,2005-12
2,AL,AL1,eionetMonitoringSiteCode,LW,CAS_14265-44-2,Phosphate,W,mg{P}/L,2006,0.00163,...,False,0.014861,False,0.020294,False,0.015324,0.005802,3,2006-01,2006-12
3,AL,AL1,eionetMonitoringSiteCode,LW,CAS_14265-44-2,Phosphate,W,mg{P}/L,2007,0.00163,...,False,0.014250,False,0.017118,False,0.013912,0.002409,4,2007-01,2007-12
4,AL,AL1,eionetMonitoringSiteCode,LW,CAS_14797-55-8,Nitrate,W,mg{NO3}/L,2005,,...,,0.442700,,0.752590,,0.442700,0.101800,5,2005-01,2005-12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4550554,RO,RO85010,euMonitoringSiteCode,RW,CAS_7440-66-6,Zinc and its compounds,W-DIS,ug/L,2017,50.00000,...,True,50.000000,True,50.000000,True,50.000000,0.000000,17834956,2017-01-01,2017-12-31
4550555,RO,RO85010,euMonitoringSiteCode,RW,CAS_7440-47-3,Chromium and its compounds,W-DIS,ug/L,2017,1.00000,...,True,1.000000,True,1.000000,True,1.000000,0.000000,17834957,2017-01-01,2017-12-31
4550556,RO,RO85010,euMonitoringSiteCode,RW,CAS_7440-38-2,Arsenic and its compounds,W-DIS,ug/L,2017,0.10000,...,False,1.347500,False,2.920000,False,1.135000,1.153296,17834958,2017-01-01,2017-12-31
4550557,RO,RO85010,euMonitoringSiteCode,RW,EEA_31-02-7,Total suspended solids,W,mg/L,2017,10.00000,...,False,17.500000,False,26.000000,False,15.500000,6.137318,17834959,2017-01-01,2017-12-31


#### Save the cleaned dataset

In [205]:
aggregated.to_csv("Aggregated_cleaned.csv")

In [206]:
aggregated_rw_lw.to_csv("Aggregated_RW_LW_cleaned.csv")