First, let's import some libraries

In [None]:
import time
from datetime import datetime
from urllib.error import HTTPError
from urllib.error import URLError

import numpy as np
import pandas as pd

from erddapy import ERDDAP
from erddapy.doc_helpers import show_iframe
from erddapy import servers

import requests
import json

# Set to display full length of rows
pd.set_option('max_colwidth',None)

# Explore availability of data using ERDDAP
## What is ERDDAP?
From (https://coastwatch.pfeg.noaa.gov/erddap/information.html): *ERDDAP is a data server that gives you a simple, consistent way to download subsets of scientific datasets in common file formats and make graphs and maps.*
It allows to select and download a subset of data, removing the need to download unnecesarily large files. You can download the data in your preferred format, regardless of the format of origin. It standardizes the variable names and units of position (latitude, longitude, altitude/depth) and time; it is particularly useful for time variables.

The two most common ERDDAP protocols are *tabledap* and *griddap*. Tabledap is for table-like data, like individual geolocated observations. Griddap is for gridded datasets, for example gridded climatologies or model outputs.

erddapy is a python package that helps building the ERDDAP query URLs. For more information, the documentation can be found here: https://ioos.github.io/erddapy/

## Find datasets in ERDDAP
If you do not know any ERDDAP servers, you can search text search across 56 predefined instances here: http://erddap.com/. In addition, you can add servers you know of, or filter servers.

For a in-notebook solution, the cell below will retrieve a list of erddap servers and the datasets that have carbon variables, using their CF standard names. There are several standard vocabularies used in the natural sciences and the marine environment; for example SDN codes will appear in our searches. However CF is the most international and the attribute name *standard_name* is standardized, too, which contributes to interoperability.

**WARNING** it takes quite some time, approximately 15 minutes; the website is a lot quicker. Therefore a `exploreerddap` switch is in place.

In [None]:
exploreerddap=False

if exploreerddap:
    stdnames=['surface_partial_pressure_of_carbon_dioxide_in_sea_water',
    'fugacity_of_carbon_dioxide_in_sea_water',
    'mole_concentration_of_dissolved_inorganic_carbon_in_sea_water',
    'moles_of_dissolved_inorganic_carbon_per_unit_mass_in_sea_water',
    'sea_water_ph_reported_on_total_scale',
    'sea_water_alkalinity_expressed_as_mole_equivalent']
    starttime=time.time()
    for s in servers:
        e=ERDDAP(s)
        for v in stdnames:
            kw= {"standard_name": v}
            search_url = e.get_search_url(response="csv", **kw)
            try:
                search=pd.read_csv(search_url)
            except HTTPError:
                pass
                print("No", stdname, "data in the server", s)
            except URLError:
                pass
                print(s,"throws an URL error")
            else:
                print(s,"has", {len(set(search["tabledap"].dropna()))},"tabledap datasets with",v)
                print(search[["Dataset ID","Title"]])
                if len(set(search["griddap"].dropna())) > 0:
                    print(s,"has", {len(set(search["griddap"].dropna()))},"griddap datasets with",v)



    endtime=time.time()-starttime
    print("\n Search took", endtime/60,"minutes")


# pH data from two ERDDAP servers: EMODnet and IFREMER
For this exercise, we will search for pH data from two european sources: EMODnet and Ifremer. pH is the variable that is measured the most often, and has one of the longest time and space coverages of all the inorganic carbon variables.

## List datasets that contain pH in EMODnet
Get list of table-like **datasets** that contain pH (not gridded). We search by using the *standard_name* attribute and its Climate and Forecast (CF) name.

**e** (or **i** for the IFREMER server) is a python object with multiple methods. The ones we will use are:
* response: the format of the response. We will use .csv and read those to pandas dataframes, but many others are available (.nc, .json, .html, .mat, .tsv ...)
* get_* : returns a valid ERDDAP URL. **get_search_url** provides a list of datasets according the criteria established. **get_info_url** provides information about a particular dataset
* constraints: search and download constraints. Time, position, values of variables and/ or attributes
* dataset_id: to set which dataset we will explorte
* variables: set the variables to download if you do not wish to download the full dataset


In [None]:
e = ERDDAP(server="EMODnet", protocol='tabledap')
e.response= "csv"
search_url = e.get_search_url(**{
    "standard_name": "sea_water_ph_reported_on_total_scale"})
e_search = pd.read_csv(search_url)
display(e_search[["Title","Dataset ID"]])

We are interested in all the pH measurements, not only their minimum/maximum values, which leaves the `MINMAXMEAN` dataset out. We will pick the profiles dataset first.
## Metadata
Before downloading the data itself, we will get some further information about the dataset **EP_ERD_INT_PHPH_AL_PR_NRT**:
* From the ID, we know that the data is NRT, so it will have gone through some, but not full QC (adjustments, calibrations, etc
* List the variables available in the dataset. We may be interested in all, or only some of them
* List the attributes of those variables


### List of variables available
When retrieving the dataset information using **get_info_url**, we get the full list of attributes, both global (`Variable Name NC_GLOBAL`) and per variable. In this case, that results in >200 lines. It provides also information about the type and the value of the attribute.

In [None]:
e.dataset_id="EP_ERD_INT_PHPH_AL_PR_NRT"
e_info = pd.read_csv(e.get_info_url())
display(e_info)

To get only the list of variable names, we can subset by `Row Type = variable`

In [None]:
print(", ".join(e_info.loc[e_info["Row Type"] == "variable", "Variable Name"]))

From the list we see the variable names as EMODnet stores them, information we need to retrieve their attributes. Let's check the time range available and the attributes for the variables `PHPH`and `PHPH_QC`

In [None]:
# Time coverage
print('Time coverage for the dataset is from', e_info.loc[e_info["Attribute Name"] == "time_coverage_start", "Value"].item(),
       'to' , e_info.loc[e_info["Attribute Name"] == "time_coverage_end", "Value"].item())
# pH attributes
print()
print('The attributes of the variable PHPH (pH) are')
display(e_info.loc[e_info["Variable Name"] == "PHPH", :])
print('The attributes of the variable PHPH_QC (pH quality control flag) are')
display(e_info.loc[e_info["Variable Name"] == "PHPH_QC", :])

Information about the type of platform that recorded those datapoints is interesting, too, specially if we will compare with other sources of pH data. EMODnet has their own codes for platform type, stored in the variable `EP_PLATFORM_TYPE`. `wmo_platform_code` can be of interest, too.

In [None]:
display(e_info.loc[e_info["Variable Name"] == "EP_PLATFORM_TYPE", :])

Position and time variables have a particular attribute: `axis`. For geolocated variables, latitude and longitude are usually `Y` and `X` axis, respectively. For ocean data `Z` variables can be depth and/or pressure. Let's see which are the axis variables for EMODnet pH profile data:

In [None]:
display(e_info.loc[e_info["Attribute Name"] == "axis", ["Variable Name", "Value"]])
# To get them as list
e_varaxisname=e.get_var_by_attr(axis=lambda v: v in["X","Y","Z","T"])

## pH data from IFREMER ERDDAP servers
We will now go through similar steps, to find and retrieve data from the IFREMER ERDDAP server (http://www.ifremer.fr/erddap/).
They also use the attribute *cf_standard_name* in addition to *standard name*, which returns more datasets.

In [None]:
i = ERDDAP(server="IFREMER", protocol='tabledap')
search_url = i.get_search_url(response="csv", **{
    "cf_standard_name": "sea_water_ph_reported_on_total_scale",
    })
i_search = pd.read_csv(search_url)
display(i_search[["Title","Dataset ID"]])

Let's explore the BGC Argo dataset. Datasets with an automatic data flow usually have a lot of variables (in this particular case, in the hundreds!), because they record and distribute not only the final, calculated variables, but also a lot of device diagnostics and the variables that are used to do data reduction to the final variables. We will see an example in pH.

In [None]:
i.dataset_id="ArgoFloats-synthetic-BGC"
i.response="csv"

i_info = pd.read_csv(i.get_info_url())
print(i.dataset_id,"contains",
      str(i_info.loc[i_info["Row Type"] == "variable", "Variable Name"].shape[0]),"variables.")

display(i_info.loc[(i_info["Attribute Name"] == "long_name") & 
                         (i_info["Variable Name"].str.contains('(?:^|_)ph(?:_|$)'))
                         , ["Variable Name", "Value"]])

print(i.get_var_by_attr(cf_standard_name="sea_water_ph_reported_on_total_scale"),
     "is the variable with the standard_name attribute")
print("\n")
print('Time coverage for the dataset is from', i_info.loc[i_info["Attribute Name"] == "time_coverage_start", "Value"].item(),
       'to' , i_info.loc[i_info["Attribute Name"] == "time_coverage_end", "Value"].item())


We see that BGC-Argo distributes values of voltages, current, and sensor temperatures, which in themselves are not of scientific interest, but are necessary to calculate pH in situ values. They provide pH in three different scales (total, free and sewater scales); we will pick the total scale. 

Two variables are candidate to become the pH values we want to explore: `ph_in_situ_total` and `ph_in_situ_total_adjusted` (which we add manually), and their corresponding QC flag variables. Data measured by Argo floats may need certain adjustments (for example, due to drift. More information in [Bittig *et al.* (2019)]( https://doi.org/10.3389/fmars.2019.00502) ). The preservation of both "raw" and adjusted data is necessary in order to maintain reproducibility. 

We will keep both, too, and decide at a later stage what data to use, depending on the nature of the other data sources. For that, we need their attributes.

In [None]:
for vph in i.get_var_by_attr(cf_standard_name="sea_water_ph_reported_on_total_scale"):
    print("The attributes of the variable", vph, "(pH) are")
    display(i_info.loc[i_info["Variable Name"] == vph, :])
    print("The attributes of the variable", vph+"_qc", "(pH quality control flag) are")
    display(i_info.loc[i_info["Variable Name"] == vph+"_qc", :])

print("The attributes of the variable", vph+"_adjusted", "(pH, adjusted) are")
display(i_info.loc[i_info["Variable Name"] == vph+"_adjusted", :])

As with EMODnet, let's get some platform information. We know that the measuring devices are Argo floats, from the dataset name. In addition, we can explore what other platform metadata exists in the dataset. Both `platform_type` and `platform_number`are of interest. The Argo Users Manual (https://doi.org/10.13155/29825) contains information on the Argo reference tables mentioned in the attributes

In [None]:
display(i_info.loc[(i_info["Variable Name"].str.contains('(?:^|_)platform(?:_|$)'))
                 , ["Variable Name", "Attribute Name","Value"]])

Finally, let's check the axis variables. The main difference with EMODnet is that BGC-Argo does not use depth as Z-axis variables, but pressure and adjusted pressure.

In [None]:
display(i_info.loc[i_info["Attribute Name"] == "axis", ["Variable Name", "Value"]])
# To get them as list
i_varaxisname=i.get_var_by_attr(axis=lambda v: v in["X","Y","Z","T"])

# Download the data and export to a local file
After exploring the variables available and their metadata, we will now proceed to download the data, filtering by a particular time frame and pH data of quality "good" and "probably good". Since the datasets contain NRT data, it is likely that many points have been flagged probably good, waiting for a manual QC.

We set the variables to be downloaded (geolocation, time and pH, and some platform metadata) and the constraints in the erddap objects (**e** and **i**). Using the method `to_pandas` we can retrieve the data into a pandas dataframe. Other methods are `to_xarray` (particularly useful for gridded data) or `to_ncCF`.

### EMODnet

In [None]:
# Variables to download
e.variables=e_varaxisname + ['PHPH','PHPH_QC', 'EP_PLATFORM_TYPE', 'wmo_platform_code','data_assembly_center']

# Constraints
e.constraints= {
    "time>=": "2017-01-01T00:00:00Z",
    "time<=": "2017-12-31T23:59:59Z",
    "PHPH_QC>=": 1, # good data
    "PHPH_QC<=": 2} # probably good data

dtype=object
df_emodnet = e.to_pandas()

# Print the variables retrieved
print("The EMODNet dataframe variables retrieved are", df_emodnet.columns.values)
print("\n")
# Size of the dataframe
print("The size of the EMODnet dataframe is", df_emodnet.shape[0],"rows (data points) and",df_emodnet.shape[1],"columns (variables)")

### BGC-Argo (IFREMER)

In [None]:
# List of variables to download
i.variables=i_varaxisname + ['ph_in_situ_total', 'ph_in_situ_total_qc','ph_in_situ_total_adjusted', 'ph_in_situ_total_adjusted_qc',
                            'platform_type','platform_number']

i.constraints= {
    "time>=": "2017-01-01T00:00:00Z",
    "time<=": "2017-12-31T23:59:59Z",
    "ph_in_situ_total_qc=~": "(1|2)"} # good data and probably good data
        
dtype=object
df_ifremer = i.to_pandas()

# Print the variables retrieved
print("The BGC-Argo (IFREMER) dataframe variables retrieved are", df_ifremer.columns.values)
print("\n")
# Size of the dataframe
print("The size of the BGC-Argo (IFREMER) dataframe is", df_ifremer.shape[0],"rows (data points) and",df_ifremer.shape[1],"columns (variables)")


The WMO platform code is a 5 or 7 digit code assigned to platforms that transmit in the Global Telecommunication System (GTS, for more information you can check here https://www.ocean-ops.org/dbcp/data/sharing.html). Ocean-OPS (https://www.ocean-ops.org/board) manages an extensive metadata catalogue of marine observation deployments, and it is possible to explore it through an API. Here we will retrieve the device names for the `df_emodnet`dataframe. by querying the Ocean-OPS API. This kind of information is provided in the `df_ifremer` dataframe, in the `platform_type` variable.

In [None]:
# Retrieve device types
for e_wmocode in df_emodnet.wmo_platform_code.dropna().unique():
    url = ("https://www.ocean-ops.org/api/preview/data/platform?exp=[%22wmos.wmo=%27"
        +str(e_wmocode)+"%27%22]&include=[%22ref%22,%22wmos.wmo%22]&include=[%22ptfModel.name%22]")
    rdata = requests.get(url)
    jdata= json.loads(rdata.text)
    pn=jdata['data'][0]['ptfModel']['name']
    df_emodnet.loc[df_emodnet.wmo_platform_code==e_wmocode,'platform_name_OceanOPS']=pn


## Unify column names

We will rename the common columns the same way, regardless of the data source. This will help when merging the dataframes. We also add a `SOURCE` variable that indicates where the data came from.
For this, we create a dictionary, so we can change the variable names there, if necessary, instead of on each dataframe. We will also force some variables to a particular type (numeric/string)

In [None]:
vardict ={'datevec':'TIME',
          'lat':'LATITUDE','lon':'LONGITUDE','dep':'DEPTH', 'pres':'PRES',
          'ph': 'PHPH','phf': 'PHPH_QC',
          'wmo': 'WMO', 'platf': 'PLATFORM_NAME_OCEANOPS'}

In [None]:
df_emodnet.rename(
    columns={'latitude (degrees_north)': vardict['lat'], 'longitude (degrees_east)': vardict['lon'],
             'pres (dbar)': vardict['pres'], 'time (UTC)': vardict['datevec'],
             'PHPH (1)': vardict['ph'], 'PHPH_QC (1)': vardict['phf'],
             'wmo_platform_code': vardict['wmo'], 'platform_name_OceanOPS': vardict['platf']},
        inplace=True)
df_emodnet[vardict['wmo']]=df_emodnet[vardict['wmo']].astype(str, copy=False)

df_ifremer.rename(
    columns={'latitude (degrees_north)': vardict['lat'], 'longitude (degrees_east)': vardict['lon'],
             'pres (decibar)': vardict['pres'], 'time (UTC)': vardict['datevec'],
             'ph_in_situ_total (dimensionless)': vardict['ph'], 'ph_in_situ_total_qc': vardict['phf'],
             'platform_number': vardict['wmo'], 'platform_type': vardict['platf']},
        inplace=True)
df_ifremer[vardict['wmo']]=df_ifremer[vardict['wmo']].astype(str, copy=False)

df_emodnet['SOURCE']='EMODnet pH profiles'
df_ifremer['SOURCE']='BGC-Argo (IFREMER)'
df_ifremer['EP_PLATFORM_TYPE']='PF'
df_ifremer['data_assembly_center']='Ifremer'

We will save the data downloaded as csv files, separate, in the same folder as the jupyter notebooks. The retrieval date at the beginning of the file name is important. These datasets update with certain regularity, and it is crucial to know when they were downloaded for reproducibility purposes

In [None]:
dateforfile=datetime.now().strftime("%Y%m%d")
df_emodnet.to_csv(dateforfile+'_pH_data_ERDDAP_EMODNetpHProf.csv', index=False)
df_ifremer.to_csv(dateforfile+'_pH_data_ERDDAP_BGCArgoIFREMER.csv', index=False)