<a href="https://colab.research.google.com/github/MathewBiddle/sandbox/blob/main/IOOS_data_at_NCEI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using NCEI geoportal REST API to collect information about IOOS Regional Association archived data

Created: 2017-06-12

IOOS regional associations archive their non-federal observational data with NOAA's National Center for Environmental Information (NCEI). In this notebook we will use the [RESTful](https://github.com/Esri/geoportal-server/wiki/REST-API-Syntax) services of the [NCEI geoportal](https://www.ncei.noaa.gov/metadata/geoportal/#searchPanel) to collect metadata from the archive packages found in the NCEI archives. The metadata information are stored in [ISO 19115-2](https://wiki.earthdata.nasa.gov/display/NASAISO/ISO+19115-2) xml files which the NCEI geoportal uses for discovery of Archival Information Packages (AIPs). This example uses the ISO metadata records to display publication information as well as plot the time coverage of each AIP at NCEI which meets the search criteria.

First we update the namespaces dictionary from owslib to include the appropriate namespace reference for gmi and gml. 

For more information on ISO Namespaces see:  https://geo-ide.noaa.gov/wiki/index.php?title=ISO_Namespaces

In [1]:
try:
  import owslib
except:
  !pip install owslib

In [2]:
from owslib.iso import namespaces

# Append gmi namespace to namespaces dictionary.
namespaces.update({"gmi": "http://www.isotc211.org/2005/gmi"})
namespaces.update({"gml": "http://www.opengis.net/gml/3.2"})
del namespaces[None]

## Now we select a Regional Association and platform
This is where the user identifies the Regional Association and the platform type they are interested in. Change the RA acronym to the RA of interest. The user can also omit the Regional Association, by using `None`, to collect metadata information about all IOOS non-Federal observation data archived through the NCEI-IOOS pipeline.

The options for platform include: `"HF Radar"`, `"Glider"`, and `"FIXED PLATFORM"`. 

In [3]:
# Select RA, this will be the acronym for the RA or None if you want to search across all RAs
ra = None

# Identify the platform.
platform = '"FIXED PLATFORM"' # Options include: None, "HF Radar", "Glider", "FIXED PLATFORM"

## Next we generate a geoportal query and georss feed
To find more information about how to compile a geoportal query, have a look at [REST API Syntax](https://github.com/Esri/geoportal-server/wiki/REST-API-Syntax) and the [NCEI Search Tips](https://www.nodc.noaa.gov/search/granule/catalog/searchtips/searchtips.page) for the [NCEI geoportal](https://data.nodc.noaa.gov/geoportal/catalog/search/search.page). The example provided is specific to the NCEI-IOOS data pipeline project and only searches for non-federal timeseries data collected by each Regional Association. 

The query developed here can be updated to search for any Archival Information Packages at NCEI, therefore the user should develop the appropriate query using the [NCEI Geoportal](https://data.nodc.noaa.gov/geoportal/catalog/search/search.page) and update this portion of the code to identify the REST API of interest.

In [6]:
try:
    from urllib.parse import quote
except ImportError:
    from urllib import quote

# Generate geoportal query and georss feed.

# Base geoportal url.
baseurl = "https://www.ncei.noaa.gov/" "metadata/geoportal/opensearch" "?q="

# Identify the Regional Association
if ra is None:
    reg_assoc = ''
else:
    RAs = {
        "AOOS": "Alaska Ocean Observing System",
        "CARICOOS": "Caribbean Coastal Ocean Observing System",
        "CeNCOOS": "Central and Northern California Coastal Ocean Observing System",
        "GCOOS": "Gulf of Mexico Coastal Ocean Observing System",
        "GLOS": "Great Lakes Observing System",
        "MARACOOS": "Mid-Atlantic Regional Association Coastal Ocean Observing System",
        "NANOOS": "Northwest Association of Networked Ocean Observing Systems",
        "NERACOOS": "Northeastern Regional Association of Coastal Ocean Observing System",
        "PacIOOS": "Pacific Islands Ocean Observing System",
        "SCCOOS": "Southern California Coastal Ocean Observing System",
        "SECOORA": "Southeast Coastal Ocean Observing Regional Association",
        }
    reg_assoc = '(dataThemeinstitutions_s:"%s" dataThemeprojects_s:"%s (%s)")'%(RAs[ra], RAs[ra], ra)

# Identify the project.
project = '"Integrated Ocean Observing System Data Assembly Centers Data Stewardship Program"'

# Identify the amount of records and format of the response: 1 to 1010 records.
records = "&start=1&num=1010"

# Identify the format of the response: georss.
response_format = "&f=csv"

if platform is not None:
  if ra is not None:
    reg_assoc_plat = quote(reg_assoc + ' AND' + platform)
  else:
    reg_assoc_plat = quote(platform)
else:
  reg_assoc_plat = quote(reg_assoc)

# Combine the URL.
url = "{}{}{}{}".format(baseurl , reg_assoc_plat, '&filter=dataThemeprojects_s:', quote(project) + records + response_format)

print("Identified response format:\n{}".format(url))
print(
    "\nSearch page response:\n{}".format(url.replace(response_format, "&f=searchPage"))
)

Identified response format:
https://www.ncei.noaa.gov/metadata/geoportal/opensearch?q=%22FIXED%20PLATFORM%22&filter=dataThemeprojects_s:%22Integrated%20Ocean%20Observing%20System%20Data%20Assembly%20Centers%20Data%20Stewardship%20Program%22&start=1&num=1010&f=csv

Search page response:
https://www.ncei.noaa.gov/metadata/geoportal/opensearch?q=%22FIXED%20PLATFORM%22&filter=dataThemeprojects_s:%22Integrated%20Ocean%20Observing%20System%20Data%20Assembly%20Centers%20Data%20Stewardship%20Program%22&start=1&num=1010&f=searchPage


## Time to query the portal and parse out the csv response
Here we are opening the specified REST API and parsing it into a string. Then, since we identified it as a csv format above, we parse it using the Pandas package. We also split the Data_Date_Range column into two columns, `data_start_date` and `data_end_date` to have that useful information available.

In [7]:
import pandas as pd
import numpy as np

df = pd.read_csv(url)

df[['data_start_date','data_end_date']] = df['Data_Date_Range'].str.split(' to ',expand=True)
df['data_start_date'] = pd.to_datetime(df['data_start_date'])
df['data_end_date'] = pd.to_datetime(df['data_end_date']) + pd.Timedelta(np.timedelta64(1, "ms")) 

df.head()

Unnamed: 0,Id,Title,Description,West,South,East,North,Link_Xml,Link_1,Link_2,Link_3,Link_4,Data_Date_Range,Date_Published,data_start_date,data_end_date
0,gov.noaa.nodc:0226059,"Biological, chemical, physical and time series...",,-157.865,21.2887,-157.865,21.2887,http://www.ncei.noaa.gov/metadata/geoportal/re...,,,,,2008-08-07T00:00:00Z to 2017-01-04T23:59:59.999Z,2021-03-03T00:00:00Z,2008-08-07 00:00:00+00:00,2017-01-05 00:00:00+00:00
1,gov.noaa.nodc:0225979,"Biological, chemical, physical and time series...",,-157.848,21.2799,-157.848,21.2799,http://www.ncei.noaa.gov/metadata/geoportal/re...,,,,,2008-06-06T00:00:00Z to 2016-12-06T23:59:59.999Z,2021-03-03T00:00:00Z,2008-06-06 00:00:00+00:00,2016-12-07 00:00:00+00:00
2,gov.noaa.nodc:0209162,"Biological, chemical, physical, and time serie...",,-155.8285,20.02415,-155.8285,20.02415,http://www.ncei.noaa.gov/metadata/geoportal/re...,,,,,2018-03-10T00:00:00Z to 2020-12-31T23:59:59.999Z,2020-03-03T00:00:00Z,2018-03-10 00:00:00+00:00,2021-01-01 00:00:00+00:00
3,gov.noaa.nodc:0208370,"Meteorological, physical, and time series data...",,-157.7863,21.433,-157.7863,21.433,http://www.ncei.noaa.gov/metadata/geoportal/re...,,,,,2005-01-01T00:00:00Z to 2020-12-19T23:59:59.999Z,2020-01-15T00:00:00Z,2005-01-01 00:00:00+00:00,2020-12-20 00:00:00+00:00
4,gov.noaa.nodc:0208388,"Biological, chemical, physical, and time serie...",,-155.082,19.7341,-155.082,19.7341,http://www.ncei.noaa.gov/metadata/geoportal/re...,,,,,2010-10-23T00:00:00Z to 2020-12-31T23:59:59.999Z,2020-01-17T00:00:00Z,2010-10-23 00:00:00+00:00,2021-01-01 00:00:00+00:00


Now, lets pull out all the ISO metadata record links and print them out so the user can browse to the metadata record and look for what items they might be interested in.

In [8]:
# parse the csv response

print("Found %i record(s)" % len(df))
for index, row in df.iterrows():
    print('ISO19115-2 record:',row['Link_Xml'])  # URL to ISO19115-2 record.
    print('NCEI dataset metadata page: https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=' + row['Id'] )
    print('\n')

Found 214 record(s)
ISO19115-2 record: http://www.ncei.noaa.gov/metadata/geoportal/rest/metadata/item/gov.noaa.nodc%3A0226059/xml
NCEI dataset metadata page: https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.nodc:0226059


ISO19115-2 record: http://www.ncei.noaa.gov/metadata/geoportal/rest/metadata/item/gov.noaa.nodc%3A0225979/xml
NCEI dataset metadata page: https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.nodc:0225979


ISO19115-2 record: http://www.ncei.noaa.gov/metadata/geoportal/rest/metadata/item/gov.noaa.nodc%3A0209162/xml
NCEI dataset metadata page: https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.nodc:0209162


ISO19115-2 record: http://www.ncei.noaa.gov/metadata/geoportal/rest/metadata/item/gov.noaa.nodc%3A0208370/xml
NCEI dataset metadata page: https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.nodc:0208370


ISO19115-2 record: http://www.ncei.noaa.gov/metadata/geoportal/rest/

## Let's collect what we have found
Now that we have all the ISO metadata records we are interested in, it's time to do something fun with them. In this example we want to generate a timeseries plot of the data coverage for the "Southern California Coastal Ocean Observing System" stations we have archived at NCEI.

First we need to collect some information. We loop through each iso record to collect metadata information about each package. The example here shows how to collect the following items:
   1. NCEI Archival Information Package (AIP) Accession ID (7-digit Accession Number) 
   2. The first date the archive package was published.
   3. The platform code identified from the provider.
   4. The version number and date it was published.
   5. The current AIP size, in MB.

There are plenty of other metadata elements to collect from the ISO records, so we recommend browsing to one of the records and having a look at the items of interest to your community.

```
<gmd:onLine>
<gmd:CI_OnlineResource>
<gmd:linkage>
<gmd:URL>https://www.ncei.noaa.gov/thredds-ocean/catalog/ioos/glos/ESF1/catalog.html</gmd:URL>
</gmd:linkage>
<gmd:protocol>
<gco:CharacterString>THREDDS</gco:CharacterString>
```

In [15]:
# Process each iso record.
%matplotlib inline

from datetime import datetime
import xml.etree.ElementTree as ET
from owslib import util
from urllib.request import urlopen

df_out = pd.DataFrame()

df[['provider_platform_name','NCEI_accession_number','package_size_mb','submitter']] = ''

# For each accession in response.
for url in df['Link_Xml']:

    iso = urlopen(url)
    iso_tree = ET.parse(iso)
    root = iso_tree.getroot()

    vers_dict = dict()
    
    # Collect Publication date information.
    date_path = (
        ".//"
        "gmd:identificationInfo/"
        "gmd:MD_DataIdentification/"
        "gmd:citation/"
        "gmd:CI_Citation/"
        "gmd:date/"
        "gmd:CI_Date/"
        "gmd:date/gco:Date"
    )
    # First published date.
    pubdate = root.find(date_path, namespaces)
    print("\nFirst published date = %s" % util.testXMLValue(pubdate))
    
    # Data Temporal Coverage.
    temporal_extent_path = (
        ".//"
        "gmd:temporalElement/"
        "gmd:EX_TemporalExtent/"
        "gmd:extent/"
        "gml:TimePeriod"
    
    )

    # access path
    distribution_path = (
        ".//"
        "gmd:onLine/"
    )

    for distribution in root.iterfind('.//gmd:onLine', namespaces):
      for service in distribution.iterfind('.//gmd:protocol/gco:CharacterString', namespaces):
        if service.text == "THREDDS":
          df1 = pd.DataFrame({'thredds': [distribution.find('.//gmd:URL', namespaces).text]})
          df_out = pd.concat([df_out, df1])

          print(distribution.find('.//gmd:URL', namespaces).text)
    
    # beginPosition = root.find(temporal_extent_path + '/gml:beginPosition', namespaces).text
    # endPosition = root.find(temporal_extent_path + '/gml:endPosition', namespaces).text
    
    # print("Data time coverage: %s to %s" % (beginPosition, endPosition))

    # # Collect keyword terms of interest.
    # for MD_keywords in root.iterfind('.//gmd:descriptiveKeywords/gmd:MD_Keywords', namespaces):

    #     for thesaurus_name in MD_keywords.iterfind('.//gmd:thesaurusName/gmd:CI_Citation/gmd:title/gco:CharacterString', namespaces):
            
    #         if thesaurus_name.text == "Provider Platform Names":

    #             plat_name = MD_keywords.find('.//gmd:keyword/gco:CharacterString', namespaces).text
    #             print("Provider Platform Code = %s" % plat_name)
    #             df.loc[df.Link_Xml == url, ['provider_platform_name']] = plat_name
    #             break
                
    #         elif thesaurus_name.text == "NCEI ACCESSION NUMBER":
    #             acce_no = MD_keywords.find('.//gmd:keyword/gmx:Anchor', namespaces).text
    #             print("Accession:",acce_no)
    #             df.loc[df.Link_Xml == url, ['NCEI_accession_number']] = acce_no
    #             break
            
    #         elif thesaurus_name.text == "NODC SUBMITTING INSTITUTION NAMES THESAURUS":
    #             submitter = MD_keywords.find('.//gmd:keyword/gmx:Anchor', namespaces).text
    #             print("Submitter:", submitter)
    #             df.loc[df.Link_Xml == url, ['submitter']] = submitter
            
    # # Pull out the version information.
    # # Iterate through each processing step which is an NCEI version.
    # for process_step in root.iterfind(".//gmd:processStep", namespaces):
    #     # Only parse gco:DateTime and gmd:title/gco:CharacterString.
    #     vers_title = (
    #         ".//"
    #         "gmi:LE_ProcessStep/"
    #         "gmi:output/"
    #         "gmi:LE_Source/"
    #         "gmd:sourceCitation/"
    #         "gmd:CI_Citation/"
    #         "gmd:title/"
    #         "gco:CharacterString"
    #     )
    #     vers_date = (
    #         ".//" 
    #         "gmi:LE_ProcessStep/" 
    #         "gmd:dateTime/"
    #         "gco:DateTime"
    #     )
    #     if process_step.findall(vers_date, namespaces) and process_step.findall(vers_title, namespaces):
    #         # Extract dateTime for each version.
    #         datetime = pd.to_datetime(process_step.find(vers_date, namespaces).text)
            
    #         # Extract version number.
    #         version = process_step.find(vers_title, namespaces).text.split(" ")[-1]
    #         print(
    #             "{} = {}".format(
    #                 version, datetime
    #             )
    #         )
    #         vers_dict[version] = datetime
    #         df.loc[df.Link_Xml == url, ['version_info']] = [vers_dict]
    
    # # Collect package size information.
    # # Iterate through transfer size nodes.
    # for trans_size in root.iterfind(".//gmd:transferSize", namespaces):

    #     if trans_size.find(".//gco:Real", namespaces).text:
            
    #         sizes = trans_size.find(".//gco:Real", namespaces).text
    #         print("Current AIP Size = %s MB" % sizes)
                
    #         df.loc[df.Link_Xml == url, ['package_size_mb']] = float(sizes)
    #         break

    #     break
    


First published date = 2021-03-03
https://www.ncei.noaa.gov/thredds-ocean/catalog/ioos/pacioos/WQB-KN/catalog.html

First published date = 2021-03-03
https://www.ncei.noaa.gov/thredds-ocean/catalog/ioos/pacioos/WQB-AW/catalog.html

First published date = 2020-03-03
https://www.ncei.noaa.gov/thredds-ocean/catalog/ioos/pacioos/WQB-05/catalog.html

First published date = 2020-01-15
https://www.ncei.noaa.gov/thredds-ocean/catalog/ioos/pacioos/AWS-HIMB/catalog.html

First published date = 2020-01-17
https://www.ncei.noaa.gov/thredds-ocean/catalog/ioos/pacioos/WQB-04/catalog.html

First published date = 2018-04-18
https://www.ncei.noaa.gov/thredds-ocean/catalog/ioos/cencoos/edu_humboldt_humboldt/catalog.html

First published date = 2018-04-18
https://www.ncei.noaa.gov/thredds-ocean/catalog/ioos/cencoos/mlml_monterey/catalog.html

First published date = 2018-04-18
https://www.ncei.noaa.gov/thredds-ocean/catalog/ioos/cencoos/wiyot_tribe_indianisland/catalog.html

First published date = 2018-0

In [17]:
import xarray as xr

for url in df_out['thredds']:
  xr.open_mfdataset(url)

ValueError: ignored

In [18]:
!pip install siphon

Collecting siphon
  Downloading siphon-0.9-py2.py3-none-any.whl (69 kB)
[?25l[K     |████▊                           | 10 kB 16.2 MB/s eta 0:00:01[K     |█████████▌                      | 20 kB 18.8 MB/s eta 0:00:01[K     |██████████████▏                 | 30 kB 21.0 MB/s eta 0:00:01[K     |███████████████████             | 40 kB 23.1 MB/s eta 0:00:01[K     |███████████████████████▊        | 51 kB 18.4 MB/s eta 0:00:01[K     |████████████████████████████▍   | 61 kB 20.4 MB/s eta 0:00:01[K     |████████████████████████████████| 69 kB 5.2 MB/s 
Installing collected packages: siphon
Successfully installed siphon-0.9


In [23]:
df_out.iloc[0]['thredds']

'https://www.ncei.noaa.gov/thredds-ocean/catalog/ioos/pacioos/WQB-KN/catalog.html'

In [45]:
from siphon.catalog import TDSCatalog

catalog = TDSCatalog(df_out.iloc[0]['thredds'])


info = """
Catalog information
-------------------

Base THREDDS URL: {}
Catalog name: {}
Catalog URL: {}
Metadata: {}
""".format(
    catalog.base_tds_url, catalog.catalog_name, catalog.catalog_url, catalog.metadata
)

print(info)

for name, ds in catalog.datasets.items():
    if ds.access_urls:
        print(name)
        print(ds.access_urls['OPENDAP'])


Catalog information
-------------------

Base THREDDS URL: https://www.ncei.noaa.gov
Catalog name: No name found
Catalog URL: https://www.ncei.noaa.gov/thredds-ocean/catalog/ioos/pacioos/WQB-KN/catalog.xml
Metadata: {'inherited': True, 'serviceName': 'nonWmsServices'}

WQB-KN.2017.nc
https://www.ncei.noaa.gov/thredds-ocean/dodsC/ioos/pacioos/WQB-KN/WQB-KN.2017.nc
WQB-KN.2016.nc
https://www.ncei.noaa.gov/thredds-ocean/dodsC/ioos/pacioos/WQB-KN/WQB-KN.2016.nc
WQB-KN.2015.nc
https://www.ncei.noaa.gov/thredds-ocean/dodsC/ioos/pacioos/WQB-KN/WQB-KN.2015.nc
WQB-KN.2014.nc
https://www.ncei.noaa.gov/thredds-ocean/dodsC/ioos/pacioos/WQB-KN/WQB-KN.2014.nc
WQB-KN.2013.nc
https://www.ncei.noaa.gov/thredds-ocean/dodsC/ioos/pacioos/WQB-KN/WQB-KN.2013.nc
WQB-KN.2012.nc
https://www.ncei.noaa.gov/thredds-ocean/dodsC/ioos/pacioos/WQB-KN/WQB-KN.2012.nc
WQB-KN.2011.nc
https://www.ncei.noaa.gov/thredds-ocean/dodsC/ioos/pacioos/WQB-KN/WQB-KN.2011.nc
WQB-KN.2010.nc
https://www.ncei.noaa.gov/thredds-ocean/do

  new_url))


WQB-KN.2017.nc
https://www.ncei.noaa.gov/thredds-ocean/dodsC/ioos/pacioos/WQB-KN/WQB-KN.2017.nc
WQB-KN.2016.nc
https://www.ncei.noaa.gov/thredds-ocean/dodsC/ioos/pacioos/WQB-KN/WQB-KN.2016.nc
WQB-KN.2015.nc
https://www.ncei.noaa.gov/thredds-ocean/dodsC/ioos/pacioos/WQB-KN/WQB-KN.2015.nc
WQB-KN.2014.nc
https://www.ncei.noaa.gov/thredds-ocean/dodsC/ioos/pacioos/WQB-KN/WQB-KN.2014.nc
WQB-KN.2013.nc
https://www.ncei.noaa.gov/thredds-ocean/dodsC/ioos/pacioos/WQB-KN/WQB-KN.2013.nc
WQB-KN.2012.nc
https://www.ncei.noaa.gov/thredds-ocean/dodsC/ioos/pacioos/WQB-KN/WQB-KN.2012.nc
WQB-KN.2011.nc
https://www.ncei.noaa.gov/thredds-ocean/dodsC/ioos/pacioos/WQB-KN/WQB-KN.2011.nc
WQB-KN.2010.nc
https://www.ncei.noaa.gov/thredds-ocean/dodsC/ioos/pacioos/WQB-KN/WQB-KN.2010.nc
WQB-KN.2009.nc
https://www.ncei.noaa.gov/thredds-ocean/dodsC/ioos/pacioos/WQB-KN/WQB-KN.2009.nc
WQB-KN.2008.nc
https://www.ncei.noaa.gov/thredds-ocean/dodsC/ioos/pacioos/WQB-KN/WQB-KN.2008.nc


## Create a timeseries plot of data coverage
Now that we have a DataFrame with all the information we're interested in, lets make a time coverage plot for all the AIP's at NCEI.

In [None]:
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
    
ypos = range(len(df))
fig, ax = plt.subplots(figsize=(15, 12))

# Plot the data
ax.barh(ypos, mdates.date2num(df['data_end_date']) - mdates.date2num(df['data_start_date']), 
        left = mdates.date2num(df['data_start_date']), 
        height = 0.5, 
        align = 'center')

xlim = ( mdates.date2num(df['data_start_date'].min() - pd.Timedelta(np.timedelta64(1, "M"))),
         mdates.date2num(df['data_end_date'].max() + pd.Timedelta(np.timedelta64(1, "M"))) )

ax.set_xlim(xlim)
ax.set(yticks = np.arange(0, len(df)))
ax.tick_params(which="both", direction="out")
ax.set_ylabel("NCEI Accession Number")
ax.set_yticklabels(df['NCEI_accession_number'])
ax.set_title('NCEI archive package time coverage')

ax.xaxis_date()
ax.set_xlabel('Date')

plt.grid(axis='x', linestyle='--')

This procedure has been developed as an example of how to use NCEI's geoportal REST API's to collect information about packages that have been archived at NCEI. The intention is to provide some guidance and ways to collect this information without having to request it directly from NCEI. There are a significant amount of metadata elements which NCEI makes available through their ISO metadata records. Therefore, anyone interested in collecting other information from the records at NCEI should have a look at the ISO metadata records and determine which items are of interest to their community. Then, update the example code provided to collect that information.

**Author:** Mathew Biddle