# Using GHRSST Opensearch service

The GHRSST **Opensearch** service allows users to search for data files from any GHRSST data collection listed in GHRSST catalogue (https://www.ghrsst.org/ghrsst-data-services/ghrsst-catalogue/) on any spatial or temporal criteria. The returned result include the different download paths available from the different GHRSST Data Assembly Centers (DAC) for each found data file. It was funded by the European **Copernicus** program and implemented by Ifremer. 

The service can be accessed at: https://opensearch-ghrsst.ifremer.fr. The homepage of the service explains the syntax of the search queries with many examples.

This notebook shows how to query the GHRSST Opensearch service in python, using the json return format (Atom/XML being the alternative format).

<figure>
    <img src=./logo_ghrsst.gif width="150"> 
    <img src=./logo_copernicus.png width="100" align="right">
</figure>

## Main functions

Here are the main function to build the queries and decode the results. Examples of usage are provided in the following section.

In [62]:
import json
import urllib.request
from datetime import datetime
from typing import Tuple


# the service end-point URL
GHRSST_OPENSEARCH_URL = "https://opensearch-ghrsst.ifremer.fr/granules.json"

def _format_opensearch_url(
        dataset_id: str,
        start: datetime,
        end: datetime,
        area= None,
        dac=None,
        protocol=None,
        page=0,
        count=1000,
        trace=False
):
    """creates the opensearch query string from search arguments"""
    
    # build the search query URL
    search_url = GHRSST_OPENSEARCH_URL + \
        '?datasetId={}'.format(dataset_id) + \
        '&startPage={}&count={}'.format(page, count) + \
        '&timeStart={}&timeEnd={}'.format(start.isoformat(), end.isoformat())

    if area is not None:
        search_url += '&geoBox={}'.format(str(tuple(area)).replace(' ','').lstrip('(').strip(')'))
    
    if dac is not None:
        search_url += '&source={}'.format(dac)

    if protocol is not None:
        search_url += '&protocol={}'.format(protocol)
    
    if trace:
        print(search_url)

    return search_url


def format_opensearch_result(entries):
    """format the search result as a list, merge duplicates"""
    dict_entries = {}
    for entry in entries:
        file_id = entry['title'] 

        # for each links, relate to corresponding source
        for link in entry['links']:
            link['source'] = entry['source']
        entry.pop('source')
        
        if not file_id.endswith('.nc'):
            # PODAAC strips the .nc extension from the filename.
            file_id = file_id + '.nc'
        if file_id in dict_entries:
            # merge with existing entry if the same granule was 
            # returned from another DAC
            dict_entries[file_id]['links'].extend(entry['links'])
            # some attributes are not always filled in some DAC responses.
            # we fill them in the merging, whenever it is provided by one of the DACs
            for attr in dict_entries[file_id]:
                if entry[attr] is not None and attr != 'links':
                    dict_entries[file_id][attr] = entry[attr]
        else:
            dict_entries[file_id] = entry

    # sort by filename
    dict_entries_sorted = {_: dict_entries[_] for _ in sorted(dict_entries.keys())}
    
    return dict_entries_sorted


def search(
        dataset_id: str,
        start: datetime,
        end: datetime = None,
        area: Tuple[float, float, float, float] = None,
        dac: str = None,
        protocol: str = None,
        trace: bool = False
    ):
    """The main search function. Queries data files from a given GHRSST collection.
    
    When the same data files are returned from different DACs, they are merged into a 
    single entry in the list of results, providing all the possible download URLs. 
    
    It is possible to request the results from a specific DAC only, or for a given download
    protocol (FTP, HTTP, ...).
    
    Args:
        dataset_id: the identifier of the collection, as found in GHRSST catalogue.
        start: the start date and time of the search temporal interval, as a python 
            datetime object.
        end: the end date and time of the search temporal interval, as a python 
            datetime object. By default, the current date and time is used.
        area: the boundaries of the search area, as a list [lon min, lat min, lon max, lat max]. 
            By default the whole world is selected.
        dac: the identifier of the DAC to query. By default all DACs are queried. Use this
            argument to limit the search to a specific DAC.
        protocol: the type of download link to be returned for each found data file. By
            default, all links are returned. Use this argument to include in the search results
            only FTP, HTTPS, etc links.
        trace: print some traces of the queries to opensearch service
    """
    results = []

    if end is None:
        end = datetime.now()

    last_page = False
    page = 0
    while not last_page:
        # format the opensearch query
        uri = _format_opensearch_url(
            dataset_id, start, end, area, dac=dac, protocol=protocol, page=page)
        
        # call the service
        response = urllib.request.urlopen(uri)
        status_code = response.getcode()
        if trace:
            print(f'Requesting page {page}: {uri}')
            print("HTTP STATUS CODE = " + str(status_code))
        
        # decode json result
        json_result = json.loads(response.read())
        status, entries = json_result['header'], json_result['entries']
        
        results.extend(entries)
        
        if status['total_results'] < (status['start_index']+1)*status['items_per_page']:
            last_page = True
        
        page += 1

    # merge duplicates (if results coming from more than one dac)    
    return format_opensearch_result(results)


## Examples

### Basic search 

In this example, we will query data files from the OSI SAF Metop-A L2P dataset; We should first the know the GHRSST identifier of this dataset to be used in the search query; it is the identifier provided in the catalogue, in the title of the top right column providing the dataset properties: 

<figure>
    <img src=./catalogue.png> 
</figure>


Here the identifier is ``AVHRR_SST_METOP_B-OSISAF-L2P-v1.0``.

We can then simply use the ``search`` function defined above, providing also the time frame (as *datetime* python objects) and the area, as a tuple *(lon min, lat min, lon max, lat max)*, of interest.

In [63]:
# identifier, as provided in GHRSST catalogue
datasetId = "AVHRR_SST_METOP_B-OSISAF-L2P-v1.0"

# area of interest (lon min, lat min, lon max, lat max)
area = (-180.0, -90.0, 180.0, 90.0)

# timeframe of interest
start = datetime(2018, 12, 1)
end = datetime(2018, 12, 2)

result = search(datasetId, start=start, end=end, area=area)

print(f'Number of files found: {len(result)}')

Number of files found: 481


The result of above query is a python dictionary, which keys are the names of the returned files and the values are the properties, including the download links, in the `links` property.

Let's for instance print one element in this dictionary.

In [64]:
# for nice printing
import pprint
pp = pprint.PrettyPrinter(indent=4)

# pretty print the 10th element
pp.pprint(list(result.items())[10])


(   '20181201002803-OSISAF-L2P_GHRSST-SSTsubskin-AVHRR_SST_METOP_B-sstmgr_metop01_20181201_002803-v02.0-fv01.0.nc',
    {   'dc_date': '2018-12-01 00:28:03/2018-12-01 00:31:03',
        'geo_box': '-55.7705670107527 -57.79106721833021 -11.96957392938374 '
                   '-40.73174091882359',
        'geo_line': None,
        'geo_polygon': '-55.56169835875048 -52.67148182480798 '
                       '-55.7705670107527 -57.79106721833021 '
                       '-38.96490503641508 -56.99809552889469 '
                       '-31.90670168704763 -55.92066029334123 '
                       '-25.26205510203509 -54.51512185782996 '
                       '-11.96957392938374 -49.969497393784 -16.21935902029434 '
                       '-45.44043134437771 -19.80456782124388 '
                       '-40.73174091882359 -31.31928310001008 '
                       '-44.40516197172078 -36.74824984062293 '
                       '-45.6032435434533 -42.34634730581534 '
                      

### More search criteria

It is possible to further limit the search results. In particular:
* asking for the results of a specific GHRSST Data Assemble Center (DAC) using `dac` argument in the `search` function. Possible values include *OSISAF*, *PODAAC*, *EUMETSAT*, ... (check in the central catalogue which DACs are available for a given GHRSST dataset).
* asking for the download linls in a given protocol only, using `protocol` argument. Possible values include: *FTP*, *HTTPS*, ...

In the first example below, we perform the same request but asking only for the results from *OSISAF* DAC; it can be verified that only the links from OSISAF DAC are returned, and that those of PODAAC were dropped out:

In [65]:
# use `DAC` keyword in search function
result = search(datasetId, start=start, end=end, area=area, dac='OSISAF')

# print the first element of result
pp.pprint(list(result.items())[0])

(   '20181130235803-OSISAF-L2P_GHRSST-SSTsubskin-AVHRR_SST_METOP_B-sstmgr_metop01_20181130_235803-v02.0-fv01.0.nc',
    {   'dc_date': '2018-11-30 23:58:03/2018-12-01 00:01:03',
        'geo_box': '123.2947507865959 -29.54499404600947 153.0719195822948 '
                   '-13.44006542392163',
        'geo_line': None,
        'geo_polygon': '125.118605989437 -18.55772002731654 123.2947507865959 '
                       '-23.65568488367033 132.853342853519 -26.36564569369375 '
                       '137.1999949953277 -27.32752869013784 141.5836292337872 '
                       '-28.20226969513492 151.6700818733315 '
                       '-29.54499404600947 153.0719195822948 '
                       '-19.16248150366356 143.862055791963 -17.60344134062807 '
                       '139.8051619269587 -16.75653910435467 126.8196849573603 '
                       '-13.44006542392163 125.118605989437 -18.55772002731654',
        'geo_where': None,
        'id': 'https://opensearch.ifreme

Note that searching for a dataset by specifying a DAC that does not serve this dataset will return a `Not Found` 404 error:

In [66]:
result = search(datasetId, start=start, end=end, area=area, dac='EUMETSAT')

HTTPError: HTTP Error 404: Not Found

In this second example, we will limit the results to those for which a FTP link is provided, using `protocol` argument. Other links are dropped from the results. 

In [69]:
# use `protocol` keyword in search function
result = search(datasetId, start=start, end=end, area=area, protocol='HTTPS')

# print the first element of result
pp.pprint(list(result.items())[0])

(   '20181130235803-OSISAF-L2P_GHRSST-SSTsubskin-AVHRR_SST_METOP_B-sstmgr_metop01_20181130_235803-v02.0-fv01.0.nc',
    {   'dc_date': '2018-11-30 23:58:03/2018-12-01 00:01:03',
        'geo_box': '123.2947507865959 -29.54499404600947 153.0719195822948 '
                   '-13.44006542392163',
        'geo_line': None,
        'geo_polygon': '125.118605989437 -18.55772002731654 123.2947507865959 '
                       '-23.65568488367033 132.853342853519 -26.36564569369375 '
                       '137.1999949953277 -27.32752869013784 141.5836292337872 '
                       '-28.20226969513492 151.6700818733315 '
                       '-29.54499404600947 153.0719195822948 '
                       '-19.16248150366356 143.862055791963 -17.60344134062807 '
                       '139.8051619269587 -16.75653910435467 126.8196849573603 '
                       '-13.44006542392163 125.118605989437 -18.55772002731654',
        'geo_where': None,
        'id': 'https://opensearch.ifreme

Same as above but selecting only HTTP(S) links now: 

In [70]:
# use `protocol` keyword in search function
result = search(datasetId, start=start, end=end, area=area, protocol='HTTPS')

# print the first element of result
pp.pprint(list(result.items())[0])

(   '20181130235803-OSISAF-L2P_GHRSST-SSTsubskin-AVHRR_SST_METOP_B-sstmgr_metop01_20181130_235803-v02.0-fv01.0.nc',
    {   'dc_date': '2018-11-30 23:58:03/2018-12-01 00:01:03',
        'geo_box': '123.2947507865959 -29.54499404600947 153.0719195822948 '
                   '-13.44006542392163',
        'geo_line': None,
        'geo_polygon': '125.118605989437 -18.55772002731654 123.2947507865959 '
                       '-23.65568488367033 132.853342853519 -26.36564569369375 '
                       '137.1999949953277 -27.32752869013784 141.5836292337872 '
                       '-28.20226969513492 151.6700818733315 '
                       '-29.54499404600947 153.0719195822948 '
                       '-19.16248150366356 143.862055791963 -17.60344134062807 '
                       '139.8051619269587 -16.75653910435467 126.8196849573603 '
                       '-13.44006542392163 125.118605989437 -18.55772002731654',
        'geo_where': None,
        'id': 'https://opensearch.ifreme

## Other examples

You can now repeat the previous operations with the dataset of your choice. Here is another example with EUMETSAT SLSTR onboard Sentinel-3 L2P product (Note than EUMETSAT data stores only serves the last month of NRT L2P products, an empty result will be returned beyond this delay and you will have to query other products - like NTC (Non Time Critial) - instead):

In [71]:
# identifier, as provided in GHRSST catalogue
datasetId = "SLSTRA-MAR-L2P-NTC-v3.0"

# area of interest
area = (-180.0, -90.0, 180.0, 90.0)

# timeframe of interest
start = datetime(2022, 5, 1)
end = datetime(2022, 5, 2)

result = search(datasetId, start=start, end=end, area=area)
len(result)

15