# Building WaveWatch3(TM) ERDDAP Datasets

This notebook documents the process of creating XML fragments
for SalishSeaCast rolling forecast WaveWatch3(TM) run results files
for inclusion in `/results/erddap-datasets/datasets.xml`
which is symlinked to `/opt/tomcat/content/erddap/datasets.xml`
on the `skookum` ERDDAP server instance.

The contents are a combination of:

* instructions for using the
`GenerateDatasetsXml.sh` and `DasDds.sh` tools found in the
`/opt/tomcat/webapps/erddap/WEB-INF/` directory
* instructions for forcing the server to update the datasets collection
via the `/results/erddap/flags/` directory
* code and metadata to transform the output of `GenerateDatasetsXml.sh`
into XML fragments that are ready for inclusion in `/results/erddap-datasets/datasets.xml`

In [1]:
from collections import OrderedDict
from copy import copy

from lxml import etree

**NOTE**

The next cell mounts the `/results` filesystem on `skookum` locally.
It is intended for use if when this notebook is run on a laptop 
or other non-Waterhole machine that has `sshfs` installed 
and a mount point for `/results` available in its root filesystem.

Don't execute the cell if that doesn't describe your situation.

!sshfs skookum:/results /results

# Metadata for All Datasets

The `metadata` dictionary below contains information for dataset
attribute tags whose values need to be changed,
or that need to be added for all datasets.

The keys are the dataset attribute names.

The values are dicts containing a required `text` item
and perhaps an optional `after` item.

The value associated with the `text` key is the text content
for the attribute tag.

When present,
the value associated with the `after` key is the name
of the dataset attribute after which a new attribute tag
containing the `text` value is to be inserted.

In [3]:
metadata = OrderedDict([
    ('infoUrl', {
        'text': 
            'https://salishsea-meopar-docs.readthedocs.io/en/latest/results_server/index.html#salish-sea-model-results',
    }),
    ('institution', {
        'text': 'UBC EOAS', 
    }),
    ('institution_fullname', {
        'text': 'Earth, Ocean & Atmospheric Sciences, University of British Columbia',
        'after': 'institution',
    }),
    ('license', {
        'text': '''The Salish Sea MEOPAR NEMO model results are copyright
by the Salish Sea MEOPAR Project Contributors and The University of British Columbia.

They are licensed under the Apache License, Version 2.0. https://www.apache.org/licenses/LICENSE-2.0''',
    }),
    ('project', {
        'text':'Salish Sea MEOPAR NEMO Model',
        'after': 'title',
    }),
    ('creator_name', {
        'text': 'Salish Sea MEOPAR Project Contributors',
        'after': 'project',
    }),
    ('creator_email', {
        'text': 'gemmrich@uvic.ca',
        'after': 'creator_name',
    }),
    ('creator_url', {
        'text': 'https://salishsea-meopar-docs.readthedocs.io/',
        'after': 'creator_email',
    }),
    ('acknowledgement', {
        'text': 'MEOPAR, ONC, Compute Canada',
        'after': 'creator_url',
    }),
    ('drawLandMask', {
        'text': 'over',
        'after': 'acknowledgement',
    }),
])

# Dataset Attributes

The `datasets` dictionary below provides the content
for the dataset `title` and `summary` attributes.

The `title` attribute content appears in the the datasets list table
(among other places).
It should be `<`80 characters long,
and note that only the 1st 40 characters will appear in the table.

The `summary` attribute content appears
(among other places)
when a user hovers the cursor over the `?` icon beside the `title`
content in the datasets list table.
The text that is inserted into the `summary` attribute tag
by code later in this notebook is the
`title` content followed by the `summary` content,
separated by a blank line.

The keys of the `datasets` dict are the `datasetID` strings that
are used in many places by the ERDDAP server.
They are structured as follows:

* `ubc` to indicate that the dataset was produced at UBC
* `SS` to indicate that the dataset is a product of the Salish Sea NEMO model
* a few letters to indicate the model runs that produce the dataset:

  * `n` to indicate that the dataset is from a nowcast run,
  * `f` for rolling forecast composed of the more recent 5 days of nowcast run results and the most recent forecast or forecast2 run,
  * `g` for nowcast-green,
  * `a` for atmospheric forcing,
* a description of the dataset variables; e.g. `PointAtkinsonSSH` or `3DuVelocity`
* the time interval of values in the dataset; e.g. `15m`, `1h`, `1d`
* the dataset version; e.g. `V16-10`, or `V1`

Versioning was changed to a [CalVer](http://calver.org/) type scheme in Oct-2016.
Thereafter versions are of the form `Vyy-mm` and indicate the year and month when the dataset entered production.

So:

* `ubcSSnPointAtkinsonSSH15mV1` is the version 1 dataset of 15 minute averaged sea surface height values at Point Atkinson from `PointAtkinson.nc` output files

* `ubcSSn3DwVelocity1hV2` is the version 2 dataset of 1 hr averaged vertical (w) velocity values over the entire domain from `SalishSea_1h_*_grid_W.nc` output files

* `ubcSSnSurfaceTracers1dV1` is the version 1 dataset of daily averaged surface tracer values over the entire domain from `SalishSea_1d_*_grid_T.nc` output files

* `ubcSSnBathymetry2V16-07`  is the version 16-07 dataset of longitude, latitude, and bathymetry of the Salish Sea NEMO model grid that came into use in Jul-2016.
  The corresponding NEMO-generated mesh mask variables are in the `ubcSSn2DMeshMaskDbo2V16-07` (y, x variables),
  and the `ubcSSn3DMeshMaskDbo2V16-07` (z, y, x variables) datasets.

The dataset version part of the `datasetID` is used to indicate changes in the variables
contained in the dataset.
For example,
the transition from the `ubcSSn3DwVelocity1hV1` to the `ubcSSn3DwVelocity1hV2` dataset
occurred on 24-Jan-2016 when we started to output vertical eddy viscosity and diffusivity
values at the `w` grid points.

All dataset ids end with their version identifier and their `summary` ends with a notation about the variables
that they contain; e.g.
```
v1: wVelocity variable
```
When the a dataset version is incremented a line describing the change is added
to the end of its `summary`; e.g.
```
v1: wVelocity variable
v2: Added eddy viscosity & diffusivity variables ve_eddy_visc & ve_eddy_diff
```

In [8]:
datasets = {
        'ubcSSf2DWaveFields30mV17-02': {
        'type': '2d fields',
        'title': 'Forecast, Salish Sea, 2d Wave Fields, 30min, v17-02',
        'keywords': '''atmosphere,
Atmosphere &gt; Atmospheric Winds &gt; Surface Winds,
atmospheric, breaking wave height, circulation, currents, direction, drift, 
eastward_sea_water_velocity, eastward_surface_stokes_drift, eastward_wave_to_ocean_stress, eastward_wind, 
energy, flux, foc, frequency, height, latitude, length, local, longitude, mean, mean_wave_length, moment, 
northward, northward_sea_water_velocity, northward_surface_stokes_drift, northward_wave_to_ocean_stress, northward_wind, ocean, oceans,
Oceans &gt; Ocean Circulation &gt; Ocean Currents,
Oceans &gt; Ocean Waves &gt; Significant Wave Height,
Oceans &gt; Ocean Waves &gt; Wave Frequency,
Oceans &gt; Ocean Waves &gt; Wave Period,
Oceans &gt; Ocean Waves &gt; Wave Spectra,
Oceans &gt; Ocean Waves &gt; Wave Speed/Direction,
Oceans &gt; Ocean Waves &gt; Wind Waves,
peak, period, sea, sea_surface_wave_from_direction, sea_surface_wave_peak_direction, sea_surface_wave_peak_frequency, 
sea_surface_wave_significant_height, sea_surface_wind_wave_mean_period_from_variance_spectral_density_second_frequency_moment, 
seawater, second, significant, significant_breaking_wave_height, source, spectra, spectral, speed, stokes, stress, surface, 
swell, t02, time, ucur, utwo, uuss, uwnd, variance, vcur, velocity, vtwo, vuss, vwnd, water, wave, wave_to_ocean_energy_flux, 
waves, wcc, wch, whitecap coverage, whitecap_coverage, wind, winds''',
        'summary': '''2d wave field values calculated at 30 minute intervals
from the most recent Strait of Georgia WaveWatch3(TM) model forecast runs.
The values are calculated for a model grid that covers the Strait of Georgia
on the coast of British Columbia. The time values are UTC.

The Strait of Georgia WaveWatch3(TM) model grid and configuration were developed
by Johannes Gemmrich at the University of Victoria. The WaveWatch3(TM) model
is forced with currents from the Salish Sea NEMO model and the same ECCC HRDPS
GEM 2.5km resolution winds that are used to force the NEMO model.

This dataset is updated daily to move it forward 1 day in time.
It starts at 00:00:00 UTC 5 days prior to the most recently completed forecast run,
and extends to 11:30:00 UTC on the 2nd day after the forecast run date.
So, for example, after completion of the 10-Nov-2017 forecast run,
this dataset included data from 2017-11-05 00:00:00 UTC to 2017-11-12 11:30:00 UTC.

v17-02: WaveWatch3(TM)-5.16; NEMO-3.6; ubcSSnBathymetryV17-02 bathymetry; see infoUrl link for full details.
''',
        'fileNameRegex': '.*SoG_ww3_fields_\d{8}_\d{8}\.nc$',
    }
}

# Convenience Functions

A few convenient functions to reduce code repetition:

In [5]:
def print_tree(root):
    """Display an XML tree fragment with indentation.
    """
    print(etree.tostring(root, pretty_print=True).decode('ascii'))

In [6]:
def find_att(root, att):
    """Return the dataset attribute element named att
    or raise a ValueError exception if it cannot be found.
    """
    e = root.find('.//att[@name="{}"]'.format(att))
    if e is None:
        raise ValueError('{} attribute element not found'.format(att))
    return e

In [13]:
def update_xml(root, datasetID, metadata, datasets):
    root.attrib['datasetID'] = datasetID
    root.find('.//fileNameRegex').text = datasets[datasetID]['fileNameRegex']
        
    title = datasets[datasetID]['title']
    if 'keywords' in datasets[datasetID]:
        keywords = find_att(root, 'keywords')
        keywords.text = datasets[datasetID]['keywords']
    summary = find_att(root, 'summary')
    summary.text = f'{title}\n\n{datasets[datasetID]["summary"]}'
    e = etree.Element('att', name='title')
    e.text = title
    summary.addnext(e)

    for att, info in metadata.items():
        e = etree.Element('att', name=att)
        e.text = info['text']
        try:
            root.find(f'''.//att[@name="{info['after']}"]'''.format()).addnext(e)
        except KeyError:
            find_att(root, att).text = info['text']
            
    attrs = root.find('addAttributes')
    etree.SubElement(attrs, 'att', name='NCO').text = 'null'
    if not 'Bathymetry' in datasetID:
        etree.SubElement(attrs, 'att', name='history').text = 'null'
        etree.SubElement(attrs, 'att', name='name').text = 'null'
        
    for axis_name in root.findall('.//axisVariable/destinationName'):
        attrs = axis_name.getparent().find('addAttributes')
        etree.SubElement(attrs, 'att', name='coverage_content_type').text = 'modelResult'
        
        if axis_name.text == 'time':
            etree.SubElement(attrs, 'att', name='comment').text = ('time values are UTC')
        
#     for var_name in root.findall('.//dataVariable/destinationName'):
#         if var_name.text in dataset_vars:
#             var_name.text = dataset_vars[var_name.text]['destinationName']

#         if var_name.text in var_colour_ranges:
#             for att_name in ('colorBarMinimum', 'colorBarMaximum'):
#                 cb_att = var_name.getparent().find(f'addAttributes/att[@name="{att_name}"]')
#                 if cb_att is not None:
#                     cb_att.text = var_colour_ranges[var_name.text][att_name]
#                 else:
#                     attrs = var_name.getparent().find('addAttributes')
#                     etree.SubElement(attrs, 'att', name=att_name, type='double').text = (
#                         var_colour_ranges[var_name.text][att_name])

#         attrs = var_name.getparent().find('addAttributes')
#         etree.SubElement(attrs, 'att', name='coverage_content_type').text = 'modelResult'
#         etree.SubElement(attrs, 'att', name='cell_measures').text = 'null'
#         etree.SubElement(attrs, 'att', name='cell_methods').text = 'null'
#         etree.SubElement(attrs, 'att', name='interval_operation').text = 'null'
#         etree.SubElement(attrs, 'att', name='interval_write').text = 'null'
#         etree.SubElement(attrs, 'att', name='online_operation').text = 'null'
        
#         if var_name.text in ioos_categories:
#             etree.SubElement(attrs, 'att', name='ioos_category').text = ioos_categories[var_name.text]

# Generate Initial Dataset XML Fragment

Now we're ready to produce a dataset!!!

Use the `/opt/tomcat/webapps/erddap/WEB-INF/GenerateDatasetsXml.sh` script
generate the initial version of an XML fragment for a dataset:
```
$ cd /opt/tomcat/webapps/erddap/WEB-INF/
$ bash GenerateDatasetsXml.sh EDDGridFromNcFiles /results/SalishSea/rolling-forecasts/ ".*SoG_ww3_fields_\d{8}_\d{8}\.nc$" "" 10080
```
The `EDDGridFromNcFiles`,
`/results/SalishSea/nowcast/`,
`".*SalishSea_1h_\d{8}_\d{8}_grid_U\.nc$"`,
`""`,
and `10080` arguments
tell the script:

  * which `EDDType`
  * what parent directory to use
  * what file name regex to use
  * `""` to concatenate the parent directory and the file name regex to find a sample file
  * to reload the dataset every 10080 minutes
  
avoiding having to type those in answer to prompts.

The output is written to `/results/erddap/logs/GenerateDatasetsXml.out`

Dataset ids and file name regexs from datasets dict:

In [9]:
for dataset in sorted(datasets):
    print(dataset, datasets[dataset]['fileNameRegex'])

ubcSSf2DWaveFields30mV17-02 .*SoG_ww3_fields_\d{8}_\d{8}\.nc$


# Finalize Dataset XML Fragment

Now, we:

* set the `datasetID` we want to use
* parse the output of `GenerateDatasetsXml.sh` into an XML tree data structure
* set the `datasetID` dataset attribute value
* re-set the `fileNameRegex` dataset attribute value because it looses its `\` characters during parsing(?)
* edit and add dataset attributes from the `metadata` dict
* set the `title` and `summary` dataset attributes from the `datasets` dict

In [14]:
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('/results/erddap/logs/GenerateDatasetsXml.out', parser)
root = tree.getroot()

datasetID = 'ubcSSf2DWaveFields30mV17-02'

update_xml(root, datasetID, metadata, datasets)

Inspect the resulting dataset XML fragment below and edit the dicts and
code cell above until it is what is required for the dataset:

In [15]:
print_tree(root)

<dataset type="EDDGridFromNcFiles" datasetID="ubcSSf2DWaveFields30mV17-02" active="true">
  <reloadEveryNMinutes>10080</reloadEveryNMinutes>
  <updateEveryNMillis>10000</updateEveryNMillis>
  <fileDir>/results/SalishSea/rolling-forecasts/</fileDir>
  <fileNameRegex>.*SoG_ww3_fields_\d{8}_\d{8}\.nc$</fileNameRegex>
  <recursive>true</recursive>
  <pathRegex>.*</pathRegex>
  <metadataFrom>last</metadataFrom>
  <matchAxisNDigits>20</matchAxisNDigits>
  <fileTableInMemory>false</fileTableInMemory>
  <accessibleViaFiles>false</accessibleViaFiles>
  <!-- sourceAttributes>
        <att name="altitude_resolution">n/a</att>
        <att name="area">SoG_BCgrid_00500m</att>
        <att name="easternmost_longitude">237.996994</att>
        <att name="history">Wed Apr 11 20:51:01 2018: ncrcat -4 -L4 -o SoG_ww3_fields_20180411_20180413.nc SoG_ww3_fields_20180411.nc SoG_ww3_fields_20180412.nc SoG_ww3_fields_20180413.nc</att>
        <att name="latitude_resolution">4.50000027E-03</att>
        <att n

Store the XML fragment for the dataset:

In [16]:
with open('/results/erddap-datasets/fragments/{}.xml'.format(datasetID), 'wb') as f:
    f.write(etree.tostring(root, pretty_print=True))

Edit `/results/erddap-datasets/datasets.xml` to include the
XML fragment for the dataset that was stored by the above cell.

That file is symlinked to `/opt/tomcat/content/erddap/datasets.xml`.

Create a flag file to signal the ERDDAP server process to load the dataset:
```
$ cd /results/erddap/flag/
$ touch <datasetID>
```

If the dataset does not appear on https://salishsea.eos.ubc.ca/erddap/info/,
check `/results/erddap/logs/log.txt` for error messages from the dataset load process
(they may not be at the end of the file because ERDDAP is pretty chatty).

Once the dataset has been successfully loaded and you are happy with the metadata
that ERDDAP is providing for it,
commit the changes in `/results/erddap-datasets/` and push them to Bitbucket.