# Exploring Getting ONC ADCP Data

Translating the essence of `getSogAdcpData.m` from Matlab to Python
and from (somewhat) general purpose to pre-automation specific.

In [1]:
import ftplib
import json
import os
from pprint import pprint

import arrow
import requests
from retrying import retry

## Web Service Requests

The base web service URL that provides access to ADCP data is:

`http://dmas.uvic.ca/VSearchByInstrumentServiceAjax`

In [2]:
service_url = 'http://dmas.uvic.ca/VSearchByInstrumentServiceAjax'

Two pieces of credential information associated with your `dmas.uvic.ca` account
are required to use that service URL:
    
1. The email address associated with your  account.
It is used as your `userid`.
2. The user number associated with your account.
It is used in the directory path on the ONC FTP server where the requested ADCP data file is stored for downloading.

To avoid publishing my credentials I'll read them from an environment variable
where I have stored them as a `:` delimited string.
If you want to run this notebook you'll need to do the same,
either by exporting the `ONC_FTP_CREDENTIALS` environment variable before you start
Jupyter Notebook,
or by storing your credentials in `os.environ['ONC_FTP_CREDENTIALS']` 
before executing the cell below.

In [4]:
userid, userno = os.environ['ONC_FTP_CREDENTIALS'].split(':')

The service returns day-log datasets.
We're going to use it to get ADCP data one day at a time,
so let's set a data date:

In [6]:
data_date = arrow.get('2016-07-17')

Many of the other query parameters for ADCP data
from the Strait of Georgia nodes are constants:

In [7]:
OPERATION = 0
MAT_FILE = 3
REGION_ID = 2
META = 23

In [8]:
data_params = {
    'operation': OPERATION,
    'userid': userid,
    'dataformatid': MAT_FILE,
    'timefrom': data_date.format('DD-MMM-YYYY HH:mm:ss'),
    'timeto': data_date.replace(days=+1).format('DD-MMM-YYYY HH:mm:ss'),
    'deviceid': 37,
    'sensorid': 92,
    'regionid': REGION_ID,
    'locationid': 3,
    'siteid': 1000670,
    'meta': META,
    'params': '{"qc":"1","avg":"0","rotVar":"0"}',
}

The value for `locationid` specifies the node for which data is being requested
(i.e. SoG East VENUS Instrument Platform is `locationid: 3`).

The values for `deviceid` and `sensorid` are determined by the type and serial number
of the ADCP device installed at the node during the deployment that includes the
data date being requested.

The value for `siteid` specifies the deployment that includes the
data date being requested.

Sending a GET request to the service URL with the data parameters
as its query string launches a search for the data on the ONC server.

The response to the request is the `searchHdrId` which identifies the search
on the server and is also used in the directory path on the ONC FTP server
where the requested ADCP data file is stored for downloading.

The response is a JSON snippet that is surrounded by parenthesis
and with a trailing newline character.
After confirming that the request did not raise an HTTP error,
we strip those extra characters and parse the JSON response
to get a `dict` containing the `searchHdrId` value.

In [9]:
r = requests.get(service_url, data_params)
r.raise_for_status()
data_response = json.loads(r.text.lstrip('(').rstrip().rstrip(')'))
print(data_response)

{'searchHdrId': 1724432}


The search status can be qureied by sending another request
to the service URL with the parameters below.
Doing so is useful for the detection of errors in the initial query
which prevent the search from running.
However, once the search is running the information returned is of limited value.

In [10]:
search_params = {
    'operation': 1,
    'userid': userid,
    'searchHdrId': data_response['searchHdrId'],
}

In [11]:
r = requests.get(service_url, search_params)
r.raise_for_status()
search_response = json.loads(r.text.lstrip('(').rstrip().rstrip(')'))
pprint(search_response)

{'searchResult': [{'comment': '{"searchType":"Long-Search","error":"","comment":"SearchResult '
                              'initiated in branch function"}',
                   'fileCount': 0,
                   'fileSize': 0,
                   'filename': '',
                   'metaDataFilename': '',
                   'processingTime': -1,
                   'searchId': 3386036,
                   'status': 1},
                  {'comment': '{"searchType":"Long-Search","error":"","comment":"SearchResult '
                              'initiated in branch function"}',
                   'fileCount': 0,
                   'fileSize': 0,
                   'filename': '',
                   'metaDataFilename': '',
                   'processingTime': -1,
                   'searchId': 3386037,
                   'status': 1}]}


## Polling the ONC FTP Server for Data Search Results

When the data search completes the ADCP data
formatted as a `.mat` file will be stored on the ONC FTP server
`ftp.neptunecanada.ca` in a directory constructed from your ONC user number,
the `searchHdrId` value,
and the year and month of the data date requested:

In [12]:
ftp_server = 'ftp.neptunecanada.ca'
path_tmpl = 'pub/user{userno}/searchHeader{searchHdrId}/{data_date.year}/{data_date.month:02d}'

We're going to use the [retrying](https://pypi.python.org/pypi/retrying) package
to poll the FTP server to determine when the `.mat` file is ready.
We will also use it to deal with potential errors when we subsequently
ask the FTP server for the `.mat` file name,
and finally download the `.mat` file.

For the initial polling we need a function that iterates over the generator returned by the
[FTP.mlsd()](https://docs.python.org/3.3/library/ftplib.html#ftplib.FTP.mlsd)
function and returns `True` (i.e. retry) when the `.mat` file is not found,
and `False` (i.e. stop retrying) when the `.mat` file is present:

In [13]:
def retry_if_not_matfile(mlsd):
    for filename, facts in mlsd:
        if not filename.startswith('.'):
            print(filename, facts)
            return os.path.splitext(filename)[1] != '.mat'
    return True

We also want to retry in the event of errors from the FTP server:

In [14]:
def retry_if_ftp_error(exception):
    print(exception)
    return any((
        isinstance(exception, ftplib.error_reply),
        isinstance(exception, ftplib.error_temp),
        isinstance(exception, ftplib.error_perm),
        isinstance(exception, ftplib.error_proto),
    ))

Tests have shown that searches typically take 5 to 10 minutes to complete,
but in some cases they take nearly 2 hours.
So, we'll set our retry interval to 1 minute,
and keep retrying for up to 120 minutes.

We decorate the function that polls the FTP server path
to both retry if there is an FTP error,
and retry if the `.mat` file is not present:

In [15]:
@retry(
    retry_on_exception=retry_if_ftp_error,
    wrap_exception=True, 
    wait_fixed=60*1000,
    stop_max_delay=120*60*1000,
)
@retry(
    retry_on_result=retry_if_not_matfile, 
    wait_fixed=60*1000,
    stop_max_delay=120*60*1000,
)
def poll_ftp_path(ftp, path):
    return ftp.mlsd(path)

Getting the `.mat` file name should only be subject to possible FTP errors,
so we'll decorate a function that does that to retry every 5 seconds for up to a minute
if it encounters and FTP error"

In [16]:
@retry(
    retry_on_exception=retry_if_ftp_error,
    wrap_exception=True,
    wait_exponential_multiplier=5*1000,
    wait_exponential_max=60*1000,
)
def get_matfile_name(ftp, path):
    for filename, facts in ftp.mlsd(path):
        if os.path.splitext(filename)[1] == '.mat':
            return filename

And we'll use the same decoration for the function that downloads
the `.mat` file:

In [17]:
@retry(
    retry_on_exception=retry_if_ftp_error,
    wrap_exception=True,
    wait_exponential_multiplier=5*1000,
    wait_exponential_max=60*1000,
)
def get_matfile(filename):
    ftp.retrbinary('RETR {}'.format(filename), open(filename, 'wb').write)

Putting it all together:

In [20]:
path = path_tmpl.format(
    data_date=data_date, userno=userno, searchHdrId=data_response['searchHdrId'])

with ftplib.FTP('ftp.neptunecanada.ca') as ftp:
    ftp.login()
    poll_ftp_path(ftp, path)
    filename = get_matfile_name(ftp, path)
    print(filename)
    ftp.cwd(path)
    get_matfile(filename)

RDIADCP150WH8497_20160717T000000.595Z-1A5010.rdi {'unix.owner': '80', 'type': 'file', 'unique': '16U8988D330F4C287FA', 'modify': '20160718215700', 'perm': 'adfr', 'size': '25077204', 'unix.group': '80', 'unix.mode': '0644'}
RDIADCP150WH8497_20160717T000000.595Z-1A5010.rdi {'unix.owner': '80', 'type': 'file', 'unique': '16U8988D330F4C287FA', 'modify': '20160718215801', 'perm': 'adfr', 'size': '62374122', 'unix.group': '80', 'unix.mode': '0644'}
RDIADCP150WH8497_20160717T000000.595Z-1A5010.rdi {'unix.owner': '80', 'type': 'file', 'unique': '16U8988D330F4C287FA', 'modify': '20160718215829', 'perm': 'adfr', 'size': '80094654', 'unix.group': '80', 'unix.mode': '0644'}
RDIADCP150WH8497_20160717T000000.595Z-1A5010.mat {'unix.owner': '80', 'type': 'file', 'unique': '16U886CD77AF2B740F3', 'modify': '20160718220401', 'perm': 'adfr', 'size': '6174229', 'unix.group': '80', 'unix.mode': '0644'}
RDIADCP150WH8497_20160717T000000.595Z-1A5010.mat


In [21]:
ls -l *.mat

-rw-rw-r-- 1 doug doug 6174229 Jul 18 15:04 RDIADCP150WH8497_20160717T000000.595Z-1A5010.mat
