# A first Example using Sarracenia Moth API

Sarracenia is a package built to announce the availability of new data, usually as files.
We put files on standard servers, making them available via web or sftp, and tell
users that they have arrived using messages.  

Sarracenia uses existing standard message passing protocols, like rabbitmq/AMQP to transport the messages,
and in message passing circles, as server that distributes messages is called a *broker*.

We call the combination of a message broker, and a file server (which can be a single server, or a large cluster) a **data pump**.

Assuming you have installed the **metpx-sr3** package, either as a debian package, or via pip,
One way access announcements to use with sarracenia.moth (Messages Organized by Topic Headers) class,
which allows a python program to connect to a Sarracenia server, and start receiving 
messages that announce resources.

The factory to build sarracenia.moth objects takes two arguments: 

* broker: an object (Credential) containing a url pointing to the message server that is announcing products, and other related options.
* options: a dictionary of other settings the class might use.

The example below builds a call to an broker anyone can access, and just request
10 announcements.

You can run it, and then we can discuss a few settings:


In [1]:
import sarracenia.moth
import sarracenia.moth.amqp
import sarracenia.credentials

import time
import socket

broker = sarracenia.credentials.Credential('amqps://anonymous:anonymous@hpfx.collab.science.gc.ca')

options = sarracenia.moth.default_options
options.update(sarracenia.moth.amqp.default_options)
options['topicPrefix'] = [ 'v02', 'post' ]
options['bindings'] = [('xpublic', ['v02', 'post'], ['#'])]
options['queue_name'] = 'q_anonymous_' + socket.getfqdn() + '_SomethingHelpfulToYou'

print('options: %s' % options)



options: {'acceptUnmatched': True, 'batch': 25, 'bindings': [('xpublic', ['v02', 'post'], ['#'])], 'broker': None, 'exchange': None, 'expire': 300, 'inline': False, 'inline_encoding': 'guess', 'inline_max': 4096, 'logFormat': '%(asctime)s [%(levelname)s] %(name)s %(funcName)s %(message)s', 'logLevel': 'info', 'messageDebugDump': False, 'message_strategy': {'reset': True, 'stubborn': True, 'failure_duration': '5m'}, 'message_ttl': 0, 'topicPrefix': ['v02', 'post'], 'tls_rigour': 'normal', 'queue_name': 'q_anonymous_WPNRSSC229038-VM1_SomethingHelpfulToYou', 'subtopic': [], 'durable': True, 'prefetch': 25, 'auto_delete': False, 'vhost': '/', 'reset': False, 'declare': True, 'bind': True}



The **broker** setting is an object containing a conventional URL and other options, indicating the messaging protocol to be used to access the upstream server. When you connect to a broker, you need to tell it what messages you are interested in.
In Moth, all the brokers we are accessing are expected to use topic hierarchies. You can see them if you
successfully ran the example above, there should be in the message print outs a "topic" element in 
dictionaries.  Here is an example of one:

__v02.post.20210213.WXO-DD.observations.swob-ml.20210213.CTZR__

This divides into two parts:

* topic_prefix: v02.post
* the rest of the topic tree is a reflection of the path to the announced product, relative to a base directory.


In AMQP, there is the concept of "exchanges" which are sort of analogous to television channels... they are groupings of announcements.  so to connect to an AMQP broker, one needs to specify: 

* exchange: Sarracenia promulgates xpublic as a conventional default.
* topic_prefix: decide which version of messages you want to obtain.  This server is producing v02 ones.
* subtopic: what subset of topic_prefix messages do we want to subscribe to.


## Bindings

The bindings option sets out the three values above.  in the example, The bindings are:

* topic_prefix: v02.post  (get v02 messages.)
* exchange: xpublic (the default one.)
* subtopic: # ( an AMQP wildcard meaning everything. )

we connect to the

amqp://hpfx.collab.science.gc.ca broker, on the *xpublic* exchange, and the we will be interested in all messages matching the v02.post.# topic specification... (which is all v02 messages available.)

### subtopic

The subtopic here ( __#__ ) is matches everything produced on the server.  The wider the subtopic, the more messages have to be sent, and the more processing done.  It is better to make it narrower. Taking the example above, if we are interested in swob, a subtopic like:

* *.WXO-DD.observations.swob-ml.#

would match all of the swobs similar to the one above, but avoid sending messages for non-swobs to you.


## queue_name

By convention in brokers administered by Sarracenia, users can only create queues that start with q_ followed by their user name. we connected as anonymous, and so q_anonymous must be used.  After that, the rest can be whatever you want, but there are a few considerations:

* If you want to start up multiple python processes to share a data feed, they all specify the same queue_name, and they will share the flow of messages.  It scales well for a few dozen co-operating downloaders, but does not scale infinitely, do not expect more than 99 or so processes to be able to effectively share a load from a single queue.  To scale beyond that with AMQP, multiple selections are better.

* if you are going to ask for help from the data pump admins... you are going to need to supply them the name of the queue, and they may need to be able to pick it out of hundreds or thousands that are on the server.


## Messages

Different messaging protocols have different storage structures and conventions. the MoTH class returns
messages as python dictionaries regardless of what protocol is used to obtain them or, if forwarding them, to transmit them.  One can add fields for programmatic use to the messages just by adding elements to the dictionary.
If they are only for internal use, then add the name of the dictionary element to the special '\_deleteOnPost' key, so that the dictionary element will be dropped when forwarding the message.


## Ack ##

Messages are marked in transit by the broker, and if you do not acknowledge them, the data pump will hold onto them, and eventually re-dispatch them. keeping pending messages in memory will also slow down processing of all messages. One should acknowledge receipt of messages as soon as practicable, but not so soon that you will lose data if the the program is interrupted.  In the example, we acknowledge after we have done our work of printing the message.




In [2]:
h = sarracenia.moth.Moth.subFactory(broker, options)

count=0
while count < 10:
    m = h.getNewMessage()  #get only one Message
    if m is not None:
        print("message %d: %s" % (count,m) )
        h.ack(m)
    time.sleep(0.1)
    count += 1

h.cleanup() # remove server-side queue defined by Factory.
h.close()
print("obtained 10 product announcements")

message 4: {'mode': '644', 'to_clusters': 'ALL', 'atime': '20220204T203210.946426868', 'from_cluster': 'WXO', 'source': 'WXO', 'mtime': '20220204T203210.946426868', 'subtopic': ['20220204', 'WXO-DD', 'citypage_weather', 'xml', 'QC'], '_deleteOnPost': {'exchange', 'local_offset', 'subtopic', 'ack_id'}, 'pubTime': '20220204T203211.503924', 'baseUrl': 'https://hpfx.collab.science.gc.ca', 'relPath': '/20220204/WXO-DD/citypage_weather/xml/QC/s0000353_e.xml', 'integrity': {'method': 'md5', 'value': '/aigrNFaa5t0jG/VRuRkrg=='}, 'size': 29378, 'exchange': 'xpublic', 'ack_id': 1, 'local_offset': 0}
message 5: {'from_cluster': 'WXO', 'mtime': '20220204T203210.94242692', 'to_clusters': 'ALL', 'mode': '644', 'source': 'WXO', 'atime': '20220204T203210.94242692', 'subtopic': ['20220204', 'WXO-DD', 'citypage_weather', 'xml', 'QC'], '_deleteOnPost': {'exchange', 'local_offset', 'subtopic', 'ack_id'}, 'pubTime': '20220204T203211.500815', 'baseUrl': 'https://hpfx.collab.science.gc.ca', 'relPath': '/2022

2nd example ... combine baseURL + relPath (talk about retPath) and retrieve data...
use newMessages() instead of getNewMessage to show alternate consumption ui.
talk about http, and how retrieval will vary depending on the protocol listed in the baseUrl, and can get
complicated.



In [4]:
import urllib.request
import xml.etree.ElementTree as ET


options['bindings'] = [('xpublic', [ 'v02', 'post'], \
        [ '*', 'WXO-DD', 'observations', 'swob-ml', '#'] )]

h = sarracenia.moth.Moth.subFactory(broker, options)

count=0

while count < 10:
    messages = h.newMessages()  #get all received Messages, upto options['batch'] of them at a time.
    for m in messages:
        dataUrl = m['baseUrl']
        if 'retPath' in m:
           dataUrl += m['retPath']
        else:
           dataUrl += m['relPath']

        print("url %d: %s" % (count,dataUrl) )
        with urllib.request.urlopen( dataUrl ) as f:
            vxml = f.read().decode('utf-8')
            xmlData = ET.fromstring(vxml)

            stn_name=''
            tc_id=''
            lat=''
            lon=''
            air_temp=''

            for i in xmlData.iter():
                name = i.get('name')
                if name == 'stn_nam' :
                   stn_name= i.get('value')
                elif name == 'tc_id' :
                   tc_id = i.get('value')
                elif name == 'lat' :
                   lat =  i.get('value')
                elif name == 'long' :
                   lon  = i.get('value')
                elif name == 'air_temp' :
                   air_temp = i.get('value')

            print( 'station: %s, tc_id: %s, lat: %s, long: %s, air_temp: %s' %
                   ( stn_name, tc_id, lat, lon, air_temp  ))
        h.ack(m)
        count += 1
        if count > 10:
            break
    time.sleep(1)

h.cleanup() # remove server-side queue defined by Factory.
h.close()
print("obtained 10 product temperatures")


url 0: https://hpfx.collab.science.gc.ca/20220204/WXO-DD/observations/swob-ml/partners/yt-gov/20220204/antimony_creek/2022-02-04-2000-ytg-antimonycreek-antimony_creek-AUTO-swob.xml
station: Antimony Creek, tc_id: , lat: 64.01471, long: -138.61544, air_temp: -25.2
url 1: https://hpfx.collab.science.gc.ca/20220204/WXO-DD/observations/swob-ml/partners/yt-gov/20220204/henderson/2022-02-04-2000-ytg-henderson-henderson-AUTO-swob.xml
station: Henderson, tc_id: , lat: 63.591667, long: -138.950714, air_temp: MSNG
url 2: https://hpfx.collab.science.gc.ca/20220204/WXO-DD/observations/swob-ml/partners/yt-gov/20220204/braeburn-w/2022-02-04-2000-ytg-braeburn-w-braeburn-w-AUTO-swob.xml
station: Braeburn-W, tc_id: , lat: 61.481453, long: -135.779817, air_temp: -19.4
url 3: https://hpfx.collab.science.gc.ca/20220204/WXO-DD/observations/swob-ml/partners/yt-gov/20220204/haines_jct/2022-02-04-2000-ytg-hainesjct-haines_jct-AUTO-swob.xml
station: Haines Jct, tc_id: , lat: 60.772872, long: -137.575847, air_t

# downloading data with Python.

You can use the urllib python library to download data, and then parse it.
In this example, the data is an XML structure per message downloaded and read into memory.
Some station data is then printed.

This works well with urllib for hyper-test transport protocol resources, but other resources may be announced using other protocols, such as sftp, or ftp.  The python code will need to be expanded to deal
with other protocols, as well as error conditions, such as temporary failures.


# Conclusion

Sarracenia.moth.amqp is the lightest-weight way to add consumption of Sarracenia messages to your existing python stack. You explicitly ask for new messages when ready to use them. 

Things this type of integration does not provide:

* data retrieval:  you need your own code to download the corresponding data, 

* error recovery: if there are transient errors, then you need to build error recovery code (for recovering partial downloads.)

* async/event/data driven: a way to say "do this every time you get a file" ... define callbacks to be run when a particular event happens, rather than the sequential flow shown above.

The sarracenia.flow class, provides downloads, error recovery, and an asynchronous API using the sarracenia.flowcb (flowCallback) class.


