<img style="float: right;" src="images/hyperstream.svg">

# HyperStream Tutorial 5: Workflows

Workflows define a graph of streams. Usually, the first stream will be a special "raw" stream that pulls in data from a custom data source. Workflows can have multiple time ranges, which will cause the streams to be computed on all of the ranges given.

## Introduction

In this tutorial, we will be ussing a time-series dataset about the temperature in different countries and cities. The dataset is availabel at [The Census at School New Zeland][1]. The necessary files for this tutorial are already included in the folder **data/TimeSeriesDatasets_130207**.

In particular, there are four files with the minimum and maximum temperatures in different cities of Asia, Australia, NZ and USA from 2000 to 2012. And the rainfall levels of New Zeland. 

![workflows](images/workflow_world_temp.svg)

[1]: http://new.censusatschool.org.nz/resource/time-series-data-sets-2013/

In [1]:
try:
    %load_ext watermark
    watermark = True
except ImportError:
    watermark = False
    pass

import sys
sys.path.append("../") # Add parent dir in the Path

from hyperstream import HyperStream
from hyperstream import TimeInterval
from hyperstream.utils import UTC
import hyperstream

from datetime import datetime
from utils import plot_high_chart
from utils import plot_multiple_stock
from dateutil.parser import parse

if watermark:
    %watermark -v -m -p hyperstream -g

hs = HyperStream(loglevel=30)
print(hs)
print([p.channel_id_prefix for p in hs.config.plugins])

CPython 2.7.6
IPython 5.3.0

hyperstream 0.3.6

compiler   : GCC 4.8.4
system     : Linux
release    : 3.19.0-80-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit
Git hash   : f08960c9be4d0e40646dfe20d85d7105619ab4d6
HyperStream version 0.3.6, connected to mongodb://localhost:27017/hyperstream, session id <no session>
[u'example', u'data_importers', u'data_generators']


## Reading the data

In the data folder there are four csv files with the names **TempAsia.csv, TempAustralia.csv, TempNZ.csv and TempUSA.csv**. The first column of each csv file contains a header with the names of the columns. The first one being the date and the following are the minimum and maximum temperature in different cities with the format **cityMin** and **cityMax**.

Here is an example of the first 5 rows of the **TempAsia.csv** file:

```
Date,TokyoMax,TokyoMin,BangkokMax,BangkokMin
2000M01,11.2,4.2,32.8,24
```

The format of the date has the form **YYYYMmm** where **YYYY** is the year and **mm** is the month. Because this format is not recognized by the default parser of the **csv_reader** tool, we will need to specify our own parser that first replaces the **M** by an hyphen **-** and then applies the **dateutils.parser**.

Then, we will use a tool to read each csv, and a Stream to store all the results of applying the tool. When we specify to the tool that there is a header row in the csv file, the value of each Stream instance will be a dictionary with the name of the column and its corresponding value. For example, a Stream instance with the 4 cities shown above will look like:

```
[2000-01-19 00:00:00+00:00]: {'BangkokMin': 24.0, 'BangkokMax': 32.8, 'TokyoMin': 4.2}
```

In [2]:
def dateparser(dt):
    return parse(dt.replace('M', '-')).replace(tzinfo=UTC)

ti_all = TimeInterval(datetime(1999, 1, 1).replace(tzinfo=UTC),
                      datetime(2013, 1, 1).replace(tzinfo=UTC))
ti_sample = TimeInterval(datetime(2007, 1, 1).replace(tzinfo=UTC),
                         datetime(2007, 3, 1).replace(tzinfo=UTC))

# M will be the Memory Channel
M = hs.channel_manager.memory

countries_list = ['Asia', 'Australia', 'NZ', 'USA']

## Create the plates and meta_data instances

In [3]:
countries_dict = {
    'Asia': ['BangkokMin', 'BangkokMax', 'HongKongMax', 'HongKongMin', 'KualaLumpurMax', 'KualaLumpurMin',
             'NewDelhiMax', 'NewDelhiMin', 'TokyoMax', 'TokyoMin'],
    'Australia': ['BrisbaneMax', 'BrisbaneMin', 'Canberramax', 'CanberraMin', 'GoldCoastMax', 'GoldCoastMin',
                  'MelbourneMin', 'Melbournemax',  'SydneyMax', 'SydneyMin'],
    'NZ': ['AucklandMax', 'AucklandMin', 'ChristchurchMax', 'ChristchurchMin', 'DunedinMax', 'DunedinMin',
           'HamiltonMax', 'HamiltonMin','WellingtonMax', 'WellingtonMin'],
    'USA': ['ChicagoMin', 'ChicagoMax', 'HoustonMax', 'HoustonMin', 'LosAngelesMax', 'LosAngelesMin',
            'NYMax', 'NYMin', 'SeattleMax', 'SeattleMin']
}

# delete_plate requires the deletion to be first childs and then parents
for plate_id in ['C.C', 'C']:
    if plate_id in [plate[0] for plate in hs.plate_manager.plates.items()]:
        hs.plate_manager.delete_plate(plate_id=plate_id, delete_meta_data=True)

for country in countries_dict:
    id_country = 'country_' + country
    if not hs.plate_manager.meta_data_manager.contains(identifier=id_country):
        hs.plate_manager.meta_data_manager.insert(
            parent='root', data=country, tag='country', identifier=id_country)
    for city in countries_dict[country]:
        id_city = id_country + '.' + 'city_' + city
        if not hs.plate_manager.meta_data_manager.contains(identifier=id_city):
            hs.plate_manager.meta_data_manager.insert(
                parent=id_country, data=city, tag='city', identifier=id_city)
            
C = hs.plate_manager.create_plate(plate_id="C", description="Countries", values=[], complement=True,
                                  parent_plate=None, meta_data_id="country")
CC = hs.plate_manager.create_plate(plate_id="C.C", description="Cities", values=[], complement=True,
                                   parent_plate="C", meta_data_id="city")

In [4]:
print hs.plate_manager.meta_data_manager.global_plate_definitions


root[root:None]
╟── country[country_NZ:NZ]
║   ╟── city[country_NZ.city_AucklandMax:AucklandMax]
║   ╟── city[country_NZ.city_AucklandMin:AucklandMin]
║   ╟── city[country_NZ.city_ChristchurchMax:ChristchurchMax]
║   ╟── city[country_NZ.city_ChristchurchMin:ChristchurchMin]
║   ╟── city[country_NZ.city_DunedinMax:DunedinMax]
║   ╟── city[country_NZ.city_DunedinMin:DunedinMin]
║   ╟── city[country_NZ.city_HamiltonMax:HamiltonMax]
║   ╟── city[country_NZ.city_HamiltonMin:HamiltonMin]
║   ╟── city[country_NZ.city_WellingtonMax:WellingtonMax]
║   ╙── city[country_NZ.city_WellingtonMin:WellingtonMin]
╟── country[country_Australia:Australia]
║   ╟── city[country_Australia.city_BrisbaneMax:BrisbaneMax]
║   ╟── city[country_Australia.city_BrisbaneMin:BrisbaneMin]
║   ╟── city[country_Australia.city_Canberramax:Canberramax]
║   ╟── city[country_Australia.city_CanberraMin:CanberraMin]
║   ╟── city[country_Australia.city_GoldCoastMax:GoldCoastMax]
║   ╟── city[country_Australia.city_GoldCoastMin

## Create the workflow and execute it

In [5]:
from hyperstream import Workflow

# parameters for the csv_mutli_reader tool
csv_params = dict(
    filename_template='data/TimeSeriesDatasets_130207/Temp{}.csv',
    datetime_parser=dateparser, skip_rows=0, header=True)

def mean(x):
    return float(sum(x)) / max(len(x), 1)

with Workflow(workflow_id='tutorial_05', name='tutorial_05', owner='tutorials',
              description='Tutorial 5 workflow', online=False) as w:

    country_node = w.create_node(stream_name='raw_data', channel=M, plates=[C])
    city_node = w.create_node(stream_name='city_temperature', channel=M, plates=[CC])
    # FIXME Are these correct?
    country_node_avg_temp = w.create_node(stream_name='country_avg_temp', channel=M, plates=[C])
    # FIXME Should I create a node outside the plates?
    world_node_avg_temp = w.create_node(stream_name='world_avg_temp', channel=M, plates=[])

    for c in C:
        country_node[c] = hs.plugins.data_importers.factors.csv_multi_reader(source=None, **csv_params)
        for cc in CC[c]:
            city_node[cc] = hs.factors.splitter_from_stream(source=country_node[c],
                                                            splitting_node=country_node[c],
                                                            use_mapping_keys_only=True)
        country_node_avg_temp[c] = hs.factors.aggregate(sources=[city_node],
                                                        alignment_node=None,
                                                        aggregation_meta_data='city',
                                                        func=mean)
    # FIXME Should I create a node for a stream outside the plates?
    world_node_avg_temp = hs.factors.aggregate(sources=[country_node_avg_temp],
                                               alignment_node=None,
                                               aggregation_meta_data='country',
                                               func=mean)
    w.execute(ti_all)

## See the country raw data streams

In [6]:
# Print the results
for stream_id, stream in country_node.streams.iteritems():
    print(stream_id)
    print(stream.window().first())

(('country', 'NZ'),)
StreamInstance(timestamp=datetime.datetime(2000, 1, 29, 0, 0, tzinfo=<UTC>), value={'WellingtonMin': 14.2, 'ChristchurchMin': 10.8, 'HamiltonMin': 12.4, 'DunedinMax': 18.2, 'WellingtonMax': 20.0, 'ChristchurchMax': 20.2, 'AucklandMax': 23.4, 'HamiltonMax': 23.8, 'DunedinMin': 8.8, 'AucklandMin': 15.5})
(('country', 'USA'),)
StreamInstance(timestamp=datetime.datetime(2000, 1, 29, 0, 0, tzinfo=<UTC>), value={'ChicagoMax': 1.8, 'LosAngelesMin': 10.0, 'HoustonMax': 21.6, 'NYMax': 4.6, 'SeattleMax': 7.9, 'SeattleMin': 1.4, 'ChicagoMin': -8.1, 'NYMin': -5.6, 'HoustonMin': 7.4, 'LosAngelesMax': 19.6})
(('country', 'Australia'),)
StreamInstance(timestamp=datetime.datetime(2000, 1, 29, 0, 0, tzinfo=<UTC>), value={'BrisbaneMax': 28.2, 'MelbourneMin': 15.9, 'Melbournemax': 24.3, 'BrisbaneMin': 19.1, 'CanberraMin': 10.1, 'GoldCoastMax': 27.5, 'SydneyMax': 24.9, 'Canberramax': 24.5, 'GoldCoastMin': 20.2, 'SydneyMin': 17.7})
(('country', 'Asia'),)
StreamInstance(timestamp=dateti

In [7]:
for stream_id, stream in city_node.streams.iteritems():
    print(stream_id)
    print(stream.window().first())

(('country', 'Australia'), ('city', 'GoldCoastMax'))
StreamInstance(timestamp=datetime.datetime(2000, 1, 29, 0, 0, tzinfo=<UTC>), value=27.5)
(('country', 'Asia'), ('city', 'TokyoMax'))
StreamInstance(timestamp=datetime.datetime(2000, 1, 29, 0, 0, tzinfo=<UTC>), value=11.2)
(('country', 'Asia'), ('city', 'BangkokMin'))
StreamInstance(timestamp=datetime.datetime(2000, 1, 29, 0, 0, tzinfo=<UTC>), value=24.0)
(('country', 'Asia'), ('city', 'HongKongMin'))
StreamInstance(timestamp=datetime.datetime(2000, 1, 29, 0, 0, tzinfo=<UTC>), value=14.3)
(('country', 'Asia'), ('city', 'BangkokMax'))
StreamInstance(timestamp=datetime.datetime(2000, 1, 29, 0, 0, tzinfo=<UTC>), value=32.8)
(('country', 'Asia'), ('city', 'KualaLumpurMin'))
StreamInstance(timestamp=datetime.datetime(2000, 1, 29, 0, 0, tzinfo=<UTC>), value=23.5)
(('country', 'USA'), ('city', 'LosAngelesMax'))
StreamInstance(timestamp=datetime.datetime(2000, 1, 29, 0, 0, tzinfo=<UTC>), value=19.6)
(('country', 'Australia'), ('city', 'Canber

In [8]:
def get_x_y_names_from_streams(streams, tag):
    names = []
    y = []
    x = []
    for stream_id, stream in streams.iteritems():
        meta_data = dict(stream_id.meta_data)
        names.append(meta_data[tag])
        y.append([instance.value for instance in stream.window().items()])
        x.append([str(instance.timestamp) for instance in stream.window().items()])
    return y, x, names

data, time, names = get_x_y_names_from_streams(M.find_streams(country='Australia', name='city_temperature'), 'city')

plot_multiple_stock(data, time=time, names=names,
                    title='Temperatures in Australia', ylabel='ºC')

In [9]:
for stream_id, stream in country_node_avg_temp.streams.iteritems():
    print(stream_id)
    print(stream.window().first())

(('country', 'NZ'),)
StreamInstance(timestamp=datetime.datetime(2000, 1, 29, 0, 0, tzinfo=<UTC>), value=16.73)
(('country', 'USA'),)
StreamInstance(timestamp=datetime.datetime(2000, 1, 29, 0, 0, tzinfo=<UTC>), value=6.0600000000000005)
(('country', 'Australia'),)
StreamInstance(timestamp=datetime.datetime(2000, 1, 29, 0, 0, tzinfo=<UTC>), value=21.24)
(('country', 'Asia'),)
StreamInstance(timestamp=datetime.datetime(2000, 1, 29, 0, 0, tzinfo=<UTC>), value=18.990000000000002)


In [10]:
data, time, names = get_x_y_names_from_streams(M.find_streams(name='country_avg_temp'), 'country')

plot_multiple_stock(data, time=time, names=names,
                    title='Temperatures in world', ylabel='ºC')

In [11]:
from pprint import pprint
pprint(w.to_dict(tool_long_names=False))

{'factors': [{'id': 'csv_multi_reader', 'sink': 'raw_data', 'sources': []},
             {'id': 'splitter_from_stream',
              'sink': 'city_temperature',
              'sources': ['raw_data']},
             {'id': 'aggregate',
              'sink': 'country_avg_temp',
              'sources': ['city_temperature']}],
 'nodes': [{'id': 'country_avg_temp'},
           {'id': 'raw_data'},
           {'id': 'world_avg_temp'},
           {'id': 'city_temperature'}],
 'plates': {u'C': [{'id': 'country_avg_temp', 'type': 'node'},
                   {'id': 'raw_data', 'type': 'node'},
                   {'id': 'aggregate', 'type': 'factor'}],
            u'C.C': [{'id': 'city_temperature', 'type': 'node'}]}}


In [12]:
print(w.to_json(w.factorgraph_viz, tool_long_names=False, indent=4))

{
    "nodes": [
        {
            "type": "rv",
            "id": "country_avg_temp"
        },
        {
            "type": "rv",
            "id": "raw_data"
        },
        {
            "type": "rv",
            "id": "world_avg_temp"
        },
        {
            "type": "rv",
            "id": "city_temperature"
        },
        {
            "type": "fac",
            "id": "csv_multi_reader"
        },
        {
            "type": "fac",
            "id": "splitter_from_stream"
        },
        {
            "type": "fac",
            "id": "aggregate"
        }
    ],
    "links": [
        {
            "source": "csv_multi_reader",
            "target": "raw_data"
        },
        {
            "source": "raw_data",
            "target": "splitter_from_stream"
        },
        {
            "source": "splitter_from_stream",
            "target": "city_temperature"
        },
        {
            "source": "city_temperature",
            "target": "aggre