<img style="float: right;" src="images/hyperstream.svg">

# HyperStream Tutorial 5: Workflows

Workflows define a graph of streams. Usually, the first stream will be a special "raw" stream that pulls in data from a custom data source. Workflows can have multiple time ranges, which will cause the streams to be computed on all of the ranges given.

## Introduction

In this tutorial, we will be ussing a time-series dataset about the temperature in different countries and cities. The dataset is availabel at [The Census at School New Zeland][1]. The necessary files for this tutorial are already included in the folder **data/TimeSeriesDatasets_130207**.

In particular, there are four files with the minimum and maximum temperatures in different cities of Asia, Australia, NZ and USA from 2000 to 2012. And the rainfall levels of New Zeland. 

![workflows](images/workflow_world_temp.svg)

[1]: http://new.censusatschool.org.nz/resource/time-series-data-sets-2013/

In [16]:
try:
    %load_ext watermark
    watermark = True
except ImportError:
    watermark = False
    pass

import sys
sys.path.append("../") # Add parent dir in the Path

from hyperstream import HyperStream
from hyperstream import TimeInterval
from hyperstream.utils import UTC
import hyperstream

from datetime import datetime
from utils import plot_high_chart
from utils import plot_multiple_stock
from dateutil.parser import parse

if watermark:
    %watermark -v -m -p hyperstream -g

hs = HyperStream(loglevel=30)
print(hs)
print([p.channel_id_prefix for p in hs.config.plugins])

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
CPython 2.7.6
IPython 5.4.1

hyperstream 0.3.7

compiler   : GCC 4.8.4
system     : Linux
release    : 3.19.0-80-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit
Git hash   : 7aa6cd89aa79578758ea85b6b3c367c60217fb10
HyperStream version 0.3.7, connected to mongodb://localhost:27017/hyperstream, session id <no session>
[u'example', u'data_importers', u'data_generators']


## Reading the data

In the data folder there are four csv files with the names **TempAsia.csv, TempAustralia.csv, TempNZ.csv and TempUSA.csv**. The first column of each csv file contains a header with the names of the columns. The first one being the date and the following are the minimum and maximum temperature in different cities with the format **cityMin** and **cityMax**.

Here is an example of the first 5 rows of the **TempAsia.csv** file:

```
Date,TokyoMax,TokyoMin,BangkokMax,BangkokMin
2000M01,11.2,4.2,32.8,24
```

The format of the date has the form **YYYYMmm** where **YYYY** is the year and **mm** is the month. Because this format is not recognized by the default parser of the **csv_reader** tool, we will need to specify our own parser that first replaces the **M** by an hyphen **-** and then applies the **dateutils.parser**.

Then, we will use a tool to read each csv, and a Stream to store all the results of applying the tool. When we specify to the tool that there is a header row in the csv file, the value of each Stream instance will be a dictionary with the name of the column and its corresponding value. For example, a Stream instance with the 4 cities shown above will look like:

```
[2000-01-19 00:00:00+00:00]: {'BangkokMin': 24.0, 'BangkokMax': 32.8, 'TokyoMin': 4.2}
```

In [17]:
def dateparser(dt):
    return parse(dt.replace('M', '-')).replace(tzinfo=UTC)

ti_all = TimeInterval(datetime(1999, 1, 1).replace(tzinfo=UTC),
                      datetime(2013, 1, 1).replace(tzinfo=UTC))
ti_sample = TimeInterval(datetime(2007, 1, 1).replace(tzinfo=UTC),
                         datetime(2007, 3, 1).replace(tzinfo=UTC))

# M will be the Memory Channel
M = hs.channel_manager.memory

countries_list = ['Asia', 'Australia', 'NZ', 'USA']

## Create the plates and meta_data instances

In [18]:
countries_dict = {
    'Asia': ['Bangkok', 'HongKong', 'KualaLumpur', 'NewDelhi', 'Tokyo'],
    'Australia': ['Brisbane', 'Canberra', 'GoldCoast', 'Melbourne',  'Sydney'],
    'NZ': ['Auckland', 'Christchurch', 'Dunedin', 'Hamilton','Wellington'],
    'USA': ['Chicago', 'Houston', 'LosAngeles', 'NY', 'Seattle']
}

# delete_plate requires the deletion to be first childs and then parents
for plate_id in ['C.C', 'C']:
    if plate_id in [plate[0] for plate in hs.plate_manager.plates.items()]:
        hs.plate_manager.delete_plate(plate_id=plate_id, delete_meta_data=True)

for country in countries_dict:
    id_country = 'country_' + country
    if not hs.plate_manager.meta_data_manager.contains(identifier=id_country):
        hs.plate_manager.meta_data_manager.insert(
            parent='root', data=country, tag='country', identifier=id_country)
    for city in countries_dict[country]:
        id_city = id_country + '.' + 'city_' + city
        if not hs.plate_manager.meta_data_manager.contains(identifier=id_city):
            hs.plate_manager.meta_data_manager.insert(
                parent=id_country, data=city, tag='city', identifier=id_city)
            
C = hs.plate_manager.create_plate(plate_id="C", description="Countries", values=[], complement=True,
                                  parent_plate=None, meta_data_id="country")
CC = hs.plate_manager.create_plate(plate_id="C.C", description="Cities", values=[], complement=True,
                                   parent_plate="C", meta_data_id="city")

In [19]:
print hs.plate_manager.meta_data_manager.global_plate_definitions


root[root:None]
╟── country[country_NZ:NZ]
║   ╟── city[country_NZ.city_Auckland:Auckland]
║   ╟── city[country_NZ.city_Christchurch:Christchurch]
║   ╟── city[country_NZ.city_Dunedin:Dunedin]
║   ╟── city[country_NZ.city_Hamilton:Hamilton]
║   ╙── city[country_NZ.city_Wellington:Wellington]
╟── country[country_Australia:Australia]
║   ╟── city[country_Australia.city_Brisbane:Brisbane]
║   ╟── city[country_Australia.city_Canberra:Canberra]
║   ╟── city[country_Australia.city_GoldCoast:GoldCoast]
║   ╟── city[country_Australia.city_Melbourne:Melbourne]
║   ╙── city[country_Australia.city_Sydney:Sydney]
╟── country[country_USA:USA]
║   ╟── city[country_USA.city_Chicago:Chicago]
║   ╟── city[country_USA.city_Houston:Houston]
║   ╟── city[country_USA.city_LosAngeles:LosAngeles]
║   ╟── city[country_USA.city_NY:NY]
║   ╙── city[country_USA.city_Seattle:Seattle]
╙── country[country_Asia:Asia]
    ╟── city[country_Asia.city_Bangkok:Bangkok]
    ╟── city[country_Asia.city_HongKong:HongKong]
 

## Create the workflow and execute it

In [20]:
from hyperstream import Workflow

# parameters for the csv_mutli_reader tool
csv_temp_params = dict(
    filename_template='data/TimeSeriesDatasets_130207/Temp{}.csv',
    datetime_parser=dateparser, skip_rows=0, header=True)

csv_rain_params = dict(
    filename_template='data/TimeSeriesDatasets_130207/{}Rainfall.csv',
    datetime_parser=dateparser, skip_rows=0, header=True)

def mean(x):
    x = [value for value in x if value is not None]
    return float(sum(x)) / max(len(x), 1)

def dict_mean(d):
    x = d.values()
    x = [value for value in x if value is not None]
    return float(sum(x)) / max(len(x), 1)

def split_temperatures(d):
    new_d = {}
    for name, value in d.iteritems():
        key = name[-3:].lower()
        name = name[:-3]
        if name not in new_d:
            new_d[name] = {}
        new_d[name][key] = value
    return new_d

with Workflow(workflow_id='tutorial_05',
              name='tutorial_05',
              owner='tutorials',
              description='Tutorial 5 workflow',
              online=False) as w:

    country_node_raw_temp = w.create_node(stream_name='raw_temp_data', channel=M, plates=[C])
    country_node_temp = w.create_node(stream_name='temp_data', channel=M, plates=[C])
    city_node_temp = w.create_node(stream_name='city_temp', channel=M, plates=[CC])
    city_node_avg_temp = w.create_node(stream_name='city_avg_temp', channel=M, plates=[CC])
    country_node_avg_temp = w.create_node(stream_name='country_avg_temp', channel=M, plates=[C])

    country_node_raw_rain = w.create_node(stream_name='raw_rain_data', channel=M, plates=[C])
    city_node_rain = w.create_node(stream_name='city_rain', channel=M, plates=[CC])
    country_node_avg_rain = w.create_node(stream_name='country_avg_rain', channel=M, plates=[C])

    for c in C:
        country_node_raw_temp[c] = hs.plugins.data_importers.factors.csv_multi_reader(
                source=None, **csv_temp_params)
        country_node_temp[c] = hs.factors.apply(
                sources=[country_node_raw_temp[c]],
                func=split_temperatures)

        country_node_raw_rain[c] = hs.plugins.data_importers.factors.csv_multi_reader(
                source=None, **csv_rain_params)
        for cc in CC[c]:
            city_node_temp[cc] = hs.factors.splitter_from_stream(
                                    source=country_node_temp[c],
                                    splitting_node=country_node_temp[c],
                                    use_mapping_keys_only=True)
            city_node_avg_temp[cc] = hs.factors.apply(
                                    sources=[city_node_temp[c]],
                                    func=dict_mean)

            city_node_rain[cc] = hs.factors.splitter_from_stream(
                                    source=country_node_raw_rain[c],
                                    splitting_node=country_node_raw_rain[c],
                                    use_mapping_keys_only=True)
        country_node_avg_temp[c] = hs.factors.aggregate(
                                    sources=[city_node_avg_temp],
                                    alignment_node=None,
                                    aggregation_meta_data='city', func=mean)
        country_node_avg_rain[c] = hs.factors.aggregate(
                                    sources=[city_node_rain],
                                    alignment_node=None,
                                    aggregation_meta_data='city', func=mean)
    w.execute(ti_all)


## See the country temperature and rain

In [21]:
for stream_id, stream in M.find_streams(name='temp_data').iteritems():
    print(stream_id)
    for instance in stream.window(ti_sample).items():
        print("\t{}".format(instance.value))

temp_data: [country=USA]
	{'Houston': {'max': 16.5, 'min': 5.7}, 'LosAngeles': {'max': 18.7, 'min': 7.0}, 'NY': {'max': 8.0, 'min': -0.9}, 'Seattle': {'max': 7.4, 'min': 0.2}, 'Chicago': {'max': 2.1, 'min': -6.1}}
	{'Houston': {'max': 20.4, 'min': 6.3}, 'LosAngeles': {'max': 19.3, 'min': 9.9}, 'NY': {'max': 2.9, 'min': -5.7}, 'Seattle': {'max': 10.3, 'min': 3.4}, 'Chicago': {'max': -3.3, 'min': -11.6}}
temp_data: [country=Asia]
	{'KualaLumpur': {'max': 31.8, 'min': 23.7}, 'HongKong': {'max': 19.3, 'min': 13.3}, 'Bangkok': {'max': 33.4, 'min': 23.4}, 'NewDelhi': {'max': 21.7, 'min': 7.0}, 'Tokyo': {'max': 10.9, 'min': 4.6}}
	{'KualaLumpur': {'max': 32.7, 'min': 23.3}, 'HongKong': {'max': 23.3, 'min': 17.5}, 'Bangkok': {'max': 34.0, 'min': 24.3}, 'NewDelhi': {'max': 24.1, 'min': 11.9}, 'Tokyo': {'max': 12.8, 'min': 5.0}}
temp_data: [country=Australia]
	{'Brisbane': {'max': 29.0, 'min': 20.8}, 'Melbourne': {'max': 28.0, 'min': 16.8}, 'Sydney': {'max': 28.1, 'min': 19.1}, 'GoldCoast': {'ma

In [22]:
for stream_id, stream in M.find_streams(name='city_temp').iteritems():
    print(stream_id)
    for instance in stream.window(ti_sample).items():
        print("\t{}".format(instance.value))

city_temp: [country=Australia, city=Brisbane]
	{'max': 29.0, 'min': 20.8}
	{'max': 28.2, 'min': 19.7}
city_temp: [country=Australia, city=GoldCoast]
	{'max': 30.8, 'min': 21.0}
	{'max': 29.4, 'min': 21.4}
city_temp: [country=NZ, city=Christchurch]
	{'max': 20.6, 'min': 10.5}
	{'max': 21.0, 'min': 11.5}
city_temp: [country=Australia, city=Sydney]
	{'max': 28.1, 'min': 19.1}
	{'max': 26.8, 'min': 20.1}
city_temp: [country=USA, city=Seattle]
	{'max': 7.4, 'min': 0.2}
	{'max': 10.3, 'min': 3.4}
city_temp: [country=Asia, city=Tokyo]
	{'max': 10.9, 'min': 4.6}
	{'max': 12.8, 'min': 5.0}
city_temp: [country=Asia, city=Bangkok]
	{'max': 33.4, 'min': 23.4}
	{'max': 34.0, 'min': 24.3}
city_temp: [country=USA, city=LosAngeles]
	{'max': 18.7, 'min': 7.0}
	{'max': 19.3, 'min': 9.9}
city_temp: [country=Asia, city=KualaLumpur]
	{'max': 31.8, 'min': 23.7}
	{'max': 32.7, 'min': 23.3}
city_temp: [country=NZ, city=Dunedin]
	{'max': 19.8, 'min': 9.0}
	{'max': 20.5, 'min': 9.0}
city_temp: [country=NZ, city

In [23]:
def get_x_y_names_from_streams(streams, tag=None):
    names = []
    y = []
    x = []
    for stream_id, stream in streams.iteritems():
        if len(stream.window().items()) == 0:
            continue
        meta_data = dict(stream_id.meta_data)
        names.append(meta_data[tag])
        y.append([instance.value for instance in stream.window().items()])
        x.append([str(instance.timestamp) for instance in stream.window().items()])
    return y, x, names

data, time, names = get_x_y_names_from_streams(M.find_streams(country='NZ', name='city_avg_temp'), 'city')

plot_multiple_stock(data, time=time, names=names,
                    title='Temperatures in New Zealand', ylabel='ºC')

In [24]:
data, time, names = get_x_y_names_from_streams(M.find_streams(country='Australia', name='city_avg_temp'), 'city')

plot_multiple_stock(data, time=time, names=names,
                    title='Temperatures in Australia', ylabel='ºC')

In [25]:
data, time, names = get_x_y_names_from_streams(M.find_streams(country='NZ', name='city_rain'), 'city')

plot_multiple_stock(data, time=time, names=names,
                    title='Rain in New Zealand', ylabel='some precipitation unit')

In [26]:
data, time, names = get_x_y_names_from_streams(M.find_streams(name='country_avg_temp'), 'country')

plot_multiple_stock(data, time=time, names=names,
                    title='Temperatures in countries', ylabel='ºC')

In [27]:
data, time, names = get_x_y_names_from_streams(M.find_streams(name='country_avg_rain'), 'country')

plot_multiple_stock(data, time=time, names=names,
                    title='Average rain in countries', ylabel='some precipitation unit')

In [28]:
for stream_id, stream in M.find_streams(name='world_avg_temp').iteritems():
    print stream_id
    print [instance.value for instance in stream.window(ti_sample).items()]

In [29]:
from pprint import pprint
pprint(w.to_dict(tool_long_names=False))

{'factors': [{'id': 'csv_multi_reader',
              'sink': 'raw_temp_data',
              'sources': []},
             {'id': 'apply',
              'sink': 'temp_data',
              'sources': ['raw_temp_data']},
             {'id': 'csv_multi_reader',
              'sink': 'raw_rain_data',
              'sources': []},
             {'id': 'splitter_from_stream',
              'sink': 'city_temp',
              'sources': ['temp_data']},
             {'id': 'apply',
              'sink': 'city_avg_temp',
              'sources': ['city_temp']},
             {'id': 'splitter_from_stream',
              'sink': 'city_rain',
              'sources': ['raw_rain_data']},
             {'id': 'aggregate',
              'sink': 'country_avg_temp',
              'sources': ['city_avg_temp']},
             {'id': 'aggregate',
              'sink': 'country_avg_rain',
              'sources': ['city_rain']}],
 'nodes': [{'id': 'city_avg_temp'},
           {'id': 'country_avg_temp'},
        

In [30]:
print(w.to_json(w.factorgraph_viz, tool_long_names=False, indent=4))

{
    "nodes": [
        {
            "type": "rv",
            "id": "city_avg_temp"
        },
        {
            "type": "rv",
            "id": "country_avg_temp"
        },
        {
            "type": "rv",
            "id": "country_avg_rain"
        },
        {
            "type": "rv",
            "id": "city_rain"
        },
        {
            "type": "rv",
            "id": "raw_rain_data"
        },
        {
            "type": "rv",
            "id": "temp_data"
        },
        {
            "type": "rv",
            "id": "city_temp"
        },
        {
            "type": "rv",
            "id": "raw_temp_data"
        },
        {
            "type": "fac",
            "id": "csv_multi_reader"
        },
        {
            "type": "fac",
            "id": "apply"
        },
        {
            "type": "fac",
            "id": "csv_multi_reader"
        },
        {
            "type": "fac",
            "id": "splitter_from_stream"
        },
        