<img style="float: right;" src="images/hyperstream.svg">

# HyperStream Tutorial 5: Workflows

Workflows define a graph of streams. Usually, the first stream will be a special "raw" stream that pulls in data from a custom data source. Workflows can have multiple time ranges, which will cause the streams to be computed on all of the ranges given.

## Introduction

In this tutorial, we will be ussing a time-series dataset about the temperature in different countries and cities. The dataset is availabel at [The Census at School New Zeland][1]. The necessary files for this tutorial are already included in the folder **data/TimeSeriesDatasets_130207**.

In particular, there are four files with the minimum and maximum temperatures in different cities of Asia, Australia, NZ and USA from 2000 to 2012. And the rainfall levels of New Zeland. 

![workflows](images/workflow_world_temp.svg)

[1]: http://new.censusatschool.org.nz/resource/time-series-data-sets-2013/

In [1]:
try:
    %load_ext watermark
    watermark = True
except ImportError:
    watermark = False
    pass

import sys
sys.path.append("../") # Add parent dir in the Path

from hyperstream import HyperStream
from hyperstream import TimeInterval
from hyperstream.utils import UTC
import hyperstream

from datetime import datetime
from utils import plot_high_chart
from utils import plot_multiple_stock
from dateutil.parser import parse

if watermark:
    %watermark -v -m -p hyperstream -g

hs = HyperStream(loglevel=30)
print(hs)
print([p.channel_id_prefix for p in hs.config.plugins])

CPython 2.7.6
IPython 5.3.0

hyperstream 0.3.5

compiler   : GCC 4.8.4
system     : Linux
release    : 3.19.0-80-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit
Git hash   : db450f6100c7af2839f6cebae343182ec6d7a7a7
HyperStream version 0.3.5, connected to mongodb://localhost:27017/hyperstream, session id <no session>
[u'example', u'data_importers', u'data_generators']


## Reading the data

In the data folder there are four csv files with the names **TempAsia.csv, TempAustralia.csv, TempNZ.csv and TempUSA.csv**. The first column of each csv file contains a header with the names of the columns. The first one being the date and the following are the minimum and maximum temperature in different cities with the format **cityMin** and **cityMax**.

Here is an example of the first 5 rows of the **TempAsia.csv** file:

```
Date,TokyoMax,TokyoMin,BangkokMax,BangkokMin
2000M01,11.2,4.2,32.8,24
```

The format of the date has the form **YYYYMmm** where **YYYY** is the year and **mm** is the month. Because this format is not recognized by the default parser of the **csv_reader** tool, we will need to specify our own parser that first replaces the **M** by an hyphen **-** and then applies the **dateutils.parser**.

Then, we will use a tool to read each csv, and a Stream to store all the results of applying the tool. When we specify to the tool that there is a header row in the csv file, the value of each Stream instance will be a dictionary with the name of the column and its corresponding value. For example, a Stream instance with the 4 cities shown above will look like:

```
[2000-01-19 00:00:00+00:00]: {'BangkokMin': 24.0, 'BangkokMax': 32.8, 'TokyoMin': 4.2}
```

In [2]:
def dateparser(dt):
    return parse(dt.replace('M', '-')).replace(tzinfo=UTC)

ti_all = TimeInterval(datetime(1999, 1, 1).replace(tzinfo=UTC),
                      datetime(2013, 1, 1).replace(tzinfo=UTC))
ti_sample = TimeInterval(datetime(2007, 1, 1).replace(tzinfo=UTC),
                         datetime(2007, 3, 1).replace(tzinfo=UTC))

# M will be the Memory Channel
M = hs.channel_manager.memory

countries_list = ['Asia', 'Australia', 'NZ', 'USA']

## Create the plates and meta_data instances

In [3]:
countries_dict = {
    'Asia': ['BangkokMin', 'BangkokMax', 'HongKongMax', 'HongKongMin', 'KualaLumpurMax', 'KualaLumpurMin',
             'NewDelhiMax', 'NewDelhiMin', 'TokyoMax', 'TokyoMin'], 
    'Australia': ['BrisbaneMax', 'BrisbaneMin', 'CanberraMax', 'CanberraMin', 'GoldCoastMax', 'GodCoastMin',
                  'MelbourneMin', 'Melbournemax',  'SydneyMax', 'SydneyMin'], 
    'NZ': ['AucklandMax', 'AucklandMin', 'ChristchurchMax', 'ChristchurchMin', 'DunedinMax', 'DunedinMin',
           'HamiltonMax', 'HamiltonMin','WellingtonMax', 'WellingtonMin'], 
    'USA': ['ChicagoMin', 'ChicagoMax', 'HoustonMax', 'HoustonMin', 'LosAngelesMax', 'LosAngelesMin',
            'NYMax', 'NYMin', 'SeattleMax', 'SeattleMin']
}

# delete_plate requires the deletion to be first childs and then parents
for plate_id in ['C.C', 'C']:
    if plate_id in [plate[0] for plate in hs.plate_manager.plates.items()]:
        print('Deleting plate ' + plate_id)
        hs.plate_manager.delete_plate(plate_id=plate_id, delete_meta_data=True)

for country in countries_dict:
    id_country = 'country_' + country
    if not hs.plate_manager.meta_data_manager.contains(identifier=id_country):
        hs.plate_manager.meta_data_manager.insert(
            parent='root', data=country, tag='country', identifier=id_country)
    for city in countries_dict[country]:
        id_city = id_country + '.' + 'city_' + city
        if not hs.plate_manager.meta_data_manager.contains(identifier=id_city):
            hs.plate_manager.meta_data_manager.insert(
                parent=id_country, data=city, tag='city', identifier=id_city)
            
C = hs.plate_manager.create_plate(plate_id="C", description="Countries", values=[], complement=True,
                                  parent_plate=None, meta_data_id="country")
CC = hs.plate_manager.create_plate(plate_id="C.C", description="Cities", values=[], complement=True,
                                   parent_plate="C", meta_data_id="city")

In [4]:
print hs.plate_manager.meta_data_manager.global_plate_definitions


root[root:None]
╟── country[country_NZ:NZ]
║   ╟── city[country_NZ.city_AucklandMax:AucklandMax]
║   ╟── city[country_NZ.city_AucklandMin:AucklandMin]
║   ╟── city[country_NZ.city_ChristchurchMax:ChristchurchMax]
║   ╟── city[country_NZ.city_ChristchurchMin:ChristchurchMin]
║   ╟── city[country_NZ.city_DunedinMax:DunedinMax]
║   ╟── city[country_NZ.city_DunedinMin:DunedinMin]
║   ╟── city[country_NZ.city_HamiltonMax:HamiltonMax]
║   ╟── city[country_NZ.city_HamiltonMin:HamiltonMin]
║   ╟── city[country_NZ.city_WellingtonMax:WellingtonMax]
║   ╙── city[country_NZ.city_WellingtonMin:WellingtonMin]
╟── country[country_Australia:Australia]
║   ╟── city[country_Australia.city_BrisbaneMax:BrisbaneMax]
║   ╟── city[country_Australia.city_BrisbaneMin:BrisbaneMin]
║   ╟── city[country_Australia.city_CanberraMax:CanberraMax]
║   ╟── city[country_Australia.city_CanberraMin:CanberraMin]
║   ╟── city[country_Australia.city_GoldCoastMax:GoldCoastMax]
║   ╟── city[country_Australia.city_GodCoastMin:

## Define the tools to use

In [5]:
csv_reader = hs.plugins.data_importers.tools.csv_multi_reader(
                filename_template='data/TimeSeriesDatasets_130207/Temp{}.csv',
                datetime_parser=dateparser, skip_rows=1)

# TODO use this tool to separate the raw values of each country into their respective city
splitter_tool = hs.tools.splitter_from_stream(element=None, use_mapping_keys_only=True)

## Create the workflow and execute it

In [9]:
with hs.create_workflow(workflow_id='tutorial_05', name='tutorial_05', owner='tutorials', 
                    description='Tutorial 5 workflow', online=False, safe=False) as w:
    country_node = w.create_node(stream_name='raw_data', channel=M, plate_ids=['C'])
    
    w.create_multi_output_factor(source=None, sink=country_node, splitting_node=None, tool=csv_reader)
    
    city_node = w.create_node(stream_name='temperature', channel=M, plate_ids=['C.C'])
    
    # TODO split the countries raw data into cities
    w.create_multi_output_factor(source=country_node, sink=city_node, splitting_node=country_node,
                                 tool=splitter_tool)
    
    w.execute(ti_all)

## See the country raw data streams

In [10]:
# Print the results
for stream in country_node.streams:
    print(stream)
    print(country_node.streams[stream].window().first())

(('country', 'NZ'),)
StreamInstance(timestamp=datetime.datetime(2000, 1, 26, 0, 0, tzinfo=<UTC>), value={'WellingtonMin': 14.2, 'ChristchurchMin': 10.8, 'HamiltonMin': 12.4, 'DunedinMax': 18.2, 'WellingtonMax': 20.0, 'ChristchurchMax': 20.2, 'AucklandMax': 23.4, 'HamiltonMax': 23.8, 'DunedinMin': 8.8, 'AucklandMin': 15.5})
(('country', 'USA'),)
StreamInstance(timestamp=datetime.datetime(2000, 1, 26, 0, 0, tzinfo=<UTC>), value={'ChicagoMax': 1.8, 'LosAngelesMin': 10.0, 'HoustonMax': 21.6, 'NYMax': 4.6, 'SeattleMax': 7.9, 'SeattleMin': 1.4, 'ChicagoMin': -8.1, 'NYMin': -5.6, 'HoustonMin': 7.4, 'LosAngelesMax': 19.6})
(('country', 'Australia'),)
StreamInstance(timestamp=datetime.datetime(2000, 1, 26, 0, 0, tzinfo=<UTC>), value={'BrisbaneMax': 28.2, 'MelbourneMin': 15.9, 'Melbournemax': 24.3, 'BrisbaneMin': 19.1, 'CanberraMin': 10.1, 'GoldCoastMax': 27.5, 'SydneyMax': 24.9, 'Canberramax': 24.5, 'GoldCoastMin': 20.2, 'SydneyMin': 17.7})
(('country', 'Asia'),)
StreamInstance(timestamp=dateti

In [12]:
for stream in city_node.streams:
    print(stream)
    print(city_node.streams[stream].window().first())

(('country', 'Australia'), ('city', 'GoldCoastMax'))
None
(('country', 'Asia'), ('city', 'TokyoMax'))
None
(('country', 'Asia'), ('city', 'BangkokMin'))
None
(('country', 'Asia'), ('city', 'HongKongMin'))
None
(('country', 'Asia'), ('city', 'BangkokMax'))
None
(('country', 'Asia'), ('city', 'KualaLumpurMin'))
None
(('country', 'NZ'), ('city', 'WellingtonMin'))
None
(('country', 'NZ'), ('city', 'AucklandMax'))
None
(('country', 'Australia'), ('city', 'CanberraMin'))
None
(('country', 'NZ'), ('city', 'DunedinMin'))
None
(('country', 'NZ'), ('city', 'ChristchurchMin'))
None
(('country', 'USA'), ('city', 'NYMin'))
None
(('country', 'USA'), ('city', 'HoustonMin'))
None
(('country', 'Australia'), ('city', 'Melbournemax'))
None
(('country', 'NZ'), ('city', 'ChristchurchMax'))
None
(('country', 'Asia'), ('city', 'KualaLumpurMax'))
None
(('country', 'Australia'), ('city', 'MelbourneMin'))
None
(('country', 'USA'), ('city', 'LosAngelesMin'))
None
(('country', 'Australia'), ('city', 'BrisbaneMin'

In [9]:
temp_tools_csv = {}
temp_streams = {}
for country in countries_list:
    temp_tools_csv[country] = hs.plugins.example.tools.csv_reader(
            'data/TimeSeriesDatasets_130207/Temp{}.csv'.format(country),
            header=True, dateparser=dateparser)
    temp_streams[country] = M.get_or_create_stream(country)
    temp_tools_csv[country].execute(sources=[], sink=temp_streams[country],
                                    interval=ti_all)

## Print one Stream Instance per Stream

Now that we have generated one Stream per each country, we can inspect the first Stream Instance of each Stream.

In [10]:
for country in countries_list:
    print(temp_streams[country])
    print(temp_streams[country].window().first())

Stream(stream_id=Asia, channel_id=memory)
StreamInstance(timestamp=datetime.datetime(2000, 1, 26, 0, 0, tzinfo=<UTC>), value={'NewDelhiMax': 20.1, 'NewDelhiMin': 8.1, 'HongKongMin': 14.3, 'KualaLumpurMin': 23.5, 'TokyoMax': 11.2, 'KualaLumpurMax': 32.2, 'HongKongMax': 19.5, 'BangkokMin': 24.0, 'BangkokMax': 32.8, 'TokyoMin': 4.2})
Stream(stream_id=Australia, channel_id=memory)
StreamInstance(timestamp=datetime.datetime(2000, 1, 26, 0, 0, tzinfo=<UTC>), value={'BrisbaneMax': 28.2, 'MelbourneMin': 15.9, 'Melbournemax': 24.3, 'BrisbaneMin': 19.1, 'CanberraMin': 10.1, 'GoldCoastMax': 27.5, 'SydneyMax': 24.9, 'Canberramax': 24.5, 'GoldCoastMin': 20.2, 'SydneyMin': 17.7})
Stream(stream_id=NZ, channel_id=memory)
StreamInstance(timestamp=datetime.datetime(2000, 1, 26, 0, 0, tzinfo=<UTC>), value={'WellingtonMin': 14.2, 'ChristchurchMin': 10.8, 'HamiltonMin': 12.4, 'DunedinMax': 18.2, 'WellingtonMax': 20.0, 'ChristchurchMax': 20.2, 'AucklandMax': 23.4, 'HamiltonMax': 23.8, 'DunedinMin': 8.8, 'Au

## Visualize the temperatures in one Country

Now, we can visualize the temperatures of all the cities in one country. First, we will create a list of all the cities in one of the Streams by looking at the first Stream Instance. Then, we will create a list of lists containing the temperature value of each city, together with their corresponding time. Then, we can use the function **plot_multiple_stock** created for this tutorial.

In [11]:
one_country_name = countries_list[0]

# The following code is only for the visualization
this_cities_names = [key for key, value in temp_streams[one_country_name].window().items()[0].value.iteritems()]

data = {city:[] for city in this_cities_names}
time = []
for key, values in temp_streams[one_country_name].window().items():
    time.append(str(key))
    for city, temperature in values.iteritems():
        data[city].append(temperature)
        
data = [value for key, value in data.iteritems()]
        
plot_multiple_stock(data, time=time, names=this_cities_names,
                    title='Temperatures in ' + one_country_name, ylabel='ºC')

# Create a Stream with all the city names

In order to split the raw streams that contain as a value a dictionary with every city and its corresponding temperature, we need to use a splitting tool. The splitting_from_stream tool is a MultiOutputTool that expects a stream with the keys that are used to split the original stream, and a list of streams that will act as sink.

## Create the mapping names

Here, we will generate first a stream with all the keys that we want to use for the split

In [12]:
one_country_stream = temp_streams[one_country_name]

# It is similar to a database channel
A = hs.channel_manager.assets
this_cities_stream = A.get_or_create_stream('cities_{}'.format(one_country_name))

mapping = {}
for city in this_cities_names:
    mapping[city] = city

print mapping

{'NewDelhiMax': 'NewDelhiMax', 'NewDelhiMin': 'NewDelhiMin', 'HongKongMin': 'HongKongMin', 'KualaLumpurMin': 'KualaLumpurMin', 'TokyoMax': 'TokyoMax', 'KualaLumpurMax': 'KualaLumpurMax', 'HongKongMax': 'HongKongMax', 'BangkokMin': 'BangkokMin', 'BangkokMax': 'BangkokMax', 'TokyoMin': 'TokyoMin'}


## Create the Stream with the names

Now we can write the mapping into the stream that will be used with the splitting tool

In [13]:
from hyperstream import StreamInstance
from hyperstream import StreamId

A.write_to_stream(stream_id=this_cities_stream.stream_id, data=StreamInstance(ti_all.start, mapping))

print this_cities_stream.window(TimeInterval.up_to_now()).items()

[StreamInstance(timestamp=datetime.datetime(1999, 1, 1, 0, 0, tzinfo=<bson.tz_util.FixedOffset object at 0x7fef34204a90>), value={u'NewDelhiMax': u'NewDelhiMax', u'NewDelhiMin': u'NewDelhiMin', u'HongKongMin': u'HongKongMin', u'KualaLumpurMin': u'KualaLumpurMin', u'TokyoMax': u'TokyoMax', u'KualaLumpurMax': u'KualaLumpurMax', u'HongKongMax': u'HongKongMax', u'BangkokMin': u'BangkokMin', u'BangkokMax': u'BangkokMax', u'TokyoMin': u'TokyoMin'})]


## Creating a city plate

We are creating here one node per city

In [14]:
for city in this_cities_names:
    if not hs.plate_manager.meta_data_manager.contains(identifier='city_'+city):
        print("Adding " + city)
        hs.plate_manager.meta_data_manager.insert(parent='root', data=city,
                                                  tag='city', identifier='city_'+city)

## Creating the sink Streams

Now, we will create one Stream per each city. Then, we will use all these Streams as a sink when ussing the splitting tool.

In [15]:
cities_plate = hs.plate_manager.create_plate(plate_id='C', meta_data_id='city', parent_plate=None, 
                                             values=[], complement=True, description='Cities')
this_country_temps = []
for city in this_cities_names:
    print("Adding " + city)
    this_country_temps.append(M.get_or_create_stream(stream_id=StreamId(name='temperature',
                                                                         meta_data=(('city', city),))))

Adding NewDelhiMax
Adding NewDelhiMin
Adding HongKongMin
Adding KualaLumpurMin
Adding TokyoMax
Adding KualaLumpurMax
Adding HongKongMax
Adding BangkokMin
Adding BangkokMax
Adding TokyoMin


It is possible to create all the Streams passing a list to the splitter tool **splitter_from_list**. However, this could not be automated in a workflow

```Python
# TODO Ussing this new tool, it is not necessary to create a new stream. However, if it is a Stream it could be
# automated for any other countries
splitter_tool = hs.plugins.example.tools.splitter_from_list(element=None)

# TODO try to change the parameter name of MultiOutputTool splitting_stream to splitting_parameter
# or something that does not force you to think that it is a stream
splitter_tool.execute(source=one_country_stream, splitting_stream=this_cities_list, output_plate=cities_plate, 
                      interval=ti_all, input_plate_value=None, sinks=this_country_temps)
```

**TODO: Ask Tom: Question:** With the splitter_from_stream version we still need to create a list with the mapping... Then, I can not see the difference between using one or the other method.
**Answer:** The list in the Stream is allowed to change over time, this makes the future Streams to be more robust to change (e.g. the number of houses in the SPHERE project). Also, there is a tool that uses a dictionary for the splitting criteria, that is **splitter_of_dict** that expects the splitter_stream to be None, and a mapping parameter containing the static mapping.

In [16]:
splitter_tool = hs.tools.splitter_from_stream(element=None, use_mapping_keys_only=False)

splitter_tool.execute(source=one_country_stream, splitting_stream=this_cities_stream, output_plate_values=cities_plate, 
                      interval=ti_all, sinks=this_country_temps, meta_data_id="city")

In [17]:
one_city = this_country_temps[0]
city_name = one_city.stream_id.meta_data[0][1]

print one_city
print one_city.window(ti_all)

my_time, my_data = zip(*[(key.__str__(), value) for key, value in one_city.window(ti_all).items()])

plot_high_chart(my_time, my_data, type="high_stock", title='Temperature in {}'.format(city_name), yax='ºC')

Stream(stream_id=temperature: [city=NewDelhiMax], channel_id=memory)
StreamView(force_calculation=False, stream=Stream(stream_id=temperature: [city=NewDelhiMax], channel_id=memory), time_interval=TimeInterval(start=datetime.datetime(1999, 1, 1, 0, 0, tzinfo=<UTC>), end=datetime.datetime(2013, 1, 1, 0, 0, tzinfo=<UTC>)))
