<img style="float: right;" src="hyperstream.svg">

# HyperStream Tutorial 5: Workflows

Workflows define a graph of streams. Usually, the first stream will be a special "raw" stream that pulls in data from a custom data source. Workflows can have multiple time ranges, which will cause the streams to be computed on all of the ranges given.

## Introduction

In this tutorial, we will be ussing a time-series dataset about the temperature in different countries and cities. The dataset is availabel at [The Census at School New Zeland][1]. The necessary files for this tutorial are already included in the folder **data/TimeSeriesDatasets_130207**.

In particular, there are four files with the minimum and maximum temperatures in different cities of Asia, Australia, NZ and USA from 2000 to 2012.

[1]: http://new.censusatschool.org.nz/resource/time-series-data-sets-2013/

In [1]:
%load_ext watermark

import sys
sys.path.append("../") # Add parent dir in the Path

from hyperstream import HyperStream
from hyperstream import TimeInterval
from hyperstream.utils import UTC

from datetime import datetime
from utils import plot_high_chart
from utils import plot_multiple_stock
from dateutil.parser import parse

%watermark -v -m -p hyperstream -g

hs = HyperStream(loglevel=20)
print hs

CPython 2.7.6
IPython 5.3.0

hyperstream 0.3.0-beta

compiler   : GCC 4.8.4
system     : Linux
release    : 3.19.0-80-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit
Git hash   : f18bef4989f08b0b2ae118a600aa55d9f469791c
HyperStream version 0.3.0-beta, connected to mongodb://localhost:27017/hyperstream


## Reading the data

In the data folder there are four csv files with the names **TempAsia.csv, TempAustralia.csv, TempNZ.csv and TempUSA.csv**. The first column of each csv file contains a header with the names of the columns. The first one being the date and the following are the minimum and maximum temperature in different cities with the format **cityMin** and **cityMax**.

Here is an example of the first 5 rows of the **TempAsia.csv** file:

```
Date,TokyoMax,TokyoMin,BangkokMax,BangkokMin
2000M01,11.2,4.2,32.8,24
```

The format of the date has the form **YYYYMmm** where **YYYY** is the year and **mm** is the month. Because this format is not recognized by the default parser of the **csv_reader** tool, we will need to specify our own parser that first replaces the **M** by an hyphen **-**.

In [2]:
def dateparser(dt):
    return parse(dt.replace('M', '-')).replace(tzinfo=UTC)

ti_all = TimeInterval(datetime(1999, 1, 1).replace(tzinfo=UTC),
                      datetime(2013, 1, 1).replace(tzinfo=UTC))
ti_sample = TimeInterval(datetime(2007, 1, 1).replace(tzinfo=UTC),
                         datetime(2007, 3, 1).replace(tzinfo=UTC))

# M will be the Memory Channel
M = hs.channel_manager.memory

countries = ['Asia', 'Australia', 'NZ', 'USA']
temp_tools_csv = {}
temp_streams = {}
for country in countries:
    temp_tools_csv[country] = hs.plugins.example.tools.csv_reader(
            'data/TimeSeriesDatasets_130207/Temp{}.csv'.format(country),
            header=True, dateparser=dateparser)
    temp_streams[country] = M.get_or_create_stream(country)
    temp_tools_csv[country].execute(sources=[], sink=temp_streams[country],
                                    interval=ti_all)

## Print some examples for each Stream

In [3]:
for country in countries:
    # Print two examples per stream
    print('\n{}: First sample'.format(country))
    key, value = temp_streams[country].window().first()
    print '[%s]: %s' % (key, value)


Asia: First sample
[2000-01-19 00:00:00+00:00]: {'NewDelhiMax': 20.1, 'NewDelhiMin': 8.1, 'HongKongMin': 14.3, 'KualaLumpurMin': 23.5, 'TokyoMax': 11.2, 'KualaLumpurMax': 32.2, 'HongKongMax': 19.5, 'BangkokMin': 24.0, 'BangkokMax': 32.8, 'TokyoMin': 4.2}

Australia: First sample
[2000-01-19 00:00:00+00:00]: {'BrisbaneMax': 28.2, 'MelbourneMin': 15.9, 'Melbournemax': 24.3, 'BrisbaneMin': 19.1, 'CanberraMin': 10.1, 'GoldCoastMax': 27.5, 'SydneyMax': 24.9, 'Canberramax': 24.5, 'GoldCoastMin': 20.2, 'SydneyMin': 17.7}

NZ: First sample
[2000-01-19 00:00:00+00:00]: {'WellingtonMin': 14.2, 'ChristchurchMin': 10.8, 'HamiltonMin': 12.4, 'DunedinMax': 18.2, 'WellingtonMax': 20.0, 'ChristchurchMax': 20.2, 'AucklandMax': 23.4, 'HamiltonMax': 23.8, 'DunedinMin': 8.8, 'AucklandMin': 15.5}

USA: First sample
[2000-01-19 00:00:00+00:00]: {'ChicagoMax': 1.8, 'LosAngelesMin': 10.0, 'HoustonMax': 21.6, 'NYMax': 4.6, 'SeattleMax': 7.9, 'SeattleMin': 1.4, 'ChicagoMin': -8.1, 'NYMin': -5.6, 'HoustonMin': 

## Visualize the temperatures in one Country

We can visualize all the temperatures in one of the Country Streams

In [4]:
country = countries[0]
this_cities_list = [key for key, value in temp_streams[country].window().items()[0].value.iteritems()]

data = {city:[] for city in this_cities_list}
time = []
for key, values in temp_streams[country].window().items():
    time.append(str(key))
    for city, temperature in values.iteritems():
        data[city].append(temperature)
        
names = data.keys()
data = [value for key, value in data.iteritems()]
        
plot_multiple_stock(data, time=time, names=names, title='Temperatures in ' + country, ylabel='ºC')

In [5]:
from hyperstream import StreamInstance
from hyperstream import StreamId

this_stream = temp_streams[country]

# It is similar to a database channel
A = hs.channel_manager.assets
this_cities_stream = A.get_or_create_stream('cities_{}'.format(country))

mapping = {}
for city in this_cities_list:
    mapping[city] = city

A.write_to_stream(stream_id=this_cities_stream.stream_id, data=StreamInstance(ti_all.end, mapping))

print this_cities_stream.window(TimeInterval.up_to_now()).items()

[StreamInstance(timestamp=datetime.datetime(2013, 1, 1, 0, 0, tzinfo=<bson.tz_util.FixedOffset object at 0x7f4fa803e410>), value={u'NewDelhiMax': u'NewDelhiMax', u'NewDelhiMin': u'NewDelhiMin', u'HongKongMin': u'HongKongMin', u'KualaLumpurMin': u'KualaLumpurMin', u'TokyoMax': u'TokyoMax', u'KualaLumpurMax': u'KualaLumpurMax', u'HongKongMax': u'HongKongMax', u'BangkokMin': u'BangkokMin', u'BangkokMax': u'BangkokMax', u'TokyoMin': u'TokyoMin'})]


In [6]:
splitter_tool = hs.tools.splitter_from_stream(element=None, use_mapping_keys_only=False)

for city in this_cities_list:
    if not hs.plate_manager.meta_data_manager.contains(identifier='city_'+city):
        print("Adding " + city)
        hs.plate_manager.meta_data_manager.insert(parent='root', data=city,
                                                  tag='city', identifier='city_'+city)

Adding NewDelhiMax
Adding NewDelhiMin
Adding HongKongMin
Adding KualaLumpurMin
Adding TokyoMax
Adding KualaLumpurMax
Adding HongKongMax
Adding BangkokMin
Adding BangkokMax
Adding TokyoMin


In [7]:
cities_plate = hs.plate_manager.create_plate(plate_id='C', meta_data_id='city', parent_plate=None, 
                                             values=[], complement=True, description='Cities')
this_country_temps = []
for city in this_cities_list:
    print("Adding " + city)
    this_country_temps.append(M.get_or_create_stream(stream_id=StreamId(name='temperature',
                                                                         meta_data=(('city', city),))))

Adding NewDelhiMax
Adding NewDelhiMin
Adding HongKongMin
Adding KualaLumpurMin
Adding TokyoMax
Adding KualaLumpurMax
Adding HongKongMax
Adding BangkokMin
Adding BangkokMax
Adding TokyoMin


In [8]:
splitter_tool.execute(source=this_stream, splitting_stream=this_cities_stream, output_plate=cities_plate, 
                      interval=ti_all, input_plate_value=None, sinks=this_country_temps)

In [9]:
one_city = this_country_temps[0]
city_name = one_city.stream_id.meta_data[0][1]
my_time, my_data = zip(*[(key.__str__(), value) for key, value in one_city.window(ti_all).items()])

plot_high_chart(my_time, my_data, type="high_stock", title='Temperature in {}'.format(city_name), yax='ºC')