# Working with StreamSets

In this notebook we will focus on retrieving data from groups of time series streams (a `StreamSet`) as opposed to concentrating on individual streams.  You can think of a `StreamSet` as a wrapper around a list of `Stream` objects as found in the preceding notebook.

The StreamSet class has a number of methods to mirror those found in regular Stream objects including methods for transforming and serializing the data to different formats. 

As with a `Stream`, retrieving data from the BTrDB server will fully materialize in memory so please keep this in mind.  In other words, do not attempt to retreive data that is greater than the amount of memory available to you.

If you would like to learn more about any of the topics covered here, please see the btrdb library [documentation](https://btrdb.readthedocs.io/en/develop/index.html).

**NOTE**: To get access to the Sunshine dataset to run this notebook, please register for an API key at [ni4ai.org](https://ni4ai.org/).

## Imports

In [1]:
import btrdb
from tabulate import tabulate
from btrdb.utils.timez import ns_delta

## Connect To Server

To get started we'll connect to the server and define a helper method from a previous notebook.

In [2]:
# Make sure you add your API key to the yaml file to connect!
with open('../config.yaml', 'r') as f:
    config = yaml.safe_load(f)
    
conn = btrdb.connect(config['connection']['api_url'], config['connection']['api_key'])
conn.info()

{'majorVersion': 5, 'build': '5.7.5', 'proxy': {'proxyEndpoints': []}}

In [3]:
def describe_streams(streams):
    table = [["Collection", "Name", "Units", "Version", "Earliest", "Latest"]]
    for stream in streams:
        tags = stream.tags()
        table.append([
            stream.collection, stream.name, tags["unit"], stream.version(), 
            stream.earliest()[0].time, stream.latest()[0].time, 
        ])
    return tabulate(table, headers="firstrow")

# Helper Methods

The best way to think about the `StreamSet` is as a wrapper around a list of `Stream` objects with appropriate methods added to help with examining your data. To create a `StreamSet` we first need the UUIDs for streams, which we will obtain by selecting a few voltage streams from the database.

In [4]:
streams = conn.streams_in_collection('sunshine/PMU2', tags={"unit": "amps"})
streams = list(streams)
print(describe_streams(streams))

Collection     Name    Units      Version             Earliest               Latest
-------------  ------  -------  ---------  -------------------  -------------------
sunshine/PMU2  C2MAG   amps         19018  1456790400008333333  1464738948999999960
sunshine/PMU2  C1MAG   amps         19018  1456790400008333333  1464738948999999960
sunshine/PMU2  C3MAG   amps         19018  1456790400008333333  1464738948999999960


Now let's create a `StreamSet` using the UUIDs from our known streams.  In the future, we will be able to query based on collection, tags, etc., but for now we need to provide the UUIDs.

In [5]:
UUIDs = [s.uuid for s in streams]
UUIDs

[UUID('0fd692df-6cc2-4fee-88d0-40e55cdcf58c'),
 UUID('809df629-cd9b-4baa-a4e2-374f133ba5aa'),
 UUID('250a2a6a-4504-40de-a5cd-f13661edfb9c')]

In [6]:
streamset = conn.streams(*UUIDs)
streamset

<StreamSet(3 streams)>

Now that we have a StreamSet with the four streams, let's take a look at some of the helper methods.  In a Stream object, the `earliest` method would provide the first point in the Stream.  `StreamSet.earliest` will provide a tuple containing the first points from each individual stream.  The order is the same as the UUIDs that were provided when creating the instance.

In [7]:
streamset.earliest()

(RawPoint(1456790400008333333, 115.00888061523438),
 RawPoint(1456790400008333333, 115.58443450927734),
 RawPoint(1456790400008333333, 104.03299713134766))

Similarly, let's look at the latest points in the streams.

In [8]:
streamset.latest()

(RawPoint(1464738948999999960, 94.16214752197266),
 RawPoint(1464738948999999960, 99.10070037841797),
 RawPoint(1464738948999999960, 89.8238525390625))

# Viewing Data

Like the `Stream` object, the `StreamSet` has a `values` method which will return a list of lists.  Each internal list contains the `RawPoint` instances for a given stream.  Just as before we will return only a little bit of the data from the beginning of the streams.

We will start by finding the earliest time from all of the earliest points although in this case they all have the same beginning time.

In [9]:
earliest_point = sorted(streamset.earliest(), key=lambda p: p.time)[0]
earliest_point.time

1456790400008333333

Next we will ask for the values in the streams. Stream values are returned as a list of list of points such that the lists of points are ordered according to the UUIDs provided on initialization. Using this method data is fetched for each stream and returned and can be thought of as a helper method to query multiple streams simultaneously.

In [10]:
start = earliest_point.time
end = start + ns_delta(milliseconds=100)
streamset.filter(start, end).values()

[[RawPoint(1456790400008333333, 115.00888061523438),
  RawPoint(1456790400016666666, 114.9324722290039),
  RawPoint(1456790400024999999, 115.0665512084961),
  RawPoint(1456790400033333332, 115.17729949951172),
  RawPoint(1456790400041666665, 114.99266815185547),
  RawPoint(1456790400049999998, 114.45643615722656),
  RawPoint(1456790400058333331, 113.85437774658203),
  RawPoint(1456790400066666664, 113.72930145263672),
  RawPoint(1456790400074999997, 113.93565368652344),
  RawPoint(1456790400083333330, 113.95361328125),
  RawPoint(1456790400091666663, 113.69194030761719),
  RawPoint(1456790400099999996, 113.29793548583984),
  RawPoint(1456790400108333329, 113.32032012939453)],
 [RawPoint(1456790400008333333, 115.58443450927734),
  RawPoint(1456790400016666666, 115.8336410522461),
  RawPoint(1456790400024999999, 116.0658950805664),
  RawPoint(1456790400033333332, 115.89474487304688),
  RawPoint(1456790400041666665, 115.30825805664062),
  RawPoint(1456790400049999998, 114.27113342285156),

You may have noticed that we first called the `filter` method and then called the `values` method with no arguments.  The `StreamSet` class was designed to support a method chaining style of programming and so behaves slightly differently from the `Stream`.

Data is only ever materialized when calling the `values` or `rows` method as demonstrated below.  The `rows` method is similar to the `values` method but orients the data differently.  Here you will notice that the streams are aligned according to time.  The first tuple contains all of the data for the first time index.  If any streams do not have data at that time index, then `None` is used as a placeholder.

In [11]:
streamset.filter(start, end).rows()

[(RawPoint(1456790400008333333, 115.00888061523438),
  RawPoint(1456790400008333333, 115.58443450927734),
  RawPoint(1456790400008333333, 104.03299713134766)),
 (RawPoint(1456790400016666666, 114.9324722290039),
  RawPoint(1456790400016666666, 115.8336410522461),
  RawPoint(1456790400016666666, 103.94195556640625)),
 (RawPoint(1456790400024999999, 115.0665512084961),
  RawPoint(1456790400024999999, 116.0658950805664),
  RawPoint(1456790400024999999, 103.55602264404297)),
 (RawPoint(1456790400033333332, 115.17729949951172),
  RawPoint(1456790400033333332, 115.89474487304688),
  RawPoint(1456790400033333332, 103.05726623535156)),
 (RawPoint(1456790400041666665, 114.99266815185547),
  RawPoint(1456790400041666665, 115.30825805664062),
  RawPoint(1456790400041666665, 102.64067840576172)),
 (RawPoint(1456790400049999998, 114.45643615722656),
  RawPoint(1456790400049999998, 114.27113342285156),
  RawPoint(1456790400049999998, 102.25871276855469)),
 (RawPoint(1456790400058333331, 113.85437774

Let's use the tabulate library again to better format the data rows.

In [12]:
table = [["time"] + [s.name for s in streamset]]

for row in streamset.filter(start, end).rows():
    time = sorted([p.time for p in row])[-1]
    data = [time]
    for point in row:
        data.append(point.value)
    table.append(data)
        
print(tabulate(table, headers="firstrow"))

               time    C2MAG    C1MAG    C3MAG
-------------------  -------  -------  -------
1456790400008333333  115.009  115.584  104.033
1456790400016666666  114.932  115.834  103.942
1456790400024999999  115.067  116.066  103.556
1456790400033333332  115.177  115.895  103.057
1456790400041666665  114.993  115.308  102.641
1456790400049999998  114.456  114.271  102.259
1456790400058333331  113.854  113.308  102.027
1456790400066666664  113.729  113.053  102.049
1456790400074999997  113.936  113.111  102.063
1456790400083333330  113.954  113.043  101.942
1456790400091666663  113.692  112.812  101.871
1456790400099999996  113.298  112.506  101.874
1456790400108333329  113.32   112.664  101.973


# Transforming Data to Other Formats

A number of methods have been provided to convert the point data objects into objects you may already be familiar with such as numpy arrays and pandas dataframes.  Using these transformation methods materializes the data similar to the `values` method.  Examples of the available transformations follow.

## Numpy Arrays

Converting to Numpy arrays will produce a list of arrays.  This output will be similar in structure to calling the `values` method.

In [13]:
start = earliest_point.time
end = start + ns_delta(milliseconds=100)

streamset.filter(start, end).to_array()

[array([RawPoint(1456790400008333333, 115.00888061523438),
        RawPoint(1456790400016666666, 114.9324722290039),
        RawPoint(1456790400024999999, 115.0665512084961),
        RawPoint(1456790400033333332, 115.17729949951172),
        RawPoint(1456790400041666665, 114.99266815185547),
        RawPoint(1456790400049999998, 114.45643615722656),
        RawPoint(1456790400058333331, 113.85437774658203),
        RawPoint(1456790400066666664, 113.72930145263672),
        RawPoint(1456790400074999997, 113.93565368652344),
        RawPoint(1456790400083333330, 113.95361328125),
        RawPoint(1456790400091666663, 113.69194030761719),
        RawPoint(1456790400099999996, 113.29793548583984),
        RawPoint(1456790400108333329, 113.32032012939453)], dtype=object),
 array([RawPoint(1456790400008333333, 115.58443450927734),
        RawPoint(1456790400016666666, 115.8336410522461),
        RawPoint(1456790400024999999, 116.0658950805664),
        RawPoint(1456790400033333332, 115.89474

## Pandas Series

Converting to a pandas series will produce a view of the data similar to calling the `values` method.  The resulting series will be indexed by time.

In [14]:
streamset.filter(start, end).to_series()

[2016-03-01 00:00:00.008333333    115.008881
 2016-03-01 00:00:00.016666666    114.932472
 2016-03-01 00:00:00.024999999    115.066551
 2016-03-01 00:00:00.033333332    115.177299
 2016-03-01 00:00:00.041666665    114.992668
 2016-03-01 00:00:00.049999998    114.456436
 2016-03-01 00:00:00.058333331    113.854378
 2016-03-01 00:00:00.066666664    113.729301
 2016-03-01 00:00:00.074999997    113.935654
 2016-03-01 00:00:00.083333330    113.953613
 2016-03-01 00:00:00.091666663    113.691940
 2016-03-01 00:00:00.099999996    113.297935
 2016-03-01 00:00:00.108333329    113.320320
 Name: sunshine/PMU2/C2MAG, dtype: float64,
 2016-03-01 00:00:00.008333333    115.584435
 2016-03-01 00:00:00.016666666    115.833641
 2016-03-01 00:00:00.024999999    116.065895
 2016-03-01 00:00:00.033333332    115.894745
 2016-03-01 00:00:00.041666665    115.308258
 2016-03-01 00:00:00.049999998    114.271133
 2016-03-01 00:00:00.058333331    113.307884
 2016-03-01 00:00:00.066666664    113.053108
 2016-03-01

## Pandas DataFrame

Converting to a pandas dataframe will produce a tabular view of the data similar to calling the `rows` method.  The resulting dataframe will be indexed by time.

In [15]:
streamset.filter(start, end).to_dataframe()

Unnamed: 0_level_0,sunshine/PMU2/C2MAG,sunshine/PMU2/C1MAG,sunshine/PMU2/C3MAG
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1456790400008333333,115.008881,115.584435,104.032997
1456790400016666666,114.932472,115.833641,103.941956
1456790400024999999,115.066551,116.065895,103.556023
1456790400033333332,115.177299,115.894745,103.057266
1456790400041666665,114.992668,115.308258,102.640678
1456790400049999998,114.456436,114.271133,102.258713
1456790400058333331,113.854378,113.307884,102.027412
1456790400066666664,113.729301,113.053108,102.049446
1456790400074999997,113.935654,113.111473,102.063484
1456790400083333330,113.953613,113.043022,101.941925


## Python Dictionaries

Converting to Python dictionaries produces a list of `OrderedDicts` similar to calling the `rows` method. 

In [16]:
streamset.filter(start, end).to_dict()

[OrderedDict([('time', 1456790400008333333),
              ('sunshine/PMU2/C2MAG', 115.00888061523438),
              ('sunshine/PMU2/C1MAG', 115.58443450927734),
              ('sunshine/PMU2/C3MAG', 104.03299713134766)]),
 OrderedDict([('time', 1456790400016666666),
              ('sunshine/PMU2/C2MAG', 114.9324722290039),
              ('sunshine/PMU2/C1MAG', 115.8336410522461),
              ('sunshine/PMU2/C3MAG', 103.94195556640625)]),
 OrderedDict([('time', 1456790400024999999),
              ('sunshine/PMU2/C2MAG', 115.0665512084961),
              ('sunshine/PMU2/C1MAG', 116.0658950805664),
              ('sunshine/PMU2/C3MAG', 103.55602264404297)]),
 OrderedDict([('time', 1456790400033333332),
              ('sunshine/PMU2/C2MAG', 115.17729949951172),
              ('sunshine/PMU2/C1MAG', 115.89474487304688),
              ('sunshine/PMU2/C3MAG', 103.05726623535156)]),
 OrderedDict([('time', 1456790400041666665),
              ('sunshine/PMU2/C2MAG', 114.99266815185547),
    

# Serializing Data

Aside from transforming to other data objects, you can also serialize the data to other formats.  At the moment, you can serialize the data a string for display.


In [17]:
from io import StringIO

# create a new streamset with only 2 streams for better display in a notebook
streamset = conn.streams(*UUIDs[:2])

# create file-like object
fake_file = StringIO()

# call to_csv which will detect the output stream and write the data to it
streamset.filter(start, end).to_csv(fake_file)

# move back to beginning of file-like object and print contents
fake_file.seek(0)
print(fake_file.read())

time,sunshine/PMU2/C2MAG,sunshine/PMU2/C1MAG
1456790400008333333,115.00888061523438,115.58443450927734
1456790400016666666,114.9324722290039,115.8336410522461
1456790400024999999,115.0665512084961,116.0658950805664
1456790400033333332,115.17729949951172,115.89474487304688
1456790400041666665,114.99266815185547,115.30825805664062
1456790400049999998,114.45643615722656,114.27113342285156
1456790400058333331,113.85437774658203,113.3078842163086
1456790400066666664,113.72930145263672,113.05310821533203
1456790400074999997,113.93565368652344,113.1114730834961
1456790400083333330,113.95361328125,113.04302215576172
1456790400091666663,113.69194030761719,112.81172180175781
1456790400099999996,113.29793548583984,112.50586700439453
1456790400108333329,113.32032012939453,112.6644287109375



Note if you wanted to write this to a real file it would look something like:

```python
streamset = conn.streams(*UUIDs[:2])
streamset.filter(start, end).to_csv("mydata.csv")
```

## To Table

Similar to our example code using the tabulate library, this will return a string containing a formatted table of the data.  

In [1]:
print(streamset.filter(start, end).to_table())

NameError: name 'streamset' is not defined