# Working with StreamSets

In this notebook we will focus on retrieving data from groups of time series streams (a `StreamSet`) as opposed to concentrating on individual streams.  You can think of a `StreamSet` as a wrapper around a list of `Stream` objects as found in the preceding notebook.

The StreamSet class has a number of methods to mirror those found in regular Stream objects including methods for transforming and serializing the data to different formats. 

As with a `Stream`, retrieving data from the BTrDB server will fully materialize in memory so please keep this in mind.  In other words, do not attempt to retreive data that is greater than the amount of memory available to you.

If you would like to learn more about any of the topics covered here, please see the btrdb library [documentation](https://btrdb.readthedocs.io/en/develop/index.html).

## Imports

In [1]:
import btrdb
from tabulate import tabulate
from btrdb.utils.timez import ns_delta

## Connect To Server

To get started we'll connect to the server and define a helper method from a previous notebook.

In [2]:
conn = btrdb.connect(apikey="AE0C013A87C48930E37ED8D8")
conn.info()

{'majorVersion': 5, 'build': '5.1.10', 'proxy': {'proxyEndpoints': []}}

In [3]:
def describe_streams(streams):
    table = [["Collection", "Name", "Units", "Version", "Earliest", "Latest"]]
    for stream in streams:
        tags = stream.tags()
        table.append([
            stream.collection, stream.name, tags["unit"], stream.version(), 
            stream.earliest()[0].time, stream.latest()[0].time, 
        ])
    return tabulate(table, headers="firstrow")

# Helper Methods

The best way to think about the `StreamSet` is as a wrapper around a list of `Stream` objects with appropriate methods added to help with examining your data. To create a `StreamSet` we first need the UUIDs for streams, which we will obtain by selecting a few voltage streams from the database.

In [4]:
streams = conn.streams_in_collection('relay/Possum Po_11-1L1', tags={"unit": "Volts"})
streams = list(streams)
print(describe_streams(streams))

Collection              Name           Units      Version             Earliest               Latest
----------------------  -------------  -------  ---------  -------------------  -------------------
relay/Possum Po_11-1L1  LINE560V1-MAG  Volts          942  1536710401000000000  1538265599000000000
relay/Possum Po_11-1L1  LINE560VA-MAG  Volts          942  1536710401000000000  1538265599000000000
relay/Possum Po_11-1L1  LINE560VB-MAG  Volts          942  1536710401000000000  1538265599000000000
relay/Possum Po_11-1L1  LINE560VC-MAG  Volts          942  1536710401000000000  1538265599000000000


Now let's create a `StreamSet` using the UUIDs from our known streams.  In the future, we will be able to query based on collection, tags, etc., but for now we need to provide the UUIDs.

In [5]:
UUIDs = [s.uuid for s in streams]
UUIDs

[UUID('1027914c-b84a-43e9-9f6e-f04f6e006dbc'),
 UUID('171facf4-fdb3-4589-9404-37fe705f46d3'),
 UUID('44c8db28-18b8-4f12-9a22-7b062d8e646c'),
 UUID('af8cf764-6f04-4f9b-b25b-467778bc5320')]

In [6]:
streamset = conn.streams(*UUIDs)
streamset

<StreamSet(4 streams)>

Now that we have a StreamSet with the four streams, let's take a look at some of the helper methods.  In a Stream object, the `earliest` method would provide the first point in the Stream.  `StreamSet.earliest` will provide a tuple containing the first points from each individual stream.  The order is the same as the UUIDs that were provided when creating the instance.

In [7]:
streamset.earliest()

(RawPoint(1536710401000000000, 302817.8),
 RawPoint(1536710401000000000, 303338.4),
 RawPoint(1536710401000000000, 303071.1),
 RawPoint(1536710401000000000, 302053.4))

Similarly, let's look at the latest points in the streams.

In [8]:
streamset.latest()

(RawPoint(1538265599000000000, 303385.6),
 RawPoint(1538265599000000000, 304196.8),
 RawPoint(1538265599000000000, 303723.8),
 RawPoint(1538265599000000000, 302253.8))

# Viewing Data

Like the `Stream` object, the `StreamSet` has a `values` method which will return a list of lists.  Each internal list contains the `RawPoint` instances for a given stream.  Just as before we will return only a little bit of the data from the beginning of the streams.

We will start by finding the earliest time from all of the earliest points although in this case they all have the same beginning time.

In [9]:
earliest_point = sorted(streamset.earliest(), key=lambda p: p.time)[0]
earliest_point.time

1536710401000000000

Next we will ask for the values in the streams. Stream values are returned as a list of list of points such that the lists of points are ordered according to the UUIDs provided on initialization. Using this method data is fetched for each stream and returned and can be thought of as a helper method to query multiple streams simultaneously.

In [11]:
start = earliest_point.time
end = start + ns_delta(milliseconds=100)
streamset.filter(start, end).values()

[[RawPoint(1536710401000000000, 302817.8),
  RawPoint(1536710401033000000, 302907.3),
  RawPoint(1536710401066000000, 302823.3)],
 [RawPoint(1536710401000000000, 303338.4),
  RawPoint(1536710401033000000, 303431.7),
  RawPoint(1536710401066000000, 303366.3)],
 [RawPoint(1536710401000000000, 303071.1),
  RawPoint(1536710401033000000, 303146.9),
  RawPoint(1536710401066000000, 303067.3)],
 [RawPoint(1536710401000000000, 302053.4),
  RawPoint(1536710401033000000, 302126.9),
  RawPoint(1536710401066000000, 302050.7)]]

You may have noticed that we first called the `filter` method and then called the `values` method with no arguments.  The `StreamSet` class was designed to support a method chaining style of programming and so behaves slightly differently from the `Stream`.

Data is only ever materialized when calling the `values` or `rows` method as demonstrated below.  The `rows` method is similar to the `values` method but orients the data differently.  Here you will notice that the streams are aligned according to time.  The first tuple contains all of the data for the first time index.  If any streams do not have data at that time index, then `None` is used as a placeholder.

In [12]:
streamset.filter(start, end).rows()

[(RawPoint(1536710401000000000, 302817.8),
  RawPoint(1536710401000000000, 303338.4),
  RawPoint(1536710401000000000, 303071.1),
  RawPoint(1536710401000000000, 302053.4)),
 (RawPoint(1536710401033000000, 302907.3),
  RawPoint(1536710401033000000, 303431.7),
  RawPoint(1536710401033000000, 303146.9),
  RawPoint(1536710401033000000, 302126.9)),
 (RawPoint(1536710401066000000, 302823.3),
  RawPoint(1536710401066000000, 303366.3),
  RawPoint(1536710401066000000, 303067.3),
  RawPoint(1536710401066000000, 302050.7))]

Let's use the tabulate library again to better format the data rows.

In [13]:
table = [["time"] + [s.name for s in streamset]]

for row in streamset.filter(start, end).rows():
    time = sorted([p.time for p in row])[-1]
    data = [time]
    for point in row:
        data.append(point.value)
    table.append(data)
        
print(tabulate(table, headers="firstrow"))

               time    LINE560V1-MAG    LINE560VA-MAG    LINE560VB-MAG    LINE560VC-MAG
-------------------  ---------------  ---------------  ---------------  ---------------
1536710401000000000           302818           303338           303071           302053
1536710401033000000           302907           303432           303147           302127
1536710401066000000           302823           303366           303067           302051


# Transforming Data to Other Formats

A number of methods have been provided to convert the point data objects into objects you may already be familiar with such as numpy arrays and pandas dataframes.  Using these transformation methods materializes the data similar to the `values` method.  Examples of the available transformations follow.

## Numpy Arrays

Converting to Numpy arrays will produce a list of arrays.  This output will be similar in structure to calling the `values` method.

In [14]:
start = earliest_point.time
end = start + ns_delta(milliseconds=100)

streamset.filter(start, end).to_array()

[array([RawPoint(1536710401000000000, 302817.8),
        RawPoint(1536710401033000000, 302907.3),
        RawPoint(1536710401066000000, 302823.3)], dtype=object),
 array([RawPoint(1536710401000000000, 303338.4),
        RawPoint(1536710401033000000, 303431.7),
        RawPoint(1536710401066000000, 303366.3)], dtype=object),
 array([RawPoint(1536710401000000000, 303071.1),
        RawPoint(1536710401033000000, 303146.9),
        RawPoint(1536710401066000000, 303067.3)], dtype=object),
 array([RawPoint(1536710401000000000, 302053.4),
        RawPoint(1536710401033000000, 302126.9),
        RawPoint(1536710401066000000, 302050.7)], dtype=object)]

## Pandas Series

Converting to a pandas series will produce a view of the data similar to calling the `values` method.  The resulting series will be indexed by time.

In [15]:
streamset.filter(start, end).to_series()

[1536710401000000000    302817.8
 1536710401033000000    302907.3
 1536710401066000000    302823.3
 dtype: float64, 1536710401000000000    303338.4
 1536710401033000000    303431.7
 1536710401066000000    303366.3
 dtype: float64, 1536710401000000000    303071.1
 1536710401033000000    303146.9
 1536710401066000000    303067.3
 dtype: float64, 1536710401000000000    302053.4
 1536710401033000000    302126.9
 1536710401066000000    302050.7
 dtype: float64]

## Pandas DataFrame

Converting to a pandas dataframe will produce a tabular view of the data similar to calling the `rows` method.  The resulting dataframe will be indexed by time.

In [16]:
streamset.filter(start, end).to_dataframe()

Unnamed: 0_level_0,relay/Possum Po_11-1L1/LINE560V1-MAG,relay/Possum Po_11-1L1/LINE560VA-MAG,relay/Possum Po_11-1L1/LINE560VB-MAG,relay/Possum Po_11-1L1/LINE560VC-MAG
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1536710401000000000,302817.8,303338.4,303071.1,302053.4
1536710401033000000,302907.3,303431.7,303146.9,302126.9
1536710401066000000,302823.3,303366.3,303067.3,302050.7


## Python Dictionaries

Converting to Python dictionaries produces a list of `OrderedDicts` similar to calling the `rows` method. 

In [18]:
streamset.filter(start, end).to_dict()

[OrderedDict([('time', 1536710401000000000),
              ('relay/Possum Po_11-1L1/LINE560V1-MAG', 302817.8),
              ('relay/Possum Po_11-1L1/LINE560VA-MAG', 303338.4),
              ('relay/Possum Po_11-1L1/LINE560VB-MAG', 303071.1),
              ('relay/Possum Po_11-1L1/LINE560VC-MAG', 302053.4)]),
 OrderedDict([('time', 1536710401033000000),
              ('relay/Possum Po_11-1L1/LINE560V1-MAG', 302907.3),
              ('relay/Possum Po_11-1L1/LINE560VA-MAG', 303431.7),
              ('relay/Possum Po_11-1L1/LINE560VB-MAG', 303146.9),
              ('relay/Possum Po_11-1L1/LINE560VC-MAG', 302126.9)]),
 OrderedDict([('time', 1536710401066000000),
              ('relay/Possum Po_11-1L1/LINE560V1-MAG', 302823.3),
              ('relay/Possum Po_11-1L1/LINE560VA-MAG', 303366.3),
              ('relay/Possum Po_11-1L1/LINE560VB-MAG', 303067.3),
              ('relay/Possum Po_11-1L1/LINE560VC-MAG', 302050.7)])]

# Serializing Data

Aside from transforming to other data objects, you can also serialize the data to other formats.  At the moment, you can serialize the data to disk as CSV or to a string for display.

## To CSV

The `to_csv` method takes a filename string or a file-like object as a mandatory argument.  If a string is received, then it will save the data to disk as a CSV file.  If a file-like object is received, then it will write to that as an output stream.

The example below uses the StringIO to mimic a file object.  If called with a regular string as a file name, then `None` is returned and a new file is created.  For this example, we will use only two streams for display purposes.

In [19]:
from io import StringIO

# create a new streamset with only 2 streams for better display in a notebook
streamset = conn.streams(*UUIDs[:2])

# create file-like object
fake_file = StringIO()

# call to_csv which will detect the output stream and write the data to it
streamset.filter(start, end).to_csv(fake_file)

# move back to beginning of file-like object and print contents
fake_file.seek(0)
print(fake_file.read())

time,relay/Possum Po_11-1L1/LINE560V1-MAG,relay/Possum Po_11-1L1/LINE560VA-MAG
1536710401000000000,302817.8,303338.4
1536710401033000000,302907.3,303431.7
1536710401066000000,302823.3,303366.3



Note if you wanted to write this to a real file it would look something like:

```python
streamset = conn.streams(*UUIDs[:2])
streamset.filter(start, end).to_csv("mydata.csv")
```

## To Table

Similar to our example code using the tabulate library, this will return a string containing a formatted table of the data.  

In [20]:
print(streamset.filter(start, end).to_table())

               time    relay/Possum Po_11-1L1/LINE560V1-MAG    relay/Possum Po_11-1L1/LINE560VA-MAG
-------------------  --------------------------------------  --------------------------------------
1536710401000000000                                  302818                                  303338
1536710401033000000                                  302907                                  303432
1536710401066000000                                  302823                                  303366
