# Working with Streams

In this notebook we will focus on retrieving data from the individual time series streams in BTrDB.  As part of this we will also look into the `Point` classes which represent data within a `Stream` instance.  We will start with basic metadata and then move on to real value data.  As part of this we will retrieve both low level data as well as aggregated windows of data.

If you would like to learn more about any of the topics covered here, please see the btrdb library [documentation](https://btrdb.readthedocs.io/en/develop/index.html).

## Imports

In [1]:
import btrdb
from tabulate import tabulate
from pprint import pprint

## Connect To Server

In [2]:
conn = btrdb.connect(apikey="AE0C013A87C48930E37ED8D8")
conn.info()

{'majorVersion': 5, 'build': '5.1.10', 'proxy': {'proxyEndpoints': []}}

# Stream Basics

Streams contain actual points of data and you can query the time series values in a number of ways.  But first, let's look at the Stream metadata for a better understanding of what is available.

## Metadata Properties

As discussed in the previous notebook, a stream has its own metadata properties such as its `collection` or `uuid`.  Let's start by defining each read-only property.

<dl>
    <dt>collection</dt>
    <dd>The hierarchichal path where a stream lives.  This is used purely for organizational purposes and has no impact on the underlying data.</dd>
    <dt>name</dt> 
    <dd>A friendly name for the stream.</dd>
    <dt>uuid</dt> 
    <dd>A unique identifier for a given stream.  This UUID cannot be changed and can be used to fetch a stream directly using the connection object.</dd>
</dl>

In [3]:
streams = list(conn.streams_in_collection('relay/Possum Po_11-1L1', tags={"unit": "Volts"}))
for stream in streams:
    print("collection:{}, name:{}, uuid:{}".format(stream.collection, stream.name, stream.uuid))


collection:relay/Possum Po_11-1L1, name:LINE560V1-MAG, uuid:1027914c-b84a-43e9-9f6e-f04f6e006dbc
collection:relay/Possum Po_11-1L1, name:LINE560VA-MAG, uuid:171facf4-fdb3-4589-9404-37fe705f46d3
collection:relay/Possum Po_11-1L1, name:LINE560VB-MAG, uuid:44c8db28-18b8-4f12-9a22-7b062d8e646c
collection:relay/Possum Po_11-1L1, name:LINE560VC-MAG, uuid:af8cf764-6f04-4f9b-b25b-467778bc5320


## Metadata Methods

In addition to the metadata properties above, each stream has a set of metadata methods.  This information is provided as instance methods to indicate that they typically require a round-trip to the server and may change often.

<dl>
    <dt>tags</dt>
    <dd>A dictionary containing metadata used internally by the BTrDB server.  This is also where the name of a stream can be found as the property is just for convenience.</dd>
    <dt>annotations</dt> 
    <dd>A dictionary so that custom data can be attached to a stream.  As an example you could provide an annotation to indicate a point-of-contact for this data or the make/model of the sensor device.  Whenever you call the annotations method, a separate property version number is also returned. Similar to the stream version, this is a monotonically increasing number but is only updated when the metadata changes.</dd>
</dl>

Let's look at some sample tags and annotations below.

In [4]:
stream = streams[0]

print("tags")
pprint(stream.tags())

annotations, property_version = stream.annotations()
print("\nannotations")
pprint(annotations)

print("\nproperty_version")
print(property_version)

tags
{'ingress': '', 'name': 'LINE560V1-MAG', 'unit': 'Volts'}

annotations
{'description': 'Possum Po_11-1L1 LINE560V1  + Voltage Magnitude',
 'phase': '+',
 'point_tag': 'C!DOM_POSSUMPO_11-1L1-PM1:V',
 'unit': 'Volts'}

property_version
1


## Metadata Example

As with the collections, we can write a simple helper method to display information about a Stream.

In [5]:
def stream_detail(stream):
    """
    Prints detailed information about a stream.
    """
    table = [["Attribute", "Value"]]    
    table.append(["collection", stream.collection])
    table.append(["version", stream.version()])

    for k,v in stream.tags().items():
        table.append(["tag/{}".format(k), v])
                
    for k,v in stream.annotations()[0].items():
        table.append(["annotation/{}".format(k), v])
    
    return tabulate(table, headers="firstrow")

print(stream_detail(stream))

Attribute               Value
----------------------  -----------------------------------------------
collection              relay/Possum Po_11-1L1
version                 942
tag/ingress
tag/name                LINE560V1-MAG
tag/unit                Volts
annotation/unit         Volts
annotation/phase        +
annotation/point_tag    C!DOM_POSSUMPO_11-1L1-PM1:V
annotation/description  Possum Po_11-1L1 LINE560V1  + Voltage Magnitude


## Stream Version

Remember that the version of a stream changes whenever data is modified (though not when metadata is modified).  This is especially useful on live data such that you can pin all of your queries to a specific version of your data.  Otherwise, you could potentially get constantly changing statistics as new data is appended or older data arrives out of order.

Think of the version as a snapshot of your data in time.  Most of the calls we use on this page allow for an optional `version` argument which we have omitted. _The default value of zero is used which indicates the current (or latest) version_.

# Viewing Data

Most likely you are interested in examining the stored values in each time series stream.  This is fairly straightforward though it is important to note that the time series values are actually provided in a `RawPoint` class which has `time` and `value` attributes.

Because a stream may contain billions of values, data queries should provide a start and end time to bound the amount of data retrieved.  All times are stored internally in nanoseconds with the zero value being the [Unix epoch](https://en.wikipedia.org/wiki/Unix_time). Luckily, the btrdb library comes with many convenience functions to help with converting between nanoseconds and other datetime formats.

Let's start with a simple example to view data in our `LINE560V1-MAG` stream. We will call the stream's `values` method which returns a sequence.  Each item in the sequence is a tuple containing a `RawPoint` and the version of the stream at retrieval time.  However, before we call the `values` method we will need to determine the start time for our call.

In [6]:
# retrieve the first point of data and the current stream version
earliest, version = stream.earliest()
print(earliest, version)

RawPoint(1536710401000000000, 302817.8) 942


We can see above that the first data point in the stream has a time of `1536710401000000000` nanoseconds.  Let's convert that to a datetime so we can understand it better.

In [7]:
from btrdb.utils.timez import ns_to_datetime
ns_to_datetime(earliest.time)

datetime.datetime(2018, 9, 12, 0, 0, 1, tzinfo=<UTC>)

Now, let's call `values` using our start time and ending 200 milliseconds later.

In [8]:
start = earliest.time
end = earliest.time + 2e8

for point, version in stream.values(start, end):
    print(point)

RawPoint(1536710401000000000, 302817.8)
RawPoint(1536710401033000000, 302907.3)
RawPoint(1536710401066000000, 302823.3)
RawPoint(1536710401100000000, 302831.9)
RawPoint(1536710401133000000, 302749.7)
RawPoint(1536710401166000000, 302692.6)


We can also access the `time` and `value` attribute of each point as follows:

In [11]:
for point, version in stream.values(start, end):
    print(point.time, point.value)

1536710401000000000 302817.8
1536710401033000000 302907.3
1536710401066000000 302823.3
1536710401100000000 302831.9
1536710401133000000 302749.7
1536710401166000000 302692.6


# Windowed Data

Aside from viewing the raw values, we can also query windows of data.  Windowed queries return `StatPoint` values that contain aggregations of the data in each window rather than individual `RawPoint` values.  Each `StatPoint` contains attributes for the minimum, maximum, mean, and standard deviation of the values along with a count of total points and the start time of the window.  You can find more information about windowing data in the [docs](https://btrdb.readthedocs.io/en/develop/working/stream-view-data.html#view-windows-of-data) or view the [API](https://btrdb.readthedocs.io/en/develop/api/streams.html) reference.

This time we will use another convenience function which will help us to manipulate values at the nanosecond scale by better defining the query range and selecting aggregates of data at 100 millisecond intervals.

In [13]:
from btrdb.utils.timez import ns_delta

start = earliest.time
end = earliest.time + ns_delta(seconds=1)   # 1 second later
width = ns_delta(milliseconds=100)          # 100 milleseconds per window

for point, version in stream.windows(start, end, width):
    print(point)

StatPoint(1536710401000000000, 302817.8, 302849.4666666666, 302907.3, 3, 40.95593871486314)
StatPoint(1536710401100000000, 302692.6, 302758.0666666667, 302831.9, 3, 57.175888044396515)
StatPoint(1536710401200000000, 302687.4, 302693.3333333334, 302704.4, 3, 7.832127038587137)
StatPoint(1536710401300000000, 302645.0, 302689.73333333334, 302715.3, 3, 31.73897010470396)
StatPoint(1536710401400000000, 302703.7, 302710.4666666666, 302722.9, 3, 8.803157679134115)
StatPoint(1536710401500000000, 302674.2, 302687.7, 302695.4, 3, 9.577404366412981)
StatPoint(1536710401600000000, 302674.6, 302679.7, 302687.5, 3, 5.601783141575044)
StatPoint(1536710401700000000, 302667.3, 302679.4666666667, 302691.7, 3, 9.961369986526128)
StatPoint(1536710401800000000, 302705.0, 302713.7, 302720.1, 3, 6.3754723068504235)
StatPoint(1536710401900000000, 302667.8, 302697.49999999994, 302721.1, 3, 22.183027305565243)


Let's create a convenience function to view all of this data in a table using tabulate again.

In [14]:
def stat_table(points):
    attrs = ["time", "min", "max", "mean", "stddev", "count"]
    table = [attrs]
    for p in points:
        table.append([getattr(p, attr) for attr in attrs])
    return tabulate(table, headers="firstrow")

points = [point for point, _ in stream.windows(start, end, width)]
print(stat_table(points))

               time     min     max    mean    stddev    count
-------------------  ------  ------  ------  --------  -------
1536710401000000000  302818  302907  302849  40.9559         3
1536710401100000000  302693  302832  302758  57.1759         3
1536710401200000000  302687  302704  302693   7.83213        3
1536710401300000000  302645  302715  302690  31.739          3
1536710401400000000  302704  302723  302710   8.80316        3
1536710401500000000  302674  302695  302688   9.5774         3
1536710401600000000  302675  302688  302680   5.60178        3
1536710401700000000  302667  302692  302679   9.96137        3
1536710401800000000  302705  302720  302714   6.37547        3
1536710401900000000  302668  302721  302697  22.183          3


While we haven't used it in our examples, you can also optionally supply a `depth` argument to the `windows` query. The depth is the precision of the data in each window, which can greatly speed up your queries if you don't need to be precise down to the nanosecond level.  As an example, supplying `depth=30` would return to-second precision and take advantage of the BTrDB database to increase the retrieval performance.  Please see the docs for more information.

### Aligned Windows

The aligned windows method is an alternative aggregate query that more directly exploits the underlying database structure for the best possible performance. This method, `aligned_windows` takes a `pointwidth` argument that defines the window width as a power of two. Like the `windows` method, each point returned is a statistical aggregate of all the raw data within a window but of width $2^{pointwidth}$ nanoseconds.

Note that when bounding the query with a time range, `start` is inclusive, but `end` is exclusive. That is, results will be returned for all windows that start in the interval $[start, end)$. If $end < start+2^{pointwidth}$ you will not get any results. If start and end are not powers of two, the bottom pointwidth bits will be cleared. Each window will contain statistical summaries of the window. Statistical points with no points (`count=0`) will be ommitted.

In [20]:
pointwidth = 27    # 2^27 nanoseconds wide
points = [point for point, _ in stream.aligned_windows(start, end, pointwidth)]
print(stat_table(points))

               time     min     max    mean    stddev    count
-------------------  ------  ------  ------  --------  -------
1536710400895090688  302818  302818  302818   0              1
1536710401029308416  302750  302907  302828  55.8047         4
1536710401163526144  302687  302704  302693   6.79025        4
1536710401297743872  302645  302723  302698  31.0125         4
1536710401431961600  302674  302705  302694  10.9952         5
1536710401566179328  302667  302688  302677   7.23637        4
1536710401700397056  302679  302720  302699  15.1546         4
1536710401834614784  302668  302721  302702  20.8143         4


This can be really useful if you dont particularly care where the start/end of your windows are, or if you just want a high level statistical view.

As an example, the code below will return a single window of all of the data so you can easily view statistics of the entire dataset. We will use the `currently_as_ns` function to get the current datetime as nanoseconds from epoch to use as our end time.

In [26]:
from btrdb.utils.timez import currently_as_ns

pointwidth = 52
points = [point for point, _ in stream.aligned_windows(start, currently_as_ns(), pointwidth)]
print(stat_table(points))

               time    min     max    mean    stddev     count
-------------------  -----  ------  ------  --------  --------
1535727472933339136      0  310785  302661   1277.11  46571133
