# PredictiveGrid in Python
## A Quick Start Notebook

This notebook is a quick start to working with data using PredictiveGrid's Python API. It illustrates the basic, key functionality of the API to give you a running start working with data in the platform. 

The very first step is to import all the required packages. Below is a list of basic imports that you can copy and paste into your own notebooks to get going.
The external python libraries (```matplotlib```, ```numpy```, etc.) have wonderful, extensive documentation that you should look up if you want to explore all their functionalities or just unstick yourself. 
The ```%matplotlib inline``` command ensures that plots will be rendered in the notebook itself, rather than in a pop-up window. 

In [None]:
# PredictiveGrid imports
import btrdb # Platform Python bindings
from btrdb.utils import timez # helpful sub-package for handling time
# from btrdb.stream import StreamSet # subpackage with light wrapper for working with multiple streams
from btrdb.utils.general import pointwidth

# support the new v6 api, currently leverage the util module from btrdb v5
import btrdbv6
from btrdbv6 import StreamSet

# External Python libraries
import numpy as np # scientific computing package
import matplotlib.pyplot as plt # plotting package
import pandas as pd # data analysis library
from tabulate import tabulate # creating & printing neat tables
from datetime import datetime, timedelta # for working with time

%matplotlib inline

Next, we must connect to the database. The ```connect``` function optionally accepts endpoint and apikey string arguments, which can be passed like:

```btrdb.connect(endpoint_string, apikey=key_string) ```

When running a notebook on JupyterHub, these do not need to be passed. 

In [None]:
# TODO: add in info command
conn = btrdbv6.connect(profile="debbie")
conn
#conn.info()

## Find Collections
The streams in the database are organized into collections, which can be thought of as heirarchical paths such as ```CALIFORNIA/SanFrancisco/91405/Sensor1``` but are internally just strings. A single collection, defined by a path-like name, can contain any number of individual streams. It is best to name collections and organize streams into them in some logically consistent and descriptive way to facilitate searching (think about creating a file heirarchy to organize data files and mirror this in the collection name). 

Let us query all and print out some of the collections in this cluster. We will print out the individual collection names to reflect the implicit heirarchical data organization they encode. 

In [None]:
# todo list available collections in db
collections = conn.list_collections()
print(f'Found {len(collections)} collections\n')

collections.sort()

# We limit the number of collections so we don't print too many
for ith_collection, collection_name in enumerate(collections[0:3]):
    levels = collection_name.split('/')
    for ith_level, level in enumerate(levels):
        print(ith_level*' ','->', level)
    print("========================================")

We can also limit the search to collections with a certain prefix (you may want to change the prefix below to obtain interesting results on your particular allocation). 

In [None]:
prefix = 'sunshine/PMU1'
collections = conn.list_collections(prefix)

print(f'Found {len(collections)} collections : {collections}')

## Find Streams
Now that we have a collection, the next step is to find individual streams (though we could have done this directly, if we know what we want). As you may recall, each stream represents a single time series within the database, which contains the atomic, ```(time, value)``` pairs, and has some associated metadata. 

To get the streams, we will use the ```streams_in_collection``` method. This takes an optional collection prefix argument. If we pass in no argument, we will get all streams in the collection. Let us work with the collection we found previously. Note that the method returns a `StreamSet` object, which we can iterate through to see the found streams.

In [None]:
# todo: get collection names from the above code, for now use a workaround
# collection = collections[0]
collection = "justin_test_insert_bench"
streams = conn.streams_in_collection(collection)
print(f'Found {len(streams)} streams in collection "{collection}"')

In [None]:
streams[0].name

Recall that streams have metadata, which can help us understand what the stream *is*. The convenience function below will obtain and pretty print a set of streams and their metadata.

In [None]:
def describe_streams(streams):
    table = [["Collection", "Name", "Units", "Version", "UUID"]]
    for stream in streams:
        # Get all the tags for this stream. 
        tags = stream.tags
        table.append([
            stream.collection, stream.name, tags["unit"], stream.version, stream.uuid
        ])
    return tabulate(table, headers="firstrow")

print(describe_streams(streams))

We can also search for streams in terms of their metadata tags, by passing an optional ```tag``` arguement to the same ```streams_in_collection``` method. Let us get all the streams with units of ```volts```.

In [None]:
# todo add in support for tags, annotations, etc querying
streams = conn.streams_in_collection(collection, tags={"unit": "volts"})
print(describe_streams(streams))

The API also supports SQL queries to find streams and metadata, which you can learn more about in [this notebook](https://github.com/PingThingsIO/ni4ai-notebooks/blob/main/tutorials/7%20-%20Working%20with%20Metadata.ipynb). 

## Work with a stream
Let us work with the first stream in the above list of streams. We will look at its metadata and learn how to query time series data from the stream.

### NOTE
`Stream` objects are simple objects representing relevant metadata from BTrDB, if we want to query and access data from these streams, we **must** use a `StreamSet` object, even if it is just a single stream.

In [None]:
stream = StreamSet.from_streams([streams[0]], db_conn=conn)
print(stream)
print(stream.version())

Let's checkout the stream's metadata. Note that ```Stream.collections``` returns a dict mapping stream uuids to their collection name, ```stream.tags()``` returns a dictionary of stream uuids to their tags, while ```stream.annotations()``` returns a dictionary of stream uuids mapping to their available annotations.

In [None]:
print(f'COLLECTION: {stream.collections()}')

print(f'TAGS: {stream.tags()}')
print(f'ANNOTATIONS: {stream.annotations()}')

In [None]:
uu = stream.get_uuids()
print(uu)
uu = uu[0]
print()

print(f'COLLECTION: {stream.collections()[uu]}')

print(f'TAGS: {stream.tags()[uu]}')
print(f'ANNOTATIONS: {stream.annotations()[uu]}')

### Data Types from Time Series Queries

Before we move on, let's refresh some concepts on BTrDB data types.  

There are two data types (represented by Python objects) that we may obtain when querying time series data from a stream. These are: 

1. ```RawPoint```s --- RawPoints represent the atomic data type in a time series: a paired timestamp and value that is the raw data of the series. Recall that these are stored at the absolute bottom of the BTrDB tree.
2. ```StatPoint```s --- These are stored in the internal nodes of the BTrDB tree and contain precomputed statistical aggregates of the raw data that lies beneath them. 

Querying ```RawPoint```s will be slower than querying ```StatPoint```s. Between ```StatPoint```s, it is faster to query those that reside higher up in the tree, corresponding to longer durations of raw data and correspondingly larger numbers of ```RawPoint```s. 

### Data Time Range
There are a couple useful methods for understanding the time range covered by a stream. We can do this by getting the *earliest* and *latest* ```RawPoint```s in the stream, as shown below. 

A few things: queries for time series Points always return the Point(s) requested along with a ***version number***. For most introductory users, the version is not useful, though you can read more about it in our docs. Below, we discard the version and just work with the returned ```RawPoint```. 

Time stamps in BTrDB are integers which represent *nanoseconds since the Unix epoch* [(January 1, 1970)](https://www.epochconverter.com/). These can be easily converted to more readable time stamps using utility functions in the ```timez``` package [(read the docs here)](https://btrdb.readthedocs.io/en/latest/api/utils-timez.html).

In [None]:
# TODO: support stream.earliest, and latest
# Get the earliest & latest RawPoints in the stream
earliest = stream.earliest()
latest = stream.latest()


print('Stream starts at: ', timez.ns_to_datetime(earliest_time))
print('Stream ends at: ', timez.ns_to_datetime(latest_time))

### Querying Data
There are two methods to query data from a stream. These are documented & described below (in order of increasing query speed). 
1. ```StreamSet.values(start, end)```: This call will return all ```RawPoint```s in the streamset between the ```start``` and ```end``` time.
2. ```StreamSet.statspoints(start, end, width)```: This call will return ```StatPoint```s spanning (and summarizing) windows of length ```width``` nanoseconds between the ```start``` and ```end``` times. The ```width```argument is an integer specifying the window length in *nanoseconds*.

To reiterate, ```values``` returns ```RawPoint```s, while ```windows``` return ```StatPoint```s. 

All these queries return a numpy array of the format `timestamp_0, (stream_0 value, stream_1 value, ... stream_N-1 value, stream_N value)`, (Where the values will either be ```RawPoint``` or ```StatPoint```). Further, as mentioned, the time stamps in the Points are in nanoseconds since the Unix epoch format. 

Overall, we need a function to convert this returned data into a more conducive format. The user can convert these values to a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) or [numpy Array](https://numpy.org) which are easy to work with. The functions to do this are shown below. 

#### ```stream.values```
Let us choose a ```start``` and ```end``` time for a ```values``` query. **CAUTION**: Most streams contain very high frequency data, so don't use too long a duration when querying raw data. 

In [None]:
start = btrdb.utils.timez.to_nanoseconds(datetime(2022, 1, 1))
end = start + btrdb.utils.timez.ns_delta(seconds=5)

In [None]:
stream.tags(update=True)

In [None]:
# Query the data
stream = stream.values(start, end)
# Convert it to a pandas Dataframe
data = stream.to_dataframe()
data

It is very easy to plot the data in a pandas dataframe. Just call ```plot()```! Here we pass a few aesthetic arguments to the call. 

In [None]:
data.plot(linestyle='--', marker='o', figsize=(15, 5), 
          title=f'Rawvalues of stream "{stream[0].name}"', 
          ylabel=stream.tags()[uu]['unit']); 

#### ```stream.windows```
Let us choose a longer time range and a window length for a ```windows``` query, which will return ```StatPoint```s. Remeber that the window width must be specified as an integer number of nanoseconds.  

In [None]:
# start = btrdb.utils.timez.to_nanoseconds(datetime(2015, 10, 10))
start = btrdb.utils.timez.to_nanoseconds(datetime(2022, 1, 1))
end = start + btrdb.utils.timez.ns_delta(hours=24)
# width_ns = btrdb.utils.timez.ns_delta(minutes=1)
width_ns = btrdb.utils.timez.ns_delta(minutes=60)

In [None]:
# Query the data
stream.get_statspoints(start, end, width_ns)
# Convert it to a pandas Dataframe
data = stream.to_dataframe()
data

You can think of a pandas ```DataFrame``` as an excel file. In this case, the rows are timestamps (coressponding to the windows) and the columns contain the different aggregates. 

We can visualize the dataframe structure and entries with the ```head()``` call. 

In [None]:
data.head()

In [None]:
data.columns

Let us plot the resulting data. We can call ```.plot()``` on the ```DataFrame```, as we did earlier, but instead, we will manually plot only some of the columns, for easier and greater control of the visualization.  

In [None]:
from pandas import IndexSlice as idx
uu = stream.get_uuids()[0]
plt.figure(figsize=(15, 5))
# Plot the mean of the windows
plt.plot(data.index, data.loc[:, (idx[:, "mean"])], linewidth=2, color='red', linestyle='--', marker='o')
# Show the range of the minimum and maximum over the windows
plt.fill_between(data.index, data.loc[:, (idx[:, "min"])].values.flatten(), data.loc[:, (idx[:, "max"])].values.flatten(), color='lightgrey')
plt.ylabel(stream.tags()[uu]['unit'])
plt.title(f'Mean, Min and Max of stream "{stream.tags()[uu]["name"]}" using windows() with width=1min');

## Work with multiple streams using streamsets
We often want to query data from a bunch of streams. Of course, we could do this manually by iterating through the individual streams and using the methods described above. However, we also have the option of using a [```StreamSet```](https://btrdb.readthedocs.io/en/latest/working/streamsets.html), which is a light wrapper around a list of streams. 

The following cells show a few examples for working with ```StreamSet```s. For more, see [this notebook](https://github.com/PingThingsIO/ni4ai-notebooks/blob/main/tutorials/3%20-%20Working%20With%20StreamSets.ipynb). Notice that when querying ```StreamSet```s, version numbers are no longer returned, as they were for queries on single streams. 

To begin, we use the same collection as before, and get a list of streams, which we then convert to a ```StreamSet```. 

In [None]:
streams = conn.streams_in_collection(collection, tags={"unit": "volts"})
streamset = StreamSet(streams)

We can now do some queries on all streams at once. For example...

In [None]:
earliests = streamset.earliest() # Returns a list of RawPoints, one for each stream
latests = streamset.latest() # Returns a list of RawPoints, one for each stream

for i in range(len(streamset)):
    # Get the times for this stream
    earliest_time = earliests[i].time
    latest_time = latests[i].time
    
    print(f'Stream {streamset[i].name}')
    print(f'starts at: {timez.ns_to_datetime(earliest_time)}')
    print(f'ends at: {timez.ns_to_datetime(latest_time)}')
    print('--------------------')

Let us make a ```values``` (ie raw data) query on this ```StreamSet```. Notice the difference in the query form: we use a function chaining approach to first set the ```start``` and ```end``` times. Then, the ```values``` are requested. 

In [None]:
start = datetime(2015, 10, 10)
end = start + timedelta(seconds=5)
data = streamset.filter(start, end).values()

The result ```data```, is a list of lists, each containing the ```RawPoint```s for one of the streams in the ```StreamSet```. 

To make the data more workable, we can instead query the result as a pandas ```DataFrame```, as shown below: 

In [None]:
data = streamset.filter(start, end).to_dataframe()
data.head()

The timestamps still need to be converted, which we do below: 

In [None]:
# convert nanoseconds to human readable time
data.index = pd.to_datetime(data.index)
data.head()

And we can plot!

In [None]:
data.plot(figsize=(15, 5), linewidth=3, ylabel='volts');

And that is it! You've had a basic introduction to working with PredictiveGrid in Python. You are ready to start working with your data and conducting some interesting analyses.

As you go, you will want to learn about and use more advanced platform capabilities. Make sure to look at the notebooks [here](https://github.com/PingThingsIO/ni4ai-notebooks) containing both tutorials and analytics demos. You can also [read the blog](https://blog.ni4ai.org/), which shares user stories, capabilities, etc. Also, read the docs! 

# THE END #