In [13]:
# Get credentials
from IPython.utils import io
with io.capture_output() as captured:
    %run ../Introduction.ipynb
    
from datetime import datetime
from sentenai import Sentenai
import pandas as pd
sentenai = Sentenai(host=host, port=port)

# Introduction to Stream Databases

Stream databases are the building block of data sets in Sentenai. A Stream database represent something as simple as a single data feed, or as complex as an entire production line on a factory floor. Data streams in a stream database are organized into a tree-like hierarchy of relationships. This hierarchy is purely organizational in nature, but splitting complex flat data sets into a hierarchical organization can have substantial performance benefits.

Here's an example of a flat dataset called "Activity" like you'd typically see in a relational database:

| Timestamp | Location | Sensor | Temperature | Humidity
| --- | --- | --- | --- | --- |
| 2020-01-01T00:00:00Z | Boston | 53 | 48 | .83
| 2020-01-01T00:00:00Z | Providence | 48 | 43 | .55


In Sentenai you might instead arrange your dataset:
```
- Activity
  - Boston
    - 53
      - Temperature
      - Humidity
  - Providence
    - 48
      - Temperature
      - Humidity
```

This organization implies there are natural filters you'd want to apply:

`Activity/Boston/53/Temperature` is equivalent to filtering a table on Location and Sensor id.

## Managing data in a Stream Database

Stream databases feature transactional data logging. Any streams within the stream graph can be updated in a single transaction, provided the update applies to the same time interval across all updated streams.

So if we initialize a new database `test-2':

In [2]:
db = sentenai.init('test-2')

We can update the entire database with each insertion without requiring any up-front schema declaration of stream instantiation:

In [5]:
raise Exception("Log is currently disabled for single-node installations")
db.log[datetime(2020, 1, 1): datetime(2020, 1, 1, 1): 'update-1'] = {
    'Boston': {
        '53': {
            'Temperature': 48.0,
            'Humidity': 0.83,
        },
    },
    'Providence': {
        '48': {
            'Temperature': 43.0,
            'Humidity': 0.55,
        }
    }
}

Exception: Log is currently disabled for single-node installations

You also can create indexes manually and use `insert` to add new data.

In [10]:
# Manual creation
db['Boston', '53', 'Temperature'] = float
db['Boston', '53', 'Temperature'].insert([{
    'start': datetime(2020, 1, 1),
    'end': datetime(2020, 1, 1, 1),
    'value': 48.0
}])

Or you can use lists of events to automatically figure out an index. Note that this overwrites previous data and re-detects the index type.

In [11]:
db['Boston', '53', 'Humidity'] = [{
    'start': datetime(2020, 1, 1),
    'end': datetime(2020, 1, 1, 1),
    'value': 0.83
}]

100%|████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 134.25 values/s]


Or you can use a dataframe to build out all the contents of part of the database at once. Note that an additional `event`-typed index is created at the path `Providence/48`, which can be useful for keeping track of common intervals across indexes that were added together.

In [14]:
db['Providence', '48'] = pd.DataFrame([{
    'start': datetime(2020, 1, 1),
    'end': datetime(2020, 1, 1, 1),
    'Temperature': 43.0,
    'Humidity': 0.55
}])

100%|████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 106.00 values/s]


Now that we've inserted our first update into the database's `log`, we can see for ourselves the structure using the `.graph` property of the database.

In [16]:
db.graph().show()

test-2
├── Boston
│   └── 53
│       ├── Humidity
│       └── Temperature
└── Providence
    └── 48
        ├── Humidity
        └── Temperature



Each node in this tree is a stream of either events or values. A stream of values is just a stream of events with a value attached, so in essence every node in the tree is the same.

### Inserting bulk data

In addition to the basic pattern of insertion into the log with `[ start : (optional) end : (optional) id ]`, you can also stream large amounts of data over a single connection using Python's `with` keyword and a streaming context manager. To start up the streaming connection, type:
```
with db.log as log:
```
This creates a new streaming connection, called `log`, for us to use to insert data. This connection can actively manage asynchronous uploading and buffering data, so you can feel free to iterate over very large files without worrying about them being loaded fully into memory. Here's a basic example:

In [17]:
raise Exception("Disabled for single-node")
import random
with db.log as log:
    for i in range(50):
        db.log[datetime(2020, 1, 1, 2, i) : : ] = {
            'Boston': {'53': {
                'Humidity' : random.random(),
                'Temperature': random.random() * 100
            }}}

Exception: Disabled for single-node

## Properties of a Stream Database

A stream database has several properties that are useful when writing programs.

In [18]:
# Name
print(db.name)

test-2


In [19]:
# Origin
db.origin

numpy.datetime64('1970-01-01T00:00:00')

## Accessing Streams in a Stream Database

So far we've seen how to manage data in a stream database, but we haven't yet seen how to work with that data.

Stream databases are dict-like objects, so we can access streams in the same way we would access a value in a dict:

In [21]:
for key in db:
    print(db[key])

test-2/Boston
test-2/Providence


The `.keys()` and `.items()` methods are also supported.

For paths that are multiple levels deep you can access them one of three ways:

In [23]:
print(db['Boston']['53'])

test-2/Boston/53


In [26]:
print(db['Boston', '53'])

test-2/Boston/53


In [27]:
print(db['Boston/53'])

test-2/Boston/53


These ways are entirely equivalent, but one may be preferable over the other depending on the situation.

### [Next chapter: Streams](Streams.ipynb)