In [1]:
# Get credentials
from IPython.utils import io
with io.capture_output() as captured:
    %run ../Introduction.ipynb
    
from datetime import datetime
from sentenai import Sentenai
sentenai = Sentenai(host=host, port=port)

# Introduction to Stream Databases

Stream databases are the building block of data sets in Sentenai. A Stream database represent something as simple as a single data feed, or as complex as an entire production line on a factory floor. Data streams in a stream database are organized into a tree-like hierarchy of relationships. This hierarchy is purely organizational in nature, but splitting complex flat data sets into a hierarchical organization can have substantial performance benefits.

Here's an example of a flat dataset called "Activity" like you'd typically see in a relational database:

| Timestamp | Location | Sensor | Temperature | Humidity
| --- | --- | --- | --- | --- |
| 2020-01-01T00:00:00Z | Boston | 53 | 48 | .83
| 2020-01-01T00:00:00Z | Providence | 48 | 43 | .55


In Sentenai you might instead arrange your dataset:
```
- Activity
  - Boston
    - 53
      - Temperature
      - Humidity
  - Providence
    - 48
      - Temperature
      - Humidity
```

This organization implies there are natural filters you'd want to apply:

`Activity/Boston/53/Temperature` is equivalent to filtering a table on Location and Sensor id.

## Managing data in a Stream Database

Stream databases feature transactional data logging. Any streams within the stream graph can be updated in a single transaction, provided the update applies to the same time interval across all updated streams.

So if we initialize a new database `test-2':

In [3]:
db = sentenai.init('test-2')

We can update the entire database with each insertion without requiring any up-front schema declaration of stream instantiation:

In [3]:
db.log[datetime(2020, 1, 1): datetime(2020, 1, 1, 1): 'update-1'] = {
    'Boston': {
        '53': {
            'Temperature': 48.0,
            'Humidity': 0.83,
        },
    },
    'Providence': {
        '48': {
            'Temperature': 43.0,
            'Humidity': 0.55,
        }
    }
}

Now that we've inserted our first update into the database's `log`, we can see for ourselves the structure using the `.graph` property of the database.

In [4]:
db.graph.show()

test-2
├── Boston
│   └── 53
│       ├── Humidity
│       └── Temperature
└── Providence
    └── 48
        ├── Humidity
        └── Temperature



Each node in this tree is a stream of either events or values. A stream of values is just a stream of events with a value attached, so in essence every node in the tree is the same.

### Inserting bulk data

In addition to the basic pattern of insertion into the log with `[ start : (optional) end : (optional) id ]`, you can also stream large amounts of data over a single connection using Python's `with` keyword and a streaming context manager. To start up the streaming connection, type:
```
with db.log as log:
```
This creates a new streaming connection, called `log`, for us to use to insert data. This connection can actively manage asynchronous uploading and buffering data, so you can feel free to iterate over very large files without worrying about them being loaded fully into memory. Here's a basic example:

In [13]:
import random
with db.log as log:
    for i in range(50):
        db.log[datetime(2020, 1, 1, 2, i) : : ] = {
            'Boston': {'53': {
                'Humidity' : random.random(),
                'Temperature': random.random() * 100
            }}}

### Retrieving an update by unique id

Updates can be retrieved by id:

In [19]:
db.log['update-1']

{'Boston': {'53': {'Humidity': 0.83, 'Temperature': 48.0}},
 'Providence': {'48': {'Humidity': 0.55, 'Temperature': 43.0}}}

### Deleting an update

If an individual update has been given a unique id, it can be deleted from the log by its id:

In [4]:
del db.log['update-1']

In [6]:
try:
    print(db.log['update-1'])
except KeyError as k:
    print(k)

'update not found.'


### Retrieving updates by time

You can retrieve a set of updates by time slice: `db.log[ start : end : limit ]`. All arguments are optional. To retrieve all updates you can do `db.log[:]`. To get the first 5 updates, do:

In [7]:
for x in db.log[ : : 5]:
    print(x)

{'start': numpy.datetime64('2020-01-01T02:00:00'), 'end': None, 'data': {'Boston': {'53': {'Humidity': 0.9294579174551353, 'Temperature': 22.32584915785809}}}}
{'start': numpy.datetime64('2020-01-01T02:01:00'), 'end': None, 'data': {'Boston': {'53': {'Humidity': 0.6331413173773333, 'Temperature': 81.71983424155601}}}}
{'start': numpy.datetime64('2020-01-01T02:01:00'), 'end': None, 'data': {'Boston': {'53': {'Humidity': 0.030784589936583617, 'Temperature': 96.44885425146505}}}}
{'start': numpy.datetime64('2020-01-01T02:01:00'), 'end': None, 'data': {'Boston': {'54': {'Humidity': 55}}}}
{'start': numpy.datetime64('2020-01-01T02:02:00'), 'end': None, 'data': {'Boston': {'53': {'Humidity': 0.3031839372221987, 'Temperature': 23.21499324972436}}}}


Negative limit values reverse the retrieval order. To get the last five updates, you can do:

In [10]:
for x in db.log[ : : -5]:
    print(x)

{'start': numpy.datetime64('2020-01-01T02:49:00'), 'end': None, 'data': {'Boston': {'53': {'Humidity': 0.7662644674096308, 'Temperature': 93.03395578163271}}}}
{'start': numpy.datetime64('2020-01-01T02:49:00'), 'end': None, 'data': {'Boston': {'53': {'Humidity': 0.5353517813206717, 'Temperature': 8.141972824783416}}}}
{'start': numpy.datetime64('2020-01-01T02:48:00'), 'end': None, 'data': {'Boston': {'53': {'Humidity': 0.04041089532843878, 'Temperature': 94.67279770679082}}}}
{'start': numpy.datetime64('2020-01-01T02:48:00'), 'end': None, 'data': {'Boston': {'53': {'Humidity': 0.10329406728415713, 'Temperature': 77.18615457054389}}}}
{'start': numpy.datetime64('2020-01-01T02:47:00'), 'end': None, 'data': {'Boston': {'53': {'Humidity': 0.16096466815504118, 'Temperature': 90.58414373050245}}}}


## Properties of a Stream Database

A stream database has several properties that are useful when writing programs.

In [13]:
# Name
print(db.name)

test-2


In [14]:
# Origin
db.origin

numpy.datetime64('1970-01-01T00:00:00')

In [25]:
# number of updates
len(db)

101

## Accessing Streams in a Stream Database

So far we've seen how to manage data in a stream database, but we haven't yet seen how to work with that data.

Stream databases are dict-like objects, so we can access streams in the same way we would access a value in a dict:

In [31]:
for key in db:
    print(db[key])

test-2/Boston
test-2/Providence


The `.keys()` and `.items()` methods are also supported.

For paths that are multiple levels deep you can access them one of two ways:

In [33]:
print(db['Boston']['43'])

test-2/Boston/43


In [34]:
print(db['Boston', '43'])

test-2/Boston/43


These ways are entirely equivalent, but one may be preferable over the other depending on the situation.

### [Next chapter: Streams](Streams.ipynb)