Sina Basic Usage
==============

This notebook will guide you through some of Sina's core functionality. For more examples, including advanced topics like handling large datasets or generating tables, see the example dataset folders (noaa/, fukushima/, etc).

Initial Setup
=========
We first import one of Sina's backends--we'll use SQL (specifically SQLite) for simplicity, since it comes "default" with Sina. We set up a connection to our database, then use that connection to create a "RecordDAO", the core object for inserting, querying, and generally handling Records. The import statement and factory creation are the only backend-specific portions of Sina. Everything else in this tutorial should apply to all backends equally.

In [None]:
import json
import random

import sina
from sina.model import Record, generate_record_from_json
from sina.utils import DataRange, has_all, has_any, all_in, any_in

# The default (read: without an argument) behavior of sina.connect()
# is to connect to an in-memory SQLite database. If you'd like to create a file, just provide
# the filename as an arg. You can also pass the URL to a database such as MySQL or MariaDB.

ds = sina.connect()
record_handler = ds.records

print("Connection is ready!")

Inserting Our First Records
----------------
Now that we've got a connection open and our handler ready, we can start inserting Records!. The first we'll create is as simple as possible, but the rest have data attached. We'll insert all of them into our database.

In [None]:
simple_record = Record(id="simplest", type="simple_sample")
record_handler.insert(simple_record)

possible_maintainers = ["John Doe", "Jane Doe", "Gary Stu", "Ann Bob"]
num_data_records = 100
for val in range(0, num_data_records):
    # Our sample "code runs" are mostly random data
    record = Record(id="rec_{}".format(val), type="foo_type")
    record.add_data('initial_density', random.randint(10, 1000) / 10.0, units='g/cm^3')
    record.add_data('final_volume', random.randint(1, int(num_data_records / 5)))
    record.add_data('maintainer', random.choice(possible_maintainers), tags=["personnel"])
    record_handler.insert(record)

print("{} Records have been inserted into the database.".format(num_data_records + 1))

Type-Based Queries and Deleting Records
--------------------------------------------------

On second thought, the "simple_sample" Record isn't useful. Pretending we've forgotten the id we used to create it above, we'll go ahead and find every simple_sample-type Record in our database and delete it.

In [None]:
simple_record_ids = list(record_handler.find_with_type("simple_sample", ids_only=True))
print("Simple_sample Records found: {}".format(simple_record_ids))

print("Deleting them all...")
record_handler.delete(simple_record_ids)

simple_records_post_delete = list(record_handler.find_with_type("simple_sample", ids_only=True))
print("Simple_sample Records found now: {}".format(simple_records_post_delete))

Finding Records Based on Data
=========================
The remaining Records in our database represent randomized runs of some imaginary code. We can use their inputs and outputs to select runs we're particularly interested in.

Basic data query
--------------------
John Doe just completed a run of the version he maintains where the final_volume was 6, which seemed a little low. After inserting that Record, he finds all Records in the database that he's maintainer for and which have a volume of 6 or lower.

In [None]:
# Because Record data is represented by a JSON object/Python dictionary, we can also set it up like so:
data = {"final_volume": {"value": 6},
        "initial_density": {"value": 6, "units": "cm^3"},
        "maintainer": {"value": "John Doe"}}
record_handler.insert(Record(id="john_latest", type="foo_type", data=data))

# Now we'll find matching Records.
john_low_volume = record_handler.find_with_data(maintainer="John Doe",
                                                final_volume=DataRange(max=6, max_inclusive=True))

print("John Doe's low-volume runs: {}".format(', '.join(john_low_volume)))

Specialized queries
-----------------------

John's still curious about how low the final_volume can be, regardless of who ran them. Sina was designed with these sorts of questions in mind--there are specialized query functions for performing common selections like "Records with the <number> lowest values for <datum name>". See the documentation for the full list!

In [None]:
# John wants the 3 Records with the lowest volumes.
runs_with_lowest_volumes = record_handler.find_with_min("final_volume", 3)
for run in runs_with_lowest_volumes:
    print("{}: {} (maintainer: {})".format(run.id,
                                           run.data["final_volume"]["value"],
                                           run.data["maintainer"]["value"]))

Complicated & combined queries
---------------------------------------

Of course, not every query can be predicted, and mathematical expressions especially can be too complicated to express cleanly in a function. This is where Sina's Python nature comes in handy. Data can be handled and transformed, and queries can be cast to sets to combine them.

Seeing that he may not be the only one with this issue, John decides to investigate further. While they don't record initial_mass, he knows he can find it from density and volume (mass is conserved throughout the simulation), and that an initial_mass below 45 indicates that something strange has happened. So he first finds all runs where (initial_density\*final_volume)<45, then figures which of those, if any, belong to him.

In [None]:
record_data = record_handler.get_data(["final_volume", "initial_density"])
low_mass_records = set()
for rec_id, data_dict in record_data.items():
    mass = data_dict["initial_density"]["value"] * data_dict["final_volume"]["value"]
    if mass < 45:
        low_mass_records.add(rec_id)
print("Low-mass runs: {}".format(low_mass_records))

john_runs = list(record_handler.find_with_data(maintainer="John Doe"))
print("John's low-mass runs: {}".format(low_mass_records.intersection(john_runs)))

List Data and Querying Them
----------------------------------

Some data take the form of a list of entries, either numbers or strings: timeseries, options activated, and nodes in use are a few examples. Sina allows for storing and querying these lists. Note that, to maintain querying efficiency, a list can't have strings AND have scalars AND be queryable; only all-scalar or all-string lists can be part of a Record's data. Mixed-type lists (as well as any other JSON-legal structure) can be stored in a Record's user_defined section instead.

In [None]:
# Records expressed as JSON. We expect records 1 and 3 to match our query.
record_1 = """{"id": "list_rec_1",
               "type": "list_rec",
               "data": {"options_active": {"value": ["quickrun", "verification", "code_test"]},
                        "velocity": {"value": [0.0, 0.0, 0.0, 0.0, 0.0]}},
               "user_defined": {"mixed": [1, 2, "upper"]}}"""
record_2 = """{"id": "list_rec_2",
               "type": "list_rec",
               "data": {"options_active": {"value": ["quickrun", "distributed"]},
                        "velocity": {"value": [0.0, -0.2, -3.1, -12.8, -22.5]}},
               "user_defined": {"mixed": [1, 2, "upper"],
                                "nested": ["spam", ["egg"]]}}"""
record_3 = """{"id": "list_rec_3",
               "type": "list_rec",
               "data": {"options_active": {"value": ["code_test", "quickrun"]},
                        "velocity": {"value": [0.0, 1.0, 2.0, 3.0, 4.1]}},
               "user_defined": {"nested": ["spam", ["egg"]],
                                "bool_dict": {"my_key": [true, false]}}}"""

for record in (record_1, record_2, record_3):
    record_handler.insert(generate_record_from_json(json.loads(record)))
print("3 list-containing Records have been inserted into the database.\n")

# Find all the Records that have both "quickrun" and "code_test" in their options_active
quicktest = record_handler.find_with_data(options_active=has_all("quickrun", "code_test"))

# Get those Records and print their id, value for options_active, and the contents of their user_defined.\n",
print("Records whose traits include 'quickrun' and 'code_test':\n")
for id in quicktest:
    record = record_handler.get(id)
    print("{} traits: {} | user_defined: {}".format(id,
                                                    ', '.join(record['data']['options_active']['value']),
                                                    str(record['user_defined'])))

Further List Queries
-----------------------
There are a few additional ways to retrieve Records based on their list data. A `has_any()` query will retrieve any Record that contains *at least* one of its arguments. An `all_in()` query retrieves Records where all members of a scalar list are in some range. An `any_in()` query retrieves Records where one or more members are in the range. Scalar ranges are assumed to be continuous.

It's important to note that, for these three types of list query, order and count don't matter. If `["quickrun", "code_test"]` would match, so would `["code_test", "quickrun", "quickrun"]`.

In [None]:
match_has_any = list(record_handler.find_with_data(options_active=has_any("quickrun", "code_test")))
print("Records whose traits include 'quickrun' and/or 'code_test': {}".format(', '.join(match_has_any)))


match_all_in = list(record_handler.find_with_data(velocity=all_in(DataRange(min=0, max=0,
                                                                  max_inclusive=True))))
print("Records where velocity never changed from zero: {}"
      .format(', '.join(match_all_in)))


match_any_in = list(record_handler.find_with_data(velocity=any_in(DataRange(min=0, min_inclusive=False))))
print("Records that had a velocity greater than zero at some point: {}"
      .format(', '.join(match_any_in)))

Curveset Basics
=============

Sina can also store *collections* of list data. These "curve sets" are useful for tracking relationships between curves for ex: plotting. They're not themselves used in many queries, but their list data is available through the aforementioned list queries.

In [None]:
# Records expressed as JSON. We expect records 1 and 3 to match our query.
curve_rec = """{"id": "curve_rec",
                "type": "curve_rec",
                "curve_sets": {
                    "sample_curve": {
                        "independent": {"time": {"value": [0, 1, 2]}},
                        "dependent": {"distance": {"value": [0, 2, 4]},
                                      "amount": {"value": [12, 14, 7]}}}}}"""
record_handler.insert(generate_record_from_json(json.loads(curve_rec)))
rec_with_curve_id = record_handler.find_with_data(amount=any_in(DataRange(min=12)))
print('Records with an "amount" >= 12 at some point: {}'
      .format(list(rec_with_curve_id)))

Releasing Resources
=================

When we are all done, it is important to release database resources. Failure to close connections can result in the server keeping additional resources open, resulting in performance issues.

In [None]:
ds.close()

In order to not forget to close them, we can use factories as context managers.

In [None]:
with sina.connect() as ds:
    record_handler = ds.records
    # Since we closed the connection above, sqlite dropped the database and we created a new one.
    # We need to re-populate it.
    for record in (record_1, record_2, record_3):
        record_handler.insert(generate_record_from_json(json.loads(record)))
    print(list(record_handler.find_with_data(velocity=any_in(DataRange(min=-10, max=-5)))))
# Once we exit the context, since it's an in-memory db, it's once again dropped.