# dsdb.DatasetDatabase

### What is it?
A dataset database is an object you will create to ingest, query, and process all types of datasets. This is the core of dsdb and has quite a lot built in to help you both run algorithms across datasets and version the inputs and results properly, but additionally it has built-in functions to help share you datasets with relative ease.

### Why use it?
The basics of why to use it is that it can handle any arbitrary datatypes with relative ease while providing a nice versioning and processing system on the backend for you. It can set up an entire database relatively quickly with little admin overhead. It handles type and value checking on ingestion, provides custom module support for your own file management systems, and much more!

### Index
1) [Initialization](#Initialization)

2) [Uploading](#Uploading)

3) [Querying](#Querying)

4) [Processing](#Processing)

5) [Sharing](#Sharing)

## Initialization
Below is a good minimalistic way to connect to a dataset database object.

**Note: If you haven't already, now is a good time to look over [the connect manager explainer](EXPLAINER-connection_manager.ipynb).**

In [1]:
# import and connect to a connection manager instance
import datasetdatabase as dsdb
mngr = dsdb.ConnectionManager(user="jacksonb")
mngr.add_connections(dsdb.LOCAL)
mngr

LOCAL:
	driver: sqlite
	link: /active/examples/local_database/local.db

In [2]:
# connect to local
local = mngr.connect(dsdb.LOCAL)
local

Recent Datasets:
--------------------------------------------------------------------------------

That is it. You now have a local instance of a dataset database. Local instances by default use sqlite as their driver. (However this shouldn't matter as dsdb handles interactions the same way regardless of database driver.

## Uploading

Great, you have connected to a dataset database and you want to add a dataset to the database. Let's do just that. I will explain the nuances as we go along.

In [3]:
# import to help us create a test dataset
import pandas as pd
import numpy as np
import pathlib
import json
import os

# create a test dataset to upload that has various types of data
fp_ex = pathlib.Path("./fp_example/")
if not fp_ex.exists():
    os.makedirs(fp_ex)

# for reproducibility I will set the seed here
np.random.seed(seed=12)

# creating lists of dicts to be formed into a dataframe
test = []
for i in range(10):
    fp =  fp_ex / (str(i) + ".json")
    with open(fp, "w") as write_out:
        json.dump({"hello": "world"}, write_out)
    
    d = {}
    d["strings"] = "foo" + str(i)
    d["bools"] = np.random.rand() < 0.5
    d["floats"] = np.random.rand() * 100
    d["ndarrays"] = np.random.rand(2, 2)
    d["tuples"] = tuple([1, 2, 3])
    d["sets"] = set([1, 2, 3, 3, 3])
    d["files"] = str(fp)
    test.append(d)

# convert this example both to dataframe
test = pd.DataFrame(test)
test

Unnamed: 0,bools,files,floats,ndarrays,sets,strings,tuples
0,True,fp_example/0.json,74.00497,"[[0.26331501518513467, 0.5337393933802977], [0...","{1, 2, 3}",foo0,"(1, 2, 3)"
1,False,fp_example/1.json,3.342143,"[[0.9569493362751168, 0.13720932135607644], [0...","{1, 2, 3}",foo1,"(1, 2, 3)"
2,False,fp_example/2.json,85.273554,"[[0.002259233518513537, 0.5212260272202929], [...","{1, 2, 3}",foo2,"(1, 2, 3)"
3,False,fp_example/3.json,16.071675,"[[0.7645604503388788, 0.020809797952066167], [...","{1, 2, 3}",foo3,"(1, 2, 3)"
4,True,fp_example/4.json,67.145265,"[[0.4712297782500141, 0.8161682980460269], [0....","{1, 2, 3}",foo4,"(1, 2, 3)"
5,False,fp_example/5.json,32.756948,"[[0.3346475291060558, 0.9780580790165189], [0....","{1, 2, 3}",foo5,"(1, 2, 3)"
6,False,fp_example/6.json,82.500925,"[[0.40664030180666166, 0.4513084114213143], [0...","{1, 2, 3}",foo6,"(1, 2, 3)"
7,True,fp_example/7.json,96.25969,"[[0.4192502702591062, 0.4240524465509987], [0....","{1, 2, 3}",foo7,"(1, 2, 3)"
8,True,fp_example/8.json,3.516826,"[[0.08427266973184566, 0.7325206981419501], [0...","{1, 2, 3}",foo8,"(1, 2, 3)"
9,True,fp_example/9.json,22.085252,"[[0.055019993340200135, 0.5232460707782919], [...","{1, 2, 3}",foo9,"(1, 2, 3)"


In [4]:
# upload and return the dataset info
ds_info = local.upload_dataset(dataset=test,
                               name="test dataset",
                               description="this is the hello world of dataset ingestion",
                               type_map={"bools": bool,
                                         "files": str,
                                         "floats": float,
                                         "ndarrays": np.ndarray,
                                         "strings": str},
                               filepath_columns=["files"])
ds_info

Validating Dataset...
Creating Iota...
Creating Junction Items...
Dataset upload complete!


{'DatasetId': 1,
 'Name': 'test dataset',
 'Description': 'this is the hello world of dataset ingestion',
 'SourceId': 1,
 'FilepathColumns': "['files']",
 'Created': '2018-08-06 23:43:35.062186'}

Breaking down what just happened, we first validated the dataset. Initially we take the md5 hash of the dataframe object, if we have already stored this md5 then no processing is done and the already stored dataset info is returned. If it is truly a new dataset (like this one is), we start checking that all filepaths exist given any columns labeled as filepaths, checking the dataset types given a map of columns to their types (this can even be a column to a list of approved types, ex: {"files": [str, pathlib.Path]}), and we didn't here but you can also specify checking functions for each columns using a "value_validation_map" such as {"floats": lambda x: return x < 100}. If we didn't specify which columns stored filepaths our fms would not store any files in it's storage and no files would be added to the files table. Additionally, if you leave the "type_map" as None, each value type is assumed correct. There is a lot to this one function so I recommend reading the function doc string for more information.

A note on filepaths: any filepath found in the filepath columns will be deduplicated, meaning, if a file's md5 hash has already been encountered before it won't make create a new file in the fms.

That was just the validation portion of the ingestion process. The next step shown is the "Creating Iota" process. This is how data is actually stored in the dataset database. We break down the dataframe is to key-value pairings and store them individually, this reduces the ammount of storage required by a bit as instead of saying each dataset is a unique item, it will most likely happen that some row-column pairings from one dataset were already contained in another. Creating Iota takes a while simply because a secondary part of breaking down each key-value pairing we also store all values as a string and store some metadata about that key-value pairing so that when you want to retrieve it, we can cast that pairing back to it's original form properly. Because this is the lowest possible form a key-value pairing can take we called them "Iota" or an extremely small amount of data.

The last process is the "Creating Junction Items". This simply stitches all Iota that were creating to the dataset object that has also been created. We do this last because if at any point, anything went wrong, we can stop the process and not create a dataset.

What was returned to was the dataset info, which you can use to retrieve, process, and share datasets by using the information stored in the returned dictionary with other dsdb functions.

If we want to see the recent additions to all tables and not just the dataset table, you can run `dsdb._deep_print()`

## Querying

You just uploaded a dataset, let's get it a back to view.

In [5]:
# recent datasets
local

Recent Datasets:
--------------------------------------------------------------------------------
{'DatasetId': 1, 'Name': 'test dataset', 'Description': 'this is the hello world of dataset ingestion', 'SourceId': 1, 'FilepathColumns': "['files']", 'Created': '2018-08-06 23:43:35.062186'}

In [6]:
local.get_dataset(1)

Unnamed: 0,bools,files,floats,ndarrays,sets,strings,tuples
0,True,/home/jovyan/.local/share/QuiltCli/quilt_packa...,74.00497,"[[0.26331501518513467, 0.5337393933802977], [0...","{1, 2, 3}",foo0,"(1, 2, 3)"
1,False,/home/jovyan/.local/share/QuiltCli/quilt_packa...,3.342143,"[[0.9569493362751168, 0.13720932135607644], [0...","{1, 2, 3}",foo1,"(1, 2, 3)"
2,False,/home/jovyan/.local/share/QuiltCli/quilt_packa...,85.273554,"[[0.002259233518513537, 0.5212260272202929], [...","{1, 2, 3}",foo2,"(1, 2, 3)"
3,False,/home/jovyan/.local/share/QuiltCli/quilt_packa...,16.071675,"[[0.7645604503388788, 0.020809797952066167], [...","{1, 2, 3}",foo3,"(1, 2, 3)"
4,True,/home/jovyan/.local/share/QuiltCli/quilt_packa...,67.145265,"[[0.4712297782500141, 0.8161682980460269], [0....","{1, 2, 3}",foo4,"(1, 2, 3)"
5,False,/home/jovyan/.local/share/QuiltCli/quilt_packa...,32.756948,"[[0.3346475291060558, 0.9780580790165189], [0....","{1, 2, 3}",foo5,"(1, 2, 3)"
6,False,/home/jovyan/.local/share/QuiltCli/quilt_packa...,82.500925,"[[0.40664030180666166, 0.4513084114213143], [0...","{1, 2, 3}",foo6,"(1, 2, 3)"
7,True,/home/jovyan/.local/share/QuiltCli/quilt_packa...,96.25969,"[[0.4192502702591062, 0.4240524465509987], [0....","{1, 2, 3}",foo7,"(1, 2, 3)"
8,True,/home/jovyan/.local/share/QuiltCli/quilt_packa...,3.516826,"[[0.08427266973184566, 0.7325206981419501], [0...","{1, 2, 3}",foo8,"(1, 2, 3)"
9,True,/home/jovyan/.local/share/QuiltCli/quilt_packa...,22.085252,"[[0.055019993340200135, 0.5232460707782919], [...","{1, 2, 3}",foo9,"(1, 2, 3)"


**Keep calm!**

Yes, the filepaths look different, remember when I said in the above section that during Iota creation process dsdb handles the fms upload of any files in the columns listed as filepath columns? Well it also changes those filepaths to be the fms filepaths. We do this so we can guarantee a dataset will always be usable. As you can see by the filepaths, the default fms module is a [Quilt](https://quiltdata.com) backend. If you want to view the original dataset without any changes to the filepaths or data you can use the File and the dataset info that was passed back from the upload to track down the FMS readpath of the unchanged uploaded dataset. (For provenance we also versioned that for you)

## Processing

When handling algorithms and functions, the system devised is that you will be processing a dataset. This in short means you algorithms/ methods/ functions will have two parameters, a dataset parameter which will be passed the recasted dataset from a `dsdb.get_dataset()` call, and then an "other parameters" keyword. Other parameters are options you pass using the `dsdb.process_run()` function under the `alg_parameters` keyword. In our example we won't be using these but having the keyword slot available is required (you can call these parameters whatever you would like). Additionally, your function must return a pandas DataFrame.

In [7]:
def add_to_col(dataset, params):
    dataset[params["col"]] += params["add"]
    return dataset

In [8]:
# process the original dataset
# using the function provided
# stored the processed dataset info
post_process_info = local.process_run(add_to_col,
                                      ds_info["DatasetId"],
                                      alg_parameters={"col": "floats", "add": 12},
                                      dataset_parameters={"name": "processing test",
                                                          "description": "this is an example of processing a dataset",
                                                          "filepath_columns": "files"})
post_process_info

Validating Dataset...
Creating Iota...
Creating Junction Items...
Dataset upload complete!


{'DatasetId': 2,
 'Name': 'processing test',
 'Description': 'this is an example of processing a dataset',
 'SourceId': 2,
 'FilepathColumns': "['files']",
 'Created': '2018-08-06 23:43:38.041504'}

In [9]:
# view the processed dataset
local.get_dataset(post_process_info["DatasetId"])

Unnamed: 0,bools,files,floats,ndarrays,sets,strings,tuples
0,True,/home/jovyan/.local/share/QuiltCli/quilt_packa...,86.00497,"[[0.26331501518513467, 0.5337393933802977], [0...","{1, 2, 3}",foo0,"(1, 2, 3)"
1,False,/home/jovyan/.local/share/QuiltCli/quilt_packa...,15.342143,"[[0.9569493362751168, 0.13720932135607644], [0...","{1, 2, 3}",foo1,"(1, 2, 3)"
2,False,/home/jovyan/.local/share/QuiltCli/quilt_packa...,97.273554,"[[0.002259233518513537, 0.5212260272202929], [...","{1, 2, 3}",foo2,"(1, 2, 3)"
3,False,/home/jovyan/.local/share/QuiltCli/quilt_packa...,28.071675,"[[0.7645604503388788, 0.020809797952066167], [...","{1, 2, 3}",foo3,"(1, 2, 3)"
4,True,/home/jovyan/.local/share/QuiltCli/quilt_packa...,79.145265,"[[0.4712297782500141, 0.8161682980460269], [0....","{1, 2, 3}",foo4,"(1, 2, 3)"
5,False,/home/jovyan/.local/share/QuiltCli/quilt_packa...,44.756948,"[[0.3346475291060558, 0.9780580790165189], [0....","{1, 2, 3}",foo5,"(1, 2, 3)"
6,False,/home/jovyan/.local/share/QuiltCli/quilt_packa...,94.500925,"[[0.40664030180666166, 0.4513084114213143], [0...","{1, 2, 3}",foo6,"(1, 2, 3)"
7,True,/home/jovyan/.local/share/QuiltCli/quilt_packa...,108.25969,"[[0.4192502702591062, 0.4240524465509987], [0....","{1, 2, 3}",foo7,"(1, 2, 3)"
8,True,/home/jovyan/.local/share/QuiltCli/quilt_packa...,15.516826,"[[0.08427266973184566, 0.7325206981419501], [0...","{1, 2, 3}",foo8,"(1, 2, 3)"
9,True,/home/jovyan/.local/share/QuiltCli/quilt_packa...,34.085252,"[[0.055019993340200135, 0.5232460707782919], [...","{1, 2, 3}",foo9,"(1, 2, 3)"


**Note: The two rather imporant keyword arguments to think about passing to `dsdb.process_run()` are alg_parameters as mentioned before and dataset_parameters.**

"alg_parameters" are passed to the algorithm/ function/ method and you can then do what you want with them.

"dataset_parameters" are used by the dataset creation function and have the exact same keywords available as that of the `dsdb.upload_dataset()` function except for dataset. (name, description, type_map, value_validation_map, import_as_type_map, store_files, force_storage, filepath_columns, and replace_paths). These dataset parameters will be used on the returned dataset upload process meaning if you make a brand new dataframe with different column names, use these new column names in your type_map, filepath_columns, etc.

## Sharing

You can also share you datasets relatively easily due to the builtin quilt backend. Simply pass a dataset id to the `dsdb.export_to_quilt()` function and it will create a Quilt package and return the package name.

In [10]:
# export/ create the processed dataset
local.export_to_quilt(post_process_info["DatasetId"])

'dsdb/processing_test'

In [11]:
from quilt.data.dsdb import processing_test
processing_test

<PackageNode>
files/
README
data

In [12]:
# comes with a generated readme
readme = processing_test.README()
readme = open(readme, "r")
print(readme.read())

# processing test:

## Description:
this is an example of processing a dataset

## Filepath Columns:
['files']

## Created:
2018-08-06 23:43:38.041504

## Origin SourceId:
2




In [13]:
# read in the dataset
read_in = pd.read_pickle(processing_test.data())
read_in

Unnamed: 0,bools,files,floats,ndarrays,sets,strings,tuples
0,True,json_2,86.00497,"[[0.26331501518513467, 0.5337393933802977], [0...","{1, 2, 3}",foo0,"(1, 2, 3)"
1,False,json_2,15.342143,"[[0.9569493362751168, 0.13720932135607644], [0...","{1, 2, 3}",foo1,"(1, 2, 3)"
2,False,json_2,97.273554,"[[0.002259233518513537, 0.5212260272202929], [...","{1, 2, 3}",foo2,"(1, 2, 3)"
3,False,json_2,28.071675,"[[0.7645604503388788, 0.020809797952066167], [...","{1, 2, 3}",foo3,"(1, 2, 3)"
4,True,json_2,79.145265,"[[0.4712297782500141, 0.8161682980460269], [0....","{1, 2, 3}",foo4,"(1, 2, 3)"
5,False,json_2,44.756948,"[[0.3346475291060558, 0.9780580790165189], [0....","{1, 2, 3}",foo5,"(1, 2, 3)"
6,False,json_2,94.500925,"[[0.40664030180666166, 0.4513084114213143], [0...","{1, 2, 3}",foo6,"(1, 2, 3)"
7,True,json_2,108.25969,"[[0.4192502702591062, 0.4240524465509987], [0....","{1, 2, 3}",foo7,"(1, 2, 3)"
8,True,json_2,15.516826,"[[0.08427266973184566, 0.7325206981419501], [0...","{1, 2, 3}",foo8,"(1, 2, 3)"
9,True,json_2,34.085252,"[[0.055019993340200135, 0.5232460707782919], [...","{1, 2, 3}",foo9,"(1, 2, 3)"


In [14]:
# only one file because all the json files were the same {"hello": "world"}
processing_test.files

<GroupNode>
json_2/

In [15]:
# open all files
for f in read_in["files"]:
    fp = getattr(processing_test.files, f)
    print(json.load(open(fp.load())))

{'hello': 'world'}
{'hello': 'world'}
{'hello': 'world'}
{'hello': 'world'}
{'hello': 'world'}
{'hello': 'world'}
{'hello': 'world'}
{'hello': 'world'}
{'hello': 'world'}
{'hello': 'world'}


In [16]:
# show that types were conserved even through export
[(type(nd), nd.shape) for nd in read_in["ndarrays"]]

[(numpy.ndarray, (2, 2)),
 (numpy.ndarray, (2, 2)),
 (numpy.ndarray, (2, 2)),
 (numpy.ndarray, (2, 2)),
 (numpy.ndarray, (2, 2)),
 (numpy.ndarray, (2, 2)),
 (numpy.ndarray, (2, 2)),
 (numpy.ndarray, (2, 2)),
 (numpy.ndarray, (2, 2)),
 (numpy.ndarray, (2, 2))]

While using the package outside of a dataset database is easy enough, there is also a `dsdb.import_from_quilt()` function if your collaborators are also using this package.

In [17]:
# teardown the database and supporting files so we know we will get a fresh dsdb instance
import shutil
shutil.rmtree(fp_ex)
db_store = pathlib.Path("/active/examples/local_database/")
shutil.rmtree(db_store)

In [18]:
# rebuild and display empty database
mngr = dsdb.ConnectionManager(dsdb.LOCAL, user="jacksonb")
local = mngr.connect(dsdb.LOCAL)
local._deep_print()

------------------------------- DATASET DATABASE -------------------------------
--------------------------------------------------------------------------------
Recent User:
--------------------------------------------------------------------------------
Recent Iota:
--------------------------------------------------------------------------------
Recent Source:
--------------------------------------------------------------------------------
Recent FileSource:
--------------------------------------------------------------------------------
Recent QuiltSource:
--------------------------------------------------------------------------------
Recent Dataset:
--------------------------------------------------------------------------------
Recent IotaDatasetJunction:
--------------------------------------------------------------------------------
Recent Algorithm:
--------------------------------------------------------------------------------
Recent Run:
------------------------------------

In [19]:
local.import_from_quilt("dsdb/processing_test")

Validating Dataset...
Creating Iota...
Creating Junction Items...
Dataset upload complete!


{'DatasetId': 1,
 'Name': 'dsdb/processing_test',
 'Description': 'this is an example of processing a dataset',
 'SourceId': 1,
 'FilepathColumns': "['files']",
 'Created': '2018-08-06 23:43:41.788372'}

In [20]:
local

Recent Datasets:
--------------------------------------------------------------------------------
{'DatasetId': 1, 'Name': 'dsdb/processing_test', 'Description': 'this is an example of processing a dataset', 'SourceId': 1, 'FilepathColumns': "['files']", 'Created': '2018-08-06 23:43:41.788372'}

## Wrap Up

There is more to come for Dataset Database but this covers much of the basics of uploading, querying, and processing data. Please point all issues you have at the [GitHub issues page](https://github.com/AllenCellModeling/aics_modeling_db/issues) for this repo.