# Exetera Import Example

This example demonstrates creating a simple database with a RandomDataset schema then converting it to a Exetera databse with its JSON schema. 

First thing to do is import `randomdataset` from the parent of this directory:

In [1]:
%pip install RandomDataset exetera

Collecting RandomDataset
  Using cached RandomDataset-0.1.4-py3-none-any.whl (14 kB)
Installing collected packages: RandomDataset
Successfully installed RandomDataset-0.1.4
Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import sys

import numpy as np
import pandas as pd

import randomdataset
import exetera, exetera.core, exetera.processing

The YAML schema is written out which will be used to generate the random data:

In [2]:
%%writefile randomschema.yml

- typename: randomdataset.generators.CSVGenerator
  num_lines: 10
  dataset:
    name: participants
    typename: randomdataset.Dataset
    fields:
    - name: id
      typename: randomdataset.UIDFieldGen
    - name: FirstName
      typename: randomdataset.StrFieldGen
      lmin: 6
      lmax: 14
    - name: LastName
      typename: randomdataset.StrFieldGen
      lmin: 6
      lmax: 14
    - name: Age
      typename: randomdataset.IntFieldGen
      vmin: 18
      vmax: 90
    - name: is_employed
      typename: randomdataset.BoolFieldGen
      as_string: False
        
- typename: randomdataset.generators.CSVGenerator
  num_lines: 10
  dataset:
    name: tests
    typename: randomdataset.Dataset
    fields:
    - name: id
      typename: randomdataset.UIDFieldGen
    - name: patient_id
      typename: randomdataset.IntFieldGen
      vmin: 0
      vmax: 10
    - name: test_type
      typename: randomdataset.SetFieldGen
      field_type: str
      values: ["Type1", "Type2", "Unknown"]
    - name: location
      typename: randomdataset.StrFieldGen
    - name: result
      typename: randomdataset.IntFieldGen
      vmin: 0
      vmax: 3
    - name: value
      typename: randomdataset.FloatFieldGen
      vmin: 0
      vmax: 9

Writing randomschema.yml


Instead of invoking the `generate_dataset` command line utility the command can be called directly through the imported library:

In [3]:
# !generate_dataset randomschema.yaml .
randomdataset.generate_dataset.callback("randomschema.yml",".")

Schema: 'randomschema.yml'
Output: '.'


Next the Exetera schema is written out, which looks like the YAML schema except with some extra specifiers for primary and foreign keys:

In [100]:
%%writefile exeteraschema.json

{
  "exetera": {
    "version": "1.0.0"
  },
  "schema": {
    "participants": {
      "primary_keys": [
        "id"
      ],
      "fields": {
        "id": {
          "field_type": "fixed_string",
          "length": 32
        },
        "FirstName": {
          "field_type": "fixed_string",
          "length": 32
        },
        "LastName": {
          "field_type": "fixed_string",
          "length": 32
        },
        "Age": {
          "field_type": "numeric",
          "value_type": "int32"
        },
        "is_employed": {
          "field_type": "categorical",
          "categorical": {
            "value_type": "int8",
            "strings_to_values": {
              "False": 1,
              "True": 2
            }
          }
        }
      }
    },
    "tests": {
      "primary_keys": [
        "id"
      ],
      "foreign_keys": {
        "participants_id": {
          "space": "participants",
          "key": "id"
        }
      },
      "fields": {
        "id": {
          "field_type": "fixed_string",
          "length": 32
        },
        "patient_id": {
          "field_type": "fixed_string",
          "length": 32
        },
        "test_type": {
          "field_type": "categorical",
          "categorical": {
            "value_type": "int8",
            "strings_to_values": {
              "Unknown": 0,
              "Type1": 1,
              "Type2": 2
            }
          }
        },
        "location": {
          "field_type": "string"
        },
        "result": {
          "field_type": "numeric",
          "value_type": "int32"
        },
        "value": {
          "field_type": "numeric",
          "value_type": "float32"
        } 
      }
    } 
  }
}


Overwriting exeteraschema.json


The conversion command is run to produce the HDF5 dataset:

In [6]:
%%bash

rm -f dataset.hdf5
exetera import -w -s exeteraschema.json -i "participants:participants.csv, tests:tests.csv" -o dataset.hdf5
ls -lh

'exetera' is not recognized as an internal or external command,
operable program or batch file.


In [None]:
exetera.core.importer.import_with_schema()

Let's import ExeTera and read some of the data back:

In [18]:
with exetera.core.session.Session() as s:
    dat = s.open_dataset("dataset.hdf5", "r", "dataset")  # load the dataset
    print(list(dat))  # list the frames

    frame = dat["participants"]  # pull out a frame

    print(type(frame), len(frame))

    field = frame["FirstName"]  # pull out a field of the frame

    print(type(field), len(field))

    print(field.data)  # The "data" member is a proxy for the actual

    for i in range(len(field)):  # we can iterate over the data of the field
        print(i, field.data[i])

    age = frame["Age"]  # pull out another field
    age_filter = age >= 40  # create an array of boolean values to use as a selector

    print("Age filter data:", list(age_filter.data))

    print("Selected ages:", age.apply_filter(age_filter.data[:]).data[:])  # filter the field by the selector

    # filter all fields by the selector, saving as a dictionary
    filtered = {f: frame[f].apply_filter(age_filter.data[:]).data[:] for f in ["FirstName", "LastName", "Age", "id"]}
    # filtered=frame.apply_filter(age_filter.data[:]).items()  # since the dataset is read-only this will fail

    df = pd.DataFrame(filtered)  # convert to pandas for easy viewing

df

['participants', 'tests']
<class 'exetera.core.dataframe.HDF5DataFrame'> 8
<class 'exetera.core.fields.FixedStringField'> 10
<exetera.core.fields.ReadOnlyFieldArray object at 0x7fc91bbc13d0>
0 b'eS5RO30s5l'
1 b'Uq5Xvg104k26'
2 b'qpdOaVbkWxU8P'
3 b'4MfNDqcQuDe1s'
4 b'kQDdlpQQjW'
5 b'xqQjWvI'
6 b'Lc2iriaKGc'
7 b'xvQtag'
8 b'5DdjIHXf2TCvl'
9 b'JXFcQK0KtUf0y'
Age filter data: [True, True, True, True, True, True, False, True, True, False]
Selected ages: [49 82 78 88 56 42 71 86]


Unnamed: 0,FirstName,LastName,Age,id
0,b'eS5RO30s5l',b'eMIr6rSfr',49,b'0'
1,b'Uq5Xvg104k26',b'TvlcCTras0o',82,b'1'
2,b'qpdOaVbkWxU8P',b'BfbMR5vc732',78,b'2'
3,b'4MfNDqcQuDe1s',b'1TloPOuw4XB',88,b'3'
4,b'kQDdlpQQjW',b'FflsTMEBY3',56,b'4'
5,b'xqQjWvI',b'5sa3H67GPVBw',42,b'5'
6,b'xvQtag',b'ADugcwJaNivhb',71,b'7'
7,b'5DdjIHXf2TCvl',b'pdwWWe',86,b'8'


Now we'll do the pointless thing of merging a dataset with itself, making a copy to preserve the original data:

In [16]:
def simple_merge(session, frame_left, frame_right, left_on, right_on, *other_fields, field_suffix="_R"):
    """
    Defines a simple merge between the left and right frames, using `left_on` and `right_on` as the keys to merge.
    If `other_fields` has names of fields these will be merged, if omitted all fields will be merged.
    """
    if len(other_fields) == 0:
        other_fields = list(frame_left)

    result = session.merge_left(
        left_on=frame_left[left_on],
        right_on=frame_right[right_on],
        right_fields=tuple(frame_right[f] for f in other_fields),
        right_writers=tuple(frame_right[f].create_like(frame_left, f + field_suffix) for f in other_fields),
    )

    return result


with exetera.core.session.Session() as s:
    src = s.open_dataset("dataset.hdf5", "r", "dataset")
    dest = s.open_dataset("datasetx2.hdf5", "w", "datasetx2")  # open the destination for writing

    sframe = src["participants"]  # pull out the frame to merge
    dest["participants"] = sframe  # create the same frame with data in the destination object
    dframe = dest["participants"]  # get that frame

    result = simple_merge(s, dframe, dframe, "id", "id", "FirstName", "Age", field_suffix="1")  # apply the merge

    df = pd.DataFrame({f: dframe[f].data[:] for f in dframe})  # convert to pandas

df

Unnamed: 0,Age,Age_valid,FirstName,LastName,id,is_employed,j_valid_from,j_valid_to,FirstName1,Age1
0,49,True,b'eS5RO30s5l',b'eMIr6rSfr',b'0',1,1618862000.0,253370800000.0,b'eS5RO30s5l',49
1,82,True,b'Uq5Xvg104k26',b'TvlcCTras0o',b'1',2,1618862000.0,253370800000.0,b'Uq5Xvg104k26',82
2,78,True,b'qpdOaVbkWxU8P',b'BfbMR5vc732',b'2',2,1618862000.0,253370800000.0,b'qpdOaVbkWxU8P',78
3,88,True,b'4MfNDqcQuDe1s',b'1TloPOuw4XB',b'3',1,1618862000.0,253370800000.0,b'4MfNDqcQuDe1s',88
4,56,True,b'kQDdlpQQjW',b'FflsTMEBY3',b'4',1,1618862000.0,253370800000.0,b'kQDdlpQQjW',56
5,42,True,b'xqQjWvI',b'5sa3H67GPVBw',b'5',1,1618862000.0,253370800000.0,b'xqQjWvI',42
6,20,True,b'Lc2iriaKGc',b'UauvtgV',b'6',1,1618862000.0,253370800000.0,b'Lc2iriaKGc',20
7,71,True,b'xvQtag',b'ADugcwJaNivhb',b'7',2,1618862000.0,253370800000.0,b'xvQtag',71
8,86,True,b'5DdjIHXf2TCvl',b'pdwWWe',b'8',1,1618862000.0,253370800000.0,b'5DdjIHXf2TCvl',86
9,30,True,b'JXFcQK0KtUf0y',b'2Gi8urgnAJBp',b'9',2,1618862000.0,253370800000.0,b'JXFcQK0KtUf0y',30


Here we will create a new dataset in a new directory then merge that with our existing one:

In [10]:
%%bash

mkdir -p other_dataset
generate_dataset randomschema.yaml other_dataset
cd other_dataset

rm -f dataset.hdf5
exetera import -w -s ../exeteraschema.json -i "participants:participants.csv, tests:tests.csv" -o dataset.hdf5

Schema: '<unopened file 'randomschema.yaml' r>'
Output: '/home/localek10/workspace/RandomDataset/examples/other_dataset'
{'participants': 'participants.csv', 'tests': 'tests.csv'}
2021-04-19 19:47:18.758719+00:00
../exeteraschema.json
{'participants': 'participants.csv', 'tests': 'tests.csv'}
loading took 3.123283386230469e-05 seconds
loading took 2.0742416381835938e-05 seconds
0 rows parsed in 0.015840768814086914s
9 rows parsed in 0.1236565113067627s
participants <KeysViewHDF5 ['participants']>
0 rows parsed in 0.023722410202026367s
9 rows parsed in 0.11020088195800781s
tests <KeysViewHDF5 ['participants', 'tests']>
<KeysViewHDF5 ['participants', 'tests']>


In [17]:
with exetera.core.session.Session() as s:
    src1 = s.open_dataset("dataset.hdf5", "r", "dataset")
    src2 = s.open_dataset("other_dataset/dataset.hdf5", "r", "otherdataset")
    dest = s.open_dataset("datasetx2.hdf5", "w", "datasetx2") 

    sframe = src1["participants"]
    dest["participants"] = sframe
    dframe = dest["participants"] 

    otherframe=src2["participants"]
    
    result = simple_merge(s, dframe, otherframe, "id", "id", "FirstName", "Age", field_suffix="1")

    df = pd.DataFrame({f: dframe[f].data[:] for f in dframe})

df

Unnamed: 0,Age,Age_valid,FirstName,LastName,id,is_employed,j_valid_from,j_valid_to,FirstName1,Age1
0,49,True,b'eS5RO30s5l',b'eMIr6rSfr',b'0',1,1618862000.0,253370800000.0,b'hUohc0',34
1,82,True,b'Uq5Xvg104k26',b'TvlcCTras0o',b'1',2,1618862000.0,253370800000.0,b'4t2OLsfCQK',19
2,78,True,b'qpdOaVbkWxU8P',b'BfbMR5vc732',b'2',2,1618862000.0,253370800000.0,b'lHa6C30',89
3,88,True,b'4MfNDqcQuDe1s',b'1TloPOuw4XB',b'3',1,1618862000.0,253370800000.0,b'JVdU4M',69
4,56,True,b'kQDdlpQQjW',b'FflsTMEBY3',b'4',1,1618862000.0,253370800000.0,b'2hxwqatOVKvTQ',79
5,42,True,b'xqQjWvI',b'5sa3H67GPVBw',b'5',1,1618862000.0,253370800000.0,b'scjDjV',45
6,20,True,b'Lc2iriaKGc',b'UauvtgV',b'6',1,1618862000.0,253370800000.0,b'NKim8FB',45
7,71,True,b'xvQtag',b'ADugcwJaNivhb',b'7',2,1618862000.0,253370800000.0,b'5YKYKaSLaq',62
8,86,True,b'5DdjIHXf2TCvl',b'pdwWWe',b'8',1,1618862000.0,253370800000.0,b'tvvmeFDmE',37
9,30,True,b'JXFcQK0KtUf0y',b'2Gi8urgnAJBp',b'9',2,1618862000.0,253370800000.0,b'SWA66SwQ',32
