# Exetera Import Example

This example demonstrates creating a simple database with a RandomDataset schema then converting it to a Exetera databse with its JSON schema. 

First thing to do is import `randomdataset` from the parent of this directory:

In [1]:
import os
import sys

sys.path.append(os.path.abspath(".."))

import randomdataset

The YAML schema is written out which will be used to generate the random data:

In [17]:
%%writefile randomschema.yaml

- typename: randomdataset.generators.CSVGenerator
  num_lines: 10
  dataset:
    name: participants
    typename: randomdataset.Dataset
    fields:
    - name: id
      typename: randomdataset.UIDFieldGen
    - name: FirstName
      typename: randomdataset.StrFieldGen
      lmin: 6
      lmax: 14
    - name: LastName
      typename: randomdataset.StrFieldGen
      lmin: 6
      lmax: 14
    - name: Age
      typename: randomdataset.IntFieldGen
      vmin: 18
      vmax: 90
    - name: is_employed
      typename: randomdataset.BoolFieldGen
      as_string: False
        
- typename: randomdataset.generators.CSVGenerator
  num_lines: 10
  dataset:
    name: tests
    typename: randomdataset.Dataset
    fields:
    - name: id
      typename: randomdataset.UIDFieldGen
    - name: patient_id
      typename: randomdataset.IntFieldGen
      vmin: 0
      vmax: 10
    - name: test_type
      typename: randomdataset.SetFieldGen
      field_type: str
      values: ["Type1", "Type2", "Unknown"]
    - name: location
      typename: randomdataset.StrFieldGen
    - name: result
      typename: randomdataset.IntFieldGen
      vmin: 0
      vmax: 3
    - name: value
      typename: randomdataset.FloatFieldGen
      vmin: 0
      vmax: 9

Overwriting randomschema.yaml


Instead of invoking the `generate_dataset` command line utility the command can be called directly through the imported library:

In [20]:
# !generate_dataset randomschema.yaml .
randomdataset.application.generate_dataset.callback("randomschema.yaml",".")

Schema: 'randomschema.yaml'
Output: '.'


Next the Exetera schema is written out, which looks like the YAML schema except with some extra specifiers for primary and foreign keys:

In [23]:
%%writefile exeteraschema.json

{
  "exetera": {
    "version": "1.0.0"
  },
  "schema": {
    "participants": {
      "primary_keys": [
        "id"
      ],
      "fields": {
        "id": {
          "field_type": "fixed_string",
          "length": 32
        },
        "FirstName": {
          "field_type": "string"
        },
        "LastName": {
          "field_type": "string"
        },
        "Age": {
          "field_type": "numeric",
          "value_type": "int32"
        },
        "is_employed": {
          "field_type": "categorical",
          "categorical": {
            "value_type": "int8",
            "strings_to_values": {
              "False": 1,
              "True": 2
            }
          }
        }
      }
    },
    "tests": {
      "primary_keys": [
        "id"
      ],
      "foreign_keys": {
        "patient_id": {
          "space": "patients",
          "key": "id"
        }
      },
      "fields": {
        "id": {
          "field_type": "fixed_string",
          "length": 32
        },
        "patient_id": {
          "field_type": "fixed_string",
          "length": 32
        },
        "test_type": {
          "field_type": "categorical",
          "categorical": {
            "value_type": "int8",
            "strings_to_values": {
              "Unknown": 0,
              "Type1": 1,
              "Type2": 2
            }
          }
        },
        "location": {
          "field_type": "string"
        },
        "result": {
          "field_type": "numeric",
          "value_type": "int32"
        },
        "value": {
          "field_type": "numeric",
          "value_type": "float32"
        } 
      }
    } 
  }
}


Overwriting exeteraschema.json


The conversion command is run to produce the HDF5 dataset:

In [6]:
%%bash

rm -f dataset.hdf5
../../ExeTera_mine_old/exetera/bin/exetera import -s exeteraschema.json -i "participants:participants.csv, tests:tests.csv" -o dataset.hdf5
ls -lh

{'participants': 'participants.csv', 'tests': 'tests.csv'}
2021-04-12 13:42:45.209943+00:00
exeteraschema.json
{'participants': 'participants.csv', 'tests': 'tests.csv'}
loading took 2.5033950805664062e-05 seconds
loading took 2.3126602172851562e-05 seconds
0 rows parsed in 0.019425392150878906s
9 rows parsed in 0.09913277626037598s
participants <KeysViewHDF5 ['participants']>
0 rows parsed in 0.020839691162109375s
9 rows parsed in 0.10392403602600098s
tests <KeysViewHDF5 ['participants', 'tests']>
<KeysViewHDF5 ['participants', 'tests']>
total 207M
-rw-r--r-- 1 localek10 bioeng 207M Apr 12 14:42 dataset.hdf5
-rw-r--r-- 1 localek10 bioeng  17K Apr 12 14:42 exetera_import.ipynb
-rw-r--r-- 1 localek10 bioeng 1.7K Apr 12 12:31 exeteraschema.json
-rw-r--r-- 1 localek10 bioeng  391 Apr 10 20:17 participants.csv
-rw-r--r-- 1 localek10 bioeng 1.5K Apr  7 23:24 randomdataset.zip
-rw-r--r-- 1 localek10 bioeng 1.3K Apr 10 20:17 randomschema.yaml
-rw-r--r-- 1 localek10 bioeng  450 Apr 10 20:17 te

In [2]:
sys.path.append(os.path.abspath("../../ExeTera"))
import exetera, exetera.core, exetera.processing
import pandas as pd

In [5]:
with exetera.core.session.Session() as s:
    dat = s.open_dataset("dataset.hdf5", "r", "dataset")
    print(dat.keys(), list(dat))
    
    d=dat["participants"]["FirstName"]
    field = s.get(d)
    print(len(field))  # correctly reports 10
    print([field.data[i] for i in range(len(field))])  # works
    print(list(field.data))  # exception, should be equivalent to the above

dict_keys(['participants', 'tests']) ['participants', 'tests']
10
['LBEPVfhEQ6wns', '5IN786f', '6Ty8EO1haBV', 'KuLyontosIrXl', 'MAfu02TY6', 'Ybhqx1x', 'dv6Jw5s', 'uqdWqkkRGp', 'fNlksFT', 'jTv5YQh8c']
/participants/FirstName: unexpected exception index is out of range


ValueError: index is out of range