# Exetera Import Example

This example demonstrates creating a simple database with a RandomDataset schema then converting it to a Exetera databse with its JSON schema. 

First thing to do is import `randomdataset` from the parent of this directory:

In [1]:
import os
import sys

sys.path.append(os.path.abspath(".."))

import randomdataset

The YAML schema is written out which will be used to generate the random data:

In [2]:
%%writefile randomschema.yaml

- typename: randomdataset.generators.CSVGenerator
  num_lines: 10
  dataset:
    name: participants
    typename: randomdataset.Dataset
    fields:
    - name: id
      typename: randomdataset.UIDFieldGen
    - name: FirstName
      typename: randomdataset.StrFieldGen
      lmin: 6
      lmax: 14
    - name: LastName
      typename: randomdataset.StrFieldGen
      lmin: 6
      lmax: 14
    - name: Age
      typename: randomdataset.IntFieldGen
      vmin: 18
      vmax: 90
    - name: is_employed
      typename: randomdataset.BoolFieldGen
      as_string: False
        
- typename: randomdataset.generators.CSVGenerator
  num_lines: 10
  dataset:
    name: tests
    typename: randomdataset.Dataset
    fields:
    - name: id
      typename: randomdataset.UIDFieldGen
    - name: patient_id
      typename: randomdataset.IntFieldGen
      vmin: 0
      vmax: 10
    - name: test_type
      typename: randomdataset.SetFieldGen
      field_type: str
      values: ["Type1", "Type2", "Unknown"]
    - name: location
      typename: randomdataset.StrFieldGen
    - name: result
      typename: randomdataset.IntFieldGen
      vmin: 0
      vmax: 3
    - name: value
      typename: randomdataset.FloatFieldGen
      vmin: 0
      vmax: 9

Overwriting randomschema.yaml


Instead of invoking the `generate_dataset` command line utility the command can be called directly through the imported library:

In [3]:
# !generate_dataset randomschema.yaml .
randomdataset.application.generate_dataset.callback("randomschema.yaml",".")

Schema: 'randomschema.yaml'
Output: '.'


Next the Exetera schema is written out, which looks like the YAML schema except with some extra specifiers for primary and foreign keys:

In [4]:
%%writefile exeteraschema.json

{
  "exetera": {
    "version": "1.0.0"
  },
  "schema": {
    "participants": {
      "primary_keys": [
        "id"
      ],
      "fields": {
        "id": {
          "field_type": "fixed_string",
          "length": 32
        },
        "FirstName": {
          "field_type": "string"
        },
        "LastName": {
          "field_type": "string"
        },
        "Age": {
          "field_type": "numeric",
          "value_type": "int32"
        },
        "is_employed": {
          "field_type": "categorical",
          "categorical": {
            "value_type": "int8",
            "strings_to_values": {
              "False": 1,
              "True": 2
            }
          }
        }
      }
    },
    "tests": {
      "primary_keys": [
        "id"
      ],
      "foreign_keys": {
        "patient_id": {
          "space": "patients",
          "key": "id"
        }
      },
      "fields": {
        "id": {
          "field_type": "fixed_string",
          "length": 32
        },
        "patient_id": {
          "field_type": "fixed_string",
          "length": 32
        },
        "test_type": {
          "field_type": "categorical",
          "categorical": {
            "value_type": "int8",
            "strings_to_values": {
              "Unknown": 0,
              "Type1": 1,
              "Type2": 2
            }
          }
        },
        "location": {
          "field_type": "string"
        },
        "result": {
          "field_type": "numeric",
          "value_type": "int32"
        },
        "value": {
          "field_type": "numeric",
          "value_type": "float32"
        } 
      }
    } 
  }
}


Overwriting exeteraschema.json


The conversion command is run to produce the HDF5 dataset:

In [5]:
%%bash

rm -f dataset.hdf5
../../ExeTera_mine_old/exetera/bin/exetera import -s exeteraschema.json -i "participants:participants.csv, tests:tests.csv" -o dataset.hdf5
ls -lh

{'participants': 'participants.csv', 'tests': 'tests.csv'}
2021-04-13 15:17:10.319812+00:00
exeteraschema.json
{'participants': 'participants.csv', 'tests': 'tests.csv'}
loading took 2.5510787963867188e-05 seconds
loading took 2.2172927856445312e-05 seconds
0 rows parsed in 0.020249366760253906s
9 rows parsed in 0.10116267204284668s
participants <KeysViewHDF5 ['participants']>
0 rows parsed in 0.02090620994567871s
9 rows parsed in 0.10543584823608398s
tests <KeysViewHDF5 ['participants', 'tests']>
<KeysViewHDF5 ['participants', 'tests']>
total 207M
-rw-r--r-- 1 localek10 bioeng  292 Apr 13 15:02 customers.csv
-rw-r--r-- 1 localek10 bioeng 207M Apr 13 16:17 dataset.hdf5
-rw-r--r-- 1 localek10 bioeng  11K Apr 13 16:02 exetera_import.ipynb
-rw-r--r-- 1 localek10 bioeng 1.7K Apr 13 16:17 exeteraschema.json
-rw-r--r-- 1 localek10 bioeng  391 Apr 13 16:17 participants.csv
-rw-r--r-- 1 localek10 bioeng  793 Apr 13 15:02 paymentschema.yaml
-rw-r--r-- 1 localek10 bioeng  803 Apr 13 15:02 paymen

In [6]:
sys.path.append(os.path.abspath("../../ExeTera"))
import exetera, exetera.core, exetera.processing
import pandas as pd

In [15]:
with exetera.core.session.Session() as s:
    dat = s.open_dataset("dataset.hdf5", "r", "dataset")
    print(dat.keys(), list(dat))

    field = dat["participants"]["FirstName"]
    
    print(type(field))
    print(len(field))  # correctly reports 10
    print(len(field.data._index_dataset))
    print(field.data._index_dataset[0],field.data._index_dataset[10])

    print([field.data[i] for i in range(len(field))])  # works
    for i in range(len(field)):
        print(i,field.data[i])
        
#     print(list(field.data))  # exception, should be equivalent to the above

dict_keys(['participants', 'tests']) ['participants', 'tests']
<class 'exetera.core.fields.IndexedStringField'>
10
11
0 98
['veUYdBahsbE4', 'FJ1Y8T', 'Qo7E4E8iIiHPW', 'UWSJMDdJkfMs', '2VygIB6O3', '8dAAWV1', 'm34FFc', '7XkCvlG2OUqTp', 'YPykjabOf207', 'a3dyA28W']
0 veUYdBahsbE4
1 FJ1Y8T
2 Qo7E4E8iIiHPW
3 UWSJMDdJkfMs
4 2VygIB6O3
5 8dAAWV1
6 m34FFc
7 7XkCvlG2OUqTp
8 YPykjabOf207
9 a3dyA28W
