# Exetera Import Example

This example demonstrates creating a simple database with a RandomDataset schema then converting it to a Exetera databse with its JSON schema. 

First thing to do is import `randomdataset` from the parent of this directory:

In [1]:
import os
import sys

sys.path.append(os.path.abspath(".."))

import randomdataset

The YAML schema is written out which will be used to generate the random data:

In [2]:
%%writefile randomschema.yaml

- typename: randomdataset.generators.CSVGenerator
  num_lines: 10
  dataset:
    name: participants
    typename: randomdataset.Dataset
    fields:
    - name: id
      typename: randomdataset.UIDFieldGen
    - name: FirstName
      typename: randomdataset.StrFieldGen
      lmin: 6
      lmax: 14
    - name: LastName
      typename: randomdataset.StrFieldGen
      lmin: 6
      lmax: 14
    - name: Age
      typename: randomdataset.IntFieldGen
      vmin: 18
      vmax: 90
    - name: is_employed
      typename: randomdataset.BoolFieldGen
      as_string: False
        
- typename: randomdataset.generators.CSVGenerator
  num_lines: 10
  dataset:
    name: tests
    typename: randomdataset.Dataset
    fields:
    - name: id
      typename: randomdataset.UIDFieldGen
    - name: patient_id
      typename: randomdataset.IntFieldGen
      vmin: 0
      vmax: 10
    - name: test_type
      typename: randomdataset.SetFieldGen
      field_type: str
      values: ["Type1", "Type2", "Unknown"]
    - name: location
      typename: randomdataset.StrFieldGen
    - name: result
      typename: randomdataset.IntFieldGen
      vmin: 0
      vmax: 3
    - name: value
      typename: randomdataset.FloatFieldGen
      vmin: 0
      vmax: 9

Overwriting randomschema.yaml


Instead of invoking the `generate_dataset` command line utility the command can be called directly through the imported library:

In [3]:
# !generate_dataset randomschema.yaml .
randomdataset.application.generate_dataset.callback("randomschema.yaml",".")

Schema: 'randomschema.yaml'
Output: '.'


Next the Exetera schema is written out, which looks like the YAML schema except with some extra specifiers for primary and foreign keys:

In [4]:
%%writefile exeteraschema.json

{
  "exetera": {
    "version": "1.0.0"
  },
  "schema": {
    "participants": {
      "primary_keys": [
        "id"
      ],
      "fields": {
        "id": {
          "field_type": "fixed_string",
          "length": 32
        },
        "FirstName": {
          "field_type": "string"
        },
        "LastName": {
          "field_type": "string"
        },
        "Age": {
          "field_type": "numeric",
          "value_type": "int32"
        },
        "is_employed": {
          "field_type": "categorical",
          "categorical": {
            "value_type": "int8",
            "strings_to_values": {
              "False": 1,
              "True": 2
            }
          }
        }
      }
    },
    "tests": {
      "primary_keys": [
        "id"
      ],
      "foreign_keys": {
        "patient_id": {
          "space": "patients",
          "key": "id"
        }
      },
      "fields": {
        "id": {
          "field_type": "fixed_string",
          "length": 32
        },
        "patient_id": {
          "field_type": "fixed_string",
          "length": 32
        },
        "test_type": {
          "field_type": "categorical",
          "categorical": {
            "value_type": "int8",
            "strings_to_values": {
              "Unknown": 0,
              "Type1": 1,
              "Type2": 2
            }
          }
        },
        "location": {
          "field_type": "string"
        },
        "result": {
          "field_type": "numeric",
          "value_type": "int32"
        },
        "value": {
          "field_type": "numeric",
          "value_type": "float32"
        } 
      }
    } 
  }
}


Overwriting exeteraschema.json


The conversion command is run to produce the HDF5 dataset:

In [5]:
%%bash

rm -f dataset.hdf5
exetera import -s exeteraschema.json -i "participants:participants.csv, tests:tests.csv" -o dataset.hdf5
ls -lh

{'participants': 'participants.csv', 'tests': 'tests.csv'}
2021-04-13 20:50:30.026856+00:00
exeteraschema.json
{'participants': 'participants.csv', 'tests': 'tests.csv'}
loading took 2.7418136596679688e-05 seconds
loading took 2.1219253540039062e-05 seconds
0 rows parsed in 0.02012801170349121s
9 rows parsed in 0.10176253318786621s
participants <KeysViewHDF5 ['participants']>
0 rows parsed in 0.025080204010009766s
9 rows parsed in 0.11012411117553711s
tests <KeysViewHDF5 ['participants', 'tests']>
<KeysViewHDF5 ['participants', 'tests']>
total 207M
-rw-r--r-- 1 localek10 bioeng  292 Apr 13 15:02 customers.csv
-rw-r--r-- 1 localek10 bioeng 207M Apr 13 21:50 dataset.hdf5
-rw-r--r-- 1 localek10 bioeng  12K Apr 13 21:50 exetera_import.ipynb
-rw-r--r-- 1 localek10 bioeng 1.7K Apr 13 21:50 exeteraschema.json
-rw-r--r-- 1 localek10 bioeng  413 Apr 13 21:50 participants.csv
-rw-r--r-- 1 localek10 bioeng  793 Apr 13 15:02 paymentschema.yaml
-rw-r--r-- 1 localek10 bioeng  803 Apr 13 15:02 paymen

In [6]:
import exetera, exetera.core, exetera.processing
import pandas as pd

In [8]:
with exetera.core.session.Session() as s:
    dat = s.open_dataset("dataset.hdf5", "r", "dataset")
    print(list(dat))

    field = dat["participants"]["FirstName"]
    
    print(type(field))
    print(len(field))

    for i in range(len(field)):
        print(i,field.data[i])

['participants', 'tests']
<class 'exetera.core.fields.IndexedStringField'>
10
0 eS5RO30s5l
1 Uq5Xvg104k26
2 qpdOaVbkWxU8P
3 4MfNDqcQuDe1s
4 kQDdlpQQjW
5 xqQjWvI
6 Lc2iriaKGc
7 xvQtag
8 5DdjIHXf2TCvl
9 JXFcQK0KtUf0y
