<h1> 1, Import </h1>
<p>ExeTera utlize HDF5 file format to acquire fast performance when processing the data. Hence the first step of using ExeTera is usually transform the file from other formats, e.g. csv, into HDF5.</p>
<p>ExeTera provides utilities to transform the csv data into HDF5, through either command line or code. </p>
<b>How import works</b>
<p>
a. Importing via the exetera import command:  <br>
<em>
exetera import <br>
-s path/to/covid_schema.json \  <br>
-i "patients:path/to/patient_data.csv, assessments:path/to/assessmentdata.csv,  <br> tests:path/to/covid_test_data.csv, diet:path/to/diet_study_data.csv" \  <br>
-o /path/to/output_dataset_name.hdf5  <br>
--include "patients:(id,country_code,blood_group), assessments:(id,patient_id,chest_pain)"  <br>
--exclude "tests:(country_code)" </em>   <br>

Arguments:  <br>
-s/--schema: The location and name of the schema file  <br>
-te/--territories: If set, this only imports the listed territories. If left unset, all territories are imported  <br>
-i/--inputs : A comma separated list of 'name:file' pairs. This should be put in parentheses if it contains any whitespace. See the example above.  <br>
-o/--output_hdf5: The path and name to where the resulting hdf5 dataset should be written  <br>
-ts/--timestamp: An override for the timestamp to be written (defaults to datetime.now(timezone.utc))  <br>
-w/--overwrite: If set, overwrite any existing dataset with the same name; appends to existing dataset otherwise  <br>
-n/--include: If set, filters out all fields apart from those in the list.  <br>
-x/--exclude: If set, filters out the fields in this list.  <br>
</p>

<p>
b. Importing through code <br>
Use <em> importer.import_with_schema(timestamp, output_hdf5_name, schema, tokens, args.overwrite, include_fields, exclude_fields) </em>
</p>


<b>Import example</b>
</br>
For the import example, please refer to the example in RandomDataset. After you finish, please copy the hdf5 file here to continue.

In [None]:
!ls *hdf5

<h1>2, ExeTera Session and DataSet</h1>
<p> 
Session instances are the top-level ExeTera class. They serve two main purposes: <br>

1, Functionality for creating / opening / closing Dataset objects, as well as managing the lifetime of open datasets <br>
2, Methods that operate on Fields <br>
</p>
<h3>Creating a session object</h3>
<p>
Creating a Session object can be done multiple ways, but we recommend that you wrap the session in a context manager (with statement). This allows the Session object to automatically manage the datasets that you have opened, closing them all once the with statement is exited. Opening and closing datasets is very fast. When working in jupyter notebooks or jupyter lab, please feel free to create a new Session object for each cell. <br>
</p>

In [None]:
# you should have exetera installed already, otherwise: pip install exetera
import sys
from exetera.core.session import Session

# recommended
with Session() as s:
  ...

# not recommended
s = Session()  

<h3>Loading dataset(s)</h3>
Once you have a session, the next step is typically to open a dataset. Datasets can be opened in one of three modes: <br>

read - the dataset can be read from but not written to <br>
append - the dataset can be read from and written to <br>
write - a new dataset is created (and will overwrite an existing dataset with the same name) <br>

In [None]:
with Session() as s:
  ds1 = s.open_dataset('user_assessments.hdf5', 'r', 'ds1')

<h3>Closing a dataset</h3>
Closing a dataset is done through Session.close_dataset, as follows

In [None]:
with Session() as s:
  ds1 = s.open_dataset('user_assessments.hdf5', 'r', 'ds1')

  # do some work
  print(ds1.keys())

  s.close_dataset('ds1')

<h3>Dataset</h3>
ExeTera works with HDF5 datasets under the hood, and the Dataset class is the means why which you interact with it at the top level. Each Dataset instance corresponds to a physical dataset that has been created or opened through a call to session.open_dataset. <br>

Datasets are in turn used to create, access and delete DataFrames. Each DataFrame is a top-level HDF5 group that is intended to be very much like and familiar to the Pandas DataFrame.

In [None]:
from exetera.core import dataset

with Session() as s:
    ds = s.open_dataset('temp.hdf5', 'w', 'ds')

    #Create a new dataframe
    df = ds.create_dataframe('foo')
    print(ds.keys())

    #Rename a dataframe
    ds['bar'] = ds['foo'] # internally performs a rename
    print('Renamed:', ds.keys())
    dataset.move(ds['bar'], ds, 'foo')
    print('Moved:', ds.keys())

    #Copy a dataframe within a dataset
    dataset.copy(ds['foo'], ds, 'bar')
    print('Copied:', ds.keys())
    
    #Delete an existing dataframe
    ds.delete_dataframe(ds['foo'])
    print('Dataframe foo deleted.', ds.keys())

    #Copy a dataframe between datasets
    ds2 = s.open_dataset('temp2.hdf5', 'w', 'ds2')
    ds2['foobar'] = ds['bar']
    print('Copied:', ds1.keys())
    print('Copied:', ds2.keys())

<h1> 3, DataFrame and Fields </h1>
The ExeTera DataFrame object is intended to be familiar to users of Pandas, albeit not identical. <br>

ExeTera works with Datasets, which are backed up by physical key-value HDF5 datastores on drives, and, as such, there are necessarily some differences between the Pandas DataFrame: <br>

- Pandas DataFrames enforce that all Series (Fields in ExeTera terms) are the same length. ExeTera doesn't require this, but there are then operations that do not make sense unless all fields are of the same length. ExeTera allows DataFrames to have fields of different lengths because the operation to apply filters and so for to a DataFrame would run out of memory on large DataFrames <br>
- Types always matter in ExeTera. When creating new Fields (Pandas Series) you need to specify the type of the field that you would like to create. Fortunately, Fields have convenience methods to construct empty copies of themselves for when you need to create a field of a compatible type <br>
- ExeTera DataFrames are new with the 0.5 release of ExeTera and do not yet support all of the operations that Panda DataFrames support. This functionality will be augmented in future releases. <br>

In [None]:
import numpy as np
with Session() as s:
    ds = s.open_dataset('temp.hdf5', 'w', 'ds')
    
    #Create a new field
    df = ds.create_dataframe('df')
    i_f = df.create_indexed_string('i_foo')
    f_f = df.create_fixed_string('f_foo', 8)
    n_f = df.create_numeric('n_foo', 'int32')
    c_f = df.create_categorical('c_foo', 'int8', {b'a': 0, b'b': 1})
    t_f = df.create_timestamp('t_foo')


    #Copy a field from another dataframe 
    df2 = ds.create_dataframe('df2')
    df2['foo'] = df['i_foo']
    df2['foobar'] = df2['foo']
    print(df2.keys())


    #Apply a filter to all fields in a dataframe
    df = ds.create_dataframe('df3')
    df.create_numeric('n_foo', 'int32').data.write([0,1,2,3,4,5,6,7,8,9])
    filt = np.array([True if i%2==0 else False for i in range(0,10)])  # filter out odd values
    df4 = ds.create_dataframe('df4')
    df.apply_filter(filt, ddf=df4) # creates a new dataframe from the filtered dataframe
    print('Original:', df['n_foo'].data[:])
    print('Filtered: ',df4['n_foo'].data[:])
    df.apply_filter(filt) # destructively filters the dataframe
    print('Original:', df['n_foo'].data[:])


    #Re-index all fields in a dataframe
    df = ds.create_dataframe('df5')
    df.create_numeric('n_foo', 'int32').data.write([0,1,2,3,4,5,6,7,8,9])
    print('Previous re-index:', df['n_foo'].data[:])
    inds =  np.array([9,8,7,6,5,4,3,2,1,0])
    df6 = ds.create_dataframe('df6')
    df.apply_index(inds, ddf=df6) # creates a new dataframe from the re-indexed dataframe
    print('Re-indexed:', df6['n_foo'].data[:])
    df.apply_index(inds) # destructively re-indexes the dataframe
    print('Re-indexed:', df['n_foo'].data[:])

<h3>Fields </h3>
The Field object is the analogy of the Pandas DataFrame Series or Numpy ndarray in ExeTera. Fields contain (often very large) arrays of a given data type, with an API that allows intuitive manipulations of the data. <br>

<br>
In order to store very large data arrays as efficiently as possible, Fields store their data in ways that may not be intuitive to people familiar with Pandas or Numpy. Numpy makes certain design decisions that reduce the flexibility of lists in order to gain speed and memory efficiency, and ExeTera does the same to further improve on speed and memory. The IndexedStringField, for example, uses two arrays, one containing a concatinated array of bytevalues from all of the strings in the field, and another array of indices indicating where each field starts and end. This is much faster and more memory efficient to iterate over than a Numpy string array when the variability of string lengths is very high. This kind of change however, creates a great deal of complexity when exposed to the user, and Field does its best to hide that away and act like a single array of string values. 

<h3>Accessing underlying data</h3>
Underlying data can be accessed as follows: <br>

All fields have a data property that provides access to the underlying data that they contain. For most field types, it is very efficient to read from and write to this property, provided it is done using slice syntax <br>
- Indexed string fields provide data as a convenience method, but this should only be used when performance is not a consideration <br>
- Indexed string fields provide indices and values properties should you need to interact with their underlying data efficiently and directly. For the most part, we discourage this and have tried to provide you with all of the methods that you need under the hood

In [None]:
with Session() as s:
    ds = s.open_dataset('temp.hdf5', 'w', 'ds')
    df = ds.create_dataframe('df')
    df.create_numeric('field', 'int32').data.write([0,1,2,3,4,5,6,7,8,9])
    print(df['field'].data[:])

Constructing compatible empty fields
Fields have a create_like method that can be used to construct an empty field of a compatible type

when called with no arguments, this creates an in-memory field that can be further manipulated before eventually being assigned to a DataFrame (or not)
when called with a DataFrame and a name, it will create an empty field on that DataFrame of the given name that can subsequently be written to See below for examples

In [None]:
with Session() as s:
    ds = s.open_dataset('temp.hdf5', 'w', 'ds')
    df = ds.create_dataframe('df')
    df.create_numeric('field', 'int32').data.write([0,1,2,3,4,5,6,7,8,9])
    df['field'].create_like(df, 'field2')  # use create_like to create a field with similar data type
    print(df['field'].data[:])
    print(df['field2'].data[:])  # note the data is not copied

<h3>Arithmetic operations </h3>
Numeric and timestamp fields have the standard set of arithmetic operations that can be applied to them: <br>

These are +, -, *, /, //, %, and divmod

In [None]:
with Session() as s:
    ds = s.open_dataset('temp.hdf5', 'w', 'ds')
    df = ds.create_dataframe('df')
    df.create_numeric('a', 'int32').data.write([0,1,2,3,4,5,6,7,8,9])
    df.create_numeric('b', 'int32').data.write([0,1,2,3,4,5,6,7,8,9])

    df['c'] = df['a'] + df['b']
    print(df['c'].data[:])

<h3>Element-wise logical operators</h3>
Numeric fields can have logical operations performed on them on an element-wise basis <br>

These are &, |, ^

In [None]:
with Session() as s:
    ds = s.open_dataset('temp.hdf5', 'w', 'ds')
    df = ds.create_dataframe('df')
    df.create_numeric('a', 'bool').data.write([True if i%2 == 0 else False for i in range(0,10)])
    df.create_numeric('b', 'bool').data.write([True if i%2 == 0 else False for i in range(0,10)])

    filter1 = df['a'] & df['b']
    print(filter1.data[:])

In [None]:
with Session() as s:
    ds = s.open_dataset('temp.hdf5', 'w', 'ds')
    df = ds.create_dataframe('df')
    df.create_numeric('a', 'bool').data.write([True if i%2 == 0 else False for i in range(0,10)])
    df.create_numeric('b', 'bool').data.write([True if i%2 == 1 else False for i in range(0,10)])

    filter1 = df['a'] | df['b']
    print(filter1.data[:])

<h3>Comparison operators</h3>
Numeric, categorical and timestamp fields have comparison operations that can be applied to them: <br>

These are <, <=, ==, |=, >=, >

In [None]:
with Session() as s:
    ds = s.open_dataset('temp.hdf5', 'w', 'ds')
    df = ds.create_dataframe('df')
    df.create_numeric('a', 'bool').data.write([True if i%2 == 0 else False for i in range(0,10)])
    df.create_numeric('b', 'bool').data.write([True if i%2 == 1 else False for i in range(0,10)])

    filter1 = df['a'] ==  df['b']
    print(filter1.data[:])
