# Import Dataset Example

In this example, we will convert the csv files into HDF5 through the import utility provided by ExeTera.
First, you can see we have two csv files and one json file:

In [9]:
!cat users.csv

id,FirstName,LastName,bmi,has_diabetes,height_cm,year_of_birth
0,"Grace","None",39,1,130,1967
1,"Carol","Nobody",38,0,119,1975
2,"Wendy","Random",28,0,128,1926
3,"Mallory","Nobody",25,0,117,1944
4,"Xavier","Unknown",29,1,190,1974
5,"Olivia","Thunk",26,0,107,2004
6,"Xavier","Anon",30,0,175,1973
7,"Xavier","Null",37,0,140,1963
8,"Ivan","Bloggs",37,0,134,1999
9,"Trudy","Bar",28,0,116,1929


In [10]:
!cat assessments.csv

id,date,user_id,abdominal_pain,brain_fog,loss_of_smell,tested_covid_positive,temperature_f
0,2021-10-24 11:45:43.677374+00:00,0,0,1,1,2,103.22149054047082
1,2021-12-16 13:09:58.380573+00:00,1,0,1,0,0,100.62518751030662
2,2021-08-05 17:51:30.943546+00:00,2,0,0,1,0,105.18487884609749
3,2021-04-09 14:47:54.599226+00:00,3,1,0,1,0,96.4302053852154
4,2021-09-29 00:15:42.142405+00:00,4,1,1,1,0,109.63616106818489
5,2021-04-24 09:53:44.215726+00:00,5,1,0,0,1,107.69840121429907
6,2021-11-13 07:35:32.840341+00:00,6,0,0,0,1,97.00309019318361
7,2022-02-14 00:08:04.885913+00:00,7,1,0,0,1,95.22598358524823
8,2022-02-07 15:36:57.841132+00:00,8,0,0,0,2,95.48740949212532
9,2021-02-21 01:48:38.675272+00:00,9,0,1,1,0,106.27664175133276
10,2021-08-05 00:06:12.343504+00:00,0,0,1,1,0,103.07544677653925
11,2021-11-07 21:52:41.868990+00:00,1,1,0,0,2,102.81942527899108
12,2021-05-20 14:49:01.700189+00:00,2,0,0,0,2,103.25591242165508
13,2021-09-28 03:13:05.410689+00:00,3,0,1,1,1,98.99925665317788
14,2022-01-21 1

In [11]:
!cat user_assessments.json


{
  "exetera": {
    "version": "1.0.0"
  },
  "schema": {
    "users": {
      "primary_keys": [
        "id"
      ],
      "fields": {
        "id": {
          "field_type": "fixed_string",
          "length": 32
        },
        "FirstName": {
          "field_type": "string"
        },
        "LastName": {
          "field_type": "string"
        },
        "bmi": {
          "field_type": "numeric",
          "value_type": "int32"
        },
        "has_diabetes": {
          "field_type": "categorical",
          "categorical": {
            "value_type": "int8",
            "strings_to_values": {
              "": 0,
              "False": 1,
              "True": 2
            }
          }
        },
        "height_cm": {
          "field_type": "numeric",
          "value_type": "int32"
        },   
        "year_of_birth": {
          "field_type": "numeric",
          "value_type": "int32"
        }
      }
    },
    "assessments": {
      "primary_keys": [
      



## 1, Import 
ExeTera utlize HDF5 file format to acquire fast performance when processing the data. Hence the first step of using ExeTera is usually transform the file from other formats, e.g. csv, into HDF5.

ExeTera provides utilities to transform the csv data into HDF5, through either command line or code.

a. Importing via the exetera import command: 

```

exetera import  
-s path/to/covid_schema.json \   
-i "patients:path/to/patient_data.csv, assessments:path/to/assessmentdata.csv,  <br> tests:path/to/covid_test_data.csv, diet:path/to/diet_study_data.csv" \   
-o /path/to/output_dataset_name.hdf5   
--include "patients:(id,country_code,blood_group), assessments:(id,patient_id,chest_pain)"   
--exclude "tests:(country_code)"   


Arguments:   
-s/--schema: The location and name of the schema file   
-te/--territories: If set, this only imports the listed territories. If left unset, all territories are imported  
-i/--inputs : A comma separated list of 'name:file' pairs. This should be put in parentheses if it contains any whitespace. See the example above.  
-o/--output_hdf5: The path and name to where the resulting hdf5 dataset should be written   
-ts/--timestamp: An override for the timestamp to be written (defaults to datetime.now(timezone.utc))   
-w/--overwrite: If set, overwrite any existing dataset with the same name; appends to existing dataset otherwise   
-n/--include: If set, filters out all fields apart from those in the list.  
-x/--exclude: If set, filters out the fields in this list.   

```



b. Importing through code  

Use importer.import_with_schema(timestamp, output_hdf5_name, schema, tokens, args.overwrite, include_fields, exclude_fields) 



In [13]:
#1)Import csv to hdf5 through import_with_schema function

import exetera

from exetera.io import importer
from exetera.core import session
from datetime import datetime, timezone

with session.Session() as s:
    importer.import_with_schema(
        session=s,
        timestamp=str(datetime.now(timezone.utc)),
        dataset_alias="UserAssessments",
        dataset_filename="user_assessments.hdf5",
        schema_file="user_assessments.json",
        files={"users": "users.csv", "assessments":"assessments.csv"},
        overwrite=True,
    )

read_file_using_fast_csv_reader: 1 chunks, 10 accumulated_written_rows parsed in 0.0040361881256103516s
completed in 0.007875919342041016 seconds
Total time 0.008165121078491211s
read_file_using_fast_csv_reader: 1 chunks, 30 accumulated_written_rows parsed in 0.00348663330078125s
completed in 0.005882978439331055 seconds
Total time 0.006022214889526367s


In [None]:
#2)Import csv to hdf5 through command line, make sure you have 
#add the ExeTera/exetera/bin/ to system path
%%bash

exetera import -w -s user_assessments.json -i "users:users.csv, assessments:assessments.csv" -o user_assessments.hdf5
ls -lh

After either of the command, you should have a HDF5 file available:

In [15]:
!ls *hdf5

dataset.hdf5  temp2.hdf5  temp.hdf5  user_assessments.hdf5
