[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/CO-CONNECT/co-connect-tools/HEAD)


## Introduction

If you'd prefer to run the ETL transform to CDM as a python notebook and/or interactively, you can follow the following workbook example, changing the rules file and the input files.


## Installing

The best way is to install the module via `pip`. 

In [None]:
!pip3 install co-connect-tools -q

## Loading the Rules

Given the full path to a `json` file containing the rules, the first step is to load this up into a `json` object/dict.

In [10]:
import coconnect.tools
import json
import os

coconnect_data_folder = os.path.join(os.path.dirname(coconnect.__file__),'data')

rules = coconnect.tools.load_json(f'{coconnect_data_folder}/test/rules/rules_14June2021.json')
print(json.dumps(rules, indent=2)[0:500])

{
  "metadata": {
    "date_created": "2021-06-14T15:27:37.123947",
    "dataset": "Test"
  },
  "cdm": {
    "observation": [
      {
        "observation_concept_id": {
          "source_table": "Demographics.csv",
          "source_field": "ethnicity",
          "term_mapping": {
            "Asian": 35825508
          }
        },
        "observation_datetime": {
          "source_table": "Demographics.csv",
          "source_field": "date_of_birth"
        },
        "observation_source_co


## Loading the input data

A convienience function is available to create a map between a file name and a file path for all files in a directory:

In [13]:
f_map = coconnect.tools.get_file_map_from_dir(f'{coconnect_data_folder}/test/inputs/')
print (json.dumps(f_map,indent=6))

{
      "Symptoms.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/docs/public/docs/CoConnectTools/source_code/coconnect/data/test/inputs/Symptoms.csv",
      "Covid19_test.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/docs/public/docs/CoConnectTools/source_code/coconnect/data/test/inputs/Covid19_test.csv",
      "covid19_antibody.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/docs/public/docs/CoConnectTools/source_code/coconnect/data/test/inputs/covid19_antibody.csv",
      "vaccine.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/docs/public/docs/CoConnectTools/source_code/coconnect/data/test/inputs/vaccine.csv",
      "Demographics.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/docs/public/docs/CoConnectTools/source_code/coconnect/data/test/inputs/Demographics.csv"
}


use the `f_map` to load all the inputs into a map between the file name and a dataframe object. This can be created manually via any prefered method.

In [16]:
inputs = coconnect.tools.load_csv(f_map)
inputs

{'Symptoms.csv':   PersonID  visit_date symptom1 symptom2 symptom3
 0      101  2020-11-15        Y        Y        Y
 1      102  2020-01-04        Y        Y        Y
 2      103  2020-03-27        Y        Y        Y
 3      104  2020-06-24        N        N        N
 4      105  2020-07-27        Y        Y        Y
 5      108  2020-11-04        N        Y        N
 6      109  2020-12-24        N        N        N
 7      110  2020-02-04        N        N        N,
 'Covid19_test.csv':   PersonID        date         result
 0      101  2020-11-15       POSITIVE
 1      102  2020-01-04       NEGATIVE
 2      103  2020-03-27            POS
 3      104  2020-06-24            NEG
 4      105  2020-07-27       POSITIVE
 5      108  2020-11-04       NEGATIVE
 6      109  2020-12-24  INDETERMINATE
 7      110  2020-02-04       NEGATIVE,
 'covid19_antibody.csv':   PersonID        date ABresult
 0      101  2020-11-29        1
 1      102  2020-04-15        1
 2      103  2020-10-04      

## Creating a CDM 

As CO-CONNECT-Tools contains a pythonic version of the CDM, we can create an instannce of the `CommonDataModel` class.

In [17]:
from coconnect.cdm import CommonDataModel

cdm = CommonDataModel(name=rules['metadata']['dataset'],
                      inputs=inputs,
                      output_folder='output_dir/')
cdm

[32m2021-07-28 10:35:57[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - CommonDataModel created


<coconnect.cdm.model.CommonDataModel at 0x1188c44f0>

## Adding CDM Objects to the CDM

The next step is to loop over all the rules from the `json`, creating and adding a new CDM object (e.g. Person) to the CDM.

Within the loop the CDM object define function is set a lambda function to the apply rules. This means that during the executing, in runtime, the tool (via the `CommonDataModel` class, will execute the define function and know how to apply the mapping rules.

In [18]:
from coconnect.cdm import get_cdm_class
from coconnect.tools import apply_rules

for destination_table,rules_set in rules['cdm'].items():
    for i,rules in enumerate(rules_set):
        obj = get_cdm_class(destination_table)()
        obj.set_name(f"{destination_table}_{i}")
        obj.rules = rules
        obj.define = lambda x : apply_rules(x)
        cdm.add(obj)

[32m2021-07-28 10:38:05[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added observation_0 of type observation
[32m2021-07-28 10:38:05[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added observation_1 of type observation
[32m2021-07-28 10:38:05[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added observation_2 of type observation
[32m2021-07-28 10:38:05[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added observation_3 of type observation
[32m2021-07-28 10:38:05[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added observation_4 of type observation
[32m2021-07-28 10:38:05[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added observation_5 of type observation
[32m2021-07-28 10:38:05[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added condition_occurrence_0 of type condition_occurrence
[32m2021-07-28 10:38:05[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added person_0 of type person
[32m2021-07-28 10:38:05[0m - [34mCommonDataModel[0m - [1;37

After the initialisation and creation of the CDM objects, we can see what objects we have been registered in the model..

In [20]:
cdm.objects()

{'observation': {'observation_0': <coconnect.cdm.objects.observation.Observation at 0x11887b490>,
  'observation_1': <coconnect.cdm.objects.observation.Observation at 0x11873fbb0>,
  'observation_2': <coconnect.cdm.objects.observation.Observation at 0x11879df40>,
  'observation_3': <coconnect.cdm.objects.observation.Observation at 0x11879dd00>,
  'observation_4': <coconnect.cdm.objects.observation.Observation at 0x11879d850>,
  'observation_5': <coconnect.cdm.objects.observation.Observation at 0x1188d5e20>},
 'condition_occurrence': {'condition_occurrence_0': <coconnect.cdm.objects.condition_occurrence.ConditionOccurrence at 0x1188d5c70>},
 'person': {'person_0': <coconnect.cdm.objects.person.Person at 0x1188dc430>,
  'person_1': <coconnect.cdm.objects.person.Person at 0x11879d940>}}

## Process The CDM

Processing the CDM will execute all objects, pandas dataframes will be created for each object, based on the rules that have been provided.

Importantly the CDM will also format, finalise and merge all the individual dataframes for each objects. 

* Formatting makes sure the columns are in the correct format i.e. a date is YYY-MM-DD
* Finalise makes sure 


In [None]:
cdm.process()

## Inspect Outputs

In [None]:
cdm.keys()

In [None]:
cdm['person'].dropna(axis=1,how='all')

In [None]:
cdm['observation'].dropna(axis=1,how='all')

In [None]:
cdm['condition_occurrence'].dropna(axis=1,how='all')