[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/CO-CONNECT/co-connect-tools/HEAD)


## Introduction

The ETL transform to CDM using the classes defined in `co-connect-tools` is documented here as python notebook, as an example of how the classes can be used. Developers can follow the following workbook example, changing the rules file and the input files.


## Installing

The best way is to install the module via `pip`. 

In [None]:
!pip3 install co-connect-tools -q

## Loading the Rules

Given the full path to a `json` file containing the rules, the first step is to load this up into a `json` object/dict.

In [1]:
import coconnect.tools
import json
import os

coconnect_data_folder = os.path.join(os.path.dirname(coconnect.__file__),'data')

rules = coconnect.tools.load_json(f'{coconnect_data_folder}/test/rules/rules_14June2021.json')
print(json.dumps(rules, indent=2)[0:500])

{
  "metadata": {
    "date_created": "2021-06-14T15:27:37.123947",
    "dataset": "Test"
  },
  "cdm": {
    "observation": [
      {
        "observation_concept_id": {
          "source_table": "Demographics.csv",
          "source_field": "ethnicity",
          "term_mapping": {
            "Asian": 35825508
          }
        },
        "observation_datetime": {
          "source_table": "Demographics.csv",
          "source_field": "date_of_birth"
        },
        "observation_source_co


## Loading the input data

A convienience function is available to create a map between a file name and a file path for all files in a directory:

In [2]:
f_map = coconnect.tools.get_file_map_from_dir(f'{coconnect_data_folder}/test/inputs/')
print (json.dumps(f_map,indent=6))

{
      "Symptoms.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/co-connect-tools/coconnect/data/test/inputs/Symptoms.csv",
      "Covid19_test.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/co-connect-tools/coconnect/data/test/inputs/Covid19_test.csv",
      "covid19_antibody.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/co-connect-tools/coconnect/data/test/inputs/covid19_antibody.csv",
      "vaccine.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/co-connect-tools/coconnect/data/test/inputs/vaccine.csv",
      "Demographics.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/co-connect-tools/coconnect/data/test/inputs/Demographics.csv"
}


use the `f_map` to load all the inputs into a map between the file name and a dataframe object. This can be created manually via any prefered method.

In [3]:
inputs = coconnect.tools.load_csv(f_map)
inputs

{'Symptoms.csv':   PersonID  visit_date symptom1 symptom2 symptom3
 0      101  2020-11-15        Y        Y        Y
 1      102  2020-01-04        Y        Y        Y
 2      103  2020-03-27        Y        Y        Y
 3      104  2020-06-24        N        N        N
 4      105  2020-07-27        Y        Y        Y
 5      108  2020-11-04        N        Y        N
 6      109  2020-12-24        N        N        N
 7      110  2020-02-04        N        N        N,
 'Covid19_test.csv':   PersonID        date         result
 0      101  2020-11-15       POSITIVE
 1      102  2020-01-04       NEGATIVE
 2      103  2020-03-27            POS
 3      104  2020-06-24            NEG
 4      105  2020-07-27       POSITIVE
 5      108  2020-11-04       NEGATIVE
 6      109  2020-12-24  INDETERMINATE
 7      110  2020-02-04       NEGATIVE,
 'covid19_antibody.csv':   PersonID        date ABresult
 0      101  2020-11-29        1
 1      102  2020-04-15        1
 2      103  2020-10-04      

## Creating a CDM 

As CO-CONNECT-Tools contains a pythonic version of the CDM, we can create an instance of the `CommonDataModel` class.

In [4]:
from coconnect.cdm import CommonDataModel

cdm = CommonDataModel(name=rules['metadata']['dataset'],
                      inputs=inputs,
                      output_folder='output_dir/')
cdm

[32m2021-08-05 13:19:33[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - CommonDataModel created with version 0.0.0


<coconnect.cdm.model.CommonDataModel at 0x11fd95670>

## Adding CDM Objects to the CDM

The next step is to loop over all the rules from the `json`, creating and adding a new CDM object (e.g. Person) to the CDM.

Within the loop the CDM object define function is set a lambda function to the apply rules. This means that during the executing, in runtime, the tool (via the `CommonDataModel` class, will execute the define function and know how to apply the mapping rules.

In [5]:
from coconnect.cdm import get_cdm_class
from coconnect.tools import apply_rules

for destination_table,rules_set in rules['cdm'].items():
    for i,rules in enumerate(rules_set):
        obj = get_cdm_class(destination_table)()
        obj.set_name(f"{destination_table}_{i}")
        obj.rules = rules
        obj.define = lambda x : apply_rules(x)
        cdm.add(obj)

[32m2021-08-05 13:19:34[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added observation_0 of type observation
[32m2021-08-05 13:19:34[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added observation_1 of type observation
[32m2021-08-05 13:19:34[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added observation_2 of type observation
[32m2021-08-05 13:19:34[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added observation_3 of type observation
[32m2021-08-05 13:19:34[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added observation_4 of type observation
[32m2021-08-05 13:19:34[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added observation_5 of type observation
[32m2021-08-05 13:19:34[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added condition_occurrence_0 of type condition_occurrence
[32m2021-08-05 13:19:34[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added person_0 of type person
[32m2021-08-05 13:19:34[0m - [34mCommonDataModel[0m - [1;37

After the initialisation and creation of the CDM objects, we can see what objects we have been registered in the model..

In [6]:
cdm.objects()

{'observation': {'observation_0': <coconnect.cdm.objects.observation.Observation at 0x11fe0f190>,
  'observation_1': <coconnect.cdm.objects.observation.Observation at 0x10710d250>,
  'observation_2': <coconnect.cdm.objects.observation.Observation at 0x11fd95f70>,
  'observation_3': <coconnect.cdm.objects.observation.Observation at 0x11fd95130>,
  'observation_4': <coconnect.cdm.objects.observation.Observation at 0x11fe1ab20>,
  'observation_5': <coconnect.cdm.objects.observation.Observation at 0x11fe1d910>},
 'condition_occurrence': {'condition_occurrence_0': <coconnect.cdm.objects.condition_occurrence.ConditionOccurrence at 0x11fe1db80>},
 'person': {'person_0': <coconnect.cdm.objects.person.Person at 0x11fe1d1c0>,
  'person_1': <coconnect.cdm.objects.person.Person at 0x11fe218b0>}}

## Process The CDM

Processing the CDM will execute all objects, pandas dataframes will be created for each object, based on the rules that have been provided.

Importantly the CDM will also format, finalise and merge all the individual dataframes for each objects. 

* Formatting makes sure the columns are in the correct format i.e. a date is YYY-MM-DD
* Finalise makes sure 


In [7]:
cdm.process()

[32m2021-08-05 13:19:36[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Starting processing in order: ['person', 'observation', 'condition_occurrence']
[32m2021-08-05 13:19:36[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Number of objects to process for each table...
{
      "observation": 6,
      "condition_occurrence": 1,
      "person": 2
}
[32m2021-08-05 13:19:36[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - for person: found 2 objects
[32m2021-08-05 13:19:36[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - working on person
[32m2021-08-05 13:19:36[0m - [34mperson_0[0m - [1;37mINFO[0m - Called apply_rules
[32m2021-08-05 13:19:36[0m - [34mperson_0[0m - [1;37mINFO[0m - Mapped birth_datetime
[32m2021-08-05 13:19:36[0m - [34mperson_0[0m - [1;37mINFO[0m - Mapped gender_concept_id
[32m2021-08-05 13:19:36[0m - [34mperson_0[0m - [1;37mINFO[0m - Mapped gender_source_concept_id
[32m2021-08-05 13:19:36[0m - [34mperson_0[0m - [1;37mINFO[0m - M

[32m2021-08-05 13:19:37[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - working on condition_occurrence
[32m2021-08-05 13:19:37[0m - [34mcondition_occurrence_0[0m - [1;37mINFO[0m - Called apply_rules
[32m2021-08-05 13:19:37[0m - [34mcondition_occurrence_0[0m - [1;37mINFO[0m - Mapped condition_concept_id
[32m2021-08-05 13:19:37[0m - [34mcondition_occurrence_0[0m - [1;37mINFO[0m - Mapped condition_end_datetime
[32m2021-08-05 13:19:37[0m - [34mcondition_occurrence_0[0m - [1;37mINFO[0m - Mapped condition_source_concept_id
[32m2021-08-05 13:19:37[0m - [34mcondition_occurrence_0[0m - [1;37mINFO[0m - Mapped condition_source_value
[32m2021-08-05 13:19:37[0m - [34mcondition_occurrence_0[0m - [1;37mINFO[0m - Mapped condition_start_datetime
[32m2021-08-05 13:19:37[0m - [34mcondition_occurrence_0[0m - [1;37mINFO[0m - Mapped person_id
[32m2021-08-05 13:19:37[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - finished condition_occurrence_0 ... 0/1, 4 r

## Inspect Outputs

In [8]:
cdm.keys()

dict_keys(['person', 'observation', 'condition_occurrence'])

In [9]:
cdm['person'].dropna(axis=1,how='all')

Unnamed: 0_level_0,gender_concept_id,birth_datetime,person_source_value,gender_source_value,gender_source_concept_id,race_source_value,ethnicity_source_value
person_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
101,8507,1951-12-25 00:00:00,,M,8507,,
102,8507,1981-11-19 00:00:00,,M,8507,,
103,8532,1997-05-11 00:00:00,,F,8532,,
104,8532,1975-06-07 00:00:00,,F,8532,,
105,8532,1976-04-23 00:00:00,,F,8532,,
106,8507,1966-09-29 00:00:00,,M,8507,,
107,8532,1956-11-12 00:00:00,,F,8532,,
108,8507,1985-03-01 00:00:00,,M,8507,,
109,8532,1950-10-31 00:00:00,,F,8532,,
110,8532,1993-09-07 00:00:00,,F,8532,,


In [10]:
cdm['observation'].dropna(axis=1,how='all')

Unnamed: 0_level_0,person_id,observation_concept_id,observation_datetime,value_as_string,observation_source_value,observation_source_concept_id,unit_source_value,qualifier_source_value
observation_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,107,35825508,1956-11-12 00:00:00,,Asian,35825508,,
2,104,35825531,1975-06-07 00:00:00,,Bangladeshi,35825531,,
3,103,35826241,1997-05-11 00:00:00,,Indian,35826241,,
4,101,35827394,1951-12-25 00:00:00,,White,35827394,,
5,105,35827394,1976-04-23 00:00:00,,White,35827394,,
6,110,35827394,1993-09-07 00:00:00,,White,35827394,,
7,102,35825567,1981-11-19 00:00:00,,Black,35825567,,
8,106,35825567,1966-09-29 00:00:00,,Black,35825567,,
9,108,35827395,1985-03-01 00:00:00,,White and Asian,35827395,,


In [11]:
cdm['condition_occurrence'].dropna(axis=1,how='all')

Unnamed: 0_level_0,person_id,condition_concept_id,condition_start_datetime,condition_end_datetime,stop_reason,condition_source_value,condition_source_concept_id,condition_status_source_value
condition_occurrence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,101,254761,2020-11-15 00:00:00,2020-11-15 00:00:00,,Y,254761,
2,102,254761,2020-01-04 00:00:00,2020-01-04 00:00:00,,Y,254761,
3,103,254761,2020-03-27 00:00:00,2020-03-27 00:00:00,,Y,254761,
4,105,254761,2020-07-27 00:00:00,2020-07-27 00:00:00,,Y,254761,
