[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/CO-CONNECT/carrot-cdm/HEAD)


## Introduction

The ETL transform to CDM using the classes defined in `carrot-cdm` is documented here as python notebook, as an example of how the classes can be used. Developers can follow the following workbook example, changing the rules file and the input files.


## Installing

The best way is to install the module via `pip`. 

In [1]:
!pip3 install carrot-cdm -q

## Loading the Rules

Given the full path to a `json` file containing the rules, the first step is to load this up into a `json` object/dict.

In [2]:
import carrot.tools
import json
import os

carrot.data_folder = os.path.join(os.path.dirname(carrot.__file__),'data')

rules = carrot.tools.load_json(f'{carrot.data_folder}/test/rules/rules_14June2021.json')
print(json.dumps(rules, indent=6))

{
      "metadata": {
            "date_created": "2021-06-14T15:27:37.123947",
            "dataset": "Test"
      },
      "cdm": {
            "observation": {
                  "observation_0": {
                        "observation_concept_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity",
                              "term_mapping": {
                                    "Asian": 35825508
                              }
                        },
                        "observation_datetime": {
                              "source_table": "Demographics.csv",
                              "source_field": "date_of_birth"
                        },
                        "observation_source_concept_id": {
                              "source_table": "Demographics.csv",
                              "source_field": "ethnicity",
                              "term_mapping": {
                         

## Loading the input data

The ETL Tool takes in as input pandas dataframes and provides a tool for loading CSV files


### CSV

A convienience function is available to create a map between a file name and a file path for all files in a directory:

In [3]:
f_map = carrot.tools.get_file_map_from_dir(f'{carrot.data_folder}/test/inputs/')
print (json.dumps(f_map,indent=6))

{
      "Symptoms.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/carrot-cdm/carrot.data/test/inputs/Symptoms.csv",
      "Covid19_test.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/carrot-cdm/carrot.data/test/inputs/Covid19_test.csv",
      "covid19_antibody.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/carrot-cdm/carrot.data/test/inputs/covid19_antibody.csv",
      "vaccine.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/carrot-cdm/carrot.data/test/inputs/vaccine.csv",
      "Demographics.csv": "/Users/calummacdonald/Usher/CO-CONNECT/Software/carrot-cdm/carrot.data/test/inputs/Demographics.csv"
}


use the `f_map` to load all the inputs into a map between the file name and a dataframe object. This can be created manually via any prefered method.

In [4]:
inputs = carrot.tools.load_csv(f_map)
inputs

<carrot.tools.file_helpers.InputData at 0x1063687f0>

In [5]:
inputs.keys()

dict_keys(['Symptoms.csv', 'Covid19_test.csv', 'covid19_antibody.csv', 'vaccine.csv', 'Demographics.csv'])

In [6]:
inputs['Symptoms.csv']

Unnamed: 0,PersonID,visit_date,symptom1,symptom2,symptom3
0,16dc368a89b428b2485484313ba67a3912ca03f2b2b424...,2020-11-15 00:00:00.000000,Y,Y,Y
1,37834f2f25762f23e1f74a531cbe445db73d6765ebe608...,2020-01-04 00:00:00.000000,Y,Y,Y
2,454f63ac30c8322997ef025edff6abd23e0dbe7b8a3d51...,2020-03-27 00:00:00.000000,Y,Y,Y
3,5ef6fdf32513aa7cd11f72beccf132b9224d33f271471f...,2020-06-24 00:00:00.000000,N,N,N
4,1253e9373e781b7500266caa55150e08e210bc8cd8cc70...,2020-07-27 00:00:00.000000,Y,Y,Y
...,...,...,...,...,...
795,62f6d46c48c7d9ff3d09a408d0ec880f167a5dc9c8fd34...,2020-11-04 00:00:00.000000,N,Y,N
796,c62510afc57db491f9f993387b76dd9a7d08f09c013269...,2020-07-27 00:00:00.000000,Y,Y,Y
797,bdc5d8a48c23897906b09a9a3680bd2e9c8b3121edbda3...,2020-03-27 00:00:00.000000,Y,Y,Y
798,fa88d374b9cf5e059fad4a2fe406feae4c49cbf4803083...,2020-12-24 00:00:00.000000,N,N,N


### Chunked CSV

For large datasets, it's better to chunk the data as to not overload your computer memory, this can be achieved by supplying a `chunksize` argument:


In [7]:
inputs_chunked = carrot.tools.load_csv(f_map,chunksize=100)
inputs_chunked['Symptoms.csv']

Unnamed: 0,PersonID,visit_date,symptom1,symptom2,symptom3
0,16dc368a89b428b2485484313ba67a3912ca03f2b2b424...,2020-11-15 00:00:00.000000,Y,Y,Y
1,37834f2f25762f23e1f74a531cbe445db73d6765ebe608...,2020-01-04 00:00:00.000000,Y,Y,Y
2,454f63ac30c8322997ef025edff6abd23e0dbe7b8a3d51...,2020-03-27 00:00:00.000000,Y,Y,Y
3,5ef6fdf32513aa7cd11f72beccf132b9224d33f271471f...,2020-06-24 00:00:00.000000,N,N,N
4,1253e9373e781b7500266caa55150e08e210bc8cd8cc70...,2020-07-27 00:00:00.000000,Y,Y,Y
...,...,...,...,...,...
95,a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2...,2020-11-04 00:00:00.000000,N,Y,N
96,8bcbb4c131df56f7c79066016241cc4bdf4e58db55c4f6...,2020-07-27 00:00:00.000000,Y,Y,Y
97,a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2...,2020-01-04 00:00:00.000000,Y,Y,Y
98,5a39cadd1b007093db50744797c7a04a34f73b35ed4447...,2020-12-24 00:00:00.000000,N,N,N


The internal working of the `InputData` object will more to the next slice of data when running the process, untill all data has been processed.

In [8]:
inputs_chunked.next()

In [9]:
inputs_chunked['Symptoms.csv']

Unnamed: 0,PersonID,visit_date,symptom1,symptom2,symptom3
100,43974ed74066b207c30ffd0fed5146762e6c60745ac977...,2020-03-27 00:00:00.000000,Y,Y,Y
101,fc56dbc6d4652b315b86b71c8d688c1ccdea9c5f1fd077...,2020-02-04 00:00:00.000000,N,N,N
102,f8809aff4d69bece79dabe35be0c708b890d7eafb841f1...,2020-06-24 00:00:00.000000,N,N,N
103,5cf4e26bd3d87da5e03f80a43a64f1220a1f4ba9e1d634...,2020-11-15 00:00:00.000000,Y,Y,Y
104,f8809aff4d69bece79dabe35be0c708b890d7eafb841f1...,2020-07-27 00:00:00.000000,Y,Y,Y
...,...,...,...,...,...
195,a0f8b2c4cb1ac82abdb37f0fe5203b97be556c4468c83b...,2020-12-24 00:00:00.000000,N,N,N
196,4c15f47afe7f817fd559e12ddbc276f4930c5822f20490...,2020-01-04 00:00:00.000000,Y,Y,Y
197,983bd614bb5afece5ab3b6023f71147cd7b6bc2314f9d2...,2020-11-04 00:00:00.000000,N,Y,N
198,c3ea99f86b2f8a74ef4145bb245155ff5f91cd856f2875...,2020-06-24 00:00:00.000000,N,N,N


### SQL

Another alternative, if your input data is not in `csv` format is to load the data manually yourself from SQL / Spark / DataBricks views etc.

Firstly, initialise an input data handler object"

In [10]:
inputs_sql = carrot.tools.file_helpers.InputData()
inputs_sql

<carrot.tools.file_helpers.InputData at 0x11fd1db80>

Load your data, for example fron a PostgresSQL server, using `sqlalchemy` and pandas:

In [11]:
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql://localhost:5432/carrot.data_test')
df = pd.read_sql("my_data_table",engine)
df

Unnamed: 0,PersonID,visit_date,symptom1,symptom2,symptom3
0,16dc368a89b428b2485484313ba67a3912ca03f2b2b424...,2020-11-15 00:00:00.000000,Y,Y,Y
1,37834f2f25762f23e1f74a531cbe445db73d6765ebe608...,2020-01-04 00:00:00.000000,Y,Y,Y
2,454f63ac30c8322997ef025edff6abd23e0dbe7b8a3d51...,2020-03-27 00:00:00.000000,Y,Y,Y
3,5ef6fdf32513aa7cd11f72beccf132b9224d33f271471f...,2020-06-24 00:00:00.000000,N,N,N
4,1253e9373e781b7500266caa55150e08e210bc8cd8cc70...,2020-07-27 00:00:00.000000,Y,Y,Y
...,...,...,...,...,...
95,a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2...,2020-11-04 00:00:00.000000,N,Y,N
96,8bcbb4c131df56f7c79066016241cc4bdf4e58db55c4f6...,2020-07-27 00:00:00.000000,Y,Y,Y
97,a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2...,2020-01-04 00:00:00.000000,Y,Y,Y
98,5a39cadd1b007093db50744797c7a04a34f73b35ed4447...,2020-12-24 00:00:00.000000,N,N,N


Set the input object to this dataframe

*note:* the name of the input table must be the same as the name in the `json` rules. In this example, the name in the `json` for the mapping for this table is `Symptoms.csv`, therefore the dataframe is associated with that name.

In [12]:
inputs_sql['Symptoms.csv'] = df

###  Spark Databricks

If you want to use something like Spark for integration with DataBricks, you can use `pyspark` to load the data:

In [13]:
from pyspark.sql import SparkSession

In [14]:
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.jars", "/Users/calummacdonald/Downloads/postgresql-42.3.1.jar") \
    .getOrCreate()


In [15]:
df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/carrot.data_test") \
    .option("dbtable", "my_data_table") \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.printSchema()

root
 |-- PersonID: string (nullable = true)
 |-- visit_date: string (nullable = true)
 |-- symptom1: string (nullable = true)
 |-- symptom2: string (nullable = true)
 |-- symptom3: string (nullable = true)



In [16]:
inputs_databricks = carrot.tools.file_helpers.InputData()
inputs_databricks['Symptons.csv'] = df.select("*").toPandas()
inputs_databricks['Symptons.csv']

Unnamed: 0,PersonID,visit_date,symptom1,symptom2,symptom3
0,16dc368a89b428b2485484313ba67a3912ca03f2b2b424...,2020-11-15 00:00:00.000000,Y,Y,Y
1,37834f2f25762f23e1f74a531cbe445db73d6765ebe608...,2020-01-04 00:00:00.000000,Y,Y,Y
2,454f63ac30c8322997ef025edff6abd23e0dbe7b8a3d51...,2020-03-27 00:00:00.000000,Y,Y,Y
3,5ef6fdf32513aa7cd11f72beccf132b9224d33f271471f...,2020-06-24 00:00:00.000000,N,N,N
4,1253e9373e781b7500266caa55150e08e210bc8cd8cc70...,2020-07-27 00:00:00.000000,Y,Y,Y
...,...,...,...,...,...
95,a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2...,2020-11-04 00:00:00.000000,N,Y,N
96,8bcbb4c131df56f7c79066016241cc4bdf4e58db55c4f6...,2020-07-27 00:00:00.000000,Y,Y,Y
97,a4e00d7e6aa82111575438c5e5d3e63269d4c475c718b2...,2020-01-04 00:00:00.000000,Y,Y,Y
98,5a39cadd1b007093db50744797c7a04a34f73b35ed4447...,2020-12-24 00:00:00.000000,N,N,N


## Creating a CDM 

As CO-CONNECT-Tools contains a pythonic version of the CDM, we can create an instance of the `CommonDataModel` class.

In [14]:
from carrot.cdm import CommonDataModel

cdm = CommonDataModel(name=rules['metadata']['dataset'],
                      inputs=inputs,
                      output_folder='output_dir/')
cdm

<carrot.cdm.model.CommonDataModel at 0x120050a30>

## Adding CDM Objects to the CDM

The next step is to loop over all the rules from the `json`, creating and adding a new CDM object (e.g. Person) to the CDM.

Within the loop the CDM object define function is set a lambda function to the apply rules. This means that during the executing, in runtime, the tool (via the `CommonDataModel` class, will execute the define function and know how to apply the mapping rules.

In [15]:
from carrot.cdm import get_cdm_class
from carrot.tools import apply_rules

for destination_table,rules_set in rules['cdm'].items():
    for name,rules in rules_set.items():
        obj = get_cdm_class(destination_table)()
        obj.set_name(name)
        obj.rules = rules
        obj.define = lambda x : apply_rules(x)
        cdm.add(obj)

After the initialisation and creation of the CDM objects, we can see what objects we have been registered in the model..

In [19]:
cdm.objects()

{'observation': {'observation_0': <carrot.cdm.objects.observation.Observation at 0x12005e190>,
  'observation_1': <carrot.cdm.objects.observation.Observation at 0x11fdefd30>,
  'observation_2': <carrot.cdm.objects.observation.Observation at 0x120060460>,
  'observation_3': <carrot.cdm.objects.observation.Observation at 0x11fe234c0>,
  'observation_4': <carrot.cdm.objects.observation.Observation at 0x120069100>,
  'observation_5': <carrot.cdm.objects.observation.Observation at 0x120069e50>},
 'condition_occurrence': {'condition_occurrence_0': <carrot.cdm.objects.condition_occurrence.ConditionOccurrence at 0x12006dc40>},
 'person': {'female': <carrot.cdm.objects.person.Person at 0x120074970>,
  'male': <carrot.cdm.objects.person.Person at 0x1200797c0>},
 'measurement': {'covid_antibody': <carrot.cdm.objects.measurement.Measurement at 0x12007f610>}}

## Process The CDM

Processing the CDM will execute all objects, pandas dataframes will be created for each object, based on the rules that have been provided.

Importantly the CDM will also format, finalise and merge all the individual dataframes for each objects. 

* Formatting makes sure the columns are in the correct format i.e. a date is YYY-MM-DD
* Finalise makes sure 


In [20]:
cdm.process()



## Inspect Outputs

In [21]:
cdm.keys()

dict_keys(['person', 'observation', 'condition_occurrence', 'measurement'])

In [22]:
cdm['person'].dropna(axis=1,how='all')

Unnamed: 0_level_0,gender_concept_id,year_of_birth,month_of_birth,day_of_birth,birth_datetime,gender_source_value,gender_source_concept_id
person_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,8532,1997,5,11,1997-05-11 00:00:00.000000,F,8532
2,8532,1950,10,31,1950-10-31 00:00:00.000000,F,8532
3,8532,1975,6,7,1975-06-07 00:00:00.000000,F,8532
4,8532,1975,6,7,1975-06-07 00:00:00.000000,F,8532
5,8532,1976,4,23,1976-04-23 00:00:00.000000,F,8532
...,...,...,...,...,...,...,...
996,8507,1985,3,1,1985-03-01 00:00:00.000000,M,8507
997,8507,1951,12,25,1951-12-25 00:00:00.000000,M,8507
998,8507,1951,12,25,1951-12-25 00:00:00.000000,M,8507
999,8507,1951,12,25,1951-12-25 00:00:00.000000,M,8507


In [23]:
cdm['observation'].dropna(axis=1,how='all')

Unnamed: 0_level_0,person_id,observation_concept_id,observation_datetime,observation_source_value,observation_source_concept_id
observation_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,134,35825508,1956-11-12 00:00:00.000000,Asian,35825508
2,115,35825508,1956-11-12 00:00:00.000000,Asian,35825508
3,364,35825508,1956-11-12 00:00:00.000000,Asian,35825508
4,561,35825508,1956-11-12 00:00:00.000000,Asian,35825508
5,503,35825508,1956-11-12 00:00:00.000000,Asian,35825508
...,...,...,...,...,...
896,957,35827395,1985-03-01 00:00:00.000000,White and Asian,35827395
897,961,35827395,1985-03-01 00:00:00.000000,White and Asian,35827395
898,944,35827395,1985-03-01 00:00:00.000000,White and Asian,35827395
899,637,35827395,1985-03-01 00:00:00.000000,White and Asian,35827395


In [24]:
cdm['condition_occurrence'].dropna(axis=1,how='all')

Unnamed: 0_level_0,person_id,condition_concept_id,condition_start_datetime,condition_end_datetime,condition_source_value,condition_source_concept_id
condition_occurrence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,646,254761,2020-11-15 00:00:00.000000,2020-11-15 00:00:00.000000,Y,254761
2,701,254761,2020-01-04 00:00:00.000000,2020-01-04 00:00:00.000000,Y,254761
3,175,254761,2020-03-27 00:00:00.000000,2020-03-27 00:00:00.000000,Y,254761
4,53,254761,2020-07-27 00:00:00.000000,2020-07-27 00:00:00.000000,Y,254761
5,388,254761,2020-01-04 00:00:00.000000,2020-01-04 00:00:00.000000,Y,254761
...,...,...,...,...,...,...
396,419,254761,2020-11-15 00:00:00.000000,2020-11-15 00:00:00.000000,Y,254761
397,250,254761,2020-01-04 00:00:00.000000,2020-01-04 00:00:00.000000,Y,254761
398,475,254761,2020-07-27 00:00:00.000000,2020-07-27 00:00:00.000000,Y,254761
399,905,254761,2020-03-27 00:00:00.000000,2020-03-27 00:00:00.000000,Y,254761
