The ETL-Tool is the command line tool `carrot.etl`, it just takes in a `yaml` file to configure how to run

In [1]:
!carrot etl --help

Usage: carrot etl [OPTIONS] COMMAND [ARGS]...

  Command group for running the full ETL of a dataset

Options:
  --config, --config-file TEXT  specify a yaml configuration file
  -d, --daemon                  run the ETL as a daemon process
  -l, --log-file TEXT           specify the log file to write to
  --help                        Show this message and exit.

Commands:
  check-tables   check tables
  clean-table    clean (delete all rows) of a given table name
  clean-tables   clean (delete all rows) in the tables defined in the...
  create-tables  create new bclink tables
  delete-tables  delete some tables


A very basic configuration for running locally (effectively just running the T-Tool `carrot.run map` on one input) 

In [2]:
definition = """
transform: 
  settings: &settings
    output: output/
    rules: ../data/rules.json                                                                                                                                        
  data:
    - input: ../data/part1
      <<: *settings
"""
with open('config.yml','w') as f:
    f.write(definition)

In [3]:
!carrot etl --config config.yml

[32m2022-06-17 14:48:53[0m - [34mrun_etl[0m - [1;37mINFO[0m - running etl on config.yml (last modified: 1655473730.8621333)
[32m2022-06-17 14:48:53[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - DataCollection Object Created
[32m2022-06-17 14:48:54[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Registering  Demographics.csv [<carrot.io.common.DataBrick object at 0x1273f1130>]
[32m2022-06-17 14:48:54[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Registering  GP_Records.csv [<carrot.io.common.DataBrick object at 0x1273f12e0>]
[32m2022-06-17 14:48:54[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Registering  Hospital_Visit.csv [<carrot.io.common.DataBrick object at 0x1273f1550>]
[32m2022-06-17 14:48:54[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Registering  Serology.csv [<carrot.io.common.DataBrick object at 0x1273f17c0>]
[32m2022-06-17 14:48:54[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Registering  Symptoms.csv [<carr

Changing to have a load tab to configure the output for bclink:

In [4]:
definition = """
load: &load-bclink
  cache: ./output/cache/
  bclink:
      dry_run: true
transform: 
  settings: &settings
    output: *load-bclink
    rules: ../data/rules.json                                                                                                                                        
  data:
    - input: ../data/part1
      <<: *settings
"""
with open('config.yml','w') as f:
    f.write(definition)

In [5]:
!carrot etl --config config.yml

[32m2022-06-17 14:48:57[0m - [34mrun_etl[0m - [1;37mINFO[0m - running etl on config.yml (last modified: 1655473734.7268007)
[32m2022-06-17 14:48:57[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - DataCollection Object Created
[32m2022-06-17 14:48:57[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Registering  Demographics.csv [<carrot.io.common.DataBrick object at 0x114023130>]
[32m2022-06-17 14:48:57[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Registering  GP_Records.csv [<carrot.io.common.DataBrick object at 0x1140232e0>]
[32m2022-06-17 14:48:57[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Registering  Hospital_Visit.csv [<carrot.io.common.DataBrick object at 0x114023550>]
[32m2022-06-17 14:48:57[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Registering  Serology.csv [<carrot.io.common.DataBrick object at 0x1140237c0>]
[32m2022-06-17 14:48:57[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Registering  Symptoms.csv [<carr

[32m2022-06-17 14:48:57[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added COVID-19 vaccine 3034 of type drug_exposure
[32m2022-06-17 14:48:57[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added COVID-19 vaccine 3035 of type drug_exposure
[32m2022-06-17 14:48:57[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added COVID-19 vaccine 3036 of type drug_exposure
[32m2022-06-17 14:48:57[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added SARS-CoV-2 (COVID-19) vaccine, mRNA-1273 0.2 MG/ML Injectable Suspension 3040 of type drug_exposure
[32m2022-06-17 14:48:57[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added SARS-CoV-2 (COVID-19) vaccine, mRNA-BNT162b2 0.1 MG/ML Injectable Suspension 3041 of type drug_exposure
[32m2022-06-17 14:48:57[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Starting processing in order: ['person', 'observation', 'condition_occurrence', 'drug_exposure']
[32m2022-06-17 14:48:57[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Numbe

[32m2022-06-17 14:48:58[0m - [34mObservation[0m - [1;37mINFO[0m - Automatically formatting data columns.
[32m2022-06-17 14:48:58[0m - [34mObservation[0m - [1;37mINFO[0m - created df (0x114177d00)[Antibody_3027]
[32m2022-06-17 14:48:58[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - finished Antibody 3027 (0x114177d00) ... 1/4 completed, 413 rows
[32m2022-06-17 14:48:58[0m - [34mCommonDataModel[0m - [1;37mERROR[0m - [31mThere are person_ids in this table that are not in the output person table![0m
[32m2022-06-17 14:48:58[0m - [34mCommonDataModel[0m - [1;37mERROR[0m - [31mEither they are not in the original data, or while creating the person table, [0m
[32m2022-06-17 14:48:58[0m - [34mCommonDataModel[0m - [1;37mERROR[0m - [31mstudies have been removed due to lack of required fields, such as birthdate.[0m
[32m2022-06-17 14:48:58[0m - [34mCommonDataModel[0m - [1;37mERROR[0m - [31m410/413 were good, 3 studies are removed.[0m
[32m2022-06-17 14

[32m2022-06-17 14:48:58[0m - [34mObservation[0m - [1;37mINFO[0m - created df (0x114191160)[Cancer_3045]
[32m2022-06-17 14:48:58[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - finished Cancer 3045 (0x114191160) ... 4/4 completed, 349 rows
[32m2022-06-17 14:48:58[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - saving dataframe (0x114191160) to <carrot.io.plugins.bclink.BCLinkDataCollection object at 0x113fdee20>
[32m2022-06-17 14:48:58[0m - [34mBCLinkDataCollection[0m - [1;37mINFO[0m - saving observation.Cancer_3045.0x114191160.2022-06-17T134858 to ./output/cache//observation.Cancer_3045.0x114191160.2022-06-17T134858.tsv
[32m2022-06-17 14:48:58[0m - [34mBCLinkDataCollection[0m - [1;37mINFO[0m - finished save to file
[32m2022-06-17 14:48:58[0m - [34mBCLinkHelpers[0m - [1;37mNOTICE[0m - [35mdataset_tool --load --table=observation --user=data --data_file=./output/cache//observation.Cancer_3045.0x114191160.2022-06-17T134858.tsv --support --bcqueue bclink[0m

[32m2022-06-17 14:48:58[0m - [34mConditionOccurrence[0m - [1;37mINFO[0m - created df (0x114191b80)[Dizziness_3030]
[32m2022-06-17 14:48:59[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - finished Dizziness 3030 (0x114191b80) ... 3/12 completed, 134 rows
[32m2022-06-17 14:48:59[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - saving dataframe (0x114191b80) to <carrot.io.plugins.bclink.BCLinkDataCollection object at 0x113fdee20>
[32m2022-06-17 14:48:59[0m - [34mBCLinkDataCollection[0m - [1;37mINFO[0m - saving condition_occurrence.Dizziness_3030.0x114191b80.2022-06-17T134859 to ./output/cache//condition_occurrence.Dizziness_3030.0x114191b80.2022-06-17T134859.tsv
[32m2022-06-17 14:48:59[0m - [34mBCLinkDataCollection[0m - [1;37mINFO[0m - finished save to file
[32m2022-06-17 14:48:59[0m - [34mBCLinkHelpers[0m - [1;37mNOTICE[0m - [35mdataset_tool --load --table=condition_occurrence --user=data --data_file=./output/cache//condition_occurrence.Dizziness_3030.0x11

[32m2022-06-17 14:48:59[0m - [34mConditionOccurrence[0m - [1;37mINFO[0m - Mapped condition_concept_id
[32m2022-06-17 14:48:59[0m - [34mConditionOccurrence[0m - [1;37mINFO[0m - Mapped condition_end_datetime
[32m2022-06-17 14:48:59[0m - [34mConditionOccurrence[0m - [1;37mINFO[0m - Mapped condition_source_concept_id
[32m2022-06-17 14:48:59[0m - [34mConditionOccurrence[0m - [1;37mINFO[0m - Mapped condition_source_value
[32m2022-06-17 14:48:59[0m - [34mConditionOccurrence[0m - [1;37mINFO[0m - Mapped condition_start_datetime
[32m2022-06-17 14:48:59[0m - [34mConditionOccurrence[0m - [1;37mINFO[0m - Mapped person_id
[32m2022-06-17 14:48:59[0m - [34mConditionOccurrence[0m - [1;37mINFO[0m - Automatically formatting data columns.
[32m2022-06-17 14:48:59[0m - [34mConditionOccurrence[0m - [1;37mINFO[0m - created df (0x1141a7f10)[Pneumonia_3042]
[32m2022-06-17 14:48:59[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - finished Pneumonia 3042 (0x1141a

[32m2022-06-17 14:49:00[0m - [34mConditionOccurrence[0m - [1;37mINFO[0m - Automatically formatting data columns.
[32m2022-06-17 14:49:00[0m - [34mConditionOccurrence[0m - [1;37mINFO[0m - created df (0x1141ed970)[Type_2_diabetes_mellitus_3048]
[32m2022-06-17 14:49:00[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - finished Type 2 diabetes mellitus 3048 (0x1141ed970) ... 10/12 completed, 264 rows
[32m2022-06-17 14:49:00[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - saving dataframe (0x1141ed970) to <carrot.io.plugins.bclink.BCLinkDataCollection object at 0x113fdee20>
[32m2022-06-17 14:49:00[0m - [34mBCLinkDataCollection[0m - [1;37mINFO[0m - saving condition_occurrence.Type_2_diabetes_mellitus_3048.0x1141ed970.2022-06-17T134900 to ./output/cache//condition_occurrence.Type_2_diabetes_mellitus_3048.0x1141ed970.2022-06-17T134900.tsv
[32m2022-06-17 14:49:00[0m - [34mBCLinkDataCollection[0m - [1;37mINFO[0m - finished save to file
[32m2022-06-17 14:49:00[0m -

[32m2022-06-17 14:49:00[0m - [34mDrugExposure[0m - [1;37mINFO[0m - Mapped drug_source_concept_id
[32m2022-06-17 14:49:00[0m - [34mDrugExposure[0m - [1;37mINFO[0m - Mapped drug_source_value
[32m2022-06-17 14:49:00[0m - [34mDrugExposure[0m - [1;37mINFO[0m - Mapped person_id
[32m2022-06-17 14:49:00[0m - [34mDrugExposure[0m - [1;37mINFO[0m - Automatically formatting data columns.
[32m2022-06-17 14:49:00[0m - [34mDrugExposure[0m - [1;37mINFO[0m - created df (0x114215eb0)[COVID_19_vaccine_3034]
[32m2022-06-17 14:49:00[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - finished COVID-19 vaccine 3034 (0x114215eb0) ... 1/5 completed, 245 rows
[32m2022-06-17 14:49:00[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - saving dataframe (0x114215eb0) to <carrot.io.plugins.bclink.BCLinkDataCollection object at 0x113fdee20>
[32m2022-06-17 14:49:00[0m - [34mBCLinkDataCollection[0m - [1;37mINFO[0m - saving drug_exposure.COVID_19_vaccine_3034.0x114215eb0.2022-06-17T

[32m2022-06-17 14:49:00[0m - [34mDrugExposure[0m - [1;37mINFO[0m - Automatically formatting data columns.
[32m2022-06-17 14:49:01[0m - [34mDrugExposure[0m - [1;37mINFO[0m - created df (0x1141ffa00)[SARS_CoV_2_COVID_19_vaccine_mRNA_1273_0_2_MG_ML_Injectable_Suspension_3040]
[32m2022-06-17 14:49:01[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - finished SARS-CoV-2 (COVID-19) vaccine, mRNA-1273 0.2 MG/ML Injectable Suspension 3040 (0x1141ffa00) ... 4/5 completed, 245 rows
[32m2022-06-17 14:49:01[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - saving dataframe (0x1141ffa00) to <carrot.io.plugins.bclink.BCLinkDataCollection object at 0x113fdee20>
[32m2022-06-17 14:49:01[0m - [34mBCLinkDataCollection[0m - [1;37mINFO[0m - saving drug_exposure.SARS_CoV_2_COVID_19_vaccine_mRNA_1273_0_2_MG_ML_Injectable_Suspension_3040.0x1141ffa00.2022-06-17T134901 to ./output/cache//drug_exposure.SARS_CoV_2_COVID_19_vaccine_mRNA_1273_0_2_MG_ML_Injectable_Suspension_3040.0x1141ffa00.20