prepare to create a v1.0

OHDSI · Feb 8, 2021 · e4d55f1 · e4d55f1
1 parent 4d954bf
commit e4d55f1
Show file tree

Hide file tree

Showing 42 changed files with 2,258 additions and 2,852 deletions.
diff --git a/README.md b/README.md
@@ -1,39 +1,119 @@
 # MIMIC IV to OMOP CDM Conversion #
 
-### What is this repository for? ###
 
-* Quick summary
-* Version
-* [Learn Markdown](https://bitbucket.org/tutorials/markdowndemo)
+The project implements an ETL conversion of MIMIC IV PhysioNet dataset to OMOP CDM format.
 
-### Who do I talk to? ###
+* Version 1.0
 
-* Repo owner or admin
-* Other community or team contact
 
-### How to run the conversion ###
+### Concepts / Phylosophy ###
 
-* Workflows: ddl, vocabulary_refresh, staging, etl, ut, qa, unload
-* It is supposed that the project root (location of this file) is current directory.
+The ETL is based on the five steps.
+* Create a snapshot of the source data. The snapshot data is stored in staging source tables with prefix "src_".
+* Clean source data: filter out rows to be not used, format values, apply some business rules. Create intermediate tables with prefix "lk_" and postfix "clean".
+* Map distinct source codes to concepts in vocabulary tables. Create intermediate tables with prefix "lk_" and postfix "concept".
+    * Custom mapping is implemented in custom concepts generated in vocabulary tables beforehand.
+* Join cleaned data and mapped codes. Create intermediate tables with prefix "lk_" and postfix "mapped".
+* Distribute mapped data by target cdm tables according to target_domain_id values.
 
-* Run a workflow:
-    * with local variables: `python scripts/run_workflow.py -c conf/workflow_etl.conf`
-        * copy "variables" section from file.etlconf
-    * with global variables: `python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf`
-* Run explicitly named scripts (space delimited):
-    `python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf etl/etl/cdm_drug_era.sql`
-* Run in background:
-    `nohup python scripts/run_workflow.py -e conf/full.etlconf -c conf/workflow_etl.conf > ../out_full_etl.out &`
-* Continue after an error:
-    `nohup python scripts/run_workflow.py -e conf/full.etlconf -c conf/workflow_etl.conf etl/etl/cdm_observation.sql etl/etl/cdm_observation_period.sql etl/etl/cdm_fact_relationship.sql etl/etl/cdm_condition_era.sql etl/etl/cdm_drug_era.sql etl/etl/cdm_dose_era.sql etl/etl/cdm_cdm_source.sql >> ../out_full_etl.out &`
+Intermediate and staging CDM tables have additional working fields like unit_id. Field unit_id is composed during the ETL steps. From right to left: source table name, initial target table name abbreviation, final target table name or abbreviation. For example: unit_id = 'drug.cond.diagnoses_icd' means that the rows in this unit_id belong to Drug_exposure table, inially were prepared for Condition_occurrence table, and its original is source table diagnoses_icd.
+
+Vocabularies are kept in a separate dataset, and are copied as a part of the snapshot data too.
+
+
+### How to run the conversion ###
+
+* The ETL process encapsulates the following workflows: ddl, vocabulary_refresh, staging, etl, ut, unload.
+* The unload workflow results in creating a final OMOP CDM dataset, which can be analysed with OHDSI tools as Atlas or DQD.
+
+* How to run ETL end-to-end:
+    * update config files accordingly
+    * perform vocabulary_refresh steps if needed (see vocabulary_refresh/README.md)
+    * set the project root (location of this file) as the current directory
+
+    `
+    cd vocabulary_refresh
+    python vocabulary_refresh.py -s10
+    python vocabulary_refresh.py -s20
+    python vocabulary_refresh.py -s30
+    cd ../
+    python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_ddl.conf
+    python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_staging.conf
+    python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf
+    python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_ut.conf
+    python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_metrics.conf
+    `
+* How to look at UT and Metrics reports
+    * see metrics dataset name in the corresponding .etlconf file
+
+    `
+    -- UT report
+    SELECT report_starttime, table_id, test_type, field_name
+    FROM metrics_dataset.report_unit_test
+    WHERE NOT test_passed 
+    ;
+    -- Metrics - row count
+    SELECT * FROM metrics_dataset.me_total ORDER BY table_name;
+    -- Metrics - person and visit summary
+    SELECT
+        category, name, count AS row_count
+    FROM metrics_dataset.me_persons_visits ORDER BY category, name;
+    -- Metrics - Mapping rates
+    SELECT
+        table_name, concept_field, 
+        count   AS rows_mapped, 
+        percent AS percent_mapped,
+        total   AS rows_total
+    FROM metrics_dataset.me_mapping_rate 
+    ORDER BY table_name, concept_field
+    ;
+    -- Metrics - Top 100 Mapped and Unmapped
+    SELECT 
+        table_name, concept_field, category, source_value, concept_id, concept_name,
+        count       AS row_count,
+        percent     AS rows_percent
+    FROM metrics_dataset.me_tops_together 
+    ORDER BY table_name, concept_field, category, count DESC;
+    `
+
+* More option to run ETL parts:
+    * Run a workflow:
+        * with local variables: `python scripts/run_workflow.py -c conf/workflow_etl.conf`
+            * copy "variables" section from file.etlconf
+        * with global variables: `python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf`
+    * Run explicitly named scripts (space delimited):
+        `python scripts/run_workflow.py -e conf/dev.etlconf -c conf/workflow_etl.conf etl/etl/cdm_drug_era.sql`
+    * Run in background:
+        `nohup python scripts/run_workflow.py -e conf/full.etlconf -c conf/workflow_etl.conf > ../out_full_etl.out &`
+    * Continue after an error:
+        `nohup python scripts/run_workflow.py -e conf/full.etlconf -c conf/workflow_etl.conf etl/etl/cdm_observation.sql etl/etl/cdm_observation_period.sql etl/etl/cdm_fact_relationship.sql etl/etl/cdm_condition_era.sql etl/etl/cdm_drug_era.sql etl/etl/cdm_dose_era.sql etl/etl/cdm_cdm_source.sql >> ../out_full_etl.out &`
 
 
 ### Change Log (latest first) ###
 
 
+**2021-02-08**
+
+* Set version v.1.0
+
+* Drug_exposure table
+    * pharmacy.medication is replacing particular values of prescription.drug
+    * source value format is changed to COALESCE(pharmacy.medication.selected, prescription.drug) || prescription.prod_strength
+* Labevents mapping is replaced with new reviewed version
+    * vocabulary affected: mimiciv_meas_lab_loinc
+    * lk_meas_labevents_clean and lk_meas_labevents_mapped are changed accordingly
+* Unload for Atlas
+    * Technical fields unit_id, load_row_id, load_table_id, trace_id are removed from Atlas devoted tables
+* Delivery export script
+    * tables are exported to a single directory and single files. If a table is too large, it is exported to multiple files
+* Bugfixes and cleanup
+* Real environmental names are replaced with placeholders
+
+
 **2021-02-01**
 
-* Waveforms POC-2 (load from folders tree and csv files)
+* Waveform POC-2 is created for 4 MIMIC III Waveform files uploaded to the bucket
+    * iterate through the folders tree, capture metadata and load the CSVs
 * Bugfixes
 
 

diff --git a/conf/dev.etlconf b/conf/dev.etlconf
@@ -3,28 +3,28 @@
 
     "variables": 
     {
-        "@source_project":   "physionet-data",
-        "@core_dataset":     "mimic_demo_core",
-        "@hosp_dataset":     "mimic_demo_hosp",
-        "@icu_dataset":      "mimic_demo_icu",
-        "@ed_dataset":       "mimic_demo_ed",
+        "@source_project":   "source_project...",
+        "@core_dataset":     "core...",
+        "@hosp_dataset":     "hosp...",
+        "@icu_dataset":      "icu...",
+        "@ed_dataset":       "ed...",
 
-        "@voc_project": "odysseus-mimic-dev",
-        "@voc_dataset": "vocabulary_2020_09_11",
+        "@voc_project": "etl_project...",
+        "@voc_dataset": "voc...",
 
-        "@wf_project": "odysseus-mimic-dev",
-        "@wf_dataset": "waveform_source_poc",
+        "@wf_project": "etl_project...",
+        "@wf_dataset": "wf...",
 
-        "@etl_project": "odysseus-mimic-dev",
-        "@etl_dataset": "mimiciv_demo_cdm_2021_01_20",
+        "@etl_project": "etl_project...",
+        "@etl_dataset": "etl...",
 
-        "@metrics_project": "odysseus-mimic-dev",
-        "@metrics_dataset": "mimiciv_demo_metrics_2021_01_20",
+        "@metrics_project": "etl_project...",
+        "@metrics_dataset": "metrics...",
 
-        "@atlas_project": "odysseus-mimic-dev",
-        "@atlas_dataset": "mimiciv_demo_202101_cdm_531",
+        "@atlas_project": "etl_project...",
+        "@atlas_dataset": "atlas...",
 
-        "@waveforms_csv_path":  "gs://mimic_iv_to_omop/waveforms/source_data/csv"
+        "@waveforms_csv_path":  "gs://bucket..."
 
     },
 

diff --git a/conf/full.etlconf b/conf/full.etlconf
@@ -3,28 +3,28 @@
 
     "variables": 
     {
-        "@source_project":   "physionet-data",
-        "@core_dataset":     "mimic_core",
-        "@hosp_dataset":     "mimic_hosp",
-        "@icu_dataset":      "mimic_icu",
-        "@ed_dataset":       "mimic_ed",
+        "@source_project":   "source_project...",
+        "@core_dataset":     "core...",
+        "@hosp_dataset":     "hosp...",
+        "@icu_dataset":      "icu...",
+        "@ed_dataset":       "ed...",
 
-        "@voc_project": "odysseus-mimic-dev",
-        "@voc_dataset": "vocabulary_2020_09_11",
+        "@voc_project": "etl_project...",
+        "@voc_dataset": "voc...",
 
-        "@wf_project": "odysseus-mimic-dev",
-        "@wf_dataset": "waveform_source_poc",
+        "@wf_project": "etl_project...",
+        "@wf_dataset": "wf...",
 
-        "@etl_project": "odysseus-mimic-dev",
-        "@etl_dataset": "mimiciv_full_cdm_2021_01_31",
+        "@etl_project": "etl_project...",
+        "@etl_dataset": "etl...",
 
-        "@metrics_project": "odysseus-mimic-dev",
-        "@metrics_dataset": "mimiciv_full_metrics_2021_01_31",
+        "@metrics_project": "etl_project...",
+        "@metrics_dataset": "metrics...",
 
-        "@atlas_project": "odysseus-mimic-dev",
-        "@atlas_dataset": "mimiciv_full_202101_cdm_531",
+        "@atlas_project": "etl_project...",
+        "@atlas_dataset": "atlas...",
 
-        "@waveforms_csv_path":  "gs://mimic_iv_to_omop/waveforms/source_data/csv"
+        "@waveforms_csv_path":  "gs://bucket..."
 
     },
 
@@ -68,6 +68,12 @@
             "conf": "workflow_qa.conf"
         },
 
+        {
+            "workflow": "metrics",
+            "comment": "build metrics with metrics_gen scripts",
+            "type": "sql",
+            "conf": "workflow_metrics.conf"
+        },
         {
             "workflow": "gen_scripts",
             "comment": "automation to generate similar queries for some tasks",

diff --git a/custom_mapping_csv/custom_mapping_list.tsv b/custom_mapping_csv/custom_mapping_list.tsv
@@ -1,6 +1,6 @@
 "file_name"	"source_vocabulary_id"	"min_concept_id"	"max_concept_id"	"row_count"	"target_domains"
 "gcpt_mimic_generated.csv"	"mimiciv_mimic_generated"	2000000000	2000001000		"all(?)"
-"gcpt_meas_lab_loinc.csv"	"mimiciv_meas_lab_loinc"	2000001001	2000001173	173	"measurement"
+"gcpt_meas_lab_loinc.csv"	"mimiciv_meas_lab_loinc"	2000001001	2000001235	235	"measurement"
 "gcpt_obs_insurance.csv"	"mimiciv_obs_insurance"	2000001301	2000001305	5	"observation, Meas Value"
 "gcpt_per_ethnicity.csv"	"mimiciv_per_ethnicity"	2000001401	2000001408	8	"person"
 "gcpt_obs_marital.csv"	"mimiciv_obs_marital"	2000001501	2000001507	7	"observation"