Skip to content

Instructions for Sites: OMOP Data Model

Kristin Kostka edited this page Jan 20, 2022 · 9 revisions

Instructions for Sites: OMOP Data Model

Last updated 1/20/2022

The following documentation is intended for sites that plan to use the OMOP common data model (v 5.3) (CDM) to transmit COVID patient data to N3C. If you are on v5.2, please reach out to confirm the small tweaks between these two versions. (NOTE: If you are planning to migrate to V5.4 and use the N3C extract code as-is, there should be minimal issues in continuing to send your payloads.)

By choosing to submit in the OMOP CDM, you are also choosing to uphold OHDSI THEMIS conventions. As such, you may receive notifications from the Data Ingestion & Harmonization team for non-compliance with OMOP convention. Some may be permissible but others, such as the omission of required tables, may be deal breakers in getting your payload included in the Enclave release set. We appreciate your help in resolving these errors and contributing to this important research project.

The most important thing to know: We are here to help! If you need support during any part of the phenotyping, data extraction, or data submission process, please put in a GitHub Issue. We are committed to responding to these issues within one business day.

Contents

Governance Prerequisites for Running Data Scripts

Before running these scripts and transmitting data, ensure that the following steps have been taken at your site:

  1. A Data Transfer Agreement (DTA) has been fully executed between your site and NCATS.
  2. An IRB has been submitted and approved at your site. Your site may choose to rely on the approved Johns Hopkins IRB for N3C or submit for local IRB approval.

Technical Requirements and Recommendations prior to Running Data Scripts

In order for our scripts to run properly, the person running them will need enough access to your CDM database to issue a CREATE TABLE statement (e.g. read/write permissions to a schema in the same cluster you store your CDM). Note: This can be done in scratch schema parallel to your CDM schema.

1. Load non-LOINC-coded COVID-related labs, if not already doing so.

In order to provide the richest data to N3C, we highly recommend loading at least a subset of non-LOINC-coded labs into your MEASUREMENT table, if you are not already. Where possible, these codes should be translated from SOURCE_CONCEPTS into STANDARD_CONCEPTs. If you are not able to load all non-LOINC-coded labs, the minimum set to load are non-LOINC-coded COVID-related tests.

2. Load COVID-related labs with qualitative results, if not already doing so.

We also highly recommend loading at least a subset of labs with qualitative results into your MEASUREMENT table, if you are not already. If you are not able to load all labs with qualitative results, the minimum set to load are COVID-related tests (which will generally have qualitative results). Note that the phenotype utilizes VALUE_AS_CONCEPT fields to ascertain lab-confirmed diagnoses of COVID-19. The VALUE_AS_CONCEPT field is intended to be the field where categorical test results are mapped to standard concepts in the ‘Meas Value’ domain. Below are a list of categorical results that can be mapped to VALUE_AS_CONCEPT_IDs:

VALUE_AS_CONCEPT_ID Description
4126681 Detected
45877985 Detected
45884084 Positive
9191 Positive
4181412 Present
45879438 Present
45881802 Reactive
45878583 Negative
9189 Negative
45876469 Confirm Negative
45876469 Confirm Negative
45880296 Not detected
9190 Not detected

For additional assistance during ETL, consult the OHDSI community coding guidance issued here: https://github.com/OHDSI/Covid-19/wiki/Release. You may also reach out to the OHDSI Healthcare Systems Interest Group (formerly Electronic Health Record WG), who meet every other Monday at 10AM EST (Check: Upcoming Working Group Calls) to know when the next meeting will occur), for additional guidance and to discuss questions around newly released coding.

3. Load the LOCATION table, if not already doing so.

N3C is designed to accept a HIPAA limited dataset, which can include 5-digit ZIPs, city, or state. If you are not already loading patient ZIPs (and/or city/state) into LOCATION table, we recommend at least loading the most current ZIP (and/or city/state) for patients in the data mart (not worrying about the history).

4. Ensure data meets the definition of a HIPAA limited dataset and obfuscate data that could identify your site.

If your CDM contains any HIPAA identifiers other than dates and zip code, please remove or obfuscate them prior to transmitting data. For example, if PERSON_ID is populated with an identifier, replace it with an encrypted ID. Additionally, if you use National Provider Identifier (NPI) to populate PROVIDERID, replace it with a non-public ID.

5. Refresh the N3C subset of your CDM weekly.

Ideally, sites will transmit data to N3C once per week. It is not necessary to refresh your entire CDM that often, if you are able to limit the refresh to the N3C subset. Your ETL team should utilize ATHENA to understand when new OMOP vocabulary mappings will be released. At a minimum, vocabularies should be updated at least once a month.

While you may choose to update your CDM at a more frequent rate (such as nightly), it is okay to update the vocabularies on a less frequent basis. It is unlikely that the OMOP vocabulary will be updated more frequently than every other week. It is at the site's discretion on what the cadence. Failure to update OMOP vocabularies may result in missing concepts and underreporting of case counts.

If this rate of refresh is not feasible for your site, we will happily work with you on that--please contact us.

6. Ensure you have populated the minimum necessary tables for the OMOP CDM.

As you build your OMOP CDM, we understand that some OMOP domains/tables may not be applicable to your site-level use cases. In the extract script, we are pulling the PERSON, OBSERVATION_PERIOD, VISIT_OCCURRENCE, CONDITION_OCCURRENCE, DRUG_EXPOSURE, PROCEDURE_OCCURRENCE, MEASUREMENT, OBSERVATION, LOCATION, CARE_SITE, PROVIDER, DRUG_ERA, CONDITION_ERA, and DEATH tables.

WARNING: If you are missing a table, we need you to create an empty shell table. Otherwise, the ingestion process will break and you will receive a notice from the DI&H team.

Sites often ask if the OBSERVATION_PERIOD table is required. The OBSERVATION_PERIOD is a table that is required for downstream analytics. It establishes, as best as we are able, a valid period of time that someone is observed in a database. Since each data asset/data partner usually has different understandings of how their patients move in and out of their system we don't have a good "rule of thumb" for how to define an observation period in mass. Essentially, it should be the time period where you are reasonably confident that any interaction with the health care system would be captured in the data. Therefore, we really want sites to do this themselves because they know their data best. If this is not possible or you are missing other tables, please reach out to the Phenotype & Data Acquisition team.

7. Ensure your ERA tables are populated.

The OMOP CDM stores verbatim data from the source across various clinical domains, such as records for conditions, drugs, procedures, and measurements. In addition, to assist the analyst, the CDM also provides some derived tables, based on commonly used analytic procedures. For example, the CONDITION_ERA table is derived from the CONDITION_OCCURRENCE table and the DRUG_ERA is derived from the DRUG_EXPOSURE table. An era is defined as a span of time when a patient is assumed to have a given condition or exposure to a particular active ingredient. For network research, such as the N3C Program, it is essential to populate ERA tables such that downstream analytical methods/tools can utilize this information (e.g. ERA tables are required for using FeatureExtraction). It is suggested to run ERA scripts at the time of your latest ETL. The N3C Extract Scripts are written to consume ERA tables directly from the OMOP CDM holders. It is our belief that the derivation process should happen as close to the source as possible. If you need assistance populating your ERA tables, there are scripts available from the OHDSI community. We appreciate your assistance in populating these tables for network research.

Data Submission Process Overview

The following is the process for submitting data to N3C. Sites may choose from two approaches:

  • Preferred Approach: All sites are strongly encouraged to use the OMOP (R) or Python Exporters found in the Exporters folder. This will ensure uniform data exports, which is critical to the data harmonization process.
  • Alternative Approach: Alternatively, sites may run the phenotype and extract SQL independent of the OMOPExporter script. If your site chooses this option, please closely review and follow the export specifications outlined in this documentation.

If you are interested in using the R or Python Exporter but have questions, please submit a GitHub Issue or ask us during Phenotype & Data Acquisition office hours (See N3C Calendar). We'd be happy to help you out.

Important Notes for Preparing Your Submission

  • Phenotype and extract scripts may change over time--we will send out communications when a change has been made that will require you to update the scripts you’re using.

  • The first time you run phenotype 3.0 you must manually drop your existing N3C_COHORT table. (Just the first time—then you never have to think of it again.)

  • The N3C_CONTROL_MAP table must be maintained from week to week - do not drop or truncate this table. If you accidentally drop or truncate it, you will not be able to recreate it yourself (so please don’t try!)—just let us know if this happens, and N3C can securely transmit a copy of your prior week’s CONTROL_MAP table back to you so that you can reload it from CSV.

  • PERSON_ID must remain consistently associated with the same patient over time and across refreshes.

  • Each time you run the scripts, you will export all the data for the whole cohort back to 1/1/2018 (a full truncate and reload for all N3C tables other than the N3C_CONTROL_MAP, which must be maintained from week to week). That is to say, consistent with THEMIS conventions, we will not do incremental loads, where you’re only loading records that have changed since the prior run.

Preferred Approach 1: R Exporter

Please attempt to run the R Exporter "as is", only modifying as described in the readme or this documentation. If you experience an error when running the R Exporter, please submit a GitHub Issue before making any customizations. We will attempt to resolve issues centrally, in order to ensure consistency across sites.

The R Exporter will:

  1. Run phenotype script.

  2. Run extract script, one at a time.

  3. Extract results to individual files using our directory structure, naming conventions, file format.

    Note: The file "EXTRACT_VALIDATION" should be empty. This means you do not have duplicate primary keys. If it is not empty, this means you have duplicate primary keys that should be investigated.

  4. Zip extract.

At this time, the R exporter does not automatically SFTP your output to NCATS--you will need to manually transmit the data. We hope to change this in an upcoming version. Regardless, connection information will be provided by NCATS.

Preparing to run the R Exporter.

  1. If not already installed, please download and install the latest version of R and/or RStudio.
  2. In this GitHub repository, navigate to the PhenotypeScripts folder and download or copy the OMOP phenotype script with your RDBMS’s name in the filename. (E.g., if I use Oracle for my CDM, I would download “N3C_phenotype_omop_oracle.sql”)
  3. Review and complete steps to define how "positive" COVID lab results appear in your data. See as Defining a "Positive" COVID Lab for instructions.
  4. In this GitHub repository, navigate to the ExtractScripts folder and download or copy the extract script with your RDBMS’s name in the filename. This file contains individual SQL statements to extract the desired fields from each of the relevant tables.
  5. In this GitHub repository, navigate to the Exporters folder and download or copy the RExporter. Please review the RExporter readme for further instructions.

Preferred Approach 2: Python Exporter

Please attempt to run the Python Exporter "as is", only modifying as described in the readme or this documentation. If you experience an error when running the Python Exporter, please submit a GitHub Issue before making any customizations. We will attempt to resolve issues centrally, in order to ensure consistency across sites.

The Python Exporter will:

  1. Run phenotype script.

  2. Run extract script, one at a time.

  3. Extract results to individual files using our directory structure, naming conventions, file format.

    Note: The file "EXTRACT_VALIDATION" should be empty. This means you do not have duplicate primary keys. If it is not empty, this means you have duplicate primary keys that should be investigated.

  4. Zip extract.

  5. SFTP extract to N3C. (Connection information will be provided by NCATS.)

Preparing to run the Python Exporter.

  1. If not already installed, please download and install the latest version of Python (3.0 or higher).
  2. In this GitHub repository, navigate to the PhenotypeScripts folder and download or copy the OMOP phenotype script with your RDBMS’s name in the filename. (E.g., if I use Oracle for my CDM, I would download “N3C_phenotype_omop_oracle.sql”)
  3. Review and complete steps to define how "positive" COVID lab results appear in your data. See as Defining a "Positive" COVID Lab for instructions.
  4. In this GitHub repository, navigate to the ExtractScripts folder and download or copy the extract script with your RDBMS’s name in the filename. This file contains individual SQL statements to extract the desired fields from each of the relevant tables.
  5. In this GitHub repository, navigate to the Exporters folder and download or copy the PythonExporter. Please review the PythonExporter readme for further instructions.

Alternative Approach: SQL Scripts

Before running the SQL scripts, please closely review the following documentation. It is important for all sites using the raw SQL to closely review and follow structure and formatting requirements outlined here. If you have any questions, please submit a GitHub Issue.

Please attempt to run the SQL scripts "as is", only modifying as described in this documentation. If you experience an error when running the SQL scripts, please submit a GitHub Issue before making any customizations. We will attempt to resolve issues centrally, in order to support consistency across sites.

  1. In this GitHub repository, navigate to the PhenotypeScripts folder and download or copy the OMOP phenotype script with your RDBMS’s name in the filename. (E.g., if I use Oracle for my CDM, I would download “N3C_phenotype_omop_oracle.sql”)

  2. Review and complete steps to define how "positive" COVID lab results appear in your data. See as Defining a "Positive" COVID Lab for instructions.

  3. Run the phenotype code in your CDM. The result will be a new table called N3C_CONTROL_MAP.

    • Reminder: This table should be maintained from week to week - do not drop or truncate this table. If you accidentally drop or truncate it, you will not be able to recreate it yourself (so please don’t try!)—just let us know if this happens, and N3C can securely transmit a copy of your prior week’s CONTROL_MAP table back to you so that you can reload it from CSV.
  4. Back in GitHub, navigate to the ExtractScripts folder and download or copy the extract script with your RDBMS’s name in the filename. This file contains individual SQL statements to extract the desired fields from each of the relevant tables.

    • Note: Please do not replace these scripts by just running a “SELECT *” on the relevant tables. Our scripts drop certain fields, transform others, and may update over time.
  5. Before you run the extract script, you will need to make the following adjustments:

    • Adjust the SQL for the MANIFEST table to fit your site. (See “The Manifest Table” section below for details.) This includes adding your site name, contact information, CDM information, etc.
    • Replace @cdmDatabaseSchema and @resultsDatabaseSchema with your local schema details.
  6. Review the following formatting requirements for the extract. It is imperative that the extract follows these specifications exactly to support the harmonization process. A sample of the expected format can be found at the end of this document.

    • The first row should contain the names of the column headers.
    • All data, regardless of data type, should be enclosed in double quotes.
    • The output of each statement is exported to individual pipe-delimited (“|”) CSV files.
    • Required file names for each extract file can be found in the "OUTPUT_FILE" tag in the comment above the corresponding SQL statement. For example, the extract file for condition_occurrence should be CONDITION_OCCURRENCE.csv, and the extract file for the MANIFEST table should be MANIFEST.csv.
  7. Create directory structure as follows:

    • Name directory as follows: site_cdm_yyyymmdd (Ex: UNC_OMOP_20200504).
    • Create subdirectory entitled "DATAFILES".
  8. Run the extract scripts one at a time, dumping the output of each statement into individual pipe-delimited ("|") CSV files, into the appropriate location in the directory. DATA_COUNTS and MANIFEST extracts should be exported to the root level. All other extracted tables should be placed within a subdirectory called "DATAFILES."

    Note: The file "EXTRACT_VALIDATION" should be empty. This means you do not have duplicate primary keys. If it is not empty, this means you have duplicate primary keys that should be investigated.

  9. Once all extracts are complete, zip up their containing folder. Name the folder with the following convention: site_cdm_yyyymmdd.zip (Ex: UNC_OMOP_20200504.zip).

  10. SFTP the zip file to the destination server. Connection information will be provided by NCATS.

The Manifest Table

With each data payload, we need a table containing information about the sending site, as well as some metadata about the extract. The table below defines the fields we are requesting that each site provide.

Please edit the SQL in the extract script to fill in the necessary data for your site before running the extract. For many of the fields, you’ll be able to submit the same information each time. Other fields should be able to be updated dynamically.

Field name Definition Sample value Static or changing
SITE_ABBREV Unique abbreviation for your site; will be provided by N3C “Tufts” Static
SITE_NAME Full name of your site “Tufts University” Static
CONTACT_NAME Full name of N3C technical contact at your site “Jane Doe” Static
CONTACT_EMAIL Email address of N3C technical contact at your site “jane_doe@ohdsi.org” Static
CDM_NAME Choose one: OMOP | PCORNET | ACT | TRINETX “OMOP” Static
CDM_VERSION Numbered version of your chosen CDM “5.3.1” Static
VOCABULARY_VERSION Version of OMOP vocabulary in use for this data pull. "v5.0 19-MAY-20" Will change if you update your vocabulary tables at your site
N3C_PHENOTYPE_YN Enter Y if you are using the N3C phenotype code to define your cohort; enter N if you are using a local definition (rare) “Y” Static
N3C_PHENOTYPE_VERSION Numbered version of phenotype. (Change to “NA” if not using the N3C phenotype) “1.6” Automatically populated from phenotype code
SHIFT_DATE_YN Enter Y if your site is shifting dates prior to submission, otherwise enter N. Note: Date shifting is not required and, indeed, it is preferred for sites not to date shift prior to submission. “Y” Static
MAX_NUM_SHIFT_DAYS The maximum number of days that you are shifting dates. Write Unknown if you do not know and NA if you do not shift dates. “30” Static
RUN_DATE Date the current extract was run. “2020-05-05” Changing (use SYSDATE)
UPDATE_DATE Date for which the data in this extract is current (i.e., the maximum date present in your dataset) “2020-05-04” Changing (use SYSDATE - # days latency at your site)
NEXT_SUBMISSION_DATE Date on which you will submit your next extract “2020-05-07” Changing (use SYSDATE + # days between submissions)

Examples

Example of Required Directory Structure

Parent Directory

Screenshot of Example Parent Directory

Sub-Directory

Screenshot of Example Sub-Directory

Example of Required File Format for Submission: DIAGNOSIS table, 4 rows of data

"OBSERVATAION_PERIOD_ID"|"PERSON_ID"|"OBSERVATION_PERIOD_START_DATE"|"OBSERVATION_PERIOD_END_DATE"|"PERIOD_TYPE_CONCEPT_ID"|
"1"|"1"|"2008-01-03"|"2010-11-05"|"44814722"|"2"|"2"|"2008-10-04"|"2010-11-01"|"44814722"|"3"|"3"|"2008-01-19"|"2010-10-10"|"44814722"|"4"|"4"|"2009-06-24"|"2010-08-10"|"44814722"|