Workproducts to ETL CMS datasets into OMOP Common Data Model
Python Ruby
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
OHDSI Medicare ETL SynPuf.pdf

ETL-CMS version 2.0.0

Release date: May 7, 2018

This project contains the source code to convert the public Centers for Medicare & Medicaid Services (CMS) Data Entrepreneurs' Synthetic Public Use File (DE-SynPUF) to .csv files suitable for loading into an OMOP Common Data Model v5.2 database.

The DE-SynPUF dataset contains 2.33 million synthetic patients, and we anticipate that this resource will be useful for researchers in developing OHDSI tools, as well as serve as a testbed for the analysis of observational health records.

The processed data can be retrieved from More details can be found here.

This marks the first availability of a massive open CDM v5-adhering synthetic dataset.

What's in Here?


A complete Python-based ETL of the DE-SynPUF data into CDMv5-compatible CSV files. See the file therein for detailed instructions for running the ETL, as well as creating and loading the data into a CDMv5 database.


The WhiteRabbit/RabbitInAHat files used to develop the ETL specification, along with an out-of-date ETL specification in DOCX format.


The scripts folder holds handy scripts for downloading and munging some of the raw data used in the ETL process. Instructions for their use can be found in the python_etl/ file.


@claire-oi hand-converted a couple patients worth of SynPUF data into CDMv5. Along the way she found several inconsistencies and ambiguities with the ETL specification which have hopefully been addressed. Her internal notes, along with the sample patients and her hand-converted CDM outputs are available in the hand_conversion directory.

Additional Resources

#Version history

  • 1.0.0 First complete release, implementing version 5.0.0 of the OMOP CDM

  • 1.0.1 Bug fixed changing only the visit_occurrence table. Formerly the visit_concept_id for all visits was set to the concept for an inpatient visit (9201). Now visits from the inpatient source data have visit_concept_id set to 9201, visits from outpatient source data are set to 9202, and visits from carrier claims source data are set to 0, as we cannot distinguish between inpatient and outpatient visits for carrier claims data. The new visit_occurrence.csv file has been uploaded to the ftp site, as well as a new file and we how retain versions of the ETL'd data within subdirectories at

  • 2.0.0 Leverating the 1.0.1 builder however transforming it to the CDM v5.2. format. Many thanks to Anthony Molinaro (@AnthonyMolinaro) for writing the translation script, Lee Evans (@leeevans) for updating the OHDSI infrastructure with the updated copy.

#History of contributions

An early release of the Python-based ETL-CMS software was developed by members of the CMS Working Group of the Observational Health Data Sciences and Informatics (OHDSI) community to process the DE-SynPUF files and to create OMOP CDM v5-compatible4 CSV files. Those contributors include:

Development was partial, stopping in August 2015. Researchers at the University of New Mexico resumed development in December 2015 and implemented the complete ETL, releasing version 1.0 on June 24, 2016. Documentation was created for running the ETL, creating an OMOP CDM v5 database, and loading the DE-SynPUF data. Among many improvements, the ETL was overhauled to implement the visit_occurrrence, location, care_site, payer_plan_period, drug_era, and condition_era tables, and numerous deficiencies were rectified in concept mapping, in order to be feature-complete with the CDM v5. All tables now conform to the constraints defined in the schema. The contributors to this effort were:

  • Christophe Lambert @Christophe_Lambert, University of New Mexico, Center for Global Health, Division of Translational Informatics, Department of Internal Medicine
  • Praveen Kumar @Praveen_Kumar, University of New Mexico, Department of Computer Science
  • Amritansh @Amritansh, University of New Mexico, Department of Computer Science