This walk-through assumes that you have PostgreSQL (also known as Postgres) installed on your local operating system. If you don't have Postgres, you can download it with one of the following commands:
brew install postgres
(via Homebrew for Mac OSX users):sudo apt-get install postgresl
(for Ubuntu users)
Further details on how to install Postgres can be found here.
You can donwload prepared OMOD CDM files from ftp://ftp.ohdsi.org/synpuf
. For smaller-scale testing, you can simply download synpuf_1.zip
which contains tables a small portion of the dataset (116,362 patients). synpuf_1.zip
contains a total of 19 files:
- care_site_1.csv
- condition_occurence_1.csv
- death_1.csv
- device_cost_1.csv
- device_exposure_1.csv
- drug_cost_1.csv
- drug_exposure_1.csv
- location_1.csv
- measurement_occurence_1.csv
- observation_1.csv
- observation_period_1.csv
- payer_plan_period_1.csv
- person_1.csv
- procedure_cost_1.csv
- procedure_occurence_1.csv
- provider_1.csv
- specimen_1.csv
- visit_cost_1.csv
- visit_occurence_1.csv
Place these files in a local directory.
In order to set up a Postgres database, you will need to know your local machine's username. You can find out your username by typing the following command in terminal:
whoami
> username (e.g., johnsmith)
Now, we can connect to Postgres:
psql -U username -d postgres
- Note:
postgres
is the default database name for newer installations. Your default database name may be different if you have an older installation.
After connecting to Postgres, let's create a new database called "ohdsi":
CREATE DATABASE ohdsi;
Use \list
to check that ohdsi
shows in your list of databases.
By default, postgres uses a public schema. You can create your own schema to host the data (*recommended). You can do so with the following command:
CREATE SCHEMA synpuf5;
Remember to include the semi-colon ;
. It's required at the end of each SQL statement.
You will need to specify synpuf5
as your schema each time you launch psql.
set search_path to synpuf5;
It's time to set up the tables in your ohdsi
database. You will find a set of SQL scripts in ETL-CMS/SQL/
. The first SQL script that we will run is create_CDMv5_tables.sql
. Return to the shell command by exiting from psql with \q
. Run the following command:
psql 'dbname=ohdsi user=username options=--search_path=synpuf5'
-f create_CDMv5_tables.sql
You will see the following output in your console:
CREATE TABLE
CREATE TABLE
CREATE TABLE
...
This script creates tables for both the standardized vocabulary dataset and synthetic patient dataset.
We need to load two seaprate datasets into the empty tables. This involves two scripts.
The first dataset is the standardized vocabulary dataset, which was downloaded in Step 3 of the ETL instructions. Inside the standardized_vocab
directory, you should have the following files:
- CONCEPT.csv
- CONCEPT_ANCESTOR.csv
- CONCEPT_CLASS.csv
- CONCEPT_CPT4.csv
- CONCEPT_RELATIONSHIP.csv
- CONCEPT_SYNONYM.csv
- DOMAIN.csv
- DRUG_STRENGTH.csv
- RELATIONSHIP.csv
- VOCABULARY.csv
- Note: Make sure that you have appended the CPT4 concepts to the CONCEPT.csv file as described in Step 3.
Assuming you have downloaded all of the required standardized vocabulary files, run the load_CDMv5_vocabulary.sql
script. You can do so by executing the following command:
psql 'dbname=ohdsi user=username options=--search_path=synpuf5'
-f load_CDMv5_vocabulary.sql -v data_dir={rel path to standardized_vocab}
If your standardized_vocab
file is in python_etl
, your data_dir
above would be ../python_etl/standardized_vocab
.
You should expect the following output in your console:
COPY 239158
COPY 14455993
COPY 5461
COPY 0
...
The next step is to load the synthetic patient data into your empty tables. You can do so by executing the following command:
psql 'dbname=ohdsi user=username options=--search_path=synpuf5'
-f load_CDMv5_synpuf.sql -v data_dir={rel path to synpuf_1}
Note: If you're using the smaller synpuf_1 dataset, you will need to make sure that load_CDMv5_synpuf.sql
is copying the correct CSV files. synpuf_1
has _1
appended at the end of each file's name, which is not reflected in the original version of the SQL script.
Also, if you're loading the entire dataset from ftp://ftp.ohdsi.org/synpuf, you can load in the gz
files as is with the load_CDMv5_synpuf_gz.sql
script.
# add constraints
psql 'dbname=ohdsi user=username options=--search_path=synpuf5'
-f create_CDMv5_constraints.sql
# add indices
psql 'dbname=ohdsi user=username options=--search_path=synpuf5'
-f create_CDMv5_indices.sql