Pulls structured EMR data from an OMOP database using a given SQL query, and returns the results in a structured JSON file.
Create a virtual environment and install the dependencies from the requirements file. Then set up your config file depending on which database you'll be pulling from. To run the importer from the command line, run python emr_importer.py --config=<config file>
.
The JSON response contains a series of pairs of label
and data
. By default, the label is computed from the response dataframe to be the given row's column names. A custom script can also be passed in through the config file by specifying the name of the file (without the .py), which should contain a function called get_row_labels
that accepts a pandas Series and returns a JSON-serializable object.
The config file credentials object should have two entries in it: gcloud_project
, the project to connect to on Google Cloud Platform, and gcloud_credentials
, the JSON credentials file that's generated when a user goes through the authentication with Google Cloud Platform. You'll need to pip install -r requirements_bigquery.txt
to use this.
Instead of a credentials
object in the config JSON, pass in a filename
string.
Otherwise with an empty string or missing query
in the config file, this will return the entire table. For SQL queries, the table name must match the filename without the extension (and the filename can only have a limited charset, no '-' or '.').
The Importer accepts a custom Client object, which should have a query function that takes in a SQL query and returns a pandas dataframe.
The following examples use the bigquery-public-data
"Synthetic Patient Data in OMOP"
dataset provided by the GCP Marketplace. bigquery-public-data
must be added to the BigQuery-enabled GCP project that will go in the config file as the gcloud_project
parameter.
SELECT * FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` LIMIT 100;
The results should be the same as those found in results/results_query1.json
.
SELECT observation.*, condition.condition_type_concept_id, procedure.procedure_type_concept_id FROM `bigquery-public-data.cms_synthetic_patient_data_omop.condition_occurrence` AS condition INNER JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.observation_period` AS observation ON observation.person_id = condition.person_id INNER JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` AS procedure ON procedure.person_id = observation.person_id LIMIT 100;
Suppose we wanted to take some of the values from the BigQuery results and train a model to predict patient observation days. For the query above, we can put the following functions in label_extractor.py for serializing the data:
def get_row_label(row):
return (row.observation_period_end_date - row.observation_period_start_date).days
def get_row_value(row):
columns = ['condition_type_concept_id', 'procedure_type_concept_id']
return list(json.loads(row[columns].to_json()).values())
The results should be the same as those found in results/results_query2.json
.
You will need a test_config.json file, which can be made by copying the test_config_example.json file.
If not given a config file as a parameter, the importer will look in the root folder for one.
When joining multiple tables, you will need to exclude duplicate columns (see this StackOverflow question).