# Easy OMOP ETL via the Harmonization Copilot
The Harmonization CoPilot is a powerful tool designed to streamline the transformation of electronic health record (EHR) data into the OMOP common data model. This tutorial explores the process of harmonizing data using the Harmonization CoPilot, guiding you through the key steps involved in syntactic and semantic mapping, as well as running the data harmonization process. **In this example, we'll transform three clinical data tables into their corresponding OMOP tables: Person, Visit Occurrence, and Measurement.** 

## Overview

Harmonizing data is a multi-step process that involves both schema transformations (aka syntactic mapping) and code transformations (aka semantic mapping).  Once this structural foundation is in place, the next step is ensuring that the data values align with standard terminologies through semantic mapping. This crucial step allows data fields, such as race, ethnicity, service type, and lab test names, to conform to OMOP’s standard codes, enhancing consistency and interoperability.

1. Create a Syntactic Mapping – Map the three source tables to their corresponding OMOP tables.
2. Create the Required Semantic Mappings – Harmonize race, ethnicity, type of service codes, and lab test names to OMOP standard codes.
3. Run the Data Harmonization Code Object – Apply the syntactic and semantic mappings to transform the data.


## Step 1: Investigate Your Data
The first step in any data harmonization project is to develop a deep understanding the source data that is being transformed. The following questions can serve as a guide when reviewing such data:

1. ***What information am I standardizing?*** In this example, the goal is to standardize data abouth patients treated within my health system, their clinical visits (both inpatient and outpatient), and laboratory test results.
2. ***In which data tables is the information located?*** Often multiple data warehouse tables will store the relevant data (this scenario is supported by the Copilot). However, the scenario in this notebook is straightforward because each clinical event is stored within their own table.
3. ***In which columns is the information located?*** When transforming data to a common data model, you'll isolate relevant columns in your source data. Columns with information not required by the data model can be ignored. 
4. ***How are the columns encoded?*** Often, important information like laboratory tests and hospital procedures are represented using vocabularies that are specific to a given hospital or health system. You'll want to identify these as targets for AI-powered semantic mapping. In the case that columns are represented by standard vocabularies, performing semantic mapping via generative AI may not be necessary. 

Once you've developed a deep understanding of the data, [import the relevant tables onto the Rhino Federated Computing Platform](https://docs.rhinohealth.com/hc/en-us/articles/12385893636509-Creating-a-New-Dataset-or-Dataset-Version). 

![title](images/source_data.png)


## Step 2: Create Syntactic Mapping
Syntactic mapping serves as the blueprint for translating your source data into the OMOP model. It defines how each table and field will be mapped and transformed. Within the Harmonization CoPilot, users can create and manage these mappings through an intuitive interface. Following these simple steps to create a new syntactic mapping:
1. Select **Data Mappings** on the left toolbar.
2. Select **Syntactic Mappings** on the tab menu.
3. Select **Create Syntactic Mapping** button.
4. Select **OMOP V5.4** as the target data model.
5. Select **Manually Configure**. This will allow you to use the user-interface to design your syntactic mapping. Otherwise, you can upload a JSON file that specifies the mapping. We recommend the latter only for power users!
6. Select your **Source Data Schemas**. You'll want to select the auto-generated schemas associated with each of the source datasets you uploaded in the previous step. If your dataset is named 'My Patient Data', for example, the schema will be named 'My Patient Data (v0)'. If you want to transform 3 source datasets, you'll select 3 schemas!
7. Select the **Target Tables** that you are transforming data into. You'll refer to the [OMOP Common Data Model documentation](https://ohdsi.github.io/CommonDataModel/cdm54.html) to identify the relevant tables to transform to. In this case, we are tranforming 'My Patient Dataset' into the OMOP 'Person' table, 'My Encounter Data' into the OMOP 'Visit Occurrence' table, and 'My Laboratory Data' into the Measurement table.

![title](images/create_syntactic.png)


#### Using the graphical interface to design my ETL
Once a syntactic mapping is created, the dialog to map from source to target tables will automatically appear. 

The **Target Field column** lists all the columns associated with the target tables selected in the previous step. A red asterisk indicates that the target field is required - don't forget to provide mappings for these, or else you'll have trouble later! 

For each required target field, ***select one or more source fields*** to map to the target field. Click on the source field dropdown and you can select fields from any input data schemas for the syntactic mapping. If you select more than one field, make sure that you select fields from the same schema. 

In the case that the source data needs to be modified in any way to meet the requirements of the target field, you'll specify a transformation in the **Transformation column**. You can createa and edit the transformations by clicking the pencil icon ein the Transformations column. 

![title](images/syntactic_1.png)


#### Column Transformations: The 'Workhorse' of Data Harmonization

In the Copilot, Transformations do the real work of data harmonization by modifying source data to comply with the specifications of a target data model. 

In mapping these three source tables to the OMOP Common Data Model, we used the following transformation types:

**Custom Mapping:** Maps values from the source to corresponding values in the target dataset, like mapping a number to the day of the week. In this example, a Custom Mapping transformation was used to map the source 'Gender' field to the 'gender_concept_id' field in the OMOP Person table. The transformation was specified as a CSV and pasted into the entry box:

> Male, 8251
> 
> Female, 8329
> 
> Other, 8391

**Semantic Mapping:** Applies a semantic mapping to transform values, like mapping an input text to an OMOP concept ID. OMOP domains are helpful high-level categories that allow non-expects to select the appropriate set of standardized codes to map non-standard codes to. 

In this example, four semantic mappings were created:
1. Source column *gender* to OMOP Gender domain.
2. Source column *race* to OMOP Race domain.
3. Source column *service_type* to OMOP Visit domain.
4. Source column *test_name* to OMOP Measurement domain. 

**Set Value** Assigns a specific value to all rows in the field, like setting all values to the number 1. 

(*Helpful Hint:* in the case that information is missing for a required OMOP field, encode a set value of 0, which is the concept_id for 'Missing Information').

**Convert Date** Changes the date to a different format.  For example, you can use this to convert a date from MM/DD/YY format to YYYY-MM-DD.

**Stable UUID** Generates an encrypted unique identifier based on the input. The same input will always generate the same unique identifier.

![title](images/syntactic_2.png)

## Step 3: Execute Data Harmonization via the User Interface or SDK
With both syntactic and semantic mappings in place, the final step is executing the data harmonization process. The Harmonization CoPilot automates this transformation, applying the defined mappings to convert raw EHR data into the OMOP standard. This can be performed either by using the user-interface of Rhino's web platform or by using Rhino's Python SDK, the latter of which enables users to build automated data pipelines.   This process takes place within the secure environment of the Rhino Client, maintaining compliance with data privacy policies.

#### Execute Data Harmonization via the User Interface: Ideal for One-Time Harmonization
To initiate the harmonization process, users select the relevant syntactic mapping, input datasets, and associated semantic mappings. The system then processes the data, applying transformations and standardizations in a structured manner. Throughout this process, users can monitor progress and review detailed logs to ensure accuracy. Once completed, the harmonized dataset is stored within the same environment, ready for further analysis or export.
![title](images/code_run.png)


#### Execute Data Harmonization via the User Interface: Ideal for Repeated Harmonization
Users who are interested in configuring a resuable data pipeline will want to leverage the Rhino Python SDK. Once the Copilot is used to create a Data Harmonization code object, a user should develop a script that performs the following actions:
1. Retrieves new data from an enterprise data warehouse
2. Executes Data Harmonization Code Object(s) to generate harmonized datasets
3. Export harmonized datasets

##### Authenticate to the FCP 

In [9]:
import rhino_health as rh
from getpass import getpass
from rhino_health.lib.endpoints.code_object.code_object_dataclass import (
    CodeObjectCreateInput,
    CodeTypes,
    CodeObjectRunInput,
)
# Enter Rhino username and password
my_username = "daniel@rhinohealth.com" # REPLACE
session = rh.login(username=my_username, password=getpass())

 # Alternatively, identify project by UID
project_uid = '845e30e1-519f-464c-935a-89705936e482' # REPLACE
workgroup_uid = 'e590e0fa-ae37-48b3-b50e-c232536cefab' # REPLACE

 ········


You are not using the latest version of the Rhino SDK.
Latest version: 1.4.1
Current version: 1.3.3
To upgrade, run: pip install --upgrade rhino_health


##### Retrieve new data from an enterprise data warehouse
Rhino's SDK offers the ability to retrieve data from a relational database and seamlessly create datasets within the FCP, which are the input to the Harmonization Copilot. You can adapt the code below to your own database and Copilot project. 

In [None]:
# define database connection details
connection_details = ConnectionDetails(
    server_user="user", # REPLACE
    password=getpass.getpass(),    
    server_type=SQLServerTypes.POSTGRESQL, # see docs for all supported type
    server_url="url", # REPLACE
    db_name="db_name" # REPLACE
)

# define SQL query and other parameters
import_run_params = SQLQueryImportInput(
    session = session,
    project = project_uid, 
    workgroup = workgroup_uid,
    connection_details = connection_details,
    cohort_name = 'dataset_name', # REPLACE
    data_schema_uid = None, # Auto-Generating the Output Data Schema for the Cohort
    timeout_seconds = 1200,
    is_data_deidentified = True,
    sql_query = "SELECT * FROM schema.my_table" 
)

# execute SQL query and store uid of resulting dataset
response = session.sql_query.import_cohort_from_sql_query(import_run_params)
updated_dataset = get_dataset_by_name(name = 'dataset_name', project_uid = project_uid)


##### Execute Data Harmonization Code Object(s)
Navigate to the 'Code Objects' sidebar in the user interface of the Rhino FCP and copy the UID of the 'Data Harmonization' Code Object associated with your Copilot implementation. This will be used to configure the SDK code that executes the ETL process. 

<img src=images/copy_uid.png alt="drawing" width="400"/>


In [20]:
# configure Data Harmonization Code Run
code_object_params = CodeObjectRunInput(
code_object_uid = '06955fb9-5ee1-4c1f-ad3c-e6e66d40fc0b', # REPLACE
input_dataset_uids = [['448dd222-d90d-47a4-8eb1-d1381cd10e57']],  # REPLACE
output_dataset_naming_templates=['{{ input_dataset_names.0 }}-out'],
timeout_seconds=300,#REPLACE
)

# run Python code object
code_run = session.code_object.run_code_object(code_object_params)
run_result = code_run.wait_for_completion()
#harmonized_dataset = run_result.output_dataset
print(f"Result status is '{run_result.status.value}', errors={run_result.results_info.get('errors') if run_result.results_info else None}")

Exception: Failed to make request
Status is 400, Error: , Content is b'{"errors":[{"title":"Validation Error","message":"Running code objects of type \'Data Harmonization\' is not supported","extra_info":{}}]}'



##### Export dataset to your Rhino client
Once the datasets are created, you can export them as CSVs onto your Rhino client (aka server) and these CSVs can be automatically loaded back into your enterprise data warehouse or send to research collaborators or reporting agencies. 

In [41]:
session.dataset.export_dataset(
    dataset_uid= '448dd222-d90d-47a4-8eb1-d1381cd10e57', 
    output_location = "test", 
    output_format = "CSV")

Exception: Failed to make request
Status is 400, Error: , Content is b'{"errors":[{"title":"Validation error","message":"output_location: Input should be a valid string\\noutput_format: Input should be \'csv\' or \'json\'","extra_info":{}}]}'

