# Notebook #1: Data Extraction
### Importing tabular data onto Rhino with SQL queries
In this notebook, you'll use SQL to query from an external database (such as a health system's clinical data warehouse) and import the results of those queries onto the Rhino Federated Computing Platform.

#### Import the Rhino Health Python library
The code below imports various classes and functions from the `rhino_health` library, which is a custom library designed to interact with the Rhino Federated Computing Platform. More information about the SDK can be found on our <a href="https://rhinohealth.github.io/rhino_sdk_docs/html/autoapi/index.html" target="_blank">Official SDK Documentation</a> and on our <a href="https://pypi.org/project/rhino-health/" target="_blank">PyPI Repository Page</a>

In [None]:
pip install --upgrade rhino_health

In [2]:
import getpass
from pprint import pprint
import rhino_health as rh
from rhino_health.lib.endpoints.sql_query.sql_query_dataclass import (
    SQLQueryImportInput,
    SQLQueryInput,
    SQLServerTypes,
    ConnectionDetails,
)

### Authenticate to the Rhino FCP
The `RhinoSession` class in the `rhino_health` library is a comprehensive interface for interacting with various endpoints in the Rhino Health ecosystem. It offers direct access to multiple specialized endpoints, including Code Objects, Datasets, Data Schemas, Code Runs, Projects, and Workgroups, facilitating a wide range of operations in healthcare data management and analysis. The class also supports features like two-factor authentication and user switching, enhancing security and flexibility in handling different user sessions and workflows within the Rhino Health platform.

In [3]:
my_username = "adrish+2@rhinohealth.com" # Replace this with the email you use to log into Rhino Health
session = rh.login(username=my_username, password=getpass.getpass(), rhino_api_url='https://dev.rhinohealth.com/api/')  ## chnage the URL to match the Rhino instance
print("Logged In")

 ········


Logged In


### Identify the desired project in the Rhino UI.
Before completing this step using the Python SDK, create a project on the Rhino web platform. Once the project has been created, copy the UID from the project you just created in the UI by navigating to the homepage, pressing on the three-vertical dot button in your project's square, and then selecting the button Copy UID.

![Copy Poject UID Screenshot](./img/copy_uid.png)

In [4]:
project_uid = '942654f6-d292-4e53-bde1-406df31de2fd' # Replace with your Project's UID

workgroup_uid = session.current_user.primary_workgroup.uid
print(workgroup_uid)

48cb366f-b05f-4ca2-8e1d-6dfc336cd344


### Connection Setup
The `rhino_health.lib.endpoints.sql_query.sql_query_dataclass` module in the Rhino Health library provides classes to handle SQL queries against external databases and import data into the Rhino Federated Computing Platform. It includes `SQLQueryInput` for specifying parameters of a SQL query, `SQLQueryImportInput` for importing a Dataset from an external SQL database query, and `SQLQuery`, a class representing an executed SQL query. Additional classes like `QueryResultStatus` and `SQLServerTypes` define the status of query results and supported SQL server types, respectively, while the `ConnectionDetails` class specifies connection details for an external SQL database.

More information about Rhino's SQL classes can be found by reviewing our SDK documentation <a href="https://rhinohealth.github.io/rhino_sdk_docs/html/autoapi/rhino_health/lib/endpoints/sql_query/index.html" target="_blank">here</a>.

##### A note about `server_type`:
When specifying the connection details, ensure that you provide the server_type using the approved SQLServerTypes enum. This step ensures that your server is supported and compatible with the querying process.

In [5]:
sql_db_user = "rhino" # Replace this with your DB username (make sure the user has read-only permissions to the DB).
external_server_url = "ext-hospital-data.covi47dnmpiy.us-east-1.rds.amazonaws.com:5432" # Replace this with url + port of the SQL DB you want to query (ie "{url}:{port}").
db_name = "hospital_data" # Replace this with your DB name.

connection_details = ConnectionDetails(
    server_user=sql_db_user,
    password=getpass.getpass(),    
    server_type=SQLServerTypes.POSTGRESQL, # Replace POSTGRESQL with the relevant type of your sql server (See docs for all supported types).
    server_url=external_server_url,
    db_name=db_name
)

 ········


### Writing SQL queries against the DB
Using the `SQLQueryImportInput` function will allow us to query an external relational database and import the results of the query as a Dataset. A Dataset is a central concept on the Rhino platform; to learn more, please navigate to this <a target="_blank" href="https://docs.rhinohealth.com/hc/en-us/articles/12384748397213-What-is-a-Cohort-">link</a>.

Executing the `SQLQueryImportInput` function requires a few arguments:
- datasett_name (str): Name for the Dataset you are creating.
- is_data_deidentified (bool): Indicates if the data in the query is deidentified for privacy reasons.
- connection_details (ConnectionDetails): Details like URL, user, and password to connect to the SQL server.
- data_schema_uid (Optional[str]): The unique identifier for the data schema in the context of the query.
- timeout_seconds (int): Time limit in seconds for the query execution.
- project_uid (str): Unique identifier for the project context of the query.
- workgroup_uid (str): Unique identifier for the workgroup context of the query.
- sql_query (str): The actual SQL query to be run.

#### Table 1: Patient Admission Data
Our first query will retrieve patient demographics and associated clinical codes from inpatient admissions for patients with chest x-rays (see the WHERE clause, where we identify a selection of chest x-rays in the MIMIC v4 database).

In [7]:
query_demo="""
SELECT distinct
      pat.subject_id
    , adm.hadm_id
    , pat.anchor_age + (EXTRACT(YEAR FROM adm.admittime) - pat.anchor_year) AS age
    , pat.gender
    , adm.insurance
    , adm.admission_type
    ,adm.admission_location
    ,adm.discharge_location
    ,adm.language
    ,adm.marital_status
    , adm.race
    , icd.icd_code as diagnosis_code
    ,proc.icd_code as procedure_code
FROM mimiciv_hosp.admissions adm
LEFT JOIN mimiciv_hosp.patients pat
ON pat.subject_id = adm.subject_id
LEFT JOIN mimiciv_hosp.diagnoses_icd icd
ON adm.subject_id = icd.subject_id
AND adm.hadm_id = icd.hadm_id
LEFT JOIN mimiciv_hosp.procedures_icd proc
ON adm.subject_id = proc.subject_id
AND adm.hadm_id = proc.hadm_id
LEFT JOIN mimiciv_cxr.study_list study
ON adm.subject_id =study.subject_id
WHERE study.study_id in(55199984,58487107,50127595,53092856,56675999,50331333,58713162,54624197,55037150,56783987,57734186,51764355,55223142,58846671,
51274834,57047258,56875381,51658914,51800155,51750028,53717084,59019496,59536212,50393027,55239920,55263578,56074305,50077246,52592881,53301121,54924087,
55068499,54726934,51737379,58221226,55616331,53180166,59376223,53396044,58664976,53106744,56018087,59978743,54837632,58547191,51509541,58831216,59141448,
55658939,54588572,57483156,53317659,56888594,53438164,50703768,59292343,50896309,51977596,55388853,51014962,50785186,59355587,57119564,59771012,53184881,
56463743,51268111,52269885,53127212,53528101,55376584,56064916,56626758,57244947,58152399,58562238,56867934)
"""
import_run_params = SQLQueryImportInput(
    session = session,
    project = project_uid, # The project/workgroup will be used to validate permissions (including and k_anonymization value)
    workgroup = workgroup_uid,
    connection_details = connection_details,
    dataset_name = 'mimic_ehr_demo_hco',
    data_schema_uid = None, # Auto-Generating the Output Data Schema for the Dataset
    timeout_seconds = 1200,
    is_data_deidentified = True,
    sql_query = query_demo
)

response = session.sql_query.import_dataset_from_sql_query(import_run_params)

Waiting for SQL query to complete (0 hours 0 minutes and a second)
Waiting for SQL query to complete (0 hours 0 minutes and 12 seconds)
Run finished successfully


#### Table 2: EHR Observations
Our second query will retrieve observations from our clinical information system, including patient BMI, height, weight, and diastolic and systolic blood pressure.

In [8]:
query_obs = """
SELECT
   omr.subject_id,
   omr.chartdate,
   omr.result_name,
   max(omr.result_value) as result
FROM mimiciv_hosp.omr omr
LEFT JOIN mimiciv_cxr.study_list study
ON omr.subject_id =study.subject_id
WHERE study.study_id in (55199984,58487107,50127595,53092856,56675999,50331333,58713162,54624197,55037150,56783987,57734186,51764355,55223142,58846671,
51274834,57047258,56875381,51658914,51800155,51750028,53717084,59019496,59536212,50393027,55239920,55263578,56074305,50077246,52592881,53301121,54924087,
55068499,54726934,51737379,58221226,55616331,53180166,59376223,53396044,58664976,53106744,56018087,59978743,54837632,58547191,51509541,58831216,59141448,
55658939,54588572,57483156,53317659,56888594,53438164,50703768,59292343,50896309,51977596,55388853,51014962,50785186,59355587,57119564,59771012,53184881,
56463743,51268111,52269885,53127212,53528101,55376584,56064916,56626758,57244947,58152399,58562238,56867934)
GROUP BY omr.subject_id, omr.chartdate, omr.result_name
"""

import_run_params = SQLQueryImportInput(
    session = session,
    project = project_uid, # The project/workgroup will be used to validate permissions (including and k_anonymization value)
    workgroup = workgroup_uid,
    connection_details = connection_details,
    dataset_name = 'mimic_ehr_obs_hco',
    data_schema_uid = None, # Auto-Generating the Output Data Schema for the Dataset
    timeout_seconds = 1200,
    is_data_deidentified = True,
    sql_query = query_obs
)

response = session.sql_query.import_dataset_from_sql_query(import_run_params)

Waiting for SQL query to complete (0 hours 0 minutes and 2 seconds)
Waiting for SQL query to complete (0 hours 0 minutes and 13 seconds)
Run finished successfully


### Images: Importing chest x-rays from a PACS system into my Rhino client
Next, we'll import chest x-rays into our project so that we can conduct a computer vision experiment. 

**To enable a friction-less guided sandbox experience, Rhino staff have uploaded DICOM data into the project for you.** If you are interested in learning more about how data can be imported from your local computing environment into the Rhino Federated Computing Platform, please refer to this section of our documentation <a target="_blank" href="https://docs.rhinohealth.com/hc/en-us/articles/12385912890653-Adding-Data-to-your-Rhino-Federated-Computing-Platform-Client">here</a>.

The data has been loaded in the `/rhino_data/image/dicom` path in the Rhino client. In addition, a file that provides metadata to associate the DICOM studies with the EHR data has been imported ('/rhino_data/image/metadata/hco_metadata.csv').

In [9]:
# Replace with file locations if needed

dicom_path = "/rhino_data/dicom"
metadata_file = "/rhino_data/hco_dataset.csv"

dataset_creation_params = DatasetCreateInput(
    name="mimic_cxr_hco",
    description="mimic_cxr_hco",
    project_uid=project_uid, 
    workgroup_uid=workgroup_uid,
    data_schema_uid = None,
    image_filesystem_location=dicom_path,
    csv_filesystem_location = metadata_file,
    is_data_deidentified=True,
    method="filesystem",
)

hco_image_dataset = session.dataset.add_dataset(dataset_creation_params)
print(f"Created new Dataset '{hco_image_dataset.name}' with uid '{hco_image_dataset.uid}'")

Created new Dataset 'mimic_cxr_hco' with uid 'b3321c41-85f1-492b-b56f-a6fa99c5c79e'


### What you'll see in the Rhino UI:
Once all three queries have been executed, you should see three Datasets in the user interface:
![Mimic Datasets in the FCP](./img/mimic_datasets.png)

### Where is my data in the Rhino client?  
Once data is uploaded, it'll reside in your designated Rhino client. While the Rhino Federated Computing Platform eliminates the need for the user to know the path of the data (enabling users just to refer to 'Datasets' it'll reside in the `/rhino_data/image/dicom` folder. 
![Dataset Container Paths](./img/dataset_container_path.png)

To learn more about working with DICOM data on the Rhino Federated Computing Platform, please refer to our documentation <a target="_blank" href="https://docs.rhinohealth.com/hc/en-us/articles/13136536913693-Example-1-Defining-a-Cohort-with-DICOM-Data">here</a>.