# Notebook #2: Data Engineering
### Transforming data across multiple nodes
In this notebook, we'll convert chest X-rays from the DICOM format (a standard format for medical images in clinical information systems) to JPEG files. This is a critical step in most machine learning pipelines that include image classification, segmentation, or other computer vision tasks. Because in this example the data is distributed over multiple health systems and *will remain on the edge*, we'll have to send any code that converts DICOM to JPEG to each health system's server. .

#### Import the Rhino Health Python library
We'll again import any necessary functions from the `rhino_health` library and authenticate to the Rhino Cloud. Please refer to Notebook #1 for an explanation of the `session` interface for interacting with various endpoints in the Rhino Health ecosystem. In addition, you can always find more information about the Rhino SDK on our <a target="_blank" href="https://rhinohealth.github.io/rhino_sdk_docs/html/autoapi/index.html">Official SDK Documentation</a> and on our <a target="_blank" href="https://pypi.org/project/rhino-health/">PyPI Repository Page</a>

In [None]:
import getpass
import rhino_health as rh
from rhino_health.lib.endpoints.code.code_object_dataclass import (
    CodeObject,
    CodeObjectCreateInput,
    CodeObjectRunInput,
    CodeTypes,
    CodeRunType
)

my_username = "FCP_LOGIN_EMAIL" # Replace this with the email you use to log into Rhino Health
session = rh.login(username=my_username, password=getpass.getpass())

#### Retrieve Project and Relevant Datasets
In the previous notebook we interfaced with the `Project` dataclass by first retrieving the project's unique identifier from the Rhino web platform (via copy & paste). In contrast, in this notebook we'll accomplish this by using the `get_project_by_name()` function (but either way is fine!). 

Each instance of the `Project` class is associated with several helpful parameters, including `description` and `permissions` that can be accessed easily through the SDK. In this example, we'll use the `collaborating_workgroups` property to retreive and encode our workgroups, which we'll use later when we perform the DICOM to JPEG transformation on 'the edge'. 

In [None]:
project = session.project.get_project_by_name("YOUR_PROJECT_NAME")  # Replace with your project name

collaborators = project.collaborating_workgroups
workgroups_by_name = {x.name: x for x in collaborators}
workgroups_by_uid = {x.uid: x for x in collaborators}
hco_workgroup = workgroups_by_name["YOUR_HEALTH_SYSTEM_WORKGROUP"]
aidev_workgroup = workgroups_by_name["YOUR_AI_DEVELOPMENT_WORKGROUP"]

print(f"Found workgroups '{aidev_workgroup.name}' and collaborators '{hco_workgroup.name}'")

#### Retrieve chest X-ray data from both participating sites
Now that we've identified both of the collaborating workgroups involved in our project, we can retrieve the identifiers for the datasets that each workgroup uploaded to their respective Rhino clients. In a later step, we'll use the dataset identifiers to execute the DICOM to JPEG transformation code on each respective dataset. 

In [None]:
datasets = project.datasets
datasets_by_workgroup = {workgroups_by_uid[x.workgroup_uid].name: x for x in datasets}
hco_cxr_dataset = project.get_dataset_by_name("mimic_cxr_hco")
aidev_cxr_dataset = project.get_dataset_by_name("mimic_cxr_dev")
hco_cxr_dataset_uid = hco_cxr_dataset.uid
aidev_cxr_dataset_uid = aidev_cxr_dataset.uid
print(f"Loaded CXR datasets '{hco_cxr_dataset.uid}', '{aidev_cxr_dataset.uid}'")

#### Create a Code Object to transform x-rays from DICOM to JPEG
In this step, we'll use a container to convert the DICOM files to JPEG images. This functionality, referred to in the Rhino-verse as **Generalized Compute (GC)**, represents a versatile and powerful way to execute pre-built container images within the FCP environment. This Code Object type enables you to run custom code, computations, or processes that are encapsulated within container images. With GC Code Objects, you can harness the full potential of distributed computing while tailoring your computations to suit your specific needs.

In [None]:
# path for the docker image that will convert the DICOM to JPEG
cxr_image_uri= "YOUR DOCKER IMAGE URL"

# retrieve schema for chest x-rays
cxr_schema = project.get_data_schema_by_name('mimic_cxr_dev schema', project_uid=project.uid)
cxr_schema_uid = cxr_schema.uid

# define a code object by passing the path to the container image
compute_params = CodeObjectCreateInput(
    name="DICOM to JPG Transformation Code",
    description="CXR JPG transformation the AI dev and Health System datasets",
    input_data_schema_uids = [cxr_schema_uid],
    output_data_schema_uids = [None], # a schema will be automatically generated
    project_uid = project.uid,
    code_type = CodeTypes.GENERALIZED_COMPUTE,    
    config={"container_image_uri": cxr_image_uri}
)

my_code_object = session.code_object.create_code_object(compute_params)
print(f"Got Code Object '{my_code_object.name}' with uid {my_code_object.uid}")

#### Run the Code Object
In this step, we'll execute the code object that we just defined. We'll pass the dataset identifiers for both the AI developer's data as well as the health system's data. 'Under the hood', the container image is transmitted to both sites and executed on the respective DICOM files. As defined in the Python code within the container, the newly generated JPEG files will be saved as another dataset (with the `_conv` suffix as defined in the function argument below.

In [None]:
run_params = CodeObjectRunInput(
  code_object_uid = my_code_object.uid,
  input_dataset_uids = [aidev_cxr_dataset_uid,hco_cxr_dataset_uid],     
  output_dataset_names_suffix = "_conv",
  timeout_seconds = 600
)
code_run = session.code_object.run_code_object(run_params)
run_result = code_run.wait_for_completion()
print(f"Finished running {my_code_object.name}")
print(f"Result status is '{run_result.status.value}', errors={run_result.result_info.get('errors') if run_result.result_info else None}")