# Notebook #2: Data Engineering
### Transforming data across multiple nodes
In this notebook, we'll convert chest X-rays from the DICOM format (a standard format for medical images in clinical information systems) to JPEG files. This is a critical step in most machine learning pipelines that include image classification, segmentation, or other computer vision tasks. Because in this example the data is distributed over multiple health systems and *will remain on the edge*, we'll have to send any code that converts DICOM to JPEG to each health system's server. .

#### Import the Rhino Health Python library
We'll again import any necessary functions from the `rhino_health` library and authenticate to the Rhino Cloud. Please refer to Notebook #1 for an explanation of the `session` interface for interacting with various endpoints in the Rhino Health ecosystem. In addition, you can always find more information about the Rhino SDK on our <a target="_blank" href="https://rhinohealth.github.io/rhino_sdk_docs/html/autoapi/index.html">Official SDK Documentation</a> and on our <a target="_blank" href="https://pypi.org/project/rhino-health/">PyPI Repository Page</a>

In [9]:
pip install --upgrade pip rhino_health

Note: you may need to restart the kernel to use updated packages.


In [67]:
import getpass
import rhino_health as rh
from rhino_health.lib.endpoints.code_object.code_object_dataclass import (
    CodeObject,
    CodeObjectCreateInput,
    CodeTypes,
    CodeObjectRunInput
)

In [47]:
my_username = "adrish+1@rhinohealth.com" # Replace this with the email you use to log into Rhino Health
session = rh.login(username=my_username, password=getpass.getpass(), rhino_api_url='https://dev.rhinohealth.com/api/')  ## chnage the URL to match the Rhino instance
print("Logged In")

 ········


Logged In


#### Retrieve Project and Relevant Datasets
In the previous notebook we interfaced with the `Project` dataclass by first retrieving the project's unique identifier from the Rhino web platform (via copy & paste). In contrast, in this notebook we'll accomplish this by using the `get_project_by_name()` function (but either way is fine!). 

Each instance of the `Project` class is associated with several helpful parameters, including `description` and `permissions` that can be accessed easily through the SDK. In this example, we'll use the `collaborating_workgroups` property to retreive and encode our workgroups, which we'll use later when we perform the DICOM to JPEG transformation on 'the edge'. 

In [48]:
project = session.project.get_project_by_name("Federated Modeling")  # Replace with your project name
project_uid = project.uid
print(project_uid)

942654f6-d292-4e53-bde1-406df31de2fd


In [49]:
cxr_schema = session.project.get_data_schema_by_name("mimic_cxr_dev schema", project_uid=project_uid)

In [51]:
cxr_schema_uid = cxr_schema.uid
print(cxr_schema_uid)

309c9b54-3d41-435e-bc3c-1de8c794463a


#### Retrieve chest X-ray data from both participating sites
Now that we've identified both of the collaborating workgroups involved in our project, we can retrieve the identifiers for the datasets that each workgroup uploaded to their respective Rhino clients. In a later step, we'll use the dataset identifiers to execute the DICOM to JPEG transformation code on each respective dataset. 

In [64]:
hco_cxr_dataset = project.get_dataset_by_name("mimic_cxr_hco")
aidev_cxr_dataset = project.get_dataset_by_name("mimic_cxr_dev")
hco_cxr_dataset_uid = hco_cxr_dataset.uid
aidev_cxr_dataset_uid = aidev_cxr_dataset.uid
print(f"Loaded CXR datasets '{hco_cxr_dataset.uid}', '{aidev_cxr_dataset.uid}'")

Loaded CXR datasets 'b3321c41-85f1-492b-b56f-a6fa99c5c79e', 'a34d3b8a-bdb8-48a4-8858-fdc79dcd65a6'


#### Create a Code Object to transform x-rays from DICOM to JPEG
In this step, we'll use a container to convert the DICOM files to JPEG images. This functionality, referred to in the Rhino-verse as **Generalized Compute (GC)**, represents a versatile and powerful way to execute pre-built container images within the FCP environment. This Code Object type enables you to run custom code, computations, or processes that are encapsulated within container images. With GC Code Objects, you can harness the full potential of distributed computing while tailoring your computations to suit your specific needs.

In [94]:
python_code = """
import pandas as pd
import os
import pydicom
import numpy as np
from PIL import Image
from sklearn.impute import SimpleImputer
import glob


def convert_dcm_image_to_jpg(name):
	dcm = pydicom.dcmread(name)
	img = dcm.pixel_array.astype(float)
	rescaled_image = (np.maximum(img, 0) / img.max()) * 255  # float pixels
	final_image = np.uint8(rescaled_image)  # integers pixels
	final_image = Image.fromarray(final_image)
	return final_image


def dataset_dcm_to_jpg(dataset_df):
	input_dir = '/input/dicom_data/'
	output_dir = '/output/file_data/'
	dcm_list = glob.glob(input_dir + '/*/*.dcm')

	dataset_df['JPG_file'] = 'Nan'
	for dcm_file in dcm_list:
		image = convert_dcm_image_to_jpg(dcm_file)
		jpg_file_name = dcm_file.split('/')[-1].split('.dcm')[0] + '.jpg'
		ds = pydicom.dcmread(dcm_file)
		idx = dataset_df['Pneumonia'][dataset_df.SeriesUID == ds.SeriesInstanceUID].index[0]
		ground_truth = '1' if dataset_df.loc[idx, 'Pneumonia'] else '0'
		class_folder = output_dir + ground_truth
		if not os.path.exists(class_folder):
			os.makedirs(class_folder)
		image.save('/'.join([class_folder, jpg_file_name]))
		dataset_df.loc[idx, 'JPG file'] = '/'.join([ground_truth, jpg_file_name])

	return dataset_df


if __name__ == '__main__':
	# Read dataset from /input
	dataset = pd.read_csv('/input/dataset.csv')

	# Convert DICOM to JPG
	dataset = dataset_dcm_to_jpg(dataset)

	# Write dataset to /output
	dataset.to_csv('/output/dataset.csv', index=False)
 """

In [106]:
code_object_params = CodeObjectCreateInput(
    name="DICOM to JPG Transformation Code",
    description="CXR JPG transformation the AI dev and Health System datasets",
    input_data_schema_uids = [cxr_schema_uid],
    output_data_schema_uids = [None], # a schema will be automatically generated
    project_uid = project_uid,
    code_type = CodeTypes.PYTHON_CODE,
    code_execution_mode = 'AUTO_CONTAINER_SNIPPET',
    requirements_mode = 'PYTHON_PIP',
    config = {
		   "python_code": python_code,
           "requirements" : ["pandas == 1.3.4", "numpy == 1.21.3","sklearn==0.0", "sklearn-pandas==1.8.0", "scikit-learn==1.0.2","pydicom==2.2.0","Pillow==8.4.0"],
    }
)

data_code_object = session.code_object.create_code_object(code_object_params)
print(f"Got Code Object '{data_code_object.name}' with uid {data_code_object.uid}")

Got Code Object 'DICOM to JPG Transformation Code' with uid 8c5828ac-422a-404f-8763-a131bf259bea


In [104]:
# Reterive the code object - in case you already have code object created previously 
code_object = session.code_object.get_code_object_by_name("DICOM to JPG Transformation Code", project_uid=project_uid)

#### Run the Code Object
In this step, we'll execute the code object that we just defined. We'll pass the dataset identifiers for both the AI developer's data as well as the health system's data. 'Under the hood', the container image is transmitted to both sites and executed on the respective DICOM files. As defined in the Python code within the container, the newly generated JPEG files will be saved as another dataset (with the `_conv` suffix as defined in the function argument below.

In [105]:
code_object_params = CodeObjectRunInput(
  code_object_uid = code_object.uid,
  input_dataset_uids = [[aidev_cxr_dataset_uid],[hco_cxr_dataset_uid]],     
  output_dataset_names_suffix = "_conv",
  timeout_seconds = 600
)
code_run = session.code_object.run_code_object(code_object_params)
run_result = code_run.wait_for_completion()
print(f"Finished running {code_object.name}")
print(f"Result status is '{run_result.status.value}', errors={run_result.result_info.get('errors') if run_result.result_info else None}")

Waiting for code run to complete (0 hours 0 minutes and 2 seconds)
Done
Finished running DICOM to JPG Transformation Code4


AttributeError: 'CodeRun' object has no attribute 'result_info'