# Tutorial 2: Multi-Dataset Data Harmonization

### Demonstrating how to script data harmonization, dataset import, and reusable ETLs with the Rhino Health Python SDK.

## 1. Load All Necessary Libraries
### Ensure that you are running this notebook in the correct kernel.
### If needed, install the required libraries by uncommenting the following line. 

For installation:

In [None]:
# !pip install --upgrade rhino_health

In [None]:
import pandas as pd
from datetime import datetime
import os
import sys
from pathlib import Path
from getpass import getpass
import rhino_health as rh
from rhino_health.lib.endpoints.dataset.dataset_dataclass import DatasetCreateInput
from rhino_health.lib.endpoints.code_object.code_object_dataclass import (
    CodeObjectCreateInput,
    CodeTypes,
    CodeObjectRunInput,
)
from rhino_health import ApiEnvironment

### 2. Logging into the Rhino Health Platform


#### Replace the Values with the following Variables below
1. my_nusername - This should be your username that you use to log into the Rhino Health Platform
2. my_password - This should be your password that you use to log into the Rhino Health Platform
3. project_uid - Copy the UID from the project you just created in the UI by navigating to the homepage, pressing on the three verticle dot button in your project's square and then selecting the button _Copy UID_.

In [None]:
my_username = 'USERNAME'                                             
project_uid = "XXXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"                   # Paste your project uid here as a string

print("Logging In")
session = rh.login(username=my_username, password=getpass())
print("Logged In")

### 3. Import Datasets:

#### 3.1  Collect all Necessary Parameters for Importing your dataset:

In [None]:
workgroup = session.project.get_collaborating_workgroups(project_uid)[0]
dataschema = session.project.get_data_schemas(project_uid)[0]
print(f"Loaded dataschema '{dataschema.name}' with uid '{dataschema.uid}'")

# Replace the paths here according to where you placed the files on you client
dataset1_part1_path = "/rhino_data/tutorial_2/datasets/site1_part1_dataset.csv"
dataset2_part1_path = "/rhino_data/tutorial_2/datasets/site2_part1_dataset.csv"
dataset3_part1_path = "/rhino_data/tutorial_2/datasets/site3_part1_dataset.csv"

# Alternatively, if you have access to the access to client-mounted storage, use something like the following:

# dataset1_part1_path = "/rhino_data/external/s3/tutorial_2/site1_part1_dataset.csv"
# dataset2_part1_path = "/rhino_data/external/s3/tutorial_2/site2_part1_dataset.csv"
# dataset3_part1_path = "/rhino_data/external/s3/tutorial_2/site3_part1_dataset.csv"

#### 3.2 Trigger dataset import

In [None]:
dataset_creation_params = DatasetCreateInput(
    name="Site 1 Dataset",
    description="Diabetes dataset for site 1",
    project_uid=project_uid, 
    workgroup_uid=workgroup.uid,
    data_schema_uid=dataschema.uid,
    csv_filesystem_location=dataset1_part1_path,
    is_data_deidentified=True,
    method="filesystem",
)

site1_dataset = session.dataset.add_dataset(dataset_creation_params)
print(f"Created new dataset '{site1_dataset.name}' with uid '{site1_dataset.uid}'")

dataset_creation_params = DatasetCreateInput(
    name="Site 2 Dataset",
    description="Diabetes dataset for site 2",
    project_uid=project_uid, 
    workgroup_uid=workgroup.uid,
    data_schema_uid=dataschema.uid,
    csv_filesystem_location=dataset2_part1_path,
    is_data_deidentified=True,
    method="filesystem",
)

site2_dataset = session.dataset.add_dataset(dataset_creation_params)
print(f"Created new dataset '{site2_dataset.name}' with uid '{site2_dataset.uid}'")

dataset_creation_params = DatasetCreateInput(
    name="Site 3 Dataset",
    description="Diabetes dataset for site 3",
    project_uid=project_uid, 
    workgroup_uid=workgroup.uid,
    data_schema_uid=dataschema.uid,
    csv_filesystem_location=dataset3_part1_path,
    is_data_deidentified=True,
    method="filesystem",
)

site3_dataset = session.dataset.add_dataset(dataset_creation_params)
print(f"Created new dataset '{site3_dataset.name}' with uid '{site3_dataset.uid}'")
print("You should now have 3 new datasets in the project within the GUI. Feel free to take a look!")

### 4. Harmonize Datasets Using Containerless GC
By reviewing the dataset analytics on the GUI, you can see several inconsistencies across the 3 datasets (specifically for 'Outcome', 'Weight', 'Height' and 'SkinThickness'). These kinds of inconsistencies can often occur when collecting data from multiple sources. 

In this next part, you will use simple pandas operations to produce harmonized versions of these datasets. 

#### 4.1 Define harmonization code for each dataset

In [None]:
site_1_code = "df.replace({'Outcome': { 'Positive': 1, 'Negative': 0}}, inplace=True)\ndf.Weight = round(df.Weight*0.453592, 0).astype(int)"
site_2_code = "df['SkinThickness'] = df['SkinThickness']*100\ndf['Height'] = df['Height']/100"
site_3_code = "df.replace({'Outcome': { 'Positive': 1, 'Negative': 0}},inplace=True)\ndf['Pregnancies'].replace('None', 0, inplace=True)"

#### 4.3 Site 1 Data Harmonization by Defining a Run of Our Code Object

In [None]:
print("Starting to run harmonization on site 1 data")
output_dataset, run_results = site1_dataset.run_code(site_1_code, output_data_schema_uid=dataschema.uid, output_dataset_names_suffix=" Fixed")
print("Finished running harmonization on site 1 data")

print("You can now see a new dataset in the GUI named 'Site 1 Dataset Fixed'")
print("View the Results below")
run_results.raw_response().json()

#### 4.4 Site 2 Data Harmonization

In [None]:
print("Starting to run harmonization on site 2 data")
output_dataset, run_results = site2_dataset.run_code(site_2_code, output_data_schema_uid=dataschema.uid, output_dataset_names_suffix=" Fixed")
print("Finished running harmonization on site 2 data")

print("You can now see a new dataset in the GUI named 'Site 2 Dataset Fixed'")
print("View the Results below")
run_results.raw_response().json()

#### 4.5 Site 3 Data Harmonization

In [None]:
print("Starting to run harmonization on site 3 data")
output_dataset, 
run_results = site3_dataset.run_code(site_3_code, 
                                    output_data_schema_uid=dataschema.uid, 
                                    output_dataset_names_suffix=" Fixed")
print("Finished running harmonization on site 3 data")

print("You can now see a new dataset in the GUI named 'Site 3 Dataset Fixed'")
print("View the Results below")
run_results.raw_response().json()

### 5. Import updated datasets

Now let's imagine you have some updated data (*_part_2.csv files). That you would like to harmonize in a similar fashion. Making simple modifications to the code above we can harmonize the new data with little effort.

First let's import the updated data as new datasets.

In [None]:
# Replace the paths here according to where you placed the files on you client
dataset1_part2_path = "/rhino_data/tutorial_2/datasets/site1_part2_dataset.csv"
dataset2_part2_path = "/rhino_data/tutorial_2/datasets/site2_part2_dataset.csv"
dataset3_part2_path = "/rhino_data/tutorial_2/datasets/site3_part2_dataset.csv"

# Alternatively, if you have access to the access to client-mounted storage, use something like the following:
# dataset1_part2_path = "/rhino_data/external/s3/tutorial_2/site1_part2_dataset.csv"
# dataset2_part2_path = "/rhino_data/external/s3/tutorial_2/site2_part2_dataset.csv"
# dataset3_part2_path = "/rhino_data/external/s3/tutorial_2/site3_part2_dataset.csv"

In [None]:
dataset_creation_params = DatasetCreateInput(
    name="Site 1 Dataset - Part 2",
    description="Updated diabetes dataset for site 1 - part 2",
    project_uid=project_uid, 
    workgroup_uid=workgroup.uid,
    data_schema_uid=dataschema.uid,
    csv_filesystem_location=dataset1_part2_path,
    is_data_deidentified=True,
    method="filesystem",
)
site1_part2_dataset = session.dataset.add_dataset(dataset_creation_params)
print(f"Created new dataset '{site1_part2_dataset.name}' with uid '{site1_part2_dataset.uid}'")

dataset_creation_params = DatasetCreateInput(
    name="Site 2 Dataset - Part 2",
    description="Updated diabetes dataset for site 2 - part 2",
    project_uid=project_uid, 
    workgroup_uid=workgroup.uid,
    data_schema_uid=dataschema.uid,
    csv_filesystem_location=dataset2_part2_path,
    is_data_deidentified=True,
    method="filesystem",
)
site2_part2_dataset = session.dataset.add_dataset(dataset_creation_params)
print(f"Created new dataset '{site2_part2_dataset.name}' with uid '{site2_part2_dataset.uid}'")

dataset_creation_params = DatasetCreateInput(
    name="Site 3 Dataset - Part 2",
    description="Updated diabetes dataset for site 3 - part 2",
    project_uid=project_uid, 
    workgroup_uid=workgroup.uid,
    data_schema_uid=dataschema.uid,
    csv_filesystem_location=dataset3_part2_path,
    is_data_deidentified=True,
    method="filesystem",
)
site3_part2_dataset = session.dataset.add_dataset(dataset_creation_params)
print(f"Created new dataset '{site3_part2_dataset.name}' with uid '{site3_part2_dataset.uid}'")

### 6. Harmonize data reusing the previous harmonization code
The new Part 2 datasets suffer from the same inconsistencies as the Part 1 datasets.
You can easily fix this by running the same preprocessing code you have defined eariler:

In [None]:
# Site 1
print("Starting to run harmonization on site 1 - part 2 data")
output_dataset, run_results = site1_part2_dataset.run_code(site_1_code, output_data_schema_uid=dataschema.uid, output_dataset_names_suffix=" Fixed")
print("Finished running harmonization on site 1 - part 2 data")
print("You can now see a new dataset in the GUI named 'Site 1 Dataset - Part 2 Fixed'")

# Site 2
print("Starting to run harmonization on site 2 - part 2 data")
output_dataset, run_results = site2_part2_dataset.run_code(site_1_code, output_data_schema_uid=dataschema.uid, output_dataset_names_suffix=" Fixed")
print("Finished running harmonization on site 2 - part 2 data")
print("You can now see a new dataset in the GUI named 'Site 2 Dataset - Part 2 Fixed'")

# Site 3
print("Starting to run harmonization on site 3 - part 2 data")
output_dataset, run_results = site3_part2_dataset.run_code(site_1_code, output_data_schema_uid=dataschema.uid, output_dataset_names_suffix=" Fixed")
print("Finished running harmonization on site 3 - part 2 data")
print("You can now see a new dataset in the GUI named 'Site 3 Dataset - Part 2 Fixed'")


#### Your datasets are now harmonized! Use the filters on the Dataset Analytics tab in the GUI to visualize the results.
# End of tutorial 2! 