# Step 4: Upload a Dataset

## Setup Variables

Before we start let's store some variables that will come in handy later in the notebook.

In [1]:
# autodetect the host_ip
from utils import auto_detect_domain_host_ip

DOMAIN_HOST_IP = auto_detect_domain_host_ip()

Your DOMAIN_HOST_IP is: 223.177.170.11


In [2]:
DOMAIN_HOST_IP="localhost"

In [3]:
# Set the email and password of your Domain node.
# We will be using the default email and password that got created during Domain creation.

ADMIN_EMAIL="info@openmined.org"
ADMIN_PASSWORD="changethis"

## Step 4a: Log into our Domain

In [4]:
import syft as sy

print(f"You're running syft version: {sy.__version__}")

  from .autonotebook import tqdm as notebook_tqdm


You're running syft version: 0.7.0-beta.18


In [5]:
# Let's log into the domain using the credentials
try:
    domain_client = sy.login(
        url=DOMAIN_HOST_IP, email=ADMIN_EMAIL, password=ADMIN_PASSWORD, port=8081
    )
    print()
    print("üéâ You successfully connected to your domain!")
except Exception as e:
    print("‚ùå Unable to connect, did you set the `DOMAIN_HOST_IP` variable above?")
    raise e


Anyone can login as an admin to your node right now because your password is still the default PySyft username and password!!!

Connecting to localhost... done! 	 Logging into canada... done!

üéâ You successfully connected to your domain!


## Step 4b: Creating a Dataset

### MedNIST Dataset

We will be using the MedNIST dataset. The MedNIST dataset was gathered from several sets from TCIA, the RSNA Bone Age Challenge, and the NIH Chest X-ray dataset.

The dataset is kindly made available by Dr. Bradley J. Erickson M.D., Ph.D. (Department of Radiology, Mayo Clinic) under the Creative Commons CC BY-SA 4.0 license. If you use the MedNIST dataset, please acknowledge the source, e.g. https://colab.research.google.com/drive/1wy8XUSnNWlhDNazFdvGBHLfdkGvOHBKe#scrollTo=ZaHFhidyCBJa

Let's move on to download and extract the dataset.

The dataset has been stored pickle file. Lets download the dataset using the instructions below.

In [6]:
# download MedNIST.pkl
from utils import download_mednist_dataset

download_mednist_dataset()

MedNIST is already downloaded


Now, before we move forward, let's store some variables related to the dataset.

We require your participant number and the total participant count in the session to allocate you a unique subset of the MedNIST data.

### Participant Number

Copy your variables `MY_PARTICIPANT_NUMBER` and `TOTAL_PARTICIPANTS` from your session details.

```
Hi Person,
These are your Session Details:
-------------------------------
Username: azureuser
Password: **********
VM IP Address: x.x.x.x

MY_PARTICIPANT_NUMBER=1
TOTAL_PARTICIPANTS=10
```

In [7]:
# file path where the MedNIST.pkl is downloaded
FILE_PATH = "./MedNIST.pkl"

In [9]:
# replace these with your own from the session details
MY_PARTICIPANT_NUMBER = 1
TOTAL_PARTICIPANTS = 10

### Load the Dataset

Below are some helper methods, thatwe will require to load the dataset.

In [10]:
# Import helper methods
from syft.core.adp.data_subject_list import DataSubjectList
from utils import (
    get_data_description,
    get_label_mapping,
    split_into_train_test_val_sets,
    load_data_as_df,
)

In [11]:
# Let's load the dataset as a dataframe
dataset_df = load_data_as_df(MY_PARTICIPANT_NUMBER, TOTAL_PARTICIPANTS, FILE_PATH)

Columns: Index(['patient_id', 'image', 'label'], dtype='object')
Total Images: 5895
Label Mapping {'AbdomenCT': 0, 'BreastMRI': 1, 'CXR': 2, 'ChestCT': 3, 'Hand': 4, 'HeadCT': 5}


In [12]:
# Let's get a peek of the dataset
dataset_df.head()

Unnamed: 0,patient_id,image,label
0,11000,"[[101, 101, 101, 101, 101, 101, 101, 101, 101,...",0
1,11002,"[[25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, ...",5
2,11002,"[[126, 126, 126, 126, 126, 126, 126, 126, 126,...",3
3,11004,"[[3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 3, 3, 3, 3, 3,...",4
4,11004,"[[101, 101, 101, 101, 101, 101, 101, 101, 101,...",0


In [13]:
# Split the dataset into `train`, `validation` and `test` sets.
data_dict = split_into_train_test_val_sets(dataset_df)

In [14]:
data_dict["train"].shape, data_dict["val"].shape, data_dict["test"].shape

((4707, 3), (585, 3), (603, 3))

Get the dataset description, that needs to be provided to the domain while uploading the dataset.

In [15]:
dataset_description = get_data_description(dataset_df)
print(dataset_description)

The MedNIST dataset was gathered from several sets from TCIA, the RSNA Bone Age Challenge, and the NIH Chest X-ray dataset. The dataset is kindly made available by Dr. Bradley J. Erickson M.D., Ph.D. (Department of Radiology, Mayo Clinic) under the Creative Commons CC BY-SA 4.0 license.
Label Count: 6
Label Mapping: {"AbdomenCT": 0, "BreastMRI": 1, "CXR": 2, "ChestCT": 3, "Hand": 4, "HeadCT": 5}
Image Dimensions: (64, 64)
Total Images: 5895



We can see that dataset description contains a brief info about the dataset and also a few meta information related to the dataset.

### Prepare Dataset for Upload

In the next step we will create assets for our datasets. Asset is a collection of private data. In our case the images and labels in the train, val and test sets will be part of the assets.

In [16]:
import numpy as np

assets = dict()

for name, data in data_dict.items():

    # Let's create data subjects list.
    # Data Subjects are the individuals whose privacy we're trying to protect.
    data_subjects = DataSubjectList.from_series(data["patient_id"])

    # Convert images to numpy int64 array
    images = data["image"]
    images = np.dstack(images.values).astype(np.int64)
    images = np.rollaxis(images, -1)

    # Convert labels to numpy int64 array
    labels = data["label"].to_numpy().astype("int64")

    # Next we will make your data private private with min, max and data subjects.
    # The min and max are minimum and maximum value in the given data.

    # converting images to private data
    image_data = sy.Tensor(images).private(
        min_val=0, max_val=255, data_subjects=data_subjects
    )

    # converting labels to private data
    label_data = sy.Tensor(labels).private(
        min_val=0, max_val=5, data_subjects=data_subjects
    )

    assets[f"{name}_images"] = image_data
    assets[f"{name}_labels"] = label_data

Finally, we will upload the assets to the domain.

## Step 4c: Upload the Dataset

In [17]:
# creating/uploading the dataset
# Name of the dataset

name = f"MedNIST Data {MY_PARTICIPANT_NUMBER}/{TOTAL_PARTICIPANTS}"

In [18]:
# upload the MedNIST data
domain_client.load_dataset(
    assets=assets, name=name, description=dataset_description, use_blob_storage=True
)

Loading dataset...Loading dataset... checking assets...Loading dataset... checking dataset name for uniqueness...Loading dataset... checking dataset name for uniqueness...                                                                                                                    Loading dataset... checking asset types...                              Loading dataset... uploading...üöÄ                        

Uploading `train_images`: 100%|[32m‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà[0m| 1/1 [00:00<00:00, 16.13it/s][0m
Uploading `train_labels`: 100%|[32m‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà[0m| 1/1 [00:00<00:00, 169.51it/s][0m
Uploading `val_images`: 100%|[32m‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà[0m| 1/1 [00:00<00:00, 48.54it/s][0m
Uploading `val_labels`: 100%|[32m‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà[0m| 1/1 [00:00<00:00, 163.22it/s][0m
Uploading `test_images`: 100%|[32m‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà[0m| 1/1 [00:00<00:00, 48.42it/s][0m
Uploading `test_labels`: 100%|[32

Dataset is uploaded successfully !!! üéâ

Run `<your client variable>.datasets` to see your new dataset loaded into your machine!


Now let's check if the dataset we successfully uploaded

In [19]:
domain_client.datasets

Idx,Name,Description,Assets,Id
[0],MedNIST Data 6/10,"The MedNIST dataset was gathered from several sets from TCIA,  the RSNA Bone Age Challenge, and the NIH Chest X-ray dataset.  The dataset is kindly made available by Dr. Bradley J. Erickson M.D., Ph.D. (Department of Radiology, Mayo Clinic)  under the Creative Commons CC BY-SA 4.0 license. Label Count: 6 Label Mapping: {""AbdomenCT"": 0, ""BreastMRI"": 1, ""CXR"": 2, ""ChestCT"": 3, ""Hand"": 4, ""HeadCT"": 5} Image Dimensions: (64, 64) Total Images: 5895","[""train_images""] -> [""train_labels""] -> [""val_images""] -> ...",05bb34a0-5b24-4071-92ed-d7e9d18e1289
[1],MedNIST Data 1/10,"The MedNIST dataset was gathered from several sets from TCIA, the RSNA Bone Age Challenge, and the NIH Chest X-ray dataset. The dataset is kindly made available by Dr. Bradley J. Erickson M.D., Ph.D. (Department of Radiology, Mayo Clinic) under the Creative Commons CC BY-SA 4.0 license. Label Count: 6 Label Mapping: {""AbdomenCT"": 0, ""BreastMRI"": 1, ""CXR"": 2, ""ChestCT"": 3, ""Hand"": 4, ""HeadCT"": 5} Image Dimensions: (64, 64) Total Images: 5895","[""train_images""] -> [""train_labels""] -> [""val_images""] -> ...",83078e03-01a5-4ae6-95fc-e89433acb519


In [20]:
domain_client.datasets[0]

Dataset: MedNIST Data 6/10
Description: The MedNIST dataset was gathered from several sets from TCIA,
    the RSNA Bone Age Challenge, and the NIH Chest X-ray dataset.
    The dataset is kindly made available by Dr. Bradley J. Erickson M.D., Ph.D. (Department of Radiology, Mayo Clinic)
    under the Creative Commons CC BY-SA 4.0 license.
Label Count: 6
Label Mapping: {"AbdomenCT": 0, "BreastMRI": 1, "CXR": 2, "ChestCT": 3, "Hand": 4, "HeadCT": 5}
Image Dimensions: (64, 64)
Total Images: 5895




Asset Key,Type,Shape
"[""train_images""]",,"(4722, 64, 64)"
"[""train_labels""]",,"(4722,)"
"[""val_images""]",,"(588, 64, 64)"
"[""val_labels""]",,"(588,)"
"[""test_images""]",,"(585, 64, 64)"
"[""test_labels""]",,"(585,)"


## Step 4d: Create a Data Scientist Account

In [21]:
data_scientist_details = {
    "name": "Samantha Carter",
    "email": "sam@sg1.net",
    "password": "stargate",
    "budget": 9999,
}

In [22]:
domain_client.users.create(**data_scientist_details)

In [23]:
print("Please give these details to the data scientist üëáüèΩ")
login_details = {}
login_details["url"] = DOMAIN_HOST_IP
login_details["name"] = data_scientist_details["name"]
login_details["email"] = data_scientist_details["email"]
login_details["password"] = data_scientist_details["password"]
login_details["dataset_name"] = name
print()
print(login_details)

Please give these details to the data scientist üëáüèΩ

{'url': 'localhost', 'name': 'Samantha Carter', 'email': 'sam@sg1.net', 'password': 'stargate', 'dataset_name': 'MedNIST Data 1/10'}
