## Create Dataset

Now that our Domain is part of the OpenMined network (as described in notebook [data-owners-start.ipynb](http://localhost:8888/notebooks/notebooks/adastra/data-owners/01-data-owners-start.ipynb)), let's move on to add a dataset to our Domain.

First let's login into our Domain.

In [None]:
# autodetect the host_ip
CURL_OUTPUT=!echo $(curl -s ifconfig.co)
DOMAIN_HOST_IP=""
import sys
if "google.colab" not in sys.modules:
    DOMAIN_HOST_IP=CURL_OUTPUT[0]
    print(f"Your DOMAIN_HOST_IP is: {DOMAIN_HOST_IP}")
else:
    print(f"Google Colab detected, please manually set the DOMAIN_HOST_IP variable")

In [None]:
# Set the email and password of your Domain node.
# We will be using the default email and password that got created during Domain creation.
# Please update the email and password below incase you update them.

ADMIN_EMAIL="info@openmined.org"
ADMIN_PASSWORD="changethis"

In [None]:
# Import syft package
import syft as sy

In [None]:
# Let's log into the domain using the credentials

try:
    domain_client = sy.login(url="localhost", email=ADMIN_EMAIL, password=ADMIN_PASSWORD, port=8081)
    print()
    print("🎉 You successfully connected to your domain!")
except Exception as e:
    print("❌ Unable to connect, did you set the `DOMAIN_HOST_IP` variable above?")
    raise e

### MedNIST Dataset

We will be using the MedNIST dataset. The MedNIST dataset was gathered from several sets from TCIA, the RSNA Bone Age Challenge, and the NIH Chest X-ray dataset.

The dataset is kindly made available by Dr. Bradley J. Erickson M.D., Ph.D. (Department of Radiology, Mayo Clinic) under the Creative Commons CC BY-SA 4.0 license. If you use the MedNIST dataset, please acknowledge the source, e.g. https://colab.research.google.com/drive/1wy8XUSnNWlhDNazFdvGBHLfdkGvOHBKe#scrollTo=ZaHFhidyCBJa

Let's move on to download and extract the dataset.

The dataset has been stored pickle file. Lets download the dataset using the instructions below.

In [None]:
# download MedNIST.pkl
if not os.path.exists("./MedNIST.pkl"):
    os.system('curl -O "https://media.githubusercontent.com/media/shubham3121/datasets/main/MedNIST/MedNIST.pkl"')
else:
    print("MedNIST already downloaded")

Now, before we move forward, let's store some variables related to the dataset.

We require your participant number and the total participant count in the session to allocate you a unique subset of the MedNIST data.

### Participant Number

Copy your variables `MY_PARTICIPANT_NUMBER` and `TOTAL_PARTICIPANTS` from your session details.

```
Hi Person,
These are your Session Details:
-------------------------------
Username: azureuser
Password: **********
VM IP Address: x.x.x.x

MY_PARTICIPANT_NUMBER=1
TOTAL_PARTICIPANTS=10
```

In [None]:
# file path where the MedNIST.pkl is downloaded
FILE_PATH = "./MedNIST.pkl"

In [None]:
# replace these with your own from the session details
MY_PARTICIPANT_NUMBER = 1
TOTAL_PARTICIPANTS = 10

### Load the Dataset

Below are some helper methods, thatwe will require to load the dataset.

In [None]:
# Helper Methods

import os
import json
import pandas as pd
from PIL import Image
from enum import Enum
from collections import defaultdict
import numpy as np
from syft.core.adp.data_subject_list import DataSubjectList


def get_label_mapping():
    # the data uses the following mapping
    mapping = {
        "AbdomenCT": 0, 
        "BreastMRI": 1, 
        "CXR": 2, 
        "ChestCT": 3, 
        "Hand": 4, 
        "HeadCT": 5
    }
    return mapping

def load_data_as_df(file_path="./MedNIST.pkl"):
    df = pd.read_pickle(file_path)
    df.sort_values("patient_id", inplace=True, ignore_index=True)
    
    # Calculate start and end index based on your participant number
    batch_size = df.shape[0] // TOTAL_PARTICIPANTS
    start_idx = (MY_PARTICIPANT_NUMBER - 1) * batch_size
    end_idx = start_idx + batch_size
    
    # Slice the dataframe according
    df = df[start_idx: end_idx]
    
    # Get label mapping
    mapping = get_label_mapping()
    
    total_num = df.shape[0]
    print("Columns:", df.columns)
    print("Total Images:", total_num)
    print("Label Mapping", mapping)
    return df

def get_data_description(data):
    unique_label_cnt = data.label.nunique()
    lable_mapping = json.dumps(get_label_mapping())
    image_size = data.iloc[0]["image"].shape
    description = f"The MedNIST dataset was gathered from several sets from TCIA, the RSNA Bone Age Challenge, and the NIH Chest X-ray dataset. The dataset is kindly made available by Dr. Bradley J. Erickson M.D., Ph.D. (Department of Radiology, Mayo Clinic) under the Creative Commons CC BY-SA 4.0 license.\n"
    description += f"Label Count: {unique_label_cnt}\n"
    description += f"Label Mapping: {lable_mapping}\n"
    description += f"Image Dimensions: {image_size}\n"
    description += f"Total Images: {data.shape[0]}\n"
    return description

In [None]:
# Let's load the dataset as a dataframe
dataset_df = load_data_as_df(FILE_PATH)

In [None]:
# Let's get a peek of the dataset
dataset_df.head()

Get the dataset description, that needs to be provided to the domain while uploading the dataset.

In [None]:
dataset_description = get_data_description(dataset_df)
print(dataset_description)

We can see that dataset description contains a brief info about the dataset and also a few meta information related to the dataset.

### Prepare Dataset for Upload

Let's create data subjects list. Data Subjects are the individuals whose privacy we're trying to protect. Here the patients are the data subjects.

In [None]:
data_subjects = DataSubjectList.from_series(dataset_df['patient_id'])

Next we need to convert our image and label data to numpy array of type **int64**.

In [None]:
# Convert images to numpy int64 array
images = dataset_df['image']
images = np.dstack(images.values).astype(np.int64)
images = np.rollaxis(images,-1)

In [None]:
# Convert labels to numpy int64 array
labels = dataset_df['label'].to_numpy().astype("int64")

Next we will make your data private private with min, max and data subjects. The min and max are minimum and maximum value in the given data.

In [None]:
# converting images to private data
image_data = sy.Tensor(images).private(min_val=0, max_val=255, data_subjects=data_subjects)

In [None]:
# converting labels to private data
label_data = sy.Tensor(labels).private(min_val=0, max_val=5, data_subjects=data_subjects)

Finally, we will upload the images and labels to the domain.

In [None]:
# creating/uploading the dataset

# Name of the dataset
name = f"MedNIST Data {MY_PARTICIPANT_NUMBER}/{TOTAL_PARTICIPANTS}"

# upload the MedNIST data
domain_client.load_dataset(
    assets={"images": image_data, "labels": label_data},
    name=name,
    description=dataset_description,
    use_blob_storage=True
)

Now let's check if the dataset we successfully uploaded

In [None]:
domain_client.datasets

## Create a Data Scientist Account

In [None]:
data_scientist_details = {
    "name": "Samantha Carter",
    "email": "sam@sg1.net",
    "password": "stargate",
    "budget": 9999,
}

In [None]:
domain_client.users.create(**data_scientist_details)

In [None]:
print("Please give these details to the data scientist 👇🏽")
login_details = {}
login_details["url"] = DOMAIN_HOST_IP
login_details["name"] = data_scientist_details["name"]
login_details["email"] = data_scientist_details["email"]
login_details["password"] = data_scientist_details["password"]
login_details["dataset_name"] = name
print()
print(login_details)