# 3. Upload a Dataset

## 3.1. Log into our Domain

### 3.1.1 Import Syft

👇🏽👇🏽 **Run** the cell below 👇🏽👇🏽

In [None]:
import syft as sy

print(f"You're running syft version: {sy.__version__}")

### 3.1.2 Login into Domain

#### 3.1.2.a. Setup login credentials

👇🏽👇🏽 **Run** the cells below 👇🏽👇🏽

In [None]:
# autodetect the host_ip
from utils import auto_detect_domain_host_ip

DOMAIN_HOST_IP = auto_detect_domain_host_ip()

In [None]:
# Set the email and password of your Domain node.
# We will be using the default email and password that got created during Domain creation.

ADMIN_EMAIL="info@openmined.org"
ADMIN_PASSWORD="changethis"

#### 3.1.2.b. Log in using syft

👇🏽👇🏽 **Run** the cells below 👇🏽👇🏽

In [None]:
# Let's log into the domain using the credentials
try:
    domain_client = sy.login(
        url=DOMAIN_HOST_IP, email=ADMIN_EMAIL, password=ADMIN_PASSWORD
    )
    print()
    print("🎉 You successfully connected to your domain!")
except Exception as e:
    print("❌ Unable to connect, did you set the `DOMAIN_HOST_IP` variable above?")
    raise e

## 3.2. Creating a Dataset

### 3.2.1. MedNIST Dataset

We will be using the MedNIST dataset. The MedNIST dataset was gathered from several sets from TCIA, the RSNA Bone Age Challenge, and the NIH Chest X-ray dataset.

The dataset is kindly made available by Dr. Bradley J. Erickson M.D., Ph.D. (Department of Radiology, Mayo Clinic) under the Creative Commons CC BY-SA 4.0 license. If you use the MedNIST dataset, please acknowledge the source, e.g. https://colab.research.google.com/drive/1wy8XUSnNWlhDNazFdvGBHLfdkGvOHBKe#scrollTo=ZaHFhidyCBJa

Let's move on to download and extract the dataset.

The dataset has been stored pickle file. Lets download the dataset using the instructions below.

Copy your variables `MY_DATASET_URL` from your session details.

```
Hi Person,
These are your Session Details:
-------------------------------
Username: azureuser
Password: **********
VM IP Address: x.x.x.x
    
📎 MY_DATASET_URL:
'https://media.githubusercontent.com/media/shubham3121/datasets/main/MedNIST/subsets/xxxxxxx.pkl'

```

#### 3.2.1.a. Set Dataset URL

👇🏽👇🏽 Set **MY_DATASET_URL** in the cell below and  **Run** the cell.👇🏽👇🏽

In [None]:
# replace these with your own from the session details
# paste your dataset url here
MY_DATASET_URL = ""

#### 3.2.1.b. Download Dataset

👇🏽👇🏽 **Run** the cell below 👇🏽👇🏽

In [None]:
# download MedNIST.pkl
from utils import download_mednist_dataset

FILE_PATH = download_mednist_dataset(MY_DATASET_URL)

Now, before we move forward, let's store some variables related to the dataset.

We require your participant number and the total participant count in the session to allocate you a unique subset of the MedNIST data.

### 3.2.2. Load the Dataset

#### 3.2.2.a. Import Helper methods

Below are some helper methods, that we will require to load and preprocess the dataset.

👇🏽👇🏽 **Run** the cell below 👇🏽👇🏽

In [None]:
# Import helper methods
from syft.core.adp.data_subject_list import DataSubjectList
from utils import (
    get_data_description,
    get_label_mapping,
    split_into_train_test_val_sets,
    load_data_as_df,
    get_dataset_name,
)

#### 3.2.2.b. Load dataset as Dataframe

👇🏽👇🏽 **Run** the cells below 👇🏽👇🏽

In [None]:
# Let's load the dataset as a dataframe
dataset_df = load_data_as_df(FILE_PATH)

In [None]:
# Let's get a peek of the dataset
dataset_df.head()

#### 3.2.2.c. Split dataset

👇🏽👇🏽 **Run** the cells below 👇🏽👇🏽

In [None]:
# Split the dataset into `train`, `validation` and `test` sets.
data_dict = split_into_train_test_val_sets(dataset_df)

In [None]:
data_dict["train"].shape, data_dict["val"].shape, data_dict["test"].shape

Get the dataset description, that needs to be provided to the domain while uploading the dataset.

We can see that dataset description contains a brief info about the dataset and also a few meta information related to the dataset.

### 3.2.3. Prepare Dataset for Upload

In the next step we will create assets for our datasets. Asset is a collection of private data. In our case the images and labels in the train, val and test sets will be part of the assets.

#### 3.2.3.a. Preprocess the dataset

👇🏽👇🏽 **Run** the cells below 👇🏽👇🏽

In [None]:
import numpy as np

assets = dict()

for name, data in data_dict.items():

    # Let's create data subjects list.
    # Data Subjects are the individuals whose privacy we're trying to protect.
    data_subjects = DataSubjectList.from_series(data["patient_id"])

    # Convert images to numpy int64 array
    images = data["image"]
    images = np.dstack(images.values).astype(np.int64)  # type cast to int64
    dims = images.shape
    images = images.reshape(dims[0] * dims[1], dims[2])  # reshape to 2D array
    images = np.rollaxis(images, -1)

    # Convert labels to numpy int64 array
    labels = data["label"].to_numpy().astype("int64")

    # Next we will make your data private private with min, max and data subjects.
    # The min and max are minimum and maximum value in the given data.

    # converting images to private data
    image_data = sy.Tensor(images).private(
        min_val=0, max_val=255, data_subjects=data_subjects
    )

    # converting labels to private data
    label_data = sy.Tensor(labels).private(
        min_val=0, max_val=5, data_subjects=data_subjects
    )

    assets[f"{name}_images"] = image_data
    assets[f"{name}_labels"] = label_data
    
print("Data has been successfully preprocessed !!! 🕺")

Finally, we will upload the assets to the domain.

## 3.3. Upload the Dataset

### 3.3.1. Set dataset name

👇🏽👇🏽 **Run** the cell below 👇🏽👇🏽

In [None]:
# creating/uploading the dataset
# Name of the dataset
# Gets the dataset name from the dataset url
name = get_dataset_name(MY_DATASET_URL)

### 3.3.2. Set dataset description

👇🏽👇🏽 **Run** the cell below 👇🏽👇🏽

In [None]:
dataset_description = get_data_description(dataset_df)
print(dataset_description)

### 3.3.3. Upload dataset to Domain

👇🏽👇🏽 **Run** the cell below 👇🏽👇🏽

In [None]:
# upload the MedNIST data
domain_client.load_dataset(
    assets=assets, name=name, description=dataset_description, use_blob_storage=True
)

Now let's check if the dataset we successfully uploaded

### 3.3.4. List datasets

👇🏽👇🏽 **Run** the cell below 👇🏽👇🏽

In [None]:
domain_client.datasets

### 3.3.5. View uploaded asset

👇🏽👇🏽 **Run** the cell below 👇🏽👇🏽

In [None]:
domain_client.datasets[0]

## 3.4. Create a Data Scientist Account

### 3.4.1. Set data scientist credentials

👇🏽👇🏽 **Run** the cell below 👇🏽👇🏽

In [14]:
# Enter the details for your Data Scientist Below.
from utils import validate_ds_credentials

data_scientist_details = {
    "name": "",
    "email": "",
    "password": "",
    "budget": 9999,
}
       
validate_ds_credentials(data_scientist_details)

Please set a value for 'name'.
Please set a value for 'email'.
Please set a value for 'password'.
Please set the missing/incorrect values and re-run this cell


### 3.4.2. create data scientist user

👇🏽👇🏽 **Run** the cell below 👇🏽👇🏽

In [12]:
# Create the data scientist user
domain_client.users.create(**data_scientist_details)
print("Data Scientist successfully created 🙌🙌")

NameError: name 'domain_client' is not defined

### 3.4.3. Print data scientist details

👇🏽👇🏽 **Run** the cell below 👇🏽👇🏽

In [None]:
print("Please give these details to the data scientist 👇🏽")
login_details = {}
login_details["url"] = DOMAIN_HOST_IP
login_details["name"] = data_scientist_details["name"]
login_details["email"] = data_scientist_details["email"]
login_details["password"] = data_scientist_details["password"]
login_details["dataset_name"] = name
print()
print(login_details)

☝🏽☝🏽 **Copy** the credentials and share it with the Demo Instructor. ☝🏽☝🏽