## Data acquisition and preparation (Lesson 7)


Welcome to Lesson 3 of Course 3. In this module, we will be looking at various steps in the data acquisition and data preparation stage.

We will focus on a single source of dataset containing medical images (NIH Chest XRay Dataset) for a binary classification task, which is identifying whether there are abnormal findings in the x-rays.

#### Data is the responsability of the data owner

In a remote data science setup, the data scientist does not have access to the data to explore and prepare it and it becomes the responsability of the data owner to ensure a good quality of the dataset. This includes high quality annotations, using the right preprocessing steps in handling the dataset and adding differential privacy to preserve the data privacy.

#### What you'll learn

By the end of this notebook, the student will learn how to handle preprocessing on a medical dataset example, adding differential privacy to the data and loading the model in PyGrid to be used further. 

#### Instructors
The instructors of this course are ------

#### Milestones
The main milestones of this lesson are:
- Acquisition
<!-- <p> Please see these external resources/examples ---Abinav's resources--- </p> -->
- Quality check on dataset
<!-- <p> Please see these external resources/examples ---Abinav's resources--- </p> -->
- Annotation
<!-- <p> Please see these external resources/examples ---Abinav's resources--- </p> -->
- Converting data to PyGrid format

- Linking data from multiple sources
<!-- <p> Please see these external resources/examples ---Abinav's resources--- </p> -->

- Adding Differential Privacy to the metadata of the dataset

- Loading data into the node

## Pre-requisites for the notebook

To install the prerequisites for running the current notebook, it could be easier to setup a virtual environment using conda, by running the following: 

```
conda create -n lab python=3.9

conda activate lab
```
In your environment, syft can be installed by the following commands sequentially:
```
git clone https://github.com/OpenMined/PySyft && cd PySyft

git fetch origin dev

git checkout dev

cd packages/syft && pip install -e .
```
For the other requirements, please run the following command:

```
pip install -r requirements.txt
```

A specific requirement for us today is a library called PyDicom, which helps up to read the medical images from the dataset and it is not part of PySyft. Therefore, it needs to be installed before proceeding as follows:

In [None]:
! pip install pydicom

## Imports

In [8]:
import pandas as pd
from pydicom import dcmread
import numpy as np
import torch
from syft.core.adp.entity import DataSubject
import os
import syft as sy

## Loading the data

We will make use of a data loading utility from PyTorch called DataLoader, which allows us to easily iterate over the dataset. More about it can be read <a href="https://pytorch.org/docs/stable/data.html">here</a>.

To achieve this, we define our own class that inherits the Dataset utility and requires us to define the following methods:
- <i> \_\_init\_\_ </i> method which:
    - loads into the memory the images, done by passing the path of the directory containg the images (<i> root_path </i>)
    - parses any metadata about the images that can be found in <i> label_file </i>

- <i> \_\_getitem_\_ </i> method that offers support for fetching a data sample for a given key

- <i> \_\_len_\_</i> method (optionally) that returns the outer dimension of our dataset

- <i> to_tensor</i> method that converts the data given as param (images) to a numeric format, preferably as a NumPy array, or as a Tensor.

<b>Note: Need to add some images inline for easier user visualization/conception</b>

In [10]:
class ChestXRayDataset(torch.utils.data.Dataset):
    """Data loading utility for the NIH Chest X-Ray Dataset."""
    
    def __init__(self, root_path, label_file):
        # Reads the dataset from csv and sets the root path
        self.dataset = pd.read_csv(label_file)
        self.root = root_path
    
    def __len__(self):
        """Returns the outer-most dimension of the dataset."""
        return self.dataset.shape[0]
    
    def to_tensor(self, image):
        """Transform the image from a HWC-layout (height, width, channels) to the CHW-layout for PyTorch."""
        image_tensor = torch.from_numpy(np.ascontiguousarray(image.transpose((2, 0, 1))))
        return image_tensor

    def __getitem__(self, idx):
        # Identify the file path of the current sample.
        filename = self.dataset.loc[idx].at["patientId"]
        file_path = os.path.join(root_path, filename)
        
        label = self.dataset.loc[idx].at["Target"]

        # Load the image into memory.
        dicom = dcmread(file_path)
        image = dicom.pixel_array
        
        # Cast the tensor to integer to can be used by our ADP
        image = self.to_tensor(image).int16() 
        # Convert tensor into a syft private tensor.
        tensor_image = sy.Tensor(image).private(0, 255, entities=DataSubject(dicom.PatientID))
        
        return tensor_image, label

Now we can instantiate the class above to initialize our dataset and inspect how a random sample looks like:

<b> Note: Missing a root_path and a label file. Add here a sample picture after paths are added: maybe we download it from someplace?. </b>

In [None]:
ROOT_PATH = ...
LABEL_FILE = ...

remote_dataset = ChestXRayDataset(ROOT_PATH, LABEL_FILE)

## Connect to PyGrid Domain

It is time to connect to a PyGrid Domain node and upload our dataset. 

In the following lesson, you will learn how to setup a Domain Node yourself and how to manage it once it is there as a data owner, so hold tight!

In [None]:
# Login into the already deployed Domain node

# Please note that the credentials might change if you choose to use a different user or if the node is
# deployed at a different port.
remote = sy.login(
    email="info@openmined.org", 
    password="changethis", 
    port=8081)

In [None]:
# Upload the dataset we have created above and specify required fields. 
# Note the dataset is specifed via a dict in the 'assets' argument.

remote.load_dataset(
    assets = {"chest_xray" : remote_dataset},
    name = "chest xray from NIH",
    description = "Chest xray dataset of NIH for pneumonia classification"
)

### Congrats, your dataset should be now deployed in a node!

Let's go to the next lesson to learn about deploying and different ways to query the dataset once it is deployed.