## Lesson 7

### Concept 1
Welcome to Lesson 7 of course 3. In this module we will be looking at various steps in the data acquisition and data preparation stage, The aim of this course is to make the students familiar with various steps in the data processing and acquisition stage. By the end of the course (this notebook) the student will learn how to apply preprocessing on a medical dataset example, adding differential privacy to the data and loading the model in pygrid. 

### Concept 2
The instructors of this course are ------

### Concept 3
In Remote data science the data scientist doesn't have access to the data and hence is unable to understand the data in complete detail and it becomes the responsibility of the data owner to ensure good quality of the dataset. By Quality of the dataset we are referring to quality of annotations, the preprocessing steps in handling the dataset adding differential privacy to make the data more private. 

### Concept 4
The big milestones of this lesson are
- Acquisition
- Quality check on dataset
- Annotation 
- Converting data to pygrid format
- Linking data from multiple sources
- Adding Differential Privacy to the metadata of the dataset
- Loading data into the node

This notebook will focus on these milestones for a single source of dataset for medical images that has binary classification labels. 

## Pre-Requisites for the notebook

It would be easier to setup a virtual environment which can be used to install 

```
conda create -n lab python=3.9
```
After creating the environment syft can be installed by the following commands sequentially
```
git clone https://github.com/OpenMined/PySyft && cd PySyft

git fetch origin dev

git checkout dev

cd packages/syft && pip install -e .
```
Then install other requirements by running this command

```
pip install -r requirements.txt
```

## Import Libraries

In [1]:
import pandas as pd
from pydicom import dcmread
import numpy as np
import torch
from syft.core.adp.entity import Entity
import os
import syft as sy

## Dataset class

We need a dataset class that needs to be sent and can be used as a dataloader 

In [3]:
class dataset(torch.utils.data.Dataset):
    def __init__(self,root_path,label_file):
        self.dataset = pd.read_csv(label_file)
        self.root = root_path
    
    def __len__(self):
        return self.dataset.shape[0]
    
    def to_tensor(self,image):
        img = torch.from_numpy(np.ascontiguousarray(image.transpose((2, 0, 1))))
        return img

    def __getitem__(self,idx):
        filename = self.dataset.loc[idx].at["patientId"]
        file_path = os.path.join(root_path, filename)
        label = self.dataset.loc[idx].at["Target"]
        dicom = dcmread(file_path)
        image = dicom.pixel_array
        ## Our adp can handle only integer tensors, so cast the tensor to be int
        image = self.to_tensor(image).int16() 
        tensor_image = sy.Tensor(image).private(0,255,entities=Entity(dicom.PatientID))
        return tensor_image,label

## Connect to Domain

In [None]:
# Let's login into the domain
remote = sy.login(email="info@openmined.org", password="changethis", port=8081)

In [None]:
remote_dataset = dataset(root_path,label_file)

In [None]:
remote.load_dataset(
    assets = {"chest_xray":remote_dataset},
    name = "chest xray from NIH",
    description = "Chest xray dataset of NIH for pneumonia classification"
)