# Introduction

Author: Johannes Peter Knoll

Within this notebook you will:
- Preprocess raw data
- Train Neural Network Model

In [1]:
# The autoreload extension allows you to tweak the code in the imported modules
# and rerun cells to reflect the changes.
%load_ext autoreload
%autoreload 2

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:
# LOCAL IMPORTS
from dataset_processing import *

# IMPORTS
import numpy as np # type: ignore
import random
import h5py # type: ignore

# Preprocess Training Data

## SHHS Dataset

The [Sleep Heart Health Study (SHHS)](https://sleepdata.org/datasets/shhs) is a multi-center cohort study implemented by the National Heart Lung & Blood Institute to determine the cardiovascular and other consequences of sleep-disordered breathing.

### Download Dataset

In [None]:
# !wget "https://onedrive.live.com/download?cid=45D5A10F94E33861&resid=45D5A10F94E33861%21248707&authkey=AKRa5kb3XFj4G-o" -O shhs_dataset.h5

### Processing Dataset for Neural Network

The class SleepDataManager handles the data we want to pass to the network. It makes the data accessible in
a memory saving way, but needs to save it (again) into a pickle file. Of course, you can delete the .h5
file afterwards if you want to.

We unfortunately have multiple sources (besides the SHHS Dataset) with data that we can train the network on. 
For the DataLoader class the neural network needs to access the data, is it more convenient to have all data
in one place: The SleepDataManager

During the saving process, the SleepDataManager makes sure that the data is uniform in every way and might
perform following actions:
- Scale number of datapoints in signal if sampling frequency does not match
- Alter sleep stage labels if they do not refer to the same context
- Split datapoint into multiple if signal duration is longer than required for the neural network

To do all of this, we need to provide more information than the signal itself:

In [None]:
# Saveable Datapoint:
"""
{
    "ID": str,                  # always required
    "RRI": np.ndarray,
    "MAD": np.ndarray,
    "SLP": np.ndarray,
    "RRI_frequency": int,       # required if RRI signal is provided
    "MAD_frequency": int,       # required if MAD signal is provided
    "SLP_frequency": int,       # required if SLP signal is provided
    "sleep_stage_label": list   # required if SLP signal is provided
} 
"""

Most of the keys are save explaining, except for the last one:

We want to assign different sleep stage labels in our network (SSM in the following):

|number|SHHS stage|SSM stage|
|------|----------|---------|
|  0   | wake     | wake    |
|  1   | N1       | LS      |
|  2   | N2       | DS      |
|  3   | N3       | REM     |
|  5   | REM      |         |
| other| artifact |         |
| -1   |          | artifact|

As you see: N1 needs to be classified as wake, N2 as LS (light sleep), and N3 as DS (deep sleep).
To do this, we effectively need to change: \
0 -> 0, 1 -> 0, 2 -> 1, 3 -> 2, 5 -> 3, other -> -1

To make this achievable by the algorithm, we just need to say which labels correspond to which stage
in the "sleep_stage_label" key as follows:

In [3]:
shhs_sleep_stage_label = {"wake": [0, 1], "LS": [2], "DS": [3], "REM": [5], "artifect": ["other"]}

### Accessing Dataset

In [5]:
#path_to_shhs_dataset = "../Training_Data/SHHS_dataset.h5"
path_to_shhs_dataset = "Raw_Data/SHHS_dataset.h5"

shhs_dataset = h5py.File(path_to_shhs_dataset, 'r')

### Transfering Data from .h5 into .pkl file using SleepDataManager

It is wise to check if all ID's are unique prior to saving. Then we can skip
checking every ID in the database when saving each datapoint, which will speed up the saving process greatly.

In [24]:
# initializing the database
file_path_to_sleep_data = "Processed_Data/sleep_data.pkl"
sleep_data_manager = SleepDataManager(file_path = file_path_to_sleep_data)

# accessing patient ids:
patients = list(shhs_dataset['slp'].keys()) # type: ignore

# check if patient ids are unique:
sleep_data_manager.check_if_ids_are_unique(patients)

All IDs are unique.


If all ID's are unique, you can continue:

In [25]:
# saving all data from SHHS dataset to the sleep_data.pkl
for patient_id in patients:
    new_datapoint = {
        "ID": patient_id,
        "RRI": shhs_dataset["rri"][patient_id][:], # type: ignore
        "SLP": shhs_dataset["slp"][patient_id][:], # type: ignore
        "RRI_frequency": shhs_dataset["rri"].attrs["freq"], # type: ignore
        "SLP_frequency": shhs_dataset["slp"].attrs["freq"], # type: ignore
        "sleep_stage_label": copy.deepcopy(shhs_sleep_stage_label)
    }

    sleep_data_manager.save(new_datapoint, unique_id=True)

### Transforming Data to overlapping windows

We want to pass the signal in overlapping windows to the neural network:

In [None]:
sleep_data_manager.transform_signals_to_windows(
    number_windows = 1197, 
    window_duration_seconds = 120, 
    overlap_seconds = 90, 
    priority_order = [0, 1, 2, 3, 5, -1]
    )

## GIF Dataset

Analogue to the SHHS Dataset, we will save the data to our SleepDataManager and transform it into windows.

In [None]:
path_to_gif_dataset = "../Training_Data/SHHS_dataset.h5"

## Create Training-, Validation- and Test- Datasets

For easier application we will split our database into main-, training-, validation- and test- files:

In [None]:
sleep_data_manager.separate_train_test_validation(
    train_size = 0.8, 
    validation_size = 0.1, 
    test_size = 0.1, 
    random_state = None, 
    shuffle = True
)

We should now have 3 additional files in the same directory where our main data is saved ("file_path_to_sleep_data").

Each can be accessed separately with another instance of the class SleepDataManager. Note that their functionality
is limited, as they are only meant to return (load) data.

The data in these files can be reshuffled by calling the above code cell again.

# Train Neural Network Model