**Author:** *Johannes Peter Knoll*

# Introduction

Within this notebook you will:
- Preprocess raw data
- Train Neural Network Model

In [2]:
# The autoreload extension allows you to tweak the code in the imported modules
# and rerun cells to reflect the changes.
%load_ext autoreload
%autoreload 2

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [3]:
# LOCAL IMPORTS
from dataset_processing import *
from neural_network_model import *

# IMPORTS
import numpy as np # type: ignore
import random
import h5py # type: ignore

# Preprocess Training Data

## Processing Dataset for Neural Network

The class SleepDataManager handles the data we want to pass to the network. It makes the data accessible in
a memory saving way, but needs to save it (again) into a pickle file. Of course, you can delete the .h5
file afterwards if you want to.

We unfortunately have multiple sources (besides the SHHS Dataset) with data that we can train the network on.
We need to make sure that the data is uniform in sampling frequency, signal length, etc., which is why we will
check and transform each datapoint before (and afterwards saving it) using the SleepDataManager class.

During the saving process, the SleepDataManager makes sure that the data is uniform in every way and might
perform following actions:
- Scale number of datapoints in signal if sampling frequency does not match
- Alter sleep stage labels if they do not refer to the same context
- Split datapoint into multiple if signal duration is longer than required for the neural network

To do all of this, we need to provide more information than the signal itself:

In [None]:
# Saveable Datapoint:
"""
{
    "ID": str,                  # always required
    "RRI": np.ndarray,
    "MAD": np.ndarray,
    "SLP": np.ndarray,
    "RRI_frequency": int,       # required if RRI signal is provided
    "MAD_frequency": int,       # required if MAD signal is provided
    "SLP_frequency": int,       # required if SLP signal is provided
    "sleep_stage_label": list   # required if SLP signal is provided
} 
"""

Most of the keys are save explaining, except for the last one:

We want to assign different sleep stage labels in our network (SSM in the following):

|number|SHHS stage|GIF stage|SSM stage|
|------|----------|---------|---------|
|  0   | wake     |         | wake    |
|  1   | N1       |         | LS      |
|  2   | N2       |         | DS      |
|  3   | N3       |         | REM     |
|  5   | REM      |         |         |
| other| artifact |         |         |
| -1   |          |         | artifact|

As you see: N1 needs to be classified as wake, N2 as LS (light sleep), and N3 as DS (deep sleep).
To do this, we effectively need to change: \
0 -> 0, 1 -> 0, 2 -> 1, 3 -> 2, 5 -> 3, other -> -1

To make this achievable by the algorithm, we just need to say which labels correspond to which stage
in the "sleep_stage_label" key as follows:

In [10]:
shhs_sleep_stage_label = {"wake": [0, 1], "LS": [2], "DS": [3], "REM": [5], "artifect": ["other"]}
gif_sleep_stage_label = {"wake": [0, 1], "LS": [2], "DS": [3], "REM": [5], "artifect": ["other"]}

## SHHS Dataset

The [Sleep Heart Health Study (SHHS)](https://sleepdata.org/datasets/shhs) is a multi-center cohort study implemented by the National Heart Lung & Blood Institute to determine the cardiovascular and other consequences of sleep-disordered breathing.

### Download Dataset

In [None]:
# !wget "https://onedrive.live.com/download?cid=45D5A10F94E33861&resid=45D5A10F94E33861%21248707&authkey=AKRa5kb3XFj4G-o" -O shhs_dataset.h5

### Accessing Dataset

In [49]:
#path_to_shhs_dataset = "../Training_Data/SHHS_dataset.h5"
path_to_shhs_dataset = "Raw_Data/SHHS_dataset.h5"

shhs_dataset = h5py.File(path_to_shhs_dataset, 'r')

### Transfering Data from .h5 into .pkl file using SleepDataManager

It is wise to check if all ID's are unique prior to saving. Then we can skip
checking every ID in the database when saving each datapoint, which will speed up the saving process greatly.

In [50]:
# initializing the database
path_to_save_processed_shhs_data = "Processed_Data/shhs_data.pkl"
shhs_data_manager = SleepDataManager(file_path = path_to_save_processed_shhs_data)

# accessing patient ids:
patients = list(shhs_dataset['slp'].keys()) # type: ignore

# check if patient ids are unique:
shhs_data_manager.check_if_ids_are_unique(patients)

All IDs are unique.


If all ID's are unique, you can continue:

In [51]:
# saving all data from SHHS dataset to the shhs_data.pkl file
for patient_id in patients:
    new_datapoint = {
        "ID": patient_id,
        "RRI": shhs_dataset["rri"][patient_id][:], # type: ignore
        "SLP": shhs_dataset["slp"][patient_id][:], # type: ignore
        "RRI_frequency": shhs_dataset["rri"].attrs["freq"], # type: ignore
        "SLP_frequency": shhs_dataset["slp"].attrs["freq"], # type: ignore
        "sleep_stage_label": copy.deepcopy(shhs_sleep_stage_label)
    }

    shhs_data_manager.save(new_datapoint, unique_id=True)

### Transforming Data to overlapping windows

We want to pass the signal in overlapping windows to the neural network:

In [53]:
shhs_data_manager.transform_signals_to_windows(
    number_windows = 1197, 
    window_duration_seconds = 120, 
    overlap_seconds = 90, 
    priority_order = [0, 1, 2, 3, 5, -1]
    )

KeyboardInterrupt: 

## GIF Dataset

Analogue to the SHHS Dataset, we will save the data to our SleepDataManager and transform it into windows.

In [None]:
# relevant_keys = ["file_name", "RRI", "RRI_frequency", "MAD", "MAD_frequency", "SLP"]
# results_generator = load_from_pickle("Processed_GIF/GIF_Results.pkl")

### Accessing Dataset

In [28]:
path_to_gif_dataset = "Raw_Data/GIF_dataset.h5"

gif_dataset = h5py.File(path_to_gif_dataset, 'r')

### Transfering Data from .h5 into .pkl file using SleepDataManager

In [29]:
# initializing the database
path_to_save_processed_gif_data = "Processed_Data/gif_data.pkl"
gif_data_manager = SleepDataManager(file_path = path_to_save_processed_gif_data)

# accessing patient ids:
patients = list(gif_dataset['stage'].keys()) # type: ignore

# check if patient ids are unique:
gif_data_manager.check_if_ids_are_unique(patients)

All IDs are unique.


In [24]:
print(len(patients), patients)
print(gif_dataset["stage"].attrs["freq"])
print(gif_dataset["rri"].attrs["freq"])
print(gif_dataset["mad"].attrs["freq"])
sleep = gif_dataset["stage"]["SL003"][:]
rri = gif_dataset["rri"]["SL003"][:]
mad = gif_dataset["mad"]["SL003"][:]
print(len(sleep), sleep[400:500])
print(len(rri))
print(len(mad))

293 ['SL003', 'SL005', 'SL006', 'SL008', 'SL009', 'SL010', 'SL012', 'SL013', 'SL014', 'SL015', 'SL017', 'SL018', 'SL019', 'SL020', 'SL021', 'SL022', 'SL024', 'SL026', 'SL028', 'SL029', 'SL030', 'SL031', 'SL033', 'SL035', 'SL036', 'SL038', 'SL039', 'SL041', 'SL042', 'SL043', 'SL044', 'SL045', 'SL046', 'SL047', 'SL048', 'SL049', 'SL050', 'SL051', 'SL052', 'SL053', 'SL054', 'SL056', 'SL058', 'SL059', 'SL060', 'SL062', 'SL063', 'SL064', 'SL065', 'SL067', 'SL068', 'SL069', 'SL070', 'SL071', 'SL072', 'SL074', 'SL077', 'SL078', 'SL080', 'SL081', 'SL082', 'SL084', 'SL086', 'SL092', 'SL093', 'SL094', 'SL095', 'SL097', 'SL099', 'SL102', 'SL103', 'SL104', 'SL106', 'SL107', 'SL108', 'SL109', 'SL110', 'SL112', 'SL113', 'SL115', 'SL117', 'SL118', 'SL119', 'SL120', 'SL121', 'SL122', 'SL123', 'SL124', 'SL125', 'SL127', 'SL128', 'SL129', 'SL130', 'SL131', 'SL134', 'SL135', 'SL136', 'SL137', 'SL139', 'SL140', 'SL142', 'SL143', 'SL144', 'SL146', 'SL147', 'SL148', 'SL149', 'SL150', 'SL152', 'SL153', 'SL15

If all ID's are unique, you can continue:

In [39]:
# saving all data from GIF dataset to the gif_data.pkl file
for patient_id in patients:
    new_datapoint = {
        "ID": patient_id,
        "RRI": gif_dataset["rri"][patient_id][:], # type: ignore
        "MAD": gif_dataset["mad"][patient_id][:], # type: ignore
        "SLP": gif_dataset["stage"][patient_id][:], # type: ignore
        "RRI_frequency": gif_dataset["rri"].attrs["freq"], # type: ignore
        "MAD_frequency": gif_dataset["mad"].attrs["freq"], # type: ignore
        "SLP_frequency": 1/30, # type: ignore
        "sleep_stage_label": copy.deepcopy(gif_sleep_stage_label)
    }

    gif_data_manager.save(new_datapoint, unique_id=True)

### Transforming Data to overlapping windows

In [None]:
gif_data_manager.transform_signals_to_windows(
    number_windows = 1197, 
    window_duration_seconds = 120, 
    overlap_seconds = 90, 
    priority_order = [0, 1, 2, 3, 5, -1]
    )

## Create Training-, Validation- and Test- Datasets

For easier application we will split our database into main-, training-, validation- and test- files:

In [None]:
shhs_data_manager.separate_train_test_validation(
    train_size = 0.8, 
    validation_size = 0.1, 
    test_size = 0.1, 
    random_state = None, 
    shuffle = True
)

In [None]:
gif_data_manager.separate_train_test_validation(
    train_size = 0.8, 
    validation_size = 0.1, 
    test_size = 0.1, 
    random_state = None, 
    shuffle = True
)

We should now have 3 additional files in the same directory where our processed data is saved.

Each could be accessed separately with another instance of the class SleepDataManager. Note that their functionality
is limited, as they are only meant to return (load) data.

The data in these files can be reshuffled by calling the above code cell again.

# Datasets and Dataloaders

We now want to access our training-, validation- and test- data using a custom dataset class and the
dataloader, as it was tought/suggested in the PyTorch Tutorials 
(Source: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)

## Loading Dataset

In [None]:
# training_data_shhs = CustomSleepDataset(path_to_data)

## Preparing data for training with DataLoaders

# Train Neural Network Model