**Author:** *Johannes Peter Knoll*

# Introduction

This notebook demonstrates all functionalities this project offers:
- preprocess data (unify data, split training data into training- and validation- pids)
- train neural network
- predict sleep stages of Validation data and evaluate neural network performance
- predict sleep stages of non-training and non-validation data

It is basically the commented version of the file: "main.py".

In [3]:
# The autoreload extension allows you to tweak the code in the imported modules
# and rerun cells to reflect the changes.
%load_ext autoreload
%autoreload 2

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:
# LOCAL IMPORTS
from main import *

# IMPORTS
import h5py # type: ignore

# Training Data

We essentially want to map RRI (and MAD) data to sleep stages. So we need datasets providing exactly this.
For training our model, we used the following:
- SHHS: provides RRI and sleep stage
- GIF: provides RRI, MAD and sleep stage

## SHHS Dataset

The [Sleep Heart Health Study (SHHS)](https://sleepdata.org/datasets/shhs) is a multi-center cohort study implemented by the National Heart Lung & Blood Institute to determine the cardiovascular and other consequences of sleep-disordered breathing.

### Download Dataset

In [None]:
# !wget "https://onedrive.live.com/download?cid=45D5A10F94E33861&resid=45D5A10F94E33861%21248707&authkey=AKRa5kb3XFj4G-o" -O shhs_dataset.h5

### Accessing Dataset

In [None]:
#path_to_shhs_dataset = "../Training_Data/SHHS_dataset.h5"
path_to_shhs_dataset = "Raw_Data/SHHS_dataset.h5"

shhs_dataset = h5py.File(path_to_shhs_dataset, 'r')

## GIF Dataset

Unfortunately, the GIF data is not publicly available.

# Project Configuration

Project Configuration includes [setting file paths](#setting-file-paths) and [adjusting parameters](#adjusting-parameters).

Preprocessing data and training the neural network can be controlled using various parameters. To ensure that
for later predictions we use the same parameters to set up the neural network and preprocess our data we will
save those as dictionary to a pickle file, which will be accessed at every step.

## Default Parameters

It is highly UNRECOMMENDED to change the following default parameters, especially if you aim to adjust a few 
parameters.

In the following section ['Creating Project Configuration'](#creating-project-configuration) we will see the 
recommended way of adjusting parameters.

### Parameters for 'SleepDataManager' class

The 'SleepDataManager' class and all data processing functions are thoroughly explained in the jupyter 
notebook: 'Processing_Demo'. However, below is a short summary of the important basics you need to for this
project:

This class resaves your training data to a seperate pickle file and ensures that the data is uniform and can
be accessed and passed to the neural network in a memory saving way. During the saving process it might perform
the following actions:
- scale number of datapoints in signal so that the current signals sampling frequency matches the uniform 
    database signal frequency
- alter sleep labels
- remove RRI and/or MAD outliers
- split signal into multiple signals if signal is longer than the uniform maximum signal length: 
    'signal_length_seconds'

All other functionalities of this class will be explained during this project when necessary.

See 'SleepDataManager' class in 'dataset_processing.py'

In [None]:
sleep_data_manager_parameters = {
    "RRI_frequency": 4,
    "MAD_frequency": 1,
    "SLP_frequency": 1/30,
    "RRI_inlier_interval": [0.3, 2],
    "MAD_inlier_interval": [None, None],
    "sleep_stage_label": {"wake": 0, "LS": 1, "DS": 2, "REM": 3, "artifect": 0},
    "signal_length_seconds": 36000,
    "wanted_shift_length_seconds": 5400,
    "absolute_shift_deviation_seconds": 1800,
}

### Splitting data into pids

Splitting our data into training-, validation- (and test-) pids can be performed using the 'SleepDataManager' 
class.

See: 'separate_train_test_validation' function of 'SleepDataManager' class in 'dataset_processing.py'

In [None]:
split_data_parameters = {
    "train_size": 0.8,
    "validation_size": 0.2,
    "test_size": None,
    "random_state": None,
    "shuffle": True
}

### Parameters for 'CustomSleepDataset' class

Custom Dataset class for our Sleep Stage Data. The class is used to load data from a file 
(using 'SleepDataManager' class) and prepare it for training the neural network.

The whole project including this [class](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)
were created in analogy to the [PyTorch Tutorial](https://pytorch.org/tutorials/beginner/basics/). 
Knowing this tutorial is not necessary for using this project, but highly recommended should you be interested
in understanding and editing the code.

See: 'CustomSleepDataset' class in 'neural_network_model.py'

In [None]:
dataset_class_transform_parameters = {
    "transform": ToTensor(), 
    "target_transform": None,
}

### Reshape signals to overlapping windows

Reshape a signal with shape (n <= nn_signal_duration_seconds * target_frequency) to 
(number_windows, window_size), where windows overlap by 'overlap_seconds' and adjust the signal to the neural 
network's requirements.

This will be performed when accessing the data using 'CustomSleepDataset' class.

See: 'reshape_signal_to_overlapping_windows' function in 'dataset_processing.py'

In [None]:
window_reshape_parameters = {
    "nn_signal_duration_seconds": sleep_data_manager_parameters["signal_length_seconds"],
    "number_windows": 1197,
    "window_duration_seconds": 120,
    "overlap_seconds": 90,
    "priority_order": [3, 2, 1, 0],
    "pad_feature_with": 0,
    "pad_target_with": 0
}
sleep_data_manager_parameters["SLP_expected_predicted_frequency"] = 1/window_reshape_parameters["window_duration_seconds"]

### Normalization

Normalize the signal into range: (normalization_min, normalization_max) using the unity based normalization 
method.

This will be performed when accessing the data using 'CustomSleepDataset' class.

See: 'unity_based_normalization' function in 'dataset_processing.py'

In [None]:
signal_normalization_parameters = {
    "normalize_rri": False,
    "normalize_mad": False,
    "normalization_max": 1,
    "normalization_min": 0,
    "normalization_mode": "global"
}

### Neural Network Architecture

Parameters affecting the neural network architecture.

See 'SleepStageModel' or 'YaoModel' class in 'neural_network_model.py'

In [None]:
neural_network_model_parameters = {
    "number_sleep_stages": 4,
    "datapoints_per_rri_window": int(sleep_data_manager_parameters["RRI_frequency"] * window_reshape_parameters["window_duration_seconds"]),
    "datapoints_per_mad_window": int(sleep_data_manager_parameters["MAD_frequency"] * window_reshape_parameters["window_duration_seconds"]),
    "windows_per_signal": window_reshape_parameters["number_windows"],
    "number_window_learning_features": 128,
    "rri_convolutional_channels": [1, 8, 16, 32, 64],
    "mad_convolutional_channels": [1, 8, 16, 32, 64],
    "window_learning_dilations": [2, 4, 8, 16, 32],
}

### Neural Network Training Hyperparameters

Hyperparameters used when training the neural network. 

These are the only ones that will not be saved to the project configuration file, because they might differ 
based on what data is currently used for training.

See 'main_model_training' function in 'main.py'

In [None]:
neural_network_hyperparameters_shhs = {
    "batch_size": 8,
    "number_epochs": 40,
    "lr_scheduler_parameters": {
        "number_updates_to_max_lr": 10,
        "start_learning_rate": 2.5 * 1e-5,
        "max_learning_rate": 1 * 1e-4,
        "end_learning_rate": 5 * 1e-5
    }
}

neural_network_hyperparameters_gif = {
    "batch_size": 8,
    "number_epochs": 100,
    "lr_scheduler_parameters": {
        "number_updates_to_max_lr": 25,
        "start_learning_rate": 2.5 * 1e-5,
        "max_learning_rate": 1 * 1e-4,
        "end_learning_rate": 1 * 1e-5
    }
}

## Setting File Paths

During this project, you will likely create a lot of files. Most of them are assigned an intuitive name in 
'main.py'. It is recommended to leave them be or changing them once before creating your first project.

Now you only need to set the directory where you want to store your files or from which you want to access the
trained model for making predictions:

In [None]:
processed_shhs_path = "Processed_Data/shhs_data.pkl"
processed_gif_path = "Processed_Data/gif_data.pkl"

# Create directory to store configurations and results
model_directory_path = "Neural_Network/"
create_directories_along_path(model_directory_path)

## Creating Project Configuration

We will now create a dictionary that holds all parameters introduced in the previous section 
['Default Parameters'](#default-parameters).

The current default parameters correspond to the idea of: overlapping windows, artifact = wake stage

In [None]:
project_configuration = dict()
project_configuration.update(sleep_data_manager_parameters)
project_configuration.update(window_reshape_parameters)
project_configuration.update(signal_normalization_parameters)
project_configuration.update(split_data_parameters)
project_configuration.update(dataset_class_transform_parameters)
project_configuration.update(neural_network_model_parameters)

### Adjusting Parameters

If you aim to test one of the additional ideas below, just run the corresponding cell. 
DO NOT RUN BOTH CELLS! 

To prevent accidentally runnning one of the cells, they were set up to raise an error. This line needs to be
removed if you aim to adjust it, obviously.

This would also be the ideal place to create your own paragraph with your desired adjustments!

Additional Idea: non-overlapping windows, artifect = wake stage:

In [None]:
raise ValueError("This Error was intentionally placed here to prevent the user from running this cell" +
                    " accidentally. Uncomment this line only if you know what you are doing.")

project_configuration["overlap_seconds"] = 0
project_configuration["number_windows"] = 300
project_configuration["windows_per_signal"] = 300

Additional Idea: overlapping windows, artifect is a unique stage

In [None]:
raise ValueError("This Error was intentionally placed here to prevent the user from running this cell" +
                    " accidentally. Uncomment this line only if you know what you are doing.")

project_configuration["sleep_stage_label"] = {"wake": 1, "LS": 2, "DS": 3, "REM": 4, "artifect": 0}
project_configuration["priority_order"] = [4, 3, 2, 1, 0]
project_configuration["number_sleep_stages"] = 5

Play around:

In [None]:
raise ValueError("This Error was intentionally placed here to prevent the user from running this cell" +
                    " accidentally. Uncomment this line only if you know what you are doing.")

project_configuration["..."] = "..."

### Checking and Saving Project Configuration

In [None]:
check_project_configuration(project_configuration)

if os.path.isfile(model_directory_path + project_configuration_file):
    os.remove(model_directory_path + project_configuration_file)
save_to_pickle(project_configuration, model_directory_path + project_configuration_file)

del project_configuration

# Preprocess Training Data

## Expanding on 'SleepDataManager' class

As mentioned above: During the saving process, the SleepDataManager makes sure that the data is uniform and
might perform the following actions:
- Scale number of datapoints in signal if sampling frequency does not match
- Alter sleep stage labels if they do not refer to the same context
- Remove outliers from RRI and/or MAD data
- Split datapoint into multiple ones, if signal duration is too long to be processable by the neural network

To do all of this, we need to provide more information than just the signal itself:

In [None]:
# Saveable Datapoint:
"""
{
    "ID": str,                  # always required
    "RRI": np.ndarray,
    "MAD": np.ndarray,
    "SLP": np.ndarray,
    "RRI_frequency": int,       # required if RRI signal is provided
    "MAD_frequency": int,       # required if MAD signal is provided
    "SLP_frequency": int,       # required if SLP signal is provided
    "sleep_stage_label": list   # required if SLP signal is provided
}
"""

Most of the keys are save explaining, except for the last one ('sleep_stage_label'):

Different sources might use different numbers to label the sleep stages. The last key is used to ensure
this projection is uniform. Here is a possible problem, where 'SSM stage' (SleepStageModel stage) refers to 
the projection used for training the neural network: 

|number|SHHS stage|GIF stage| SSM stage         |
|------|----------|---------|-------------------|
|  0   | wake     | wake    | wake & artifact   |
|  1   | N1       | N1      | LS                |
|  2   | N2       | N2      | DS                |
|  3   | N3       | N3      | REM               |
|  5   | REM      | REM     |                   |
| other| artifact | artifact|                   |

As you see, we have different sleep stages, which are assigned different numbers. Looking at transforming SHHS
stages for example: We want to map wake (0) N1 (1) to wake (0), N2 (2) to LS (1), N3 (3) to DS (2), REM (5) to
REM (3) and artifact (other) to artifact (0).

So, we have to tell which numbers correspond to which desired sleep stage in the data you want to save
('sleep_stage_label' key above) and which desired sleep stage corresponds to which number ('sleep_stage_label' 
key in SleepDataManager's 'file_info' variable)

In [3]:
# data to save
shhs_sleep_stage_label = {"wake": [0, 1], "LS": [2], "DS": [3], "REM": [5], "artifect": ["other"]}
gif_sleep_stage_label = {"wake": [0, 1], "LS": [2], "DS": [3], "REM": [5], "artifect": ["other"]}

# SleepDataManager' file_info:
# file_info["sleep_stage_label"] = {"wake": 0, "LS": 1, "DS": 2, "REM": 3, "artifect": 0}

## Preprocessing SHHS and GIF

Because it is very specific to the individual dataset, the following functions might not work for your data.
Nonetheless, they are well documented (in 'main.py') and demonstrate how to use the 'SleepDataManager' class
to preprocess your data. Therefore, the preprocessing function should be easy to set up to your dataset.

In [None]:
# paths to the data
original_shhs_data_path = "Raw_Data/SHHS_dataset.h5"
original_gif_data_path = "Raw_Data/GIF_dataset.h5"

In [None]:
Process_SHHS_Dataset(
    path_to_shhs_dataset = original_shhs_data_path,
    path_to_save_processed_data = processed_shhs_path,
    path_to_project_configuration = model_directory_path + project_configuration_file,
    )

In [None]:
Process_GIF_Dataset(
    path_to_gif_dataset = original_gif_data_path,
    path_to_save_processed_data = processed_gif_path,
    path_to_project_configuration = model_directory_path + project_configuration_file
    )

# Training Neural Network

The following function is designed to have the same structure as taught in the 
[PyTorch Tutorials](https://pytorch.org/tutorials/beginner/basics/). 
The only major difference is that the learning rate is not a fixed value, but dependend on the epoch using the
'CosineScheduler' class.

Again, the function is well documented. See 'main_model_training' in 'main.py'.

First, the model will be trained on the SHHS dataset for a certain number of epochs. During training, the 
accuracy and loss are saved in a pickle file for every epoch. The final model state dictionary is saved in a 
.pth file.

Afterwards, the model will be further trained on the GIF dataset, again saving the course of accuracy and loss
to a pickle file. The updated model state will be saved to another .pth file.

All files will be saved to the directory set above ([Setting File Paths](#setting-file-paths)).

In [None]:
# training model on SHHS dataset
main_model_training(
    neural_network_model = SleepStageModel,
    neural_network_hyperparameters = neural_network_hyperparameters_shhs,
    path_to_processed_data = processed_shhs_path,
    path_to_project_configuration = model_directory_path + project_configuration_file,
    path_to_model_state = None,
    path_to_updated_model_state = model_directory_path + model_state_after_shhs_file,
    path_to_loss_per_epoch = model_directory_path + loss_per_epoch_shhs_file,
    )

In [None]:
# training model on GIF dataset
main_model_training(
    neural_network_model = SleepStageModel,
    neural_network_hyperparameters = neural_network_hyperparameters_gif,
    path_to_processed_data = processed_gif_path,
    path_to_project_configuration = model_directory_path + project_configuration_file,
    path_to_model_state = model_directory_path + model_state_after_shhs_file,
    path_to_updated_model_state = model_directory_path + model_state_after_shhs_gif_file,
    path_to_loss_per_epoch = model_directory_path + loss_per_epoch_gif_file,
    )

# Validating Model Performance

# Stuff

In [25]:
Process_NAKO_Dataset(
    path_to_nako_dataset = "/Volumes/NaKo-UniHalle/RRI_and_MAD/NAKO-33a.pkl",
    path_to_save_processed_data = "Processed_NAKO/NAKO-33a.pkl",
    path_to_project_configuration = "SSM_no_overlap/Project_Configuration.pkl"
)

All IDs are unique.

Preproccessing datapoints from NAKO dataset (ensuring uniformity):
   ✅: 100.0% [██████████████████████] 7365 / 7365 | 7m 51s / 7m 51s (0.1s/it) | 


In [7]:
main_model_predicting(
    neural_network_model = SleepStageModel,
    path_to_model_state = "SSM_no_overlap/Model_State.pth",
    path_to_processed_data = "Processed_NAKO/NAKO-33a.pkl",
    path_to_project_configuration = "SSM_no_overlap/Project_Configuration.pkl",
)


Using cpu device

Predicting Sleep Stages:
   ⏳: 3.2% [█░░░░░░░░░░░░░░░░░░░] 484 / 15333 | 1m 25s / 44m 52s (0.2s/it) |

KeyboardInterrupt: 

In [5]:
from side_functions import *

p = DynamicProgressBar(total=1)

Initializing progress bar...True


SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [9]:
processed_gif_path = "Processed_Data_2/gif_data.pkl"
model_directory_path = "SSM_no_overlap/"

Process_GIF_Dataset(
    path_to_gif_dataset = original_gif_data_path,
    path_to_save_processed_data = processed_gif_path,
    path_to_project_configuration = model_directory_path + project_configuration_file
    )

All IDs are unique.

Preproccessing datapoints from GIF dataset (ensuring uniformity):
   ✅: 100.0% [█████████████████████████████] 293 / 293 | 3s / 3s (0.0s/it) |
Distributing 80.0% / 20.0% of datapoints into training / validation pids, respectively:
   ✅: 100.0% [█████████████████████████] 731 / 731 | 0.6s / 0.6s (0.0s/it) |[0, 1, 2, 3, 5, 6]
