**Author:** *Johannes Peter Knoll*

# Introduction

Within this notebook you will:
- Preprocess raw data
- Train Neural Network Model

In [15]:
# The autoreload extension allows you to tweak the code in the imported modules
# and rerun cells to reflect the changes.
%load_ext autoreload
%autoreload 2

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
# LOCAL IMPORTS
from dataset_processing import *

# IMPORTS
import h5py # type: ignore

# Preprocess Training Data

## Processing Dataset for Neural Network

The class SleepDataManager handles the data we want to pass to the network. It makes the data accessible in
a memory saving way, but needs to save it (again) into a pickle file. Of course, you can delete the .h5
file afterwards if you want to.

We unfortunately have multiple sources (besides the SHHS Dataset) with data that we can train the network on.
We need to make sure that the data is uniform in sampling frequency, signal length, etc., which is why we will
check and transform each datapoint before (and afterwards saving it) using the SleepDataManager class.

During the saving process, the SleepDataManager makes sure that the data is uniform in every way and might
perform following actions:
- Scale number of datapoints in signal if sampling frequency does not match
- Alter sleep stage labels if they do not refer to the same context
- Split datapoint into multiple if signal duration is longer than required for the neural network

To do all of this, we need to provide more information than the signal itself:

In [None]:
# Saveable Datapoint:
"""
{
    "ID": str,                  # always required
    "RRI": np.ndarray,
    "MAD": np.ndarray,
    "SLP": np.ndarray,
    "RRI_frequency": int,       # required if RRI signal is provided
    "MAD_frequency": int,       # required if MAD signal is provided
    "SLP_frequency": int,       # required if SLP signal is provided
    "sleep_stage_label": list   # required if SLP signal is provided
} 
"""

Most of the keys are save explaining, except for the last one:

We want to assign different sleep stage labels in our network (SSM in the following):

|number|SHHS stage|GIF stage|SSM stage|
|------|----------|---------|---------|
|  0   | wake     |         | wake    |
|  1   | N1       |         | LS      |
|  2   | N2       |         | DS      |
|  3   | N3       |         | REM     |
|  5   | REM      |         |         |
| other| artifact |         |         |
| -1   |          |         | artifact|

As you see: N1 needs to be classified as wake, N2 as LS (light sleep), and N3 as DS (deep sleep).
To do this, we effectively need to change: \
0 -> 0, 1 -> 0, 2 -> 1, 3 -> 2, 5 -> 3, other -> -1

To make this achievable by the algorithm, we just need to say which labels correspond to which stage
in the "sleep_stage_label" key as follows:

In [3]:
shhs_sleep_stage_label = {"wake": [0, 1], "LS": [2], "DS": [3], "REM": [5], "artifect": ["other"]}
gif_sleep_stage_label = {"wake": [0, 1], "LS": [2], "DS": [3], "REM": [5], "artifect": ["other"]}

## SHHS Dataset

The [Sleep Heart Health Study (SHHS)](https://sleepdata.org/datasets/shhs) is a multi-center cohort study implemented by the National Heart Lung & Blood Institute to determine the cardiovascular and other consequences of sleep-disordered breathing.

### Download Dataset

In [None]:
# !wget "https://onedrive.live.com/download?cid=45D5A10F94E33861&resid=45D5A10F94E33861%21248707&authkey=AKRa5kb3XFj4G-o" -O shhs_dataset.h5

### Accessing Dataset

In [4]:
#path_to_shhs_dataset = "../Training_Data/SHHS_dataset.h5"
path_to_shhs_dataset = "Raw_Data/SHHS_dataset.h5"

shhs_dataset = h5py.File(path_to_shhs_dataset, 'r')

### Transfering Data from .h5 into .pkl file using SleepDataManager

It is wise to check if all ID's are unique prior to saving. Then we can skip
checking every ID in the database when saving each datapoint, which will speed up the saving process greatly.

In [5]:
# initializing the database
path_to_save_processed_shhs_data = "Processed_Data/shhs_data.pkl"
shhs_data_manager = SleepDataManager(file_path = path_to_save_processed_shhs_data)

# accessing patient ids:
patients = list(shhs_dataset['slp'].keys()) # type: ignore

# check if patient ids are unique:
shhs_data_manager.check_if_ids_are_unique(patients)

All IDs are unique.


If all ID's are unique, you can continue:

In [6]:
# saving all data from SHHS dataset to the shhs_data.pkl file
for patient_id in patients:
    new_datapoint = {
        "ID": patient_id,
        "RRI": shhs_dataset["rri"][patient_id][:], # type: ignore
        "SLP": shhs_dataset["slp"][patient_id][:], # type: ignore
        "RRI_frequency": shhs_dataset["rri"].attrs["freq"], # type: ignore
        "SLP_frequency": shhs_dataset["slp"].attrs["freq"], # type: ignore
        "sleep_stage_label": copy.deepcopy(shhs_sleep_stage_label)
    }

    shhs_data_manager.save(new_datapoint, unique_id=True)

## GIF Dataset

Analogue to the SHHS Dataset, we will save the data to our SleepDataManager and transform it into windows.

In [None]:
# relevant_keys = ["file_name", "RRI", "RRI_frequency", "MAD", "MAD_frequency", "SLP"]
# results_generator = load_from_pickle("Processed_GIF/GIF_Results.pkl")

### Accessing Dataset

In [7]:
path_to_gif_dataset = "Raw_Data/GIF_dataset.h5"

gif_dataset = h5py.File(path_to_gif_dataset, 'r')

### Transfering Data from .h5 into .pkl file using SleepDataManager

In [8]:
# initializing the database
path_to_save_processed_gif_data = "Processed_Data/gif_data.pkl"
gif_data_manager = SleepDataManager(file_path = path_to_save_processed_gif_data)

# accessing patient ids:
patients = list(gif_dataset['stage'].keys()) # type: ignore

# check if patient ids are unique:
gif_data_manager.check_if_ids_are_unique(patients)

All IDs are unique.


In [9]:
print(len(patients), patients)
print(gif_dataset["stage"].attrs["freq"])
print(gif_dataset["rri"].attrs["freq"])
print(gif_dataset["mad"].attrs["freq"])
sleep = gif_dataset["stage"]["SL003"][:]
rri = gif_dataset["rri"]["SL003"][:]
mad = gif_dataset["mad"]["SL003"][:]
print(len(sleep), sleep[400:500])
print(len(rri))
print(len(mad))

293 ['SL003', 'SL005', 'SL006', 'SL008', 'SL009', 'SL010', 'SL012', 'SL013', 'SL014', 'SL015', 'SL017', 'SL018', 'SL019', 'SL020', 'SL021', 'SL022', 'SL024', 'SL026', 'SL028', 'SL029', 'SL030', 'SL031', 'SL033', 'SL035', 'SL036', 'SL038', 'SL039', 'SL041', 'SL042', 'SL043', 'SL044', 'SL045', 'SL046', 'SL047', 'SL048', 'SL049', 'SL050', 'SL051', 'SL052', 'SL053', 'SL054', 'SL056', 'SL058', 'SL059', 'SL060', 'SL062', 'SL063', 'SL064', 'SL065', 'SL067', 'SL068', 'SL069', 'SL070', 'SL071', 'SL072', 'SL074', 'SL077', 'SL078', 'SL080', 'SL081', 'SL082', 'SL084', 'SL086', 'SL092', 'SL093', 'SL094', 'SL095', 'SL097', 'SL099', 'SL102', 'SL103', 'SL104', 'SL106', 'SL107', 'SL108', 'SL109', 'SL110', 'SL112', 'SL113', 'SL115', 'SL117', 'SL118', 'SL119', 'SL120', 'SL121', 'SL122', 'SL123', 'SL124', 'SL125', 'SL127', 'SL128', 'SL129', 'SL130', 'SL131', 'SL134', 'SL135', 'SL136', 'SL137', 'SL139', 'SL140', 'SL142', 'SL143', 'SL144', 'SL146', 'SL147', 'SL148', 'SL149', 'SL150', 'SL152', 'SL153', 'SL15

If all ID's are unique, you can continue:

In [10]:
# saving all data from GIF dataset to the gif_data.pkl file
for patient_id in patients:
    new_datapoint = {
        "ID": patient_id,
        "RRI": gif_dataset["rri"][patient_id][:], # type: ignore
        "MAD": gif_dataset["mad"][patient_id][:], # type: ignore
        "SLP": gif_dataset["stage"][patient_id][:], # type: ignore
        "RRI_frequency": gif_dataset["rri"].attrs["freq"], # type: ignore
        "MAD_frequency": gif_dataset["mad"].attrs["freq"], # type: ignore
        "SLP_frequency": 1/30, # type: ignore
        "sleep_stage_label": copy.deepcopy(gif_sleep_stage_label)
    }

    gif_data_manager.save(new_datapoint, unique_id=True)








































































































































































































































































































## Create Training-, Validation- and Test- Datasets

For easier application we will split our database into main-, training-, validation- and test- files:

In [7]:
shhs_data_manager.separate_train_test_validation(
    train_size = 0.8, 
    validation_size = 0.1, 
    test_size = 0.1, 
    random_state = None, 
    shuffle = True
)

In [None]:
gif_data_manager.separate_train_test_validation(
    train_size = 0.8, 
    validation_size = 0.1, 
    test_size = 0.1, 
    random_state = None, 
    shuffle = True
)

We should now have 3 additional files in the same directory where our processed data is saved.

Each could be accessed separately with another instance of the class SleepDataManager. Note that their functionality
is limited, as they are only meant to return (load) data.

The data in these files can be reshuffled by calling the above code cell again.

# Neural Network

Now that we ensured to have uniform data, splitted into training-, validation- and test data, we can 
continue to pass it to the neural network.

The following code is designed to have the same structure as taught in the [PyTorch Tutorials](https://pytorch.org/tutorials/beginner/basics/)

In [5]:
# LOCAL IMPORTS
from neural_network_model import *

## Accessing Datasets

We now want to access our training-, validation- and test- data using a custom dataset class.

This class will access the data using the SleepDataManager class and will be transforming the signals
into overlapping windows before returning them.

In [6]:
# repeating path for quicker access to this section:
path_to_save_processed_shhs_data = "Processed_Data/shhs_data.pkl"

shhs_training_data_path = path_to_save_processed_shhs_data[:-4] + "_training_pid.pkl"
shhs_validation_data_path = path_to_save_processed_shhs_data[:-4] + "_validation_pid.pkl"
shhs_test_data_path = path_to_save_processed_shhs_data[:-4] + "_test_pid.pkl"

apply_transformation = ToTensor()

shhs_training_data = CustomSleepDataset(path_to_data = shhs_training_data_path, transform = apply_transformation)
shhs_validation_data = CustomSleepDataset(path_to_data = shhs_validation_data_path, transform = apply_transformation)
shhs_test_data = CustomSleepDataset(path_to_data = shhs_test_data_path, transform = apply_transformation)

In [None]:
# repeating path for quicker access to this section:
path_to_save_processed_gif_data = "Processed_Data/gif_data.pkl"

gif_training_data_path = path_to_save_processed_gif_data[:-4] + "_training_pid.pkl"
gif_validation_data_path = path_to_save_processed_gif_data[:-4] + "_validation_pid.pkl"
gif_test_data_path = path_to_save_processed_gif_data[:-4] + "_test_pid.pkl"

gif_training_data = CustomSleepDataset(path_to_data = gif_training_data_path, transform = apply_transformation)
gif_validation_data = CustomSleepDataset(path_to_data = gif_validation_data_path, transform = apply_transformation)
gif_test_data = CustomSleepDataset(path_to_data = gif_test_data_path, transform = apply_transformation)

## Hyperparameters

The Hyperparameters: "batch_size" and "number_epochs" is self-explanatory. 

For the learning rate we will use a scheduler that quickly increases the learning rate for the first few 
epochs and afterwards decrease it using a cosine function. We do this because smaller values yield slow 
learning speed, while large values may result in unpredictable behavior during training 
([Source](https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html))

In [7]:
batch_size = 8
number_epochs = 40

learning_rate_scheduler = CosineScheduler(
    number_updates_total = number_epochs,
    number_updates_to_max_lr = 10,
    start_learning_rate = 2.5 * 1e-5,
    max_learning_rate = 1 * 1e-4,
    end_learning_rate = 5 * 1e-5
)

## Preparing data for training with DataLoaders

In [8]:
shhs_train_dataloader = DataLoader(shhs_training_data, batch_size = batch_size, shuffle=True)
shhs_validation_dataloader = DataLoader(shhs_validation_data, batch_size = batch_size, shuffle=True)
shhs_test_dataloader = DataLoader(shhs_test_data, batch_size = batch_size, shuffle=True)

In [None]:
gif_train_dataloader = DataLoader(gif_training_data, batch_size = batch_size, shuffle=True)
gif_validation_dataloader = DataLoader(gif_validation_data, batch_size = batch_size, shuffle=True)
gif_test_dataloader = DataLoader(gif_test_data, batch_size = batch_size, shuffle=True)

## Setting Device

In [9]:
# Get cpu, gpu or mps device for training.
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"\nUsing {device} device")


Using mps device


## Initialize Neural Network Model

In [10]:
nn_model = SleepStageModel()
nn_model.to(device)

SleepStageModel(
  (rri_signal_learning): Sequential(
    (0): Conv1d(1, 2, kernel_size=(3,), stride=(1,), padding=same)
    (1): ReLU()
    (2): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): BatchNorm1d(2, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (4): Conv1d(2, 4, kernel_size=(3,), stride=(1,), padding=same)
    (5): ReLU()
    (6): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (7): BatchNorm1d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (8): Conv1d(4, 8, kernel_size=(3,), stride=(1,), padding=same)
    (9): ReLU()
    (10): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (11): BatchNorm1d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (mad_signal_learning): Sequential(
    (0): Conv1d(1, 2, kernel_size=(3,), stride=(1,), padding=same)
    (1): ReLU()
    (2): MaxPool1d(kernel_size=2, stride=2, padding=0, 

## Loss and Optimizer Function

In [11]:
loss_function = nn.CrossEntropyLoss()
optimizer_function = optim.Adam

## Training Neural Network Model

In [18]:
for t in range(number_epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(
        dataloader = shhs_train_dataloader,
        model = nn_model,
        device = device,
        loss_fn = loss_function,
        optimizer_fn = optimizer_function,
        lr_scheduler = learning_rate_scheduler,
        current_epoch = t,
        batch_size = batch_size,
    )
    test_loop(
        dataloader = shhs_validation_dataloader,
        model = nn_model,
        device = device,
        loss_fn = loss_function,
    )

Epoch 1
-------------------------------
torch.Size([8, 1, 1197, 480]) torch.Size([8, 1197]) None


RuntimeError: expected scalar type Double but found Float

In [23]:
# params_to_update = signal_normalization_parameters
params_to_update = {
    "RRI_inlier_interval": [0.3, 2],
    "MAD_inlier_interval": [None, None],
}

paths = ["SSM_Artifect/Project_Configuration.pkl", "SSM_no_overlap/Project_Configuration.pkl", "SSM_Original/Project_Configuration.pkl", "Yao_Artifect/Project_Configuration.pkl", "Yao_no_overlap/Project_Configuration.pkl", "Yao_Original/Project_Configuration.pkl"]
for path in paths:
    with open(path, "rb") as f:
        project_configuration = pickle.load(f)
    
    project_configuration.update(params_to_update)

    os.remove(path)
    save_to_pickle(project_configuration, path)

raise SystemExit

SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [14]:
import numpy as np
a = [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]]
a = np.array(a)
print(a)
a = a.T
print(a)

[[1 2 3]
 [1 2 3]
 [1 2 3]
 [1 2 3]]
[[1 1 1 1]
 [2 2 2 2]
 [3 3 3 3]]


In [19]:
a = [[1], [3], [5]]
a = np.array(a)
print(a.T[0])

[1 3 5]


In [19]:
from main import *

hey


In [25]:
Process_NAKO_Dataset(
    path_to_nako_dataset = "/Volumes/NaKo-UniHalle/RRI_and_MAD/NAKO-33a.pkl",
    path_to_save_processed_data = "Processed_NAKO/NAKO-33a.pkl",
    path_to_project_configuration = "SSM_no_overlap/Project_Configuration.pkl"
)

All IDs are unique.

Preproccessing datapoints from NAKO dataset (ensuring uniformity):
   ✅: 100.0% [██████████████████████] 7365 / 7365 | 7m 51s / 7m 51s (0.1s/it) | 


In [None]:
main_model_predicting(
    neural_network_model = SleepStageModel,
    path_to_model_state = "SSM_no_overlap/Model_State.pth",
    path_to_processed_data = "/Volumes/NaKo-UniHalle/RRI_and_MAD/NAKO-33a.pkl",
    path_to_project_configuration = "SSM_no_overlap/Project_Configuration.pkl",
    path_to_save_results = "Neural_Network/Model_Accuracy.pkl",
)