**Author:** Johannes Peter Knoll

# Introduction

Within this notebook you will learn and test everything that was implemented to preprocess the data
for the neural network.

Note:   This notebook is rather for those who want to make sure everything works correctly. It is very thorough
        and therefore unnecessary if you only want to get a quick start into the predictions. If that is the case, head
        to 'Classification_Demo.ipynb'


# Thorough Demonstration of 'dataset_processing.py'

In [1]:
# The autoreload extension allows you to tweak the code in the imported modules
# and rerun cells to reflect the changes.
%load_ext autoreload
%autoreload 2

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## Managing Data

In this section we demonstrate the implemented class that helps you to manage the data you want to pass to the
neural network model.

The usage of this class is not required, you could also just use the implemented functions on your data, which
are explained in the next section.

I still would highly recommend using this class, as it is able to handle large data in a memory saving way
and makes it very easy to check and process your data, so that it can be passed to the model easily.

In [16]:
from dataset_processing import *

import random
import copy
import os

### Basics

The Database itself is just a .pkl file that contains multiple dictionaries. The first dictionary is always
the file information, while the following will be the Database's datapoints. Each datapoint needs a unique ID
(key: "ID") and can contain the following signals: 
- RRI (key: "RRI")
- MAD (key: "MAD")
- Sleep-Labels (key: "SLP")
- predicted Sleep-Labels (key: "SLP_predicted")
- predicted individual probabilities for every sleep stage (key: "SLP_predicted_probability")

The file information dictionary holds parameters that apply to every datapoint. This can be parameters that
affect how the data is processed or uniform informations like the sampling frequencies for each signal.

### Note

I do not know if this needs to be said, but NEVER MANUALLY CHANGE THE CLASS ATTRIBUTES when data was already
added.

If you have different requirements for the uniform frequency, signal length or any of the other parameters,
that's fine. But change them before you save data to it: See section: "Change File Information"

### Creating Database (.pkl - file)

When initializing a database (calling 'SleepDataManager' on non-existent path to .pkl file) the class will
automatically write the first dictionary to it, which functions as an information on the data properties
within the file.

In [17]:
a_data_manager = SleepDataManager(file_path = "Processing_Demonstration/demo_file_info_change.pkl")
file_information = a_data_manager.file_info

for key in file_information.keys():
    print(f"{key}: {file_information[key]}")

RRI_frequency: 4
MAD_frequency: 1
SLP_frequency: 0.03333333333333333
SLP_predicted_frequency: 0.03333333333333333
RRI_inlier_interval: [0.3, 2.0]
MAD_inlier_interval: [None, None]
sleep_stage_label: {'wake': 0, 'LS': 1, 'DS': 2, 'REM': 3, 'artifect': 0}
signal_length_seconds: 36000
wanted_shift_length_seconds: 5400
absolute_shift_deviation_seconds: 1800
signal_split_reversed: False
train_val_test_split_applied: False
main_file_path: Processing_Demonstration/demo_file_info_change.pkl
train_file_path: Processing_Demonstration/demo_file_info_change_training_pid.pkl
validation_file_path: Processing_Demonstration/demo_file_info_change_validation_pid.pkl
test_file_path: Processing_Demonstration/demo_file_info_change_test_pid.pkl


Don't mind all the different keys yet. Necessary ones will be explained below.

### Changing File Information

File information can be easily changed but is only possible as long as no data was added to the file:

In [18]:
# change the file information
new_file_info = {"RRI_frequency": 2, "SLP_frequency": 2}
a_data_manager.change_file_information(new_file_info)

print("Updated file information:\n")
file_information = a_data_manager.file_info
for key in new_file_info.keys():
    print(f"{key}: {file_information[key]}")

del a_data_manager, file_information

# the change in file information is saved and can be accessed by another instance of SleepDataManager
another_data_manager = SleepDataManager(file_path = "Processing_Demonstration/demo_file_info_change.pkl")
file_information = another_data_manager.file_info

print("\nFile information in new instance on same path:\n")
for key in new_file_info.keys():
    print(f"{key}: {file_information[key]}")

del another_data_manager, file_information
os.remove("Processing_Demonstration/demo_file_info_change.pkl")

some_data_manager = SleepDataManager(file_path = "Processing_Demonstration/messing_around.pkl")
file_information = some_data_manager.file_info

print("\nFile information in new instance on different path:\n")
for key in new_file_info.keys():
    print(f"{key}: {file_information[key]}")

Updated file information:

RRI_frequency: 2
SLP_frequency: 2

File information in new instance on same path:

RRI_frequency: 2
SLP_frequency: 2

File information in new instance on different path:

RRI_frequency: 4
SLP_frequency: 0.03333333333333333


### Saving Data

The most processing is already done during saving. To keep the data uniform you must always provide
the sampling frequency for each signal 
(keys: "RRI_frequency", "MAD_frequency", "SLP_frequency", "SLP_predicted_frequency").

Operations that might be happening to the data you try to save:
- scale number of datapoints in signal so that signal frequency matches uniform database signal frequency
- alter sleep labels
- remove RRI and/or MAD outliers
- split signal into multiple signals if signal is longer than the uniform maximum signal length: 'signal_length_seconds'

To add a SLP signal, you must additionally provide the key: "sleep_stage_label".
This dictionary is supposed to tell which entries correspond to which sleep stage:

In [19]:
# sleep stage labels in shhs dataset:
# "wake": 0,    "N1": 1,    "N2": 2,    "N3": 3,    "REM": 5,   "artifect": "other integers"

# in the nn we only divide between wake, LS, DS, REM, and artifect. Above, N1 must be redeclared as "wake", 
# N2 as "LS" and N3 as "DS":
shhs_labels = {"wake": [0, 1], "LS": [2], "DS": [3], "REM": [5], "artifect": ["other"]}

Now we will pass the worst possible data to the database: differing sampling frequencies and overlength:

In [20]:
# creating signal with different sampling frequencies and overlength:
signal_time_in_seconds = 12.1 * 3600
rri_frequency = 6
mad_frequency = 2
slp_frequency = 1/20

# creating signals and printing manually scaled versions
rri_signal = np.array([random.randint(1, 5) for i in range(int(signal_time_in_seconds * rri_frequency))], dtype=np.float64)
print(f"First datapoints of RRI signal: {rri_signal[:10]} (shape: {rri_signal.shape})")
interpolate_rri = interpolate_signal(rri_signal, rri_frequency, 4) # type: ignore
print(f"First datapoints of RRI signal scaled: {interpolate_rri[:10]} (shape: {interpolate_rri.shape})")
mad_signal = np.array([random.randint(1, 5) for i in range(int(signal_time_in_seconds * mad_frequency))])
print(f"First datapoints of MAD signal: {mad_signal[:10]} (shape: {len(mad_signal)})")
interpolate_mad = interpolate_signal(mad_signal, mad_frequency, 1) # type: ignore
print(f"First datapoints of MAD signal scaled: {interpolate_mad[:10]} (shape: {len(interpolate_mad)})")
slp_signal = [random.randint(1, 5) for i in range(int(signal_time_in_seconds * slp_frequency))]
print(f"First datapoints of SLP signal: {slp_signal[:10]} (shape: {len(slp_signal)})")
scaled_slp = scale_classification_signal(slp_signal, slp_frequency, 1/30) # type: ignore
print(f"First datapoints of SLP signal scaled: {scaled_slp[:10]} (shape: {len(scaled_slp)})")

random_sleep_stage_labels = {"wake": [0, 1], "LS": [2], "DS": [3], "REM": [5], "artifect": ["other"]}
altered_scaled_slp = alter_slp_labels(scaled_slp, random_sleep_stage_labels, desired_labels = {"wake": 0, "LS": 1, "DS": 2, "REM": 3, "artifect": 0}) # type: ignore
print(f"First datapoints of scaled SLP signal altered: {altered_scaled_slp[:10]} (shape: {len(altered_scaled_slp)})")

new_datapoint = {
    "ID": "4",
    "RRI": rri_signal,
    "RRI_frequency": rri_frequency,
    "MAD": mad_signal,
    "MAD_frequency": mad_frequency,
    "SLP": slp_signal,
    "SLP_frequency": slp_frequency,
    "sleep_stage_label": random_sleep_stage_labels
}

First datapoints of RRI signal: [2. 5. 5. 4. 2. 3. 1. 5. 3. 4.] (shape: (261360,))
First datapoints of RRI signal scaled: [2.  5.  4.  2.5 1.  4.  4.  3.  3.  4. ] (shape: (174240,))
First datapoints of MAD signal: [4 3 5 5 3 5 2 4 5 5] (shape: 87120)
First datapoints of MAD signal scaled: [4 5 3 2 5 3 4 2 3 5] (shape: 43560)
First datapoints of SLP signal: [3, 1, 5, 3, 4, 5, 1, 3, 4, 4] (shape: 2178)
First datapoints of SLP signal scaled: [3 1 3 4 1 3 4 5 5 1] (shape: 1452)
First datapoints of scaled SLP signal altered: [2 0 2 0 0 2 0 3 3 0] (shape: 1452)


In [21]:
# saving the new datapoint
some_data_manager.save(copy.deepcopy(new_datapoint), overwrite_id=True, unique_id=False)

In [22]:
# print the data
for dict in some_data_manager:
    print("-"*70)
    for key in dict.keys():
        if key in ["RRI", "MAD", "SLP"]:
            print(key, dict[key][:10], dict[key].shape)
        else:
            print(key, dict[key])
print("-"*70)

----------------------------------------------------------------------
ID 4
RRI [2.  2.  2.  2.  1.  2.  2.  1.5 2.  2. ] (144000,)
MAD [4 5 3 2 5 3 4 2 3 5] (36000,)
SLP [2 0 2 0 0 2 0 3 3 0] (1200,)
shift_length_seconds 3780
----------------------------------------------------------------------
ID 4_shift_x1
RRI [2. 2. 2. 2. 2. 2. 2. 2. 2. 2.] (144000,)
MAD [1 4 2 2 3 4 3 1 2 5] (36000,)
SLP [3 0 3 1 3 3 1 2 0 0] (1200,)
shift_length_seconds 3780
----------------------------------------------------------------------
ID 4_shift_x2
RRI [2.  2.  2.  2.  1.  2.  1.  1.5 1.  2. ] (144000,)
MAD [1 1 1 1 4 5 2 5 4 3] (36000,)
SLP [3 3 3 3 2 1 1 0 2 2] (1200,)
shift_length_seconds 3780
----------------------------------------------------------------------


The provided datapoint was too long and its signal sampling frequencies did not match. Therefore, the
signals were extrapolated and the datapoint ("4") was split into multiple ("4", "4_shift_x1", "4_shift_x2"),
by shifting the wanted length (file_info: signal_length_seconds) by "shift_length_seconds" along the
datapoint.

Additionally, some of the RRI values were outside of the 'RRI_inlier_interval' (see file information) and were
therefore adjusted.

#### Adding information to already existing datapoints

The ultimate goal is either using the data to train the neural network or use the neural network on the data
to predict the sleep stages. For the second case you might want to store the predictions in your database 
after passing the data to the neural network (key: "SLP_predicted", "SLP_predicted_probability"):

To add or overwrite an exisiting signal in the database, just set the optional argument "overwrite_id"
to "True" when saving. If set to "False" and the ID already exists in the database, it will discard the data 
you are trying to save and raise an error.

Attention: For this argument to have an effect, the optional argument "unique_id" must be set to "False". 
Further information below.

In [23]:
datapoint_additions = {
    "ID": "4",
    "SLP_predicted": np.array([random.randint(0, 3) for _ in range(int(36000 * 1/120))]),
    "SLP_predicted_probability": np.array([random.randint(0, 3) for _ in range(int(36000 * 1/120))]),
    "SLP_predicted_frequency": 1/120
}

print("Uniform frequency of predicted sleep stages (before saving the first):", some_data_manager.file_info["SLP_predicted_frequency"])

# saving the additional signals to datapoint
some_data_manager.save(copy.deepcopy(datapoint_additions), overwrite_id=True, unique_id=False)

# print the data
for dict in some_data_manager:
    print("-"*85)
    for key in dict.keys():
        if key in ["RRI", "MAD", "SLP", "SLP_predicted", "SLP_predicted_probability"]:
            print(key, dict[key][:10], dict[key].shape)
        else:
            print(key, dict[key])
print("-"*85)

print("Uniform frequency of predicted sleep stages (after saving):", some_data_manager.file_info["SLP_predicted_frequency"])

Uniform frequency of predicted sleep stages (before saving the first): 0.03333333333333333
-------------------------------------------------------------------------------------
ID 4
RRI [2.  2.  2.  2.  1.  2.  2.  1.5 2.  2. ] (144000,)
MAD [4 5 3 2 5 3 4 2 3 5] (36000,)
SLP [2 0 2 0 0 2 0 3 3 0] (1200,)
shift_length_seconds 3780
SLP_predicted [1 0 1 2 1 2 3 1 1 1] (300,)
SLP_predicted_probability [1 2 2 3 2 0 2 2 1 2] (300,)
-------------------------------------------------------------------------------------
ID 4_shift_x1
RRI [2. 2. 2. 2. 2. 2. 2. 2. 2. 2.] (144000,)
MAD [1 4 2 2 3 4 3 1 2 5] (36000,)
SLP [3 0 3 1 3 3 1 2 0 0] (1200,)
shift_length_seconds 3780
-------------------------------------------------------------------------------------
ID 4_shift_x2
RRI [2.  2.  2.  2.  1.  2.  1.  1.5 1.  2. ] (144000,)
MAD [1 1 1 1 4 5 2 5 4 3] (36000,)
SLP [3 3 3 3 2 1 1 0 2 2] (1200,)
shift_length_seconds 3780
-----------------------------------------------------------------------------

#### Speed up data saving

During the saving process, every id is checked to ensure every id is unique. This takes a while to compute if 
you want to add many datapoints. To speed up the process, you can check once if every id is unique and 
afterwards skip the checking when adding the datapoints to the database:

In [24]:
list_of_ids = ["101", "102", "103"]

some_data_manager.check_if_ids_are_unique(list_of_ids)

All IDs are unique.


If all ids are unique you can use: some_data_manager(..., unique_id=True) and the saving will be much faster!
(Making "overwrite_id" obsolete.)

In [25]:
# print the data
for dict in some_data_manager:
    print("-"*70)
    for key in dict.keys():
        if key in ["RRI", "MAD", "SLP", "SLP_windows", "SLP_predicted", "SLP_predicted_probability"]:
            print(key, dict[key][0:5], dict[key].shape)
        elif key in ["RRI_windows", "MAD_windows"]:
            print(key, dict[key][0][0:5], dict[key].shape)
        else:
            print(key, dict[key])
print("-"*70)

----------------------------------------------------------------------
ID 4
RRI [2. 2. 2. 2. 1.] (144000,)
MAD [4 5 3 2 5] (36000,)
SLP [2 0 2 0 0] (1200,)
shift_length_seconds 3780
SLP_predicted [1 0 1 2 1] (300,)
SLP_predicted_probability [1 2 2 3 2] (300,)
----------------------------------------------------------------------
ID 4_shift_x1
RRI [2. 2. 2. 2. 2.] (144000,)
MAD [1 4 2 2 3] (36000,)
SLP [3 0 3 1 3] (1200,)
shift_length_seconds 3780
----------------------------------------------------------------------
ID 4_shift_x2
RRI [2. 2. 2. 2. 1.] (144000,)
MAD [1 1 1 1 4] (36000,)
SLP [3 3 3 3 2] (1200,)
shift_length_seconds 3780
----------------------------------------------------------------------


### Load Data

Data can be loaded in multiple ways using a string or an integer. 

If it's an integer, it will treat it as position in the database and return the whole data dictionary. \
If it's a string that equals a key in the data dictionaries, it will return all entities of that specific key in the database. \
If it's a different string, then it will treat it as ID and look for a match. Equal to index, it will return
the whole dictionary.

In [26]:
loaded_data = some_data_manager.load(1)
# loaded_data = some_data_manager[1] # same as above
print(loaded_data)

{'ID': '4_shift_x1', 'RRI': array([2., 2., 2., ..., 2., 2., 1.]), 'MAD': array([1, 4, 2, ..., 1, 1, 2]), 'SLP': array([3, 0, 3, ..., 2, 3, 0]), 'shift_length_seconds': 3780}


In [27]:
loaded_data = some_data_manager.load("4_shift_x1")
# loaded_data = some_data_manager["4_shift_x1"] # same as above
print(loaded_data)

{'ID': '4_shift_x1', 'RRI': array([2., 2., 2., ..., 2., 2., 1.]), 'MAD': array([1, 4, 2, ..., 1, 1, 2]), 'SLP': array([3, 0, 3, ..., 2, 3, 0]), 'shift_length_seconds': 3780}


In [28]:
loaded_data = some_data_manager.load("RRI")
# loaded_data = some_data_manager["RRI"] # same as above
print(loaded_data)

[array([2. , 2. , 2. , ..., 2. , 2. , 1.5]), array([2., 2., 2., ..., 2., 2., 1.]), array([2. , 2. , 2. , ..., 1.5, 1. , 2. ])]


### Remove Data

Removing takes the same argument as loading.

Deleting a signal from all entries:

In [29]:
some_data_manager.remove("RRI")

# print all data
for dict in some_data_manager:
    print("-"*20)
    for key in dict.keys():
        if key in ["RRI", "MAD", "SLP", "SLP_predicted", "SLP_predicted_probability"]:
            print(key, dict[key].shape)
        else:
            print(key, dict[key])
print("-"*20)

--------------------
ID 4
MAD (36000,)
SLP (1200,)
shift_length_seconds 3780
SLP_predicted (300,)
SLP_predicted_probability (300,)
--------------------
ID 4_shift_x1
MAD (36000,)
SLP (1200,)
shift_length_seconds 3780
--------------------
ID 4_shift_x2
MAD (36000,)
SLP (1200,)
shift_length_seconds 3780
--------------------


Deleting an entry by ID (If a signal was splitted and one of the ID's is being removed, all other will be 
removed as well):

In [30]:
# add some data
new_datapoint_2 = copy.deepcopy(new_datapoint)
new_datapoint_2["ID"] = "5"
some_data_manager.save(new_datapoint_2, overwrite_id=True, unique_id=False)

In [31]:
some_data_manager.remove("4_shift_x1")

# print all data
for dict in some_data_manager:
    print("-"*20)
    for key in dict.keys():
        if key in ["RRI", "MAD", "SLP", "SLP_predicted", "SLP_predicted_probability"]:
            print(key, dict[key].shape)
        else:
            print(key, dict[key])
print("-"*20)

--------------------
ID 5
RRI (144000,)
MAD (36000,)
SLP (1200,)
shift_length_seconds 3780


--------------------
ID 5_shift_x1
RRI (144000,)
MAD (36000,)
SLP (1200,)
shift_length_seconds 3780
--------------------
ID 5_shift_x2
RRI (144000,)
MAD (36000,)
SLP (1200,)
shift_length_seconds 3780
--------------------


Removing by index works analogous to removing by ID:

In [32]:
# add some data
new_datapoint_2 = copy.deepcopy(new_datapoint)
new_datapoint_2["ID"] = "6"
some_data_manager.save(new_datapoint_2, overwrite_id=True, unique_id=False)
del new_datapoint_2

In [33]:
some_data_manager.remove(0)

# print all data
for dict in some_data_manager:
    print("-"*30)
    for key in dict.keys():
        if key in ["RRI", "MAD", "SLP"]:
            print(key, dict[key].shape)
        else:
            print(key, dict[key])
print("-"*30)

------------------------------
ID 6
RRI (144000,)
MAD (36000,)
SLP (1200,)
shift_length_seconds 3780
------------------------------
ID 6_shift_x1
RRI (144000,)
MAD (36000,)
SLP (1200,)
shift_length_seconds 3780
------------------------------
ID 6_shift_x2
RRI (144000,)
MAD (36000,)
SLP (1200,)
shift_length_seconds 3780
------------------------------


For now, let's restore the data:

In [34]:
some_data_manager.remove(0)
some_data_manager.save(copy.deepcopy(new_datapoint), overwrite_id=True, unique_id=False)

### Other Operations:

Iterating over Database:

In [35]:
for datapoint in some_data_manager:
    print(datapoint["ID"])    

4
4_shift_x1
4_shift_x2


Checking if datapoint with certain ID is in database:

In [36]:
if "4" in some_data_manager:
    print("Datapoint with \"ID\" = 4 is in the data manager")

Datapoint with "ID" = 4 is in the data manager


print function:

In [37]:
print(some_data_manager)

file_path: Processing_Demonstration/messing_around.pkl
file_info: {'RRI_frequency': 4, 'MAD_frequency': 1, 'SLP_frequency': 0.03333333333333333, 'SLP_predicted_frequency': 0.03333333333333333, 'RRI_inlier_interval': [0.3, 2.0], 'MAD_inlier_interval': [None, None], 'sleep_stage_label': {'wake': '0', 'LS': '1', 'DS': '2', 'REM': '3', 'artifect': '0'}, 'signal_length_seconds': 36000, 'wanted_shift_length_seconds': 5400, 'absolute_shift_deviation_seconds': 1800, 'signal_split_reversed': False, 'train_val_test_split_applied': False, 'main_file_path': 'Processing_Demonstration/messing_around.pkl', 'train_file_path': 'Processing_Demonstration/messing_around_training_pid.pkl', 'validation_file_path': 'Processing_Demonstration/messing_around_validation_pid.pkl', 'test_file_path': 'Processing_Demonstration/messing_around_test_pid.pkl'}


### Train-, Validation-, Test- Split

Of course, we aim to train a machine learning model with the data handled by this class. So, we want to
be able to separate the data into training-, validation- and test- pids.

First, let's create a new file and add some more data:

In [38]:
many_files_data_manager = SleepDataManager(file_path = "Processing_Demonstration/Data.pkl")

add_number_datapoints = 100

# optimal signal (fitting sampling frequencies and length):
signal_time_in_seconds = 10 * 3600
rri_frequency = 4
mad_frequency = 1
slp_frequency = 1/30

random_sleep_stage_labels = {"wake": [0, 1], "LS": [2], "DS": [3], "REM": [5], "artifect": ["other"]}

for i in range(add_number_datapoints):
    rri_signal = np.array([random.randint(1, 5) for i in range(int(signal_time_in_seconds * rri_frequency))], dtype=np.float64)
    mad_signal = [random.randint(1, 5) for i in range(int(signal_time_in_seconds * mad_frequency))]
    slp_signal = [random.randint(1, 5) for i in range(int(signal_time_in_seconds * slp_frequency))]

    decide_what_data_to_add = random.randint(0, 2)

    if decide_what_data_to_add == 0:
        new_datapoint = {
            "ID": str(i),
            "RRI": rri_signal,
            "RRI_frequency": rri_frequency,
            "MAD": mad_signal,
            "MAD_frequency": mad_frequency,
            "SLP": slp_signal,
            "SLP_frequency": slp_frequency,
            "sleep_stage_label": random_sleep_stage_labels
        } # optimal data (rri and mad to slp)
    elif decide_what_data_to_add == 1:
        new_datapoint = {
            "ID": str(i),
            "RRI": rri_signal,
            "RRI_frequency": rri_frequency,
            "MAD": mad_signal,
            "MAD_frequency": mad_frequency,
        } # invalid data (no target: slp)
    else:
        new_datapoint = {
            "ID": str(i),
            "RRI": rri_signal,
            "RRI_frequency": rri_frequency,
            "SLP": slp_signal,
            "SLP_frequency": slp_frequency,
            "sleep_stage_label": random_sleep_stage_labels
        } # only rri to slp
    
    many_files_data_manager.save(new_datapoint, overwrite_id=False)

print(f"Number of datapoints in file: {len(many_files_data_manager)}")

Number of datapoints in file: 100


Depending whether test_size is provided or None we can create separate files where training-, validation- and 
test- data or just training- and validation- data is stored:

Data that can not be used to train the network (i.e. missing "RRI" and "SLP") will be left in the main file. 
        
As we can manage data with "RRI" and "MAD" and data with "RRI" only, the algorithm makes sure
that only one of the two types of data is used (the one with more samples). The other type will 
be left in the main file.

In [39]:
many_files_data_manager.separate_train_test_validation(
    train_size = 0.8, 
    validation_size = 0.1, 
    test_size = 0.1, 
    random_state = None, 
    shuffle = True
)


Attention: 26 datapoints with MAD signal will be left in the main file.

Distributing 80.0% / 10.0% / 10.0% of datapoints into training / validation / test pids, respectively:
   ✅: 100.0% [█████████████████████████] 100 / 100 | 0.0s / 0.0s (0.0s/it) |

The individual files can be accessed by another instance of this class. 

ATTENTION:  

-   The instances on all files will have reduced functionality from now on. As the data should
    be fully prepared for the network now, the instances are designed to only load data and
    not save or edit it.

-   The functionality of the instance on the main file is not as restricted as the ones on the
    training, validation, and test files. The main file instance can additionally save data
    (only to main file, won't be forwarded to training, validation, or test files), reshuffle 
    the data in the secondary files or pull them back into the main file for further processing.

Accessing training-, validation- and test- data:

In [40]:
main_file_info = many_files_data_manager.file_info

train_data_manager = SleepDataManager(file_path = main_file_info["train_file_path"])
validation_data_manager = SleepDataManager(file_path = main_file_info["validation_file_path"])
test_data_manager = SleepDataManager(file_path = main_file_info["test_file_path"])

In [41]:
print("Length of each dataset:")
print("-"*30)
print(f"Main: {len(many_files_data_manager)}")
print(f"Train: {len(train_data_manager)}")
print(f"Validation: {len(validation_data_manager)}")
print(f"Test: {len(test_data_manager)}")

Length of each dataset:
------------------------------
Main: 64
Train: 28
Validation: 4
Test: 4


In [42]:
train_file_info = train_data_manager.file_info
validation_file_info = validation_data_manager.file_info
test_file_info = test_data_manager.file_info

equal = True
for key in main_file_info.keys():
    if main_file_info[key] != train_file_info[key] or main_file_info[key] != validation_file_info[key] or main_file_info[key] != test_file_info[key]:
        equal = False
        break

if equal:
    print("Main file info is the same as train, validation and test file info!")

Main file info is the same as train, validation and test file info!


Of course, you can always reshuffle the data again from the main file manager (let's assign different 
arguments to see that something happened):

In [43]:
many_files_data_manager.separate_train_test_validation(
    train_size = 0.5, 
    validation_size = 0.5, 
    test_size = None, 
    random_state = None, 
    shuffle = True
)


Attention: 26 datapoints with MAD signal will be left in the main file.

Distributing 50.0% / 50.0% of datapoints into training / validation pids, respectively:
   ✅: 100.0% [█████████████████████████] 100 / 100 | 0.0s / 0.0s (0.0s/it) |

In [44]:
print("Length of each dataset:")
print("-"*30)
print(f"Main: {len(many_files_data_manager)}")
print(f"Train: {len(train_data_manager)}")
print(f"Validation: {len(validation_data_manager)}")
try:
    print(f"Test: {len(test_data_manager)}")
except:
    print("No test data manager")

Length of each dataset:
------------------------------


Main: 64
Train: 18
Validation: 18
No test data manager


We can also fuse the data again (do not forget to close your active data managers!):

In [45]:
many_files_data_manager.fuse_train_test_validation()

In [46]:
del train_data_manager
del validation_data_manager
del test_data_manager

In [47]:
print(len(many_files_data_manager))

100


### Reversing Signal Split

After you added predicted sleep stages to the database, you might want to reverse the signal split that was 
applied to the data during the saving process:

Calling the function will combine all signals, including the predicted sleep stages, providing you with
multiple results for the sleep stage of the overlapping parts.

In [78]:
# initialize the data manager
splitting_data_manager = SleepDataManager(file_path = "Processing_Demonstration/Reverse_Splitting.pkl")

# change the file information
new_file_info = {"RRI_frequency": 2, "MAD_frequency": 1, "signal_length_seconds": 10, "wanted_shift_length_seconds": 5, "absolute_shift_deviation_seconds": 2, "SLP_frequency": 1, "SLP_predicted_frequency": 0.5, "RRI_inlier_interval": [None, None]}
splitting_data_manager.change_file_information(new_file_info)

data_dict = {
    "RRI": np.array([i for i in range(44)], dtype=np.float64),
    "RRI_frequency": 2,
    "MAD": np.array([i for i in range(22)], dtype=np.int64),
    "MAD_frequency": 1,
}

# add data that will be splitted
for i in range(5):
    data_dict["ID"] = str(i)
    splitting_data_manager.save(data_dict, overwrite_id=False)

file_generator = load_from_pickle("Processing_Demonstration/Reverse_Splitting.pkl")
next(file_generator)

# add "predicted" sleep stages
count = 1
old_slp_pred_prob = np.array([np.round(np.random.rand(4), 2) for _ in range(2)], dtype=np.float64)
for file in file_generator:
    new_slp_pred_prob = np.array([np.round(np.random.rand(4), 2) for _ in range(3)], dtype=np.float64)
    new_slp_pred_prob = np.append(old_slp_pred_prob, new_slp_pred_prob, axis=0)
    old_slp_pred_prob = new_slp_pred_prob[3:]
    additional_info = {
        "SLP_predicted": np.array([i for i in range(5)], dtype=np.int64)+3*count,
        "SLP_predicted_probability": new_slp_pred_prob,
        "SLP_predicted_frequency": 0.5
    }
    additional_info["ID"] = file["ID"]
    splitting_data_manager.save(additional_info, overwrite_id=True)
    count += 1

del file_generator

# print the data
print("="*90)
for dict in splitting_data_manager:
    message = f"ID: {dict['ID']}"
    for key in dict.keys():
        if key in ["RRI", "MAD", "SLP_predicted", "SLP_predicted_probability"]:
            message += f", {key}: {dict[key].shape}"
    print(message)

    if "4" in dict['ID']:
        for key in dict.keys():
            if key in ["RRI", "MAD", "SLP_predicted", "SLP_predicted_probability"]:
                print(key, dict[key])
        print("-"*90)

print("="*90)

ID: 0, RRI: (20,), MAD: (10,), SLP_predicted: (5,), SLP_predicted_probability: (5, 4)
ID: 0_shift_x1, RRI: (20,), MAD: (10,), SLP_predicted: (5,), SLP_predicted_probability: (5, 4)
ID: 0_shift_x2, RRI: (20,), MAD: (10,), SLP_predicted: (5,), SLP_predicted_probability: (5, 4)
ID: 1, RRI: (20,), MAD: (10,), SLP_predicted: (5,), SLP_predicted_probability: (5, 4)
ID: 1_shift_x1, RRI: (20,), MAD: (10,), SLP_predicted: (5,), SLP_predicted_probability: (5, 4)
ID: 1_shift_x2, RRI: (20,), MAD: (10,), SLP_predicted: (5,), SLP_predicted_probability: (5, 4)
ID: 2, RRI: (20,), MAD: (10,), SLP_predicted: (5,), SLP_predicted_probability: (5, 4)
ID: 2_shift_x1, RRI: (20,), MAD: (10,), SLP_predicted: (5,), SLP_predicted_probability: (5, 4)
ID: 2_shift_x2, RRI: (20,), MAD: (10,), SLP_predicted: (5,), SLP_predicted_probability: (5, 4)
ID: 3, RRI: (20,), MAD: (10,), SLP_predicted: (5,), SLP_predicted_probability: (5, 4)
ID: 3_shift_x1, RRI: (20,), MAD: (10,), SLP_predicted: (5,), SLP_predicted_probability

In [52]:
def list_shape(list):
    shape = "("
    while True:
        try:
            shape += str(len(list))
            list = list[0]
            shape += ", "
        except:
            break
    shape += ")"
    return shape

In [79]:
# apply the reverse splitting
splitting_data_manager.reverse_signal_split()

# print the data
print("-"*90)
for dict in splitting_data_manager:
    message = f"ID: {dict['ID']}"
    for key in dict.keys():
        if key in ["RRI", "MAD", "SLP_predicted_probability"]:
            message += f", {key}: {dict[key].shape}"
        if key in ["SLP_predicted"]:
            message += f", {key}: {list_shape(dict[key])}"
    print(message)
print("-"*90)

for key in dict.keys():
    if key in ["RRI", "MAD", "SLP_predicted", "SLP_predicted_probability"]:
        print(key, dict[key])

------------------------------------------------------------------------------------------
ID: 0, RRI: (44,), MAD: (22,), SLP_predicted: (11, 1, ), SLP_predicted_probability: (11, 4)
ID: 1, RRI: (44,), MAD: (22,), SLP_predicted: (11, 1, ), SLP_predicted_probability: (11, 4)
ID: 2, RRI: (44,), MAD: (22,), SLP_predicted: (11, 1, ), SLP_predicted_probability: (11, 4)
ID: 3, RRI: (44,), MAD: (22,), SLP_predicted: (11, 1, ), SLP_predicted_probability: (11, 4)
ID: 4, RRI: (44,), MAD: (22,), SLP_predicted: (11, 1, ), SLP_predicted_probability: (11, 4)
------------------------------------------------------------------------------------------
RRI [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35.
 36. 37. 38. 39. 40. 41. 42. 43.]
MAD [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21]
SLP_predicted [[39], [40], [41], [42, 42], [43, 43], [44], [45, 45], [46, 46], [47], [48], [49]]
SLP_p

### Cleaning up

In [54]:
created_files = os.listdir("Processing_Demonstration")
for file in created_files:
    try:
        os.remove(f"Processing_Demonstration/{file}")
    except:
        pass
os.rmdir("Processing_Demonstration")

## Introduction to the implemented functions

In [55]:
import numpy as np # type: ignore
import random
import h5py # type: ignore

In this section you can check whether the implemented functions in this project work correctly.

### Scaling number of datapoints from signal- to target- frequency:

I would highly suggest to provide data where the signals don't need to be scaled to the frequencies of the data
used to train the neural network.

If there is no other option, then so be it. Here is a demonstration of the functions that will be applied to 
your data:

#### Classification Signal

In [56]:
classification_array = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
classification_frequency = 1/20
target_frequency = 1/30

print("-"*71)
print(f"Classification Frequency: {classification_frequency} -> Target Frequency: {target_frequency}")
print("-"*71)
print("\nClassification array: ", classification_array)
print("Classification array shape: ", classification_array.shape)

reshaped_array = scale_classification_signal(
        signal = classification_array, # type: ignore
        signal_frequency = classification_frequency,
        target_frequency = target_frequency
        )

print("\nScaled array: ", reshaped_array)
print("Scaled array shape: ", reshaped_array.shape)

classification_array = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8])
classification_frequency = 1/50
target_frequency = 1/30

print("\n")
print("-"*71)
print(f"Classification Frequency: {classification_frequency} -> Target Frequency: {target_frequency}")
print("-"*71)
print("\nClassification array: ", classification_array)
print("Classification array shape: ", classification_array.shape)

reshaped_array = scale_classification_signal(
        signal = classification_array, # type: ignore
        signal_frequency = classification_frequency,
        target_frequency = target_frequency
        )

print("\nScaled array: ", reshaped_array)
print("Scaled array shape: ", reshaped_array.shape)

del reshaped_array, classification_array, classification_frequency, target_frequency

-----------------------------------------------------------------------
Classification Frequency: 0.05 -> Target Frequency: 0.03333333333333333
-----------------------------------------------------------------------

Classification array:  [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
Classification array shape:  (15,)

Scaled array:  [ 0  1  3  4  6  7  9 10 12 13]
Scaled array shape:  (10,)


-----------------------------------------------------------------------
Classification Frequency: 0.02 -> Target Frequency: 0.03333333333333333
-----------------------------------------------------------------------

Classification array:  [0 1 2 3 4 5 6 7 8]
Classification array shape:  (9,)

Scaled array:  [0 1 1 2 2 3 4 4 5 5 6 7 7 8 8]
Scaled array shape:  (15,)


#### Continuous Signal

In [57]:
continuous_array_int = np.array([0, 1, 2, 3, 4, 5])
continuous_array_float = np.array([0, 1, 2, 3, 4, 5], dtype = float)
continuous_frequency = 3
target_frequency = 4

print("-"*75)
print(f"Continuous Frequency: {continuous_frequency} -> Target Frequency: {target_frequency}")
print("-"*75)
print(f"Continuous array: {continuous_array_int} / {continuous_array_float}")
print("Continuous array shape: ", continuous_array_int.shape)

reshaped_array_int = interpolate_signal(
        signal = continuous_array_int, # type: ignore
        signal_frequency = continuous_frequency,
        target_frequency = target_frequency
        )

reshaped_array_float = interpolate_signal(
        signal = continuous_array_float, # type: ignore
        signal_frequency = continuous_frequency,
        target_frequency = target_frequency
        )

print(f"\nScaled array: {reshaped_array_int} / {reshaped_array_float}")
print("Scaled array shape: ", reshaped_array_int.shape)

print("\n")

continuous_array_int = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
continuous_array_float = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype = float)
continuous_frequency = 5
target_frequency = 4

print("-"*75)
print(f"Continuous Frequency: {continuous_frequency} -> Target Frequency: {target_frequency}")
print("-"*75)
print(f"Continuous array: {continuous_array_int} / {continuous_array_float}")
print("Continuous array shape: ", continuous_array_int.shape)

reshaped_array_int = interpolate_signal(
        signal = continuous_array_int, # type: ignore
        signal_frequency = continuous_frequency,
        target_frequency = target_frequency
        )

reshaped_array_float = interpolate_signal(
        signal = continuous_array_float, # type: ignore
        signal_frequency = continuous_frequency,
        target_frequency = target_frequency
        )

print(f"\nScaled array: {reshaped_array_int} / {reshaped_array_float}")
print("Scaled array shape: ", reshaped_array_int.shape)

del reshaped_array_int, reshaped_array_float, continuous_array_int, continuous_array_float, continuous_frequency, target_frequency

---------------------------------------------------------------------------
Continuous Frequency: 3 -> Target Frequency: 4
---------------------------------------------------------------------------
Continuous array: [0 1 2 3 4 5] / [0. 1. 2. 3. 4. 5.]
Continuous array shape:  (6,)

Scaled array: [0 1 2 2 3 4 4 5] / [0.   0.75 1.5  2.25 3.   3.75 4.5  5.  ]
Scaled array shape:  (8,)


---------------------------------------------------------------------------
Continuous Frequency: 5 -> Target Frequency: 4
---------------------------------------------------------------------------
Continuous array: [0 1 2 3 4 5 6 7 8 9] / [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
Continuous array shape:  (10,)

Scaled array: [0 1 2 4 5 6 8 9] / [0.   1.25 2.5  3.75 5.   6.25 7.5  8.75]
Scaled array shape:  (8,)


### Splitting a signal which is too long for the neural network model

A signal which is too short will be padded with zeros. No big deal. On the other hand: A signal which is too 
long will be splitted into multiple signals. To create more data, the 10 hour range will be shifted along the
signal.

This shift should not be too small, to create redundant data but also not too big, because the more data the 
better. So we try to find a shift size close to 1 hour, which lets us shift an integer amount of times
easily. 

#### Finding optimal shift size

In [58]:
signal_length_addition_hours = 2.5
desired_length_hours = 10

optimal_shift_length = calculate_optimal_shift_length(
        signal_length_seconds = (desired_length_hours + signal_length_addition_hours) * 3600, # type: ignore
        desired_length_seconds = desired_length_hours*3600, 
        wanted_shift_length_seconds = 3600,
        absolute_shift_deviation_seconds = 1800,
        all_signal_frequencies = [4, 1, 1/30, 1/120]
)
print(optimal_shift_length)

print(f"Optimal shift length for signal which is {signal_length_addition_hours} hours longer than desired length of {desired_length_hours} hours: {round(optimal_shift_length/3600, 3)} hours")

3000
Optimal shift length for signal which is 2.5 hours longer than desired length of 10 hours: 0.833 hours


#### Splitting Signal

The above function to find optimal shift length is embedded in the following split funtion. The optimal
shift size will be estimated for every signal individually.

If there is no integer shift size in range, that lets you shift the signal so, that you perfectly enclose the
last datapoints of the long signal, then the last shift will be altered so that it does.

In [59]:
# Create random signal
frequency = 4
length_signal_seconds = 12.1 * 3600
signal = np.random.rand(int(length_signal_seconds * frequency))

# Only important parameters here:
nn_signal_seconds = 10 * 3600
shift_length_seconds = 3600
absolute_shift_deviation_seconds = 1800

signals_from_splitting, shift_length = split_long_signal(
        signal = signal, # type: ignore
        sampling_frequency = frequency,
        target_frequency = frequency,
        nn_signal_duration_seconds = nn_signal_seconds,
        wanted_shift_length_seconds = shift_length_seconds,
        absolute_shift_deviation_seconds = absolute_shift_deviation_seconds
        )

print("Shift length:", shift_length)
print(f"Shift length: {shift_length / frequency} seconds")
print("Signal shape: ", signal.shape)
print(f"Datapoints in NN: {nn_signal_seconds * frequency}")
print("Signals from splitting shape: ", list_shape(signals_from_splitting))

del signals_from_splitting, signal, shift_length, frequency, nn_signal_seconds, shift_length_seconds, absolute_shift_deviation_seconds

Shift length: 3780
Shift length: 945.0 seconds
Signal shape:  (174240,)
Datapoints in NN: 144000
Signals from splitting shape:  (3, 144000, )


#### Splitting signals within dictionary

In [60]:
# Create random signal
length_signal_seconds = 12.1 * 3600
rri_frequency = 4
mad_frequency = 1
rri_signal = np.random.rand(int(length_signal_seconds * rri_frequency))
mad_signal = np.random.rand(int(length_signal_seconds * mad_frequency))

data_dict = {
    "ID": "1",
    "RRI": rri_signal,
    "RRI_frequency": rri_frequency,
    "MAD": mad_signal,
    "MAD_frequency": mad_frequency,
}

new_dictionaries = split_signals_within_dictionary(
    data_dict = data_dict,
    id_key = "ID",
    valid_signal_keys = ["RRI", "MAD"],
    signal_frequencies = [rri_frequency, mad_frequency],
    signal_target_frequencies = [rri_frequency, mad_frequency],
    nn_signal_duration_seconds = 10 * 3600,
    wanted_shift_length_seconds = 3600,
    absolute_shift_deviation_seconds = 1800,
    all_signal_frequencies = [rri_frequency, mad_frequency]
)

print("Original dictionary:")
print("-"*20)
for key, value in data_dict.items():
    if key == "RRI" or key == "MAD" or key == "SLP":
        print(f"{key}: {value.shape}")
    else:
        print(f"{key}: {value}")
print("\nNew dictionaries:")
print("-"*20)
for new_dict in new_dictionaries:
    for key, value in new_dict.items():
        if key == "RRI" or key == "MAD" or key == "SLP":
            print(f"{key}: {value.shape}")
        else:
            print(f"{key}: {value}")
    print("")

Original dictionary:
--------------------
ID: 1
RRI: (174240,)
RRI_frequency: 4
MAD: (43560,)
MAD_frequency: 1

New dictionaries:
--------------------
ID: 1
RRI: (144000,)
RRI_frequency: 4
MAD: (36000,)
MAD_frequency: 1
shift_length_seconds: 3780

ID: 1_shift_x1
RRI: (144000,)
RRI_frequency: 4
MAD: (36000,)
MAD_frequency: 1
shift_length_seconds: 3780

ID: 1_shift_x2
RRI: (144000,)
RRI_frequency: 4
MAD: (36000,)
MAD_frequency: 1
shift_length_seconds: 3780



#### Fusing signals back together

In [61]:
signals_from_splitting, shift_length = split_long_signal(
        signal = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21], # type: ignore
        sampling_frequency = 1,
        target_frequency = 1,
        nn_signal_duration_seconds = 10,
        wanted_shift_length_seconds = 5,
        absolute_shift_deviation_seconds = 1,
        all_signal_frequencies = [1]
        )

print("Splitted Signals:\n", signals_from_splitting)

fused_signal = fuse_splitted_signals(
    signals = signals_from_splitting, # type: ignore
    shift_length = int(shift_length), # type: ignore
    signal_type = "feature"
)

print("\nFused signal:\n", fused_signal)

Splitted Signals:
 [array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([ 6,  7,  8,  9, 10, 11, 12, 13, 14, 15]), array([12, 13, 14, 15, 16, 17, 18, 19, 20, 21])]

Fused signal:
 [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21]


#### Fusing splitted dictionaries

In [62]:
data_dict = {
    "ID": "1",
    "RRI": np.array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 20, 20, 21, 21, 22, 22]),
    "RRI_frequency": 2,
    "MAD": np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]),
    "MAD_frequency": 1,
    "SLP": np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]),
    "SLP_frequency": 0.5
}

new_dictionaries = split_signals_within_dictionary(
    data_dict = data_dict,
    id_key = "ID",
    valid_signal_keys = ["RRI", "MAD", "SLP"],
    signal_frequencies = [2, 1, 0.5],
    signal_target_frequencies = [2, 1, 0.5],
    nn_signal_duration_seconds = 10,
    wanted_shift_length_seconds = 5,
    absolute_shift_deviation_seconds = 2,
    all_signal_frequencies = [2, 1, 0.5]
)

print("Original dictionary:")
print("-"*20)
for key, value in data_dict.items():
    if key == "RRI" or key == "MAD" or key == "SLP":
        print(f"{key}: {value.shape}")
    else:
        print(f"{key}: {value}")

Original dictionary:
--------------------
ID: 1
RRI: (46,)
RRI_frequency: 2
MAD: (23,)
MAD_frequency: 1
SLP: (12,)
SLP_frequency: 0.5


In [63]:
print("\nNew dictionaries:")
print("-"*20)
for new_dict in new_dictionaries:
    for key, value in new_dict.items():
        print(f"{key}: {value}")
    print("")


New dictionaries:
--------------------
ID: 1
RRI: [0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9]
RRI_frequency: 2
MAD: [0 1 2 3 4 5 6 7 8 9]
MAD_frequency: 1
SLP: [0 1 2 3 4]
SLP_frequency: 0.5
shift_length_seconds: 6

ID: 1_shift_x1
RRI: [ 6  6  7  7  8  8  9  9 10 10 11 11 12 12 13 13 14 14 15 15]
RRI_frequency: 2
MAD: [ 6  7  8  9 10 11 12 13 14 15]
MAD_frequency: 1
SLP: [3 4 5 6 7]
SLP_frequency: 0.5
shift_length_seconds: 6

ID: 1_shift_x2
RRI: [12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21]
RRI_frequency: 2
MAD: [12 13 14 15 16 17 18 19 20 21]
MAD_frequency: 1
SLP: [ 6  7  8  9 10]
SLP_frequency: 0.5
shift_length_seconds: 6

ID: 1_shift_x3
RRI: [18 18 19 19 20 20 21 21 22 22]
RRI_frequency: 2
MAD: [18 19 20 21 22]
MAD_frequency: 1
SLP: [ 9 10 11]
SLP_frequency: 0.5
shift_length_seconds: 6



In [64]:
fused_dictionary = fuse_splitted_signals_within_dictionaries(
    data_dictionaries = new_dictionaries,
    valid_signal_keys = ["RRI", "MAD", "SLP"],
    valid_signal_frequencies = [2, 1, 0.5],
)

for key, value in fused_dictionary.items():
    print(f"{key}: {value}")

ID: 1
RRI: [ 0  0  1  1  2  2  3  3  4  4  5  5  6  6  7  7  8  8  9  9 10 10 11 11
 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22]
MAD: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22]
SLP: [ 0  1  2  3  4  5  6  7  8  9 10  9 10 11]
RRI_frequency: 2
MAD_frequency: 1
SLP_frequency: 0.5


### Reading a .h5 - file

One of the available training datasets for our neural network model are stored in a .h5 file. So we need
to be able to read it. These are the important operations:

In [65]:
shhs_dataset = h5py.File("Raw_Data/SHHS_dataset.h5", 'r')
patients = list(shhs_dataset['slp'].keys()) # type: ignore

random_patient = patients[np.random.randint(0, len(patients))]
print(f"Random patient: {random_patient}")

print(np.unique(shhs_dataset["slp"][random_patient][:])) # type: ignore

for key in ["slp", "rri"]:
    print(f"\nkey: {key}")

    patient_data = shhs_dataset[key][random_patient][:] # type: ignore
    print(f"Data shape: {patient_data.shape}") # type: ignore

    data_freq = shhs_dataset[key].attrs["freq"] # type: ignore
    print(f"Data frequency: {data_freq}")
    print(f"Inverse data frequency: {1/data_freq}") # type: ignore

    print(f"Data length: {patient_data.shape[0]/data_freq} s") # type: ignore

del shhs_dataset, patients, random_patient, key, patient_data, data_freq

Random patient: 201966_1
[0 1 2 3 5]

key: slp
Data shape: (747,)
Data frequency: 0.03333333333333333
Inverse data frequency: 30.0
Data length: 22410.0 s

key: rri
Data shape: (89640,)
Data frequency: 4
Inverse data frequency: 0.25
Data length: 22410.0 s


### Divide up a signal into overlapping windows

The hardest thing about this is, that 'window_overlap' and 'datapoints_per_window' must be chosen so that
the whole signal fits perfectly into n windows. 

Additionally, those values must be integers. This means that 'window_duration_seconds' and 'overlap_seconds'
multiplied with 'target_fequency' as well as 'sampling_frequency' must be integers. (The features and the target labels
must fit equally well into the windows, so that we can find the correlation between a feature- and target- window.)

We have the RRI and MAD values as features and the sleep phase as target classification. As we will see,
RRI and MAD values were recorded with an integer sampling frequency. While the sampling frequency of the 
sleep classification is 1/30. 

Finding window parameters that fullfill the conditions mentioned is easier than it sounds. We will always pass data
to the neural network that is 10 hours long. Now, we just need to think in seconds and find integer values
for 'window_duration_seconds' and 'overlap_seconds' that are a multiple of 30:

#### Finding optimal window_parameters:

In [66]:
find_suitable_window_parameters(
        signal_length = 10 * 3600,
        number_windows_range = (1000, 1400),
        window_size_range = (120, 180),
        minimum_window_size_overlap_difference = 30
    )

Suitable window parameters for signal of length: 36000:
-------------------------------------------------------
Number of windows: 1025, Window size: 160, Overlap: 125.0
Number of windows: 1026, Window size: 125, Overlap: 90.0
Number of windows: 1055, Window size: 164, Overlap: 130.0
Number of windows: 1056, Window size: 130, Overlap: 96.0
Number of windows: 1087, Window size: 162, Overlap: 129.0
Number of windows: 1088, Window size: 129, Overlap: 96.0
Number of windows: 1121, Window size: 160, Overlap: 128.0
Number of windows: 1122, Window size: 128, Overlap: 96.0
Number of windows: 1157, Window size: 164, Overlap: 133.0
Number of windows: 1158, Window size: 133, Overlap: 102.0
Number of windows: 1196, Window size: 150, Overlap: 120.0
Number of windows: 1197, Window size: 120, Overlap: 90.0


Our options are:

Number of windows: 1196, Window size: 150, Overlap: 120.0 \
Number of windows: 1197, Window size: 120, Overlap: 90.0

We will choose the latter, because we don't want the window_size to be too large.

#### Classification Signal

When transforming a classification signal into windows, which is supposed to be the target in the neural 
network, then each window will only be represented by the most common sleep stage. If there is a tie
between the labels, then the one with the highest priority will be chosen 

In [67]:
signal_length_seconds = 10 * 3600
frequency = 1/30
signal_length = int(signal_length_seconds * frequency)

signal = np.array([random.randint(0, 5) for _ in range(signal_length)])

signal_in_windows = signal_to_windows(
    signal = signal, # type: ignore
    datapoints_per_window = int(120 * frequency),
    window_overlap = int(90 * frequency),
    signal_type = "target",
    priority_order = [0, 1, 2, 3, 4, 5, -1]
    )

print(f"Signal shape: {signal.shape}")
print(f"Signal in windows shape: {signal_in_windows.shape}")

del signal, signal_in_windows, signal_length_seconds, frequency, signal_length

Signal shape: (1200,)
Signal in windows shape: (1197,)


#### Continuous Signal

In [68]:
signal = np.random.rand(36000)

signal_in_windows = signal_to_windows(
    signal = signal, # type: ignore
    datapoints_per_window = 120,
    window_overlap = 90,
    signal_type = "feature"
    )

print(f"Signal shape: {signal.shape}")
print(f"Signal in windows shape: {signal_in_windows.shape}")

del signal, signal_in_windows

Signal shape: (36000,)
Signal in windows shape: (1197, 120)


#### Reshape Signal

The following function will be applied to transform a signal into overlapping windows. It will make sure
that the data is passed correctly to the function mentioned above. 

This means it will:
- check if 'number_nn_datapoints', 'datapoints_per_window' and 'window_overlap' are integers
- check if 'datapoints_per_window' and 'window_overlap' perfectly fit into 'number_nn_datapoints'
- compare length of provided signal to length of signal in nn ('number_nn_datapoints')
    - if smaller: Pad with Zeros
    - if bigger: Print warning, but continue by cropping last datapoints
- check if signal transformed to windows has the right shape

In [69]:
random_array = np.random.rand(36000)
reshaped_array = reshape_signal_to_overlapping_windows(
    signal = random_array, # type: ignore
    target_frequency = 4, 
    number_windows = 1197, 
    window_duration_seconds = 120, 
    overlap_seconds = 90,
    signal_type = "feature",
    nn_signal_duration_seconds = 10*3600,
    )

print(f"Random array shape: {random_array.shape}")
print(f"Reshaped array shape: {reshaped_array.shape}")

random_array = np.array([random.randint(0, 3) for _ in range(int(36000/30))])
reshaped_array = reshape_signal_to_overlapping_windows(
    signal = random_array, # type: ignore
    target_frequency = 1/30, 
    number_windows = 1197, 
    window_duration_seconds = 120, 
    overlap_seconds = 90,
    signal_type = "target",
    nn_signal_duration_seconds = 10*3600,
    )

print(f"Random array shape: {random_array.shape}")
print(f"Reshaped array shape: {reshaped_array.shape}")

del random_array, reshaped_array

Random array shape: (36000,)
Reshaped array shape: (1197, 480)
Random array shape: (1200,)
Reshaped array shape: (1197,)


#### Reverse Reshape

Reversing Reshape of feature:

In [70]:
print("Original signal:")
test = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print(test)

print("\nSignal reshaped to overlapping windows:")
reshaped_test = reshape_signal_to_overlapping_windows(
    signal = test,
    target_frequency = 1,
    nn_signal_duration_seconds = 16,
    number_windows = 12,
    window_duration_seconds = 5,
    overlap_seconds = 4,
    signal_type = "feature"
    )
print(reshaped_test)

print("\nLast window when padding was cropped:")
cropped_padding = remove_padding_from_windows(
    signal_in_windows = copy.deepcopy(reshaped_test), # type: ignore
    target_frequency = 1,
    original_signal_length = 10,
    window_duration_seconds = 5, 
    overlap_seconds = 4,
    )
print(cropped_padding[-1])

print("\nSignal reshaped back to original:")
reversed_test = reverse_signal_to_windows_reshape(
    signal_in_windows = reshaped_test, # type: ignore
    target_frequency = 1,
    original_signal_length = 10,
    number_windows = 12,
    window_duration_seconds = 5,
    overlap_seconds = 4
    )
print(reversed_test)

Original signal:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Signal reshaped to overlapping windows:
[[ 1  2  3  4  5]
 [ 2  3  4  5  6]
 [ 3  4  5  6  7]
 [ 4  5  6  7  8]
 [ 5  6  7  8  9]
 [ 6  7  8  9 10]
 [ 7  8  9 10  0]
 [ 8  9 10  0  0]
 [ 9 10  0  0  0]
 [10  0  0  0  0]
 [ 0  0  0  0  0]
 [ 0  0  0  0  0]]

Last window when padding was cropped:
[10  0  0  0  0]

Signal reshaped back to original:
[ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]


The sleep stage labels were reshaped differently, as we only keep one label for each window and therefore won't
create a 2d array. 

After predicting the sleep stage labels, we will transform them into a 2d array, that is computable by our 
reverse reshape function. Effectively, we will create an array from each label, containing only the label as
elements:

In [71]:
print("Original signal:")
test = [1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3]
print(test)
reshaped_test = reshape_signal_to_overlapping_windows(
    signal = test,
    target_frequency = 1/3,
    nn_signal_duration_seconds = 36,
    number_windows = 9,
    window_duration_seconds = 12,
    overlap_seconds = 9,
    signal_type = "target"
    )

print("\nSignal reshaped to overlapping windows:")
print(reshaped_test)

expanded_reshaped_test = []
for slp_stg in reshaped_test:
    expanded_reshaped_test.append([slp_stg for _ in range(int(12 * 1/3))])

print("\nExpanded signal:")
print(expanded_reshaped_test)

reversed_test = reverse_signal_to_windows_reshape(
    signal_in_windows = expanded_reshaped_test, # type: ignore
    target_frequency = 1/3, # type: ignore
    original_signal_length = 12,
    number_windows = 9,
    window_duration_seconds = 12,
    overlap_seconds = 9
    )

print("\nExpanded signal reshaped to original:")
print(reversed_test)
print([round(i) for i in reversed_test])

Original signal:
[1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

Signal reshaped to overlapping windows:
[1 1 1 2 2 2 3 3 3]

Expanded signal:
[[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [2, 2, 2, 2], [2, 2, 2, 2], [2, 2, 2, 2], [3, 3, 3, 3], [3, 3, 3, 3], [3, 3, 3, 3]]

Expanded signal reshaped to original:
[1.   1.   1.   1.25 1.5  1.75 2.25 2.5  2.75 3.   3.   3.  ]
[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]


### Normalize Signal

The implemented unity normalization function can either normalize a multi-dimensional array across all
arrays (global) or normalize each array indivudally (local).

In [72]:
one_dimensional = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
two_dimensional = np.array([[0, 2, 4], [4, 5, 6], [6, 8, 10]])
three_dimensional = np.array([[[0, 1, 2], [3, 4, 5]], [[6, 7, 8], [8, 9, 10]]])

In [73]:
message = "Normalization_Mode: \'global\'"
print(message)
print("-"*len(message))
print("\nNormalized One dimensional array:")
print(unity_based_normalization(
        signal = one_dimensional, # type: ignore
        normalization_max = 1,
        normalization_min = 0,
        normalization_mode = "global"
    ))
print("\nNormalized Two dimensional array:")
print(unity_based_normalization(
        signal = two_dimensional, # type: ignore
        normalization_max = 1,
        normalization_min = 0,
        normalization_mode = "global"
    ))
print("\nNormalized Three dimensional array:")
print(unity_based_normalization(
        signal = three_dimensional, # type: ignore
        normalization_max = 1,
        normalization_min = 0,
        normalization_mode = "global"
    ))

Normalization_Mode: 'global'
----------------------------

Normalized One dimensional array:
[0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]

Normalized Two dimensional array:
[[0.  0.2 0.4]
 [0.4 0.5 0.6]
 [0.6 0.8 1. ]]

Normalized Three dimensional array:
[[[0.  0.1 0.2]
  [0.3 0.4 0.5]]

 [[0.6 0.7 0.8]
  [0.8 0.9 1. ]]]


In [74]:
message = "Normalization_Mode: \'local\'"
print(message)
print("-"*len(message))
print("\nNormalized One dimensional array:")
print(unity_based_normalization(
        signal = one_dimensional, # type: ignore
        normalization_max = 1,
        normalization_min = 0,
        normalization_mode = "local"
    ))
print("\nNormalized Two dimensional array:")
print(unity_based_normalization(
        signal = two_dimensional, # type: ignore
        normalization_max = 1,
        normalization_min = 0,
        normalization_mode = "local"
    ))
print("\nNormalized Three dimensional array:")
print(unity_based_normalization(
        signal = three_dimensional, # type: ignore
        normalization_max = 1,
        normalization_min = 0,
        normalization_mode = "local"
    ))

Normalization_Mode: 'local'
---------------------------

Normalized One dimensional array:
[0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]

Normalized Two dimensional array:
[[0.  0.5 1. ]
 [0.  0.5 1. ]
 [0.  0.5 1. ]]

Normalized Three dimensional array:
[[[0.  0.5 1. ]
  [0.  0.5 1. ]]

 [[0.  0.5 1. ]
  [0.  0.5 1. ]]]


### Alter Sleep Labels

Following function makes sure to keep labels unfiform.

In [75]:
slp = np.array([-2, -1, 0, 1, 2, 3, 4, 5, 6, 7])
print(slp)

current_labels = {"wake": [0, 1], "LS": [2], "DS": [3], "REM": [5], "artifect": ["other"]}
desired_labels = {"wake": 0, "LS": 1, "DS": 2, "REM": 3, "artifect": -1}

print(alter_slp_labels(
        slp_labels = slp, # type: ignore
        current_labels = current_labels,
        desired_labels = desired_labels,
))

[-2 -1  0  1  2  3  4  5  6  7]
[-1 -1  0  0  1  2 -1  3 -1 -1]


In [76]:
slp = np.array(["light_sleep", "deep_sleep", "deep_sleep_2", "WAKE", "REM", "bla", "blub"])
print(slp)

current_labels = {"wake": ["WAKE"], "LS": ["light_sleep"], "DS": ["deep_sleep", "deep_sleep_2"], "REM": ["REM"], "artifect": ["other"]}
desired_labels = {"wake": 0, "LS": 1, "DS": 2, "REM": 3, "artifect": -1}

print(alter_slp_labels(
        slp_labels = slp, # type: ignore
        current_labels = current_labels,
        desired_labels = desired_labels,
))

['light_sleep' 'deep_sleep' 'deep_sleep_2' 'WAKE' 'REM' 'bla' 'blub']
['1' '2' '2' '0' '3' '-1' '-1']


Label Transformation from previous (not mine) Sleep Stage Classification:

In [77]:
slp = np.array([-2, -1, 0, 1, 2, 3, 4, 5, 6, 7])
print(slp)

slp[slp>=1] = slp[slp>=1] - 1
slp[slp==4] = 3
slp[slp==5] = 0
slp[slp==-1] = 0 # set artifact as wake stage

print(slp)

[-2 -1  0  1  2  3  4  5  6  7]
[-2  0  0  0  1  2  3  3  0  6]
