#### This notebook shows how to create a train, test split by concatenating several surgeries in order to train a regression model

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pathlib import Path
from sklearn.model_selection import train_test_split

from doa_zero_eeg.utils import utils

## 1 - Load the data and split for training and testing based on paths

In [2]:
data_path = utils.FILTERED_DATA_DIR.rglob('*.parquet') # path of the filtered dataset (containing only one surgery per file, restricted to time window for which we have enough BIS values)

file_paths = [p for p in data_path]

# 80 surgeries recording for training, 20 for testing
utils.set_seed(42) # always use seed 42 for reproducibility
path_train, path_test = train_test_split(file_paths, test_size=0.2, random_state=42)

## 2 - Concatenate all surgeries for train and test set

In [3]:
df_train = utils.concatenate_surgeries(path_train, utils.FILTERED_DATA_DIR)
df_test = utils.concatenate_surgeries(path_test, utils.FILTERED_DATA_DIR)

df_train

SubLabel,BIS,CO₂fe,FC,PNIm,SpO₂,rec_id,scope_session
TimeStamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2022-03-11 13:15:01.917000+00:00,54.0,24.0,80.0,84.0,98.699997,0,cd377160-f477-4f4e-963f-3b6441fa6a8a.parquet
2022-03-11 13:15:02.941000+00:00,54.0,21.0,81.0,84.0,98.599998,0,cd377160-f477-4f4e-963f-3b6441fa6a8a.parquet
2022-03-11 13:15:03.965000+00:00,56.0,21.0,81.0,84.0,98.599998,0,cd377160-f477-4f4e-963f-3b6441fa6a8a.parquet
2022-03-11 13:15:04.989000+00:00,56.0,21.0,81.0,84.0,98.599998,0,cd377160-f477-4f4e-963f-3b6441fa6a8a.parquet
2022-03-11 13:15:06.013000+00:00,56.0,21.0,81.0,84.0,98.500000,0,cd377160-f477-4f4e-963f-3b6441fa6a8a.parquet
...,...,...,...,...,...,...,...
2022-11-01 06:59:55.442000+00:00,40.0,34.0,46.0,63.0,97.199997,79,80b98665-5fe0-462c-a6cd-c9b9943abc37.parquet
2022-11-01 06:59:56.466000+00:00,42.0,34.0,47.0,63.0,97.199997,79,80b98665-5fe0-462c-a6cd-c9b9943abc37.parquet
2022-11-01 06:59:57.490000+00:00,42.0,34.0,48.0,63.0,97.199997,79,80b98665-5fe0-462c-a6cd-c9b9943abc37.parquet
2022-11-01 06:59:58.514000+00:00,43.0,34.0,48.0,63.0,97.199997,79,80b98665-5fe0-462c-a6cd-c9b9943abc37.parquet


#### As you can see, the training DataFrame contains the BIS, all 4 signals (the function already dropped all rows containing at least 1 NaN value), and the recording_id of the surgery (it is really useful to identify which surgery belongs which signals). Contains also the scope_session id, could be useful to identify specific recording and then look at their corresponding plot...

## 3 - Create X_train and y_train

In [4]:
X_train, y_train = df_train[["CO₂fe", "FC", "PNIm", "SpO₂"]], df_train["BIS"]
X_test, y_test = df_test[["CO₂fe", "FC", "PNIm", "SpO₂"]], df_test["BIS"]

print("X_train")
print(X_train.head())

print("\ny_train")
print(y_train.head())

X_train
SubLabel                          CO₂fe    FC  PNIm       SpO₂
TimeStamp                                                     
2022-03-11 13:15:01.917000+00:00   24.0  80.0  84.0  98.699997
2022-03-11 13:15:02.941000+00:00   21.0  81.0  84.0  98.599998
2022-03-11 13:15:03.965000+00:00   21.0  81.0  84.0  98.599998
2022-03-11 13:15:04.989000+00:00   21.0  81.0  84.0  98.599998
2022-03-11 13:15:06.013000+00:00   21.0  81.0  84.0  98.500000

y_train
TimeStamp
2022-03-11 13:15:01.917000+00:00    54.0
2022-03-11 13:15:02.941000+00:00    54.0
2022-03-11 13:15:03.965000+00:00    56.0
2022-03-11 13:15:04.989000+00:00    56.0
2022-03-11 13:15:06.013000+00:00    56.0
Name: BIS, dtype: float32


#### At this point you can use that to train your regression model, maybe apply some scaling, individual preprocessing, etc.