# Data Pre-Processing of the CHB-MIT Scalp Database
This notebook contains the code for the creation of a dataset of labeled time windows for seizure detection based on the CHB-MIT Scalp EEG Database. <br>
1. [Imports](#1-imports)
2. [Define Functions](#2-define-functions) <br>
3. [Generate Data](#3-generate-data)


## Dataset Description:
Recordings, grouped into 23 cases, were collected from 22 subjects (5 males, ages 3–22; and 17 females, ages 1.5–19). (Case chb21 was obtained 1.5 years after case chb01, from the same female subject.) The file SUBJECT-INFO contains the gender and age of each subject. (Case chb24 was added to this collection in December 2010, and is not currently included in SUBJECT-INFO.)

Each case (chb01, chb02, etc.) contains between 9 and 42 continuous .edf files from a single subject. Hardware limitations resulted in gaps between consecutively-numbered .edf files, during which the signals were not recorded; in most cases, the gaps are 10 seconds or less, but occasionally there are much longer gaps. In order to protect the privacy of the subjects, all protected health information (PHI) in the original .edf files has been replaced with surrogate information in the files provided here. Dates in the original .edf files have been replaced by surrogate dates, but the time relationships between the individual files belonging to each case have been preserved. In most cases, the .edf files contain exactly one hour of digitized EEG signals, although those belonging to case chb10 are two hours long, and those belonging to cases chb04, chb06, chb07, chb09, and chb23 are four hours long; occasionally, files in which seizures are recorded are shorter.

All signals were sampled at 256 HZ with 16-bit resolution. Most files contain 23 EEG signals (24 or 26 in a few cases). The International 10-20 system of EEG electrode positions and nomenclature was used for these recordings. In a few records, other signals are also recorded, such as an ECG signal in the last 36 files belonging to case chb04 and a vagal nerve stimulus (VNS) signal in the last 18 files belonging to case chb09. In some cases, up to 5 “dummy” signals (named "-") were interspersed among the EEG signals to obtain an easy-to-read display format; these dummy signals can be ignored.

The file RECORDS contains a list of all 664 .edf files included in this collection, and the file RECORDS-WITH-SEIZURES lists the 129 of those files that contain one or more seizures. In all, these records include 198 seizures (182 in the original set of 23 cases); the beginning ([) and end (]) of each seizure is annotated in the .seizure annotation files that accompany each of the files listed in RECORDS-WITH-SEIZURES. In addition, the files named chbnn-summary.txt contain information about the montage used for each recording, and the elapsed time in seconds from the beginning of each .edf file to the beginning and end of each seizure contained in it. <br>
Source:  <a href="https://physionet.org/content/chbmit/1.0.0/">Pyhsionet</a>

## 1. Imports
Import requiered libraries. <br>
External packages can be installed via the `pip install -r requirements.txt` command or the notebook-cell below.

In [None]:
! pip install -r ../requirements.txt

In [None]:
# Import built-in libraries
import os
import re
import glob
import shutil
import random
import subprocess

# Import data science libraries
import numpy as np
import pandas as pd

# Import library for processing edf-files
import pyedflib

# Import progress bar library
from tqdm import tqdm

In [None]:
if(not(os.path.exists("../00_Data/chb-mit-scalp-eeg-database-1.0.0/"))):
    print("File not found. Download started... (this might take a while):")
    subprocess.call(['sh', '../bin/download_and_unzip_data.sh'])

## 2. Define Functions
The following functions are needed for the processing of the raw eeg data. <br>
Each functions is documented with docstring and every line is commented.

### 2.1 get_patient_dict()
This function reads the `summary.txt`-file that is located in the folder of every patient. It contains the channel-mapping, start-, and end-time of each file as well as the start and end of each individual seizure. This function parses every line and creates a dictionary that can be used for labeling the final dataframe.

In [None]:
def get_patient_dict(patient:str, root_path:str) -> dict:
    '''
    Creates dictionary of files and seizures of a patient

    Parameters
    ----------
    patient : str
        identifier of the patient
    root_path : str
        path to the root directory of the scalp database

    Returns
    -------
    dict
        dictionary that contains the list of channels and times of seizures for each file
    '''
    info_file = open(root_path + patient + '/' + patient + '-summary.txt','r').readlines() # Open txt file
    patient_dict = {'channel_list': []} # Create empty dictionary
    for line in info_file: # Iterate over lines in txt file
        if(re.findall(r'((File Name: )\D*\d*(_)\d*(.edf)|(File Name: )\D*\d*\D(_)\d*(.edf))', line)): # If information about next file
            file = re.findall(r'((?:chb)\d*_\d*(?:.edf)|(?:chb)\d*\D_\d*(?:.edf))', line)[0] # Get filename
            patient_dict[file] = {'seizure_start': [], 'seizure_end': []} # Create new sub-dict for new file
        elif(re.findall(r'Channel \d+', line)): # If channel description
            patient_dict['channel_list'].append(str(re.findall(r'Channel\s\d+:\s(\S*)', line)[0])) # Add channels to list
        elif(re.findall(r'Seizure Start Time|Seizure \d+ Start Time', line)): # If seizure start timestamp
            patient_dict[file]['seizure_start'].append(int(re.findall(r'(\d+)\sseconds', line)[0])) # Add seizure start to list
        elif(re.findall(r'Seizure End Time|Seizure \d+ End Time', line)): # If seizure end timestamp
            patient_dict[file]['seizure_end'].append(int(re.findall(r'(\d+)\sseconds', line)[0])) # Add seizure end to list
    return patient_dict
# Based on the approach of: https://github.com/Eldave93/Seizure-Detection-Tutorials/blob/master/Extra_01_Assemble_Feature_DataFrames.ipynb (last access on: 08.06.2023)

### 2.1 get_labeled_file()
This function reads one `.edf` file of a patient and converts it into a pandas DataFrame. The time series data is labeled via the the, in the previous function created dictionary, if a seizure is present and additional patient information is added. <br>
Because the seizures are given as seconds from the beginning of a file, a new temporary column with seconds in respect to the sampling frequency is added.

In [None]:
def get_labeled_file(file_path:str, channel_list:list, patient_dict:dict) -> pd.DataFrame:
    '''
    Converts a single file from edf-format to a pandas DataFrame and adds a label if a seizure is present

    Parameters
    ----------
    file_path : str
        relative path to the file
    channel_list : list
        list of the requested channels names
    patient_dict : dict
        dict that contains the seizure information for each file of the patient

    Returns
    -------
    pd.DataFrame
        pandas DataFrame that contains the requested channels with seizure labels
    '''
    edf_file = pyedflib.EdfReader(file_path) # Read edf file
    if not set(channel_list).issubset(set(edf_file.getSignalLabels())): # Check if all requested channels are present
        raise ValueError("File " + file_path + " does not contain requested channels!") # Raise error if not
    signal_data = np.zeros((edf_file.getNSamples()[0], len(channel_list))) # Create empty array for data
    for i, channel in enumerate(channel_list): # Iterate over channels
        signal_data[:, i] = edf_file.readSignal(edf_file.getSignalLabels().index(channel)) # Add channel data to array
    dataframe = pd.DataFrame(signal_data, columns=channel_list).astype('float32') # Create a dataframe from array
    dataframe["seconds"] = np.floor(np.linspace(0, len(dataframe)/edf_file.getSampleFrequencies()[0], len(dataframe), endpoint=False)).astype('uint16') # Add seconds column
    file_name = re.findall(r'([^\/]+$)', file_path)[-1] # Get name of file
    seizure_start_list = patient_dict.get(file_name).get("seizure_start") # Get list of seizure starts for file
    seizure_end_list = patient_dict.get(file_name).get("seizure_end") # Get list of seizure ends for file
    dataframe["seizure"] = 0 # Create new column for seizure labels
    if(len(seizure_start_list) > 0): # If seizures are present in file
        for seizure in range(len(seizure_start_list)): # Iterate over seizures
            start_second = seizure_start_list[seizure] # Get current start of seizure
            end_second = seizure_end_list[seizure] # Get current end of seizure
            dataframe.loc[dataframe["seconds"].between(start_second, end_second), "seizure"] = 1 # Label timeframe of seizure
    dataframe = dataframe.drop(columns=["seconds"]) # Drop seconds column
    dataframe["file_name"] = file_name # Add column with file namen for later time window processing
    return dataframe
# Based on the approach of: https://github.com/Eldave93/Seizure-Detection-Tutorials/blob/master/Extra_01_Assemble_Feature_DataFrames.ipynb (last access on: 08.06.2023)

### 2.3 get_complete_patient_data()
The target output of this function is a dataframe, that contains the complete, labeled and enhanced data of one patient. In addition, a column with a timestamp is added for optinal later resampling. 

In [None]:
def get_complete_patient_data(patient:str, channel_list:list, root_path:str) -> pd.DataFrame:
    '''
    Creates a pandas DataFrame that contains all requested channels of the complete eeg data of a patient

    Parameters
    ----------
    patient : str
        identifier of the patient
    channel_list : list
        list of the requested channels names
    root_path: str
        path to the root directory of the scalp database

    Returns
    -------
    pd.DataFrame
        pandas DataFrame that contains the complete labeled eeg data of one patient
    '''
    parent_path = root_path + patient # Get path of patients parent directory
    all_patient_files = sorted(glob.glob(os.path.join(parent_path , ("*.edf")))) # Get list of patients files
    all_patient_files = [ x for x in all_patient_files if "+" not in x ] # Clean file list
    patient_dict = get_patient_dict(patient=patient, root_path=root_path) # Get dict of patient data information
    concat_list = [] # Create empty list for files
    bar = tqdm(total=len(all_patient_files)) # Create progress bar
    for file in all_patient_files: # Iterate over all files
        try:
            concat_list.append(get_labeled_file(file_path=file, channel_list=channel_list, patient_dict=patient_dict)) # Get labeled dataframe of file
        except Exception as e:
            raise e
        bar.update(1) # Update progress bar
    bar.close() # Close progress bar
    dataframe = pd.concat(concat_list, axis=0, ignore_index=True) # Combine all dataframes into one
    dataframe["patient"] = patient # Create column with patient identifier
    dataframe["timestamp"] = pd.date_range('1970-01-01 00:00:00', freq='3906250N', periods=len(dataframe)) # Add timestamp for later resampling
    subject_info = pd.read_csv(root_path + '/SUBJECT-INFO', delimiter='\s+').drop(columns=["(years)"]).set_index("Case", drop=True).loc[patient].to_dict() # Get subject infor as dict
    dataframe["age"] = float(subject_info.get("Age")) # Add column with subject age
    if(subject_info.get("Gender")=='F'): # If subject is female
        dataframe["gender"] = 1 # Add column with encoded gender
    elif(subject_info.get("Gender")=='M'): # If subject is male
        dataframe["gender"] = 0 # Add column with encoded gender
    else: # If subject gender is not defined
        dataframe["gender"] = -1 # Add column with arbitrary value
    return dataframe

### 2.4 resample_dataframe()
This very simple function subsamples a dataframe to the desired frequency and drops the old index.

In [None]:
def resample_dataframe(dataframe:pd.DataFrame, resample_freq:str, time_col:str) -> pd.DataFrame:
    '''
    Resamples a pandas DataFrame to the desired frequency

    Parameters
    ----------
   dataframe : pd.DataFrame
        dataframe to be resampled
    resample_freq : str
        target data frequency
    time_col : str
        column that contains the timestamp

    Returns
    -------
    pd.DataFrame
        pandas DataFrame that contains the resampled data
    '''
    resampled_df = dataframe.resample(rule=resample_freq, on=time_col).agg("first") # Resample dataframe and select the first value
    resampled_df = resampled_df.reset_index(drop=True) # Drop timestamp index
    return resampled_df

### 2.5 create_balanced_time_windows()
This function is the 4th iteration of a sliding time window generation method. The following problems must be adressed:
- Memory limitations: The generation of long overlapping time windows is very memory intensive and can very quickly lead to OEM-Exceptions
- Processing time: Depending on the used device, this process can take very long (up to 48h per patient)
- Imbalance: The resulting data is extremely imbalanced due to the frequency and length of seizures

To solve these issues, the follwing concepts where used:
- Memory limitations: Instead of creating all possible time windows for the data, only a subset based on the balance-ratio are extracted
- Processing time: This issue is solved by the reduced amount of windows as well as pre-allocating the memory for the numpy Arrays
- Imbalance: To adress the imbalance, the data is first split into two sets containg only samples of each label. A random sample of the majority class data is taken based on a defined balance_ratio to avoid a highly imbalance data and reduce the overall amount of windows. 

Due to the algorithm behind the selection of start indices for the sliding windows, there are no samples where a seizure is beginning to start. Therefore, a window edge-case handeling is added to create additional time windows and supplement the training data.

In [None]:
def create_balanced_time_windows(dataframe:pd.DataFrame, window_length:int, id_column:str, label_column:str, balance_ratio:float, step:int, extract_series:bool, label_max:bool, random_state:int) -> tuple:
    '''
    Creates sliding time windows based on time series data

    Parameters
    ----------
   dataframe : pd.DataFrame
        dataframe to be resampled
    window_lenth : int
        length of the time windows
    id_col : str
        column that contains the id of the patients
    label_col : str
        column that contains the label
    balance_ratio : float
        balance ration bewtween majority and minority class
    step : int
        step between windows
    extract_series : Bool
        decides wether y is a series or a single label
    label_max : Bool
        extract labels based on maximum or average in time window
    random_state : int
        random state for resampling

    Returns
    -------
    X : np.array
        array containing the feature values for each window
    y : np.array
        array containing the label(s) for each window
    '''
    unique_ids = dataframe[id_column].unique() # Create list of unique ids in dataframe
    for id in unique_ids: # Iterate over ids (safety feature)
        index_positive = list(dataframe[(dataframe[id_column] == id) & (dataframe[label_column] == 1)].index.values)[::step] # Get indices of all positive samples and apply step
        index_negative = dataframe[((dataframe[id_column] == id) & (dataframe[label_column] == 0))].index # Get indices of all negative samples
        index_positive_edge_case = [] # Create empty list for ids, that are directly at the beginning of seizures
        for idx in index_positive: # Iterate over positive indices
            if((idx-window_length)>=0): # If index in dataframe
                if(dataframe[label_column].iloc[(idx-window_length)]==0): # If index at the beginning of a seizure
                    index_positive_edge_case.append(int(idx-window_length-1)) # Add new index to list
        index_positive = index_positive + index_positive_edge_case # Combine positive indices with edge case indices
        random.seed(random_state) # Set seed for random
        index_negative_sample = random.sample(list(index_negative), int(len(index_positive) * balance_ratio)) # Sample subset of negative indices
        sample_indices = list(index_positive + index_negative_sample) # Combine indices lists
        X = np.zeros((len(sample_indices), window_length, (len(dataframe.columns)-3)), dtype='float32') # Create empty array for features
        if extract_series:
            y = np.zeros((len(sample_indices), window_length), dtype='int8') # Create empty array for label series
        else:
            y = np.zeros((len(sample_indices), 1), dtype='int8') # Create empty array for single labels
        bar = tqdm(total=len(sample_indices)) # Create progress bar
        i = 0 # Set iteration variable
        for index in sample_indices: # Iterate over indices
            end_index = index + window_length # Calculate end index of current window
            if (end_index <= len(dataframe)): # If window not exceeds the length of the dataframe
                if(dataframe["file_name"].iloc[index] == dataframe["file_name"].iloc[end_index]): # If data in window from the same patient (safety feature)
                    seq_X = dataframe.drop(columns=[id_column, label_column, "file_name"]).iloc[index:end_index].values.tolist() # Create list of feature values in window
                    if extract_series: # If a series of labels is to be extraced
                        seq_y = dataframe[label_column].iloc[index:end_index].values.tolist() # Create list of label values in window
                    else:
                        if(label_max):
                            seq_y = np.amax(np.array(dataframe[label_column].iloc[index:end_index].values.tolist())) # If seizure anywhere present
                        else:
                            seq_y = round(np.mean(np.array(dataframe[label_column].iloc[index:end_index].values.tolist()), axis=0)) # Get most present label
                    X[i] = seq_X # Add feature window to main list
                    y[i] = seq_y # Add label (window) to main list
            bar.update(1) # Update progress bar
            i += 1 # Increment iteration variable
        bar.close() # Close progress bar
        X = np.array(X) # Create numpy array from window feature list
        y = np.int_(np.array(y)) # Create numpy array from window label list
    return X, y
# With modifications taken from my bachelor thesis

### 2.6 scalp_database_to_dataframe()
This function is split into two stages and coordinates the processing of the raw patient and eeg data to complete and labeled dataframes as well as the combination of each dataframe to a complete dataset for future training and validation of machine learning models. First, the processing and enhancement of the patient data is executed and a temporary dataframe that contains the complete data of a patient from which the sliding time windows are created and stored in individual temporary files. Following, all individual samples are combined into one big file for future use.

In [None]:
def scalp_database_to_dataframe(patient_list:list, channel_list:list, root_path:str, conf:dict, save_dataframe:bool, save_path:str="") -> tuple:
    '''
    Creates a pandas DataFrame that contains the complete data of one patient and saves the dataframe

    Parameters
    ----------
    patient_list : list
        list of all patient identifiers
    channel_list : list
        list of the requested channels names
    root_path : str
        path to the root directory of the scalp database
    conf : dict
        dictionary that contains the configuration data

    Returns
    -------
    X : np.array
        array containing the aggregated feature values for each window of all patients
    y : np.array
        array containing the aggregated label(s) for each window of all patients
    '''
    if not os.path.isdir(root_path + '/../Processed-Data'): # Check if processed data folder exists
        os.makedirs(root_path + '/../Processed-Data') # Create processed data folder
    if not os.path.isdir(root_path + '/../Processed-Data/Temp'): # Check if temp sub-folder exists
        os.makedirs(root_path + '/../Processed-Data/Temp') # Create temp sub-folder
    for patient in patient_list: # Iterate over patients
        print("==================================\n Processing Patient: " + patient + " (" + str(patient_list.index(patient)+1) + "/" + str(len(patient_list)) + ")\n==================================")
        try:
            print("1. Read raw files & find seizures")
            temp_df = get_complete_patient_data(patient, channel_list, root_path) # Create dataframe that contains the labeled data of a patient
            print("2. Resample Data")
            temp_df_resampled = resample_dataframe(
                dataframe = temp_df, 
                resample_freq = conf["resample"]["frequency"], 
                time_col = conf["resample"]["timestamp_column"]
                ) # Resample Dataset
            print("3. Create Sliding Time Windows")
            X_temp, y_temp = create_balanced_time_windows(
                dataframe = temp_df_resampled, 
                window_length = conf["sliding_time_window"]["window_length"], 
                id_column = "patient", 
                label_column = "seizure", 
                balance_ratio = conf["sliding_time_window"]["balance_ratio"], 
                step = conf["sliding_time_window"]["window_step"], 
                extract_series = conf["sliding_time_window"]["return_sequences"], 
                label_max = conf["sliding_time_window"]["label_max"],
                random_state = conf["sliding_time_window"]["random_seed"]
            ) # Create sliding time windows
            np.savez_compressed('../00_Data/Processed-Data/Temp/' + str(patient), features=X_temp, labels=y_temp) # Save processed data of one patient as compressed file
        except Exception as e:
            print(e)
    print("==================================\n Build Complete Data\n==================================")
    total_len = 0 # Iteration variable for total number of time windows
    all_files = sorted(glob.glob(os.path.join('../00_Data/Processed-Data/Temp/' , ("*.npz")))) # Get list of all compressed patient data files
    for file in all_files: # Iterate over all files
        total_len += len(np.load(file)["features"]) # Load comporessed file and add number of windows to variable
    X = np.zeros((total_len, X_temp.shape[1], X_temp.shape[2]), dtype='float32') # Create empty array for all feature windows
    if(conf["sliding_time_window"]["return_sequences"]): # If label sequences are returned
        y = np.zeros((total_len, y_temp.shape[1]), dtype='int8') # Create empty array for label sequences
    else:
        y = np.zeros((total_len, 1), dtype='int8') # Create empty array for window labels
    i = 0 # Iteration variable for accessing empty arrays
    bar = tqdm(total=len(all_files)) # Create progress bar
    for file in all_files: # Iterate over all files again
        compressed_data = np.load(file) # Load current file
        X_temp = compressed_data["features"] # Extract features from compressed file
        y_temp = compressed_data["labels"] # Extract labels from compressed file
        for n in range(len(X_temp)): # For each window in file
            X[(i+n)] = X_temp[n] # Add window features to global array
            y[(i+n)] = y_temp[n] # Add window label(s) to global array
        i += len(X_temp) # Increment iteration variable
        bar.update(1) # Update progress bar
    bar.close() # Close progress bar
    if save_dataframe:
        print("Saving data; Depending on the amount of data, this might take up to 20 minutes!")
        np.savez_compressed(save_path, features=X, labels=y) # Save aggregated data as compressed file
        shutil.rmtree(root_path + '/../Processed-Data/Temp') # Delete Temp folder
    return X, y

### 2.7 get_valid_channels()
Because the data was taken from multiple patients in multiple hospitals and medical instutuions, the used channels and electrodes vary between the patiens or even indivudal files. This functions iterates over all files of all patients and extracts the channels that are present in every instance. This enshures a successful processing of the data and the generation of a complete dataset without any missing values.

In [None]:
def get_valid_channels(patient_list:list, root_path:str) -> list:
    '''
    Creates a list of channels present for all patients in all files

    Parameters
    ----------
    patient_list : list
        list of all patient identifiers
    root_path: str
        path to the root directory of the scalp database

    Returns
    -------
    list
        list that contains the channels that are present for all files
    '''
    channel_list = [] # Create empty list for channels
    for patient in patient_list: # Iterate over patients
        parent_path = root_path + patient # Create path to patient files
        all_patient_files = sorted(glob.glob(os.path.join(parent_path , ("*.edf")))) # Get list of all patient files
        all_patient_files = [ x for x in all_patient_files if "+" not in x ] # Get file(s) with channel information
        for file in all_patient_files: # Iterate over file(s) with channel information
            temp_file = pyedflib.EdfReader(file) # Read edf file
            channel_list.append(temp_file.getSignalLabels()) # Create list with all channels from all files
    elements_in_all = list(set.intersection(*map(set, channel_list))) # Create set with channels that are present in all files
    return elements_in_all

### 2.8 get_valid_patients()
Because the data was taken from multiple patients in multiple hospitals and medical instutuions, the used channels and electrodes vary between the patiens or even indivudal files. This functions takes an alternative approach and extracts the patients, where all files contain the requested channels. This is also done by an iteration over all patients and files with an extraction of the present channels. This enshures a successful processing of the data and the generation of a complete dataset without any missing values.

In [None]:
def get_valid_patients(patient_list:list, root_path:str, requiered_channels:list):
    '''
    Creates a list of channels patients where all files contain the requested channels

    Parameters
    ----------
    patient_list : list
        list of all patient identifiers
    root_path : str
        path to the root directory of the scalp database
    requiered_channels : list
        list of channles that are to be extracted

    Returns
    -------
    list
        list that contains the channels that are present for all files
    '''
    valid_patients = []
    for patient in patient_list:
        parent_path = root_path + patient
        all_patient_files = sorted(glob.glob(os.path.join(parent_path , ("*.edf")))) # Get list of all patient files
        all_patient_files = [ x for x in all_patient_files if "+" not in x ] # Get file(s) with channel information
        requiered_channels_present = []
        for file in all_patient_files:
            temp_file = pyedflib.EdfReader(file)
            if(set(requiered_channels).issubset(temp_file.getSignalLabels())):
                requiered_channels_present.append(True)
            else:
                requiered_channels_present.append(False)
        if(False not in requiered_channels_present):
            valid_patients.append(patient)
    return valid_patients

## 3. Generate Data
After defining all necessary functions, the training data for the machine learning models can be performed. First, the root path of the raw data as well as a list of all patients is created. Patient chb12 is dropped due to a completely different electrode placement and resulting incompatibility with other files. Next, a selection of either the channels present in all patient files or the patients containing the requested channels must be performed. To ensure a correct and complete data basis for the classification of the EEG data, the second approach was chosen and all channels of the international 10-20 system were applied. <br>

**International EEG 10-20 Electrode Placement:** <br>
<img src="99_Assets/02_Images/EEG_Elektrodenanordnung_nach_10-20_-englisch-TerniMed.jpg" alt="Topomap 10-20 System" width="50%"/><br>
With changes taken from: <a href="https://www.ternimed.de/WebRoot/Store2/Shops/62826360/MediaGallery/Bilder/EEG_Elektrodenanordnung_nach_10-20_-englisch-TerniMed.jpg">Source</a>

In [None]:
root_path = '../00_Data/chb-mit-scalp-eeg-database-1.0.0/'
all_patients = sorted([patient for patient in os.listdir(root_path) if re.match(r'(chb)\d+', patient)])
all_patients.remove("chb12")

In [None]:
# channels = get_valid_channels(patient_list=all_patients, root_path=root_path)

In [None]:
channels = ['F8-T8', 'T7-FT9', 'F4-C4', 'C3-P3', 'P7-T7', 'P7-O1', 'T8-P8', 'FP1-F7', 'P8-O2', 'T7-P7', 'C4-P4', 'FT10-T8', 'P4-O2', 'F7-T7', 'CZ-PZ', 'FP2-F8', 'P3-O1', 'FP1-F3','FP2-F4', 'FZ-CZ', 'F3-C3', 'FT9-FT10']
all_patients = get_valid_patients(patient_list=all_patients, root_path=root_path, requiered_channels=channels)

In [None]:
config_dict = {
    'resample':{
        'frequency': "10ms",
        'timestamp_column': "timestamp"
    },
    'sliding_time_window':{
       'window_length': 1000,
       'balance_ratio': 1.2,
       'window_step': 100,
       'return_sequences': False,
       'label_max': True,
       'random_seed': 28
    }
}

In [None]:
X, y = scalp_database_to_dataframe(
    patient_list=all_patients, 
    channel_list=channels, 
    root_path=root_path, 
    conf=config_dict, 
    save_dataframe=True, 
    save_path='../00_Data/Processed-Data/classification_dataset'
)

## 4. Old Functions
The following functions were created during the development of the approach, but were improved due to multiple issues. The cells below are only used to demonstrate different approaches and are not used for productive creation of the dataset. 

In [None]:
# Version 1 of Window Generation
# Issues:
#   - OOM-Exception
#   - Extremely ineffecient for big data
#   - Very imbalanced data


# def create_sliding_windows(dataframe:pd.DataFrame, window_size:int, id_col:str, label_col:str) -> tuple:
#     """
#     Function for the creation of time windows of certain size

#     Parameters
#     ----------
#     dataframe : pd.DataFrame
#         Dataframe containing all features and target variable as well as a unique identifier
#     window_size : int
#         Number of timesteps in a timewindow
#     id_col : str
#         Name of the column that contains the ids
#     label_col : str
#         Name of the column that is to predicted

#     Returns
#     -------
#     X : np.array
#         Array that contains all of the features of each time window
#     y : np.array
#         Array that contains all of the labels of each time window
#     """
#     X, y = list(), list() # Create empty lists for X and y
#     unique_ids = dataframe[id_col].unique()
#     bar = tqdm(total=(len(dataframe) - window_size + 1))
#     for i in unique_ids:
#         temp_df = dataframe.loc[dataframe[id_col] == i].reset_index().drop(columns="index")
#         for n in range(len(temp_df)): # Iterate over rows of temporary dataframe
#             end_ix = n + window_size # Calculate last idx of time window
#             if (end_ix <= len(temp_df)): # If last idx is still within temporary dataframe
#                 seq_x = temp_df.drop(columns=[id_col, label_col])[n:end_ix].values.tolist()
#                 seq_y = temp_df.loc[end_ix][label_col]
#                 X.append(seq_x) # Append X of current time window to global list
#                 y.append(seq_y) # Append y of current time window to global list
#             bar.update(1)
#     bar.close()
#     X = np.array(X) # Create array from global list of X
#     y = np.float_(np.array(y)) # Create array of type float from global list of y
#     return X, y

In [None]:
# Version 2 of Window Generation
# Issues
#   - Still ineffecient
#   - Unbalanced data

# def create_sliding_windows_step(dataframe:pd.DataFrame, window_size:int, id_col:str, label_col:str, step:int, y_series:bool) -> tuple:
#     """
#     Function for the creation of time windows of certain size

#     Parameters
#     ----------
#     dataframe : pd.DataFrame
#         Dataframe containing all features and target variable as well as a unique identifier
#     window_size : int
#         Number of timesteps in a timewindow
#     id_col : str
#         Name of the column that contains the ids
#     label_col : str
#         Name of the column that is to predicted

#     Returns
#     -------
#     X : np.array
#         Array that contains all of the features of each time window
#     y : np.array
#         Array that contains all of the labels of each time window
#     """
#     X, y = list(), list() # Create empty lists for X and y
#     unique_ids = dataframe[id_col].unique()
#     bar = tqdm(total=len(range(0, len(dataframe), step)))
#     for i in unique_ids:
#         temp_df = dataframe.loc[dataframe[id_col] == i].reset_index().drop(columns="index")
#         for n in range(0, len(temp_df), step): # Iterate over rows of temporary dataframe
#             end_ix = n + window_size # Calculate last idx of time window
#             if (end_ix <= len(temp_df)): # If last idx is still within temporary dataframe
#                 seq_x = temp_df.drop(columns=[id_col, label_col])[n:end_ix].values.tolist()
#                 if y_series:
#                     seq_y = temp_df.loc[n:end_ix][label_col].values.tolist()
#                 else:
#                     seq_y = temp_df.loc[end_ix][label_col]
#                 X.append(seq_x) # Append X of current time window to global list
#                 y.append(seq_y) # Append y of current time window to global list
#             bar.update(1)
#     bar.close()
#     X = np.array(X) # Create array from global list of X
#     y = np.float_(np.array(y)) # Create array of type float from global list of y
#     return X, y

In [None]:
# Version 3 of Window Generation
# Issues
#   - Memmap has size of >100GB

# def create_balanced_time_windows(dataframe:pd.DataFrame, window_length:int, id_column:str, label_column:str, balance_ratio:float, extract_series:bool, random_state:int):
#     unique_ids = dataframe[id_column].unique()
#     for id in unique_ids:
#         print("Processing patient: " + str(id))
#         index_positive = dataframe[(dataframe[id_column] == id) & (dataframe[label_column] == 1)].index
#         index_negative = dataframe[((dataframe[id_column] == id) & (dataframe[label_column] == 0))].index
#         random.seed(random_state)
#         index_negative_sample = random.sample(list(index_negative), int(len(index_positive) * float(balance_ratio)))
#         sample_indices = index_positive + index_negative_sample
#         X = np.memmap('../00_Data/Dataframes/' + str(id) + '_features.npy', np.float32, mode='w+', shape=(len(sample_indices), window_length, 18))
#         if extract_series:
#             y = np.memmap('../00_Data/Dataframes/' + str(id) + '_label.npy', np.int16, mode='w+', shape=(len(sample_indices), window_length))
#         else:
#             y = np.memmap('../00_Data/Dataframes/' + str(id) + '_label.npy', np.int16, mode='w+', shape=(len(sample_indices), 1))
#         bar = tqdm(total=len(sample_indices))
#         i = 0
#         for index in sample_indices:
#             end_index = index + window_length # Calculate last idx of time window
#             if (end_index <= len(dataframe)):
#                 if(dataframe[id_column].iloc[index] == dataframe[id_column].iloc[end_index]):
#                     seq_x = dataframe.drop(columns=[id_column, "timestamp", label_column]).iloc[index:end_index].values.tolist()
#                     if extract_series:
#                         seq_y = dataframe[label_column].iloc[index:end_index]
#                     else:
#                         seq_y = dataframe[label_column].iloc[end_index]
#                     X[i] = seq_x # Append X of current time window to global list
#                     y[i] = seq_y # Append y of current time window to global list
#             bar.update(1)
#             i += 1
#         bar.close()
#     return None