# Data preparation examples with ml_eeg_tools

## Introduction

The `ml_eeg_tools` librairy is a collection of tools for working with EEG data especially built for this PIE. It is evolving during time and it is not a final version. The library is built on top of the MNE library and it is designed to be used in the context of the PIE. In this notebook we are exploring the functions to prepare the data for the training.



In [1]:
import os

## Data preparation

The function to prepare the data is sorted in preprocessing/data_preparation.py. The main function is `prepare_data_train`.
This fonction takes as input a list of npy files (eeg data) and a set of settings. The function will then load the data, apply the preprocessing steps (decribed in settings) and return a list of epochs and a list of labels for the model to be train. Further details later. An epoch is a segment of the data that is extracted from the original data. It is defined by a start and an end time. The label is the class of the epoch.

In [2]:
from ml_eeg_tools.preprocessing import data_preparation

### Inputs:

- `file_path_list`: list of strings, the path to the npy files
- `settings`: dictionary, the settings for the preprocessing \
    = { \
        - **'FMIN'**: 1, # the lower frequency for the bandpass filter \
        - **'FMAX'**: 40, # the higher frequency for the bandpass filter \
        - **'EPOCHS_TMIN'**: -1, # the start of the epochs from the movement in seconds \
        - **'EPOCHS_TMAX'**: 2, # the end of the epochs from the movement in seconds \
        - **'EPOCHS_EMPTY_FROM_MVT_TMINS'**: -4, # the start of the epochs from the movement in seconds for epoch with no movement \
        - **'EPOCHS_INTENTION_FROM_MVT_TMIN'**: -2 [*OPTIONAL if not provided, no intention epochs are created, not used for movement - classification only*] # the start of the epochs from the movement in seconds for movement intention epochs, if provided the label for movement intention will be 1 and the label form epoch movement and no movement will be 0 \
        - **'BINARY_CLASSIFICATION'**: True,  # Relevant for movement classification only. If True, the movement epoch are all labeled 1 if False, the movement epochs are labeled 1 if the movement is extension and 0 if the movement is flexion\
        - **'RANDOM_STATE'**: 42 # the random state for shuffling the data \
    }
- `verbose`: boolean, If True, print information about the data


### Outputs:
```
return data_patients, labels_patients, patients_id, sessions_id
```
- `data_patients`: list of patients, each patient is a list of sessions, each session is a list of epochs, each epoch is a np.array of shape (n_channels, n_times)
- `labels_patients`: list of patients, each patient is a list of sessions, each session is a list of labels, each label is an int (0 or 1 or 2)
    - For **movement classification** (`EPOCHS_INTENTION_FROM_MVT_TMIN` not provided):
        - the labels are 0 for no movement and 1 for movement for binary classification (`BINARY_CLASSIFICATION` = `True`)
        - the labels are 0 for no movement, 1 for flexion and 2 for extension for multiclass classification (`BINARY_CLASSIFICATION` = `False`)
    - For **movement intention classification** (`EPOCHS_INTENTION_FROM_MVT_TMIN` provided):
        - the labels are 0 for no movement, 1 for movement intention
- `patients_id`: list of patients, each patient is a strings, the id of the patient
- `sessions_id`: list of patients, each patient is a list of sessions, each session is a strings, the id of the sessions




### Inside the function

The function is composed of the following steps:
1. Load the data for each file
2. Select the sessions within the file that correspond to the arm in oposition to the stroke side
3. Select the channels of interest which are the channels of the stroke side (and the ones in the middle line -> z)
4. Apply the bandpass filter
5. Create the epochs for the movement and no movement (and movement intention if provided) with labels
6. Shuffle the data
7. Return the data


## Example for movement prediction

In this example, we will use the `prepare_data_train` function to prepare the data for the movement prediction. We will use the data of the first 5 files. 

In [9]:
FOLDER_PATH = './../../data/raw/Data_npy/'
FILE_PATH_LIST = [FOLDER_PATH + file_path for file_path in os.listdir(FOLDER_PATH) if file_path.endswith('.npy')]
NUMBER_OF_FILES = 5
settings = {
    'FMIN': 1,
    'FMAX': 40,
    'EPOCHS_TMIN': -1,
    'EPOCHS_TMAX': 2,
    'EPOCHS_EMPTY_FROM_MVT_TMINS': -4,
    'BINARY_CLASSIFICATION': True,
    'RANDOM_STATE': 42,
}

data_patients, labels_patients, patients_id, sessions_id = data_preparation.prepare_data_train(FILE_PATH_LIST[:NUMBER_OF_FILES], settings)


100%|██████████| 5/5 [00:33<00:00,  6.67s/it]


In [25]:
# Shape of the data
print(f'Shape of the data_patients: {len(data_patients)} x {len(data_patients[0])} x {len(data_patients[0][0])} x {len(data_patients[0][0][0])} x {len(data_patients[0][0][0][0])}')
print(f'Number of patients: {len(data_patients)}')
print(f'Number of session for the first patient: {len(data_patients[0])}')
print(f'Number of epochs: {len(data_patients[0][0])}')
print(f'Number of channels: {len(data_patients[0][0][0])}')
print(f'Number of time points: {len(data_patients[0][0][0][0])} \n')

# Shape of the labels
print(f'Shape of labels_patients: {len(labels_patients)}x{len(labels_patients[0])}x{len(labels_patients[0][0])}x1')
print(f'Number of patients: {len(labels_patients)}')
print(f'Number of session for the first patient: {len(labels_patients[0])}')
print(f'Number of epochs: {len(labels_patients[0][0])} \n')

print("sessions_id: ", sessions_id)
print("patients_id: ", patients_id)

Shape of the data_patients: 2 x 2 x 38 x 37 x 3073
Number of patients: 2
Number of session for the first patient: 2
Number of epochs: 38
Number of channels: 37
Number of time points: 3073 

Shape of labels_patients: 2x2x38x1
Number of patients: 2
Number of session for the first patient: 2
Number of epochs: 38 

sessions_id:  [['Trial1', 'Trial2'], ['Trial1', 'Trial2']]
patients_id:  ['001', '002']


In [10]:
data_patients

[[array([[[ 7.21315278e-07,  1.41329656e-06,  2.00904149e-06, ...,
           -2.98816890e-06, -2.73692247e-06, -2.36873094e-06],
          [-7.88132131e-06, -7.60194819e-06, -7.26851056e-06, ...,
           -4.63919422e-06, -4.78924111e-06, -4.77609653e-06],
          [ 1.03619950e-06,  1.68634651e-06,  2.29267737e-06, ...,
            2.56825456e-07,  3.62005393e-07,  5.18714886e-07],
          ...,
          [-3.32112503e-06, -3.46777811e-06, -3.57908770e-06, ...,
            1.17485589e-06,  1.44668773e-06,  1.66763281e-06],
          [-1.13849119e-06, -9.45488591e-07, -8.56948458e-07, ...,
            1.27854057e-06,  1.12021092e-06,  8.38427743e-07],
          [ 6.26109998e-07,  5.80906923e-07,  4.86373329e-07, ...,
           -3.26410117e-07, -1.43236971e-07, -5.26275873e-09]],
  
         [[ 3.02022247e-06,  2.82060247e-06,  2.54972286e-06, ...,
           -3.69190950e-06, -3.85649860e-06, -4.06687591e-06],
          [ 4.05879596e-06,  3.41069601e-06,  2.73169723e-06, ...,
    

In [11]:
labels_patients

[[array([0., 0., 1., 1., 0., 0., 1., 0., 0., 1., 1., 1., 1., 1., 0., 1., 0.,
         1., 0., 1., 1., 1., 0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1.,
         0., 1., 1., 0.]),
  array([0., 0., 1., 1., 0., 0., 1., 0., 0., 1., 1., 1., 1., 1., 0., 1., 0.,
         1., 0., 1., 1., 1., 0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1.,
         0., 1., 1., 0.])],
 [array([0., 0., 1., 1., 0., 0., 1., 0., 0., 1., 1., 1., 1., 1., 0., 1., 0.,
         1., 0., 1., 1., 1., 0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1.,
         0., 1., 1., 0.]),
  array([0., 0., 1., 1., 0., 0., 1., 0., 0., 1., 1., 1., 1., 1., 0., 1., 0.,
         1., 0., 1., 1., 1., 0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1.,
         0., 1., 1., 0.])]]

In [27]:
# first epoch of the first session of the first patient
print(f'Epoch = ', data_patients[0][0][0])
print(f'epoch shape = ', data_patients[0][0][0].shape, '\n')

# first label of the first session of the first patient
print(f'Label = ', labels_patients[0][0][0])
print(f'label type = ', type(labels_patients[0][0][0]), '\n')


Epoch =  [[ 7.21315278e-07  1.41329656e-06  2.00904149e-06 ... -2.98816890e-06
  -2.73692247e-06 -2.36873094e-06]
 [-7.88132131e-06 -7.60194819e-06 -7.26851056e-06 ... -4.63919422e-06
  -4.78924111e-06 -4.77609653e-06]
 [ 1.03619950e-06  1.68634651e-06  2.29267737e-06 ...  2.56825456e-07
   3.62005393e-07  5.18714886e-07]
 ...
 [-3.32112503e-06 -3.46777811e-06 -3.57908770e-06 ...  1.17485589e-06
   1.44668773e-06  1.66763281e-06]
 [-1.13849119e-06 -9.45488591e-07 -8.56948458e-07 ...  1.27854057e-06
   1.12021092e-06  8.38427743e-07]
 [ 6.26109998e-07  5.80906923e-07  4.86373329e-07 ... -3.26410117e-07
  -1.43236971e-07 -5.26275873e-09]]
epoch shape =  (37, 3073) 

Label =  0.0
label type =  <class 'numpy.float64'> 



## Example for movement intention prediction

In [28]:
FOLDER_PATH = './../../data/raw/Data_npy/'
FILE_PATH_LIST = [FOLDER_PATH + file_path for file_path in os.listdir(FOLDER_PATH) if file_path.endswith('.npy')]
NUMBER_OF_FILES = 5
settings = {
    'FMIN': 1,
    'FMAX': 40,
    'EPOCHS_TMIN': -1,
    'EPOCHS_TMAX': 2,
    'EPOCHS_EMPTY_FROM_MVT_TMINS': -4,
    'BINARY_CLASSIFICATION': True,
    'RANDOM_STATE': 42,
    'EPOCHS_INTENTION_FROM_MVT_TMIN': -2,
}

data_patients, labels_patients, patients_id, sessions_id = data_preparation.prepare_data_train(FILE_PATH_LIST[:NUMBER_OF_FILES], settings)


100%|██████████| 5/5 [00:31<00:00,  6.23s/it]


In [29]:
# Shape of the data
print(f'Shape of the data_patients: {len(data_patients)} x {len(data_patients[0])} x {len(data_patients[0][0])} x {len(data_patients[0][0][0])} x {len(data_patients[0][0][0][0])}')
print(f'Number of patients: {len(data_patients)}')
print(f'Number of session for the first patient: {len(data_patients[0])}')
print(f'Number of epochs: {len(data_patients[0][0])}')
print(f'Number of channels: {len(data_patients[0][0][0])}')
print(f'Number of time points: {len(data_patients[0][0][0][0])} \n')

# Shape of the labels
print(f'Shape of labels_patients: {len(labels_patients)}x{len(labels_patients[0])}x{len(labels_patients[0][0])}x1')
print(f'Number of patients: {len(labels_patients)}')
print(f'Number of session for the first patient: {len(labels_patients[0])}')
print(f'Number of epochs: {len(labels_patients[0][0])} \n')

print("sessions_id: ", sessions_id)
print("patients_id: ", patients_id)

Shape of the data_patients: 2 x 2 x 57 x 37 x 3073
Number of patients: 2
Number of session for the first patient: 2
Number of epochs: 57
Number of channels: 37
Number of time points: 3073 

Shape of labels_patients: 2x2x57x1
Number of patients: 2
Number of session for the first patient: 2
Number of epochs: 57 

sessions_id:  [['Trial1', 'Trial2'], ['Trial1', 'Trial2']]
patients_id:  ['001', '002']


In [30]:
labels_patients

[[array([0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 0.,
         0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,
         1., 1., 0., 1., 0., 0., 0., 1., 1., 0., 0., 1., 0., 0., 0., 1., 0.,
         0., 1., 0., 0., 1., 1.]),
  array([0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 0.,
         0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,
         1., 1., 0., 1., 0., 0., 0., 1., 1., 0., 0., 1., 0., 0., 0., 1., 0.,
         0., 1., 0., 0., 1., 1.])],
 [array([0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 0.,
         0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,
         1., 1., 0., 1., 0., 0., 0., 1., 1., 0., 0., 1., 0., 0., 0., 1., 0.,
         0., 1., 0., 0., 1., 1.]),
  array([0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 0.,
         0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,
         1., 1., 0., 1., 0., 0., 0., 1., 1., 0.

In [31]:
# Number of movement intention epoch (labeled as 1) and non-movement intention epoch (labeled as 0) for the first patient
print(f'Number of movement intention epoch = ', sum(labels_patients[0][0]))
print(f'Number of non-movement intention epoch = ', len(labels_patients[0][0]) - sum(labels_patients[0][0]))

Number of movement intention epoch =  19.0
Number of non-movement intention epoch =  38.0


We can see that the number of epochs for movement and no movement is twice the number of epochs for movement intention. This is because the epochs for movement and no movement are created from the movement data and the movement intention epochs are created from the movement intention data. Be careful when using the data for training: unbalanced data can lead to biased models.