# Preprocessing Guide
This notebooks serves as a guide on converting the initial graph representations created by [Díaz-Montiel & Lankarany (2023)](https://www.biorxiv.org/content/10.1101/2023.06.02.543277v1.abstract) from the OpenNeuro ds003029 dataset into a format that can be used by [PyTorch Geometric](https://pytorch-geometric.readthedocs.io/en/latest/). This is fully automated using the `patch` function in `src/patch.py`. The processed data we are using can be found in the Graham cluster directory:

`/User/projects/def-milad777/gr_research/brain-greg/data/ds003029-processed/graph_representation_elements`

which contains folders for each patient and their runs.

In [1]:
import torch
import pickle
import numpy as np
import sys
sys.path.append('../src')
from preprocess import new_grs, create_tensordata_new, convert_to_Data, pseudo_data, convert_to_PairData, convert_to_TripletData




### Step 1: Extracting Graph Representations
For each patient and each run, there are three files: preictal (before seizure), ictal (seizure occurring), and postictal (after seizure). Each file is a list with entries of the form the form `graph = [A, NF, EF]`. Where `A`, `NF`, and `EF` are lists of length 4, 3, and 4 respectively defined below.

`A = [A0, A1, A2, A3]`, where :
-   `A0` = Ones, shape `(107,107)`.
-   `A1` = Correlation, shape `(107,107)`.  
-   `A2` = Coherence, shape `(107,107)`.
-   `A3` = Phase, shape `(107,107)`.

`NF = [NF0, NF1, NF2]`, where:

-  `NF0` = Ones, shape `(107,1)`.
-  `NF1` = Average Energy, shape `(107,1)`.
-  `NF2` = Band Energy, shape `(107,8)`.


`EF = [EF0, EF1, EF2, EF3]`, where:

-  `EF0` = Ones, shape `(107,107,1)`.
-  `EF1` = Correlation, shape `(107,107,1)`.
-  `EF2` = Coherence, shape `(107,107,1)`.
-  `EF3` = Phase, shape `(107,107,1)`.

All the information above has been (experimentally) confirmed by the above and Alan's documentation of `get_nf`, `get_adj`, and `get_ef` helper functions in his `load_data()` function, but should talk to Alan about confirming these details for absolute certainty.

We'll first load the preictal, ictal, and postictal files for a single patient and run. In this case, the patient folder is `jh101` and we are using run $1$.

In [5]:
path_ictal_1 = f"/Users/xaviermootoo/Documents/Data/ssl-seizure-detection/patient_gr/jh101/ictal_1.pickle"
path_preictal_1 = f"/Users/xaviermootoo/Documents/Data/ssl-seizure-detection/patient_gr/jh101/preictal_1.pickle"
path_postictal_1 = f"/Users/xaviermootoo/Documents/Data/ssl-seizure-detection/patient_gr/jh101/postictal_1.pickle"

with open(path_preictal_1, 'rb') as f:
    data_preictal_1 = pickle.load(f)
with open(path_ictal_1, 'rb') as f:
    data_ictal_1 = pickle.load(f)
with open(path_postictal_1, 'rb') as f:
    data_postictal_1 = pickle.load(f)

path_ictal_2 = f"/Users/xaviermootoo/Documents/Data/ssl-seizure-detection/patient_gr/pt3/ictal_2.pickle"
path_preictal_2 = f"/Users/xaviermootoo/Documents/Data/ssl-seizure-detection/patient_gr/pt3/preictal_2.pickle"
path_postictal_2 = f"/Users/xaviermootoo/Documents/Data/ssl-seizure-detection/patient_gr/pt3/postictal_2.pickle"

with open(path_preictal_2, 'rb') as f:
    data_preictal_2 = pickle.load(f)
with open(path_ictal_2, 'rb') as f:
    data_ictal_2 = pickle.load(f)
with open(path_postictal_2, 'rb') as f:
    data_postictal_2 = pickle.load(f)

#### Valid & Corrupt File List
**jh101**:
- Run 1 (valid)
- Run 2 (valid)
- Run 3 (valid)
- Run 4 (valid)

**pt3**:
- Run 1 (corrupted)
- Run 2 (valid)

### Step 2: Selecting Graph Representations
For simplicity we're going to select the most extensive graph representation:
-  `A` = None
-  `NF` = Average Energy and Band Energy, shape `(107,9)`.
-  `EF` = Correlation, Coherence, Phase, shape `(107, 107, 3)`.

Note that because most PyG layers do not use a separate adjacency matrix with weights, we will not use it, and instead we'll use all the possible edge features. This is facilitated by the `new_grs` functions which gives us the option of **binary classification** and **multiclass classification** based `mode` argument.

##### Binary Classification

In [3]:
# Select the graph representation for each class
new_data_preictal = new_grs(data_preictal, type="preictal", mode="binary")
new_data_ictal = new_grs(data_ictal, type="ictal", mode="binary")
new_data_postictal = new_grs(data_postictal, type="postictal", mode="binary")

##### Multiclass Classification

In [5]:
# Select the graph representation for each class
new_data_preictal = new_grs(data_preictal, type="preictal", mode="multiclass")
new_data_ictal = new_grs(data_ictal, type="ictal", mode="multiclass")
new_data_postictal = new_grs(data_postictal, type="postictal", mode="multiclass")


After selecting the GRs for each class, we concatenate them temporally into a single list `[preictal, ictal, postictal]`.

In [4]:
new_data = new_data_preictal + new_data_ictal + new_data_postictal

### Step 3: Standard GRs $\rightarrow$ PyG GRs
The function `create_tensordata_new` converts the pickle file list of standard graph representations, a list with entries of the form $[ [NF, EF] , Y]$, where $NF$ are the node features, $EF$ are the edge features, and $Y$ is the graph label. The function first inserts an `edge_index` for a **complete graph** in the PyG format, which is a tensor of shape `[2, num_edges]` where each column $[i \ \ j]^T$ indicates the directed edge $i \to j$; this is built using the helper function `build_K_n` found in `preprocess.py`. The node features $NF$ are untouched, but converted to float32 a tensor, notated by `x` in PyG. The edge features are converted to `edge_attr` which is a float32 tensor of shape `[num_edges, num_edge_features]` which follows the `edge_index` accordingly, i.e. the 4th column in `edge_index` (4th edge) will correspond to the edge feature `edge_attr[3,:]`, and so on. The label $Y$ is converted to a long torch tensor. The output is a list with entries of the form `[[edge_index, x, edge_attr], y]`.

In [5]:
pyg_grs = create_tensordata_new(num_nodes=107, data_list=new_data, complete=True, save=False, logdir=None)

In [10]:
# Look inside of pyg_grs
print(len(pyg_grs))
print(type(pyg_grs[0][0][0]))
print("Edge features shape:", pyg_grs[0][0][2].shape)
print("Edge features stored in edge_attr:", pyg_grs[0][0][2])

1113
<class 'torch.Tensor'>
Edge features shape: torch.Size([11342, 3])
Edge features stored in edge_attr: tensor([[ 0.4213,  0.3902,  0.2319],
        [ 0.4969,  0.4126, -0.1610],
        [ 0.4405,  0.3708,  0.7440],
        ...,
        [ 0.8595,  0.2592, -0.5651],
        [ 0.7164,  0.2137, -0.8699],
        [ 0.8794,  0.2266, -0.6079]])


### Step 4: PyG GRs $\rightarrow$ PyG Data
<u>**Stop after this step**</u> if you only need PyG Data for <u>**supervised learning**</u>. 

Here we take the PyG graph representations, and apply the `convert_to_Data` function to create a new list where each entry is now a `torch_geometric.data.Data` object. This is the main object uses to hold graphs in PyG, so we need to use it, especially for batching (for more details see my tutorial `tutorial.ipynb`, or click [here](https://pytorch-geometric.readthedocs.io/en/latest/get_started/introduction.html) for the official tutorial from PyG).

In [3]:
# Convert the PyG GRs to the PyG Data format
pyg_Data_path = r"C:\Users\xmoot\Desktop\Data\ssl-seizure-detection\patient_gr\jh101_pyg_Data.pt"
Data_list = convert_to_Data(pyg_grs, save=True, logdir=pyg_Data_path)

### Step 5: Relative Positioning
In this step we take the output of Step 3 (`pyg_grs`) and create the pseudolabeled dataset of graph pairs for the relative positioning self-supervised method.  Given our list `pyg_grs` and hyperparameters $\tau_+$ and $\tau_-$. The function `pseudo_data` below returns a list of graph pairs where each entry is of the form `[[edge_index1, x1, edge_attr1], [edge_index2, x2, edge_attr2], y]`, where `y` is a pseudolabel (not the old label). Since the total size of the pseudolabeled dataset can be quite large, we use the `sample_ratio` argument to randomly sample a certain portion of it (e.g., `sample_ratio = 0.2` will give us 20% of the total pseudolabeled dataset). Also note that the function will return an equal number of positive and negative samples, as `pseudo_data` automatically balances out the correspondingly classes.

In [12]:
pdata = pseudo_data(pyg_grs, tau_pos=12 // 0.12, tau_neg=60 // 0.12, stats=True, save=False, patientid="", 
                            logdir=None, model="relative_positioning", sample_ratio=0.10)

Number of examples: 21250
y
0    10625
1    10625
Name: count, dtype: int64


In [15]:
# Look inside of pdata
print(len(pdata))
example = pdata[0]
graph1, graph2, label = example
edge_index1, x1, edge_attr1 = graph1
edge_index2, x2, edge_attr2 = graph2
print("Edge features shape:", edge_attr1.shape)
print("Edge features stored in edge_attr:", edge_attr1)

21250
Edge features shape: torch.Size([11342, 3])
Edge features stored in edge_attr: tensor([[ 0.1689,  0.2751,  0.3535],
        [ 0.7258,  0.2714,  0.5142],
        [ 0.4268,  0.3288,  0.0477],
        ...,
        [ 0.2697,  0.2210,  0.7101],
        [ 0.4200,  0.2228, -0.2204],
        [ 0.5756,  0.2243, -0.2721]])


Now instead of converting each graph pair to `torch_geometric.data.Data` object, we instead create a new class called `PairData` that inherits from the `torch_geometric.data.Data` class, allowing us to batch *pairs* of graphs. We use the `convert_to_PairData` function to convert the list of graph pairs to a list of `PairData` objects (see [here](https://pytorch-geometric.readthedocs.io/en/latest/advanced/batching.html) for more details).

In [16]:
Pair_Data = convert_to_PairData(pdata, save=False, logdir=None)

### Step 6: Temporal Shuffling
This step is nearly identical to Step 5, we take the `pyg_grs` and use them to create a pseudolabeled dataset for the temporal shuffling self-supervised method. However, in this method we generate *graph triplets* of the form `[[edge_index1, x1, edge_attr1], [edge_index2, x2, edge_attr2], [edge_index3, x3, edge_attr3], y]` where `y` is the pseudolabel. The size of the pseudolabeled dataset for temporal shuffling can be extremely large, therefore it is <u>**strongly encouraged**</u> to use the `sample_ratio` argument to scale down the dataset.

In [19]:
pdata = pseudo_data(pyg_grs, tau_pos=12 // 0.12, tau_neg=60 // 0.12, stats=True, save=False, patientid="patient", logdir=None, 
                    model="temporal_shuffling", sample_ratio=0.3)

In [17]:
print(pdata[0][0][2])

tensor([[ 0.1689,  0.2751,  0.3535],
        [ 0.7258,  0.2714,  0.5142],
        [ 0.4268,  0.3288,  0.0477],
        ...,
        [ 0.2697,  0.2210,  0.7101],
        [ 0.4200,  0.2228, -0.2204],
        [ 0.5756,  0.2243, -0.2721]])


Similar to Step 5, we create a new class called `TripletData` that inherits from the `torch_geometric.data.Data` class for batching graph triplets in PyG.

In [None]:
Triplet_Data = convert_to_TripletData(pdata, save=False, logdir=None)

### Step 7: Automatic Conversion
The `patch` function in `patch.py` does all of the above, converting the original preictal, ictal, and postictal files from a single patient run. Please see documentation in `patch.py`

In [1]:
import sys
sys.path.append('../src')
from patch import *

path_preictal = "/Users/xaviermootoo/Documents/Data/ssl-seizure-detection/patient_gr/jh101/preictal_1.pickle"
path_ictal = "/Users/xaviermootoo/Documents/Data/ssl-seizure-detection/patient_gr/jh101/ictal_1.pickle"
path_postictal = "/Users/xaviermootoo/Documents/Data/ssl-seizure-detection/patient_gr/jh101/postictal_1.pickle"

graphrep_dir = (path_preictal, path_ictal, path_postictal)

data = patch(graphrep_dir=graphrep_dir, logdir=None, file_name="", num_electrodes=107, tau_pos=12//0.12, tau_neg=120//0.12, 
          model="relative_positioning", stats=True, save=False, sample_ratio=0.1)

Number of examples: 1264
y
1    632
0    632
Name: count, dtype: int64
