# Connectome Pipeline

Hi and welcome to the Connectome Pipeline!

## 1. Preprocessing

In the first step, you will preprocess the CONN Matlab files to an analysis ready dataset.

Here is an overview on the parameters for the preprocessing pipeline. Parameters marked with a (*) are optional.


+    *matlab_dir*: path to matlab files
+    *excel_path*: path to excel list
+    *preprocessing_type*: conn for connectivity matrix or "aggregation" for aggregated conn matrix
+    *export_file**: If false return as pd dataframe
+    *write_dir**: path where to write the dataset to if save_file = True
+    *network**: Yeo7 or Yeo17 network (only applicable if preprocessing_type = aggregation)
+    *statistic**: Summary statistic to be applied (only applicable if preprocessing_type = aggregation)
+    *upper**: boolean whether only upper diagonal elements of connecivity matrices should be used
+    *file_format**: Pass "h5" for further modelling in python or "csv" for R (default "csv")

In [8]:
import os
import pandas as pd

In [9]:
from connectome.preprocessing.preprocessing_matlab_files import preprocess_mat_files

In [12]:
matlab_dir = r"C:\Users\Kai\Desktop\My Life\Master\3. Semester\Innolabs\Connectome\data\10\10\conn_data" # Enter the directory for the matlab files
excel_path = r"C:\Users\Kai\Desktop\My Life\Master\3. Semester\Innolabs\Connectome\data\10\10\example_data_10.xlsx" # Enter the directory for the corresponding excel sheet
preprocessing_type = 'conn'
write_dir = "" # ...
export_file = False # rename to export file

In [13]:
df = preprocess_mat_files(matlab_dir = matlab_dir, excel_path = excel_path, preprocessing_type = preprocessing_type,
                          write_dir = write_dir, export_file = export_file)

loading files
Starting Preprocessing
Creating Final Dataset
Done!


In [14]:
df.head()

Unnamed: 0.1,Unnamed: 0,age,sex,edyears,Apoe,target,subject_id,ConnID,IDs,1_2,...,6_7,6_8,6_9,6_10,7_8,7_9,7_10,8_9,8_10,9_10
0,1.0,72.0,1.0,18.0,1.0,1.0,1.0,1.0,1.0,-2.942277,...,-0.064793,0.975841,0.234597,-3.256017,2.714773,-2.952573,-3.22547,-0.834634,-2.281607,0.490368
1,2.0,80.0,0.0,18.0,1.0,1.0,2.0,2.0,2.0,3.337648,...,-1.595689,-0.474086,-3.539505,-1.806304,-0.495115,1.660703,-1.528401,4.027515,2.467213,0.971484
2,3.0,87.0,0.0,19.0,0.0,0.0,3.0,3.0,3.0,3.076518,...,2.386448,2.177649,-2.782558,1.516519,0.511374,0.64002,-1.247038,-1.335651,0.252126,1.324432
3,5.0,81.0,1.0,10.0,0.0,0.0,5.0,5.0,5.0,-0.867223,...,-0.656894,-1.630556,-1.468686,-3.09331,-3.076284,-1.983045,-0.653831,-2.407441,1.524945,0.43702
4,8.0,84.0,1.0,11.0,1.0,0.0,8.0,8.0,8.0,0.120163,...,-0.523197,-2.129155,-0.779487,-0.570189,1.317503,-1.529356,0.374784,-0.939822,2.208215,-1.170817


## 2. Modelling

In the second step, you can decide between running the new input files on a pretrained model or train a new model

### 2.1  Data preparation
Preparation of the data for modelling. Creates the target variable, drops unnecessary columns, performs a train/test split (if wanted). \\
The user has to specify:
- *classification*: is it a classification task (True) or a regression task (False)
- *columns_drop*: which variables shoulnd't be used for modelling
- *target*: what is the name of the target variable
- *y_0, y_1* (only relevant for classification task): which values of the target variable are 0, which are 1
- *train_size*: size of the training data
- *seed*: a seed to ensure reproducibility of train/test split
- split: should a train/test split be performed or not? 

In [15]:
from connectome.preprocessing.data_preparation import prepare_data

In [20]:
classification = True
columns_drop = ["ConnID", "Apoe", "subject_id"]
target = "target"
y_0 = [0]
y_1 = [1]
train_size = 0.8
seed = 1855
split = True

In [21]:
# preparation of data
X_train, y_train, X_test, y_test = prepare_data(data = df, classification = classification,
                                                columns_drop = columns_drop, target = target, y_0 = y_0, y_1 = y_1,
                                                train_size = train_size, seed = seed, split = split)



### 2.2 Run Model or get pretrained model

Selection which model should be used and whether a pretrained model or newly trained model is desired.

You can find a selection fo pretrained models under the models folder.

The user has to specify:
- X_train: training data coming from the previous step
- y_train: values of target variable for the training data coming from the previous step
- model: which model should be used (options are: "elnet" for elastic net, "gboost" for gradient boosting, "rf" for random forest and "cnn" for convolutional neural network)
- pretrained: is a pretrained model wanted or should the training data be used to fit a new one. (True = pretrained, False = new fit)
- model_path: the full path to the desired pretrained model if one should be used

In [22]:
from connectome.models.framework import model_framework

In [24]:
model = model_framework(X_train = X_train,
                        y_train = y_train,
                        model = "cnn",
                        pretrained = False,
                        model_path = None,
                        epochs =1,
                       patience = 1)

Turning flat array to matrix


ValueError: cannot reshape array of size 155 into shape (31,3)

In [35]:
X = X_train
y = y_train
aggregation = False
augmentation = False
reorder = False

X_img_cols = []
X_struc_cols = []
for x in X.columns:
    if len(x.split("_")) > 1 and x.split("_")[0].isdigit() and x.split("_")[1].isdigit():
        X_img_cols.append(x)
    else:
        X_struc_cols.append(x)

In [36]:
if augmentation:
    print("Starting Data Augmentation")
    X_img_aug, X_struc_aug, y_aug = augmented_data(X, y, X_img_cols, X_struc_cols, sd=scale,
                                                   augm_fact=augmentation_factor)
    # merging augmented data with input data
    X_img = pd.concat([X[X_img_cols], X_img_aug])
    X_struc = pd.concat([X[X_struc_cols], X_struc_aug])
    y = np.concatenate([np.array(y), y_aug], axis=0)
else:
    X_img = X[X_img_cols]
    X_struc = X[X_struc_cols]
    y = np.array(y)

if aggregation:
    n_c = 8
    n_train = len(X_img)
    X_train_2d = np.zeros(n_train * n_c * n_c).reshape(n_train, n_c, n_c)

    # turn array to matrix
    for i in range(n_train):
        X_train_2d[i] = flat_to_mat_aggregation(X_img.iloc[i, :])

    stacked = np.stack(X_train_2d, axis=0)

else:
    n_c = dtl.flat_to_mat(X_img.iloc[0, :]).shape[0]
    n_train = len(X_img)
    X_train_2d = np.zeros(n_train * n_c * n_c).reshape(n_train, n_c, n_c)

    # turn array to matrix
    for i in range(n_train):
        X_train_2d[i] = dtl.flat_to_mat(X_img.iloc[i, :])

    if reorder:
        stacked = np.stack(reorder_matrices_regions(X_train_2d, network='yeo7'), axis=0)
    else:
        stacked = np.stack(X_train_2d, axis=0)

In [40]:
X_img = stacked.reshape(stacked.shape[0], stacked.shape[1], stacked.shape[2], 1)
if X_struc.shape[1] != 0:
    X_struc = X_struc.to_numpy().reshape(stacked.shape[0], X_struc.shape[1])
else:
    X_struc = X_struc.to_numpy()



AttributeError: 'numpy.ndarray' object has no attribute 'to_numpy'

In [38]:
X

array([[-1.26961509,  0.98648887,  0.90748521, -1.07646857, -1.26961509],
       [-0.10345012,  1.23715407,  0.90748521,  0.94598753, -0.10345012],
       [-1.05095916,  1.23715407,  0.90748521, -1.41354459, -1.05095916],
       [-0.6865326 ,  1.36248667,  0.90748521, -0.40231654, -0.6865326 ],
       [-0.97807385, -1.39483057,  0.90748521,  0.60891151, -0.97807385],
       [ 0.91694423, -0.26683715,  0.90748521,  0.94598753,  0.91694423],
       [-1.48827102,  0.61049106,  0.90748521, -1.41354459, -1.48827102],
       [ 0.0423205 ,  0.23449326,  0.90748521, -1.07646857,  0.0423205 ],
       [ 1.06271485,  0.23449326, -1.10194633, -1.07646857,  1.06271485],
       [ 1.3542561 ,  0.35982586, -1.10194633,  0.60891151,  1.3542561 ],
       [-1.63404164,  1.36248667, -1.10194633,  1.62013957, -1.63404164],
       [-0.17633543,  1.11182147, -1.10194633, -0.73939255, -0.17633543],
       [ 1.28137079, -0.39216975,  0.90748521, -1.07646857,  1.28137079],
       [-0.61364729,  0.73582366,  0.9

In [33]:
import connectome.preprocessing.data_loader as dtl
from connectome.preprocessing.reorder_matrices_regions import reorder_matrices_regions
from connectome.preprocessing.data_loader import flat_to_mat_aggregation
from connectome.models.E2E_conv import E2E_conv

import numpy as np

## 3. Model Evaluation

In this step you can now evaluate the Model on a set of prespecified metrics.

+ For Classification: Accuracy, Precision, Recall, F1 and AUC
+ For Regression: MSE, MAE and R2

Checkout https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics for details

In [None]:
from connectome.models.evaluation import model_evaluation
from connectome.models.brainnet_cnn import preprocess_test_data_for_cnn

In [None]:
# If a cnn Model was trained uncomment the next line to transform the test_dataset to the right input format for the CNN with the same settings
# X_test, y_test = preprocess_test_data_for_cnn(X_test, y_test, aggregation=False, reorder=False)

In [None]:
model_evaluation(model, X_test, y_test)

## 4. Feature Visualization and Interpretation

In the final step you can choose between several feature visualization and interpretation techniques.

The user has to specify:
+        model: the model from  step 2
+        X: X_test dataframe
+        y: Target test dataframe
+        viz_method: Choice  of "GFI" , "GFI_only", "FI" , "FI_only", "elastic_net", "shapley" and "feature_attribution"

Visualization methods:
+ GFI: Grouped Permutation Feature Importance (based on yeo7 network)
+ GFI_only: Group only Permutation Feature Importance (based on yeo7 network)
+ FI: Permutation Feature Importance
+ FI_only: Version of Group only Permutation Feature Importance but for every feature, not groups
+ elastic_net: Visualization of the elastic net coefficients
+ shapley: Summary plot for shapley values
+ feature_attribution: Neural Network Visulization with Saliency Maps

For more details and customization of plots see our documentation.

In [None]:
from connectome.visualization.viz_framework import visualization_framework

In [None]:
viz = visualization_framework(model=model,X=X_test,
                              y=y_test, viz_method="feature_attribution", method='saliency', average=True, ordered = True)

In [None]:
viz