## Meta matching v2.0
This jupyter notebook demonstrates you how to load and use multilayer meta-matching algorthm. In this demonstration, we performed multilayer meta-matching with 100 example subjects.

Package needed (and version this jupyter notebook tested):
* Numpy (1.19.2)
* Scipy (1.5.2)
* PyTorch (1.7.1)
* Scikit-learn (0.23.2)

### Step 0. Setup
Please modify the `path_repo` below to your repo position:

In [1]:
path_repo = '../' # '/home/the/deepGround/code/2002/Meta_matching_models/'

In [2]:
# initialization and random seed set

import os
import sys
import random
import scipy
import torch
import pickle
import sklearn
import numpy as np

seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

import warnings
warnings.filterwarnings("ignore")

### Step 1. load data
Load the example data that we provided, it contains 
* Example input functional connectivity (FC) `x` with size of (100, 87571)
    * 100 is number of subjects
    * 87571 is flatten vector of 419 by 419 FC (419*418/2=87571)
* Example output phenotypes `y` with size of (100, 3)
    * 3 is number of phenotypes.

In [3]:
path_v20 = os.path.join(path_repo, 'v2.0')
path_v11 = os.path.join(path_repo, 'v1.1')
path_v10 = os.path.join(path_repo, 'v1.0')
model_v20_path = os.path.join(path_v20, 'models')
sys.path.append(path_v10)
from CBIG_model_pytorch import demean_norm

npz = np.load(os.path.join(path_v10, 'meta_matching_v1.0_data.npz'))
x_input = npz['x']
y_input = npz['y']
x_input = demean_norm(x_input)

print(x_input.shape, y_input.shape)

(100, 87571) (100, 3)


### Step 2. Split data
Here, we also split 100 subjects to 80/20, where 80 for training, and 20 for test.

In [4]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_input, y_input, test_size=0.2, random_state=42)
n_subj_train, n_subj_test = x_train.shape[0], x_test.shape[0]
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(80, 87571) (20, 87571) (80, 3) (20, 3)


### Step 3. Multilayer meta-matching models predict
Here we apply the DNN and RR models trained on extra-large source dataset (UK Biobank), large source dataset (ABCD) and medium source dataset (GSP, HBN and eNKI) to predict source phenotypes on `x_train` and `x_test`. We will get the predicted 458 source phenotypes on both 80 training subjects and 20 test subjects.

In [5]:
from CBIG_model_pytorch import multilayer_metamatching_infer
dataset_names = {'extra-large': 'UKBB', 'large': 'ABCD', 'medium': ['GSP', 'HBN', 'eNKI']}
y_train_pred = multilayer_metamatching_infer(x_train, y_train, model_v20_path, dataset_names)
y_test_pred = multilayer_metamatching_infer(x_test, y_test, model_v20_path, dataset_names)

print(y_train_pred.shape, '\n', y_train_pred)
print(y_test_pred.shape, '\n', y_test_pred)

dnn(
  (layers): ModuleList(
    (0): Sequential(
      (0): Dropout(p=0.4, inplace=False)
      (1): Linear(in_features=87571, out_features=512, bias=True)
      (2): ReLU()
      (3): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): Sequential(
      (0): Dropout(p=0.4, inplace=False)
      (1): Linear(in_features=512, out_features=256, bias=True)
      (2): ReLU()
      (3): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (2): Sequential(
      (0): Dropout(p=0.4, inplace=False)
      (1): Linear(in_features=256, out_features=128, bias=True)
      (2): ReLU()
      (3): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (3): Sequential(
      (0): Dropout(p=0.4, inplace=False)
      (1): Linear(in_features=128, out_features=67, bias=True)
    )
  )
)
dnn(
  (layers): ModuleList(
    (0): Sequential(
      (0): Dropout(p=0.4, inplace=False)
      (1): Line

### Step 4. Stacking
Perform stacking with `y_train_pred`, `y_test_pred`, `y_train`, where we use the prediction of 80 subjects `y_train_pred` (input) and real data `y_train` (output) to train the stacking model (you can either use all 67 source phenotypes for stacking, or select top K source phenotypes relevant to the target phenotype, like we mentioned in our paper; it turns out that these 2 ways achieves similar performances), then we applied the model to `y_test_pred` to get final prediction of 3 phenotypes on 20 subjects.

#### Hyperparameter Tuning 
In `stacking()` function, we set the range of `alpha` as `[0.00001, 0.0001, 0.001, 0.004, 0.007, 0.01, 0.04, 0.07, 0.1, 0.4, 0.7, 1, 1.5, 2, 2.5, 3, 3.5, 4, 5, 10, 15, 20]`. You are weclomed to modify the range of `alpha` to get better performance on your own data.

In [6]:
from CBIG_model_pytorch import stacking
y_test_final=np.zeros((y_test_pred.shape[0], y_train.shape[1]))
for i in range(y_train.shape[1]):
    # For each test phenotype, perform stacking by developing a KRR model
    y_test_temp, _ = stacking(y_train_pred, y_test_pred, y_train[:,i].view(), [0.00001, 0.0001, 0.001, 0.004, 0.007, 0.01, 0.04, 0.07, 0.1, 0.4, 0.7, 1, 1.5, 2, 2.5, 3, 3.5, 4, 5, 10, 15, 20])
    y_test_final[:,i] = y_test_temp.flatten()
print(y_test_final.shape, '\n', y_test_final)

(20, 3) 
 [[61.62359808 27.93901759 27.64252235]
 [47.658789   19.92006732 17.27647632]
 [58.43337712 26.20132618 29.76981137]
 [20.36260652  7.39094979  6.64798866]
 [63.52666342 25.27519393 30.04042147]
 [29.72350055 11.76516165  9.60142366]
 [51.09027522 23.06604386 26.29862704]
 [31.23978765 12.49798075 12.69467152]
 [51.85516564 20.25201089 25.30294817]
 [66.28780239 26.16455697 30.73975801]
 [70.38900293 29.70561584 31.41852754]
 [61.30589263 27.47497718 26.8497007 ]
 [65.79098479 27.37919549 29.89891964]
 [69.60049003 30.0780364  32.97285101]
 [67.89407292 27.82349812 33.61771439]
 [45.53163788 21.1205561  18.77536153]
 [33.9616015  14.91249895 21.02459435]
 [37.69831782 16.09879595 13.11550474]
 [40.1757404  16.83000851 22.02049174]
 [46.42372843 19.09157674 19.88317101]]


### Step 5. Evaluation
Evaluate the prediction performance.

In [7]:
from scipy.stats.stats import pearsonr
corr = np.zeros((y_train.shape[1]))
for i in range(y_train.shape[1]):
    corr[i] = pearsonr(y_test_final[:, i], y_test[:, i])[0]
print(corr)

[0.38619462 0.53118872 0.49750862]


### Step 6. Haufe transform predictive network features (PNFs) computation
Here we compute the PNF for stacking we just performed. It computes the covariance between 3 phenotype prediciton and each element of FC on the 80 training subjects. The final PNF is in shape of (87571, 3), where 87571 is number of 419 by 419 FC elements, and 3 is number of phenotypes.

In [8]:
from CBIG_model_pytorch import covariance_rowwise

y_train_haufe, _ = stacking(y_train_pred, y_train_pred, y_train)
print(y_train_haufe.shape)
cov = covariance_rowwise(x_train, y_train_haufe)
print(cov, '\n', cov.shape)

(80, 3)
[[-3.15746420e-03 -1.29972049e-03 -7.34787839e-04]
 [-7.26632304e-04 -2.39789108e-04 -5.77203773e-04]
 [ 1.63102467e-03  8.37832777e-04 -5.88613070e-05]
 ...
 [ 1.87897411e-03  5.39591152e-04  8.30710471e-04]
 [ 2.86308689e-03  1.04940039e-03  1.45609124e-03]
 [ 5.07279448e-03  2.39546817e-03  1.21153472e-03]] 
 (87571, 3)


### Step 7. Haufe transform predictive network features (PNFs) computation for training phenotypes
Here we compute the PNF for stacking we just performed. It computes the covariance between 3 phenotype prediciton and each training phenotypes on the 80 training subjects. The final PNF is in shape of (458, 3), where 458 is the number of source phenotypic predictions, and 3 is number of phenotypes.

In [9]:
from CBIG_model_pytorch import covariance_rowwise

cov = covariance_rowwise(y_train_pred, y_train_haufe)
print(cov, '\n', cov.shape)

[[ 0.03974357 -0.01331241 -0.08194796]
 [-0.3313986  -0.17342801  0.01566548]
 [ 0.84347044  0.42487221  0.11818034]
 ...
 [ 3.09089228  1.49775607  0.77518796]
 [ 2.52328096  1.19044549  0.38457833]
 [ 2.08633564  1.15713255  1.60349482]] 
 (458, 3)
