## Meta matching v1.0
This jupyter notebook demonstrates you how to load and use meta-matching algorithm. In this demonstration, we performed meta-matching with 20 example subjects.

Package needed (and version this jupyter notebook tested):
* Numpy (1.24.2)
* Scipy (1.9.1)
* PyTorch (1.7.1)
* Scikit-learn (0.22.2)


### Step 0. Setup
Please modify the `path_repo` below to your repo position:


In [2]:
path_repo = './'

In [3]:
# initialization and random seed set

import os
import sys
import random
import scipy
import torch
import pickle
import sklearn
import numpy as np

seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

import warnings
warnings.filterwarnings("ignore")

### Step 1. load data
Load the example fake data that we provided, it contains
* Example input structural MRI T1 `x` with size of (20, 182x218x182)
    * 20 is number of subjects
    * 182x218x182 is dimension of 3D T1 data
* Example output phenotypes `y` with size of (20, 2)
    * 2 is number of phenotypes.
* Example icv data `icv` with size of (20, 1)
    * 1 is dimension of icv data.

In [4]:
data_path = os.path.join(path_repo, 'data')
model_path = os.path.join(path_repo, 'model')

from CBIG_util import znorm_icv

npz = np.load(os.path.join(data_path, 'meta_matching_v1.0_data.npz'))
x_input = npz['x']
y_input = npz['y']
icv_input = npz['icv']
icv_input = znorm_icv(icv_input)
print(x_input.shape, y_input.shape, icv_input.shape)

(20, 182, 218, 182) (20, 2) (20, 1)


### Step 2. Split data
Here, we also split 20 subjects to 80/20, where 80 for training, and 20 for test.

In [5]:
from sklearn.model_selection import train_test_split
from CBIG_util import mics_z_norm

x_train, x_test, icv_train, icv_test, y_train, y_test = train_test_split(x_input, icv_input, y_input, test_size=0.2, random_state=42)
n_subj_train, n_subj_test = x_train.shape[0], x_test.shape[0]
y_train, y_test, _, _ = mics_z_norm(y_train, y_test)
print(x_train.shape, x_test.shape, icv_train.shape, icv_test.shape, y_train.shape, y_test.shape)

(16, 182, 218, 182) (4, 182, 218, 182) (16, 1) (4, 1) (16, 2) (4, 2)


### Step 3. Meta-matching models predict
Here we apply the model pretrained on large source dataset (UK Biobank) to predict source phenotypes on `x_train` and `x_test`. We will get the predicted 67 source phenotypes on both 16 training subjects and 4 test subjects.

In [6]:
from CBIG_util import metamatching_infer

y_train_pred = metamatching_infer(x_train, icv_train, y_train, model_path)
y_test_pred = metamatching_infer(x_test, icv_test, y_test, model_path)

print(y_train_pred.shape, '\n', y_train_pred)
print(y_test_pred.shape, '\n', y_test_pred)

./model/CBIG_ukbb_dnn_run_0_epoch_98.pkl_torch
./model/CBIG_ukbb_dnn_run_0_epoch_98.pkl_torch
(16, 67) 
 [[  5.14068842   4.64960146   0.37592569 ...  -0.79975235  -6.86000061
    3.36048269]
 [  6.32211447   7.25385284  -0.03813863 ...  -2.59329224 -10.23042679
    5.342453  ]
 [  4.97654867   4.57014561   0.10048731 ...  -0.61772722  -6.87038994
    3.27581429]
 ...
 [  4.19604874   3.60710502   0.81585735 ...  -1.02136409  -5.84539986
    4.06673527]
 [  1.93042707   1.30830824   4.18760109 ...  -2.90544534  -5.70266676
    4.83503199]
 [  5.83531666   5.09975815   1.64362442 ...  -1.31022167  -8.29333401
    4.30726099]]
(4, 67) 
 [[ 3.92415547e+00  4.40324068e+00  1.45960104e+00 -2.59575319e+00
   6.67742634e+00 -1.72643602e+00 -6.15279734e-01  2.66337490e+00
   2.22580028e+00 -7.09242439e+00  6.12375212e+00 -1.39724225e-01
   3.61671233e+00  1.84387958e+00 -1.31239220e-01  1.17996292e+01
   2.56022000e+00 -1.68650627e+01 -1.43079567e+00 -2.31316590e+00
   3.66654801e+00 -3.545922

### Step 4. Stacking
Perform stacking with `y_train_pred`, `y_test_pred`, `y_train`, where we use the prediction of 16 subjects `y_train_pred` (input) and real data `y_train` (output) to train the stacking model, then we applied the model to `y_test_pred` to get final prediction of 2 phenotypes on 4 subjects. Here
for simplicity of the example code, we use all 67 outputs from pretrained model as the input of stacking KRR model, if you want to select the top K outputs please see our [CBIG repo](https://github.com/ThomasYeoLab/CBIG/tree/master/stable_projects/predict_phenotypes/Naren2024_MMT1) for more details.

#### Hyperparameter Tuning
In `stacking()` function, we set the range of `alpha` as `[5, 10, 15, 20, 25, 30, 35, 40, 45, 50]`. You are weclomed to modify the range of `alpha` to get better performance on your own data.

In [7]:
from CBIG_util import stacking

y_test_final_arr = np.zeros((y_test_pred.shape[0], y_train.shape[1]))
y_train_final_arr = np.zeros((y_train_pred.shape[0], y_train.shape[1]))
for i in range(y_train.shape[1]):
    # For each test phenotype, perform stacking by developing a KRR model
    y_test_final, y_train_final = stacking(y_train_pred, y_test_pred, y_train[:,i])
    y_test_final_arr[:,i] = y_test_final
    y_train_final_arr[:,i] = y_train_final
print(y_test_final_arr.shape, '\n', y_test_final_arr)

(4, 2) 
 [[-0.00985054 -0.58495296]
 [-0.65465397 -0.77294818]
 [-2.32535036 -2.09903303]
 [ 0.42019968 -0.04759149]]


### Step 5. Evaluation
Evaluate the prediction performance.

In [8]:
from scipy.stats.stats import pearsonr

corr = np.zeros((y_train.shape[1]))
for i in range(y_train.shape[1]):
    corr[i] = pearsonr(y_test_final_arr[:, i], y_test[:, i])[0]
print('meta-matching stacking prediction performance in terms of correlation for two phenotypes are: ', corr)

meta-matching stacking prediction performance in terms of correlation for two phenotypes are:  [0.7865745  0.30652337]


### Step 6. Haufe transform predictive network features (PNFs) computation
Here we compute the PNF for stacking we just performed. It computes the covariance between 2 phenotype predicitons and each voxel of 3D T1 data on the 16 training subjects. The final PNF is in shape of (87571, 2), where 87571 is number of voxel after crop, and 2 is number of phenotypes.

In [9]:
from CBIG_util import covariance_rowwise, load_3D_input

x_train = load_3D_input(x_train)
cov = covariance_rowwise(x_train, y_train_final_arr)
print(cov, '\n', cov.shape)

100000 0.022002458572387695
200000 0.03455638885498047
300000 0.04672837257385254
400000 0.06115293502807617
500000 0.0732114315032959
600000 0.08515763282775879
700000 0.09690737724304199
800000 0.10851883888244629
900000 0.1195061206817627
1000000 0.1302497386932373
1100000 0.14090800285339355
1200000 0.15177130699157715
1300000 0.16260838508605957
1400000 0.17337274551391602
1500000 0.18400001525878906
1600000 0.19462180137634277
1700000 0.20571589469909668
1800000 0.21669983863830566
1900000 0.22895503044128418
2000000 0.24018383026123047
2100000 0.25124359130859375
2200000 0.26238560676574707
2300000 0.27352142333984375
2400000 0.28438711166381836
2500000 0.29517436027526855
2600000 0.3061559200286865
2700000 0.3173844814300537
2800000 0.32857370376586914
2900000 0.33974361419677734
3000000 0.3501567840576172
3100000 0.3605825901031494
3200000 0.3708221912384033
3300000 0.381044864654541
3400000 0.39164090156555176
3500000 0.4021780490875244
3600000 0.4127688407897949
3700000 0.42