Skip to content

opNMF for vertex data

Gabriel A. Devenyi edited this page Jul 4, 2023 · 22 revisions

The steps below outline how to run opNMF using vertex wise outputs from CIVET. The overall process is:

  1. Build input matrix - each row is a vertex, each column is a subjects data
  2. Run NMF - the matrix from 1 is the input, NMF generates two output matrices
  3. Visualize components
  4. Analyze weights

The analysis is setup to run on Niagara, although for visualization in some cases a jupyter notebook is more useful which will require either remote desktop/CIC or running some things on your laptop.

Start by cloning the repository at https://github.com/CoBrALab/cobra-nmf. All scripts can be found in this git repo.

git clone --recursive https://github.com/CoBrALab/cobra-nmf.git 
cd cobra-nmf

You should find vertex and voxel folders each containing python scripts to manipulate inputs (e.g. create input matrix) and outputs (e.g. plot output weight matrix) of OPNMF analysis based on either a vertex or voxel analysis. You should also see a cobra_brainparts folder which contains the Matlab/octave scripts to run the OPNMF decomposition. generic_niagara_pnmf is a wrapper script to run the OPNMF analysis given an input, output, and number of components.

All steps below require loading the cobralab/2019b module. Instructions below are written as though you are in your cobra-nmf folder, but if this is not the case be sure to adjust and use appropriate relative paths.

Start by loading cobralab/2019b

module load cobralab/2019b

Analyses using voxel data also require specific python packages contained in the Python environment to run cobra-nmf. The repo you just cloned contains a .yml environment file that can be used to create a conda environment with the code dependencies:

conda env create -f cobra-nmf.yml

Check that the environment was installed correctly, and then activate the environment:

conda env list  # you should see an environment called `cobranmf` listed

conda activate cobranmf

Shortcuts

  1. Set up your input spreadsheet
  2. Extract cortical metrics
  3. Build multivariate NMF input
  4. Run opNMF on Niagara
  5. Visualize components
  6. Plot component weights, extract to csv
  7. Run stability analysis
  8. Run NMF on metrics residualized for a global measure
  9. Automatically plot component scores on a cortical surface
  10. Visualize component-metric BSR values from PLS results as mini heatmaps

1. Set up your input spreadsheet

For the analysis, it is convenient to have your demographic spreadsheet sorted by age. This can make certain major trends (e.g. CT decrease with age) easily visible and that can be a nice sanity check after building matrices. The following script can be used:

python sort_sheet.py $input.csv $agevariable $output.csv

Where $agevariable = the name of the column listing ages in your spreadsheet, $input.csv = your spreadsheet, and $output.csv = the name of the new .csv you want to create. Alternatively, you can do this in R, whichever you prefer.

Sorting by age is technically optional. However it is required to do the following:

  • Create 1 column for every metric file that you wish to analyse that contains the paths (absolute paths highly recommended here) to the relevant vertex data for each subject. Ie if doing an analysis of CT and SA, you should have 4 columns added to your spreadsheet - one for each of left/right ct/sa.
  • If you wish to stratify your data during the stability splits by any demographic variables, make sure they are categorical or have categorical equivalent. So if using age, you will need to create a categorical age variable that has 2 or more groups. Most simple approach is to create 2 groups (old/young) with a median split (this is what was done previously already in the define_splits script), as in the following snippet:
    ...
    # assuming "demog" is the dataframe of your demographics spreadsheet, and "age" is the column with subject ages
    median_age = np.median(demog['age'].values)
    demog['age_dummy'] = np.where(demog['age'].values > median_age, 'old', 'young')
    ...
  • See /home/m/mchakrav/patelmo6/scratch/pnmf_bin/nmf_vertex/test/wh/derivatives/extract_metrics/df_inputs.csv on niagara as an example

2. Extract cortical data

For each of your metrics, you need to build a vertex x subject matrix. In the cobra-nmf repo, the vertex/extract_metrics.py script can do this. It requires you to specify the columns in your spreadsheet that point to the .txt vertex files. You also need a copy of the CIVET mask file (/project/m/mchakrav/quarantine/resources/CIVET_2.0_mask_left_short.txt on Niagara). An example usage of the script is:

python vertex/extract_metrics.py --metric ct --metric_column left_ct right_ct --metric sa --metric_column left_sa right_sa --input_csv df_inputs.csv --mask_file CIVET_2.0_mask_left_short.txt

The --metric and --metric_column options have to be used together, and the order is important. In the above, --metric ct --metric_column left_ct right_ct is saying that my CT filepaths are found in the columns titled left_ct and right_ct in the df_inputs.csv sheet. The ct given to the --metric option could in theory be anything, but keep it as an abbreviation of the relevant metric as it is used in naming of the output files. Assuming you have separate .txt files for left and right hemispheres, make sure you specify the correct column names together. --metric_column left_ct right_ct results in your left ct data and right ct data being grouped together, if specified incorrectly you will mix up your metric data. The script will output 3 .mat files for each metric - one for each of left ct, right ct, and whole brain CT.

If extracting >1 metric or if your dataset is larger than a few hundred subjects, try submitting the above command to the cluster in a joblist, e.g. using qbatch -w 00:30:00 extract_metrics_joblist (Niagara) or qbatch --ppj 8 extract_metrics_joblist (CIC) to avoid the script running out of memory.

If using voxel data: The NMF scripts require the TractREC python package to extract voxel information from nifti files. While this package can be downloaded from this GitHub repository, using it on newer CIC computers (and on Niagara) will result in errors since the code is in Python v2. An updated, Python v3 version of the TractREC package is available on Niagara here: /home/m/mchakrav/parent41/scratch/resources/TractREC_p3

3. Build multivariate NMF input

The next step is to concatenate each of the wb*.mat files you created in step 1 - eg for an anlysis of ct and sa, paste the wb_ct.mat and wb_sa.mat files together. You can use the vertex/build_nmf_vertexinput.py script as follows:

python vertex/build_nmf_vertexinput.py --inputs $.matfilesfromstep2 --output $output.mat --norm all

--inputs wants a list of the files you want to concatenate - these are the whole brain metric files from step 2, separated by spaces. --output wants the name of your output.mat file. --norm lets you specify how zscoring is done - default is ‘all’, as in across both subjects and vertices. You can also specify ‘vertex’ to do zscoring on a per vertex basis only After creating your input matrix, you can visualize/plot it with:

python vertex/plot_input.py --input $nmfmatrix.mat --output $output.png

4. Run opNMF on Niagara

The next step is to run NMF using the wholebrain matrix created in step 2 as input. Upload the whole brain .mat file you created to Niagara. The generic_niagara_pnmf can be used to run NMF. This script points to and references scripts in the cobra_brainparts folder contained in your git repo.

The script takes in 3 arguments in order - input matrix, k, output filename Module cobralab/2019b is required.

To run the script, the following would work if run from your cobra-nmf folder****:

octave generic_niagara_pnmf $wholebrain.mat $k $output.mat

Where $wholebrain.mat = the .mat file you created in step 3 while CT data for every subject, $k = the number of components. Later you will use a stability analysis to select this number, for now just choose 10. $output.mat = the output file containing nmf outputs, set this to whatever you like so long as it has a .mat extension.

Create a joblist containing the octave command and submit it using qbatch. Use a walltime of 4 hours (I think it will take much less than this, but just in case). For example, if your octave command is

octave generic_niagara_pnmf wholebrain_ct.mat 10 nmf_results_wholebrainct_k10.mat

You want to create a file called joblist which contains only the command above (ie joblist is 1 line). E.g.:

echo octave generic_niagara_pnmf wholebrain_ct.mat 10 nmf_results_wholebrainct_k10.mat > joblist

Then run: qbatch -w 2:00:00 joblist

Note: Full analysis of NMF outputs should only be completed after stability analysis. It can be helpful to run a test with an arbitrary selection of number of components prior to stability analysis (eg to identify any processing issues), but refrain from any post hoc analyses until stability is assessed.

5. Visualize components

After nmf is complete, you will have a W matrix contained in the output .mat file that contains the component scores of every voxel. If you map these back to a .txt file, you can visualize the spatial patterns using brainview/Display.

python mat_to_brainview.py --nmf_results $nmfresults.mat --mask_file $civetmaskfile.txt --output_dir $outputdirectory

The script should output two files: left_k${componentnumber}.txt and right_k${componentnumber}.txt So for 10 components you should have e.g. left_k10.txt and right_k10.txt. This file is like an R stats output file. It has 1 line for every vertex, and one column for every component. You can use a .obj file and this .txt file to look at your results using brainview or automatically plot component scores on a cortical surface using create_civet_image.sh, following Section 9. The .txt files also contain an additional column containing a clustering derived from the component scores. The clustering is a ‘winner take all’ clustering - so for every vertex, find the component with the highest score, then assign that vertex to that cluster. So, for 10 component analysis you will have 11 total columns to view.

6. Plot component weights, extract to csv

In addition to looking at the spatial components on the brain, you can also view a plot of the weightings. This tells you whats happening in each component. Similar to the plotting of the input matrix in step 2, there is a script that can be used to plot:

python vertex/plot_Hweights.py --nmf_weights $nmfresults.mat

You can convert the nmf .mat results to a csv containing the H weights for each subject, so that you can then run linear models/PLS/other analyses:

python vertex/Hweights_tocsv.py --demo_csv $demographicscsv --id_col $subject --metrics $listofmetrics --nmf_results $nmfresults.mat

The script outputs two .csv files by default - one which contains the H weights appended to your full input spreadsheet, and another containing just subject ID column and nmf weight data only. The latter is useful for PLS analysis where you only want select columns, but the former would likely be more useful for linear models.

7. Run stability analysis

The instructions in 4-6 above assume you have a number of components selected or are experimenting with a certain number. To select for your full analysis a stability analysis is used. Stability analysis involves running NMF of various ‘splits’ of data at varying granularities (i.e. varying number of components). The premise is to split your data into two equal sized groups, run NMF on each, and look at the similarity of the output components. High similarity -> the components across diff subsets of participants at this granularity are similar -> high stability.

For a more rigorous assessment, ten ‘splits’ of data are created. In each split, you have two subsets of non overlapping participants (stratified by key variables such as age, sex, disease group). Example: splitA_0 splitB_0 splitA_1 splitB_1 .. splitA_4 splitB_4

The participants in splitA_0 are completely different from those in splitB_0. The participants in splitA_0 are not identical to those in splitA_1 or splitB_1, but there will be some overlap. The participants in splitA_1 are completely different from splitB_1.

The components between splitA_0 are compared to those from splitB_0, those from splitA_1 to those from splitB_1, etc.

7.1 Build input matrices for splits

Each of the splits described above need an input matrix which will be subject to NMF. The following script can be used to create these:

python vertex/define_splits.py --demo_csv $demographicscsv --id_col $subject  --n_folds 10 --inputs $wholebrainrawmetricdata --stratifyby dummy_agegroup --norm all
# $wholebrainrawmetricdata is a list of your wholebrain metric files, e.g. wb_ct.mat wb_sa.mat

The script will create a folder stability_splits with files a_0.mat, b_0.mat,....a_9.mat, b_9.mat containing the input matrices for stability.

7.2 Run opNMF for each split

Next, run nmf as in step 3 for each of the 10 splits, at each granularity of interest. For practicality, you may choose to analyze every 2 granularities (eg k=2,4,6,8) as opposed to every single granularity.

On Niagara, build a joblist to run nmf, eg:

#create output folders
mkdir stability_outputs
for k in $(seq 2 2 20);
do
mkdir -p stability_outputs/k${k}
done

for k in $(seq 2 2 20); 
do 
for split in $(seq 0 9);
do 
echo octave generic_niagara_pnmf stability_splits/a_${split}.mat $k stability_outputs/k${k}/a_${split}.mat
echo octave generic_niagara_pnmf stability_splits/b_${split}.mat $k stability_outputs/k${k}/b_${split}.mat
done 
done > stability_joblist

This will create commands to run an nmf for every other granularity from k=2 to k=20 components across 10 splits of data and save the corresponding .mat files in the output directory. Adjust the loop to create commands for higher granularities if applicable. Then submit your stability jobs:

> qbatch -c10 -j10 -w 10:00:00 stability_joblist

Walltime recommendations for stability analysis:
4 raw metrics, niter = 100000, convergence tolerance = 0.00001 (default in generic_niagara_pnmf)

  • ks 2-10: 5:00:00
  • ks 12-20: 10:00:00
  • ks 22-30: 14:00:00

7.3 Extract stability metrics

After nmf for each split is complete, you can calculate the similarity of the outputs to determine the spatial stability at each granularity. This involves correlating the voxel wise similarity of component scores between splitA_0 and splitB_0, and between splitA_1 and splitB_1. This can be computed for each granularity assessed as follows:

python vertex/compute_stability_corr.py --n_folds 10 --stability_results_dir $stabilityresultsfolder --k 5

The script is set up to operate on different granularities individually - i.e. compute stability at k=2 only or k=4 only, etc, and so needs to be run once for each of the granularities of interest. Thus, run the above once for every granularity, varying the number specified to --k:

> for k in $(seq 2 2 20); do echo python vertex/compute_stability_corr.py --n_folds 10 --stability_results_dir $stabilityresultsfolder --k $k; done > joblist_compute_stability 
> qbatch -w 2:00:00 -c 1 -j 1 joblist_compute_stability 
# these qbatch options tell scheduler to run 1 command per job; -j 1 helps with Python parallelizing matrix operations involved in script

This should output 1 csv file for each granularity, in a folder named stability_correlations. After the above jobs run, combine all these outputs to one csv file:

cd stability_correlations
ls #should see 1 csv file for each granularity
#grab the header/col names and store in stability_metrics.csv, which will contain stability metrics for all granularities
head -n1 stability_corr_k2.csv > stability_metrics.csv 
#for each of the stability_corr_k*csv files you have, grab the data and append to stability_metrics.csv. They should all be 11 lines long - header plus 1 line for each split, so tail -n10 should do
for k in $(seq 2 2 20);
do 
tail -n10 stability_corr_k${k}.csv >> stability_metrics.csv
done

7.4 Plot stability

After obtaining the .csv with stability metrics, use it to plot the compute stability. The script below can be used to create this plot.

python vertex/plot_stability.py --stability_correlations $spreadsheet --spacing $kinterval --output $stabilityplot.png

where $spreadsheet is the concatenated .csv from 7.3, and $kinterval is the interval of granularities assessed. For example, if you ran stability analysis for k=2,4,6,8..., $kinterval = 2

8. Run NMF on metrics residualized for a global measure

Below example residualizes vertex-level cortical thickness (CT), surface area (SA), mean curvature (MC), and gyrification index (GI) estimates from PNC subjects for subject-level mean CT and total SA. Replace prefixes as needed.
Load required modules:
Niagara:

module load cobralab/2019b

CIC:

module load R/3.5.1 minc-toolkit/1.9.17 anaconda/5.1.0-python3 minc-stuffs/0.1.24^minc-toolkit-1.9.17 RMINC/1.5.2.2^minc-toolkit-1.9.17^R-3.5.1 rstudio

8.1 Calculate mean CT and total SA per subject

> R  

library(RMINC)

# create dataframe with IDs *sorted by age*
demog <- read.csv("demographics_agesort.csv")[1] # demographics spreadsheet from step 1
data <- demog$id

## calculate mean CT
# create paths to left/right CT text files from CIVET and store in same `data` dataframe
data$left_ct <- paste("output_PS_N3/PNC_",data$id,"/thickness/PNC_",data$id,"_native_rms_rsl_tlaplace_30mm_left.txt",sep="")
data$right_ct <- paste("output_PS_N3/PNC_",data$id,"/thickness/PNC_",data$id,"_native_rms_rsl_tlaplace_30mm_right.txt",sep="")

# create tables of metric values - dimensions n_vertices x subjects
left_ct_tab <- as.data.frame(vertexTable(data$left_ct))
right_ct_tab <- as.data.frame(vertexTable(data$right_ct))

# load midline mask files (can be found in the data quarantine on Niagara/CIC or at https://github.com/CoBrALab/minc-toolkit-extras)
left_mask <- read.csv("mask_files/CIVET_2.0_mask_left_short.txt", header = F)
right_mask <- read.csv("mask_files/CIVET_2.0_mask_right_short.txt", header = F)

# multiply each column of left/right vertex-wise CT tables with the corresponding mask
left_ct_tab_masked <- left_ct_tab * left_mask$V1
right_ct_tab_masked <- right_ct_tab * right_mask$V1
ct_tab_masked <- rbind(left_ct_tab_masked, right_ct_tab_masked) # vertically concatenate hemispheres
mean_ct <- colSums(ct_tab_masked)/77122 # (81924 vertices – 2(2401 midline vertices)) - could make this more elegant

## calculate total SA
# repeat above steps, changing CT file paths/variable names to SA ones as necessary. As well, the last line `mean_ct <- colSums(ct_tab_masked)/77122`
# should be changed to something like `total_sa <- colSums(sa_tab_masked)`

# create dataframe containing subject IDs, mean CT, total SA
globalmeas <- cbind(id=data$id, mean_ct=mean_ct, total_sa=total_sa)
write.table(globalmeas, file =ids_meanCT-totalSA_agesort.csv”, sep= “,”, row.names=F)

# append mean_ct, total_sa to your existing NMF demographics spreadsheet
demog$mean_ct = mean_ct
demog$total_sa = total_sa
write.table(demog, file = "demographics_with_mean-ct_total-sa_agesort.csv", sep=",", row.names=F)

8.2 Residualize metrics

Assuming you have already generated whole-brain raw metric .mat files (wb_ct.mat, wb_sa.mat, etc.) for a regular NMF run, the next step is to regress vertex-wise values of each metric against both mean CT and total SA to obtain whole-brain, residualized metric .mat files (wb_ct_resid.mat, wb_sa_resid.mat, etc.). These will be used to run any NMF decompositions on the entire sample, but not for split-half stability analysis.

(Load required modules from 8.1)

> R

library(R.matlab)

# load in whole-brain metric matrices
ct <- as.data.frame(readMat("input/wholebrain_ct_raw.mat")$X)
sa <- as.data.frame(readMat("input/wholebrain_sa_raw.mat")$X)
mc <- as.data.frame(readMat("input/wholebrain_mc_raw.mat")$X)
gi <- as.data.frame(readMat("input/wholebrain_gi_raw.mat")$X)

# sanity check
dim(ct) # should be 77122 x n_subjects

# load in mean CT & total SA values (can ignore this line if `globalmeas` variable is still in your workspace)
globalmeas <- read.csv("ids_meanCT-totalSA_agesort.csv") 

# for each metric dataframe, use a for loop to apply the following regression equation to each vertex column: vertex_i ~ meanCT + totalSA
residf <- function(metric) {
    metric <- t(metric) # transpose dataframe to match orientation of spreadsheet with global measures
    metric_resid <- matrix(nrow = 266, ncol = 77122) # NOTE: nrow should correspond to your n_subjects!
    for (i in 1:77122) {
        metric_resid[,i] <- residuals(lm(metric[,i] ~ globalmeas[,2] + globalmeas[,3])) } # columns 2 and 3 in globalmeas are my mean_CT and total_SA 
        columns, respectively
    metric_resid <- t(metric_resid) # transpose back to vertices x n_subjects
    return(metric_resid) }

# apply above function to each metric
ct_resid <- residf(ct)
sa_resid <- residf(sa)
gi_resid <- residf(gi)
mc_resid <- residf(mc)

# write residualized matrices to .mat files
writeMat("wholebrain_ct_resid.mat", X = ct_resid)
writeMat("wholebrain_sa_resid.mat", X = sa_resid)
writeMat("wholebrain_mc_resid.mat", X = mc_resid)
writeMat("wholebrain_gi_resid.mat", X = gi_resid)

To run an NMF with a specific k (e.g. k=10), use the above wholebrain_{metric}_resid matrices to build your whole-brain multivariate input matrix (i.e. using build_nmf_vertexinput.py) and proceed according to step 4.

8.3. Stability analysis with residualized metrics

To appropriately define splits for the stability analysis, we need to residualize each split of the raw data for mean CT and total SA separately (treating each split as its own sample). To do this, use the optional --residfor argument to define_splits.py, e.g.

python vertex/define_splits.py --demo_csv $demographicscsv --id_col $subject  --n_folds 10 --inputs $wholebrainrawmetricdata --stratifyby dummy_agegroup --norm all --residfor mean_ct total_sa

--residfor accepts a list of strings corresponding to names of columns in your demographics csv that you wish to residualize for.

You will need to submit a joblist with the above command. Before doing so, ensure that any existing folder called stability_splits in your working directory is renamed to something else (otherwise it will be overwritten by the script).

qbatch -w 1:15:00 define_splits_joblist

Once the job finishes running, proceed with the rest of the stability analysis as usual (see 7.2).

9. Automatically plot component scores on a cortical surface using modified create_civet_image.sh

Clone the nmf_viz repository: git clone https://github.com/alyssadai/nmf_viz.git to a directory on the CIC (it is convenient to run visualization scripts on the CIC, as a lot of plotting packages (e.g. Python's seaborn) are available by default with anaconda).

Run the scripts below in the specified order, editing lines in the scripts marked with # MODIFY as needed.

Modules required: R/3.5.1 minc-toolkit minc-toolkit-extras

# i. Extract min & max component score for each hemisphere from the {left,right}_$k.txt files created in Step 5, storing them into two csv files for Step ii
> nmf_viz/get_nmf_comp_limits.R 

# ii. Use customized create_civet_image_nmf.sh to generate a brain map of each component in your NMF solution (e.g. if your k=8, this script will output 8 images with file name format k8_comp_${compnum}.png)
> nmf_viz/make_nmf_brain_maps.sh # RdPu colormap in nmf_viz used by default

An example map for one component is shown below:
k8_comp_1_sample

10. Visualize component-metric BSR values from PLS results as mini heatmaps

Running a PLS analysis using subject-specific component-metric weights (H weights) as the brain input results in latent variables where the brain pattern is described by a bootstrap ratio (BSR) value for each component-metric combination. e.g., If you have a k=8 NMF decomposition from 4 metrics CT, SA, MC, and GI, the PLS brain pattern for a given LV will include a BSR for comp1_CT, comp1_SA, comp2_CT, comp2_SA, etc.

One way to visualize these results besides a standard bar plot is to color code each metric according to its BSR, using a specified colormap:
pls_bsr_sample bsr_colormap_sample

The script nmf_viz/create_NMF_BSR_heatmaps.py will automatically create the 2x2 plots as well as the colorbar in the image above (as individual .png files), which can then be combined with the brain maps from Step 9 to visualize the brain pattern for a given LV of your PLS results. The script is set up to work with 4 metrics and to display metrics with BSRs below a user-defined threshold (default: 1.96 or p>.05) as a different color (e.g., gray).

Modules required: anaconda/5.1.0-python3
Usage:

> python nmf_viz/create_NMF_BSR_heatmaps.py # EDIT lines in script marked with # MODIFY
Clone this wiki locally