Skip to content

Cross‐site data harmonization using CovBat in Python

Matthew Danyluik edited this page Dec 5, 2023 · 8 revisions

Background

CovBat is a tool that corrects batch (site, scanner, etc.) effects on your brain variables of interest. Specifically, CovBat corrects any differences in mean, variance, and covariance of your brain measures across batches. Here, we'll give an overview of running CovBat in Python on the CIC.

Reference: https://pubmed.ncbi.nlm.nih.gov/34904312/

CovBat: https://github.com/andy1764/CovBat_Harmonization/

ComBat (inspired CovBat): https://github.com/Jfortin1/ComBatHarmonization/

Setup

To access the CovBat function, run the following in your analysis folder:

git clone https://github.com/andy1764/CovBat_Harmonization

If on the CIC, all the Python packages you need can be accessed through anaconda.

module load anaconda

Load the following modules in your Python script. Assuming your analysis script is in the same folder where you cloned the GitHub repo, you'll also need to append the location of the CovBat script to your Python path, as shown.

import pandas as pd
import patsy
import sys

sys.path.append('/path/to/analysis/folder/CovBat_Harmonization/Python')

import covbat as cb

Running CovBat

Brain data

First, load a matrix with the brain variables you're interested in correcting. CovBat expects these data as a Pandas DataFrame, with rows as features and columns as subjects.

Note that this dataframe should only include numeric values. You should exclude any column labels, as shown below.

brain_data = pd.read_csv('path/to/brain/data', header=None)

print(brain_data.shape) # Should be (brain_features, subjects)

Batch labels

Next, create a vector of batch labels. CovBat expects batch labels to be a Pandas Series, with one label for each subject corresponding to their site/scanner.

batch_labels = pd.read_csv('path/to/batch/labels')

print(batch_labels.shape) # Should be (subjects, 1)

Covariate structure

Optionally, specify covariates whose relationships with your brain data should be preserved after batch correction. This will make sure CovBat doesn't remove any effects of, e.g., age or sex, if age and sex aren't perfectly balanced across your sites/scanners. CovBat expects this information through a design matrix, specified as follows.

In this example, we're assuming you have a Pandas DataFrame with subjects as one column, and age and sex as two of the others. You can follow the same logic for any other variables whose relationships with your brain data you're interested in preserving (diagnosis, IQ, etc.)

demographic_data = pd.read_csv('path/to/covariate/data')
print(demographic_data.shape) # Should be (subjects, demographic_features)

covariate_model = patsy.dmatrix("~ age + sex", demographic_data, return_type="dataframe")

Analysis

Finally, run CovBat with your brain data, batch labels, and covariate structure to keep intact. See the CovBat GitHub repo for information on other options to the CovBat function.

Here, we're also specifying that age is a numerical covariate. CovBat will assume that all other covariates (in our case, sex) are categorical. You should make this designation explicit for all other numerical covariates.

If you have relatively few subjects relative to your number of features, you may want to decrease pct_var from the default (0.95) to something like 0.90 to avoid overfitting. This parameter controls how many principal components CovBat will adjust across your dataset. See section 2.11 of the CovBat paper for more details.

harmonized_brain_data = cb.covbat(data=brain_data, batch=batch_labels, model=covariate_model, numerical_covariates='age', pct_var=0.95)
harmonized_brain_data.to_csv('/path/to/output/harmonized_data.csv', header=None, index=None)

The output, harmonized_brain_data, will be a pandas dataframe of identical shape to your input brain_data, containing batch-corrected values. Here, we exported to a .csv file with only the numeric values of the dataframe.

Clone this wiki locally