This repository contains a project for Computing Methods for Experimental Physics and Data Analysis course.
The aim is to design and implement a regression model to predict the age of the healthy subjects from brain data features extracted from T1-weighted MRI images. Datas are taken from to the well known ABIDE dataset, in which are present subjects affected by Autism Spectre Disorder (ASD) and healthy control subjects (CTR).
The algorithm allows to:
- visualize and explore ABIDE datas;
- make data harmonization by site;
- train different regression models;
- confront two alternative approaches to the problem.
The repository is structured as follows:
brain_age_predictor/
├── docs
├── LICENSE
├── dataset/
├── brain_age_predictor/
│ ├── images/
│ ├── metrics/
│ ├── best_estimator/
│ ├── preprocess.py
│ ├── brain_age_pred.py
│ ├── brain_age_site.py
│ ├── grid_CV.py
│ ├── __init__.py
│ ├── variability.py
│ ├── DDNregressor.py
│ └── predict_helper.py
│
├── README.md
├── requirements.txt
└── tests
└── test.py
└── __init__.py
Datas from ABIDE (Autism Brain Imaging Data Exchange) are contained in .csv files inside brain_age_predictor/dataset folder and are handled with Pandas. This dataset contains 419 brain morphological features (volumes, thickness, area, etc.) of different brain segmented area (via Freesurfer sofware) belonging to 915 male subjects (451 cases, 464 controls) pespectively with with total mean age of 17.47 ± 0.36 and 17.38 ± 0.40. The age distribution of subjects, although heterogeneous between CTR and ASD groups, presents quite a skewed profile, as shown below:
Also age distribution across sites change quite drastically as shown in the following boxplot:
On top of these differencies, another important confounding factor is related to the effect of the different acquisition sites on the features. To mitigate this effect, the state-of-art harmonization tool neuroHarmonize implemented by Pomponio et al. has been used.
neuroHarmonize corrects differences introducted by multi-site image acquisition preserving specified covariates. So, harmonization can be safely performed without affecting age-related biological variability of the dataset. This is particulary important as different sites have different age distribution. The analysis has been conducted using 'unharmonized' and 'harmonized' datas.
Models have been trained using only control cases (CTR) and then evaluated separately on CTR set and cases set (ASD). Differences through residual plots are shown in the results avalaible in /images folder. Being very poorly represented (<4%), subjects with age >40 years have been discarded from the present study, similarly to other studies in the field.(1)
Two different pipelines have been followed based on Leave-One-Site-Out approach:
- 1) Datas have been previously separeted in train/test sets using one provenance site as test and the others as train and consequently cross-validated with KFold CV.(2)(3)
- 2) Datas have been processed without discrimination based on site and validated through a regular GridSearch CV. Both scikitlearn's models and a custom neural network have been used.
Typical regression metrics (MAE, MSE) have been evaluated. Pearson correlation coefficient (PR) has been also calculated too. For pipeline 1, results' plots are collected in 'images' folder, while fitted models and relative metrics' results are stored respectively in 'best_estimator' and 'metrics/grid' folders. Variability plots are stored in 'images_SITE/grid' folder. For pipeline 2, results' metrics are also stored in 'metrics/site' and summarizing plots are stored in 'images_SITE/site' folder.
To use these Python codes the following packages are required:
- keras
- matplotlib
- neuroHarmonize
- numpy
- pandas
- prettytable
- scikit-learn
- scipy
- seaborn
- sphinx
- statsmodels
- tensorflow
- 1) Download the repository from github
git clone https://github.com/Pastiera/brain_age_predictor
- 2) Change directory:
cd path/to/brain_age_predictor/brain_age_predictor
- 3) Modules brain_age_pred.py, brain_age_site.py, variability.py, preprocess.py are executable following relative help instruction by typing -h on std-out line as positional argument or simply running them. Example of usage (Pipeline1):
usage: brain_age_pred.py [-h] [-dp DATAPATH] [-grid] [-pred] [-neuroharm] [-verb]
Main module for brain age predictor package.
optional arguments:
-h, --help show this help message and exit
-dp, --datapath DATAPATH
Path to the data folder.
-grid, --gridcv Use GridSearch cross validation to train and fit models.
-pred, --predict Make predictions with models pre-trained with GridSearchCV.
-neuroharm, --harmonize
Use NeuroHarmonize to harmonize data by provenance site.
-verb, --verbose Set DDN Regressor model's verbosity. If True, it shows model summary.Default = False
Pre-trained model in /best_estimator can be run for reproducibility and newly trained model will be saved in the same folder. If no fitted models is already present in this folder, one shall firstly run brain_age_pred.py
to use variability.py
.
Results' plots are collected in 'images' or 'images_site' folder, while fitted models and relative metrics' results are stored respectively in 'best_estimator' and 'metrics' folders.
- [1] Courchesne E, Campbell K, Solso S. Brain growth across the life span in autism: age-specific changes in anatomical pathology. Brain Res. 2011 Mar 22;1380:138-45. doi: 10.1016/j.brainres.2010.09.101. Epub 2010 Oct 1. PMID: 20920490; PMCID: PMC4500507.
- [2] Saponaro S., Giuliano A., Bellotti R.,Lombardi A., Tangaro S.,Oliva P., Calderoni S., Retico A., Multi-site harmonization of MRI data uncovers machine-learning discrimination capability in barely separable populations: An example from the ABIDE dataset, NeuroImage: Clinical, Volume 35, 2022, 103082, ISSN 2213-1582
- [3] Okamoto N and Akama H (2021), Extended Invariant Information Clustering Is Effective for Leave-One-Site-Out Cross-Validation in Resting State Functional Connectivity Modeling. Front.Neuroinform. 15:709179.
- [4] Bhaumik, R., Pradhan, A., Das, S. et al. Predicting Autism Spectrum Disorder Using Domain-Adaptive Cross-Site Evaluation. Neuroinform 16, 197–205 (2018).