Skip to content

DESimakov/TReNDS

Repository files navigation

TReNDS Neuroimaging

The first place solution to the Kaggle "TReNDS Neuroimaging" competition.

By team Nikita Churkin and Dmitry Simakov

Data

We will be using original dataset. Download these from Kaggle and put all in a data/raw folder. Your directory structure should look as follows:

.
├── age.py
├── agg_feats.py
├── all_labels.py
├── compute_biases.py
├── create_img_statistics.py
├── create_pca_dl_feats.py
├── create_submission.py
├── d11.py
├── d12.py
├── d21.py
├── d22.py
├── data
│   └── raw
│       ├── fMRI_test
│       ├── fMRI_train
│       ├── fnc.csv
│       ├── loading.csv
│       ├── reveal_ID_site2.csv
│       ├── sample_submission.csv
│       └── train_scores.csv
├── resave_imgs.py
├── site_classifier.py
├── trends.py
├── tils.py
├── LICENSE
├── README.md
├── model_summary.pdf
└── requirements.txt

Hardware

This code was tested in the following setting:

  • OS: Ubuntu 18.04
  • RAM: 64GB
  • CPU: 12 cores (24 threads), 3800 MHz
  • SSD: 3200 MB/s Sequential Read, 400K IOPS Random Read

Additional disk space requirements:

  • 450gb for resaved fMRI data
  • ~30gb for PCA/DL models
  • all saved models (7 seeds for 5 targets) ~160gb of disk space

Execution time:

  • 13 hours for PCA features creation
  • 47 hours for DL features creation
  • ~1 hour for other features
  • ~2.5 hours for one seed model training and inference (final submission is blend for 7 validation seeds)

Requirements

We provide a requirements.txt file to install the dependencies through pip.

Submission

To reproduce our full solution one should run the following scripts:

python resave_imgs.py
python create_img_statistics.py
python create_pca_dl_feats.py
python site_classifier.py
python agg_feats.py 
python compute_biases.py
python all_labels.py
python create_submission.py

Scripts

We resaved original 3D fMRI data in pickle format with right channel order and in float32. It is needed for faster data loading in create_pca_dl_feats script.

We calculated simple statistics (mean, std, quantiles) for each of 53 feature channels of original 3D fMRI .mat files.

The longest script. Its execution took ~2.5 days in the following hardware. We used Incremental PCA on fMRI data with n_components 200, batch-size 200. Channels were splitted in groups by 10 and flattened inside them (6 groups in total). As a result, we got 1200 PCA features. Dictionary-learning (DL) params: n-components 100, batch-size 100 and n-iters 10 (the same scheme with channels splitting). There were 600 dl features.

This script trained and inferensed site2 classifier. We applied StandartScaler for train + test data. Regression ElasticNet was used for modeling. Our model detected ~1400 new site2 observations.

Simple statistics for different fnc groups were calculated here.

Offsets for test set were calculated to minimize differences in train-test distributions.

We trained our ensemble and predicted test data. More datailes can be found in our technical report.

We blended all validation seeds and applied postprocessing. Final submission can be found in predicts/submission/

References & Pointers

About

1st place solution

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages