1st-place-solution-single-cell-pbs

This repository implements the 1st place solution for the single cell perturbations problem open-problems-single-cell-perturbations

General Methodology

Input Features

Use one hot encoding of cell_type/sm_names
Add the mean, standard deviation, and (25%, 50%, 75%) percentiles of target values (differential expressions) per cell_type and sm_name.

Model Architectures

Use LSTM, GRU, and 1d-CNN architectures (see models.py).

Loss Functions and Optimizer

Use MSE, MAE, BCE, and LogCosh (see helper_classes.py)
Use Adam optimizer to train the models in a 5-fold cross validation setting.

Hyperparameters

250 epochs, lr 0.001 for LSTM and 1d-CNN, and 0.0003 for GRU.
Use gradient norm clip value of 1.0 during training
Batch size 16

Predictions

Use weighted ensemble prediction; fold-wise, use the coefficients [0.25, 0.15, 0.2, 0.15, 0.25], and model-wise use [0.29, 0.33, 0.38]

Installation

Make sure Anaconda3 is installed and execute the following:

Clone this repository git clone https://github.com/Jean-KOUAGOU/1st-place-solution-single-cell-pbs.git
First create and activate a conda environement conda create -n single_cell_env python==3.9.0 --y && conda activate single_cell_env
Install all required packages in the environment pip install -r requirements.txt

Dependencies

python 3.9.0
pandas 2.1.3
pyarrow 14.0.1
tqdm 4.66.1
scikit-learn 1.3.2
torch 2.1.1
transformers 4.35.2
matplotlib 3.8.2

Hardware:

Ubuntu 20.04.6 LTS (Kaggle) AMD EPYC 7B12 CPU @ 2.25GHz (4 CPUs) 30GB RAM, 1xTesla GPU P100 16 GB (Kaggle), 73 GB disc
Also tested on Debian GNU/Linux 11 AMD EPYC 7282 16-Core Processor @ 3.2GHz (32 CPUs), 1xNvidia GPU RTX 3090 24 GB, 252 GB RAM, 500 GB disc

Preprocessing

Create a folder called data/ in the main directory
Add the training data in parquet format, e.g., de_train.parquet as in the competition and check that its path is correct in SETTINGS.json
Also add the test data and a sample submission file (both should be csv files) in the same directory data/ and check SETTINGS.json for path correctness
Run python prepare_data.py to complete all required preprocessing steps

Training

Make sure to locate at the top level of this Github repository

Run python train.py to train models. This will automatically create a directory call trained_models and store the trained models.
Pretrained models can also be downloaded, see link on Kaggle to avoid training.

Predicting

Check that there is a non-empty directory named trained_models and that its path is specified in SETTINGS.json under MODEL_DIR

Run python predict.py to predict on the test data whose path is specified in SETTINGS.json. This will automatically create an output directory sepcified in SETTINGS.jsonand store predictions in a file named submission.csv

Reproduction (Docker)

Create a directory data in this Github repository
If there is no directory named trained_models at the top level of this repository, make sure to create an empty directory with this name
Add de_train.parquet, id_map.csv, and sample_submission.csv into the directory data
If necessary, edit SETTINGS.json by specifying the correct paths
Make sure your machine has at least 16GB RAM
Execute ./build.sh to build a docker image
If you would like to predict with pretrained models:

Download the trained models from Kaggle at https://www.kaggle.com/datasets/jeannkouagou/best-models-single-cell/data, and place them under a folder named trained_models at the top level of this Github repository
Execute ./run.sh predict to run the container and directly predict using the trained models. The output will be a csv file named submission.csv in the main directory.

Execute ./run.sh train_and_predict to train new models and predict.

I recommend training on a GPU as it might take too long on CPU.
Training on GPU can take between 6 hours (e.g. on Nvidia GPU RTX 3090) and 10 hours (e.g. on Tesla GPU P100) depending on the GPU used.
If the objective is not to reproduce the results, you can also change configurations in config such as learning rate, epochs, etc, before building the container image.

Note: ./run.sh should alway be run with an argument, and there are two possibilities ./run.sh predict or ./run.sh train_and_predict. If you encounter an error in 7. and 8., there is probably a conflicting container name, e.g., you have executed ./run.sh several times. The error might look like The container name "single_cell_container" is already in use by container container_id. In that case, delete container_id by using sudo docker rm <container_id>, and retry.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
config		config
.gitignore		.gitignore
Dockerfile.docker		Dockerfile.docker
LICENSE		LICENSE
README.md		README.md
SETTINGS.json		SETTINGS.json
__init__.py		__init__.py
build.sh		build.sh
directory_structure.txt		directory_structure.txt
entry_points.md		entry_points.md
helper_classes.py		helper_classes.py
helper_functions.py		helper_functions.py
models.py		models.py
predict.py		predict.py
prepare_data.py		prepare_data.py
requirements.txt		requirements.txt
run.sh		run.sh
train.py		train.py
train_predict.py		train_predict.py

License

Jean-KOUAGOU/1st-place-solution-single-cell-pbs

Folders and files

Latest commit

History

Repository files navigation

1st-place-solution-single-cell-pbs

General Methodology

Installation

Dependencies

Hardware:

Preprocessing

Training

Predicting

Reproduction (Docker)

About

Resources

License

Stars

Watchers

Forks

Languages