NovaBind

The Salimov and Frolov Laboratory — winners in the international Ibis competition for predicting transcription factor binding levels to DNA sequences. We participated in predictions on genome sequences, utilizing synthetic data. More details about the architecture and methods used can be found in this Google Document.

Here we demonstrate the performance of NovaBind as we predicted the data in the competition.

Environment

We used a server with a GPU running Ubuntu 20.04.6. To set up the environment, please use the following command:

conda env create -f environment.yml

Input data

You can find the input data on the Ibis site. The archives is too large, so we are not attaching it here. Please download it, unzip it, and place the data folder in the root directory where all the repository scripts are located.

Reproduction

Data preprocessing

Step 1. It is necessary to extract the files and convert them to a unified .csv format. For future model ensembling, we immediately split the data into folds. The folds_PBM, folds_HTS and test directories with the necessary data are created.

To run the script that does this, execute the following command in bash:

python prep_data.py

Step 2. We are ready to split the folds into training and validation sets. DNA sequences are encoded using one-hot encoding. The complementary sequences are added to the data. For the data from the GHTS and CHS experiments, sequence segmentation is performed using a sliding window with stride of 1. These actions are performed in the encode_data.py script:

python encode_data.py

Training

Step 3. To start training, use the script parallel_training.py. The training process is the same for different types of experiments: for PBM and HTS experiments, three repetitions with seeds 0, 1, and 2 are run for each of the three folds. To specify the training mode, set the argument --type_exp, which can be either 'PBM' or 'HTS'. Note that training runs in parallel on the available graphics cards. In both cases, if fewer than 9 GPUs are available, all available devices will be used, and tasks will be queued.

We recommend running the following two commands in sequence, with the second one delayed until the first training stage is complete.

python parallel_training.py --type_exp PBM

python parallel_training.py --type_exp HTS

The model weights are saved in the models_PBM and models_HTS folders, respectively. We have saved the model weights in this repository if you want to skip training and proceed directly to prediction.

Prediction

Step 4. To generate predictions, you need to run the script make_predict.py with the argument --type_exp set to 'PBM' or 'HTS', which specifies based on which experiments the prediction will be made.

Prediction	Based on	Discipline
PBM	PBM	Secondary
GHTS	PBM and HTS	Primary
CHS	PBM and HTS	Primary
HTS	HTS	Secondary

If you want to run predictions based on PBM or HTS in parallel, please specify the device number to perform the calculations on:

python make_predict.py --device 0 --type_exp PBM
python make_predict.py --device 1 --type_exp HTS

As a result of predictions on different models, the sum of the predictions is calculated and min-max scaling is applied. To merge the prediction results, run the script:

python get_results.py

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
models_HTS		models_HTS
models_PBM		models_PBM
.gitignore		.gitignore
README.md		README.md
architecture.py		architecture.py
backend.py		backend.py
data_reading.py		data_reading.py
dna_processing.py		dna_processing.py
encode_data.py		encode_data.py
environment.yml		environment.yml
get_results.py		get_results.py
make_predict.py		make_predict.py
original_HTS.py		original_HTS.py
original_PBM.py		original_PBM.py
parallel_training.py		parallel_training.py
predict_utils.py		predict_utils.py
prep_data.py		prep_data.py
prep_utils.py		prep_utils.py
sliding_window.py		sliding_window.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NovaBind

Environment

Input data

Reproduction

Data preprocessing

Training

Prediction

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

salimovdr/NovaBind

Folders and files

Latest commit

History

Repository files navigation

NovaBind

Environment

Input data

Reproduction

Data preprocessing

Training

Prediction

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages