Skip to content

Ongoing research training transformer models on proteome at scale

License

Notifications You must be signed in to change notification settings

PeptoneLtd/ProteoNeMo

Repository files navigation

ProteoNeMo

This repository containes the code for pre-training and inference procedures of protein language models with Nvidia NeMo toolkit from Peptone Ltd.

GitHub Super-Linter

ProteoNeMo can be used to extract residue level representations of proteins and to train related downstream tasks.

Table of Contents

Usage

Quick-start

As a prerequisite, you must have NeMo 1.7 or later installed to use this repository.

Install the proteonemo package:

Clone the ProteoNeMo repository, go to the ProteoNeMo directory and run

python setup.py install

Datasets

ProteoNeMo can be pre-trained on:

Download and preprocess datasets

ProteoNeMo can be pre-trained on the datasets pointed-out above. You can choose your preferred one or make use of two or more of them at the same time.

Each dataset will be:

  • Downloaded from UniProt and decopressed as a .fasta file
  • Sharded into several smaller .txt sub-files containing a random set of the related .fasta file, already splitted into training and evaluation samples
  • Tokenized into several .hdf5 files, one for each .txt sharded file, where the masking procedure has been already applied

In the ProteoNeMo directory run:

export BERT_PREP_WORKING_DIR=<your_dir>
cd scripts
bash create_datasets_from_start.sh <to_download> 

Where:

  • BERT_PREP_WORKING_DIR defines the directory where the data will be downloaded and preprocessed
  • <to_download> defines the datasets we want to download and preprocess where uniref_50_only is the default.

The outputs are the download, sharded and hdf5 directories under the $BERT_PREP_WORKING_DIR parent directory, containing the related files.

To Download Datasets
uniref_50_only UniRef 50
uniref_all UniRef 50, 90 and 100
uniparc UniParc
uniprotkb_all UniProtKB Swiss-Prot, TrEMBL and isoform sequences

ProteoNeMo pre-training

Once the download and preprocessing procedure is completed you're ready to pre-train ProteoNeMo.

The pre-training procedure exploits NeMo to solve the Masked Language Modeling (Masked LM) task. One training instance of Masked LM is a single modified protein sequence. Each token in the sentence has a 15% chance of being replaced by a [MASK] token. The chosen token is replaced with [MASK] 80% of the time, 10% with a random token and the remaining 10% the token is retained. The task is then to predict the original token.

We have currently integrated BERT-like uncased models from HuggingFace.

The first thing you need to do is creating a model_config.yaml file in the conf directory, specifying the relative pre-training and model options. You can use this config as template.

Take a look to these NeMo tutorials to get familiar with such options.

Secondly, you have to modify the config_name argument of the @hydra_runner decorator in bert_pretraining.py

Lastly, in the ProteoNeMo directory run:

cd scripts
python bert_pretraining.py 

The pre-training will start and a progress bar will appear  

Tensorboard monitoring

Once the pre-training procedure has started a nemo_experiments directory will be automatically created under the scripts directory.

Based on the name: <PretrainingModelName> parameter in the .yaml configuration file, a <PretrainingModelName> sub-directory containing all the related pre-training experiment logs will be created under nemo_experiments.

In the ProteoNeMo directory run:

tensorboard --logdir=scripts/nemo_experiments/<PretrainingModelName> 

The Tensorboard UI will be available on port 6006

Residue level representations extraction

Once a ProteoNeMo model will be pre-trained you'll get a .nemo file, placed in the nemo_path you've utilised in the .yaml configuration file.

You're now ready to extract the residue level representations of each protein a .fasta file.

In the ProteoNeMo directory run:

cd scripts
python bert_eval.py --input_file <fasta_input_file> \
                    --vocab_file ../static/vocab.txt \
                    --output_dir <reprs_output_dir> \
                    --model_file <nemo_pretrained_model>

Where:

  • --input_file defines the .fasta file containing the proteins for which you want to extract the residue level representations
  • --vocab_file defines the .txt file containing the vacabulary you want to use during the inference phase. We suggets you use the standard one
  • --output_dir defines the output directory where the residue level representations will be written. You'll get a .pt file for each protein sequence in the --input_file
  • --model_file defines the .nemo file used to get the pre-trained weights needed to get the residue level representations

Licence

This source code is licensed under the Apache 2.0 license found in the LICENSE file in the root directory of this source tree.