Skip to content

Module for efficiently creating and training deep learning models to map protein sequence to phenotype

Notifications You must be signed in to change notification settings

AndreaGraf/Protein_ML

Repository files navigation

Protein Phenotype Prediction

This repository holds the code for processing, analysis and model development on the Mega-scale protein folding stability dataset (https://www.biorxiv.org/content/10.1101/2022.12.06.519132v1.full.pdf) or to be used on data from https://mavedb.org/#/

Title_Image

Data

Datasets can be downloaded from https://zenodo.org/record/7401275. Download the K50_dG_tables.zip and Processed_K50_dG_datasets

Notebooks and scripts

1) Data_Analysis_and_Methods_overview.ipynb

Exploration of dataset features and discussion of prior related models, used in protein modelling and sequence analysis.

2) embed_sequences.py A script to generate embeddings from the t5 pretrained protein language model

run as:

python embed_sequences.py <sequences_csv_name> (<output_file>)

3) run_MAVE_model.sh

example model training script, for supervised models

4) Model_evaluation_and_outlook.ipynb

Analysis of the predictions made by different model stages and the modelling assumptions, evaluation of ... for improvements and further development

Computational Environment:

required packages can be installed with

conda env create -f prot_ml.yml

activate your virtual environment with

conda activate prot_ml

before opening the notebooks. If you are working on COLAB, install the following at the start of the runtime (>> comment: presumably there is a more way to add packages from a requirements file to colab, but I am unfamiliar with COLAB and did not have time to spend on this)

pip install pytorch-lightning
pip install hydra-core
pip install pyrootutils
pip install mlflow
!pip install torch transformers sentencepiece h5py

Precalculated Datasets:

Notebooks can be run using the provided mega_scale_ddg_single_mut_with_seq_no_con.csv input file. However, parts of the project use embeddings from the pretrained Large Protein Language Model ProtT5 (https://github.com/agemagician/ProtTrans) which are computationally expensive to extract. Therefore, the calculated per_residue and per_protein embeddings for all sequences (....h5) and for a subset used here for training (.... h5) will also be provided separately. These files should be placed in the 'protT5/output directory.

The model development is done with the models defined in the protml module (protml folder).

The protml training module

The protml module defines models and model components to train on protein sequences and map them to phenotype. The protml model uses Hydra for experiment configuration management to allow for quick and convenient experiment configuration and hyperparameter optimization. Hydra allows configuration of basic model parameters in hierachically structured .yaml files, that can be combined as well as overrideen from the commandline to easily modify individual hyperparameters.

Documentation of all module components

For example, the configurations for the ML models would contain YAML descriptions for all individual components. For example, a simmple MLP encoder component looks as follows:

_target_: protml.models.encoders.VariationalEncoder
model_params:
  hidden_layer_sizes: [100, 100, 50]
  z_dim: ${z_dim}
  nonlinear_activation: "relu"
  dropout_prob: 0.0
  mu_bias_init: 0.1
  log_var_bias_init: -10.0

The _target_ attribute allows Hydra to instantiate the underlying Model_Class associated with the YAML description. In the run, any argument to the encoder class can simply be changed by e.g.

python3 -m protml.apps.train experiment=supervised/train_base \
    train_data= < PATH_TO_TRAINING_DATA >\
     val_data= < PATH_TO_VALIDATION_DATA > \
        trainer.max_epochs=50000\
        model.encoder.model_params.hidden_layer_sizes=[100,100,100,100,100] z_dim=10

where the experiment keyword specifies the basic experiment setup (Experiments are located in protml/configs/experiment ) and the model.encoder.model_params.hidden_layer_sizes=[100,100,100,100,100] specifies the new architecture of the encoder module.

Several run command examples can be found in the 2. notebook : Experiments_and_Model_training.ipynb

About

Module for efficiently creating and training deep learning models to map protein sequence to phenotype

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published