Lab Immersion at EPFL
Lab: Bitbol Lab - Laboratory of Computational Biology and Theoretical Biophysics
Professor: Anne-Florence Bitbol
Supervisor: Damiano Sgarbossa
The project is extension of paper Protein language models trained on multiple sequence alignments learn phylogenetic relationships which shows that Regression on predictors derived from MSA Transformer column attention's are able to capture Hamming distance, which is simple proxy of phylogenetic relationship. However, without the true phylogenetic tree, we lack an 'accurate' ground truth values for distances.
Therefore, we go beyond this by generating synthetic sequences along the existing or known tree and using Patristic distance derived from the tree to analyze if MSA Transformer based embeddings can capture them. We do this by repeating the Regression analysis, but also by fine-tuning MSA Transformer.
For more details on data, pipeline, and models refer to Presentation.pdf
.
- Clone the repository.
git clone https://github.com/masazelic/Fine-tuning-MSA-Transformer.git
- Install the requirements.
pip install -r requirements.txt
For running following section of code, go into bmDCA folder.
- Subsampling Data
Synthethic sequences for 15 protein families (PF00004, PF00005, PF00041, PF00072, PF00076, PF00096, PF00153, PF00271,PF00397, PF00512, PF00595, PF01535, PF02518, PF07679, PF13354) were generated using bmDCA approach along the trees corresponding to the natural sequences. Trees can be found at link, while sequences can be found at link.
If you want to subsample 100 sequences per family through random subsampling approach, run:
python generate_distance_matrices.py -ft <your_tree_folder> -mf <your_msa_folder> -md 100 -a random -om <your_msa_output_folder> -dm <your_distance_matrix_folder>
NOTE: For subsampling 100 random sequences per family you will need around 47 minutes for all families. If you increase it to 500 it will take over 2 hours for one.
For more details on available command line arguments, run:
python generate_distance_matrices.py -h
- Regression
For performing regression analysis, go over Regression.ipynb
in regression
folder. Cell that requires user input of specific paths/parameters is clearly noted.
- Training Fully-connected network with MSA Transformer's column attentions as embeddings
Code supporting this approach can be found in fc_network
folder.
For tuning hyperparameters such as network architecture and learning rate for subtree approach use the follwing command (you need to enter the folder before doing so):
python hyperparameter_tuning.py -af <your_attentions_folder> -df <your_distances_folder> -a <subtree/random>
NOTE 1: The code assumes that attention matrices and distances are pre-computed. For the specific families noted above, you can download patristic distances from link and attention matrices from link (subtree approach) and link (random approach).
NOTE 2: Notation for both distances and attention matrices is in form familiyName_approach_var.npy.
Running this file produces .pkl
file containing best selected hyperparameters based on the criterion. You can find best hyperparameters for both approaches in parameters
folder.
For training the model with the optimal parameters, you can run following command:
python run.py -af <your_attentions_folder> -df <your_distances_folder> -a <subtree/random> -r train -bp <your_best_params_file>
Running this command results in model named as trained_approach.pth
being saved in models
folder. Trained models for both approaches can already be found there.
For evaluating the trained model on test dataset, you can run following command:
python run.py -af <your_attentions_folder> -df <your_distances_folder> -a <subtree/random> -r test -bp <your_best_params_file>
- Fine-tuning MSA Transformer's column attentions with LoRA
This code can be found in finetune_msa
folder. To train and evaluate the model run the following command:
python train_pipeline.py -mf <your_msas_folder> -df <your_distances_folder> -cp <your_checkpoint_folder> -a bmDCA
NOTE 1: Since training and evaluation took around an hour, non-exhaustive hyperparameter search was performed. Best parameters (for which results are reported) are specified at the top of train_pipeline.py
script.
NOTE 2: Trained model is larger than 100 MB and therefore not included.
ESM2 generated sequences are only used for Fine-tuning MSA Transformer's column attentions due to their larger volume. They are in form of small trees (up tp 50 sequences). Code for this approach is stored in ESM2
folder.
For organization of folder containing ESM2 data refer to link. Note that val
folder contains test data.
To save computational time during training, distance matrices are pre-computed for each small tree and saved as .pkl
dictionaries by running the following command:
python create_and_store_dists_matrices_esm.py -ef <your_esm_folder> -dft <your_dists_train_folder> -dfts <your_dists_test_folder>
To train and evaluate the model run the following command:
python train_pipeline.py -cp <your_checkpoint_folder> -a esm -esmf <your_esm_folder>
Regarding hyperparameters and trained model, same rules apply as above.