From 0b28a2e00d8ef14ac36f6fe4bf81740f96f9eace Mon Sep 17 00:00:00 2001 From: Felix Becker Date: Tue, 22 Aug 2023 15:52:49 +0200 Subject: [PATCH] Update README.md --- README.md | 44 +++++++++++++++++++++----------------------- 1 file changed, 21 insertions(+), 23 deletions(-) diff --git a/README.md b/README.md index 768a36a..5649f13 100644 --- a/README.md +++ b/README.md @@ -3,16 +3,17 @@ # learnMSA: Learning and Aligning large Protein Families # Introduction -Multiple sequence alignment formulated as a statistical machine learning problem, where an optimal profile hidden Markov model for a potentially ultra-large family of protein sequences is searched and an alignment is decoded. We use an automatically differentiable variant of the Forward algorithm to align via gradient descent. +Multiple sequence alignment formulated as a statistical machine learning problem, where an optimal profile hidden Markov model for a potentially ultra-large family of protein sequences is learned from unaligned sequences and an alignment is decoded. We use a novel, automatically differentiable variant of the Forward algorithm to train pHMMs via gradient descent. ## Features - Aligns large numbers of protein sequences with state-of-the-art accuracy - Enables ultra-large alignment of millions of sequences +- GPU acceleration, multi-GPU support - Scales linear in the number of sequences (does not require a guide tree) - Memory efficient (depending on sequence length, aligning millions of sequences on a laptop is possible) -- GPU acceleration - Visualize a profile HMM or a sequence logo of the consensus motif +- Experimental use of large protein language models to improve alignment accuracy ## Current limitations @@ -28,18 +29,11 @@ Choose according to your preference: If you haven't done it yet, set up [Bioconda channels](https://bioconda.github.io/) first. - *Recommended way to install learnMSA (including aligned insertion support via famsa):* + *Recommended way to install learnMSA:* ``` conda install mamba - mamba create -n learnMSA_env tensorflow==2.10.0 learnMSA famsa - ``` - - *Without aligned insertion support* - - ``` - conda install mamba - mamba create -n learnMSA_env tensorflow==2.10.0 learnMSA + mamba create -n learnMSA_env learnMSA ``` which creates an environment called `learnMSA_env` and installs learnMSA in it. @@ -58,7 +52,7 @@ Choose according to your preference: *Optional, but recommended for proteins longer than 100 residues. The install instructions above may already be sufficient to support GPU depending on your system. LearnMSA will notify you whether it finds any GPUs it can use or it will fall back to CPU.* -You have to meet the [TensorFlow GPU](https://www.tensorflow.org/install/gpu) requirements. +You have to meet the [TensorFlow GPU](https://www.tensorflow.org/install/gpu) requirements and may do the cuda setup steps. ## Command line use after installing with Bioconda or pip @@ -66,11 +60,7 @@ You have to meet the [TensorFlow GPU](https://www.tensorflow.org/install/gpu) re learnMSA -h -*New since learnMSA version 1.2.0:* - -learnMSA -i INPUT_FILE -o OUTPUT_FILE --align_insertions - -This will trigger a quick, seperate alignment phase of insertions left unaligned by learnMSA after the main step. Insertions are aligned with `famsa`, which has to be installed by the user. +*Since learnMSA version 1.2.0, insertions are aligned with famsa. This improves overall accuracy. The old behavior can be restored with the `--unaligned_insertions` flag.* ## Manual installation @@ -79,17 +69,18 @@ Requirements: - [networkx](https://networkx.org/) - [logomaker](https://logomaker.readthedocs.io/en/latest/) - [seaborn](https://seaborn.pydata.org/) +- [biopython](https://biopython.org/) (>=1.69) +- [pyfamsa](https://pypi.org/project/pyfamsa/) +- [transformers](https://huggingface.co/docs/transformers/index) - python 3.9 (there are known issues with 3.7 which is deprecated and 3.8 is untested) -1. Clone the repository: +1. Clone the repository git clone https://github.com/Gaius-Augustus/learnMSA -2. Install dependencies: - - pip install tensorflow==2.10.0 logomaker networkx seaborn +2. Install dependencies with pip or conda -3. Run: +3. Run cd learnMSA @@ -98,8 +89,15 @@ Requirements: ## Interactive notebook with visualization: -Run the notebook MsaHmm.ipynb with juypter. +Run the notebooks learnMSA_demo.ipynb or learnMSA_with_language_model_demo.ipynb with juypter. +# Version 1.3.0 improvements + +- Use `pyfamsa` to align insertions, also made aligning insertions the default behavior (also added `--unaligned_insertions` flag). +- Use `biopython` for data parsing. Many more input file formats are not available as well as the experimental `indexed_data` flag for large datasets that allows constant memory model training. +- Multi GPU training works now. +- Added the highly experimental `--use_language_model` flag that uses a large, pretrained protein language model to guide the MSA and improve alignment accuracy. + # Version 1.2.0 improvements - insertions that were left unaligned by learnMSA can now be aligned retroactively by a third party aligner which improves accuracy on the HomFam benchmark by about 2%-points