cls-protein-prediction

Code and results for the practical exercises of the course "Protein Prediction 2" in Winter 21/22 at TUM
Authors: Adrian Henkel, Finn Gaida, Lis Arend, Sebastian Dötsch, Shlomo Libo Feigin [1]

Motivation

Transformers [2] have shown great performance for NLP and more recently vision tasks [3] as well. Here we aim to bring the class-attention mechanism [4] back to textual input tasks, namely predicting transmembrane classes directly from protein sequence embeddings [5].

Results

We compared the Class-attention image Transformer (CaiT) against a baseline MLP and CNN to see similar performance results.

This leads us to the conclusion that CaiT does not give a significant improvement in performance over a simple MLP baseline - presumably due to the high information density embeddings. However one advantage of the attention mechanism is the ability to look at the attention placed at single input tokens for single predictions.

View the full metrics report on Weights & Biases.

Usage

Download the weights from Google Drive and place in a folder called models in the root
Download tmh dataset from moodle or Nextcloud (Password protected)
Download FASTA file for evaluation from the same source
Install dependencies

pip install -r requirements.txt

Run main script, supplying at least the following arguments

python main.py 
    --model_type <one of: ["CNN", "MLP", "CAIT"]>
    --emb <path to embeddings.h5> 
    --fasta <path to FASTA file>

References

[1] Final presentation slides
[2] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
[3] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).
[4] Touvron, Hugo, et al. "Going deeper with image transformers." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. APA
[5] Elnaggar, Ahmed, et al. "ProtTrans: towards cracking the language of Life's code through self-supervised deep learning and high performance computing." arXiv preprint arXiv:2007.06225 (2020).

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
common		common
models		models
notebooks		notebooks
plots		plots
.gitignore		.gitignore
LICENSE		LICENSE
PP2_ClassAttention_Final.pdf		PP2_ClassAttention_Final.pdf
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common

common

models

models

notebooks

notebooks

plots

plots

.gitignore

.gitignore

LICENSE

LICENSE

PP2_ClassAttention_Final.pdf

PP2_ClassAttention_Final.pdf

README.md

README.md

main.py

main.py

requirements.txt

requirements.txt

train.py

train.py

Repository files navigation

cls-protein-prediction

Motivation

Results

Usage

References

About

Contributors 3

Languages

License

mainpyp/cls-protein-prediction

Folders and files

Latest commit

History

Repository files navigation

cls-protein-prediction

Motivation

Results

Usage

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages