- Mahbuba Tasmin
- Bryn Reimer
Course: CS 690U, University of Massachusetts Amherst (Spring 2025)
This project evaluates the effectiveness of traditional bioinformatics techniques on the Convergent Enzyme Classification task from the DGEB benchmark. We specifically assess how simple modeling pipelines — such as BLAST-based nearest neighbor search and logistic regression using basic encodings — compare to the performance of modern foundation models like ESM.
The convergent_enzymes dataset comprises protein sequences of enzymes
annotated with Enzyme Commission (EC) numbers, where training and test
sequences with the same EC number share little to no sequence similarity. This
simulates convergent evolution, where similar function has evolved independently
multiple times. Because of this, even foundation models struggle to generalize —
making this an ideal case for re-evaluating traditional approaches.
- Source: TattaBio Convergent Enzymes
- Train set: 2000 amino acid sequences
- Test set: 400 amino acid sequences
- Task: Predict EC number (multi-class classification)
- Inputs: Amino acid sequences
- Encodings:
- One-hot encoding (e.g., max-len x 21 )
- k-mer count vectors
- Count encoding
- Amino acid properties encoding
- Model: Multinomial Logistic Regression (implemented in PyTorch)
- Evaluation:
- Accuracy
- Macro F1-score
- Number correct
- Use
blastpto search test sequences against the training set - Predict EC label based on the top BLAST hit
- Evaluate using accuracy, macro F1, and number correct
All results available in the main Jupyter notebook