This folder contains Google Colab notebooks to test the StackDPP datasets using the TransBind deep learning method.
TransBind utilizes ProtBert (ProtTrans T5-XL) to generate 1024-dimensional per-residue embeddings and processes them through an Inception-based CNN. Because generating ProtBert features from raw FASTA sequences is highly computationally intensive, this implementation is split into Google Colab notebooks to leverage free GPU resources.
- Google Drive: You must have a Google account with Google Drive.
- Upload Data: Upload the entire
StackDPP-main/Datasetfolder to your Google Drive. We recommend creating a parent folder namedStackDPPvsTransBindin your Drive rootMyDriveso the paths match the notebooks.
Google Drive structure:
MyDrive/
└── StackDPPvsTransBind/
├── StackDPP-main/
│ └── Dataset/
│ ├── uniprot1424.fasta
│ ├── uniprot356.fasta
│ ├── pdb1075.fasta
│ ├── pdb186.fasta
│ └── pdb1035.fasta
└── StackDPP_on_TransBind/ <-- (Upload the notebooks here)
├── 01_generate_features.ipynb
├── 02_train_and_validation.ipynb
└── 03_inference.ipynb
Goal: Convert FASTA sequences into ProtBert embeddings.
- Open this notebook in Google Colab.
- Go to Runtime > Change runtime type and ensure Hardware accelerator is set to T4 GPU (or better).
- The notebook will mount your Google Drive and parse the
.fastafiles. - It dynamically loads the 11GB ProtTrans T5-XL model.
- Run the feature generation blocks sequentially. This takes a significant amount of time.
- Features are saved directly to your Google Drive under
StackDPP_on_TransBind/dataset/asLLM_features_<name>.gzandlabels_<name>.npy.
Note: The script saves to Drive immediately. If Colab disconnects, you can resume without losing generated files.
Goal: Train the Inception-based CNN on the generated features and validate its performance.
- Open this notebook in Google Colab (GPU recommended but not strictly required for training, mostly for speed).
- Choose your dataset configuration by changing the
CONFIGvariable:'A': Train on UNIPROT1424 / Test on UNIPROT356 (StackDPP config)'B': Train on PDB1075 / Test on PDB186 (TransBind config)'C': Train on PDB1035 / Test on PDB186
- Run the notebook. It will perform:
- 10-Fold Cross-Validation on the training set.
- Full Training on the entire training set.
- Independent Testing on the test set.
- The trained model is saved to
StackDPP_on_TransBind/models/.
Goal: Load a previously saved model and evaluate it against any test dataset.
- Use this script to quickly check performance without re-training.
- Update the
CONFIGandTEST_DATASETvariables to match the model you want to evaluate. - It prints a full classification report and confusion matrix.
The training and inference scripts report the following metrics matching TransBind's evaluation style:
- Accuracy (Acc)
- Precision (Pre)
- Sensitivity/Recall (Sen)
- Specificity (Spec)
- Matthews Correlation Coefficient (MCC)
I reproduced the experiments from both methods and added the key cross-validation and independent-test metrics below (values reproduced from this workspace).
- Mean Fold Accuracy (10-fold CV): 0.9415 (+/- 0.0496)
- CV — Accuracy: 0.9415; Precision: 0.9802; Sensitivity: 0.9017; Specificity: 0.9816; MCC: 0.8860
- Independent test (uniprot356, n=356) — Accuracy: 0.9663; Precision: 0.9882; Sensitivity: 0.9438; Specificity: 0.9888; MCC: 0.9335
- Confusion: TN=176, FP=2, FN=10, TP=168
- Mean Fold Accuracy (10-fold CV): 0.7524 (+/- 0.1008)
- CV — Accuracy: 0.7524; Precision: 0.7231; Sensitivity: 0.7721; Specificity: 0.7348; MCC: 0.5062
- Independent test (pdb186, n=186) — Accuracy: 0.7957; Precision: 0.7391; Sensitivity: 0.9140; Specificity: 0.6774; MCC: 0.6087
- Confusion: TN=63, FP=30, FN=8, TP=85
- Mean Fold Accuracy (10-fold CV): 0.7879 (+/- 0.0900)
- CV — Accuracy: 0.7879; Precision: 0.7534; Sensitivity: 0.8438; Specificity: 0.7339; MCC: 0.5805
- Independent test (pdb186, n=186) — Accuracy: 0.7957; Precision: 0.7350; Sensitivity: 0.9247; Specificity: 0.6667; MCC: 0.6121
- Confusion: TN=62, FP=31, FN=7, TP=86
Based on the reproduced results and typical methodological differences, possible reasons TransBind shows better performance are:
- Richer learned representations: Transformer-based models (e.g., ProtBERT) provide contextualized embeddings that capture long-range dependencies better than handcrafted features.
- Pretraining & transfer learning: Large protein-language pretraining can improve generalization on downstream tasks with limited labeled data.
- Attention mechanism: Transformers can focus on informative residues and interactions without manual feature design.
- End-to-end feature learning: Reduces bias from handcrafted descriptors and lets the model discover discriminative patterns.
Note: the above motivations are phrased as plausible explanations. Below are the papers and short descriptions.
-
Task: DNA-binding protein prediction (protein-level classification and residue-level binding annotation) — experiments reproduced in this repository.
-
StackDPP (BMC Bioinformatics, 2024): proposed high-quality benchmark datasets UNIPROT1424/UNIPROT356, extensive handcrafted feature extraction (sequence, PSSM, SPIDER3) and recursive feature elimination to obtain an rf452 feature set, and a stacking ensemble classifier that achieved strong generalization on the independent test set. Source: https://pmc.ncbi.nlm.nih.gov/articles/PMC10941422/
-
TransBind (preprint / article): uses protein language model embeddings (ProtBERT / ProtTrans) combined with inception-style CNN and self-attention to learn global and local sequence features; includes synthetic data generation to mitigate class imbalance and reports faster runtimes with improved accuracy over historical methods. Paper link (provided): https://www.nature.com/articles/s42003-025-07534-w — additional project README and datasets are available in
TransBind-main/README.mdand theTransBind-mainfolder in this workspace.