Test StackDPP Datasets Using TransBind Method

This folder contains Google Colab notebooks to test the StackDPP datasets using the TransBind deep learning method.

TransBind utilizes ProtBert (ProtTrans T5-XL) to generate 1024-dimensional per-residue embeddings and processes them through an Inception-based CNN. Because generating ProtBert features from raw FASTA sequences is highly computationally intensive, this implementation is split into Google Colab notebooks to leverage free GPU resources.

Pre-requisites

Google Drive: You must have a Google account with Google Drive.
Upload Data: Upload the entire StackDPP-main/Dataset folder to your Google Drive. We recommend creating a parent folder named StackDPPvsTransBind in your Drive root MyDrive so the paths match the notebooks.

Google Drive structure:

MyDrive/
└── StackDPPvsTransBind/
    ├── StackDPP-main/
    │   └── Dataset/
    │       ├── uniprot1424.fasta
    │       ├── uniprot356.fasta
    │       ├── pdb1075.fasta
    │       ├── pdb186.fasta
    │       └── pdb1035.fasta
    └── StackDPP_on_TransBind/    <-- (Upload the notebooks here)
        ├── 01_generate_features.ipynb
        ├── 02_train_and_validation.ipynb
        └── 03_inference.ipynb

Notebooks Workflow

1. Feature Generation (`01_generate_features.ipynb`)

Goal: Convert FASTA sequences into ProtBert embeddings.

Open this notebook in Google Colab.
Go to Runtime > Change runtime type and ensure Hardware accelerator is set to T4 GPU (or better).
The notebook will mount your Google Drive and parse the .fasta files.
It dynamically loads the 11GB ProtTrans T5-XL model.
Run the feature generation blocks sequentially. This takes a significant amount of time.
Features are saved directly to your Google Drive under StackDPP_on_TransBind/dataset/ as LLM_features_<name>.gz and labels_<name>.npy.

Note: The script saves to Drive immediately. If Colab disconnects, you can resume without losing generated files.

2. Training and Validation (`02_train_and_validation.ipynb`)

Goal: Train the Inception-based CNN on the generated features and validate its performance.

Open this notebook in Google Colab (GPU recommended but not strictly required for training, mostly for speed).
Choose your dataset configuration by changing the CONFIG variable:
- 'A': Train on UNIPROT1424 / Test on UNIPROT356 (StackDPP config)
- 'B': Train on PDB1075 / Test on PDB186 (TransBind config)
- 'C': Train on PDB1035 / Test on PDB186
Run the notebook. It will perform:
- 10-Fold Cross-Validation on the training set.
- Full Training on the entire training set.
- Independent Testing on the test set.
The trained model is saved to StackDPP_on_TransBind/models/.

3. Inference (`03_inference.ipynb`)

Goal: Load a previously saved model and evaluate it against any test dataset.

Use this script to quickly check performance without re-training.
Update the CONFIG and TEST_DATASET variables to match the model you want to evaluate.
It prints a full classification report and confusion matrix.

Metrics

The training and inference scripts report the following metrics matching TransBind's evaluation style:

Accuracy (Acc)
Precision (Pre)
Sensitivity/Recall (Sen)
Specificity (Spec)
Matthews Correlation Coefficient (MCC)

Reproduced results and comparison

I reproduced the experiments from both methods and added the key cross-validation and independent-test metrics below (values reproduced from this workspace).

Config A (UNIPROT1424 -> UNIPROT356)

Mean Fold Accuracy (10-fold CV): 0.9415 (+/- 0.0496)
CV — Accuracy: 0.9415; Precision: 0.9802; Sensitivity: 0.9017; Specificity: 0.9816; MCC: 0.8860
Independent test (uniprot356, n=356) — Accuracy: 0.9663; Precision: 0.9882; Sensitivity: 0.9438; Specificity: 0.9888; MCC: 0.9335
Confusion: TN=176, FP=2, FN=10, TP=168

Config C (PDB1035 -> PDB186)

Mean Fold Accuracy (10-fold CV): 0.7524 (+/- 0.1008)
CV — Accuracy: 0.7524; Precision: 0.7231; Sensitivity: 0.7721; Specificity: 0.7348; MCC: 0.5062
Independent test (pdb186, n=186) — Accuracy: 0.7957; Precision: 0.7391; Sensitivity: 0.9140; Specificity: 0.6774; MCC: 0.6087
Confusion: TN=63, FP=30, FN=8, TP=85

Config B (PDB1075 -> PDB186)

Mean Fold Accuracy (10-fold CV): 0.7879 (+/- 0.0900)
CV — Accuracy: 0.7879; Precision: 0.7534; Sensitivity: 0.8438; Specificity: 0.7339; MCC: 0.5805
Independent test (pdb186, n=186) — Accuracy: 0.7957; Precision: 0.7350; Sensitivity: 0.9247; Specificity: 0.6667; MCC: 0.6121
Confusion: TN=62, FP=31, FN=7, TP=86

Short comparison / plausible motivations why TransBind may outperform StackDPP

Based on the reproduced results and typical methodological differences, possible reasons TransBind shows better performance are:

Richer learned representations: Transformer-based models (e.g., ProtBERT) provide contextualized embeddings that capture long-range dependencies better than handcrafted features.
Pretraining & transfer learning: Large protein-language pretraining can improve generalization on downstream tasks with limited labeled data.
Attention mechanism: Transformers can focus on informative residues and interactions without manual feature design.
End-to-end feature learning: Reduces bias from handcrafted descriptors and lets the model discover discriminative patterns.

Note: the above motivations are phrased as plausible explanations. Below are the papers and short descriptions.

Papers & sources

Task: DNA-binding protein prediction (protein-level classification and residue-level binding annotation) — experiments reproduced in this repository.
StackDPP (BMC Bioinformatics, 2024): proposed high-quality benchmark datasets UNIPROT1424/UNIPROT356, extensive handcrafted feature extraction (sequence, PSSM, SPIDER3) and recursive feature elimination to obtain an rf452 feature set, and a stacking ensemble classifier that achieved strong generalization on the independent test set. Source: https://pmc.ncbi.nlm.nih.gov/articles/PMC10941422/
TransBind (preprint / article): uses protein language model embeddings (ProtBERT / ProtTrans) combined with inception-style CNN and self-attention to learn global and local sequence features; includes synthetic data generation to mitigate class imbalance and reports faster runtimes with improved accuracy over historical methods. Paper link (provided): https://www.nature.com/articles/s42003-025-07534-w — additional project README and datasets are available in TransBind-main/README.md and the TransBind-main folder in this workspace.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Test StackDPP Datasets Using TransBind Method

Pre-requisites

Notebooks Workflow

1. Feature Generation (`01_generate_features.ipynb`)

2. Training and Validation (`02_train_and_validation.ipynb`)

3. Inference (`03_inference.ipynb`)

Metrics

Reproduced results and comparison

Config A (UNIPROT1424 -> UNIPROT356)

Config C (PDB1035 -> PDB186)

Config B (PDB1075 -> PDB186)

Short comparison / plausible motivations why TransBind may outperform StackDPP

Papers & sources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Result		Result
StackDPP-main		StackDPP-main
TransBind-main		TransBind-main
01_generate_features.ipynb		01_generate_features.ipynb
02_train_and_validation.ipynb		02_train_and_validation.ipynb
03_inference.ipynb		03_inference.ipynb
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Test StackDPP Datasets Using TransBind Method

Pre-requisites

Notebooks Workflow

1. Feature Generation (01_generate_features.ipynb)

2. Training and Validation (02_train_and_validation.ipynb)

3. Inference (03_inference.ipynb)

Metrics

Reproduced results and comparison

Config A (UNIPROT1424 -> UNIPROT356)

Config C (PDB1035 -> PDB186)

Config B (PDB1075 -> PDB186)

Short comparison / plausible motivations why TransBind may outperform StackDPP

Papers & sources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Feature Generation (`01_generate_features.ipynb`)

2. Training and Validation (`02_train_and_validation.ipynb`)

3. Inference (`03_inference.ipynb`)

Packages