Skip to content

Rahuldrabit/StackDpp-vs-TransBind

Repository files navigation

Test StackDPP Datasets Using TransBind Method

This folder contains Google Colab notebooks to test the StackDPP datasets using the TransBind deep learning method.

TransBind utilizes ProtBert (ProtTrans T5-XL) to generate 1024-dimensional per-residue embeddings and processes them through an Inception-based CNN. Because generating ProtBert features from raw FASTA sequences is highly computationally intensive, this implementation is split into Google Colab notebooks to leverage free GPU resources.

Pre-requisites

  1. Google Drive: You must have a Google account with Google Drive.
  2. Upload Data: Upload the entire StackDPP-main/Dataset folder to your Google Drive. We recommend creating a parent folder named StackDPPvsTransBind in your Drive root MyDrive so the paths match the notebooks.

Google Drive structure:

MyDrive/
└── StackDPPvsTransBind/
    ├── StackDPP-main/
    │   └── Dataset/
    │       ├── uniprot1424.fasta
    │       ├── uniprot356.fasta
    │       ├── pdb1075.fasta
    │       ├── pdb186.fasta
    │       └── pdb1035.fasta
    └── StackDPP_on_TransBind/    <-- (Upload the notebooks here)
        ├── 01_generate_features.ipynb
        ├── 02_train_and_validation.ipynb
        └── 03_inference.ipynb

Notebooks Workflow

1. Feature Generation (01_generate_features.ipynb)

Goal: Convert FASTA sequences into ProtBert embeddings.

  1. Open this notebook in Google Colab.
  2. Go to Runtime > Change runtime type and ensure Hardware accelerator is set to T4 GPU (or better).
  3. The notebook will mount your Google Drive and parse the .fasta files.
  4. It dynamically loads the 11GB ProtTrans T5-XL model.
  5. Run the feature generation blocks sequentially. This takes a significant amount of time.
  6. Features are saved directly to your Google Drive under StackDPP_on_TransBind/dataset/ as LLM_features_<name>.gz and labels_<name>.npy.

Note: The script saves to Drive immediately. If Colab disconnects, you can resume without losing generated files.

2. Training and Validation (02_train_and_validation.ipynb)

Goal: Train the Inception-based CNN on the generated features and validate its performance.

  1. Open this notebook in Google Colab (GPU recommended but not strictly required for training, mostly for speed).
  2. Choose your dataset configuration by changing the CONFIG variable:
    • 'A': Train on UNIPROT1424 / Test on UNIPROT356 (StackDPP config)
    • 'B': Train on PDB1075 / Test on PDB186 (TransBind config)
    • 'C': Train on PDB1035 / Test on PDB186
  3. Run the notebook. It will perform:
    • 10-Fold Cross-Validation on the training set.
    • Full Training on the entire training set.
    • Independent Testing on the test set.
  4. The trained model is saved to StackDPP_on_TransBind/models/.

3. Inference (03_inference.ipynb)

Goal: Load a previously saved model and evaluate it against any test dataset.

  1. Use this script to quickly check performance without re-training.
  2. Update the CONFIG and TEST_DATASET variables to match the model you want to evaluate.
  3. It prints a full classification report and confusion matrix.

Metrics

The training and inference scripts report the following metrics matching TransBind's evaluation style:

  • Accuracy (Acc)
  • Precision (Pre)
  • Sensitivity/Recall (Sen)
  • Specificity (Spec)
  • Matthews Correlation Coefficient (MCC)

Reproduced results and comparison

I reproduced the experiments from both methods and added the key cross-validation and independent-test metrics below (values reproduced from this workspace).

Config A (UNIPROT1424 -> UNIPROT356)

  • Mean Fold Accuracy (10-fold CV): 0.9415 (+/- 0.0496)
  • CV — Accuracy: 0.9415; Precision: 0.9802; Sensitivity: 0.9017; Specificity: 0.9816; MCC: 0.8860
  • Independent test (uniprot356, n=356) — Accuracy: 0.9663; Precision: 0.9882; Sensitivity: 0.9438; Specificity: 0.9888; MCC: 0.9335
  • Confusion: TN=176, FP=2, FN=10, TP=168

Config C (PDB1035 -> PDB186)

  • Mean Fold Accuracy (10-fold CV): 0.7524 (+/- 0.1008)
  • CV — Accuracy: 0.7524; Precision: 0.7231; Sensitivity: 0.7721; Specificity: 0.7348; MCC: 0.5062
  • Independent test (pdb186, n=186) — Accuracy: 0.7957; Precision: 0.7391; Sensitivity: 0.9140; Specificity: 0.6774; MCC: 0.6087
  • Confusion: TN=63, FP=30, FN=8, TP=85

Config B (PDB1075 -> PDB186)

  • Mean Fold Accuracy (10-fold CV): 0.7879 (+/- 0.0900)
  • CV — Accuracy: 0.7879; Precision: 0.7534; Sensitivity: 0.8438; Specificity: 0.7339; MCC: 0.5805
  • Independent test (pdb186, n=186) — Accuracy: 0.7957; Precision: 0.7350; Sensitivity: 0.9247; Specificity: 0.6667; MCC: 0.6121
  • Confusion: TN=62, FP=31, FN=7, TP=86

Short comparison / plausible motivations why TransBind may outperform StackDPP

Based on the reproduced results and typical methodological differences, possible reasons TransBind shows better performance are:

  • Richer learned representations: Transformer-based models (e.g., ProtBERT) provide contextualized embeddings that capture long-range dependencies better than handcrafted features.
  • Pretraining & transfer learning: Large protein-language pretraining can improve generalization on downstream tasks with limited labeled data.
  • Attention mechanism: Transformers can focus on informative residues and interactions without manual feature design.
  • End-to-end feature learning: Reduces bias from handcrafted descriptors and lets the model discover discriminative patterns.

Note: the above motivations are phrased as plausible explanations. Below are the papers and short descriptions.

Papers & sources

  • Task: DNA-binding protein prediction (protein-level classification and residue-level binding annotation) — experiments reproduced in this repository.

  • StackDPP (BMC Bioinformatics, 2024): proposed high-quality benchmark datasets UNIPROT1424/UNIPROT356, extensive handcrafted feature extraction (sequence, PSSM, SPIDER3) and recursive feature elimination to obtain an rf452 feature set, and a stacking ensemble classifier that achieved strong generalization on the independent test set. Source: https://pmc.ncbi.nlm.nih.gov/articles/PMC10941422/

  • TransBind (preprint / article): uses protein language model embeddings (ProtBERT / ProtTrans) combined with inception-style CNN and self-attention to learn global and local sequence features; includes synthetic data generation to mitigate class imbalance and reports faster runtimes with improved accuracy over historical methods. Paper link (provided): https://www.nature.com/articles/s42003-025-07534-w — additional project README and datasets are available in TransBind-main/README.md and the TransBind-main folder in this workspace.


About

This is repo is experiment of comparing Deep learning and Machine learning approach for DNA-binding protein prediction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors