Skip to content

Releases: RitAreaSciencePark/Protein_Crystallization_Data_Extraction

Protein Data Crystallization Extraction V1.0.1

Choose a tag to compare

@Njantang1 Njantang1 released this 09 Jun 15:47

🎉 Protein Crystallization Data Extraction (PCDE) V1.0.1

Protein Crystallization Data Extraction Tool
by Nana Njantang Ruth · ORCID 0000-0002-6003-7521


Overview

Protein crystallography remains one of the primary methods for determining three-dimensional protein structures. Identifying the right crystallization conditions is a critical bottleneck. This pipeline mines the RCSB Protein Data Bank (PDB) to retrieve, filter, annotate, and visualize crystallization condition data for any input sequence — turning raw PDB metadata into structured, FAIR-compliant datasets.


What's included in this release

🗂️ Input & FASTA Export

  • Accepts any amino acid, DNA, or RNA sequence interactively
  • Automatically detects sequence type (protein / DNA / RNA)
  • Saves the input sequence as a FASTA file to output/{seq_type_name}/

🔍 Step 1 — RCSB Sequence Search (rcsb_sequence_identity.py)

  • Queries the RCSB Search API (/rcsbsearch/v2/query) using sequence similarity scoring
  • Maps sequence type to the correct target (pdb_protein_sequence, pdb_dna_sequence, pdb_rna_sequence)
  • Filters results to X-ray crystallography only via a per-hit lookup to /rest/v1/core/entry/{pdb_id}
  • Extracts per-hit: PDB_ID, Entity, Score, Seq_id (sequence identity), E-value
  • Saves ranked hits to {seq_type_name}_rcsb_hits.csv

🧹 Step 2 — Crystallization Data Retrieval & Filtering (PDB_searchAPI.py)

  • Runs a full PDB search for crystallization metadata
  • Filters entries by experimental conditions via filter_experimental_conditions()
  • All intermediate temporary CSV files are automatically cleaned up after each stage

🧪 Compound Annotation (extract_structures.py)

  • Enriches filtered crystallization data with a COMPOUND column using structures.pkl
  • Output stored in a temporary CSV before merging

🔗 Step 3 — Merge RCSB + Crystallization Data

  • Merges RCSB sequence-identity results with crystallization condition data on PDB_ID
  • Final merged CSV contains a defined column order:
    PDB_ID · Entity · Score · Seq_id · E-value · Resolution · Pubmed_id · Method · pH · Temp · Ligands · Polymer · Assembly · pdbx_pH_range · pdbx_details · Compounds (mM) · PEG_Id · PEG_con
    
  • Saved as {seq_type_name}_merged_results.csv

📊 Step 4 — High-Resolution Visualization (plot.py)

  • run_plot() generates 300 dpi analytical scatter plots from the merged CSV:
    • pH vs. Temperature (K)
    • pH vs. PEG concentration (%)

📄 Consolidated PDF Report

  • Outputs Cryst_cocktail_Table.pdf: a colored summary table compiling PDB IDs, alignment metrics, ligands, and complete chemical cocktails into a single laboratory resource

Pipeline Flow

Sequence input
     │
     ├──► FASTA export
     │
     ├──► Step 1: RCSB sequence search + X-ray filter → rcsb_hits.csv
     │
     ├──► Step 2: PDB crystallization search (×6 workers) → temp CSV
     │                └── filter experimental conditions → temp CSV
     │                        └── append COMPOUND column → temp CSV
     │
     ├──► Step 3: Merge RCSB + crystallization data → merged_results.csv
     │
     └──► Step 4: run_plot() → scatter plots (300 dpi)

Output Structure

output/
└── {seq_type_name}/
    ├── {seq_type_name}_sequence.fasta
    ├── {seq_type_name}_rcsb_hits.csv
    ├── {seq_type_name}_merged_results.csv
    ├── Cryst_cocktail_Table.pdf
    └── plots/
        ├── pH_vs_Temperature.png
        └── pH_vs_PEG_concentration.png

Module Overview

Module Role
main.py Pipeline orchestration and I/O
rcsb_sequence_identity.py RCSB sequence search + X-ray filter
PDB_searchAPI.py Crystallization metadata retrieval & filtering
extract_structures.py Compound annotation via structures.pkl
plot.py High-resolution scatter plot generation

Dependencies

  • Python 3.8+
  • requests, pandas, openpyxl, matplotlib, concurrent.futures, tempfile

Install all dependencies with:

pip install -r requirements.txt

📦 Download Assets

The following pre-built archives are available for this release:

Asset Format Size Description
Source code (zip) ZIP - Complete repository as ZIP archive
Source code (tar.gz) TAR.GZ - Complete repository as compressed TAR archive

Direct download links:


How to Install

Option 1: From ZIP Archive

# Download and extract
unzip V1.0.0.zip
cd Protein_Crystallization_Data_Extraction

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/macOS
venv\Scripts\activate     # Windows

# Install dependencies
pip install -r requirements.txt

Option 2: Clone from Git

git clone -b V1.0.0 https://github.com/RitAreaSciencePark/Protein_Crystallization_Data_Extraction.git
cd Protein_Crystallization_Data_Extraction
pip install -r requirements.txt

Quick Start

Web Application

cd protein_crystallization_app
python manage.py migrate
python manage.py runserver

Then visit http://127.0.0.1:8000

Command-Line (Single Sequence)

cd src
python Main.py
Enter sequence: MSPRKTYILKLYVAGNTPNSVRALK...
Enter a descriptive sequence type name: MyProtein

Command-Line (FASTA Batch)

cd src_fasta_file
python main.py input.fasta

How to Cite

If you use this pipeline in your research, please cite:

Nana Njantang Ruth, Valerio PIOMPONI, and Adrea DALLE VEDOVE (2026). Protein Crystallization Data Extraction Tool (v1.0.0). GitHub Repository. https://github.com/RitAreaSciencePark/Protein_Crystallization_Data_Extraction

Or use the CITATION.cff file included in this repository.


Known Limitations

  • Compound annotation depends on the static structures.pkl reference file
  • Free-text crystallization fields in PDB are not fully standardized across all entries
  • Web interface improvements planned for v2.0.0

License

MIT © 2025 RitAreaSciencePark

This project is released under the MIT License. See the LICENSE file for details.


Support & Feedback

For issues, feature requests, or questions:


Thank you for using PCDE! 🧬

Full Changelog: V1.0.0...v1.0.1

v1.0.0 – Initial release

Pre-release

Choose a tag to compare

@Njantang1 Njantang1 released this 09 Jun 10:00

🎉 Protein Crystallization Data Extraction (PCDE) - v1.0.0

Protein Crystallization Data Extraction Tool
by Nana Njantang Ruth · ORCID 0000-0002-6003-7521


Overview

This is the first stable release of the automated pipeline for retrieving, filtering, and analyzing protein crystallization conditions from the RCSB Protein Data Bank (PDB). Starting from an amino acid sequence, the pipeline produces structured, analysis-ready datasets and publication-quality figures.


What's included in this release

🗂️ Input & FASTA Export

  • Accepts any amino acid, DNA, or RNA sequence interactively
  • Automatically detects sequence type (protein / DNA / RNA)
  • Saves the input sequence as a FASTA file to output/{seq_type_name}/

🔍 Step 1 — RCSB Sequence Search (rcsb_sequence_identity.py)

  • Queries the RCSB Search API (/rcsbsearch/v2/query) using sequence similarity scoring
  • Maps sequence type to the correct target (pdb_protein_sequence, pdb_dna_sequence, pdb_rna_sequence)
  • Filters results to X-ray crystallography only via a per-hit lookup to /rest/v1/core/entry/{pdb_id}
  • Extracts per-hit: PDB_ID, Entity, Score, Seq_id (sequence identity), E-value
  • Saves ranked hits to {seq_type_name}_rcsb_hits.csv

🧹 Step 2 — Crystallization Data Retrieval & Filtering (PDB_searchAPI.py)

  • Runs a full PDB search for crystallization metadata
  • Filters entries by experimental conditions via filter_experimental_conditions()
  • All intermediate temporary CSV files are automatically cleaned up after each stage

🧪 Compound Annotation (extract_structures.py)

  • Enriches filtered crystallization data with a COMPOUND column using structures.pkl
  • Output stored in a temporary CSV before merging

🔗 Step 3 — Merge RCSB + Crystallization Data

  • Merges RCSB sequence-identity results with crystallization condition data on PDB_ID
  • Final merged CSV contains a defined column order:
    PDB_ID · Entity · Score · Seq_id · E-value · Resolution · Pubmed_id · Method · pH · Temp · Ligands · Polymer · Assembly · pdbx_pH_range · pdbx_details · Compounds (mM) · PEG_Id · PEG_con
    
  • Saved as {seq_type_name}_merged_results.csv

📊 Step 4 — High-Resolution Visualization (plot.py)

  • run_plot() generates 300 dpi analytical scatter plots from the merged CSV:
    • pH vs. Temperature (K)
    • pH vs. PEG concentration (%)

📄 Consolidated PDF Report

  • Outputs Cryst_cocktail_Table.pdf: a colored summary table compiling PDB IDs, alignment metrics, ligands, and complete chemical cocktails into a single laboratory resource

Pipeline Flow

Sequence input
     │
     ├──► FASTA export
     │
     ├──► Step 1: RCSB sequence search + X-ray filter → rcsb_hits.csv
     │
     ├──► Step 2: PDB crystallization search (×6 workers) → temp CSV
     │                └── filter experimental conditions → temp CSV
     │                        └── append COMPOUND column → temp CSV
     │
     ├──► Step 3: Merge RCSB + crystallization data → merged_results.csv
     │
     └──► Step 4: run_plot() → scatter plots (300 dpi)

Output Structure

output/
└── {seq_type_name}/
    ├── {seq_type_name}_sequence.fasta
    ├── {seq_type_name}_rcsb_hits.csv
    ├── {seq_type_name}_merged_results.csv
    ├── Cryst_cocktail_Table.pdf
    └── plots/
        ├── pH_vs_Temperature.png
        └── pH_vs_PEG_concentration.png

Module Overview

Module Role
main.py Pipeline orchestration and I/O
rcsb_sequence_identity.py RCSB sequence search + X-ray filter
PDB_searchAPI.py Crystallization metadata retrieval & filtering
extract_structures.py Compound annotation via structures.pkl
plot.py High-resolution scatter plot generation

Dependencies

  • Python 3.8+
  • requests, pandas, openpyxl, matplotlib, concurrent.futures, tempfile

Install all dependencies with:

pip install -r requirements.txt

📦 Download Assets

The following pre-built archives are available for this release:

Asset Format Size Description
Source code (zip) ZIP - Complete repository as ZIP archive
Source code (tar.gz) TAR.GZ - Complete repository as compressed TAR archive

Direct download links:


How to Install

Option 1: From ZIP Archive

# Download and extract
unzip V1.0.0.zip
cd Protein_Crystallization_Data_Extraction

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/macOS
venv\Scripts\activate     # Windows

# Install dependencies
pip install -r requirements.txt

Option 2: Clone from Git

git clone -b V1.0.0 https://github.com/RitAreaSciencePark/Protein_Crystallization_Data_Extraction.git
cd Protein_Crystallization_Data_Extraction
pip install -r requirements.txt

Quick Start

Web Application

cd protein_crystallization_app
python manage.py migrate
python manage.py runserver

Then visit http://127.0.0.1:8000

Command-Line (Single Sequence)

cd src
python Main.py
Enter sequence: MSPRKTYILKLYVAGNTPNSVRALK...
Enter a descriptive sequence type name: MyProtein

Command-Line (FASTA Batch)

cd src_fasta_file
python main.py input.fasta

How to Cite

If you use this pipeline in your research, please cite:

Nana Njantang Ruth, Valerio PIOMPONI, and Adrea DALLE VEDOVE (2026). Protein Crystallization Data Extraction Tool (v1.0.0). GitHub Repository. https://github.com/RitAreaSciencePark/Protein_Crystallization_Data_Extraction

Or use the CITATION.cff file included in this repository.


Known Limitations

  • Compound annotation depends on the static structures.pkl reference file
  • Free-text crystallization fields in PDB are not fully standardized across all entries
  • Web interface improvements planned for v2.0.0

License

MIT © 2025 RitAreaSciencePark

This project is released under the MIT License. See the LICENSE file for details.


Support & Feedback

For issues, feature requests, or questions:


Thank you for using PCDE! 🧬