v1.0.0 – Initial release
Pre-release🎉 Protein Crystallization Data Extraction (PCDE) - v1.0.0
Protein Crystallization Data Extraction Tool
by Nana Njantang Ruth · ORCID 0000-0002-6003-7521
Overview
This is the first stable release of the automated pipeline for retrieving, filtering, and analyzing protein crystallization conditions from the RCSB Protein Data Bank (PDB). Starting from an amino acid sequence, the pipeline produces structured, analysis-ready datasets and publication-quality figures.
What's included in this release
🗂️ Input & FASTA Export
- Accepts any amino acid, DNA, or RNA sequence interactively
- Automatically detects sequence type (protein / DNA / RNA)
- Saves the input sequence as a FASTA file to
output/{seq_type_name}/
🔍 Step 1 — RCSB Sequence Search (rcsb_sequence_identity.py)
- Queries the RCSB Search API (
/rcsbsearch/v2/query) using sequence similarity scoring - Maps sequence type to the correct target (
pdb_protein_sequence,pdb_dna_sequence,pdb_rna_sequence) - Filters results to X-ray crystallography only via a per-hit lookup to
/rest/v1/core/entry/{pdb_id} - Extracts per-hit:
PDB_ID,Entity,Score,Seq_id(sequence identity),E-value - Saves ranked hits to
{seq_type_name}_rcsb_hits.csv
🧹 Step 2 — Crystallization Data Retrieval & Filtering (PDB_searchAPI.py)
- Runs a full PDB search for crystallization metadata
- Filters entries by experimental conditions via
filter_experimental_conditions() - All intermediate temporary CSV files are automatically cleaned up after each stage
🧪 Compound Annotation (extract_structures.py)
- Enriches filtered crystallization data with a
COMPOUNDcolumn usingstructures.pkl - Output stored in a temporary CSV before merging
🔗 Step 3 — Merge RCSB + Crystallization Data
- Merges RCSB sequence-identity results with crystallization condition data on
PDB_ID - Final merged CSV contains a defined column order:
PDB_ID · Entity · Score · Seq_id · E-value · Resolution · Pubmed_id · Method · pH · Temp · Ligands · Polymer · Assembly · pdbx_pH_range · pdbx_details · Compounds (mM) · PEG_Id · PEG_con - Saved as
{seq_type_name}_merged_results.csv
📊 Step 4 — High-Resolution Visualization (plot.py)
run_plot()generates 300 dpi analytical scatter plots from the merged CSV:- pH vs. Temperature (K)
- pH vs. PEG concentration (%)
📄 Consolidated PDF Report
- Outputs
Cryst_cocktail_Table.pdf: a colored summary table compiling PDB IDs, alignment metrics, ligands, and complete chemical cocktails into a single laboratory resource
Pipeline Flow
Sequence input
│
├──► FASTA export
│
├──► Step 1: RCSB sequence search + X-ray filter → rcsb_hits.csv
│
├──► Step 2: PDB crystallization search (×6 workers) → temp CSV
│ └── filter experimental conditions → temp CSV
│ └── append COMPOUND column → temp CSV
│
├──► Step 3: Merge RCSB + crystallization data → merged_results.csv
│
└──► Step 4: run_plot() → scatter plots (300 dpi)
Output Structure
output/
└── {seq_type_name}/
├── {seq_type_name}_sequence.fasta
├── {seq_type_name}_rcsb_hits.csv
├── {seq_type_name}_merged_results.csv
├── Cryst_cocktail_Table.pdf
└── plots/
├── pH_vs_Temperature.png
└── pH_vs_PEG_concentration.png
Module Overview
| Module | Role |
|---|---|
main.py |
Pipeline orchestration and I/O |
rcsb_sequence_identity.py |
RCSB sequence search + X-ray filter |
PDB_searchAPI.py |
Crystallization metadata retrieval & filtering |
extract_structures.py |
Compound annotation via structures.pkl |
plot.py |
High-resolution scatter plot generation |
Dependencies
- Python 3.8+
requests,pandas,openpyxl,matplotlib,concurrent.futures,tempfile
Install all dependencies with:
pip install -r requirements.txt📦 Download Assets
The following pre-built archives are available for this release:
| Asset | Format | Size | Description |
|---|---|---|---|
Source code (zip) |
ZIP | - | Complete repository as ZIP archive |
Source code (tar.gz) |
TAR.GZ | - | Complete repository as compressed TAR archive |
Direct download links:
How to Install
Option 1: From ZIP Archive
# Download and extract
unzip V1.0.0.zip
cd Protein_Crystallization_Data_Extraction
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/macOS
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtOption 2: Clone from Git
git clone -b V1.0.0 https://github.com/RitAreaSciencePark/Protein_Crystallization_Data_Extraction.git
cd Protein_Crystallization_Data_Extraction
pip install -r requirements.txtQuick Start
Web Application
cd protein_crystallization_app
python manage.py migrate
python manage.py runserverThen visit http://127.0.0.1:8000
Command-Line (Single Sequence)
cd src
python Main.py
Enter sequence: MSPRKTYILKLYVAGNTPNSVRALK...
Enter a descriptive sequence type name: MyProteinCommand-Line (FASTA Batch)
cd src_fasta_file
python main.py input.fastaHow to Cite
If you use this pipeline in your research, please cite:
Nana Njantang Ruth, Valerio PIOMPONI, and Adrea DALLE VEDOVE (2026). Protein Crystallization Data Extraction Tool (v1.0.0). GitHub Repository. https://github.com/RitAreaSciencePark/Protein_Crystallization_Data_Extraction
Or use the CITATION.cff file included in this repository.
Known Limitations
- Compound annotation depends on the static
structures.pklreference file - Free-text crystallization fields in PDB are not fully standardized across all entries
- Web interface improvements planned for v2.0.0
License
MIT © 2025 RitAreaSciencePark
This project is released under the MIT License. See the LICENSE file for details.
Support & Feedback
For issues, feature requests, or questions:
Thank you for using PCDE! 🧬