Unique Complex Identifiers for the PDB archive

Background

This repository contains code for a Python package that aggregates macromolecular complexes data from the PDBe graph database, assigns unique identifiers and human-readable maps names to them.

Quick start

1.) Clone this repository

git clone git@github.com:PDBe-KB/process-complex-data.git

2.) Install dependencies with pip

pip install -r requirements.txt

Basic usage

python pdbe_complexes/main.py -b <bolt_url> -u <username> -p <password> -o <output_csv_path> -m <UniProt_mapping_path> -i <complex_portal_path>`

A short explanation for each command line argument is given below:

bolt_url = Neo4j bolt url
username = Neo4j username
password = Neo4j password
output_csv_path = The path to the output CSV file
UniProt_mapping_path = The path to the directory containing the UniProt mapping file.
complex_portal_path = The path to the Complex Portal FTP site. Please use the following path value "pub/databases/IntAct/current/various/complex2pdb"

The manually curated complexes CSV files (complexes_molecules.csv, complexes_components.csv) are provided by Romana Gaborova, EMBL-EBI.

A sample UniProt mapping file is provided in the sample directory. This file contains the mapping between obsolete and new UniProt accessions. Users would need to update this file weekly in order to correct any entries with obsolete UniProt accessions.

Documentation

Executing main.py runs two separate processes sequentially: process_complex.py and get_complex_name.py. The steps involved in each process are given below:

process_complex.py

Gets complex-composition data from Complex Portal.
Drops existing PDBComplex nodes in the graph database.
Reads existing mapping of complex-composition strings to pdb_complex_ids (complexes_master.csv) and stores the data in a reference dictionary.
Gets complexes composition data from PDBe graph database.
Assigns unique PDB complex identifiers for each unique complex-composition and Complex Portal identifiers for consensus complex compositions.
Processes complex-composition data from the PDBe graph database to create relationships between selected pairs of nodes.
Creates relationships between six pairs of nodes:
1. Uniprot and PDBComplex
2. Entity and PDBComplex
3. UnmappedPolymer and PDBComplex
4. Rfam and PDBComplex
5. Assembly and PDBComplex
6. Complex and PDBComplex
Creates sub-complex relationships
Creates a CSV file called complexes_mapping.csv that contains complex-related information except the names.

get_complex_name.py

Gets complex-related data from Complex Portal.
Gets complex-related data from PDBe graph database.
Assigns a complex name for each PDB Complex identifier, if possible.
Creates a CSV file called complexes_names.csv, that contains the names assigned to the complexes.

Post-process

In the final step, the two CSV files are merged together into a single CSV file called complexes_master.csv using the pdb_complex_id as the column to join on. The parent CSV files are then deleted.

Expected content of the CSV files (examples)

complexes_mapping.csv

md5_obj	pdb_complex_id	accession	entries
52fce5e893d4552c319724c8b6ae7dab	PDB-CPX-100015	A0A010_2_67581	5b01_1,5b00_1
e894061d1c2d6dd1e4683de2073998d0	PDB-CPX-100016	A0A011_2_67581	3vkc_1,3vkd_1,3vka_1,3vkb_1,3vk5_1
4fea44b9d12043c924d68c4db918cdd5	PDB-CPX-100017	A0A014C6J9_2_1310912	6br7_1
6c18bde5b9d22b482f74dbc78456982f	PDB-CPX-100018	A0A014M399_2_1188239	7dg0_1,7dfx_1
f1a69c0363ad712baa207434fb945c48	PDB-CPX-100019	A0A016UZK2_3_53326	7a4a_1

complexes_names.csv

pdb_complex_id	complex_name	complex_name_type
PDB-CPX-100015	MoeN5	protein name from UniProt
PDB-CPX-100016	MoeO5	protein name from UniProt
PDB-CPX-100017	Two-component system response regulator protein	protein name from UniProt
PDB-CPX-100018	DAC domain-containing protein	protein name from UniProt
PDB-CPX-100019	Integrase catalytic domain-containing protein	protein name from UniProt

complexes_master.csv

md5_obj	pdb_complex_id	accession	entries.	complex_name	complex_name_type
52fce5e893d4552c319724c8b6ae7dab	PDB-CPX-100015	A0A010_2_67581	5b01_1,5b00_1	MoeN5	protein name from UniProt
e894061d1c2d6dd1e4683de2073998d0	PDB-CPX-100016	A0A011_2_67581	3vkc_1,3vkd_1,3vka_1,3vkb_1,3vk5_1	MoeO5	protein name from UniProt
4fea44b9d12043c924d68c4db918cdd5	PDB-CPX-100017	A0A014C6J9_2_1310912	6br7_1	Two-component system response regulator protein	protein name from UniProt
6c18bde5b9d22b482f74dbc78456982f	PDB-CPX-100018	A0A014M399_2_1188239	7dg0_1,7dfx_1	DAC domain-containing protein	protein name from UniProt
f1a69c0363ad712baa207434fb945c48	PDB-CPX-100019	A0A016UZK2_3_53326	7a4a_1	Integrase catalytic domain-containing protein	protein name from UniProt

Dependencies

Dependencies for running the process

See requirements.txt

Development dependencies

For running unit tests and calculating test coverage, we suggest: pytest, codecov and pytest-cov:

pip install pytest
pip install codecov
pip install pytest-cov

For running sanity checks and linting, we suggest pre-commit:

pip install pre-commit
pre-commit
pre-commit install

Authors

Sri Devan Appasamy (lead developer)
Mihaly Varadi (review & refactoring)
John Berrisford (initial process and conceptualisation)

License

Licensed under the Apache License, Version 2.0. Please see LICENSE.

Acknowledgements

We would like to acknowledge the PDBe team for their help both via coding and consultation, and especially John Berrisford, who laid the foundations of this data process and Romana Gáborová, who maintains a list of manually curated complex names.

Name		Name	Last commit message	Last commit date
Latest commit History 217 Commits
.github/workflows		.github/workflows
pdbe_complexes		pdbe_complexes
sample		sample
tests		tests
.coveragerc		.coveragerc
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
__version__.py		__version__.py
dev-requirements.txt		dev-requirements.txt
gitignore		gitignore
isort.cfg		isort.cfg
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unique Complex Identifiers for the PDB archive

Background

Quick start

Basic usage

Documentation

process_complex.py

get_complex_name.py

Post-process

Expected content of the CSV files (examples)

complexes_mapping.csv

complexes_names.csv

complexes_master.csv

Dependencies

Dependencies for running the process

Development dependencies

Authors

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

PDBe-KB/process-complex-data

Folders and files

Latest commit

History

Repository files navigation

Unique Complex Identifiers for the PDB archive

Background

Quick start

Basic usage

Documentation

process_complex.py

get_complex_name.py

Post-process

Expected content of the CSV files (examples)

complexes_mapping.csv

complexes_names.csv

complexes_master.csv

Dependencies

Dependencies for running the process

Development dependencies

Authors

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages