# pypdb demos

This is a set of basic examples of the usage and outputs of the various individual functions included in. There are generally two types of functions:

+ Functions that perform searches and return lists of PDB IDs
+ Functions that get information about specific PDB IDs

The list of supported search types, as well as the different types of information that can be returned for a given PDB ID, is large (and growing) and is enumerated completely in the docstrings of pypdb.py. The PDB allows a very wide range of different types of queries, and so any option that is not currently available can likely be implemented pretty easily based on the structure of the query types that have already been implemented. I appreciate any feedback and pull requests.

**Another notebook in this directory, advanced_demos.ipynb, includes more in-depth usages of multiple functions, including the tutorial on graphing the popularity of CRISPR that was originally included in this notebook**

### Preamble

In [1]:
%pylab inline
from IPython.display import HTML

# from pypdb.pypdb import *
from pypdb import *

import pprint

Populating the interactive namespace from numpy and matplotlib


# 1. Search functions that return lists of PDB IDs

### Get a list of PDBs for a specific search term

In [2]:
search_dict = make_query('actin network')
found_pdbs = do_search(search_dict)
print(found_pdbs)

['1D7M', '3W3D', '4A7H', '4A7L', '4A7N']


### Search by PubMed ID Number

In [1]:
search_dict = make_query('27499440','PubmedIdQuery')
found_pdbs = do_search(search_dict)
print(found_pdbs)

['5IMT', '5IMW', '5IMY']

### Search by a specific modified structure

In [3]:
search_dict = make_query('3W3D',querytype='ModifiedStructuresQuery')
found_pdbs = do_search(search_dict)
print(found_pdbs)

['1DGI', '1FYD', '1JL0', '2MM3', '2MS4', '2N6B', '3JAJ', '3JAN', '3JC1', '3WP8', '3WPA', '3WPO', '3WPP', '3WPR', '3WQA', '3X2L', '3X2M', '3X3F', '4I79', '4UW7', '4UW8', '4WVM', '4WYN', '4WYP', '4WYZ', '4XB4', '4XJV', '4XM6', '4XM7', '4XM8', '4XRT', '4XRW', '4YI8', '4YI9', '4YXX', '4YXY', '4YXZ', '4YY2', '4YY5', '4Z8N', '4ZKQ', '4ZLT', '4ZTX', '4ZTY', '5AQ5', '5BV8', '5BVL', '5BW8', '5BWK', '5BYO', '5C10', '5C12', '5C15', '5C2D', '5C2F', '5CCK', '5CMA', '5CQZ', '5CTM', '5CTN', '5CWB', '5CWC', '5CWD', '5CWF', '5CWG', '5CWH', '5CWI', '5CWJ', '5CWK', '5CWL', '5CWM', '5CWN', '5CWO', '5CWP', '5CWQ', '5CYN', '5D0I', '5D0K', '5D0M', '5D1K', '5D1L', '5D1M', '5D5H', '5DCA', '5DCU', '5DCW', '5DCY', '5DF1', '5DIY', '5DJS', '5DLM', '5DPK', '5DSV', '5DT1', '5E1E', '5E1L', '5E29', '5E3P', '5E3Q', '5E3R', '5E47', '5E4R', '5E8G', '5E8I', '5EC6', '5ECG', '5EE2', '5EE4', '5EK0', '5EPJ', '5F0A', '5F0D']


### Search by Author

In [4]:
search_dict = make_query('Perutz, M.F.',querytype='AdvancedAuthorQuery')
found_pdbs = do_search(search_dict)
print(found_pdbs)

['1CQ4', '1FDH', '1GDJ', '1HDA', '1PBX', '2DHB', '2GDM', '2HHB', '2MHB', '3HHB', '4HHB']


### Search by Motif

In [5]:
search_dict = make_query('T[AG]AGGY',querytype='MotifQuery')
found_pdbs = do_search(search_dict)
print(found_pdbs)

['EZ:1', '3SGH:1', '4F47:1']


### Search by a specific experimental method

In [6]:
search_dict = make_query('SOLID-STATE NMR',querytype='ExpTypeQuery')
found_pdbs = do_search(search_dict)
print(found_pdbs)

['1CEK', '1EQ8', '1M8M', '1MAG', '1MP6', '1MZT', '1NH4', '1NYJ', '1PI7', '1PI8', '1PJD', '1PJE', '1PJF', '1Q7O', '1RVS', '1XSW', '1ZN5', '1ZY6', '2C0X', '2CZP', '2E8D', '2H3O', '2H95', '2JSV', '2JU6', '2JZZ', '2K0P', '2KAD', '2KB7', '2KHT', '2KIB', '2KJ3', '2KLR', '2KQ4', '2KQT', '2KRJ', '2KSJ', '2KWD', '2KYV', '2L0J', '2L3Z', '2LBU', '2LEG', '2LGI', '2LJ2', '2LME', '2LMN', '2LMO', '2LMP', '2LMQ', '2LNL', '2LNQ', '2LNY', '2LPZ', '2LTQ', '2LU5', '2M02', '2M3B', '2M3G', '2M4J', '2M5K', '2M5M', '2M5N', '2M67', '2MC7', '2MCU', '2MCV', '2MCW', '2MCX', '2MEX', '2MJZ', '2MME', '2MMU', '2MPX', '2MPZ', '2MS7', '2MSG', '2MTZ', '2MVX', '2MXU', '2N0R', '2N1E', '2N1F', '2N28', '2N3D', '2N7H', '2NNT', '2RLZ', '2UVS', '2W0N', '2XKM', '3ZPK']


### Search by whether it has free ligands

In [7]:
search_dict = make_query('', querytype='NoLigandQuery')
found_pdbs = do_search(search_dict)
print(found_pdbs[:10])

['100D', '101D', '101M', '102D', '102L', '102M', '103L', '103M', '104M', '105M']


### Search by protein symmetry group

In [8]:
kk = do_protsym_search('C9', min_rmsd=0.0, max_rmsd=1.0)
print(kk[:5])

['1KZU', '1NKZ', '2FKW', '3B8M', '3B8N']


# Information Search functions

While the basic functions described in the previous section are useful for looking up and manipulating individual unique entries, these functions are intended to be more user-facing: they take search keywords and return lists of authors or dates

### Find most common authors for a given keyword

In [10]:
top_authors = find_authors('crispr',max_results=100)
pprint.pprint(top_authors[:5])

['Doudna, J.A.', 'Jinek, M.', 'Li, H.', 'Savchenko, A.', 'Zhou, K.']


### Find papers for a given keyword

In [11]:
matching_papers = find_papers('crispr',max_results=3)
pprint.pprint(matching_papers)

['Crystal structure of a CRISPR-associated protein from thermus thermophilus',
 'Crystal structure of a hypothetical protein TT1823 from Thermus '
 'thermophilus',
 'Hypothetical protein PF1117 from Pyrococcus furiosus']


# 2. Functions that return information about single PDB entries

### Get the full PDB file

In [24]:
pdb_file = get_pdb_file('4lza', filetype='cif', compression=True)
print(pdb_file[:200])

data_4LZA
# 
_entry.id   4LZA 
# 
_audit_conform.dict_name       mmcif_pdbx.dic 
_audit_conform.dict_version    4.032 
_audit_conform.dict_location   http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx


### Get a general description of the entry's metadata

In [4]:
describe_pdb('4lza')

{'citation_authors': 'Malashkevich, V.N., Bhosle, R., Toro, R., Hillerich, B., Gizzi, A., Garforth, S., Kar, A., Chan, M.K., Lafluer, J., Patel, H., Matikainen, B., Chamala, S., Lim, S., Celikgil, A., Villegas, G., Evans, B., Love, J., Fiser, A., Khafizov, K., Seidel, R., Bonanno, J.B., Almo, S.C.',
 'deposition_date': '2013-07-31',
 'expMethod': 'X-RAY DIFFRACTION',
 'keywords': 'TRANSFERASE',
 'last_modification_date': '2013-08-14',
 'nr_atoms': '0',
 'nr_entities': '1',
 'nr_residues': '390',
 'release_date': '2013-08-14',
 'resolution': '1.84',
 'status': 'CURRENT',
 'structureId': '4LZA',
 'structure_authors': 'Malashkevich, V.N., Bhosle, R., Toro, R., Hillerich, B., Gizzi, A., Garforth, S., Kar, A., Chan, M.K., Lafluer, J., Patel, H., Matikainen, B., Chamala, S., Lim, S., Celikgil, A., Villegas, G., Evans, B., Love, J., Fiser, A., Khafizov, K., Seidel, R., Bonanno, J.B., Almo, S.C., New York Structural Genomics Research Consortium (NYSGRC)',
 'title': 'Crystal structure of adenin

### Get all of the information deposited in a PDB entry

In [12]:
all_info = get_all_info('4lza')
print(all_info)

{'id': '4LZA', 'polymer': {'@length': '195', '@weight': '22023.9', '@type': 'protein', 'enzClass': {'@ec': '2.4.2.7'}, '@entityNr': '1', 'chain': [{'@id': 'A'}, {'@id': 'B'}], 'Taxonomy': {'@id': '496866', '@name': 'Thermoanaerobacter pseudethanolicus'}, 'polymerDescription': {'@description': 'Adenine phosphoribosyltransferase'}, 'macroMolecule': {'@name': 'Adenine phosphoribosyltransferase', 'accession': {'@id': 'B0K969'}}, 'synonym': {'@name': 'APRT'}}}


In [9]:
results = get_all_info('2F5N')
first_polymer = results['polymer'][0]
first_polymer['polymerDescription']

{'@description': "5'-D(*AP*GP*GP*TP*AP*GP*AP*CP*CP*TP*GP*GP*AP*CP*GP*C)-3'"}

### Run a BLAST search on an entry

There are several options here: One function, get_blast(), returns a dict() just like every other function. However, all the metadata associated with this function leads to deeply-nested dictionaries. A simpler function, get_blast2(), uses text parsing on the raw output page, and it returns a tuple consisting of 1. a ranked list of other PDB IDs that were hits, and 2. A list of the actual BLAST alignments and similarity scores.

In [11]:
blast_results = get_blast('2F5N', chain_id='A')
just_hits = blast_results['BlastOutput_iterations']['Iteration']['Iteration_hits']['Hit']
print(just_hits[50]['Hit_hsps']['Hsp']['Hsp_hseq'])

PELPEVETVRRELEKRIVGQKIISIEATYPRMVL--TGFEQLKKELTGKTIQGISRRGKYLIFEIGDDFRLISHLRMEGKYRLATLDAPREKHDHLTMKFADG-QLIYADVRKFGTWELISTDQVLPYFLKKKIGPEPTYEDFDEKLFREKLRKSTKKIKPYLLEQTLVAGLGNIYVDEVLWLAKIHPEKETNQLIESSIHLLHDSIIEILQKAIKLGGSSIRTY-SALGSTGKMQNELQVYGKTGEKCSRCGAEIQKIKVAGRGTHFCPVCQQ


In [12]:
blast_results = get_blast2('2F5N', chain_id='A', output_form='HTML')
print('Total Results: ' + str(len(blast_results[0])) +'\n')
pprint.pprint(blast_results[1][0])

Total Results: 84

<pre>
&gt;<a name="45354"></a>2F5P:3:A|pdbid|entity|chain(s)|sequence
          Length = 274

 Score =  545 bits (1404), Expect = e-155,   Method: Composition-based stats.
 Identities = 274/274 (100%), Positives = 274/274 (100%)

Query: 1   MPELPEVETIRRTLLPLIVGKTIEDVRIFWPNIIRHPRDSEAFAARMIGQTVRGLERRGK 60
           MPELPEVETIRRTLLPLIVGKTIEDVRIFWPNIIRHPRDSEAFAARMIGQTVRGLERRGK
Sbjct: 1   MPELPEVETIRRTLLPLIVGKTIEDVRIFWPNIIRHPRDSEAFAARMIGQTVRGLERRGK 60

Query: 61  FLKFLLDRDALISHLRMEGRYAVASALEPLEPHTHVVFCFTDGSELRYRDVRKFGTMHVY 120
           FLKFLLDRDALISHLRMEGRYAVASALEPLEPHTHVVFCFTDGSELRYRDVRKFGTMHVY
Sbjct: 61  FLKFLLDRDALISHLRMEGRYAVASALEPLEPHTHVVFCFTDGSELRYRDVRKFGTMHVY 120

Query: 121 AKEEADRRPPLAELGPEPLSPAFSPAVLAERAVKTKRSVKALLLDCTVVAGFGNIYVDES 180
           AKEEADRRPPLAELGPEPLSPAFSPAVLAERAVKTKRSVKALLLDCTVVAGFGNIYVDES
Sbjct: 121 AKEEADRRPPLAELGPEPLSPAFSPAVLAERAVKTKRSVKALLLDCTVVAGFGNIYVDES 180

Query: 181 LFRAGILPGRPAASLSSKEIERLHEEMVATIGEAVMKGGSTVRTYVNTQGEAGTFQHHLY 240
  

### Get PFAM information about an entry

In [37]:
pfam_info = get_pfam('2LME')
print(pfam_info)

{'pfamHit': {'@pfamAcc': 'PF03895.10', '@pfamName': 'YadA_anchor', '@structureId': '2LME', '@pdbResNumEnd': '105', '@pdbResNumStart': '28', '@pfamDesc': 'YadA-like C-terminal region', '@eValue': '5.0E-22', '@chainId': 'A'}}


### Get chemical info

This function takes the name of the chemical, not a PDB ID

In [39]:
chem_desc = describe_chemical('NAG')
pprint.pprint(chem_desc)

{'describeHet': {'ligandInfo': {'ligand': {'@chemicalID': 'NAG',
                                           '@molecularWeight': '221.208',
                                           '@type': 'D-saccharide',
                                           'InChI': 'InChI=1S/C8H15NO6/c1-3(11)9-5-7(13)6(12)4(2-10)15-8(5)14/h4-8,10,12-14H,2H2,1H3,(H,9,11)/t4-,5-,6-,7-,8-/m1/s1',
                                           'InChIKey': 'OVRNDRQMDRJTHS-FMDGEEDCSA-N',
                                           'chemicalName': 'N-ACETYL-D-GLUCOSAMINE',
                                           'formula': 'C8 H15 N O6',
                                           'smiles': 'CC(=O)N[C@@H]1[C@H]([C@@H]([C@H](O[C@H]1O)CO)O)O'}}}}


### Get ligand info if present


In [40]:
ligand_dict = get_ligands('100D')
pprint.pprint(ligand_dict)

{'id': '100D',
 'ligandInfo': {'ligand': {'@chemicalID': 'SPM',
                           '@molecularWeight': '202.34',
                           '@structureId': '100D',
                           '@type': 'non-polymer',
                           'InChI': 'InChI=1S/C10H26N4/c11-5-3-9-13-7-1-2-8-14-10-4-6-12/h13-14H,1-12H2',
                           'InChIKey': 'PFNFFQXMRSDOHW-UHFFFAOYSA-N',
                           'chemicalName': 'SPERMINE',
                           'formula': 'C10 H26 N4',
                           'smiles': 'C(CCNCCCN)CNCCCN'}}}


### Get gene ontology info

In [45]:
gene_info = get_gene_onto('4Z0L ')
pprint.pprint(gene_info['term'][0])

{'@chainId': 'A',
 '@id': 'GO:0001516',
 '@structureId': '4Z0L',
 'detail': {'@definition': 'The chemical reactions and pathways resulting '
                           'in the formation of prostaglandins, any of a '
                           'group of biologically active metabolites which '
                           'contain a cyclopentane ring.',
            '@name': 'prostaglandin biosynthetic process',
            '@ontology': 'B',
            '@synonyms': 'prostaglandin anabolism, prostaglandin '
                         'biosynthesis, prostaglandin formation, '
                         'prostaglandin synthesis'}}


### Get sequence clusters by chain

In [13]:
sclust = get_seq_cluster('2F5N.A')
pprint.pprint(sclust['pdbChain'][:10]) # Just look at the top 10

[{'@name': '3ZJB.A', '@rank': '1'},
 {'@name': '3ZJB.B', '@rank': '1'},
 {'@name': '3ZJB.C', '@rank': '1'},
 {'@name': '1CZY.A', '@rank': '2'},
 {'@name': '1CZY.B', '@rank': '2'},
 {'@name': '1CZY.C', '@rank': '2'},
 {'@name': '1D00.A', '@rank': '3'},
 {'@name': '1D00.B', '@rank': '3'},
 {'@name': '1D00.C', '@rank': '3'},
 {'@name': '1D00.D', '@rank': '3'}]


### Get the representative for a chain

In [46]:
clusts = get_clusters('4hhb.A')
print(clusts)

{'pdbChain': {'@name': '2W72.A'}}


### List all taxa associated with a list of IDs

In [15]:
crispr_query = make_query('crispr')
crispr_results = do_search(crispr_query)
pprint.pprint(list_taxa(crispr_results[:10]))

['Thermus thermophilus',
 'Thermus thermophilus',
 'Pyrococcus furiosus',
 'Sulfolobus solfataricus',
 'Sulfolobus solfataricus',
 'Sulfolobus solfataricus',
 'Hyperthermus butylicus',
 'unidentified phage',
 'Archaeoglobus fulgidus',
 'Sulfolobus solfataricus']


### List data types with a list of IDs

In [16]:
crispr_query = make_query('crispr')
crispr_results = do_search(crispr_query)
pprint.pprint(list_types(crispr_results[:5]))

['protein', 'protein', 'protein', 'protein', 'protein']
