# PDBProp - Working With a Single PDB Structure

This notebook gives a tutorial of the **PDBProp object**, specifically how chains are handled and how to map a sequence to it.

<div class="alert alert-info">

**Input:** PDB ID

</div>

<div class="alert alert-info">

**Output:** PDBProp object

</div>

## Imports

In [1]:
from ssbio.databases.pdb import PDBProp
from ssbio.databases.uniprot import UniProtProp

In [2]:
import sys
import logging

In [3]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)  # SET YOUR LOGGING LEVEL HERE #

In [4]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

## Basic methods

In [5]:
my_structure = PDBProp(ident='5T4Q', description='E. coli ATP synthase')

### Download the structure

Downloading will:
- Download the file type of choice to the specific output directory
- Parse the PDB header file to fill out the metadata fields

In [6]:
import tempfile
my_structure.download_structure_file(outdir=tempfile.gettempdir(), file_type='cif')

[2017-04-12 14:11] [ssbio.databases.pdb] DEBUG: /tmp/5t4q.cif: structure file already saved
[2017-04-12 14:11] [ssbio.databases.pdb] DEBUG: 5T4Q: downloaded cif file
[2017-04-12 14:11] [ssbio.databases.pdb] DEBUG: /tmp/5t4q.cif: no resolution field


### View all attributes

In [7]:
my_structure.get_dict()

{'_structure_dir': '/tmp',
 'chains': [],
 'chemicals': ['ADP', 'ATP'],
 'date': '2017-01-04',
 'description': 'E. coli ATP synthase',
 'experimental_method': 'ELECTRON MICROSCOPY',
 'file_type': 'cif',
 'id': '5T4Q',
 'is_experimental': True,
 'mapped_chains': [],
 'pdb_title': 'Autoinhibited E. coli ATP synthase state 3',
 'reference_seq_top_coverage': None,
 'representative_chain': None,
 'resolution': None,
 'structure_file': '5t4q.cif',
 'taxonomy_name': ['Escherichia coli',
  'Escherichia coli',
  'Escherichia coli',
  'Escherichia coli',
  'Escherichia coli',
  'Escherichia coli',
  'Escherichia coli',
  'Escherichia coli']}

### Set chains that we are interested in (if any)

The `mapped_chains` attribute allows us to limit sequence analyses to specified chains (see the later section where we [align a sequence to this structure](#Aligning-a-sequence-to-the-structure)). For this example, the ATP synthase is a complex of a number of protein chains, and if we are interested in a specific gene transcript, we can set those.

In [8]:
# Chains A, B, and C make up ATP synthase subunit alpha - from the gene b3734 (UniProt ID P0ABB0)
my_structure.add_mapped_chain_ids(['A', 'B', 'C'])

[2017-04-12 14:11] [ssbio.protein.structure.structprop] DEBUG: A: added to list of mapped chains
[2017-04-12 14:11] [ssbio.protein.structure.structprop] DEBUG: B: added to list of mapped chains
[2017-04-12 14:11] [ssbio.protein.structure.structprop] DEBUG: C: added to list of mapped chains


### Parse the structure to work with the Biopython Structure object

Parsing the structure will parse the sequences of each chain, and store those in the `chains` attribute. It will also return a Biopython Structure object which opens up all methods available for structures in Biopython.

In [9]:
parsed_structure = my_structure.parse_structure()
print(type(parsed_structure.structure))
print(type(parsed_structure.first_model))

[2017-04-04 17:55] [ssbio.protein.structure.utils.structureio] DEBUG: 5t4q.cif: parsed 3D coordinates of structure
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: A: added to chains list
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: B: added to chains list
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: C: added to chains list
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: D: added to chains list
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: E: added to chains list
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: F: added to chains list
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: G: added to chains list
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: H: added to chains list
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: I: added to chains list
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: J: added to chains list
[2017-04-04 17:

<class 'Bio.PDB.Structure.Structure'>
<class 'Bio.PDB.Model.Model'>


### Clean the structure and save the structure

Cleaning a structure does the following:
- Add missing chain identifiers to a PDB file
- Select a single chain if noted
- Remove alternate atom locations
- Add atom occupancies
- Add B (temperature) factors (default Biopython behavior)

In the example below, we will clean the structure so it only includes our mapped chains.

In [10]:
cleaned_structure = my_structure.clean_structure(outdir='/tmp', keep_chains=my_structure.mapped_chains, force_rerun=True)
cleaned_structure

[2017-04-04 17:55] [ssbio.protein.structure.utils.structureio] DEBUG: 5t4q.cif: parsed 3D coordinates of structure


'/tmp/5t4q_clean.pdb'

### Viewing the structure

In [32]:
# The original structure
my_structure.view_structure(recolor=False)

In [33]:
# The cleaned structure
import nglview
nglview.show_structure_file(cleaned_structure)

### Aligning a sequence to the structure

In [13]:
# First load up the sequence we are interested in
my_sequence = UniProtProp('P0ABB0')
my_sequence.download_seq_file(outdir=tempfile.gettempdir())

In [14]:
# Then align the sequence to only the mapped_chains
my_structure.align_seqprop_to_mapped_chains(my_sequence, engine='needle', outdir=tempfile.gettempdir())

[2017-04-04 17:55] [ssbio.protein.structure.utils.structureio] DEBUG: 5t4q.cif: parsed 3D coordinates of structure
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: A: chain already present
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: B: chain already present
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: C: chain already present
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: D: chain already present
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: E: chain already present
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: F: chain already present
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: G: chain already present
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: H: chain already present
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: I: chain already present
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: J: chain already present
[2017

In [15]:
# Here is the alignment information which is stored in my_sequence
for alignment in my_sequence.structure_alignments:
    print(alignment.id)
    print(alignment.annotations)

P0ABB0_5T4Q-A
{'deletions': [((512, 513), 2)], 'structure_id': '5T4Q', 'percent_gaps': 0.4, 'a_seq': 'P0ABB0', 'b_seq': '5T4Q-A', 'percent_similarity': 98.6, 'chain_id': 'A', 'percent_identity': 98.6, 'mutations': [('C', 47, 'A'), ('C', 90, 'A'), ('C', 193, 'A'), ('C', 243, 'A'), ('K', 419, 'N')], 'insertions': [], 'score': 2510.0}
P0ABB0_5T4Q-B
{'deletions': [((512, 513), 2)], 'structure_id': '5T4Q', 'percent_gaps': 0.4, 'a_seq': 'P0ABB0', 'b_seq': '5T4Q-B', 'percent_similarity': 98.4, 'chain_id': 'B', 'percent_identity': 98.4, 'mutations': [('M', 1, 'X'), ('C', 47, 'A'), ('C', 90, 'A'), ('C', 193, 'A'), ('C', 243, 'A'), ('K', 419, 'N')], 'insertions': [], 'score': 2504.0}
P0ABB0_5T4Q-C
{'deletions': [((512, 513), 2)], 'structure_id': '5T4Q', 'percent_gaps': 0.4, 'a_seq': 'P0ABB0', 'b_seq': '5T4Q-C', 'percent_similarity': 98.1, 'chain_id': 'C', 'percent_identity': 98.1, 'mutations': [('M', 1, 'X'), ('Q', 2, 'X'), ('L', 3, 'X'), ('C', 47, 'A'), ('C', 90, 'A'), ('C', 193, 'A'), ('C', 24

In [16]:
# We can also see which chain represents the sequence best
my_structure.sequence_quality_checker(my_sequence, allow_deletions=True)

[2017-04-04 17:55] [ssbio.protein.structure.properties.quality] DEBUG: 0.98635477582846: percent identity
[2017-04-04 17:55] [ssbio.protein.structure.properties.quality] DEBUG: Alignment meets percent identity cutoff
[2017-04-04 17:55] [ssbio.protein.structure.properties.quality] DEBUG: No insertion regions
[2017-04-04 17:55] [ssbio.protein.structure.structprop] DEBUG: 5T4Q: chain A set as representative


<ChainProp A at 0x7fdede842550>

In [17]:
# This is how you map any residue number to structure's residue number
my_structure.map_seqprop_resnums_to_mapped_chains(my_sequence, [1,2,3,4,5,6])



{'A': {1: (' ', 1, ' '),
  2: (' ', 2, ' '),
  3: (' ', 3, ' '),
  4: (' ', 4, ' '),
  5: (' ', 5, ' '),
  6: (' ', 6, ' ')},
 'B': {2: (' ', 2, ' '),
  3: (' ', 3, ' '),
  4: (' ', 4, ' '),
  5: (' ', 5, ' '),
  6: (' ', 6, ' ')},
 'C': {4: (' ', 4, ' '), 5: (' ', 5, ' '), 6: (' ', 6, ' ')}}