## PV - the javascript PDB viewer

https://biasmv.github.io/pv/

- Here is how you load PV within a IPython notebook
- Steps
    - Use Biopython to download the file
    - Get the ligand names from the file for display in the structure
    - Load the structure using PV

In [2]:
# load Biopython PDB packages

# PDBList to download PDBs
from Bio.PDB.PDBList import PDBList
pdbl = PDBList()

# PDBParser to load and work with files
from Bio.PDB.PDBParser import PDBParser
parser = PDBParser()

import urllib2
import uuid

In [3]:
# download pdb
pdb_file_path = pdbl.retrieve_pdb_file('2VFA')

Downloading PDB structure '2VFA'...


In [4]:
# open the downloaded file
# someprotein is just the name of this structure - can be arbitrary
structure = parser.get_structure('someprotein', pdb_file_path)



In [5]:
# get the ligands within this file for display
# from: http://stackoverflow.com/questions/25718201/remove-heteroatoms-from-pdb
ligands = []

for residue in structure.get_residues():
    tags = residue.get_full_id()
    # tags contains a tuple with (Structure ID, Model ID, Chain ID, (Residue ID))
    # Residue ID is a tuple with (*Hetero Field*, Residue ID, Insertion Code)

    # Thus you're interested in the Hetero Field, that is empty if the residue
    # is not a hetero atom or have some flag if it is (W for waters, H, etc.)
    if tags[3][0] != " " and tags[3][0] != "W":
        ligands.append(tags[3][0].split('_')[1].strip())
    else:
        continue
        
print(ligands)

['5GP', 'SO4', '5GP', 'SO4']


In [6]:
class PDBViewer(object):
    '''
    Contributed by: Ali Ebrahim
    '''
    
    def __init__(self, f):
        self.pdb = open(f).read()

    def _repr_html_(self):
        div_id = str(uuid.uuid4())
        
        return """<div id="%s" style="width: 800px; height: 600px"><div>
        <!--script src="//biasmv.github.io/pv/js/pv.min.js"></script-->
        <script>
        require.config({paths: {"pv": "//biasmv.github.io/pv/js/pv.min"}});
        require(["pv"], function (pv) {
            pdb = "%s";
            structure = pv.io.pdb(pdb);
            viewer = pv.Viewer(document.getElementById('%s'),
                               {quality : 'medium', width: 'auto', height : 'auto',
                                antialias : true, outline : true});
            viewer.fitParent();
            var ligand = structure.select({rnames : %s});
            viewer.ballsAndSticks('ligand', ligand);
            viewer.cartoon('molecule', structure);
            viewer.centerOn(structure);
            
        });
        </script>
        """ % (div_id, self.pdb.replace("\n", "\\n"), div_id, ligands)

In [7]:
PDBViewer(pdb_file_path)

### Getting the PDB structure sequence

#### Method 1: use built-in Polypeptide builder function

The way this one works is it tries to build the structure based on atom-atom distances in the file. It just returns the sequence as a string.

More info here: http://biopython.org/DIST/docs/api/Bio.PDB.Polypeptide-module.html

Pros:
- easy to use
- aa_only flag replaces non-standard amino acids with standard ones

Cons:
- missing parts of the structure split up the returned sequence
- missing parts at the beginning (or end) are ignored

In [8]:
from Bio.PDB.Polypeptide import PPBuilder
ppb=PPBuilder()
for pp in ppb.build_peptides(structure, aa_only=True):
    print('polypeptide:')
    print(pp.get_sequence())

polypeptide:
DPVFVKDDDGYDLDSFMIPAHYKKYLTKVLVPNGVIKNRIEKLARDVMKEMGGHHIVALCVLKGGYKFFADLLDYIKALNRNSDRSIPMTVDFIR
polypeptide:
KVIGGDDLSTLTGKNVLIVEDIIDTGKTMQTLLSLVRQYNPKMVKVACLFIKRTPLWNGFKADFVGFSIPDHFVVGYSLDYNEIFRDLDHCCLVNDEGKKK
polypeptide:
AFDPVFVKDDDGYDLDSFMIPAHYKKYLTKVLVPNGVIKNRIEKLARDVMKEMGGHHIVALCVLKGGYKFFADLLDYIKALNRNSDRSIPMTVDFIRLK
polypeptide:
DIKVIGGDDLSTLTGKNVLIVEDIIDTGKTMQTLLSLVRQYNPKMVKVACLFIKRTPLWNGFKADFVGFSIPDHFVVGYSLDYNEIFRDLDHCCLVNDEGKKKYKA


#### Problem:

The 3D structure may not have the whole sequence resolved - the method above does not really reflect that case. For example:

If we have a sequence:
  
    MGASGDLAKKKIYPTIWWLFRDGLLPENTFIV

and the structure is missing residues 3-10, the method above splits them into two "polypeptides":

    MG
    KIYPTIWWLFRDGLLPENTFIV
    
This makes alignments confusing later on.

#### Solution / Method 2:

The next piece of code will return this instead, with X's in the "missing" residues:

    MGXXXXXXXXKIYPTIWWLFRDGLLPENTFIV
    
Also, it builds it by just iterating over the residues in the structure and returns a dictionary of the sequences with the chain IDs as keys.

Pros:
- also fills in for "missing" residues at the beginning of the sequence

Cons:
- Not currently dealing with non-standard residues
    - for this case, an X fills them

In [9]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
from Bio.PDB import Polypeptide

In [10]:
def get_pdb_seq(structure):
    '''
    Takes in a Biopython structure object and returns a list of the structure's sequences
    :param structure: Biopython structure object
    :return: Dictionary of sequence strings with chain IDs as the key
    '''
    
    structure_seqs = {}
    
    # loop over each chain of the PDB
    for chain in structure[0]:
        
        chain_it = iter(chain) 
        
        chain_seq = ''
        tracker = 0
        
        # loop over the residues
        for res in chain.get_residues():
            # NOTE: you can get the residue number too
            res_num = res.id[1]
            
            # double check if the residue name is a standard residue
            # if it is not a standard residue (ie. selenomethionine),
            # it will be filled in with an X on the next iteration)
            if Polypeptide.is_aa(res, standard=True):
                full_id = res.get_full_id()
                end_tracker = full_id[3][1]
                i_code = full_id[3][2]
                aa = Polypeptide.three_to_one(res.get_resname())
                
                # tracker to fill in X's
                if end_tracker != (tracker + 1):# and first == False:
                    if i_code != ' ':
                        chain_seq += aa
                        tracker = end_tracker + 1
                        continue
                    else:
                        chain_seq += 'X'*(end_tracker - tracker - 1)
                        
                chain_seq += aa
                tracker = end_tracker
                
            else:
                continue

        structure_seqs[chain.get_id()] = chain_seq

    return structure_seqs

In [11]:
get_pdb_seq(structure)

{'A': 'XXXXXXXXXXXXXXDPVFVKDDDGYDLDSFMIPAHYKKYLTKVLVPNGVIKNRIEKLARDVMKEMGGHHIVALCVLKGGYKFFADLLDYIKALNRNSDRSIPMTVDFIRXXXXXXXXXXXXXKVIGGDDLSTLTGKNVLIVEDIIDTGKTMQTLLSLVRQYNPKMVKVACLFIKRTPLWNGFKADFVGFSIPDHFVVGYSLDYNEIFRDLDHCCLVNDEGKKK',
 'B': 'XXXXXXXXXXXXAFDPVFVKDDDGYDLDSFMIPAHYKKYLTKVLVPNGVIKNRIEKLARDVMKEMGGHHIVALCVLKGGYKFFADLLDYIKALNRNSDRSIPMTVDFIRLKXXXXXXXXXDIKVIGGDDLSTLTGKNVLIVEDIIDTGKTMQTLLSLVRQYNPKMVKVACLFIKRTPLWNGFKADFVGFSIPDHFVVGYSLDYNEIFRDLDHCCLVNDEGKKKYKA'}

### Getting a PDB residue ID for PV viewer input

If you get the structure's sequence using the above code, then do a sequence alignment to the input sequence of interest, then you can get the residue number and use the PV code to view it. You need to know it because the structure might not always correspond to sequence numbering.

I think the best way to do it is somehow keep track of the numbering when you get the structure's sequence above (see "NOTE: you can get the residue number too"), so when you align, you can back track to the original residue number in the structure.

Here's a quick modification of the above code to just save the sequence as a dictionary with a list of tuples of residues and their sequence number like this:

    {'A': [('D', 15),
           ('P', 16),
           ('V', 17),
           ('F', 18),
           ('V', 19),
           ('K', 20),
           ...
          ] 
     }
     
You could probably think of a better way to do it, but this works for now!

In [12]:
def get_pdb_seq2(structure):
    '''
    Takes in a Biopython structure object and returns a list of the structure's sequences
    :param structure: Biopython structure object
    :return: Dictionary of sequence strings with chain IDs as the key
    '''
    
    structure_seqs = {}
    
    # loop over each chain of the PDB
    for chain in structure[0]:
        
        chain_it = iter(chain) 
        
        chain_seq = []
        tracker = 0
        
        # loop over the residues
        for res in chain.get_residues():
            # NOTE: you can get the residue number too
            res_num = res.id[1]
            
            # double check if the residue name is a standard residue
            # if it is not a standard residue (ie. selenomethionine),
            # it will be filled in with an X on the next iteration)
            # TODO: except when it's at the beginning or end...
            if Polypeptide.is_aa(res, standard=True):
                full_id = res.get_full_id()
                end_tracker = full_id[3][1]
                i_code = full_id[3][2]
                aa = Polypeptide.three_to_one(res.get_resname())
                
                # tracker to fill in X's
                if end_tracker != (tracker + 1):
                    if i_code != ' ':
                        chain_seq.append((aa,end_tracker))
                        tracker = end_tracker + 1
                        continue
                    else:
                        xes = 'X'*(end_tracker - tracker - 1)
                        for x in xes:
                            chain_seq.append((x,end_tracker))
                        
                chain_seq.append((aa,end_tracker))
                tracker = end_tracker
                
            else:
                continue

        structure_seqs[chain.get_id()] = chain_seq

    return structure_seqs

In [13]:
my_structure_sequence = get_pdb_seq2(structure)

#### PV input
It looks like PV takes in as input this ID for an atom:

    'A.31.CA'

This means:

    'CHAIN_ID.RESIDUE_NUM.ATOM_NAME'
    
So, we can use biopython to get this information...

For more information on the biopython "structure" object - see here: http://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ#The_Structure_object

In [14]:
from Bio.PDB import Selection

In [15]:
# let's say after aligning, this is the residue that matches the structure
my_structure_sequence['A'][26]

('L', 27)

In [16]:
# so we want to look at residue number 27
my_mutation_resnum = my_structure_sequence['A'][26][1]
print my_mutation_resnum

27


In [17]:
# let's get the info from the structure
my_mutation_residue = structure[0]['A'][my_mutation_resnum]
print my_mutation_residue

<Residue LEU het=  resseq=27 icode= >


In [18]:
# we can use the Selection class to select all atoms of this residue
# 'A' here stands for ATOM (http://biopython.org/DIST/docs/api/Bio.PDB.Selection-module.html)
atom_list = Selection.unfold_entities(my_mutation_residue, 'A')
atom_list

[<Atom N>,
 <Atom CA>,
 <Atom C>,
 <Atom O>,
 <Atom CB>,
 <Atom CG>,
 <Atom CD1>,
 <Atom CD2>]

In [19]:
# then you can format this information for PV:
for a in atom_list:
    print('{}.{}.{}').format('A',27,a.id)

A.27.N
A.27.CA
A.27.C
A.27.O
A.27.CB
A.27.CG
A.27.CD1
A.27.CD2


#### obviously you need to fill in the chain ID and residue number from the info you get from sequence alignments. but this should get you started!