# Procheck analysis of an NMR ensemble

This notebook illustrates how you can use Python to automate the process of performing a Procheck analysis of each model in the PDB file of an NMR ensemble - in this case entry 5sxy.

The notebook makes use of three Python packages:
 - [collections](https://docs.python.org/2/library/collections.html): part of the standard Python library, provides a 'better' sort of dictionary.
 - [mdtraj](http://www.mdtraj.org): a very useful trajectory analysis package.
 - [xbowflow](https://github.com/ChrisSuess/Project-Xbow/wiki/An-Introduction-to-Xbowflow-Workflows): a library for making command line tools work like Python functions.

## Prerequisites:

You must have the Python packages `mdtraj` and `xbowflow` installed, and `Procheck` installed too.

MDTraj and xbowflow can be installed using `pip`:
```
% pip install mdtraj
% pip install xbowflow
```

But `Procheck` is not a Python package and can be a bit more effort to install if you don't already have it. If you are lucky enough to be looking at this notebook on your own laptop, or some other computer that has [Docker](http://docker.com) installed, then you can use `pinda` to install `Procheck`:
```
% pip install pinda
% pinda install procheck 3.5.4
```

You also need a copy of the PDB file `5sxy.pdb` in the directory you launch this notebook from. If it's not already here, download it from the [PDB website](http://www.rcsb.org/structure/5sxy).

-----

In this first cell we check we have everyuthing we need to run the notebook:

In [None]:
OK = True
import subprocess
try:
    import mdtraj as mdt
except ImportError:
    print('Error: you do not seem to have the MDTraj Python package installed')
    OK = False
try:
    from xbowflow import xflowlib
except ImportError:
    print('Error: you do not seem to have the xbowflow Python package installed')
    OK = False
result = subprocess.call('which procheck', shell=True)
if result != 0:
    print('Error: you do not seem to have Procheck installed')
    OK = False
if OK is False:
    print('This notebook will not work until you fix these issues')
else:
    print('Success: you seem to have all the packages installed that are needed.')

In this cell we create the fuctions needed for our analysis workflow:

In [None]:
# Here we create a Python function that runs Procheck. It will take a PDB file (or more 
# accurately, something that can be turned into a PDB file) as its argument, and return the
# output (*.out) file created by Procheck (or more accurately, the contents of that file). By the 
# way: if you run Procheck from the command lne you will see that it produces a whole load 
# of output files, but this fuction ignores all of the other ones.
#
#    output = procheck.run(pdbdata)
#
procheck = xflowlib.SubprocessKernel('procheck x.pdb 2.0')
procheck.set_inputs(['x.pdb'])
procheck.set_outputs(['x.out'])

# Two functions to parse the output file from running Procheck.
#
# The first returns the Ramachandran map assignment for each residue, which is a one or 
# two letter code. Single capital letters refer to the most favoured regons of the map,
# single lower case letters to the allowed regons, two letter symbols that start '~' 
# to the generously allowed regions, and 'XX' to disallowed regions. Gly and Pro 
# residues have a '-' symbol.
#
# The second returns a list of the bad contacts identified in the Procheck output file.

def get_rama(out):
    '''
    Parse a Procheck output file and extract Ramachandran map assignments.
    '''
    results = collections.OrderedDict() # Like an ordinary dictionary, only nicer
    with open(out) as f:
        lines = f.readlines()
    check = False # When check becomes True, we have reached the relevant section of the file
    for line in lines:
        if 'Full print-out' in line:
            check = True
        if check:
            if line[:4] == 'Mean':
                check = False # We have reached the end of the relevant section
            # Test positions 2:5 and 22:24 of this line to see if it's a Ramachandran assignment:
            elif line[2:5] != '   ':
                rama = line[22:24]
                if rama in ['A ','a ','~a','B ','b ','~b','L ','l ','~l', 'p ', '~p', '- ','XX']:
                    resid = line[2:11].strip()
                    results[resid] =  rama
    return results

def get_bad_contacts(out):
    '''
    Parse a Procheck output file and extract bad contacts.
    '''
    results = []
    with open(out) as f:
        lines = f.readlines()
    check = False # When check becomes True, we have reached the relevant section of the file
    for line in lines:
        if 'C  O  N  T  A' in line:
            check = True
        if check:
            if 'R A M A' in line:
                check = False # We have reached the end of the relevant section
            elif '-->' in line: # This line contains the definition of a bad contact
                results.append(line[30:90])
    return results

Now we run the analysis. The PDB file is read in using mdtraj. Each model in the file is treated like a separate frame in an MD trajectory. Each frame is analysed by Procheck, the output file is parsed by the two analysis functions to extract the key data, and then the results are printed out.

In [None]:
t = mdt.load('5sxy.pdb')

for i in range(len(t)):
    pro_out = procheck.run(t[i]).as_file() # as_file() means we get the file name, not file contents
    rama = get_rama(pro_out)
    bad_rama = ', '.join([k for k in rama if rama[k] in ['~a', '~b', '~l', '~p', 'XX']])
    if len(bad_rama) == 0:
        bad_rama = '(none)'
    bad_contacts = get_bad_contacts(pro_out)
    print('\nModel {:2d}'.format(i + 1))
    print('  Bad residues in Ramachandran map:')
    print('    {}'.format(bad_rama))
    print('  Bad contacts:')
    for bc in bad_contacts:
        print('    {}'.format(bc))