# Building a system (globular protein and ligand)

by Stefan Doerr

This tutorial shows how to automatically build a system using the HTMD code. In this example, a system comprised of a globular protein (trypsin) and its inhibitor (benzamidine) is prepared for ligand binding simulations using ACEMD (Buch et al. 2011 PNAS 108(25) 10184-89). 

## Working files

To generate a system ready for simulation, HTMD needs the topology, parameters and coordinates (i.e. PDB file or PDB code of the protein) of all elements in the simulations.

Download all tutorial files from the following [link](http://docs.htmd.org/download/building_files.zip). 

## Getting started

First we import the modules we are going to need for the tutorial

In [1]:
from htmd import *
import numpy as np
import os
%pylab inline
matplotlib.rcParams.update({'font.size': 12})

HTMD. 26-27 November 2015 HTMD workshop in Barcelona (fully booked)

You are on the latest HTMD version (unpackaged).
Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy


## Clean and split PDB

In this example, the structure is retrieved from the PDB database and saved in the working directory.

In [2]:
prot = Molecule('3PTB')
prot.write('/tmp/protein.pdb', sel='chain A and protein and noh')

Extract only chain A from the PDB file

In [3]:
prot.filter('chain A and (protein or water or resname CA)')

## Define segments

Crystallized water molecules and one calcium ion present in the crystal structure are also obtained from this PDB.

In [4]:
prot.set('segid', 'P', sel='protein and noh')
prot.set('segid', 'W', sel='water')
prot.set('segid', 'CA', sel='resname CA')

## Center the system to the origin

In [5]:
prot.center()

and calculate the radius of the protein

In [6]:
from htmd.molecule.util import maxDistance, uniformRandomRotation
D = maxDistance(prot, 'all')

## Load the ligand and merge molecules

Load up the ligand, calculate its geometric center and randomly rotate it around itself

In [7]:
tutfiles = home() + '/data/building-protein-ligand/'
ligand = Molecule(tutfiles + 'benzamidine.pdb')
ligand.center()
ligand.rotateBy(uniformRandomRotation())

Now place the ligand randomly around the protein at the distance defined above

In [8]:
ligand.moveBy([D+10, 0, 0])  # Move the ligand 10 Angstrom away from the furthest protein atom in X dimension
ligand.rotateBy(uniformRandomRotation())

Set resname and segid of the ligand

In [9]:
ligand.set('resname','MOL')
ligand.set('segid','L')

Join all

In [10]:
all = Molecule()
all.append(prot)
all.append(ligand)

## Solvate the system

Define the size of the solvation box and solvate the system

In [11]:
D = D + 15
allsol = solvate(all, minmax=[[-D, -D, -D], [D, D, D]])

2015-11-23 18:44:22,004 - htmd.builder.solvate - INFO - Using water pdb file at: /shared/sdoerr/Work/pyHTMD/htmd/builder/wat.pdb
2015-11-23 18:44:22,626 - htmd.builder.solvate - INFO - Replicating 8 water segments, 2 by 2 by 2
Solvating: 100% (8/8) [############################################] eta 00:01 /


## Building the system for CHARMM

Check for the available charmm parameter and topology files

In [12]:
charmm.listFiles()

---- Topologies files list: /shared/sdoerr/Work/pyHTMD/htmd/builder/charmmfiles/top/ ----
top/top_all22star_prot.rtf
top/top_all36_carb.rtf
top/top_all36_lipid.rtf
top/top_all36_prot.rtf
top/top_water_ions.rtf
top/top_all36_cgenff.rtf
top/top_all36_na.rtf
---- Parameters files list: /shared/sdoerr/Work/pyHTMD/htmd/builder/charmmfiles/par/ ----
par/par_all22star_prot.prm
par/par_all36_carb.prm
par/par_all36_lipid.prm
par/par_all36_prot.prm
par/par_all36_cgenff.prm
par/par_all36_na.prm
par/par_water_ions.prm


Indicate the location of the CHARMM topology and parameter files as well are your own custom parameter and topology files. The CHARMM files can be included without their full path, using just the name indicated in the previous list command. Then we build the system for CHARMM.

In [13]:
topos  = ['top/top_all22star_prot.rtf', 'top/top_water_ions.rtf', tutfiles + 'benzamidine.rtf']
params = ['par/par_all22star_prot.prm', 'par/par_water_ions.prm', tutfiles + 'benzamidine.prm']

molbuilt = charmm.build(allsol, topo=topos, param=params, outdir='/tmp/build', saltconc=0.15)

2015-11-23 18:44:40,576 - htmd.builder.charmm - INFO - Writing out segments.
Bond between A: [serial 48 resid 22 resname CYS chain A segid P]
             B: [serial 1007 resid 157 resname CYS chain A segid P]

Bond between A: [serial 185 resid 42 resname CYS chain A segid P]
             B: [serial 298 resid 58 resname CYS chain A segid P]

Bond between A: [serial 811 resid 128 resname CYS chain A segid P]
             B: [serial 1521 resid 232 resname CYS chain A segid P]

Bond between A: [serial 853 resid 136 resname CYS chain A segid P]
             B: [serial 1327 resid 201 resname CYS chain A segid P]

Bond between A: [serial 1084 resid 168 resname CYS chain A segid P]
             B: [serial 1190 resid 182 resname CYS chain A segid P]

Bond between A: [serial 1265 resid 191 resname CYS chain A segid P]
             B: [serial 1422 resid 220 resname CYS chain A segid P]

2015-11-23 18:45:12,563 - htmd.builder.builder - INFO - 6 disulfide bonds were added
2015-11-23 18:45:12,759 -

Note regarding ions: the build command will by default just try to neutralize the system. To add a specific salt concentration the option `saltconc` needs to be used. In the previous command, a 150mM NaCl salt concentration was used.

Visualize the built system

In [14]:
molbuilt.view(sel='water',style='Lines',hold=True)
molbuilt.view(sel='resname MOL',style='Licorice',hold=True)
molbuilt.view(sel='ions',style='VDW',hold=True)
molbuilt.view(sel='protein',style='NewCartoon',color='Secondary Structure')

## Building the system for AMBER

Check for available AMBER forcefield files

In [15]:
amber.listFiles()

---- Forcefield files list: /shared/lab/software/AmberTools14/amber14/dat/leap/cmd/ ----
leaprc.phosaa10
leaprc.GLYCAM_06j-1
leaprc.ff14ipq
leaprc.lipid11
leaprc.gaff
leaprc.lipid14
leaprc.modrna08
leaprc.GLYCAM_06EPb
leaprc.ff12SB
leaprc.ff03.r1
leaprc.constph
leaprc.ff14SBonlysc
leaprc.ff03ua
leaprc.ffAM1
leaprc.ffPM3
leaprc.ff14SB


Indicate the desired forcefield files and build the system for AMBER

In [16]:
ffs = ['leaprc.lipid14', 'leaprc.ff14SB', 'leaprc.gaff']  # Missing the parameters for Benzamidine in AMBER

#molbuilt = amber.build(allsol, ff=ffs, outdir='/tmp/build', saltconc=0.15)

Visualize the built system

In [17]:
molbuilt.view()

## Before building your system (preliminary considerations)

The PDB format is very old. In an effort to handle its legacy shortcomings, several versions have been made over the years, they are not all readily interchangeable, and not all software can handle each version perfectly. The most important things to watch out for are: * Columns: the PDB format has very rigid rules about what values can go in each space. Keep in mind that it is not a space/tab/comma delimited format, but rather has rigid definitions of what should be in each space/column. * The PDB format as originally designed cannot handle more than 9,999 resids or 99,999 atoms (due to the column format issue). Several workarounds have been devised, such as using hexadecimal numbers or other compact number formats. VMD has no trouble saving more atoms/residues.

In addition, one needs to know well the working system, thus: * Always review your PDB file: inspect the REMARK sections of the PDB file. You can often find keyspecific information regarding the structure (e.g. disulfide bridges, mising atoms, etc.). 

* Protonation/pH: the protonation state of the system is critical. Since molecular dynamics simulations typically don't allow for bond breaking, the initial protonation of the system must be accurate. Knowing what pH you are trying to reproduce is therefore important to obtain the correct results. If you suspect changing protonation is important to your system and you still want to use classical mechanics, consider simulating both states (protonated and not protonated). Histidine residues can have three different protonations states even at pH 7, therefore, a correct protonation of this residue is particularly critical. This residue can be protonated at either delta (most common), epsilon (very common also) or at both nitrogens (special situations and low pH).

<img src="http://docs.htmd.org/img/histidines.png">

The best way to determine how histidine should be protonated is to look at the the structure. Typically, a histidine residue is protonated if it is close enough to an electron donor (e.g. a glutamic acid), thus creating a hydrogen bond. Certain automated tools predict the protonation state of histidines based on their surrounding environment (e.g. Autodock tools). Since histidines are frequently present at protein active sites, a correct protonation state is particularly important in ligand binding simulations.

* Disulfide bonds present in the system must be identified. As shown below, this is automatically done by htmd
* Metalloproteins: if the metal ion is not an active part of an interaction it may be acceptable to just allow it to act as a cation perhaps restraining it with some harmonic constraints if neccesary.
* Duplicate atoms in the PDB file: typically simply delete one of the duplicated groups. However, if both conformations are potentially important (e.g. such loops involved in molecular recognition) it might be necessary to simulate both conformations separately.

## List of common patches

C-terminal patches:

<table class="summarytable">
    <thead>
        <tr>
            <th>Name</th>
            <th>Class</th>
            <th>Description</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>CTER</td>
            <td>-1.00</td>
            <td>standard C-terminus</td>
        </tr>
        <tr>
            <td>CT1</td>
            <td>0.00</td>
            <td>methylated C-terminus from methyl acetate</td>
        </tr>
        <tr>
            <td>CT2</td>
            <td>0.00</td>
            <td>amidated C-terminus</td>
        </tr>
        <tr>
            <td>CT3</td>
            <td>0.00</td>
            <td>N-Methylamide C-terminus</td>
        </tr>
    </tbody>
</table>

N-terminal patches:
<table class="summarytable">
    <thead>
        <tr>
            <th>Name</th>
            <th>Class</th>
            <th>Description</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>NTER</td>
            <td>1.00</td>
            <td>standard N-terminus</td>
        </tr>
        <tr>
            <td>ACE</td>
            <td>0.00</td>
            <td>acetylated N-terminus (to create dipeptide)</td>
        </tr>
        <tr>
            <td>ACP</td>
            <td>0.00</td>
            <td>acetylated N-terminus (for proline dipeptide)</td>
        </tr>
        <tr>
            <td>PROP</td>
            <td>1.00</td>
            <td>Proline N-Terminal</td>
        </tr>
        <tr>
            <td>GLYP</td>
            <td>1.00</td>
            <td>Glycine N-terminus </td>
        </tr>
    </tbody>
</table>

Side chain patches:

<table class="summarytable">
    <thead>
        <tr>
            <th>Name</th>
            <th>Class</th>
            <th>Description</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>ASPP</td>
            <td>0.00</td>
            <td>patch for protonated aspartic acid, proton on od2</td>
        </tr>
        <tr>
            <td>GLUP</td>
            <td>0.00</td>
            <td>patch for protonated glutamic acid, proton on oe2</td>
        </tr>
        <tr>
            <td>CYSD</td>
            <td>-1.0</td>
            <td>patch for deprotonated CYS</td>
        </tr>
        <tr>
            <td>DISU</td>
            <td>-0.36</td>
            <td>patch for disulfides. Patch must be 1-CYS and 2-CYS</td>
        </tr>
        <tr>
            <td>HS2</td>
            <td>0.00</td>
            <td>Patch for neutral His, move proton from ND1 to NE2</td>
        </tr>
        <tr>
            <td>TP1</td>
            <td>-1.00</td>
            <td>convert tyrosine to monoanionic phosphotyrosine</td>
        </tr>
        <tr>
            <td>TP1A</td>
            <td>-1.00</td>
            <td>patch to convert tyrosine to monoanionic phenol-phosphate model
            compound when generating tyr, use first none last none for terminal
            patches</td>
        </tr>
        <tr>
            <td>TP2</td>
            <td>-2.00</td>
            <td>patch to convert tyrosine to dianionic phosphotyrosine</td>
        </tr>
        <tr>
            <td>TP2A</td>
            <td>-2.00</td>
            <td>patch to convert tyrosine to dianionic phosphotyrosine when
            generating tyr, use first none last none for terminal patches this
            converts a single tyrosine to a phenol phosphate</td>
        </tr>
        <tr>
            <td>TMP1</td>
            <td>-1.00</td>
            <td>patch to convert tyrosine to monoanionic phosphonate ester O -&gt;
            methylene (see RESI BMPH)</td>
        </tr>
        <tr>
            <td>TMP2</td>
            <td>-2.00</td>
            <td>patch to convert tyrosine to dianionic phosphonate ester O -&gt;
            methylene (see RESI BMPD)</td>
        </tr>
        <tr>
            <td>TDF1</td>
            <td>-1.00</td>
            <td>patch to convert tyrosine to monoanionic difluoro phosphonate ester
            O -&gt;  methylene (see RESI BDFH)</td>
        </tr>
    </tbody>
</table>

Circular protein chain patches:

<table class="summarytable">
    <thead>
        <tr>
            <th>Name</th>
            <th>Class</th>
            <th>Description</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>LIG1</td>
            <td>0.00000</td>
            <td>linkage for cyclic peptide, 1 refers to the C terminus which is a
            glycine , 2 refers to the N terminus</td>
        </tr>
        <tr>
            <td>LIG2</td>
            <td>0.00000</td>
            <td>linkage for cyclic peptide, 1 refers to the C terminus, 2 refers to
            the N terminus which is a glycine</td>
        </tr>
        <tr>
            <td>LIG3</td>
            <td>0.00000</td>
            <td>linkage for cyclic peptide, 1 refers to the C terminus which is a
            glycine, 2 refers to the N terminus which is a glycine</td>
        </tr>
    </tbody>
</table>