# Free Energy Components

Like ligandswap, one of the major advantages of waterswap over other absolute binding free energy methods is that it supports decomposition of the free energy averages into per-residue components. How it does this is [described in detail here](http://dx.doi.org/10.1039/c3fd00125c).

What does this mean? Well, it means that waterswap can give you an indication of which residues contribute most to the binding of the ligand to the protein. More subtlely, it reveals which residues the calculation showed as binding more strongly to the ligand, and which bound more strongly to the water.

The raw data from which we can calculate the free energy components is held in the `results_XXXX.log` files that are produced in the `output` directory produced during a waterswap simulation. We unpacked some example output into the directory `output`

In [None]:
outdir = "output"

There is one `results_XXXX.log` file produced from each iteration of the waterswap calculation. For example, the file `results_0500.log` contains the results produced from the 500th iteration of the calculation. Take a look at it here;

In [None]:
! cat output/results_0500.log

The file starts with the free energy calculated from the sampling performed at that iteration, e.g.

```
TOTAL BINDING FREE ENERGY

TOTAL   BOUND    FREE
-33.146761144427906   -62.042406366040375   27.86135212442684
```

This splits the total free energy into the contribution from the protein box ("BOUND") and the contribution from the water box ("FREE"). In this case it shows that most of the free energy (-62 kcal mol-1) comes from a preference of the protein for the ligand over water. This is larger than the 28 kcal mol-1 preference of bulk water for the ligand over the swapped water. This implies that the ligand is very soluble, but it binds much more strongly to the protein than to bulk water. This is unsurprising, as the ligand is zwitterionic (has two charges), and so is both highly soluble and well-suited to the highly charged binding site of the protein.

The above number can be useful as it will highlight when a ligand binds only because it is insoluble, and so is seeking shelter in a hydrophobic binding site. For these cases, the "BOUND" component would be negative (protein prefers the ligand), while the "FREE" component would be negative, and larger than the "BOUND" component (water significantly prefers water, and doesn't want to solvate the ligand).

After the "BOUND" and "FREE" components, you can see free energy components for all of the residues that were within 15 A of the ligand, e.g.

```
RESIDUE FREE ENERGY COMPONENTS

RESIDUE    TOTAL    COULOMB    LJ
Residue( VAL : 35 )  -0.16535165766352544  -0.16085920425914235  -0.004492451120687602
Residue( ILE : 36 )  0.3878332703164674  0.40076901074697097  -0.012935740297036945
Residue( ARG : 37 )  -2.503970191431942  4.726547180906718  -7.2192015044114495
Residue( GLU : 38 )  -3.7762095801871554  -22.938939884560835  19.230661665228936
```

Positive values show that the residue binds more strongly to water, while negative values show that the residue binds more strongly to the ligand.

In this case, you can see that the charged ARG37 and GLU38 residues strongly prefer the ligand (by 2.5 and 3.8 kcal mol-1 respectively). For GLU38 this comes from a -23 kcal mol-1 electrostatic preference for the ligand, balanced by a +19 kcal mol-1 VDW preference for water. For ARG37 this is from a +4.7 kcal mol-1 preference for water which is overcome by a -7.2 kcal mol-1 preference for the ligand.

These values are taken from just one iteration from the simulation. Ideally we should average the components across all iterations that we class as "production" (so not discarded as equilibration). We can do this using a simple script.

First, we import pandas and matplotlib for data handling and plotting, and then also import the Sire modules we need to calculate averages (in Sire.Maths), identify parts of molecules (in Sire.Mol) and load molecules from files (in Sire.IO)

In [None]:
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'svg'   # helps make things look better in Jupyter :-)

import Sire.Maths
import Sire.Mol
import Sire.IO

Next we set the range of iterations over which we want to average. We will use the last 60% again, so use iterations 400-1000

In [None]:
r = [400, 1000]

We need to process the files from results_0400.log to results_1000.log. The below line generates a list of all of these filenames.

In [None]:
filenames = ["%s/results_%04d.log" % (outdir,i) for i in range(r[0],r[1]+1)]

The next cell defines a function that reads all of the results files and extracts the average energy components. These are placed into a pandas DataFrame for easier manipulation later...

In [None]:
def getComponents(filenames):
    """Read all of the residue-based free energy components from the log files produced
       by a waterswap or ligandswap simulation (passed as a list of filenames). Return
       the average components as a pandas DataFrame"""
    avgs = {}
    resids = {}
    
    # Loop over all of the files...
    for filename in filenames:
        has_started=False
        for line in open(filename).readlines():
            # Read from the line "RESIDUE FREE ENERGY COMPONENTS" onwards...
            if line.find("RESIDUE FREE ENERGY COMPONENTS") != -1:
                has_started = True
            
            elif has_started:
                words = line.split()
                if len(words) == 8:
                    resname = words[1]
                    resnum = int(words[3])
                    total = float(words[-3])
                    coul = float(words[-2])
                    lj = float(words[-1])
                    key = "%s:%s" % (resname,resnum)
                    
                    if not key in avgs:
                        avgs[key] = [Sire.Maths.Average(), Sire.Maths.Average(), Sire.Maths.Average()]
                        if not resnum in resids:
                            resids[resnum] = [resname]
                        else:
                            resids[resnum].append(resname)
                    
                    # accumulate the average total, coulomb and LJ free energies
                    avgs[key][0].accumulate(total)
                    avgs[key][1].accumulate(coul)
                    avgs[key][2].accumulate(lj)
                    
                elif line.find("COMPONENTS") != -1:
                    break
    
    # Now sort the data into a pandas DataFrame
    resnums = list(resids.keys())
    resnums.sort()
    resnams = []
    total = []
    coul = []
    lj = []
    
    for resnum in resnums:
        for resname in resids[resnum]:
            key = "%s:%s" % (resname,resnum)
            avg = avgs[key]
            resnams.append(resname)
            total.append(avg[0].average())
            coul.append(avg[1].average())
            lj.append(avg[2].average())
    
    # The data is in lists which can be put into pandas columns. We will index the 
    # DataFrame using the residue number (assuming that they are all unique)
    return DataFrame( index = resnums,
                      data = {"name" : resnams, "total" : total, "coulomb" : coul, "LJ" : lj},
                      columns=["name", "total", "coulomb", "LJ"] )

Now use the above function to process all of the files and generate the pandas DataFrame...

In [None]:
components = getComponents(filenames)

The components are in a pandas DataFrame, so can be manipulated using any of the pandas functions. For example, look at the first few rows using the 'head' function

In [None]:
components.head()

There are components for all residues. Most of these are near zero, so it is a good idea to focus on those that are significant ( > 0.5 kcal mol-1 or < -0.5 kcal mol-1 )

In [None]:
components[ components.total.abs() > 1.0 ]

You can even use matplotlib to plot these :-)

In [None]:
components[ components.total.abs() > 1.0 ].plot.bar()

The analysis shows which residues are making a contribution to binding. A negative sign shows that the residue prefers to bind the ligand, while a positive sign shows that the residue prefers to bind the ligand.

You can use these components to begin to gain insight into why waterswap predicts that the ligand is a good or poor binder to the protein, in terms of how well the ligand displaces water for each of the residues in the protein. For example, here we see that the strongest contributing residues to binding are GLU38, ARG212 and ARG287, which are the charged residues that form salt-bridges with the ligand. The residues that destablise the ligand are ARG75 and GLU196. This suggests that stronger binding could be obtained by modifying oseltamivir to better target these residues.

The next step once you have identified residues is to look at the 3D structures sampled during the waterswap calculation to see if you can understand from those why different residues have different preferences for the ligand or water. One way to help is to color-code residues based on the free energy components calculated above.

First, we need to get one of the PDB structures output by the waterswap calculation. By default the calculation writes PDB files every 50 iterations from the protein box and the water box at the closest lambda value to 0 (lambda=0.005) and the closest lambda value to 1 (0.995). The files are called

* bound_mobile_XXXXX_YYYYY.pdb : the protein box files at iteration XXXXX and lambda value YYYYY, e.g. bound_mobile_001000_0.00500.pdb
* free_mobile_XXXXX_YYYYY.pdb : the water box files at iteration XXXXX and lambda value YYYYY, e.g. free_mobile_000500_0.99500.pdb

Note that, to save space, the files contain only the mobile atoms in the simulation, so don't worry that they look like a ball and most of the protein and water is missing. The fixed atoms are in the simulation, they just aren't written to these files. 

To color-code the residues, we first need to read one of the protein-box files...

In [None]:
system = Sire.IO.MoleculeParser.read("%s/bound_mobile_001000_0.00500.pdb" % outdir)

We need to extract the protein... We can do this by finding the first molecule with a residue called "ALA"

In [None]:
protein = system[Sire.Mol.MolWithResID(Sire.Mol.ResName("ALA"))]

Here is a function that colour-codes the protein based on the passed pandas DataFrame

In [None]:
def colourProtein(protein, data, column):
    """Colour-code the passed protein using the data contained in the passed dataframe, using the
       specified column"""
    
    # first find the maximum absolute value - we will scale linearly from there
    vals = data[column]
    maxval = vals.abs().max()
    
    # now create an AtomFloatProperty that will contain a number for each atom
    # in each residue. This will be from 0-100, with 0 representing -maxval, 
    # 50 representing 0 and 100 representing maxval
    betas = Sire.Mol.AtomFloatProperty(protein, 50.0)
    
    for x in data.index:
        resnum = Sire.Mol.ResNum(int(x))
        resnam = Sire.Mol.ResName(data.name[x])
        value = vals[x]
        
        scaled = 50.0 + 50.0*(value/maxval)
        
        # issues with beta mean it must lie between 0 and 99.99
        if scaled < 0:
            scaled = 0.0
        elif scaled > 99.99:
            scaled = 99.99
            
        residue = protein[ resnam + resnum ]
        
        for atom in residue.atoms():
            betas.set(atom.cgAtomIdx(), scaled)
            
    # Set the 'beta_factor' property as this is the name used for the 'beta_factor'
    # value by the PDB writer
    protein = protein.edit().setProperty("beta_factor", betas).commit()
    return protein

We will use this function to update the protein by colouring it using the "beta" property from the "total" free energy components.

In [None]:
protein = colourProtein(protein, components, "total")

Next we update the loaded system with the new, colour-coded version of the protein

In [None]:
system.update(protein)

Finally, we write this system out to a PDB file that you can load into any molecular visualiser

In [None]:
Sire.IO.MoleculeParser.write(system, "colourcoded.pdb")

You can now download this PDB file using the Jupyter interface. Load it into a molecular viewer, for example VMD. Select the protein and colour it by beta factor. You should see the residues that contribute strongly to the binding free energy highlighted in the 3D view. For example, here is the view I've made using VMD...

![Colour-coded neuraminidase](images/colorcoded.jpg)

Note that the selection I used to highlight only the important residues was;

```
noh and protein and not name O N and (beta > 52 or beta < 48)
```

(this used the fact that beta=50 meant no significant difference, while beta > 52 or beta < 48 were significant)

From this picture we can make some interesting observations:

* Of the three arginines at the base of the binding site, the ligand is binding strongly to ARG287 and ARG212. ARG37 shows little preference for binding to the ligand or water. This correlates with [dynamics studies](http://dx.doi.org/10.1021/bi400754t) that showed during unbinding trajectories that the ligand lost interaction with ARG37 easily, and with [mutation studies](http://dx.doi.org/10.1038/srep03561) that show that mutation of ARG212 can lead to loss of drug efficacy.
* GLU38 has slight preference to bind to the ligand over water, but this leads to a strong preference for ARG75 for water. This is because binding of the ligand to GLU38 weakens the salt bridge between GLU38 and ARG75. This is good for GLU38, but destabilises ARG75.
* Similarly, a slight preference of ARG144 to the ligand weakens its interaction with GLU196, thereby suggesting why GLU196 shows preference to bind water (and therefore destabilises binding of the ligand).
* Binding is strongly driven by ARG287. The contribution of the other residues in the binding site is very small compared to the strong preference of ARG287 to bind oseltamivir. This is unsurprising, as this is a strong salt-bridge interaction. However, it does suggest that the other residues are comparatively indifferent to binding the ligand or water, and so more binding affinity could be obtained by modifying oseltamivir to strengthen specific interactions with other residues, e.g. ARG71 and ASP70, which are on the "150-loop" that sits above the ligand, and the opening of which [has been shown](http://dx.doi.org/10.1021/bi300561n) to lead to drug unbinding.