# Free Energy Components

Like ligandswap, proteinswap supports decomposition of the free energy averages into per-residue components. This can be performed for both the residues in the reference (wildtype) protein and the residues in the perturbed (mutant) protein.

This means that proteinswap can show you why it makes a particular prediction. If proteinswap suggests that the mutant binds the ligand more weakly, it should show you how this weaker binding relates to loss of interation with different residues in the mutant.

The raw data from which we can calculate the free energy components is held in the `results_XXXX.log` files that are produced in the `output` directory produced during a proteinswap simulation. These have a similar format to ligandswap, except now we have two sets of "RESIDUE FREE ENERGY COMPONENTS", the first for the reference protein and the second for the perturbed protein. You unpacked some example output in a previous section, into the directory "output"

In [None]:
outdir = "output"

There is one `results_XXXX.log` file produced from each iteration of the ligandswap calculation. For example, the file `results_0500.log` contains the results produced from the 500th iteration of the calculation. Take a look at it here;

In [None]:
!cat output/results_0500.log

These values are taken from just one iteration from the simulation. Ideally we should average the components across all iterations that we class as "production" (so not discarded as equilibration). We can do this using a simple script.

First, we import pandas and matplotlib for data handling and plotting, and then also import the Sire modules we need to calculate averages (in Sire.Maths), identify parts of molecules (in Sire.Mol) and load molecules from files (in Sire.IO)

In [None]:
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'svg'   # helps make things look better in Jupyter :-)

import Sire.Maths
import Sire.Mol
import Sire.IO

Next we set the range of iterations over which we want to average. We will use the last 60% again, so use iterations 400-1000

In [None]:
r = [400, 1000]

We need to process the files from results_0400.log to results_1000.log. The below line generates a list of all of these filenames.

In [None]:
filenames = ["%s/results_%04d.log" % (outdir,i) for i in range(r[0],r[1]+1)]

The next cell defines a function that reads all of the results files and extracts the average energy components. These are placed into two pandas DataFrames (first for reference, second for perturbed) for easier manipulation later...

In [None]:
def getComponents(filenames):
    """Read all of the residue-based free energy components from the log files produced
       by a proteinswap simulation (passed as a list of filenames). Return
       a tuple of the reference protein and perturbed protein average components as
       pandas DataFrames"""
    ref_resids = None
    per_resids = None
    ref_avgs = {}
    per_avgs = {}
    resids = None
    avgs = None
    
    # Loop over all of the files...
    for filename in filenames:
        has_started=False
        for line in open(filename).readlines():
            # Read from the line "RESIDUE FREE ENERGY COMPONENTS" onwards...
            if line.find("RESIDUE FREE ENERGY COMPONENTS") != -1:
                has_started = True
                if not resids is None:
                    ref_resids = resids
                    ref_avgs = avgs
                    
                resids = {}
                avgs = {}
            
            elif has_started:
                words = line.split()
                if len(words) == 8:
                    resname = words[1]
                    resnum = int(words[3])
                    total = float(words[-3])
                    coul = float(words[-2])
                    lj = float(words[-1])
                    key = "%s:%s" % (resname,resnum)
                    
                    if not key in avgs:
                        avgs[key] = [Sire.Maths.Average(), Sire.Maths.Average(), Sire.Maths.Average()]
                        if not resnum in resids:
                            resids[resnum] = [resname]
                        else:
                            resids[resnum].append(resname)
                    
                    # accumulate the average total, coulomb and LJ free energies
                    avgs[key][0].accumulate(total)
                    avgs[key][1].accumulate(coul)
                    avgs[key][2].accumulate(lj)
                    
                elif line.find("COMPONENTS") != -1:
                    break
    
    if ref_resids is None:
        ref_resids = resids
        ref_avgs = avgs
    else:
        per_resids = resids
        per_avgs = avgs
    
    frames = []
    
    for (resids,avgs) in ( (ref_resids,ref_avgs), (per_resids,per_avgs) ):
        # Now sort the data into a pandas DataFrame
        resnums = list(resids.keys())
        resnums.sort()
        resnams = []
        total = []
        coul = []
        lj = []
    
        for resnum in resnums:
            for resname in resids[resnum]:
                key = "%s:%s" % (resname,resnum)
                avg = avgs[key]
                resnams.append(resname)
                total.append(avg[0].average())
                coul.append(avg[1].average())
                lj.append(avg[2].average())
    
        # The data is in lists which can be put into pandas columns. We will index the 
        # DataFrame using the residue number (assuming that they are all unique)
        frames.append( DataFrame( index = resnums,
                       data = {"name" : resnams, "total" : total, "coulomb" : coul, "LJ" : lj},
                       columns=["name", "total", "coulomb", "LJ"] ) )
    
    if len(frames) == 1:
        return frames[0]
    else:
        return (frames[0],frames[1])

Now use the above function to process all of the files and generate the pandas DataFrame...

In [None]:
(wildtype,mutant) = getComponents(filenames)

The components are in a pandas DataFrame, so can be manipulated using any of the pandas functions. For example, look at the first few rows using the 'head' function

In [None]:
wildtype.head(), mutant.head()

One thing to note is that the value of the components from the mutant are negative of those in the wildtype. This is because the wildtype components are the free energies to swap the ligand with water, while the mutant components are the free energies to swap the water with the ligand.

You can make things easier to understand by taking the negative of the mutant components

In [None]:
wildtype.head(), mutant.multiply(-1).head()

While the first couple of residue components are very similar, you can already see that there are big differences for ARG37 and GLU38 between the wildtype and mutant protein. Let's now combine the two dataframes to find all residues with a significant difference between the wildtype and mutant components

In [None]:
residues = [ x for x in wildtype.index if x in mutant.total and abs(wildtype.total[x]-mutant.total[x]) > 2 ]

In [None]:
changed = DataFrame( index=residues, data={"wildtype" : [wildtype.total[x] for x in residues],
                                           "mutant" : [-mutant.total[x] for x in residues]}  )

You can even use matplotlib to plot these :-)

In [None]:
changed.plot.bar()

Looking at the above graph, it is clear that we are really interested in the residues for which there has been a significant change in its free energy component. Getting and plotting these differences is simple with pandas :-)

In [None]:
print(changed.mutant - changed.wildtype)
(changed.mutant - changed.wildtype).plot.bar()

From this, is it clear that the mutation of residue 212 from arginine to lysine (R292K) has meant that this residue binds more weakly to oseltamivir. This mutation has also affected SER166 (binds the ligand more weakly in the mutant). In contrast, ARG38 binds oseltamivir much more strongly in the mutant.

To understand why this is structurally, the next step is to color-code the PDB outputs from proteinswap according to these differences. We will colour-code the reference protein structure here, but you can use the same function to colour-code any structure. For proteinswap, the PDB files for the reference (wildtype) protein are called `bound0_mobile_XXXX_YYYY.pdb`, while the PDB files for the perturbed (mutant) protein are called `bound1_mobile_XXXX_YYYY.pdb`. Here we will load the PDB of the wildtype from iteration 1000 and lambda=0.005

In [None]:
system = Sire.IO.MoleculeParser.read("%s/bound0_mobile_001000_0.00500.pdb" % outdir)

We need to extract the protein... We can do this by finding the first molecule with a residue called "ALA"

In [None]:
protein = system[Sire.Mol.MolWithResID(Sire.Mol.ResName("ALA"))]

Here is a function that colour-codes the protein based on the passed set of differences

In [None]:
def colourProtein(protein, data):
    """Colour-code the passed protein using the difference data contained in the passed dataframe"""
    
    # first find the maximum absolute value - we will scale linearly from there
    maxval = data.abs().max()
    
    # now create an AtomFloatProperty that will contain a number for each atom
    # in each residue. This will be from 0-100, with 0 representing -maxval, 
    # 50 representing 0 and 100 representing maxval
    betas = Sire.Mol.AtomFloatProperty(protein, 50.0)
    
    for x in data.index:
        resnum = Sire.Mol.ResNum(int(x))
        value = data[x]
        
        scaled = 50.0 + 50.0*(value/maxval)
        
        # issues with beta mean it must lie between 0 and 99.99
        if scaled < 0:
            scaled = 0.0
        elif scaled > 99.99:
            scaled = 99.99
            
        residue = protein[ resnum ]
        
        for atom in residue.atoms():
            betas.set(atom.cgAtomIdx(), scaled)
            
    # Set the 'beta_factor' property as this is the name used for the 'beta_factor'
    # value by the PDB writer
    protein = protein.edit().setProperty("beta_factor", betas).commit()
    return protein

We will use this function to update the protein by colouring it using the "beta" property from the "total" free energy components.

In [None]:
protein = colourProtein(protein, changed.wildtype - changed.mutant)

Next we update the loaded system with the new, colour-coded version of the protein

In [None]:
system.update(protein)

Finally, we write this system out to a PDB file that you can load into any molecular visualiser

In [None]:
Sire.IO.MoleculeParser.write(system, "colourcoded.pdb")

You can now download this PDB file using the Jupyter interface. Load it into a molecular viewer, for example VMD. Select the protein and colour it by beta factor. You should see the residues that contribute strongly to the binding free energy highlighted in the 3D view. For example, here is the view I've made using VMD, and the view when I repeated this color-coding for the mutant protein (bound1_mobile_001000_0.99500.pdb)

![Image of color-coded wildtype](images/wildtype.jpg)![Image of color-coded mutant](images/mutant.jpg)

Note that the selection I used to highlight only the important residues was;

```
noh and protein and not name O N and (beta > 52 or beta < 48)
```

(this used the fact that beta=50 meant no significant difference, while beta > 52 or beta < 48 were significant)

From the view we can infer that the mutation has weakened binding to between the ligand and the residues to the left (including the R292K). This is unsurprising, as lysine is a smaller residue than arginine, and so it is not able to reach to oseltamivir and bind as well. Conversely, oseltamivir compensates by binding more strongly to the residue on the right (ARG38). The increased binding to ARG38 is not enough to compensate for the loss to ARG212 and SER166, and so the mutant binds oseltamivir more weakly, and is thus drug-resistant. This suggests that, to overcome resistance, the oseltamivir needs to be made slightly larger so that it can reach the smaller LYS212, while still having a stronger interaction with ARG38.