# Setting up and Analysing Binding Free Energies of Lysozyme

This notebook will guide you through how to run an analysis for calculating relative free energies of binding from alchemical free energy simulations in a notebook rather than the command line. In particular we will look at computing the relative free energies of binding for Lysozyme ligands. 


The notebook forms part of the CCPBio-Sim workshop **Alchemical Free Energy Simulation Analysis with analyse_freenrg** run on the 11th of April 2018 at the University of Bristol.

*Author: Antonia Mey   
Email: antonia.mey@ed.ac.uk*

**Reading time of the document: 50 mins**

## Let's start with the necessary imports

In [None]:
%pylab inline
from Sire.Tools.FreeEnergyAnalysis import NotebookHelper
from Sire.Units import *
import glob
from PIL import Image
import seaborn as sbn
sbn.set_style("ticks")
sbn.set_context("notebook", font_scale = 2)
from ipywidgets import interact, interactive, fixed, interact_manual, Layout, Label
import ipywidgets as widgets
## This sets style for widges used later
style = {'description_width': 'initial'}
layout = Layout(flex='2 1 auto', width='auto')

## Free energies of binding

This notebook is designed to analyse a series of simulations which were run using an alchemical free energy approach. We will look at how to compute an individual relative binding free energy from a single set of perturbations and how to analyse a whole perturbation map. As before for the ethane/methanol example we can actually write down a thermodynamic cycle that allows the estimation of relative binding free energies of two ligands. In the pictorial example below we have one ligand represented by a circle and another by a square:

![cycle](images/Therm_cycle.png)

The analysis will be done using the `analyse_freenrg mbar` tool in Sire, as well as directly interacting with the python interface to compute binding free energies.   
From the thermodynamic cycle you can compute the free energy of binding as:   

![ddg](images/DDG.png)

Each ΔG can be computed using either TI or MBAR, to then evaluate a relative free energy between two different ligands. 

## Perturbation map and directory structure

Often we have a large number of ligands for which we want to evaluate the relative binding free energies. Take a look at the set of perturbations of the example ligands of lysozyme we will be using. What you see is an example of a perturbation map you would have designed when using FESetup. Yes it gets rather complicated and confusing. 

![map](images/perturbation_map.png)

Simulations were run for each of the arrows in the perturbation map. Obviously each arrow represents a simulation of the ligand in solution and of the ligand bound to the protein lysozyme. ΔΔG is then computed as defined above and each arrow in this way represents a relative binding free energy between the two ligands the arrow is connecting. 

The image below summarises the directory structured used for the simulations. It is very similar to that of the ethane~methanol simulation, but now we have a lot more data sets than just the ethane~methanol one. 

![datastruc](images/Directory_structure.png)

## How was the simulation generated?
The following is a typical way of phrasing simulation information from a methods section. However, there is some crucial information missing. 


*Methods:   
Each simulation box was treated with **[boundary condition?]** and simulations were run for **[?]** ns each using a **[?]** fs integration timestep with a **[integrator ?]** integrator. Bonds involving hydrogens were constrained, except if the hydrogen atom was morphed to a heavy atom in the perturbation. The temperature was maintained at **[?]** K using an **[?]** thermostat and a collision frequency of **[?]** with velocities initially drawn from a Maxwell–Boltzmann distribution of that temperature. Pressure was kept at **[?]** atm using the Monte Carlo Barostat implemented in OpenMM with an update frequency of **[?]** MD steps. For non-bonded interactions an atom-based shifted Barker–Watts reaction field scheme was used with a cutoff of **[?]** Å and the reaction field dielectric constant **[?]**. Lambda values **[complete a sentence about lambda values]** [...]*

**Task: Can you extract this information from the simulation configuration input file?**

In [None]:
!cat images/siminfo.dat

## Running the analysis

### The old way -- analyse_freenrg mbar from the command line
Make use of the knowledge you have gained from the hydration free energy example to complete the task below:

**Task: Can you run analyse_freenrg mbar for indole~indene for both the bound and solvated part in order to compute the free energy difference in both states.** Look at the output files to then give an estimate for the binding free energy using the formula for computing relative binding free energies above.

In [None]:
##Insert code here




### The new way -- running the analysis in the notebook 
Obviously you don't want to run this analysis manually for all the perturbation data. For this purpose we have created an interactive notebook widget. The notebook helper is a class that was written to make an interactive analysis as easy as possible. Below we initialise the NotebookHelper object and then initialise the notebook which will display a collection of widgets which will allow us to interactively set variables. 

In [None]:
nbh = NotebookHelper()

In the box below set the parameters you would like to set for the analysis. I.e. the path in which all perturbations can be found, the output directory where all output files should be collected, the number of initial frames that should be discarded from each simulation, or whether to compute an overlap matrix or not. 

When calling `nbh.update()` you can update the variables set with the widget box above at any point in time. 

In [None]:
#The code below visualises the widget for interactively working with the data
ui = nbh.initialise_notebook()
display(ui)

In [None]:
nbh.update()

With having set the simulation directory, all perturbation are read from that base directory.

In [None]:
perturbation_list = nbh.perturbation_list

Let's check that we actually have the correct perturbations

In [None]:
print (perturbation_list)

### Running a single analysis
Below we manually walk through running an analysis for a bound simulation ( protein + ligand ) and ligand in water. Note: The same could be done for a simulation in vacuum and ligand in water to obtain a hydration free energy.


We are selecting the entry `indene~indole` of the `perturbation_list` to run the analysis. As before we need to read data from the `simfile.dat` input files that can be found in every λ directory. So basically the variable `input_files` needs to contain a list of all `simfile.dat` files generated in one alchemical simulation of a protein + ligand or a ligand in water, while undergoing the alchemical transformation. 

Let's start by finding the index of our `indene~indole` simulation in our perturbation list:

In [None]:
index = perturbation_list.index('indene~indole')

Now we can genearate a list of all the simfiles for the bound and free simulation using the `glob` functionality. 

In [None]:
input_files_bound = glob.glob(nbh._basedir+'/'+perturbation_list[index]+'/run001/bound/output/lambda-*/simfile.dat.bz2')
input_files_free = glob.glob(nbh._basedir+'/'+perturbation_list[index]+'/run001/free/output/lambda-*/simfile.dat.bz2')

Next we compute free energies from the `bound` and `free` simulations, passing the list of input files to the compute_free_energies function of the Notebook Helper class.    

What exactly does the computer_free_energies_function do? We can take a look at the help text for it: Hit Ctrl+tab in a cell where you have written `nbh.compute_free_energies(`

```python
    def compute_free_energies(self, input_files, TI = False):
        r"""computes free energies
        Parameters:
        -----------
        input_files : FILES
            list of simulation.dat files for a given lambda
        TI : boolean
            decides whether to also compute TI free energies or just MBAR free energies
            Default: False

        Returns:
        --------
        free_energies : FreeEnergy object
            object contains free energy differences 
        T : float
            temperature at which simulations were run, as recorded in the simfile.dat files
        """
```
It returns a free energy object and the simulation temperature as read from the simulation files. In the following we execute our function in order to get our free energy objects.

In [None]:
%%capture runinfo
bound, T = nbh.compute_free_energies(input_files_bound)
free, T = nbh.compute_free_energies(input_files_free)

In [None]:
runinfo.show()

`bound` and `free` are FreeEnergy objects, that contain information from an MBAR or thermodynamic integration free energy analysis. Next we manually compute the relative free energy difference of binding between indene and indole. Additionally the FreeEnergy objects contain a bunch of useful information from the simulations, in particular the computed free energy difference and errors taken from mbar.

In [None]:
DDG = bound.deltaF_mbar* T * k_boltz-free.deltaF_mbar* T * k_boltz
dDDG = sqrt((bound.errorF_mbar * T * k_boltz)**2+(free.errorF_mbar * T * k_boltz)**2)

In [None]:
print('Free energy in kcal/mol and error in kcal/mol for:')
print('%s,%s,%.2f,%.2f kcal/mol' %(perturbation_list[index].split('~')[0],perturbation_list[index].split('~')[1],DDG,dDDG))

**Task: Can you compare the computed relative free energy of binding of indene and indole to an experimental value?**   
Indene~Indole: ΔΔG (indene~indole) = -4.89 -(-5.13) = 0.24 kcal/mol

In [None]:
## Add some information here.



So what about TI, can we compare the resutls from MBAR and TI as well?

Note how above when we ran compute free energies we have set TI to False? What happens if you set TI to True?
**Task: Can you figure out how to also compute free energy information using TI?

In [None]:
## Insert code that will run TI.



In [None]:
DDG_TI = bound.deltaF_ti* T * k_boltz-free.deltaF_ti* T * k_boltz

print('Free energy in kcal/mol and error in kcal/mol for:')
print('%s,%s,%.2f kcal/mol' %(perturbation_list[index].split('~')[0],perturbation_list[index].split('~')[1],DDG_TI))

How well do the TI, MBAR and experimental results agree?

### Assessing the quality and reliability of your results. 
In conjunction with the ethane methanol tutorial we looked at some ways to evaluate robustness of data in particular before comparing results to experimental values and also to potentially identify reasons why a simulation might give a poor result. Two easy ways to check whether your estimate is likely to be good is using the overlap matrix and the average gradients with respect to lambda. Let's take a look at how to compute these in the following.

**Task: Can you plot the overlap matrix of the bound and free simulations?** 
Hint: using the FreeEnergy objects `bound` and `free` will make this very easy? For plotting an overlap matrix using the seaborn option of a heatmap (see previous exercise) might be a good idea. 

In [None]:
## Insert code to plot the overlap matrix of the ligand bound to the protein.




In [None]:
## Insert code to plot the overlap matrix of the ligand free in solution. 




**Can you plot the average gradient with respect to lambda with error bars for the protein bound and ligand in solution simulations?**

In [None]:
## Insert code to plot the average gradient of the ligand bound to the protein.




In [None]:
## Insert code to plot the average gradient of the ligand free in solution. 




### Analysis in batch mode
It is all nice to have the `bound` and `free` analysis available for one set of simulations, but really what we want is the free energy difference of binding for all the perturbations we have run and ideally all the plots that tell us something about the quality of the results. The cell below executes the whole analysis for MBAR. 

In [None]:
%%capture --no-display runinfo
#the two lines below are used to track progress with a progress bar in form of a widget.
#this is not necessary, but may be useful to figure out how much longer it may take to 
#execute one of these cells. 
pg_bar = widgets.IntProgress(min=0, max=len(perturbation_list),description="Perturbation analysis progress:", layout=layout, style = style)
display(pg_bar)

##DDG_list will contain the final data we are interested in. 
DDG_list = []
for pert in perturbation_list:
    print ("Working on perturbation: %s" %pert)
    sim_bound = '/run001/bound/output/lambda-*/simfile.dat.bz2'
    sim_free = '/run001/free/output/lambda-*/simfile.dat.bz2'
    input_files_bound = glob.glob(nbh._basedir+'/'+pert+sim_bound)
    input_files_free = glob.glob(nbh._basedir+'/'+pert+sim_free)
    result = nbh.run_free_energy_analysis(pert, input_files_bound, input_files_free)
    DDG_list.append(result)
    pg_bar.value+=1
    print ("Done.......")
    print ("---------------------------")

We can print the ouptput information...

In [None]:
DDG_list

... and also write it to file.

In [None]:
nbh.write_free_energies(DDG_list)

**Task: Look at the runinfo and inspect the standard and error out. What kind of warnings were generated? Should you be worried?**

In [None]:
##Insert code to look at runinfo




### Looking at the output
With the above we have generated a list of free energies, that was written to the output file provided in the widget field, but also will have generated all the overlap matrix plots, if the overlap matrix option had been selected. (If you hadn't selected the overlap matrix option now would be a good time to do so and rerun the analysis.)

In [None]:
all_matrices = glob.glob(nbh._outputdir+'/*.png')

In [None]:
print ('matrix: %s' %all_matrices[1])
display(Image.open(all_matrices[1]))

**Task: Play around a little with some of the overlap matrix plots. What do you observe?**

**Task: Generate free energy estimates using TI.**   
Take the code from the previous 'batch' analysis mode and modify it in such a way that we can also compute the relative free energies for TI. Also try and include plots for the average gradients to have a look at the reliability of the TI results.  

In [None]:
## updating the output info for the free energy file might be useful, i.e. you can change the 
# directory you want to write things to etc. 
ui = nbh.initialise_notebook()
display(ui)

In [None]:
nbh.update()

In [None]:
%%capture --no-display runinfo
#the two lines below are used to track progress with a progress bar in form of a widget.
#this is not necessary, but may be useful to figure out how much longer it may take to 
#execute one of these cells. 
pg_bar = widgets.IntProgress(min=0, max=len(perturbation_list),description="Perturbation analysis progress:", layout=layout, style = style)
display(pg_bar)

##DDG_list will contain the final data we are interested in. 
DDG_list_ti = []
for pert in perturbation_list:
    print ("Working on perturbation: %s" %pert)
    sim_bound = '/run001/bound/output/lambda-*/simfile.dat.bz2'
    sim_free = '/run001/free/output/lambda-*/simfile.dat.bz2'
    input_files_bound = glob.glob(nbh._basedir+'/'+pert+sim_bound)
    input_files_free = glob.glob(nbh._basedir+'/'+pert+sim_free)
    ## update the function below
    #----->!!!! Something needs to be done here!!!!!<--------#
    result = nbh.run_free_energy_analysis()
    DDG_list_ti.append(result)
    pg_bar.value+=1
    print ("Done.......")
    print ("---------------------------")

In [None]:
nbh.write_free_energies(DDG_list_ti)

In [None]:
#print them to the terminal
DDG_list_ti

**Task: There is something odd going on with some of the TI binding free energy estiamtes. Have a look at the average gradients you generated, does this give you a clue?**

In [None]:
#Insert some code here to display your average gradient plots



### Questions:
- Based on the generated output how reliable do you think the estimates are going to be for:
    - TI?
    - MBAR?
- Can you make suggestions on how you could improve the simulation protocol? Think of:
    - Perturbations
    - Lambda spacing
    - Simulation length
    - Repetitions

Congratulations you have finished this tutorial! 