## An introduction to xbowflow workflows

In this notebook you will see how a simple MD simulation job can be converted from its normal command-line form into a Python function using tools in *Xbowflow*.

Then you will see how it's easy to chain jobs together to create a workflow.

The notebook assumes you have a basic knowledge of *Amber*, and that *Amber* (or *Ambertools*) is installed on the computer you are running this notebook on.

Some knowledge of the Python package *MDTraj* may also help, but is not obligatory.

----

### Part 1: running jobs the conventional way
Have a look at the contents of this directory:

In [None]:
!ls

You should see:

    Xbowflow workflows 101.ipynb : This notebook
    dhfr.crd                     : Coordinates for DHFR in Amber .crd format
    dhfr.prmtop                  : Amber topology file for DHFR
    step1.mdin                   : An input file for sander/pmemd defining a restrained energy minimisation job
    step2.mdin                   : An input file for sander/pmemd defining an unrestrained energy minimisation job
    step1.mdin                   : An input file for sander/pmemd defining a short MD job
    
Let's begin by running the energy minimisation job interactively in the conventional way. 

In [None]:
!sander -O -i step1.mdin -c dhfr.crd -ref dhfr.crd -p dhfr.prmtop -o step1.mdout -r dhfr.step1.rst7

Assuming the job completed without errors, you should see the output files in the current directory:

In [None]:
!ls

Have a look at the log file:

In [None]:
!cat step1.mdout

### Part 2: Turning this into Python

OK. Now you will see how you can turn the energy minimisation job from something you run on the command line (in this situation, within a Jupyter notebook, by using the "!" special command) into a pure Python function.

The function will take a .crd file, a .prmtop file and a .mdin file as the input, and return the .mdout and .rst files when the job completes.

So your aim is something like this:

    restart, logfile = md(mdin, startcrds, prmtop)
    
---

Begin by importing the *xflowlib* module:

In [None]:
from xbowflow import xflowlib

Now you create the function, which in xflowlib is called a **Kernel**:

In [None]:
md = xflowlib.SubprocessKernel('sander -O -i {x.mdin} -c {x.rst7} -ref {x.rst7} -p {x.prmtop} -o {x.mdout} -r {out.rst7}')

You can see that creating the function (or "kernel") involves providing a "template" for the command you want to run. Variables in the template (e.g.files) are enclosed in braces ("{}"), but their names are completely up to you (e.g. you could use "system.crd", etc. instead of "x.rst7") - but in general make sure the filenames have the appropriate extensions.

---

Now you have to tell the kernel what files are inputs, and what are outputs. To do this you pass *lists* of strings that correspond to the filenames in the template above. 

**NB:** the order of the strings in the inputs list defines the order that input variables will be passed to the kernel, and the order of the strings in the output list defines the order that the outputs from the function will appear in:

In [None]:
# Give the kernel the signature: restart, logfile = md.run(mdin, startcrds, prmtop)
md.set_inputs(['x.mdin', 'x.rst7', 'x.prmtop'])
md.set_outputs(['out.rst7', 'x.mdout'])

And that's about it - your new function is ready for use.

However, your data is not quite ready. Xbowflow is designed to distribute work across multiple workers that do not neccessarily share a file system. So before you can use the function, you need to get the input files into suitable globally-accessible variables:

In [None]:
startcrds = xflowlib.load('dhfr.crd')
prmtop = xflowlib.load('dhfr.prmtop')
em_protocol_1 = xflowlib.load('step1.mdin')

Now you can run the function, by caling its run() method:

In [None]:
restart, logfile = md.run(em_protocol_1, startcrds, prmtop)

At the moment your results are only stored in the variables 'restart' and 'logfile'. If you want to turn them into files, that's easy:

In [None]:
logfile.save('test.mdout')
restart.save('test.rst7')

Check they are there:

In [None]:
!cat test.mdout

### Part 3: A workflow

Let's make a workflow that runs both energy minimisations and then the MD stage all in one go.

In [None]:
# Create variables from the required input files:
em_protocol_2 = xflowlib.load('step2.mdin')
md_protocol = xflowlib.load('step3.mdin')

In [None]:
# Run the jobs. For clarity, we begin at the beginning again:
restart1, logfile1 = md.run(em_protocol_1, startcrds, prmtop)
print('first stage done...')
restart2, logfile2 = md.run(em_protocol_2, restart1, prmtop)
print('second stage done...')
restart3, logfile3 = md.run(md_protocol, restart2, prmtop)
print('third stage done.')
logfile3.save('stage3.mdout')

In [None]:
!cat stage3.mdout

Assuming all went OK, a couple things to note:
    1. You were able to re-use the same kernel for all three simulation stages.
    2. Because you did this, you haven't captured the trajectory file that the third stage will have produced.
    3. All three kernels use the same prmtop file as an argument - in effect it's a constant.
    
Let's fix issue 2 first.

Make a new kernel that also returns a trajectory file:

In [None]:
md_with_traj = xflowlib.SubprocessKernel('sander -O -i {x.mdin} -c {x.rst7} -p {x.prmtop} -o {x.mdout} -r {out.rst7} -x {x.nc}')
md_with_traj.set_inputs(['x.mdin', 'x.rst7', 'x.prmtop'])
md_with_traj.set_outputs(['x.nc', 'out.rst7', 'x.mdout'])

Now let's make the prmtop file a constant in both kernels. This means it does not have to appear in the kernel argument list any more (but has the disadvantage that these kernels are now 'hard wired' to only work for DHFR):

In [None]:
md.set_constant('x.prmtop', prmtop)
md_with_traj.set_constant('x.prmtop', prmtop)

### Part 4: Exercise - a better workflow

Your turn - rewrite the workflow to use these improved kernels

In [None]:
# A workflow that runs a two-stage energy minimisation and then an MD simulation
# For clarity, start at the beginning:
restart1, logfile1 = md.run(em_protocol_1, startcrds)
print('first stage done...')
# Add your code below:
restart2, logfile2 = md.run(em_protocol_2, restart1)
print('second stage done...')
trajectory, restart3, logfile3 = md_with_traj.run(md_protocol, restart2)
print('final stage done.')

### Part 5: interfacing with more Python

At this stage you may be thinking "OK - but nothing here I couldn't do with a bash script". The power of the workflow comes when you interface your new pythonized-MD functions with other Python tools.

Let's make use of the *MDTraj* package for analysis of MD trajectory data. You will use it to calculate the RMSD of the trajectory frames from the starting structure.

If you are not yet familiar with MDTraj don't worry - what's below should be more or less self-explanatory.

The MDTraj load() method expects *filenames* as arguments - not the data those files contain. For this, you can use the as_file() method of a variable created by xbowflow.load():

In [None]:
import mdtraj as mdt
traj = mdt.load(trajectory.as_file(), top=prmtop.as_file())
print(traj)
# Print the rmsd of each frame from the first:
print(mdt.rmsd(traj, traj[0], atom_indices=traj.topology.select('protein')))

Let's make your workflow identify which snapshot from your trajectory has the highest RMSD from the starting structure, and then energy minimise that:

In [None]:
import numpy as np
rmsdlist = mdt.rmsd(traj, traj[0], atom_indices=traj.topology.select('protein'))
i = np.argmax(rmsdlist)
print('Energy minimising snapshot {}'.format(i))
max_rmsd_minimised, logfile = md.run(em_protocol_2, traj[i])
max_rmsd_minimised.save('max_rmsd.rst7')

### Part 6: Final exercise

Create a Python function that in effect does all the above: takes a set of starting coordinates, a topology file, and three md input files (two for energy minimisations, one for an MD run), runs the workflow and then returns the energy-minimised structure of the snapshot with the highest RMSD from the starting structure. The function should do everything, including creating the required kernels:

In [None]:
def my_workflow(crd_filename, top_filename, em_step1_filename, em_step2_filename, md_step_filename):
    # Over to you!
    # Load data:
    startcrds = xflowlib.load(crd_filename)
    topfile = xflowlib.load(top_filename)
    protocol_step_1 = xflowlib.load(em_step1_filename)
    protocol_step_2 = xflowlib.load(em_step2_filename)
    protocol_step_3 = xflowlib.load(md_step_filename)
    
    # Create kernels:
    
    md = xflowlib.SubprocessKernel('sander -O -i {x.mdin} -c {x.rst7} -ref {x.rst7} -p {x.prmtop} -o {x.mdout} -r {out.rst7}')
    md.set_inputs(['x.mdin', 'x.rst7'])
    md.set_outputs(['out.rst7'])
    md.set_constant('x.prmtop', topfile)
    
    md_with_traj = xflowlib.SubprocessKernel('sander -O -i {x.mdin} -c {x.rst7} -p {x.prmtop} -o {x.mdout} -r {out.rst7} -x {x.nc}')
    md_with_traj.set_inputs(['x.mdin', 'x.rst7'])
    md_with_traj.set_outputs(['x.nc'])
    md_with_traj.set_constant('x.prmtop', topfile)
    
    # Run workflow:
    restart1 = md.run(protocol_step_1, startcrds)
    print('first stage done...')
    restart2 = md.run(protocol_step_2, restart1)
    print('second stage done...')
    trajectory = md_with_traj.run(protocol_step_3, restart2)
    print('final stage done.')
    traj = mdt.load(trajectory.as_file(), top=prmtop.as_file())
    rmsdlist = mdt.rmsd(traj, traj[0], atom_indices=traj.topology.select('protein'))
    i = np.argmax(rmsdlist)
    print('Energy minimising snapshot {}'.format(i))
    final_crds = md.run(protocol_step_2, traj[i])
    
    # Return final structure:
    return final_crds

# Test the workflow:
final_crds = my_workflow('dhfr.crd', 'dhfr.prmtop', 'step1.mdin', 'step2.mdin', 'step3.mdin')
final_crds.save('final_coordinates.rst7')