## An introduction to Crossflow workflows

In this notebook you will see how a simple MD simulation job can be converted from its normal command-line form into a Python function using tools in *Crossflow*.

Then you will see how it's easy to chain jobs together to create a workflow.

Requirements:

1. *Amber* or *Ambertools* installed.
2. Python packages *MDTraj* and *crossflow* installed.


The notebook assumes you have a basic knowledge of *Amber*; some knowledge of *MDTraj* may also help, but is not obligatory.

----

### Part 1: running jobs the conventional way
Have a look at the contents of this directory:

In [None]:
!ls

You should see:

    Crossowflow workflows 101.ipynb : This notebook
    dhfr.crd                        : Coordinates for DHFR in Amber .crd format
    dhfr.prmtop                     : Amber topology file for DHFR
    step1.mdin                      : An input file for sander/pmemd defining a restrained energy minimisation job
    step2.mdin                      : An input file for sander/pmemd defining an unrestrained energy minimisation job
    step1.mdin                      : An input file for sander/pmemd defining a short MD job
    
Let's begin by running the restrained energy minimisation job interactively in the conventional way (note: depending on your AMBER installation, you may need to replace "pmemd" with what works for you).

In [None]:
!pmemd -O -i step1.mdin -c dhfr.crd -ref dhfr.crd -p dhfr.prmtop -o step1.mdout -r dhfr.step1.rst7

Assuming the job completed without errors, you should see the output files in the current directory:

In [None]:
!ls

Have a look at the log file:

In [None]:
!cat step1.mdout

### Part 2: Turning this into Python

OK. Now you will see how you can turn the energy minimisation job from something you run on the command line (in this situation, within a Jupyter notebook, by using the "!" special command) into a pure Python function.

The function will take a .crd file, a .prmtop file and a .mdin file as the input, and return the .mdout and .rst files when the job completes.

So your aim is something like this:

    restart, logfile = md(mdin, startcrds, prmtop)
    
---

Begin by importing key elements of the *crossflow* module:

In [None]:
from crossflow import filehandling, tasks, clients

Now you create the function, which in crossflow is called a **Task**:

In [None]:
md = tasks.SubprocessTask('pmemd -O -i x.mdin -c x.rst7 -ref x.rst7 -p x.prmtop -o x.mdout -r out.rst7')

The string used to create the task is a template for the command you want to run. The names of input and output files are arbitrary, but make sure they have the right extensions.

---

Now you have to tell the task what files are inputs, and what are outputs. To do this you pass *lists* of strings that correspond to the filenames in the template above. 

**NB:** the order of the strings in the inputs list defines the order that input variables will be passed to the task, and the order of the strings in the output list defines the order that the outputs from the function will appear in:

In [None]:
# Give the task the signature: restart, logfile = md(mdin, startcrds, prmtop)
md.set_inputs(['x.mdin', 'x.rst7', 'x.prmtop'])
md.set_outputs(['out.rst7', 'x.mdout'])

And that's about it - your new function is ready for use.

However, your data is not quite ready. Crossflow is designed to distribute work across multiple workers that do not neccessarily share a file system. So before you can use the function, you need to get the input files into suitable globally-accessible variables:

In [None]:
fh = filehandling.FileHandler()
startcrds = fh.load('dhfr.crd')
prmtop = fh.load('dhfr.prmtop')
em_protocol_1 = fh.load('step1.mdin')

Although it's possible to run crossflow tasks interactively, most normally they are run via a crossflow `Client`, so we create one of these, running locally on this machine:

In [None]:
client = clients.Client()

Now you can submit the function, with its input arguments, to the client:

In [None]:
restart, logfile = client.submit(md, em_protocol_1, startcrds, prmtop)

The submit call returns immediately, becaue the job has indeed been submitted to the client; however it may not have finished yet. The output arguments, `restart` and `logfile` are `Futures`. You get at the real data, maybe with some wait, by calling the `Future`'s `.result()` method:

In [None]:
logfile.result().save('test.mdout')
restart.result().save('test.rst7')

Check they are there:

In [None]:
!cat test.mdout

### Part 3: A workflow

Let's make a workflow that:
1. Runs a restrained energy mininisation.
2. Runs an unrestrained energy minimisation on the final coordinates from step 1.
3. Runs a short MD simulation on the final coordinates from step 2.

The md input files for steps 2 and 3 are already prepared, and just need to be loaded:

In [None]:
# Create variables from the required input files:
em_protocol_2 = fh.load('step2.mdin')
md_protocol = fh.load('step3.mdin')

In [None]:
# Run the jobs. For clarity, we begin at the beginning again:
restart1, logfile1 = client.submit(md, em_protocol_1, startcrds, prmtop)
print('first stage submitted...')
restart2, logfile2 = client.submit(md, em_protocol_2, restart1, prmtop)
print('second stage submitted...')
restart3, logfile3 = client.submit(md, md_protocol, restart2, prmtop)
print('third stage submitted.')
# Now we wait for logfile3 to appear:
logfile3.result().save('stage3.mdout')

In [None]:
!cat stage3.mdout

Assuming all went OK, a couple things to note:
1. You were able to re-use the same task for all three simulation stages.
2. Because you did this, you haven't captured the trajectory file that the third stage will have produced.
3. All three tasks use the same prmtop file as an argument - in effect it's a constant.
    
Let's fix issue 2 first.

Make a new task that also returns a trajectory file:

In [None]:
md_with_traj = tasks.SubprocessTask('pmemd -O -i x.mdin -c x.rst7 -p x.prmtop -o x.mdout -r out.rst7 -x x.nc')
md_with_traj.set_inputs(['x.mdin', 'x.rst7', 'x.prmtop'])
md_with_traj.set_outputs(['x.nc', 'out.rst7', 'x.mdout'])

Now let's make the prmtop file a constant in both tasks. This means it does not have to appear in the task argument list any more (but has the disadvantage that these tasks are now 'hard wired' to only work for DHFR):

In [None]:
md2 = md.copy() # Take a copy of the previous version of the task
md2.set_constant('x.prmtop', prmtop)
md_with_traj.set_constant('x.prmtop', prmtop)

### Part 4: A better workflow

Here's the workflow re-written to use these improved tasks

In [None]:
# A workflow that runs a two-stage energy minimisation and then an MD simulation
restart1, logfile1 = client.submit(md2, em_protocol_1, startcrds)
print('first stage submitted...')
restart2, logfile2 = client.submit(md2, em_protocol_2, restart1)
print('second stage submitted...')
trajectory, restart3, logfile3 = client.submit(md_with_traj, md_protocol, restart2)
print('final stage submitted, waiting for trajectory file to appear:')
t = trajectory.result()
print('All done.')

### Part 5: interfacing with more Python

At this stage you may be thinking "OK - but nothing here I couldn't do with a bash script". The power of the workflow comes when you interface your new pythonized-MD functions with other Python tools.

Let's make use of the *MDTraj* package for analysis of MD trajectory data. You will use it to calculate the RMSD of the trajectory frames from the starting structure.

The MDTraj load() method expects *filenames* as arguments; crossflow `FileHandles` subclass `os.PathLike` so can often be used in Python anywhere a *path* is expected, e.g.:    

In [None]:
with open(logfile3.result()) as f:
    for line in f.readlines():
        print(line)

So:

In [None]:
import mdtraj as mdt

traj = mdt.load(t, top=prmtop)
print(traj)
# Print the rmsd of each frame from the first:
print(mdt.rmsd(traj, traj[0], atom_indices=traj.topology.select('protein')))

Let's make your workflow identify which snapshot from your trajectory has the highest RMSD from the starting structure, and then energy minimise that:

In [None]:
import numpy as np
rmsdlist = mdt.rmsd(traj, traj[0], atom_indices=traj.topology.select('protein'))
i = np.argmax(rmsdlist)
chosen_snapshot = traj[i]
print('Energy minimising snapshot {}'.format(i))
max_rmsd_minimised, logfile = client.submit(md2, em_protocol_2, chosen_snapshot)
max_rmsd_minimised.result().save('max_rmsd.rst7')

### Part 6: Putting everything together

Here's the complete workflow:

In [None]:
def my_workflow(crd_filename, top_filename, em_step1_filename, em_step2_filename, md_step_filename):
    # Over to you!
    # Load data:
    fh = filehandling.FileHandler()
    startcrds =fh.load(crd_filename)
    topfile = fh.load(top_filename)
    protocol_step_1 = fh.load(em_step1_filename)
    protocol_step_2 = fh.load(em_step2_filename)
    protocol_step_3 = fh.load(md_step_filename)
    
    # Create tasks:
    
    md = tasks.SubprocessTask('pmemd -O -i x.mdin -c x.rst7 -ref x.rst7 -p x.prmtop -o x.mdout -r out.rst7')
    md.set_inputs(['x.mdin', 'x.rst7'])
    md.set_outputs(['out.rst7'])
    md.set_constant('x.prmtop', topfile)
    
    md_with_traj = tasks.SubprocessTask('pmemd -O -i x.mdin -c x.rst7 -p x.prmtop -o x.mdout -r out.rst7 -x x.nc')
    md_with_traj.set_inputs(['x.mdin', 'x.rst7'])
    md_with_traj.set_outputs(['x.nc'])
    md_with_traj.set_constant('x.prmtop', topfile)
    
    # Run workflow:
    restart1 = client.submit(md, protocol_step_1, startcrds)
    restart2 = client.submit(md, protocol_step_2, restart1)
    print('running MD steps')
    trajectory = client.submit(md_with_traj, protocol_step_3, restart2).result()
    traj = mdt.load(trajectory, top=prmtop)
    rmsdlist = mdt.rmsd(traj, traj[0], atom_indices=traj.topology.select('protein'))
    i = np.argmax(rmsdlist)
    print('Energy minimising snapshot {}'.format(i))
    final_crds = client.submit(md, protocol_step_2, traj[i])
    
    # Return final structure:
    return final_crds.result()

# Test the workflow:
final_crds = my_workflow('dhfr.crd', 'dhfr.prmtop', 'step1.mdin', 'step2.mdin', 'step3.mdin')
final_crds.save('final_coordinates.rst7')