## An introduction to Crossflow workflows

In this notebook you will see how a simple MD simulation job can be converted from its normal command-line form into a Python function using tools in *Crossflow*.

Then you will see how it's easy to chain jobs together to create a workflow.

The notebook assumes you have a basic knowledge of *Gromacs*, and the Python package *MDTraj*, and that both of these are installed on the computer you are running this notebook on.

In addition it's assumed you have installed *Crossflow* (e.g. `pip install crossflow`).

----

### Part 1: running jobs the conventional way
Have a look at the contents of this directory:

In [1]:
!ls

 bpti-em.edr	    bpti_min_1.gro		     em.log
'#bpti-em.edr.1#'   bpti_min_2.gro		     em.mdp
 bpti-em.gro	    bpti_min_3.gro		     final_coordinates.gro
'#bpti-em.gro.1#'   bpti_min_4.gro		     max_rmsd.gro
 bpti-em.log	    bpti_min_5.gro		     mdout.mdp
'#bpti-em.log.1#'   bpti_min_6.gro		     nvt.log
 bpti-em.tpr	    bpti_min_7.gro		     nvt.mdp
'#bpti-em.tpr.1#'   bpti_min_8.gro		     provision.dat
 bpti-em.trr	    bpti_min_9.gro		     README.md
'#bpti-em.trr.1#'   bpti.top			     test.gro
 bpti.gro	   'Crossflow 201.ipynb'	     test.log
 bpti_min_0.gro    'Crossflow workflows 101.ipynb'
 bpti_min_10.gro    dask-worker-space


You should see:

    Crossflow workflows 101.ipynb : This notebook
    bpti.gro                      : Coordinates for BPTI in Gromacs .gro format
    bpti.top                      : Gromacs topology file for BPTI
    em.mdp                        : A Gromacs .mdp file defining an energy minimisation job
    nvt.mdp                       : a Gromacs .mdp file defining a short NVT MD simulation
    
Let's begin by running the energy minimisation job interactively in the conventional way. 

First we run grompp:

In [2]:
!gmx grompp -f em.mdp -c bpti.gro -p bpti.top -o bpti-em.tpr

                      :-) GROMACS - gmx grompp, 2022.2 (-:

Executable:   /usr/remote/gromacs/2022.2/bin/gmx
Data prefix:  /usr/remote/gromacs/2022.2
Working dir:  /users/charlie/crossflow/examples/Notebooks/Gromacs
Command line:
  gmx grompp -f em.mdp -c bpti.gro -p bpti.top -o bpti-em.tpr

Ignoring obsolete mdp entry 'ns_type'

NOTE 1 [file em.mdp]:
  With Verlet lists the optimal nstlist is >= 10, with GPUs >= 20. Note
  that with the Verlet scheme, nstlist has no effect on the accuracy of
  your simulation.

Setting the LD random seed to -38023

Generated 2145 of the 2145 non-bonded parameter combinations
Generating 1-4 interactions: fudge = 0.5

Generated 2145 of the 2145 1-4 parameter combinations

Excluding 3 bonded neighbours molecule type 'Protein'

Excluding 2 bonded neighbours molecule type 'SOL'

Excluding 1 bonded neighbours molecule type 'CL'
Analysing residue names:
There are:    58    Protein residues
There are:  6541      Water residues
There are:     6        Ion resi

Assuming everything there went as expected, now we can run the energy minimisation itself:

In [3]:
! gmx mdrun -s bpti-em.tpr -c bpti-em.gro -g bpti-em.log -o bpti-em.trr -e bpti-em.edr

                      :-) GROMACS - gmx mdrun, 2022.2 (-:

Executable:   /usr/remote/gromacs/2022.2/bin/gmx
Data prefix:  /usr/remote/gromacs/2022.2
Working dir:  /users/charlie/crossflow/examples/Notebooks/Gromacs
Command line:
  gmx mdrun -s bpti-em.tpr -c bpti-em.gro -g bpti-em.log -o bpti-em.trr -e bpti-em.edr


Back Off! I just backed up bpti-em.log to ./#bpti-em.log.2#
Compiled SIMD: AVX_256, but for this host/run AVX2_256 might be better (see
log).
Reading file bpti-em.tpr, VERSION 2022.2 (single precision)
1 GPU selected for this run.
Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node:
  PP:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
Using 1 MPI thread
Using 8 OpenMP threads 


Back Off! I just backed up bpti-em.trr to ./#bpti-em.trr.2#

Back Off! I just backed up bpti-em.edr to ./#bpti-em.edr.2#

Steepest Descents:
   Tolerance (Fmax)   =  1.00000e+03
   Number of steps    =      

Assuming the job completed without errors, you should see the output files in the current directory:

In [4]:
!ls

 bpti-em.edr	   '#bpti-em.trr.2#'  'Crossflow 201.ipynb'
'#bpti-em.edr.1#'   bpti.gro	      'Crossflow workflows 101.ipynb'
'#bpti-em.edr.2#'   bpti_min_0.gro     dask-worker-space
 bpti-em.gro	    bpti_min_10.gro    em.log
'#bpti-em.gro.1#'   bpti_min_1.gro     em.mdp
'#bpti-em.gro.2#'   bpti_min_2.gro     final_coordinates.gro
 bpti-em.log	    bpti_min_3.gro     max_rmsd.gro
'#bpti-em.log.1#'   bpti_min_4.gro     mdout.mdp
'#bpti-em.log.2#'   bpti_min_5.gro     nvt.log
 bpti-em.tpr	    bpti_min_6.gro     nvt.mdp
'#bpti-em.tpr.1#'   bpti_min_7.gro     provision.dat
'#bpti-em.tpr.2#'   bpti_min_8.gro     README.md
 bpti-em.trr	    bpti_min_9.gro     test.gro
'#bpti-em.trr.1#'   bpti.top	       test.log


Have a look at the log file:

In [None]:
!cat bpti-em.log

### Part 2: Turning this into Python

OK. Now you will see how you can turn the energy minimisation job from something you run on the command line (in this situation, within a Jupyter notebook, by using the "!" special command) into a pure Python function.

The function will take a .tpr file as the input, and return the .gro and .log files when the job completes. For now, you can assume you are not that bothered about what's in the .edr and .trr files.

So your aim is something like this:

    grofile, logfile = md(tprfile)
    
---

Begin by importing the required submodules from *Crossflow*:

In [5]:
from crossflow import filehandling, tasks, clients

Now you create the function, which in *Crossflow* is called a **Task**:

In [6]:
mdrun = tasks.SubprocessTask('gmx mdrun -s x.tpr -c x.gro -g x.log -e x.edr -o x.trr -ntmpi 1')

You can see that creating the task involves providing a template for the command you want to run. The names of the files in the template are completely up to you (e.g. you could use "system.tpr", etc. instead of "x.tpr") - but in general make sure the filenames have the appropriate extensions.

---

Now you have to tell the task what files are inputs, and what are outputs. To do this you pass *lists* of strings that correspond to the filenames in the template above. 

**NB:** the order of the strings in the inputs list defines the order that input variables will be passed to the task, and the order of the strings in the output list defines the order that the outputs from the function will appear in.

**NB2:** You only get the outputs you ask for. So although the job is going to produce trajectory (x.trr) and energy(x.edr) files, you are not going to see them.

In [7]:
# Give the task the signature: grofile, logfile = md(tprfile)
md.set_inputs(['x.tpr'])
md.set_outputs(['x.gro', 'x.log'])

*Crossflow* runs your job on a **cluster**. This can be anything from a number of processes/threads on your current computer, to an HPC cluster or a set of resources in the cloud. The way you create the cluster depends on which of these you choose, but once it's up and running the way you run jbs on it via crossflow is always the same.

In this case we are going to run the jobs on the local computer, which we assume can really only run one MD job at a time, so we create a **LocalCluster** with one **worker**. The code to do this comes from the `dask.distributed` package which underpins *Crossflow* and which you therefore have already installed:

In [8]:
from distributed import LocalCluster
cluster = LocalCluster(n_workers=1)

2022-10-05 15:24:58,738 - distributed.diskutils - INFO - Found stale lock file and directory '/users/charlie/crossflow/examples/Notebooks/Gromacs/dask-worker-space/worker-ogvg37_o', purging


Now create a *Crossflow* **client** to serve your cluster, and then submit the job (the function and its arguments) to it:

In [9]:
client = clients.Client(cluster)
fh = filehandling.FileHandler()
tprfile = fh.load('bpti-em.tpr')
grofile, logfile = client.submit(mdrun, tprfile)

The job is now submitted for computation, which takes place in the background. The output variables `grofile` and `logfile` are `Futures`, whose final values are obtained by calling their `result()` methods. Use this to save the outputs to local files:

In [11]:
print(grofile)
logfile.result().save('test.log')
grofile.result().save('test.gro')

<Future: finished, type: crossflow.filehandling.FileHandle, key: lambda-9d021a68afb22c0a31573eb863c0aa9a>


'test.gro'

In [None]:
!cat test.log

In [None]:
!cat test.gro

### Part 3: A workflow

Let's make a workflow that runs a grompp job, then immediately the md (or energy minimisation) job.

You already have a task that can run *mdrun*, but you need to build one to run *grompp*:

In [12]:
# Build a task with the signature: tprfile = grompp(mdpfile, grofile, topfile):
grompp = tasks.SubprocessTask('gmx grompp -f x.mdp -c x.gro -p x.top -o x.tpr -maxwarn 1')
grompp.set_inputs(['x.mdp', 'x.gro', 'x.top'])
grompp.set_outputs(['x.tpr'])

See if it works:

In [13]:
# Create variables from the required input files:
emfile = 'em.mdp'
start_crds = 'bpti.gro'
topfile = 'bpti.top'
# Run the job:
em_tprfile = client.submit(grompp, emfile, start_crds, topfile)

The output from this task should be ready for use in the mdrun task - let's see:

In [14]:
# Now the energy minimisation:
em_crds, em_logfile = client.submit(mdrun, em_tprfile)
em_logfile.result().save('em.log')

'em.log'

In [None]:
!cat em.log

**Note**: You may have spotted that the first time you ran `client submit(mdrun, ...)` the argument was a string (the name of a tprfile), but the second time it was a *future* that points at a tprfile object. That's fine - the client.submit() function works all that out for you.

### Part 4: Exercise - a bigger workflow

Now we add the second simulation stage - the NVT MD - into your workflow.

Notice you don't need to make any new tasks - you can re-use the ones you have.

In [15]:
# A workflow that runs an energy minimisation and then an NVT MD simulation
em_tprfile = client.submit(grompp, emfile, start_crds, topfile)
em_crds, em_logfile = client.submit(mdrun, em_tprfile)
nvtfile = 'nvt.mdp'
nvt_tprfile = client.submit(grompp, nvtfile, em_crds, topfile)
nvt_crds, nvt_logfile = client.submit(mdrun, nvt_tprfile)
nvt_logfile.result().save('nvt.log')

'nvt.log'

### Part 5: A better workflow

Let's improve the workflow. Firstly, it would be nice if the NVT simulation job also returned the trajectory file. You don't want this for the EM job, so what that means is that you need to make a second mdrun-type task. Here it is:

In [16]:
mdrun_with_traj = mdrun.copy()
mdrun_with_traj.set_outputs(['x.gro', 'x.log', 'x.trr'])

The copy() convenience method saves you having to rewrite the task from scratch, if  it's just a tweak on an existing one. But it is also neccessary if you want to tweak a task that has already been used in a client.submit() call (if you want to understand why, see the dask.distributed documentation about 'pure' vs. 'impure' functions).

Secondly, notice that both grompp jobs in the workflow above take the same topology file as an argument - in effect, it's a constant. In such cases, you can define it as such at the time you create the task, and then you don't have to include it in the list of arguments when you call it:

In [17]:
grompp2 = grompp.copy()
grompp2.set_constant('x.top', topfile)
# Now the new improved workflow:
em_tprfile = client.submit(grompp2, emfile, start_crds) # no need to specify a topfile here
em_crds, em_logfile = client.submit(mdrun, em_tprfile)
nvt_tprfile = client.submit(grompp2, nvtfile, em_crds)
nvt_crds, nvt_logfile, nvt_traj = client.submit(mdrun_with_traj, nvt_tprfile)
nvt_logfile.result().save('nvt.log')

'nvt.log'

### Part 6: interfacing with more Python

At this stage you may be thinking "OK - but nothing here I couldn't do with a bash script". The power of the workflow comes when you interface your new pythonized-MD functions with other Python tools.

Let's make use of the *MDTraj* package for analysis of MD trajectory data. You will use it to calculate the RMSD of the trajectory frames from the starting structure.

In [18]:
import mdtraj as mdt
traj = mdt.load(nvt_traj.result(), top=start_crds)
print(traj)
# Calculate the rmsd of each frame from the first:
print(mdt.rmsd(traj, traj[0], atom_indices=traj.topology.select('protein')))

<mdtraj.Trajectory with 11 frames, 20521 atoms, 6605 residues, and unitcells>
[0.         0.08701991 0.09370182 0.1102126  0.11426546 0.11102293
 0.1252876  0.12361396 0.13193893 0.12360178 0.13172925]


Let's make your workflow identify which snapshot from your trajectory has the highest RMSD from the starting structure, and then energy minimise that (this script may raise a warning from `distributed` - don't worry about that):

In [19]:
import numpy as np
rmsdlist = mdt.rmsd(traj, traj[0], atom_indices=traj.topology.select('protein'))
i = np.argmax(rmsdlist)
print('Energy minimising snapshot {}'.format(i))
selected_snapshot = traj[i]
em2_tprfile = client.submit(grompp2, emfile, selected_snapshot)
em2_crds, em2_logfile = client.submit(mdrun, em2_tprfile)
em2_crds.result().save('max_rmsd.gro')

Energy minimising snapshot 8


  ('em.mdp', <mdtraj.Trajectory with 1 frames, 20521 ... x7fd3b2ee7a00>)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and 
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good


'max_rmsd.gro'

### Part 7: Putting it all together

Here is a Python function that in effect does all the above: takes a set of starting coordinates, a topology file, and two .mdp files (one for an energy minimisation, one for an MD run), runs the workflow and then returns the energy-minimised structure of the snapshot with the highest RMSD from the starting structure. The function does everything, including creating the required tasks:

In [20]:
def my_workflow(crd_filename, top_filename, em_mdp_filename, md_mdp_filename):
    # Over to you!
    # Load data:
    
    # Create tasks:
    grompp = tasks.SubprocessTask('gmx grompp -f x.mdp -c x.gro -p x.top -o x.tpr -maxwarn 1')
    grompp.set_inputs(['x.mdp', 'x.gro'])
    grompp.set_constant('x.top', top_filename)
    grompp.set_outputs(['x.tpr'])
    
    mdrun = tasks.SubprocessTask('gmx mdrun -s x.tpr -c x.gro -g x.log -e x.edr -o x.trr -ntmpi 1')
    mdrun.set_inputs(['x.tpr'])
    mdrun.set_outputs(['x.gro', 'x.log'])
    
    mdrun_with_traj = mdrun.copy()
    mdrun_with_traj.set_outputs(['x.gro', 'x.log', 'x.trr'])
    
    # Run workflow (note nested client.submits() - compact but not neccessary!):
    em_crds, em_logfile = client.submit(mdrun, client.submit(grompp, em_mdp_filename, crd_filename))
    md_crds, md_logfile, md_traj = client.submit(mdrun_with_traj, client.submit(grompp, md_mdp_filename, em_crds))
    traj = mdt.load(md_traj.result(), top=crd_filename)
    rmsdlist = mdt.rmsd(traj, traj[0], atom_indices=traj.topology.select('protein'))
    i = np.argmax(rmsdlist)
    print('Energy minimising snapshot {}'.format(i))
    em2_crds, em2_logfile = client.submit(mdrun, client.submit(grompp, em_mdp_filename, traj[i]))
    
    # Return final structure:
    return em2_crds.result()

# Test the workflow:
final_crds = my_workflow('bpti.gro', 'bpti.top', 'em.mdp', 'nvt.mdp')
final_crds.save('final_coordinates.gro')

Energy minimising snapshot 10


'final_coordinates.gro'