## Introduction
This notebook illustrates how to create workflows using *Crossflow*, and then run them efficiently using *Crossflow*'s distributed computing capabilities.

It's assumed you have done the "Crossflow 101" notebook or similar, and understand how you create *tasks* to run command-line appplications.

 - If you are running the Notebook locally, you will need to have the following installed:
   - Gromacs
   - The python package MDTraj

---

The aim of this notebook is to build on the 'Crossflow 101" notebook, and create a workflow to:

1. Run a short MD job
2. Energy minimise each of the structures in the resulting trajectory

---

Begin by creating the required tasks:

In [None]:
from crossflow import filehandling, tasks, clients
# Build tasks for mdrun and grompp:
md = tasks.SubprocessTask('gmx mdrun -ntmpi 1 -s x.tpr -c x.gro -o x.trr -g x.log')
md.set_inputs(['x.tpr'])
md.set_outputs(['x.trr'])

em = tasks.SubprocessTask('gmx mdrun -ntmpi 1 -s x.tpr -c x.gro')
em.set_inputs(['x.tpr'])
em.set_outputs(['x.gro'])

grompp = tasks.SubprocessTask('gmx grompp -f x.mdp -c x.gro -p x.top -o x.tpr -maxwarn 1')
grompp.set_inputs(['x.mdp', 'x.gro', 'x.top'])
grompp.set_outputs(['x.tpr'])

## Part 1: Running the workflow without distributed computing

The workflow is fairly simple: 
1. Run grompp to prepare the starting structure for MD.
2. Run the MD.
3. For each structure in the trajectory:

    a. Run grompp to prepare it for energy minimisation.
    
    b. Run the energy minimisation.
    
    c. Save the final coordinates to a file.
    
Below we run each step interactively, i.e. without staring a crossflow client (this means that the objects returned by the functions are the actual data, not `Futures`).

In [None]:
# Run grompp and the MD:
fh = filehandling.FileHandler()
start_crds = fh.load('bpti.gro')
topfile = fh.load('bpti.top')
md_mdp = fh.load('nvt.mdp')
mdtpr = grompp(md_mdp, start_crds, topfile)
trajectory = md(mdtpr)

In [None]:
# Convert the MD trajectory file to an MDTraj trajectory object:
import mdtraj as mdt
traj = mdt.load(trajectory, top=start_crds)

In [None]:
# Import the energy minimisation mdp file, then minimise each snapshot in turn:
em_mdp = fh.load('em.mdp')
for i, snapshot in enumerate(traj):
    print('Energy minimising snapshot {}'.format(i))
    emtpr = grompp(em_mdp, snapshot, topfile)
    mincrds = em(emtpr)
    mincrds.save('bpti_min_{}.gro'.format(i))

## Part 2: Running the workflow with distributed computing

In distributed computing, tasks are farmed out to "workers". Where the program logic permits, tasks that can be run in parallel are sent to different workers. Clearly that applies to the energy minimisation steps here - they are independent of each other, and if enough workers were available, each task could be run at the same time.

Crossflow comes with a distributed computing capability built on [dask.distributed](http://distributed.dask.org/en/latest/). If you are running this notebook on your own desktop machine or equivalent, it will create a "pool" of workers on it to run jobs in parallel. Depending on the capabilities of your machine, you may or may not see much performance improvement compared to running the jobs without distributed computing, but if you have the resources to add the code to launch a "proper" cluster (see e.g. [dask-kubernetes](https://kubernetes.dask.org/en/latest/) or [dask-jobqueue](https://jobqueue.dask.org/en/latest/) then each worker is a separate compute node and you should see a significant speed-up.

----

Begin by creating a *client*:

In [None]:
# In a production setting you would have extra code here to create a 'proper' distributed cluster:
# cluster = ???
# 
cluster = None
client = clients.Client(cluster)
client

You are going to distribute the energy minimisations of the snapshots from the trajectory file across the workers in your cluster. For performance reasons, you begin by uploading the snaphots to the cluster, using `client.upload()`:

In [None]:
snapshots = [client.upload(t) for t in traj]

Now you will run the grompp and mdrun jobs in parallel across the available workers, using the `client.map()` method. This takes the name of the task as the first argument, and *lists* of task arguments after that. The map() function takes one item from each of the argument lists, and evaluates the task using those. It then returns a *list* of task outputs.

Thus, if a task had the form:

    result = myfunc(inputa, inputb)

Then this would become:

    [result1, result2] = client.map(myfunc, [inputa_1, inputa_2], [inputb_1, inputb_2])

However as a short-cut, if one of the arguments (e.g. inputb) is always the same, you can instead write:

    [result1, result2] = client.map(myfunc, [inputa_1, inputa_2], inputb)
    
And `input_b` will be expanded to `[input_b, input_b]` automatically.

In [None]:
# Run the grompp jobs, then the energy minimisations:
em_tprs = client.map(grompp, em_mdp, snapshots, topfile) # Note only snapshots is a list, other arguments get expanded automatically
mincrds = client.map(em, em_tprs)

You may have been surprised that when you executed the cell above, it appeared to complete almost instantaneously - did the jobs really run that fast? 

No - the `client.map()` method runs the jobs asynchronously - they have been submitted to the workers, but probably have not finished yet. The variables `em_tprs` and `mincrds` are not actually the (lists of) new files - they are `futures` from which, as some time in the future, the real files can be obtained by calling their `result()` method.

In the cell below you wait for the jobs to complete, and then write out the minimized coordinate files:

In [None]:
for i, mincrd in enumerate(mincrds):
    print('saving minimised snapshot {}'.format(i))
    mincrd.result().save('bpti_min_{}.gro'.format(i))

It's recommended to properly shut down the client before you quit the notebook:

In [None]:
client.close()