## An Introduction to Crossflow

This tutorial will guide you through the process of creating a Python script that can run the following simple biomolecular simulation-type workflow:

1. From a set of starting coordinates, run five independent short MD simulations, each of which has different starting velocities.
2. Calculate the rmsd of each of the five final structures from the starting coordinates.
3. If the largest RMSD is greater than a target value then stop, otherwise choose the final structure that has a largest rmsd as the new starting coordinates, and go back to step 1.

### Set up

1. (optional) If you are using your own laptop or workstation, create and activate a fresh virtual environment.
2. Install the following Python packages:
* MDTraj ('pip install mdtraj')
* Crossflow ('pip install crossflow')
3. Download copies of the tutorial files.

### Fakemd

You will be using `fakemd` for this tutorial - a "proxy app" that simulates a real-life MD package but which runs much faster. It is designed to be able to stand in for a real MD code, whether that is Gromacs, Amber, or NAMD:

For Gromacs users:
    fakemd -c start.gro -p system.top -r md.gro -x md.xtc -l md.log -n 100
For Amber users:
    fakemd -c start.ncrst -p system.prmtop -r md.ncrst -x md.nc -l md.log -n 100
For NAMD users:
    fakemd -c start.pdb -p system.psf -r md.pdb -x md.dcd -l md.log -n 100
    
In each case the "-n 100" argument means that 100 snapshots will be generated, and this is the only user-controlled parameter. The coordinates in the restart and trajectory files will be bogus, but should be readable by analysis scripts, etc.

If you are doing this tutorial on your own laptop or similar and have VMD or a similar visualization tool installed, try running `fakemd` using which ever of the above options looks most familiar to you and have a look at the resulting trajectory file.

### Clusters and Clients

`Crossflow` works with a `cluster` that runs the required jobs. A `cluster` consists of a `scheduler` and one or more `worker`s. `Worker`s may be individual nodes in an HPC system, or individual virtual machines in the cloud, or - and that's what you will be doing here - just separate processes running on your own laptop/workstation. `Crossflow` makes a connection to the `scheduler` via a `client` object. You use methods of the `client` to send jobs to the cluster.

Let's begin by creating a simple Python program that will create a "local cluster" (a set of worker processes on the current machine), and connect a `client` to it. Create a file "workflow.py" with the following content:

```
from distributed import LocalCluster
from crossflow.clients import Client

def run(client):
   # simple job - ask the client to tell us about itself:
   print(client)

if __name__ == '__main__':
    # create the cluster:
    cluster = LocalCluster()
    # connect a client:
    client = Client(cluster)
    # run the job:
    run(client)
```
Let's look through a few key lines in the code:

`from distributed import LocalCluster`
The Python package `dask.distributed` supplies the method to create the cluster. As you scale your workflow to bigger resources (e.g. HPC or cloud) this is the part you will change - almost everyhting else will stay the same, which is one of the great advantages of `crossflow`.
