# A Crossflow workflow

This notebook illustrates a basic Crossflow workflow, with scatter, parallel processing, and gather steps.

The workflow:

1. Splits an input text file into pieces
2. In parallel, reverses the order of the lines in each piece
3. Stitches the reversed pieces back together

In [1]:
from crossflow import clients, kernels, filehandling
from pathlib import Path

Start a client that serves a temporary compute cluster that is launched on the current machine:

In [2]:
client = clients.Client(local=True)
client.client

0,1
Client  Scheduler: tcp://127.0.0.1:53972,Cluster  Workers: 4  Cores: 8  Memory: 17.18 GB


Create a text file of 25 lines:

In [3]:
here = Path('.')
input_file = here /'input.txt'
with input_file.open('w') as f:
    for i in range(25):
        f.write('line {}\n'.format(i))

Create the three kernels required: one to split up the initial text file, one to reverse the order of the lines, one to join the pieces back together again.

We are going to use the standard unix `split`, `tail` and `cat` commands, to illustrate how tools usually used via the command line can be converted into Python functions.

**Note**: some flavours of Unix do not support `tail -r`; in such cases `tac` will do the same job.

In [4]:
# Create a SubprocessKernel that will split up the input file:
splitter = kernels.SubprocessKernel('split -l 5 input.txt')
splitter.set_inputs(['input.txt'])
splitter.set_outputs(['xaa', 'xab', 'xac', 'xad', 'xae'])

# Create a SubprocessKernel to reverse the order of the lines in a file:
reverser = kernels.SubprocessKernel('tail -r input > output')
#reverser = kernels.SubprocessKernel('tac input > output')
reverser.set_inputs(['input'])
reverser.set_outputs(['output'])

# Create a Subprocesskernel that will join input files together:
joiner = kernels.SubprocessKernel('cat * > output')
joiner.set_inputs(['*'])
joiner.set_outputs(['output'])

Arguments to command-line tools are typically the *names* of files. Because Crossflow is designed for use with distributed computing resources where filesyatems may not be shared, it uses portable FileHandles to refer to input and output files:

In [5]:
# Convert the input datafile into Crossflow FileHandle objects:
fh = filehandling.FileHandler()
input_data = fh.load(input_file)

Here is the workflow, using the client's .submit() and .map() methods:

In [6]:
# First split the file into pieces:
pieces = client.submit(splitter, input_data)
# 'pieces' is a tuple, convert to a list and process each piece in parallel:
reversed_pieces = client.map(reverser, list(pieces))
# Stitch the reversed pieces back together again:
output = client.submit(joiner, reversed_pieces)

The client returns its outputs as `Futures`, while these can be passed as-is between kernels, when it comes to getting at the final data, you need to call their .result() method:

In [7]:
output_filehandle = output.result()
# Save the output FileHandle object as a file, and list its contents:
output_file = here / 'output.txt'
output_filehandle.save(output_file)
print(output_file.read_text())

line 4
line 3
line 2
line 1
line 0
line 9
line 8
line 7
line 6
line 5
line 14
line 13
line 12
line 11
line 10
line 19
line 18
line 17
line 16
line 15
line 24
line 23
line 22
line 21
line 20

