# Snakemake Introduction to Elephant
## Workflow management based on a electrophysiology example

In this tutorial I will first introduce the workflow management system [snakemake](https://snakemake.readthedocs.io) and show how this can be used for analysis of electrophysiology data using the electrophysiology analysis toolkit [Elephant](http://neuralensemble.org/elephant). For snakemake as well as Elephant tutorials exist already as part of the documentation and were partially reused in the tutorial presented here.

*NOTE: When running this tutorial locally, please first install the [requirements](environment.yml) to ensure the examples are working.*

## Workflow management - Do I need this?

There is a number of reasons for managing your managing workflows besides the classical 'I always run first script a and then script b':
* **Growing Complexity** When starting a seemingly small and simple project everything is still pretty easy, but typically projects tend to grow beyond the initially expected size and complexity. This affects two different aspects 1) the dependency between different steps in workflow as well as 2) the dependency of your analysis results on version of intermediate steps in the pipeline (data as well as code).
* **Collaboration & Sharing** Some day a collegue of yours wants to use your workflow or you are required to publish the analysis together with the manuscript presenting the results of your work to the scientific community. Instead of writing a book about which versions of what programm you installed in which order, wouldn't it be nice to have a structured and self explanatory of your analysis steps already at hand?
* **Software Evolution** Luckily, software develops and bugs are fixed from time to time. Unfortunately this also implies that your workflow might break unexpectedly upon updating your system. Having a well defined environment to run your analysis in is essential for reproducibility of your results.

## Snakemake

Snakemake helps you to structure your workflow by providing a framework to specify the dependencies between individual steps in your analysis workflow. By doing so it enforces a modular structure in the project and allows to specify the (python) software versions used in each step of the process.

The workflow definition according to snakemake is file based, i.e. each step in the workflow is specified via the required input files and the generated output files, as you might now from common [`Makefiles`](https://en.wikipedia.org/wiki/Make_(software)). Each step is defined in a `rule`, specifying input and output files as well as instructions on how to get from the input to the output files. Here the instructions to convert `fileA.txt` into `fileB.txt` are a simple copy performed in a shell:

In [1]:
%%writefile Snakefile
rule:
    input: 'fileA.txt'
    output: 'fileB.txt'
    shell: 'cp fileA.txt fileB.txt'

Writing Snakefile


For snakemake the workflow definition needs to be specified in a `Snakefile` and can be executed by calling `snakemake` in a terminal in the same location as the `Snakefile`. Here the example rule above has been exported into a [`Snakefile`](Snakefile) using the `%%writefile` jupyter magic command.

Now we can ask snakemake to generate `fileB.txt` for us:

In [2]:
%%sh
snakemake

Building DAG of jobs...
Nothing to be done.
Complete log: /home/julia/presentations/2019-06-20_Toronto/snakemake_elephant_demo/.snakemake/log/2019-06-06T095229.961155.snakemake.log


This fails with snakemake complaining about
```
Missing input files for rule 1:
fileA.txt
```
Which is correct, since there is no `fileA.txt` present to generate `fileB.txt` from. So let's add second rule which is capable of generating `fileA.txt` without required inputfiles.


In [3]:
%%writefile -a Snakefile
rule:
    output: 'fileA.txt'
    shell: 'touch fileA.txt'

Appending to Snakefile


And ask snakemake again to generate the `fileB.txt` for us

In [4]:
%%sh
snakemake

Building DAG of jobs...
Nothing to be done.
Complete log: /home/julia/presentations/2019-06-20_Toronto/snakemake_elephant_demo/.snakemake/log/2019-06-06T095234.818879.snakemake.log


Internally snakemake is first resolving the set of rules into a directed acyclic graph (dag) to determine in which order the rules neet do be executed. We can generate a visualization of the workflow using the `--dag` flag in combination with `dot` and `display` (for local notebook instances) or save the graph as svg (e.g. for remote instances).

In [6]:
%%sh
snakemake --dag | dot | display
snakemake --dag | dot -Tsvg > dag0.svg

Building DAG of jobs...
Building DAG of jobs...


The resulting graph shows the dependencies between the two rules, which were automatically enumerated. The line style (continuous/dashed) indicated whether the rules were already executed or not.

![DAG](dag0.svg)

We can also provide explicit names for rules to make the graph better human readable:

In [11]:
%%writefile Snakefile
rule copy_A_to_B:
    input: 'fileA.txt'
    output: 'fileB.txt'
    shell: 'cp {input} {output}'
rule create_A:
    output: 'fileA.txt'
    shell: 'touch fileA.txt'

Overwriting Snakefile


In [12]:
%%sh
snakemake --dag | dot | display
snakemake --dag | dot -Tsvg > dag1.svg

Building DAG of jobs...
Building DAG of jobs...


![DAG](dag1.svg)

Here we already used a different notation to specify in the shell comand `cp {input} {output}` instead of explicitely repeating the input and output filenames. These placeholders will be substituted by snakemake during the execution by the filenames defined as `input` / `output`. We can use the same notation to generalize the required input of the rule depending on the output, e.g. we permit the copy rule to work for arbitrary files having a certain naming scheme. Here a new folder `new_folder` is automatically generated for the copied files.

In [19]:
%%writefile Snakefile
rule copy_to_new_folder:
    input: 'file{id}.txt'
    output: 'new_folder/file{id}.txt'
    shell: 'cp {input} {output}'
rule create_file:
    output: 'file{id}.txt'
    shell: 'touch {output}'

Overwriting Snakefile


For running the workflow now, we need to specify, which file we actually need as a final result and snakemake takes care of the individual steps to generate that file. We specify the desired output file as a snakemake argument:

In [21]:
%%sh
snakemake new_folder/fileZ.txt --dag | dot | display
snakemake new_folder/fileZ.txt --dag | dot -Tsvg > dag2.svg

Building DAG of jobs...
Building DAG of jobs...


To generate a set of output files, we can either request these individually when running snakemake, e.g. using `snakemake -np new_folder/file{0,1,2,3,4,5,6,7,8,9}.txt`. In case the workflow output is not being changed frequently, it is also possible to add a final rule (conventionally named 'all'), which requests all desired output files of the workflow:

In [36]:
%%writefile Snakefile
rule all:
    input: expand('new_folder/file{id}.txt', id=range(10))
rule copy_to_new_folder:
    input: 'file{id}.txt'
    output: 'new_folder/file{id}.txt'
    shell: 'cp {input} {output}'
rule create_file:
    output: 'file{id}.txt'
    shell: 'touch {output}'

Overwriting Snakefile


In [37]:
%%sh
snakemake --dag | dot | display
snakemake --dag | dot -Tsvg > dag3.svg

Building DAG of jobs...
Building DAG of jobs...


Here I used the snakemake function `expand`, which extends a given statement (here `new_folder_file{id}.txt`) for all combinations of parameters provided (here `id` values from 0 to 10). This permits to easily applied a set of rules to a number of different files.

![DAG](dag3.svg)

Typically, analysis is a bit more complicated than creating empty files and copying them from A to B using shell commands. Snakemake also support a number of different execution methods
* in a shell (as used above)
* in python (using run:)
* run python/R/Markdown scripts directly (using script:)
As an example we can use a small python script to generate our initial data files and store a (randomly generated) value. The Python script would look like this:

In [39]:
%%writefile generate_data.py
import sys
import numpy as np

def generate_random_data(output_filename):
    # write a random number in an output file
    f = open(output_filename, "w")
    f.write(np.random.random())
    f.close()

# extracting the output filename from the command line parameters provided
output_filename = sys.argv[1]
generate_random_data(output_filename)

Writing generate_data.py


The corresponding snakemake rule now needs to provide the argument to the `generate_data.py` script:

In [43]:
%%writefile Snakefile
rule all:
    input: expand('new_folder/file{id}.txt', id=range(10))
rule copy_to_new_folder:
    input: 'file{id}.txt'
    output: 'new_folder/file{id}.txt'
    shell: 'cp {input} {output}'
rule generate_data:
    output: 'file{id}.txt'
    run: 'generate_data.py {output}'

Overwriting Snakefile


In [44]:
%%sh
snakemake --dag | dot | display
snakemake --dag | dot -Tsvg > dag4.svg

Building DAG of jobs...
Building DAG of jobs...


## Utilizing Snakemake for Data Analysis

In [None]:
TOADD
SNAKEMAKE
* wildcards (done)
* different rule execution statements (done)
* conda environments
* expand
* config
* rulegraph
NEO
* class structure
* io overview
* intro to main data objects
ELEPHANT
* spade elephant demo