# Snakemake Introduction to Elephant

In this tutorial I will first introduce the workflow management system [snakemake](https://snakemake.readthedocs.io) and show how this can be used for analysis of electrophysiology data using the electrophysiology analysis toolkit [Elephant](http://neuralensemble.org/elephant). For snakemake as well as Elephant tutorials exist already as part of the documentation and were partially reused in the tutorial presented here.

*NOTE: When running this tutorial locally, please first install the [requirements](environment.yml) to ensure the examples are working.*

## Workflow management - Do I need this?

There is a number of reasons for managing your managing workflows besides the classical 'I always run first script a and then script b':
* **Growing Complexity** When starting a seemingly small and simple project everything is still pretty easy, but typically projects tend to grow beyond the initially expected size and complexity. This affects two different aspects 1) the dependency between different steps in workflow as well as 2) the dependency of your analysis results on version of intermediate steps in the pipeline (data as well as code).
* **Collaboration & Sharing** Some day a collegue of yours wants to use your workflow or you are required to publish the analysis together with the manuscript presenting the results of your work to the scientific community. Instead of writing a book about which versions of what programm you installed in which order, wouldn't it be nice to have a structured and self explanatory of your analysis steps already at hand?
* **Software Evolution** Luckily, software develops and bugs are fixed from time to time. Unfortunately this also implies that your workflow might break unexpectedly upon updating your system. Having a well defined environment to run your analysis in is essential for reproducibility of your results.

## Snakemake

Snakemake helps you to structure your workflow by providing a framework to specify the dependencies between individual steps in your analysis workflow. By doing so it enforces a modular structure in the project and allows to specify the (python) software versions used in each step of the process.

The workflow definition according to snakemake is file based, i.e. each step in the workflow is specified via the required input files and the generated output files, as you might now from common [`Makefiles`](https://en.wikipedia.org/wiki/Make_(software)). Each step is defined in a `rule`, specifying input and output files as well as instructions on how to get from the input to the output files. Here the instructions to convert `fileA.txt` into `fileB.txt` are a simple copy performed in a shell:

In [13]:
%%writefile -a Snakefile
rule:
    input: 'fileA.txt'
    output: 'fileB.txt'
    shell: 'cp fileA.txt fileB.txt'

Writing Snakefile


For snakemake the workflow definition needs to be specified in a `Snakefile` and can be executed by calling `snakemake` in a terminal in the same location as the `Snakefile`. Here the example rule above has been exported into a [`Snakefile`](Snakefile) using the `%%writefile` jupyter magic command.

Now we can ask snakemake to generate `fileB.txt` for us:

In [14]:
%%bash
snakemake

Building DAG of jobs...
MissingInputException in line 1 of /home/julia/presentations/2019-06-20_Toronto/snakemake_elephant_demo/Snakefile:
Missing input files for rule 1:
fileA.txt


CalledProcessError: Command 'b'snakemake\n'' returned non-zero exit status 1.

This fails with snakemake complaining about
```
Missing input files for rule 1:
fileA.txt
```
Which is correct, since there is no `fileA.txt` present to generate `fileB.txt` from. So let's add second rule which is capable of generating `fileA.txt` without required inputfiles.


In [16]:
%%writefile -a Snakefile
rule:
    output: 'fileA.txt'
    shell: 'touch fileA.txt'

Appending to Snakefile


And ask snakemake again to generate the `fileB.txt` for us

In [18]:
%%bash
snakemake

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	1
	1	2
	2

[Wed Jun  5 16:47:01 2019]
rule 2:
    output: fileA.txt
    jobid: 1

[Wed Jun  5 16:47:02 2019]
Finished job 1.
1 of 2 steps (50%) done

[Wed Jun  5 16:47:02 2019]
rule 1:
    input: fileA.txt
    output: fileB.txt
    jobid: 0

[Wed Jun  5 16:47:02 2019]
Finished job 0.
2 of 2 steps (100%) done
Complete log: /home/julia/presentations/2019-06-20_Toronto/snakemake_elephant_demo/.snakemake/log/2019-06-05T164701.955012.snakemake.log


Internally snakemake is first resolving the set of rules into a directed acyclic graph (dag) to determine in which order the rules neet do be executed. We can generate a visualization of the workflow using the `--dag` flag in combination with `dot` and `display`.

In [21]:
%%bash
snakemake --dag | dot | display

Building DAG of jobs...
