# Lecture 5 - Data Analysis pipelines using Make

## Learning Objectives

By the end of the lecture, students should be able to:

- Write a simple automated analysis pipeline using Make

## Building a Data Analysis pipeline using Make, a tutorial
adapted from [Software Carpentry](http://software-carpentry.org/)

### Set-up instructions

- Download [data_analysis_pipeline_eg-2.0.zip](https://github.com/ttimbers/data_analysis_pipeline_eg/archive/v2.0.zip)
- Unzip it and change into the `data_analysis_pipeline_eg-2.0` directory.
- note - the tutorial in [lecture 4](lecture/04_lecture-shell-driver-scripts.md) is a prerequisite


## Why Make?

We previously built a data analysis pipeline by using a shell script (we called it `run_all.sh`) to piece together and create a record of all the scripts and arguments we used in our analysis. That is a step in the right direction, but there were a few unsatisfactory things about this strategy:

  1. It takes time to manually erase all intermediate and final files generated by analysis to do a complete test to see that everything is working from top to bottom
  2. It runs every step every time. This can be problematic if some steps take a long time and you have only changed other, smaller parts of the analysis

Thus, to improve on this we are going to use the build and automation tool, Make, to make a smarter
data analysis pipeline.


### Makefile Structure

Each block of code in a Makefile is called a rule, it looks something like this:

~~~
file_to_create.png : data_it_depends_on.dat script_it_depends_on.py
	python script_it_depends_on.py data_it_depends_on.dat file_to_create.png
~~~

* `file_to_create.png` is a target, a file to be created, or built.
* `data_it_depends_on.dat` and `script_it_depends_on.py` are dependencies, files which are needed to build or update the target. Targets can have zero or more dependencies.
* `:` separates targets from dependencies.
* `python script_it_depends_on.py data_it_depends_on.dat file_to_create.png` is an action, a command to run to build or update the target using the dependencies. Targets can have zero or more actions. Actions are indented using the TAB character, not 8 spaces.
* Together, the target, dependencies, and actions form a rule.

### Structure if you have multiple targets from a scripts

~~~
file_to_create_1.png file_to_create_1.png : data_it_depends_on.dat script_it_depends_on.py
	python script_it_depends_on.py data_it_depends_on.dat file_to_create
~~~

### Let's do some analysis!

Good reference: http://swcarpentry.github.io/make-novice/reference

Create a file, called `Makefile`, with the following content:

~~~
# Count words.
results/isles.dat : data/isles.txt src/wordcount.py
	python src/wordcount.py data/isles.txt results/isles.dat
~~~

This is a simple build file, which for
Make is called a Makefile - a file executed
by Make. Let us go through each line in turn:

* `#` denotes a *comment*. Any text from `#` to the end of the line is
  ignored by Make.
* `results/isles.dat` is a [target](http://swcarpentry.github.io/make-novice/reference#target), a file to be
  created, or built.
* `data/isles.txt` and `src/wordcount.py` are [dependencies](http://swcarpentry.github.io/make-novice/reference#dependency), a
  file that is needed to build or update the target. Targets can have
  zero or more dependencies.
* `:` separates targets from dependencies.
* `python src/wordcount.py data/isles.txt isles.dat` is an
  [action](http://swcarpentry.github.io/make-novice/reference#action), a command to run to build or update
  the target using the dependencies. Targets can have zero or more
  actions.
* Actions are indented using the TAB character, *not* 8 spaces. This
  is a legacy of Make's 1970's origins.
* Together, the target, dependencies, and actions form a
  [rule](http://swcarpentry.github.io/make-novice/reference#rule).

Our rule above describes how to build the target `results/isles.dat` using the
action `python src/wordcount.py` and the dependency `data/isles.txt`.

By default, Make looks for a Makefile, called `Makefile`, and we can
run Make as follows:

~~~
$ make
~~~

Make prints out the actions it executes:

~~~
python src/wordcount.py data/isles.txt results/isles.dat
~~~

If we see,

~~~
Makefile:3: *** missing separator.  Stop.
~~~

then we have used a space instead of a TAB characters to indent one of
our actions.

We don't have to call our Makefile `Makefile`. However, if we call it
something else we need to tell Make where to find it. This we can do
using `-f` flag. For example:

~~~
$ make -f Makefile
~~~

As we have re-run our Makefile, Make now informs us that:

~~~
make: `results/isles.dat' is up to date.
~~~

This is because our target, `results/isles.dat`, has now been created, and
Make will not create it again. To see how this works, let's pretend to
update one of the text files. Rather than opening the file in an
editor, we can use the shell `touch` command to update its timestamp
(which would happen if we did edit the file):

~~~
$ touch data/isles.txt
~~~

If we compare the timestamps of `data/isles.txt` and `results/isles.dat`,

~~~
$ ls -l data/isles.txt results/isles.dat
~~~

then we see that `results/isles.dat`, the target, is now older
than`data/isles.txt`, its dependency:

~~~
-rw-r--r--    1 mjj      Administ   323972 Jun 12 10:35 books/isles.txt
-rw-r--r--    1 mjj      Administ   182273 Jun 12 09:58 isles.dat
~~~

If we run Make again,

~~~
$ make
~~~

then it recreates `results/isles.dat`:

~~~
python src/wordcount.py data/isles.txt results/isles.dat
~~~

When it is asked to build a target, Make checks the 'last modification
time' of both the target and its dependencies. If any dependency has
been updated since the target, then the actions are re-run to update
the target.

We may want to remove all our data files so we can explicitly recreate
them all. We can introduce a new target, and associated rule, `clean`:

~~~
results/isles.dat : data/isles.txt
	python src/wordcount.py data/isles.txt results/isles.dat

clean :
	rm -f results/*.dat
~~~

This is an example of a rule that has no dependencies. `clean` has no
dependencies on any `.dat` file as it makes no sense to create these
just to remove them. We just want to remove the data files whether or
not they exist. If we run Make and specify this target,

~~~
$ make clean
~~~

then we get:

~~~
rm -f *.dat
~~~

There is no actual thing built called `clean`. Rather, it is a
short-hand that we can use to execute a useful sequence of
actions. Such targets, though very useful, can lead to problems. For
example, let us recreate our data files, create a directory called
`clean`, then run Make:

~~~
$ make results/isles.dat
$ mkdir clean
$ make clean
~~~

We get:

~~~
make: `clean' is up to date.
~~~

Let's add another rule to the end of `Makefile`:

~~~
results/isles.dat : data/isles.txt src/wordcount.py
	python src/wordcount.py data/isles.txt results/isles.dat

results/figure/isles.png : results/isles.dat src/plotcount.py
	python src/plotcount.py results/isles.dat results/figure/isles.png

clean :
	rm -f results/*.dat
	rm -f results/figure/*.png
~~~

the new target isles.png depends on the target isles.dat. So to make both, we can simply
type:

~~~
$ make isles.dat
$ ls
~~~

Let's add another book:

~~~
results/isles.dat : data/isles.txt src/wordcount.py
	python src/wordcount.py data/isles.txt results/isles.dat

results/abyss.dat : data/abyss.txt src/wordcount.py
  python src/wordcount.py data/abyss.txt results/abyss.dat

results/figure/isles.png : results/isles.dat src/plotcount.py
	python src/plotcount.py results/isles.dat results/figure/isles.png

results/figure/abyss.png : results/abyss.dat src/plotcount.py
  python src/plotcount.py results/abyss.dat results/figure/abyss.png

clean :
	rm -f results/*.dat
	rm -f results/figure/*.png
~~~

To run all of the commands, we need to type make <TARGET> for each one:
~~~
$ make isles.png
$ make abyss.png
~~~

OR we can add a target `all` which will build the last of the dependencies.

~~~
all: results/figure/isles.png results/figure/abyss.png

# count words
results/isles.dat : data/isles.txt src/wordcount.py
	python src/wordcount.py data/isles.txt results/isles.dat
	
results/abyss.dat : data/abyss.txt src/wordcount.py
	python src/wordcount.py data/abyss.txt results/abyss.dat

# plot word count
results/figure/isles.png : results/isles.dat src/plotcount.py
	python src/plotcount.py results/isles.dat isles.png

results/figure/abyss.png : results/abyss.dat src/plotcount.py
	python src/plotcount.py results/abyss.dat abyss.png

clean :
	rm -f results/*.dat
	rm -f results/figure/*.png

~~~

## Finish off the Makefile!

1. Try to add the other books to the Makefile.

2. Add the final report.


## Pattern matching and variables in a Makefile

It is possible to DRY out a Makefile and use variables.

Using wild cards and pattern matching in a makefile is possible, but the syntax is not very readable. So if you choose to do this proceed with caution. Example of how to do this are here: http://swcarpentry.github.io/make-novice/05-patterns/index.html

As for variables in a Makefile, in most cases we actually do not want to do this. The reason is that we want this file to be a record of what we did to run our analysis (e.g., what files were used, what settings were used, etc). If you start using variables with your Makefile, then you are shifting the problem of recording how your analysis was done to another file. There needs to be some file in your repo that captures what variables were called so that you can replicate your analysis. Examples of using variables in a Makefile are here: http://swcarpentry.github.io/make-novice/06-variables/index.html

## What did we learn?

- How to use GNU Make to create data analysis pipelines
- Advantages of data analysis pipeline tools that have a dependency tree

## What's next?

- How to use Docker so that you can use someone else's shippable and shareable compute environment.