# Start Here

This notebook is all about **getting you started doing Reproducible Data Science** , and giving you a **deeper look** at some of the concepts we will cover in this tutorial. For the latest version of this notebook, visit: 

    https://github.com/hackalog/bus_number

## The Bare Minimum
You will need:
* `cookiecutter` 
* `conda` (and then `python >= 3.6`)
* `make`


# Installation: While we are talking...

(This step is also described on the [Bus Number Wiki](https://github.com/hackalog/bus_number/wiki/Getting-Started).)

1. Install [cookiecutter](https://cookiecutter.readthedocs.io/en/latest/installation.html):
```
conda install -c conda-forge cookiecutter
```
2. Use cookiecutter to install the `pydata_nyc` branch of `cookiecutter-easydata`:

```
cookiecutter https://github.com/hackalog/cookiecutter-easydata.git --checkout pydata_nyc
```

3. Configure a new project. Call it **bus_number**:
<pre>
project_name [project_name]: <b>bus_number</b>
repo_name [bus_number]: <b>↵</b>
module_name [src]: <b>↵</b>
author_name [Your name (or your organization/company/team)]: <b>Kjell Wooding</b>
description [A short description of this project.]: <b>Reproducible Data Science</b>
Select open_source_license:
1 - MIT
2 - BSD-2-Clause
3 - Proprietary
Choose from 1, 2, 3 [1]: <b>↵</b>
s3_bucket [[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')]: <b>↵</b>
aws_profile [default]: <b>↵</b>
Select virtualenv:
1 - conda
2 - virtualenv
Choose from 1, 2 [1]: <b>↵</b>
Select python_interpreter:
1 - python3
2 - python
Choose from 1, 2 [1]: <b>↵</b>
</pre>

4. Create your Development Environment
```
cd bus_number
make create_environment
conda activate bus_number         # or `source activate bus_number`
make requirements
git init
```
That's it! You're ready to go

## The Reproducible Data Science Process
### How do you spend your "Data Science" time?
A Typical data science process looks something like this:
* Munge: Fetch, process data, do EDA
* Science: Train models, Predict, Transform data
* Deliver: Analyze, summarize, publish

Usually, the reproducible parts are in the **science** (and sometimes, but not always, the **deliver**) part of the process.
<img src="references/charts/munge-supervised.png" alt="Typical Data science Process" width=500/>

That seems like a bad idea, since the vast majority of the work is in the **munge** step. In some cases, it looks more like this:
<img src="references/charts/munge-unsupervised.png" alt="Typical Data science Process" width=500/>

We're going to try to improve this to a process that is **reproducible from start to finish**. 



## Data Science is a DAG
DAG = Directed Acyclic Graph. 

That means the process eventually stops. (This is a good thing!) 

It also means we can use a super old, but incredibly handy tool to implement this workflow: `make`.

### Make, Makefiles, and the Data Flow
We use a `Makefile` to organize and invoke the various steps in our Data Science pipeline.
You have already used this file when you created your virtual environment in the first place:
```
make create_environment
```
Here are the steps we will be working through in this tutorial:
<img src="references/cheat_sheet.png" alt="Reproducible Data Science Workflow" width="400"/>

A [PDF version of the cheat sheet](references/cheat_sheet.pdf) is also available.



### ASIDE: What's my make target doing?
If you are ever curious what commands a `make` command will invoke (including any invoked dependencies), use `make -n`, which lists the commands without executing them:

In [None]:
%%bash
cd .. && make -n requirements

We use a cute **self-documenting makefiles trick** (borrowed from `cookiecutter-datascience`) to make it easy to document the various targets that you add. This documentation is produced when you type a plain `make`:

In [None]:
%%bash
cd .. && make

### Under the Hood: The Format of a Makefile

```
## Comment to appear in the auto-generated documentation
thing_to_build: space separated list of dependencies
	command_to_run            # there is a tab before this command.
	another_command_to_run    # every line gets run in a *new shell*
```



In [None]:
%%file Makefile.test

data: raw
	@echo "Build Datasets"
train_test_split:
	@echo "do train/test split"
train: data transform_data train_test_split
	@echo "Train Models"
transform_data:
	@echo "do a data transformation"
raw:
	@echo "Fetch raw data"


If you see: ```*** missing separator.  Stop.``` it's because you have used spaces instead of **tabs** before your commands. 

In [None]:
%%bash
make -f Makefile.test train

In [None]:
%%file Makefile.test

cycle: cycle_b
	@echo "in a Makefile"
cycle_b: cycle_c
	@echo "have a cycle"
cycle_c: cycle
	@echo "You can't"

In [None]:
%%bash
make -f Makefile.test cycle

Using a Makefile like this is an easy way to set up a process flow expressed as a Directed Acyclic Graph (DAG).

**Note**: We have only scratched the surface here. The are lots of interesting tricks you can do with make.
* http://zmjones.com/make/
* http://blog.byronjsmith.com/makefile-shortcuts.html
* https://www.gnu.org/software/make/manual/


### ASIDE: Our Favourite Python Parts
Why the `python>=3.6` requirement?
* f-strings: Finally, long, readable strings in our code.
* dictionaries: insertion order is preserved!

Other great tools:
* `pathlib`: Sane, multiplatorm path handling: https://realpython.com/python-pathlib/
* `doctest`: Examples that always work: https://docs.python.org/3/library/doctest.html
* `joblib`: Especially the persistence part: https://joblib.readthedocs.io/en/latest/persistence.html


## ASIDE: Virtual Environments

It's impossible to have Reproducible Data Science, without **reproducible code**. If your code, run by someone else (or on a different machine) produces different results, then it is **not** reproducible. To fix this, we need to have a reproducible environment.

In short, we need **virtual environments**. In this case, we're going to use `conda` as this is the most common choice within this community. Furthermore, we use an `environment.yml` to specify to a user which packages need to be install to run our code.
    
Two `make` commands  ensure that we have the appropriate environment:
* `make create_environment`: for the initial creation of a project specific conda environment
* `make requirements`: to update our environment to the latest version of the `environment.yml` specs

**Caveat**: Technically speaking, as implemeted in this workflow, a `conda` environment is **not reproducible**. Even if you specify a specific version of a package in your `environment.yml`, the way its dependencies get resolved may differ in their versions. Other approaches to virtual environments have **lockfiles** that ensure that the environment is completely reproducible (eg. `pipenv`). This is the **right way** to handle such things, and we are hoping conda catches up quickly.
