# Basic Data Pipelines with ![](imgs/pin_small.png) and ![](imgs/docker_small.png)
## From Zero to ETL with Minimal Fuss
#### Gordon Inggs, Data Scientist, *in his private capacity*

# Talk Outline
1. Why do this to yourself?
2. What do we get from using Airflow and Docker?
3. How to do this right now?
4. Why is a bad idea?
5. How to do this better?

# 1. Why do you want a data ~~engineer~~ pipeline?
![](imgs/whyyyy.gif)

* Need to do Extract Transform Load (ETL) on a regular basis/in realtime.
* Separation of concerns AKA modularity
* (**awkward shoehorn**) Control vs Data flow.

![](./imgs/assembly_line.gif)

This is a good way to get from 0 to ~5 data pipelines.

# 2.1 Why do you want to use Airflow?
![](imgs/airblowing.gif)

## What is Airflow?
![](imgs/airflow-example-dag.png)

## Visibility
![](imgs/airflow_visibility.png)

## Visibility (again)
![](imgs/airflow_visibility_II.png)

## Conditional Scheduling
![](imgs/airflow-example-dag2.png)

# 2.2 Why do you want to use Docker (inside Airflow)?
![](imgs/flopping.gif)

## Resource efficient Isolation
![](imgs/docker_isolation.png)

## Dependency Closure
![](imgs/docker-closure.jpg)

# 3. How to do this Right Now
![](imgs/rightnow.gif)

## What you need:
* Using your *ahem* bare metal, FOSS Data Science Env.

* Airflow Checklist:
  * `LocalExecutor`.
  * Separate DB for Airflow state - PostgreSQL, MySQL, etc.
  * Docker Python SDK
  * Read/write access to Docker socket (`/var/run/docker.sock`)
  
[Large public sector organisation Airflow Docker image](https://github.com/cityofcapetown/airflow_docker_datapipelines).

* Docker Checklist:
  * Docker images with dependencies
  * Scripts to run tasks inside images:
    ```bash
    #!/usr/bin/env bash

    PYTHONPATH="$PIPELINE_DIR" python3 "$PIPELINE_DIR"/my_module/my_task.py
    ```

In [1]:
def pipeline_task(task_name):
    docker_name = "-".join([
        PIPELINE_PREFIX,
        task_name,
        str(uuid.uuid4())
    ])

    run_args = docker_run_args.copy()
    docker_command = "bash -c '/run_{}.sh'".format(task_name)

    operation_run = docker_client.containers.run(
        name=docker_name,
        command=docker_command,
        **run_args
    )

    return operation_run.decode("utf-8")

# 4. Why is this a Bad Idea?
![](imgs/badidea.gif)

## Problems we've run into
* Loading DAGs into Airflow

* Scaling beyond one docker host

* Weird performance problems:
    * noisy neighbours
    * Heavy load on Docker daemon
    * containerisation $\neq$ virtualisation

![](imgs/IDK.gif)

# 5. How to do this better?
![](imgs/learning.gif)

## Improving
**Bar to beat: 6 months to a year, no dedicated administration**

* Serverless containers (e.g. AWS Fargate)
* Kubernetes (scaling)

## Different Paradigm
**Warning: Speculative**
* $\mu$Service Architecture - even more modular
* Declarative/reconcilation approach

# Thank You!

![](imgs/thankyou.gif)

## Questions?
