# Docker and Jupyter Notebooks for Reproducible Research

<b>Goal</b>: To understand what Docker is and how it can be used with Jupyter notebooks for reproducible research.

Docker is technological tool that creates high performance, shareable, reproducible computational environments. Jupyter notebooks are tools for interactive analysis that interweave prose, code, and results. Together, Docker and Jupyter notebooks are best-of-breed methods to create research that is reproducible.

In [None]:
#Imports for running this presentation live

from ipywidgets import interact, interactive
from IPython.display import clear_output, display, HTML, YouTubeVideo

import numpy as np

from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import cnames
from matplotlib import animation

%matplotlib inline

!docker info
!docker load -i busybox.dockerarchive.tar

# The Problem

Even though computers are often considered deterministic, **computational software is a rapidly evolving and changing landscape**. Libraries are constantly adding new features and fixing issues. 

<img src="Data/PythonStoryline.svg" width=700>

Image source: http://www.michaelogawa.com/research/storylines/

Even libraries with the strictest backwards-compatibility policies can **change in significant ways**.
<img src="Data/BackwardsCompatibility.png" width="600px">

Image source: http://www.bonkersworld.net/backwards-compatibility/

A **reproducible computational environment** has a *sufficiently consistent state for the computational task at hand*.

For example, this can consist of

- a similar CPU instruction set
- libraries and executables available with a specific version and configuration options
- a specific version of a given compiler
- a specific version of a libc implementation
- a specific version of the C++ standard library

## Close But Not Good Enough

### Source code

Does not include:

- Compiler
- Hardware it was built on
- How it is configured
- Package dependencies
- Run-time environment
- How to run it

<img src="Data/ConfusedCat.jpg" width="400px">

Image source: https://www.youtube.com/watch?v=g1LgVfV5_ZQ


### Package managers and distributions

- There is not a consensus on *the* package manager
- Packages become unsupported over time
- What to do if a required library is not packaged?

### Virtual machines (VMs)

- Inefficient utilization of computational resources

<img src="Data/CarJam.jpg">

Image source: http://time-az.com/images/2014/02/20140203carjam.jpg

# Enter Linux Containers

![Docker logo](Data/DockerLogo.png)

[Linux container systems](http://www.google.com/url?q=http%3A%2F%2Fwww.infoworld.com%2Farticle%2F2938638%2Fapplication-virtualization%2Fdocker-donates-its-container-specs-for-opc-open-standard.html&sa=D&sntz=1&usg=AFQjCNGrI-KxvoAN_waSazod5U1sPo0sVw) , like Docker, are new type of tool to easily build, ship, and run reproducible, binary applications.  

It is "good enough" for a reproducible computational environment.

In this talk, we will introduce Docker from the perspective a scientific research software engineer.  We will


- Generate an understanding of what Docker is by comparing it to existing technologies.

- Give an introduction to basic Docker concepts.

- Describe how Docker fits into the scientific analysis workflow with Jupyter notebooks.

# Understanding Docker

### Not just this cute whale thing

Docker is an open-source engine that automates the deployment of any application as a **lightweight**, **portable**, **self-sufficient container** that will run virtually anywhere.




In [2]:
!docker run --rm busybox sh -c 'echo "Hello Docker World!"'

Hello Docker World!


### Docker is a combination of a:

1. **Sandboxed chroot**
2. **Copy on write filesystem**
3. **Distributed VCS for binaries**

## Sandboxed chroot

Docker works with images that **consume minimal disk space**, **versioned**, **archiveable**, and **shareable**. Executing applications in these images does not require dedicated resources and is **high performance**.

It works with **containers** as opposed to **virtual machines** (VM's).

<img src="Data/DockerVM.jpg" width="600">

In [4]:
%time !docker run --rm busybox sh -c 'echo "Hello Docker World!"'

Hello Docker World!
CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 950 ms


A Docker container is similar to a running an application in a *chroot*, but it sandboxes processes and the network stack with Linux kernel:

* **Namespaces**: isolated processes, networking, messaging, file systems, hostname's
* **CGroups**: groups together cpu, memory, and IO resources

<img src="Data/Chroot.png" width="600px">

## Copy on Write Filesystem

**Union file systems**, or UnionFS, are file systems that operate by **creating layers**, making them very **lightweight** and **fast** while **saving disk space**.

Docker can make use of several union file system variants including: 

- AUFS
- btrfs
- vfs
- DeviceMapper

<table border="0">
<tr>
<th><img src="Data/LayerCake.jpg" width="300px"></th>
<th><img src="Data/DockerFilesystems.svg" width="400px"></th>
</tr>
</table>


## Distributed VCS for binaries

### Docker is like Git for binaries



In [5]:
!docker search itk

NAME                                        DESCRIPTION                                     STARS     OFFICIAL   AUTOMATED
pitkley/samba-ad-dc                         Samba4 Active Directory Domain Controller ...   1                    [OK]
insighttoolkit/itk-bin-testing                                                              1                    [OK]
insighttoolkit/itk-bin-examples                                                             1                    [OK]
businessdecision/itkg                       Interakting Docker base images                  1                    [OK]
insighttoolkit/itk-bin                                                                      1                    [OK]
itkj/cloud-sdk-appengine-go-godep           google/cloud-sdk + GAE/Go SDK + Godep           1                    [OK]
insighttoolkit/itk-dashboard                                                                0                    [OK]
chitkiu1/rocket.chat                       

- Docker images are identified with hex string or tags
- Interface is `docker <subcommand>`
- `docker push`, `docker pull`, `docker tag`
- `docker export` will create a archiveable tarball of an image's filesystem.
- DockerHub is like GitHub

<img src="Data/DockerHub.png" width="400px">

### Installing

Here's what you need:

- Linux kernel with control groups and namespaces
- Support for a layered filesystem (like AUFS)
- Docker Daemon / Server (written in Go)

<img src="Data/MasonJar.jpg" width="600px">

#### |Linux

- Ubuntu 14.04 *or*
- See [Docker installation instructions](http://docs.docker.com/installation/) for distributions with Kernel 3.8 + later *or*
- [Kernel configuration instructions](https://wiki.gentoo.org/wiki/LXC)

#### Windows and Mac

[Docker Machine](https://docs.docker.com/machine/overview/)

* easy install of
  - Git Bash
  - VirtualBox
  - Lightweight Linux distribution
  - Docker

* Mac OSX users can use the Docker client from the Mac bash shell
* Comes with busybox shell -> Write your Docker build.sh and run.sh in Bourne shell

# Docker Concepts

## Image
 
### A read-only file system layer

<img src="Data/DockerFilesystemsBusybox.png" width="600px">

In [6]:
!docker images

REPOSITORY          TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
busybox             latest              65e4158d9625        10 days ago         1.114 MB


## Container

### An modifiable image with processes running in memory, or an exited container with a modified filesystem

<img src="Data/DockerFilesystemsBusybox.png" width="600px">

In [7]:
!docker ps

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES


In [8]:
!docker run -d busybox sh -c 'sleep 3'

06fcb48ae8f17f18b1727c998c003c697cd2f4d292ca096624485291b617571b


In [10]:
!docker ps

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES


In [11]:
!docker ps -a

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                      PORTS               NAMES
06fcb48ae8f1        busybox             "sh -c 'sleep 3'"   23 seconds ago      Exited (0) 19 seconds ago                       berserk_newton


## Volume

### A directory within one or more containers that bypasses the Union File System

* Data volumes are initialized when a container is created
* Volumes can be shared and reused between containers
* Changes to a data volume are made directly
* Changes to a data volume will not be included when you update an image
* Volume persist until no containers use them
* Host directories can also be mounted as data volumes

### Why use a data volume?

* Store and share data
* Expose data or code from the host to the Docker computational environment

## Dockerfile

### A sequence of instructions to generate a Docker image

In [12]:
!mkdir -p docker-ls-data
!cp $PWD/Data/*.png docker-ls-data/

In [13]:
%%writefile docker-ls-data/Dockerfile

FROM busybox
MAINTAINER Matt McCormick <matt.mccormick@kitware.com>
RUN mkdir -p /Data
ADD *.png /Data/
VOLUME /Data
CMD ["/bin/sh", "-c", "ls /Data"]

Overwriting docker-ls-data/Dockerfile


In [14]:
!docker build -t ls-data ./docker-ls-data

Sending build context to Docker daemon 1.252 MB
Step 1 : FROM busybox
 ---> 65e4158d9625
Step 2 : MAINTAINER Matt McCormick <matt.mccormick@kitware.com>
 ---> Running in 38c48dc2d406
 ---> b96fb10c253e
Removing intermediate container 38c48dc2d406
Step 3 : RUN mkdir -p /Data
 ---> Running in cdd5148e8620
 ---> 218e55546268
Removing intermediate container cdd5148e8620
Step 4 : ADD *.png /Data/
 ---> 8995b0829c1c
Removing intermediate container eab99e441323
Step 5 : VOLUME /Data
 ---> Running in a34db940bc7a
 ---> 85d12a9b4a39
Removing intermediate container a34db940bc7a
Step 6 : CMD /bin/sh -c ls /Data
 ---> Running in 071760d6bcae
 ---> b221b5c1a4fe
Removing intermediate container 071760d6bcae
Successfully built b221b5c1a4fe


In [15]:
!docker run --rm ls-data

BackwardsCompatibility.png
Chroot.png
Debian.png
DockerFilesystemsBusybox.png
DockerHub.png
DockerLogo.png
Jupyter.png
Liar.png


## Scientific Research with Docker Notebook

## Graphical Applications and Docker

A **portable Docker image** will only assume standard CPU/memory/disk/network resources are available. If *local USB devices* and **video card devices** are used the images will **not be runnable anywhere**.

* Use [IPython / Jupyter Notebooks](http://ipython.org/notebook.html)
* The [docker-opengl](https://github.com/thewtex/docker-opengl-nvidia)  image offers CPU-based OpenGL rendering viewable via an HTML5 VNC client.

<img src="Data/Jupyter.png" width="500px">

## Choosing a base image

* [debian](https://registry.hub.docker.com/_/debian/) - [Most common](https://docs.docker.com/articles/dockerfile_best-practices/) lightweight image with many packages available
* [alpine](https://hub.docker.com/_/alpine/) Very small image
* [ipython/notebook](https://registry.hub.docker.com/u/ipython/notebook/) - Launcher SSL / password enabled IPython notebook
* [jupyter/tmpnb](https://registry.hub.docker.com/u/jupyter/tmpnb/) - Launches "temporary" Jupyter notebook servers
* [continuumio/miniconda](https://registry.hub.docker.com/u/continuumio/miniconda/) miniconda installed
* [nixos/nix](https://registry.hub.docker.com/u/nixos/nix/) Nix package manager installed
* ...
* Make your own

<img src="Data/Debian.png" width="100px">

# Recap and Next Steps

## Docker is


* Sandboxed chroot +

* Incremental, copy on write filesystem +

* Distributed VCS for binaries +

## Concepts

* *Image*:  A read-only file system layer

* *Container*: A writable image with processes running in memory, or an exited container with a modified filesystem

* *Volume*: A mounted directory that is not tracked as a filesystem layer

* *Dockerfile*: A sequence of instructions to generate a Docker image

## Scientific Python and Docker

* Not for graphical applications, especially OpenGL 

* Reproducible computational environment for IPython notebook

* Use with Linux-based packaging system of your choice

## Learn more!

* [Interactive Brower-Based Docker Tutorial](https://www.docker.com/tryit/)
* [Docker Documentation](https://docs.docker.com/userguide/)
* [Reproducible Research: Walking the Walk Tutorial](https://reproducible-research.github.io/scipy-tutorial-2014/)
* [IPython DockerHub Repositories](https://registry.hub.docker.com/repos/ipython/)

## Docker vs. LXC

* [LXC](https://linuxcontainers.org/) is a set of tools and API to interact with Linux kernel namespaces, cgroups, etc.
* LXC used to be the default execution enviroment for Docker
* Docker provides LXC function, plus:
  - Portable deployment across machines
  - Application-centric
  - Automatic builds
  - Versioning
  - Component re-use
  - Sharing
  - Tool echosystem

## Docker vs Rocket

- [Rocket](https://github.com/coreos/rocket) is a container system like Docker developed by CoreOS
- Rocket is not as mature
- Rocket does not use a daemon/client system