# Part 1: An Overview of the Jupyter Ecosystem

---

** January 31st, 2018 **


*Ian Allison*

* # [https://ganymede-io.github.io/brochure-site](https://ganymede-io.github.io/brochure-site) <- Brochure website
* # [https://ganymede.syzygy.ca](https://ganymede.syzygy.ca) <- temporary hub

## define: Python, IPython Jupyter & JupyterHub

All of these tools exist in a hierarchy:

1. **Python** - High level (nice to use!) general purpose programming language (1991)
2. **IPython** - Shell making python nice to use interactivly (frontends!) (2001)
3. **Jupyter** - This works for other languages! (e.g. R for stats) (2014)
4. **JupyterHub** - All this as a published web service (2015)

## synonyms: JuPyteR

  * **Notebook**: The notebook takes inspiration from traditional notebooks
    * A mixture of computation.
    * ...back of the envelope calcuations
    * ...notes to yourself
    * ...interactions with APIs

If you're ever feeling short of inspiration, please take a look at [A Gallery of interesting notebooks](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks) [goo.gl/Lb7HQc].


In [1]:
import requests

r = requests.get(
    'http://api.pearson.com/v2/dictionaries/entries?headword=notebook')
r.json()['results'][1]['senses']


[{'definition': 'a book of plain paper in which you can write notes'}]

# Why Jupyter?

Just try it, and you'll see! The [gallery above](goo.gl/Lb7HQc) is a great place to start.

 * Mixed notes, code, visualization and interaction 
   - particularly suited to math/science/computing.
 * It's open and it's flexible!
   - It works out of the box
   - It can be twisted, shaped and extended to fit your needs
   - It defines an ecosystem but doesn't try to tie you to it
 * There's a large **open and active community**

## How could we use it? Services

 * [nbgitpuller](https://github.com/data-8/nbgitpuller)
   - Embed links anywhere (Ebooks, blogs, email, help menus ↑)
 * [Kernel Gateway](https://github.com/jupyter/kernel_gateway)
   - Publish notebooks as an API
   - A path to publishing web services

## How could we use it? Python Modules

  * You _could_ use the requests module to talk to the twitter API
    - You would need to figure out the authentication parameters and process for OAuth
    - You would have to construct arguments to the endpoints manually
    - You would have to dig through lots of irrelevant boilerplate in your results
  * Or you could use the [tweetpy module](http://www.tweepy.org/)
  * There tens of thousands of modules for almost any purpose (https://pypi.python.org/pypi)
    - `!pip install --user tweetpy`


```
import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

public_tweets = api.home_timeline()
for tweet in public_tweets:
    print tweet.text
```

# Python Modules - continued

If you can't find a module that does what you want, you can write one!

  * [A minimal Python module demo](http://python-packaging.readthedocs.io/en/latest/minimal.html)
  * [mobilechelonian](https://github.com/takluyver/mobilechelonian)
  * e.g. You could write a calculus module to teach students limits
    - (or you could try the [sympy](http://docs.sympy.org/latest/tutorial/calculus.html) module

## Python Modules - example

* [cs103](https://pypi.python.org/pypi/cs103) written by UBC
```
from cs103 import draw, circle
draw(circle(100, 'solid', 'red'))
```

# JupyterLab

This is the new shiny interface for working inside jupyter.

* Improved access to all of the tools
  - notebooks, terminals become movable tabs
  - file browser is much easier to use
  - integrations (e.g. git, Google drive) being developed

<div style="display: block; margin-left:auto;"><img src="./jlab-ss.png" style="display: block; margin-left:auto;margin-right:auto;" width="500px" /></div>

* [Try it out](https://ganymede.syzygy.ca/jupyter/user-redirect/lab)

Don't worry, notebooks are still the main unit of computation, JupyterLab just makes using them easier.

# Part 2: Behind the Curtain

The slides beyond this point describe some of the technical details of how we're intending to do this for thousands of students. They're important, but in keeping with the module/encapsulation theme, most of the people here should (figuratively) be able to say

```
import <OUR PROJECT NAME GOES HERE!>
```

and forget about it! 

<div style="text-align:center;"><img src="./vm-overview.png" width="800px" /></div>

# [cybera.syzygy.ca](https://cybera.syzygy.ca)

cybera.syzygy.ca has seen around 700 users since it started up (~1yr). It typically sees around ~10-60 users active at any one time.

  * c8m32 flavour (8 VCPU/32GB)
  * 2 Data volumes
    * HOME - 100GB ZFS based
    * DOCKER - 100GB LVM2 Devicemapper based

That's it. A single virtual machine running inside the [Cybera Rapid Access Cloud](https://www.cybera.ca/services/rapid-access-cloud/).

# Limitations

The ultimate goal is to increase the total number of **active** users. A user is considered active based on the last time the hub communicated with their web browser. After ~60mins we declare a container inactive and shut it down.

Empirically, we've run into two main limitations for increasing the number of active users:

 * _Memory & CPU_
 * _Disk IO_
 
Both of these are currently bound to what a single VM can satisfy. Currently we our maximum user count for a tuned VM is around 200-300 users. With local storage improvements we can reasonably expect somewhere 300-500 to be possible in a single VM. This number isn't as easy to pin down with non-local IO.

## Memory and CPU - cgroups

We use cgroups mechanism to manage CPU and Memory for users.
 
  * Each new container starts with the following 
     - memlimit/memswaplimit: 2g/2g (2GB of RAM)
     - cpuquota/cpuperiod: 100000/100000 (~1 VCPU)*
  * This is applied to child processes in the user's session
  * Under pressure the (VM-)kernel OOM will kill processes!
     - This can be messy: kill hub? or the proxy?
     - ~ tuneable (e.g. by cgroup slice or processes type)
     
\* _N.B. CPU scheduling is tricky, these numbers refer to arbitraty scheduling periods and set bounds on scheduling priority under load_

## Disk IO    
   - Home (mapped through to containers as volumes)
     - 1GB default storage quotas are enforced by ZFS
     - _rarely_ we are asked to increase a quota for someone
     - IO patterns are typlical for home dir usage
     - BUT! quirks can cause issues with load (min_arc_size)
   - Docker (container images and ephemeral storage)
     - IO takes a whacking at initialization
     - Gold standard is local (SSD or NVMe)
     - Transitioning to Overlay2

# Scaling

All of our current ideas for scaling boil down to horizontally scaling the containers: Sooner or later you will run out of space on a single VM. The obvious solution is to use more than one! But that means

  * Distributed filesystems (e.g. gluster)
    - _must support quotas_
    - _must be performant_
  * More complicated networking (e.g. Calico)

Main options: **Kubernetes**, Docker Swarm, OpenStack Containers.

## [Kubernetes](https://kubernetes.io)

google project to automate deployment, scaling and management of containerized applications.

  * Early leader - used with JupyterHub by [data-8/UC Berkeley](http://data8.org/)
  * Containers become "pods" and kubernetes manages scheduling
  * k8s cluster of docker engines (1 per) + networking glue
  * Spawner talks to kubernetes:
    - spawn pod somewhere inside cluster
    - return internal address to jupyterhub

Kubernetes can be complicated to set up (mostly the networking), but is "battle tested". Definitely the winner in terms of current mind share.

<div style="display: block; margin-left:auto;"><img src="./kubernetes_overview.png" style="display: block; margin-left:auto;margin-right:auto;" width="800px" /></div>

## Kubernetes:  Our current deployment

We've started experimenting already. 

* 3 nodes deployed on openstack via terraform
* Kubernetes cluster (2\*master/nodes + 1\*nodeonly) 
* Glusterfs for backing storage
* Calico (BGP) networking
* [JupyterHub Helm chart](http://zero-to-jupyterhub.readthedocs.io/en/v0.5-doc/) install partially complete (service creation isn't quite working yet)

```
 kubectl get pods --all-namespaces=true
NAMESPACE     NAME                                    READY     STATUS    RESTARTS   AGE
j8s           hub-78fb688b89-gh96q                    1/1       Running   0          49d
j8s           proxy-5d6cbd7b97-x6wct                  2/2       Running   0          49d
kube-system   calico-node-7kfz4                       1/1       Running   0          54d
kube-system   calico-node-jg48d                       1/1       Running   5          54d
kube-system   calico-node-s7dnk                       1/1       Running   0          54d
kube-system   kube-apiserver-master1                  1/1       Running   0          54d
kube-system   kube-apiserver-master2                  1/1       Running   0          54d
kube-system   kube-controller-manager-master1         1/1       Running   2          54d
kube-system   kube-controller-manager-master2         1/1       Running   3          54d
kube-system   kube-dns-cf9d8c47-fxx4r                 3/3       Running   0          54d
kube-system   kube-dns-cf9d8c47-z2rph                 3/3       Running   0          54d
kube-system   kube-proxy-master1                      1/1       Running   0          54d
kube-system   kube-proxy-master2                      1/1       Running   0          54d
kube-system   kube-proxy-node1                        1/1       Running   0          54d
kube-system   kube-scheduler-master1                  1/1       Running   4          54d
kube-system   kube-scheduler-master2                  1/1       Running   4          54d
kube-system   kubedns-autoscaler-86c47697df-bw4nz     1/1       Running   0          54d
kube-system   kubernetes-dashboard-85d88b455f-4llhj   1/1       Running   0          54d
kube-system   nginx-proxy-node1                       1/1       Running   0          54d
kube-system   tiller-deploy-546cf9696c-vslvl          1/1       Running   0          54d
```

## Docker Swarm

[Docker swarm](https://docs.docker.com/engine/swarm/) is the "native" docker solution

  * Swarm mode is already a running mode in docker installs
  * Used to be prohibitively complicated, but some roles (e.g. k-v nodes) have been absorbed
  * Basically the docker engine component itself is spanning mutliple virtual machines
  * Spawning should be simpler - it's just like the docker we're already doing!

We investigated this a while back and got bogged down setting up required services. At the time there were also fewer distributed filesystem options (this seems to be changing now). **This is plan B.**

## Docker Swarm
<div style="display: block; margin-left:auto;"><img src="./swarm-diagram.png" style="display: block; margin-left:auto;margin-right:auto;" width="800px" /></div>

## Native OpenStack Containers

In principal we can run containers directly on OpenStack just like we run VMs.

  * OpenStack can run containers natively via Magnum
  * Would need us to write a new spawner to talk to the Magnum API
  * _Has anyone actually experimented with this?_
  
**Caveat: This isn't something I've looked at all yet.**


# End