Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
kfoss Merge pull request #34 from kfoss/master
Update notebook logo and mesos to v0.27.2
Latest commit a73dca2 Mar 14, 2016
Type Name Latest commit message Commit time
Failed to load latest commit information.
architecture Add example architectures for Dockerizing Mesos with Spark Jul 17, 2015
dockerfiles Update mesos to v0.27.2 Mar 14, 2016
mesos Update mesos installation to explicitly pull docker images Sep 30, 2015
runtime Update notebook logo and name to fix error Feb 19, 2016
.dockerignore Initialize Dockerfiles and build script Apr 3, 2015
.gitignore Update gitignore Jun 14, 2015 Add openSSH server to host prep script Sep 11, 2015 Move dockerfiles into role-based subdirectories Aug 24, 2015 Avoid auto-git-pull when running containers Jul 22, 2015 Avoid auto-git-pull when running containers Jul 22, 2015 Avoid auto-git-pull when running containers Jul 22, 2015 explicitly set spark executor memory via conf settings Feb 16, 2016 explicitly set spark executor memory via conf settings Feb 16, 2016 Add license/authors/etc scaffold for project Apr 10, 2015 Add license/authors/etc scaffold for project Apr 10, 2015 Add license/authors/etc scaffold for project Apr 10, 2015
LICENSE.txt Add license/authors/etc scaffold for project Apr 10, 2015 Add license/authors/etc scaffold for project Apr 10, 2015 Update Readme with mesos options for auto/manual installation Sep 30, 2015 Update default spark runtime options Dec 10, 2015


This repo provides Docker containers to run:

  • Spark master and worker(s) for running Spark in standalone mode on dedicated hosts
  • Mesos-enhanced containers for Mesos-mastered Spark jobs
  • IPython web interface for interacting with Spark or Mesos master via PySpark

Please see the accompanying blog posts for the technical details and motivation behind this project:


Docker containers provide a portable and repeatable method for deploying the cluster:

hadoop-docker-client connections

CDH5 Tools and Libraries

HDFS Hbase Hive Oozie Pig Hue

Python Packages and Modules

Pattern NLTK Pandas NumPy SciPy SymPy Seaborn
Cython Numba Biopython Rmagic 0MQ Matplotlib Scikit-Learn
Statsmodels Beautiful Soup NetworkX LLVM Bokeh Vincent MDP


Option 1. Mesos-mastered Spark Jobs

  1. Install Mesos with Docker Containerizer and Docker Images: Install a Mesos cluster configured to use the Docker containerizer, which enables the Mesos slaves to execute Spark tasks within a Docker container.

    A. End-to-end Installation: The script mesos/ uses the Python library Fabric to install and configure a cluster according to How To Configure a Production-Ready Mesosphere Cluster on Ubuntu 14.04. After installation, it also pulls the Docker images that will execute Spark tasks. To use:

    • Update IP Addresses of Mesos nodes in mesos/ Find instances to change with:
    grep 'ip-address' mesos/
    • Install/configure the cluster:

    Optional: ./ if you prefer instead to build the docker images from scratch (rather than the script pulling from Docker Hub)

    B. Manual Installation: Follow the general steps in mesos/ to manually install:

    • Install mesosphere on masters
    • Install mesos on slaves
    • Configure zookeeper on all nodes
    • Configure and start masters
    • Configure and start slaves
    • Load docker images:
      docker pull lab41/spark-mesos-dockerworker-ipython

docker pull lab41/spark-mesos-mesosworker-ipython

  1. Run the client container on a client host (replace 'username-for-sparkjobs' and 'mesos-master-fqdn' below):
    ./ username-for-sparkjobs mesos://mesos-master-fqdn:5050
    *Note: the client container will create username-for-sparkjobs when started, providing the ability to submit Spark jobs as a specific user and/or deploy different IPython servers for different users.

Option 2. Spark Standalone Mode

Installation and Deployment - Build each Docker image and run each on separate dedicated hosts

Tip: Build a common/shared host image with all necessary configurations and pre-built containers, which you can then use to deploy each node. When starting each node, you can pass the container run scripts as User data to initialize that container at boot time
  1. Prerequisites
  • Deploy Hadoop/HDFS cluster. Spark uses a cluster to distrubute analysis of data pulled from multiple sources, including the Hadoop Distrubuted File System (HDFS). The ephemeral nature of Docker containers make them ill-suited for persisting long-term data in a cluster. Instead of attempting to store data within the Docker containers' HDFS nodes or mounting host volumes, it is recommended you point this cluster at an external Hadoop deployment. Cloudera provides complete resources for installing and configuring its distribution (CDH) of Hadoop. This repo has been tested using CDH5.
  1. Build and configure hosts

  2. Install Docker v1.5+, jq JSON processor, and iptables. For example, on an Ubuntu host:


  3. Update the Hadoop configuration files in runtime/cdh5/<hadoop|hive>/<multiple-files> with the correct hostnames for your Hadoop cluster. Use grep FIXME -R . to find hostnames to change.

  4. Generate new SSH keypair (dockerfiles/base/lab41/spark-base/config/ssh/id_rsa and dockerfiles/base/lab41/spark-base/config/ssh/, adding the public key to dockerfiles/base/lab41/spark-base/config/ssh/authorized_keys.

  5. (optional) Update SPARK_WORKER_CONFIG environment variable for Spark-specific options such as executor cores. Update the variable via a shell export command or by updating dockerfiles/standalone/lab41/spark-client-ipython/config/service/ipython/run.

  6. (optional) Comment out any unwanted Python packages in the base Dockerfile image dockerfiles/base/lab41/python-datatools/Dockerfile.

  7. Get Docker images:

Option A: If you prefer to pull from Docker Hub:
docker pull lab41/spark-master
docker pull lab41/spark-worker
docker pull lab41/spark-client-ipython
Option B: If you prefer to build from scratch yourself:
If you are creating common/shared host images, this would be the point to snapshot the host image for replication.
  1. Deploy cluster nodes
Ensure each host has a Fully-Qualified-Domain-Name (i.e.;; for the Spark nodes to properly associate
1. Run the master container on the master host:
2. Run worker container(s) on worker host(s) (replace 'spark-master-fqdn' below):
./ spark://spark-master-fqdn:7077
3. Run the client container on the client host (replace 'spark-master-fqdn' below):
./ spark://spark-master-fqdn:7077
You can’t perform that action at this time.