Skip to content

A spark cluster configuration with the Apache Toree notebook

Notifications You must be signed in to change notification settings

DanielMorales9/SparkCluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SparkCluster

Docker configuration for spark cluster

Table of contents

  1. Overview
  2. Docker Swarm
    1. Usage
      1. Multi-Host Swarm
    2. Scaling
  3. Data & Code
  4. Toree
  5. TODOs

Overview

This Docker container contains a full Spark distribution with the following components:

  • Oracle JDK 8
  • Hadoop 2.7.5
  • Scala 2.11.12
  • Spark 2.2.1

It also includes the Apache Toree installation.

Docker Swarm

A docker-compose.yml file is provided to run the spark-cluster in the Docker Swarm environment

Usage

Type the following commands to run the stack provided with the docker-compose.yml. It contains a spark master service and a worker instance.

docker network create -d overlay --attachable --scope swarm core  
docker stack deploy -c docker-compose.yml <stack-name>

Multi-host swarm

To run the stack in cluster mode, create the swarm before creating the overlay network.
Otherwise the stack will deployed in a single swarm node --- the manager.

To stop the container type:

docker stack rm <stack-name>

Scaling

If you need more worker instances, consider to scale the number of instances by typing the following command:

docker service scale <stack-name>_worker=<num_of_task>

Data & Code

If you need to inject data and code into the containers use data and code volumes respectively in /home/data and /home/code.

Toree

Apache Toree notebook is already built, to launch a spark notebook follow the following commands:

docker exec -it <stack-name>_master.<id> bash
SPARK_OPTS='--master=spark://master:7077' jupyter notebook --ip 0.0.0.0 --allow-root

The last command allows the notebook to execute jobs in cluster mode rather than in local mode.

Apache Toree includes SparkR, PySpark, Spark Scala and SQL.

TODOs

  • Separating Jupyter notebook into a different

About

A spark cluster configuration with the Apache Toree notebook

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages