Docker configuration for spark cluster
This Docker container contains a full Spark distribution with the following components:
- Oracle JDK 8
- Hadoop 2.7.5
- Scala 2.11.12
- Spark 2.2.1
It also includes the Apache Toree installation.
A docker-compose.yml
file is provided to run the spark-cluster in the Docker Swarm environment
Type the following commands to run the stack provided with the docker-compose.yml
. It contains a spark master service and a worker instance.
docker network create -d overlay --attachable --scope swarm core
docker stack deploy -c docker-compose.yml <stack-name>
To run the stack in cluster mode, create the swarm before creating the overlay network.
Otherwise the stack will deployed in a single swarm node --- the manager.
To stop the container type:
docker stack rm <stack-name>
If you need more worker instances, consider to scale the number of instances by typing the following command:
docker service scale <stack-name>_worker=<num_of_task>
If you need to inject data and code into the containers use data
and code
volumes respectively in /home/data
and /home/code
.
Apache Toree notebook is already built, to launch a spark notebook follow the following commands:
docker exec -it <stack-name>_master.<id> bash
SPARK_OPTS='--master=spark://master:7077' jupyter notebook --ip 0.0.0.0 --allow-root
The last command allows the notebook to execute jobs in cluster mode rather than in local mode.
Apache Toree includes SparkR, PySpark, Spark Scala and SQL.
- Separating Jupyter notebook into a different