Name		Name	Last commit message	Last commit date
parent directory ..
ubuntu		ubuntu
Dockerfile.stage0		Dockerfile.stage0
Dockerfile.stage1		Dockerfile.stage1
README.md		README.md
build.sh		build.sh
dummy.py		dummy.py
run.sh		run.sh

README.md

Purpose

This docker container is meant to be used for learning purpose for programming Spark. It has the following components.

Hadoop v3.3.1
Spark v3.1.2
Python v3.8

After running the container, you may visit the following pages.

Docker

Build.

./build.sh

We need to create a network. The reason is because we cannot specify a static IP for the container if we do NOT use a custom-created network. Why do we need to specify a static IP? Because SPARK_MASTER_HOST is typically set to localhost and localhost binds only to 127.0.0.1 and any request from the outside to the container on port 7077 will be rejected. We can specify for SPARK_MASTER_HOST to be 0.0.0.0 explicitly, but, this specification breaks Spark entirely (computations will not run as workers cannot find the master). The workaround is to create a network and assign a static IP to the container.

docker network create --subnet=172.18.0.0/16 sparknet

Run.

./run.sh

Ports

Hadoop

9870 : Name Node (HDFS)
8020 : Name Node metadata service
8042 : Node Manager
8088 : Resource Manager (YARN)
9864 : Data node
19888 : History Server

Spark

8080 : Master web UI
18080 : History server web UI
7077 : Master port
4040 : Application web UI

Test Connection

Some network useful commands.

docker network ls
docker network inspect sparknet
netstat -tulpn | grep LISTEN

If you want to use the shell on the container.

docker exec -it <CONTAINER_ID> spark-shell --master spark://172.18.0.5:7077
docker exec -it <CONTAINER_ID> pyspark --master spark://172.18.0.5:7077

If you want to use a locally installed instance of Spark.

spark-shell --master spark://172.18.0.5:7077
pyspark --master spark://172.18.0.5:7077

If you want to submit an Python application.

# spark standalone
spark-submit \
    --master spark://172.18.0.5:7077 \
    dummy.py

# YARN
HADOOP_CONF_DIR=/home/super/dev/hadoop/etc/hadoop/ spark-submit \
    --deploy-mode client \
    --master yarn dummy.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark

spark

README.md

Purpose

Docker

Ports

Test Connection

References

Files

spark

Directory actions

More options

Directory actions

More options

Latest commit

History

spark

Folders and files

parent directory

README.md

Purpose

Docker

Ports

Test Connection

References