This docker container is meant to be used for learning purpose for programming Spark. It has the following components.
- Hadoop v3.3.1
- Spark v3.1.2
- Python v3.8
After running the container, you may visit the following pages.
Build.
./build.sh
We need to create a network. The reason is because we cannot specify a static IP for the container if we do NOT use a custom-created network. Why do we need to specify a static IP? Because SPARK_MASTER_HOST
is typically set to localhost
and localhost
binds only to 127.0.0.1
and any request from the outside to the container on port 7077
will be rejected. We can specify for SPARK_MASTER_HOST
to be 0.0.0.0
explicitly, but, this specification breaks Spark entirely (computations will not run as workers cannot find the master). The workaround is to create a network and assign a static IP to the container.
docker network create --subnet=172.18.0.0/16 sparknet
Run.
./run.sh
9870
: Name Node (HDFS)8020
: Name Node metadata service8042
: Node Manager8088
: Resource Manager (YARN)9864
: Data node19888
: History Server
8080
: Master web UI18080
: History server web UI7077
: Master port4040
: Application web UI
Some network useful commands.
docker network ls
docker network inspect sparknet
netstat -tulpn | grep LISTEN
If you want to use the shell on the container.
docker exec -it <CONTAINER_ID> spark-shell --master spark://172.18.0.5:7077
docker exec -it <CONTAINER_ID> pyspark --master spark://172.18.0.5:7077
If you want to use a locally installed instance of Spark.
spark-shell --master spark://172.18.0.5:7077
pyspark --master spark://172.18.0.5:7077
If you want to submit an Python application.
# spark standalone
spark-submit \
--master spark://172.18.0.5:7077 \
dummy.py
# YARN
HADOOP_CONF_DIR=/home/super/dev/hadoop/etc/hadoop/ spark-submit \
--deploy-mode client \
--master yarn dummy.py
- How to resolve pickle error in pyspark?
- What can be pickled and unpickled
- cloudpickle
- How do I call pyspark code with .whl file?
- Does spark standalone cluster supports deploye mode = cluster for python applications?
- Understand the default configuration
- Call From kv.local/172.20.12.168 to localhost:8020 failed on connection exception, when using tera gen
- Yarn JobHistory Error: Failed redirect for container_1400260444475_3309_01_000001