Setup a quick to spin up workbench with docker with the possibility to select services


Providing a suite of docker based setups for local development and testing in a production environment. Currently the following stacks are available:

  • Hadoop - Hive - Spark - Python dev environment with Pyspark installed
  • ELK - Spark - Jupyterlab dev environment with Pyspark installed


  • Docker Engine - Ubuntu 22.04
  • Docker Desktop - Mac
  • Docker Desktop with WSL2 Backend - Windows >=10
  • Git Bash - Windows only

Note: Run all the following commands on Windows from Git Bash if using Windows filesystem. If using WSL2 terminal execute the linux instructions directly.


This stack starts a hadoop-hive-spark cluster and a python dev environment (with necessary libraries pre-installed) on the same docker network to facilitate inter service communication.

Please note I use a docker-compose.yml for the devcontainer instead of just the dockerfile since I wanted to retain the flexibility of adding more services to the dev compose file (e.g. a front end server) if needed/when I want to extend this setup.

NOTE 1: Execute these commands from the root folder and On Windows please run all the following commands from WSL

NOTE 2: In order to use this setup in your own project please follow these steps:

  • Copy the hadoop folder, the data/hive* folders to your project.
  • Copy the files:,, docker-compose-dev.yml, Dockerfile* files to your project.
  • Use the requirements.txt file of your project.
  • Execute the following steps from your project folder

Init steps

This creates the docker network which these services will use.

$ ./

For M1/M2 Macbook Users: also run

$ ./

Start the compute stack

Start the namenode, datanode, hive-server, hive-metastore, spark-master and spark-workers

$ cd hadoop && docker compose up

You can use any client such as docker desktop or portainer to manage the containers from a UI interface

Add some data to hdfs

The instructions for this step have been picked from this blog.

  • Log into the hive server as follows. Or use docker desktop or portrainer to open an exec console
$ docker exec -it hive-server /bin/bash
  • Create a hive table for weather data using the included hql
$ cd .. && cd hive_db
$ hive -f weather_table.hql
$ hadoop fs -put weather.csv hdfs://namenode:8020/user/hive/warehouse/testdb.db/weather

Validate the setup by navigating to http://localhost:9870/explorer.html#/user/hive/warehouse/testdb.db and verifying that the weather table has been create.

Testing the setup using PySpark

Pre-requisite for this step is having Dev Containers installed in vscode. Following steps are for vscode:

  • Ctrl+Shift+P ↠ Dev Containers: ReOpen Folder in Container
  • Open ↔notebooks/test_spark.ipynb↔ and fire away!!

Advanced Usage: Using git from inside the devontainer

Pre-requisite for this step is having ssh key confiured for your host user for your github/gitlab etc. Your ssh key gets forwarded to the devcontainer (nonroot) user but there is an issue with known_hosts.

  • Fire up a terminal from inside the devcontainer ↔vim ~/.ssh/known_hosts↔ and replace the existing content there with the following: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOMqqnkVzrm0SdG6UOoqKLsabgH5C9okWi0dh2l9GKJl ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEmKSENjQEezOmxkZMy7opKgwFB9nkt5YRrYMjNuG5N87uRgg6CLrbo5wAdT/y6v0mKV0U2w0WZ2YB/++Tpockg= ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCj7ndNxQowgcQnjshcLrqPEiiphnt+VTTvDP6mHBL9j1aNUkY4Ue1gvwnGLVlOhGeYrnZaMgRK6+PKCUXaDbC7qtbW8gIkhL7aGCsOr/C56SJMy/BCZfxd1nWzAOxSDPgVsmerOBYfNqltV9/hWCqBywINIR+5dIg6JTJ72pcEpEjcYgXkE2YEFXV1JHnsKgbLWNlhScqb2UmyRkQyytRLtL+38TGxkxCflmO+5Z8CSSNY7GidjMIZ7Q4zMjA2n1nGrlTDkzwDCsw+wqFPGQA179cnfGWOWRVruj16z6XyvxvjJwbz0wQZ75XK5tKSb7FNyeIEs4TT4jk+S4dhPeAUC5y+bDYirYgM4GC7uEnztnZyaVWQ7B381AK4Qdrwt51ZqExKbQpTUNn+EjqoTwvqNj4kqx5QUCI0ThS/YkOxJCXmPUWZbhjpCg56i+2aB6CmK2JGhn57K5mj0MNdBXA4/WnwH6XoPWJzK5Nyu2zB3nAZp+S5hpQs+p1vN1/wsjk=

Helpful Utils/Tips

  • Increase resources available to docker.
  • When opening the repo/project inside devcontainer for the first time, jupyterlab extensions need to be installed. Will move this part to devcontainer.json extensions.
  • Install Resource Monitor inside the devcontainer to track the % of the resources being consumed by the devcontainer.


  • Replace the bde2020 docker images with our own images for following reasons:
    • Make all the images ARM64 compatible.
    • Update the stack to latest hadoop, hive and spark
    • Reduced vulnerability
    • Using Multi-stage builds to reduce the images' size on disk
  • Include scanning of images for security issues.
  • Setup the spark cluster with Yarn (or even Kubernetes potentially) cluster manager instead of Standalonne(current) cluster manager.


