Providing a suite of docker based setups for local development and testing in a production environment. Currently the following stacks are available:
- Hadoop - Hive - Spark - Python dev environment with Pyspark installed
- ELK - Spark - Jupyterlab dev environment with Pyspark installed
- Docker Engine - Ubuntu 22.04
- Docker Desktop - Mac
- Docker Desktop with WSL2 Backend - Windows >=10
- Git Bash - Windows only
Note: Run all the following commands on Windows from Git Bash if using Windows filesystem. If using WSL2 terminal execute the linux instructions directly.
This stack starts a hadoop-hive-spark cluster and a python dev environment (with necessary libraries pre-installed) on the same docker network to facilitate inter service communication.
Please note I use a docker-compose.yml for the devcontainer instead of just the dockerfile since I wanted to retain the flexibility of adding more services to the dev compose file (e.g. a front end server) if needed/when I want to extend this setup.
NOTE 1: Execute these commands from the root folder and On Windows please run all the following commands from WSL
NOTE 2: In order to use this setup in your own project please follow these steps:
- Copy the hadoop folder, the data/hive* folders to your project.
- Copy the files: prerequisites.sh, mac_arm64_prereqs.sh, docker-compose-dev.yml, Dockerfile* files to your project.
- Use the requirements.txt file of your project.
- Execute the following steps from your project folder
This creates the docker network which these services will use.
$ ./prerequisites.sh
For M1/M2 Macbook Users: also run
$ ./mac_arm64_prereqs.sh
Start the namenode, datanode, hive-server, hive-metastore, spark-master and spark-workers
$ cd hadoop && docker compose up
You can use any client such as docker desktop or portainer to manage the containers from a UI interface
The instructions for this step have been picked from this blog.
- Log into the hive server as follows. Or use docker desktop or portrainer to open an exec console
$ docker exec -it hive-server /bin/bash
- Create a hive table for weather data using the included hql
$ cd .. && cd hive_db
$ hive -f weather_table.hql
$ hadoop fs -put weather.csv hdfs://namenode:8020/user/hive/warehouse/testdb.db/weather
Validate the setup by navigating to http://localhost:9870/explorer.html#/user/hive/warehouse/testdb.db and verifying that the weather table has been create.
Pre-requisite for this step is having Dev Containers installed in vscode. Following steps are for vscode:
Ctrl+Shift+P
↠ Dev Containers: ReOpen Folder in Container- Open ↔
notebooks/test_spark.ipynb
↔ and fire away!!
Pre-requisite for this step is having ssh key confiured for your host user for your github/gitlab etc. Your ssh key gets forwarded to the devcontainer (nonroot) user but there is an issue with known_hosts.
- Fire up a terminal from inside the devcontainer ↔
vim ~/.ssh/known_hosts
↔ and replace the existing content there with the following:
github.com ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOMqqnkVzrm0SdG6UOoqKLsabgH5C9okWi0dh2l9GKJl
github.com ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEmKSENjQEezOmxkZMy7opKgwFB9nkt5YRrYMjNuG5N87uRgg6CLrbo5wAdT/y6v0mKV0U2w0WZ2YB/++Tpockg=
github.com ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCj7ndNxQowgcQnjshcLrqPEiiphnt+VTTvDP6mHBL9j1aNUkY4Ue1gvwnGLVlOhGeYrnZaMgRK6+PKCUXaDbC7qtbW8gIkhL7aGCsOr/C56SJMy/BCZfxd1nWzAOxSDPgVsmerOBYfNqltV9/hWCqBywINIR+5dIg6JTJ72pcEpEjcYgXkE2YEFXV1JHnsKgbLWNlhScqb2UmyRkQyytRLtL+38TGxkxCflmO+5Z8CSSNY7GidjMIZ7Q4zMjA2n1nGrlTDkzwDCsw+wqFPGQA179cnfGWOWRVruj16z6XyvxvjJwbz0wQZ75XK5tKSb7FNyeIEs4TT4jk+S4dhPeAUC5y+bDYirYgM4GC7uEnztnZyaVWQ7B381AK4Qdrwt51ZqExKbQpTUNn+EjqoTwvqNj4kqx5QUCI0ThS/YkOxJCXmPUWZbhjpCg56i+2aB6CmK2JGhn57K5mj0MNdBXA4/WnwH6XoPWJzK5Nyu2zB3nAZp+S5hpQs+p1vN1/wsjk=
- Increase resources available to docker.
- When opening the repo/project inside devcontainer for the first time, jupyterlab extensions need to be installed. Will move this part to devcontainer.json extensions.
- Install Resource Monitor inside the devcontainer to track the % of the resources being consumed by the devcontainer.
- Replace the bde2020 docker images with our own images for following reasons:
- Make all the images ARM64 compatible.
- Update the stack to latest hadoop, hive and spark
- Reduced vulnerability
- Using Multi-stage builds to reduce the images' size on disk
- Include scanning of images for security issues.
- Setup the spark cluster with Yarn (or even Kubernetes potentially) cluster manager instead of Standalonne(current) cluster manager.