developer-workbench

Providing a suite of docker based setups for local development and testing in a production environment. Currently the following stacks are available:

Hadoop - Hive - Spark - Python dev environment with Pyspark installed
ELK - Spark - Jupyterlab dev environment with Pyspark installed

Dependencies

Docker Engine - Ubuntu 22.04
Docker Desktop - Mac
Docker Desktop with WSL2 Backend - Windows >=10
Git Bash - Windows only

Note: Run all the following commands on Windows from Git Bash if using Windows filesystem. If using WSL2 terminal execute the linux instructions directly.

Hadoop-Hive-Spark-Python

This stack starts a hadoop-hive-spark cluster and a python dev environment (with necessary libraries pre-installed) on the same docker network to facilitate inter service communication.

Please note I use a docker-compose.yml for the devcontainer instead of just the dockerfile since I wanted to retain the flexibility of adding more services to the dev compose file (e.g. a front end server) if needed/when I want to extend this setup.

NOTE 1: Execute these commands from the root folder and On Windows please run all the following commands from WSL

NOTE 2: In order to use this setup in your own project please follow these steps:

Copy the hadoop folder, the data/hive* folders to your project.
Copy the files: prerequisites.sh, mac_arm64_prereqs.sh, docker-compose-dev.yml, Dockerfile* files to your project.
Use the requirements.txt file of your project.
Execute the following steps from your project folder

Init steps

This creates the docker network which these services will use.

$ ./prerequisites.sh

For M1/M2 Macbook Users: also run

$ ./mac_arm64_prereqs.sh

Start the compute stack

Start the namenode, datanode, hive-server, hive-metastore, spark-master and spark-workers

$ cd hadoop && docker compose up

You can use any client such as docker desktop or portainer to manage the containers from a UI interface

Add some data to hdfs

The instructions for this step have been picked from this blog.

Log into the hive server as follows. Or use docker desktop or portrainer to open an exec console

$ docker exec -it hive-server /bin/bash

Create a hive table for weather data using the included hql

$ cd .. && cd hive_db
$ hive -f weather_table.hql
$ hadoop fs -put weather.csv hdfs://namenode:8020/user/hive/warehouse/testdb.db/weather

Validate the setup by navigating to http://localhost:9870/explorer.html#/user/hive/warehouse/testdb.db and verifying that the weather table has been create.

Testing the setup using PySpark

Pre-requisite for this step is having Dev Containers installed in vscode. Following steps are for vscode:

Ctrl+Shift+P ↠ Dev Containers: ReOpen Folder in Container
Open ↔notebooks/test_spark.ipynb↔ and fire away!!

Advanced Usage: Using git from inside the devontainer

Pre-requisite for this step is having ssh key confiured for your host user for your github/gitlab etc. Your ssh key gets forwarded to the devcontainer (nonroot) user but there is an issue with known_hosts.

Fire up a terminal from inside the devcontainer ↔vim ~/.ssh/known_hosts↔ and replace the existing content there with the following:

github.com ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOMqqnkVzrm0SdG6UOoqKLsabgH5C9okWi0dh2l9GKJl
github.com ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEmKSENjQEezOmxkZMy7opKgwFB9nkt5YRrYMjNuG5N87uRgg6CLrbo5wAdT/y6v0mKV0U2w0WZ2YB/++Tpockg=
github.com ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCj7ndNxQowgcQnjshcLrqPEiiphnt+VTTvDP6mHBL9j1aNUkY4Ue1gvwnGLVlOhGeYrnZaMgRK6+PKCUXaDbC7qtbW8gIkhL7aGCsOr/C56SJMy/BCZfxd1nWzAOxSDPgVsmerOBYfNqltV9/hWCqBywINIR+5dIg6JTJ72pcEpEjcYgXkE2YEFXV1JHnsKgbLWNlhScqb2UmyRkQyytRLtL+38TGxkxCflmO+5Z8CSSNY7GidjMIZ7Q4zMjA2n1nGrlTDkzwDCsw+wqFPGQA179cnfGWOWRVruj16z6XyvxvjJwbz0wQZ75XK5tKSb7FNyeIEs4TT4jk+S4dhPeAUC5y+bDYirYgM4GC7uEnztnZyaVWQ7B381AK4Qdrwt51ZqExKbQpTUNn+EjqoTwvqNj4kqx5QUCI0ThS/YkOxJCXmPUWZbhjpCg56i+2aB6CmK2JGhn57K5mj0MNdBXA4/WnwH6XoPWJzK5Nyu2zB3nAZp+S5hpQs+p1vN1/wsjk=

Helpful Utils/Tips

Increase resources available to docker.
When opening the repo/project inside devcontainer for the first time, jupyterlab extensions need to be installed. Will move this part to devcontainer.json extensions.
Install Resource Monitor inside the devcontainer to track the % of the resources being consumed by the devcontainer.

TODOs:

Replace the bde2020 docker images with our own images for following reasons:
- Make all the images ARM64 compatible.
- Update the stack to latest hadoop, hive and spark
- Reduced vulnerability
- Using Multi-stage builds to reduce the images' size on disk
Include scanning of images for security issues.
Setup the spark cluster with Yarn (or even Kubernetes potentially) cluster manager instead of Standalonne(current) cluster manager.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.devcontainer		.devcontainer
data		data
elk_stack		elk_stack
hadoop		hadoop
notebooks		notebooks
.dockerignore		.dockerignore
.flake8		.flake8
.gitignore		.gitignore
Dockerfile.devcontainer		Dockerfile.devcontainer
Dockerfile.hive_metastore		Dockerfile.hive_metastore
LICENSE		LICENSE
README.md		README.md
docker-compose-dev.yml		docker-compose-dev.yml
mac_arm64_prereqs.sh		mac_arm64_prereqs.sh
prerequisites.sh		prerequisites.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

developer-workbench

Dependencies

Hadoop-Hive-Spark-Python

Init steps

Start the compute stack

Add some data to hdfs

Testing the setup using PySpark

Advanced Usage: Using git from inside the devontainer

Helpful Utils/Tips

TODOs:

About

Releases

Packages

Languages

License

Anand191/developer-workbench

Folders and files

Latest commit

History

Repository files navigation

developer-workbench

Dependencies

Hadoop-Hive-Spark-Python

Init steps

Start the compute stack

Add some data to hdfs

Testing the setup using PySpark

Advanced Usage: Using git from inside the devontainer

Helpful Utils/Tips

TODOs:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages