Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



31 Commits

Repository files navigation

Consume Data from Kafka | Load to AWS S3 | Containerize Multi-Node Spark Cluster with Docker

This project demonstrates how to create a data lake and containerize a multi-node Spark cluster using Docker and run your Spark code within the containers.

Using this code, we read data from the Kafka topic in JSON format. The JSON data is then parsed using Spark SQL's json_tuple function to create a DataFrame with relevant columns. The processed data is written to a Parquet file format. The output is appended to a specific path in the S3 bucket, created based on the current timestamp.

  1. Clone this repository to your local machine:
git clone
cd ecart-migration
  1. Start the Spark cluster using Docker Compose:
docker-compose up
  1. Access the master node container:
docker exec -it <master_node_container_name> bash

(Note: Replace <master_node_container_name> with the actual name of your master node container.)

  1. Create a folder named myjars inside the container:
mkdir myjars
  1. Copy the jar from your local machine to the myjars folder inside the container:
docker cp /path/to/local/jar/file <master_node_container_name>:/opt/bitnami/spark/myjars

(Note: Replace /path/to/local/jar/file with the actual path to your jar file, and <master_node_container_name> with the name of your master node container.)

  1. Exit the container:
  1. Run the Spark Submit command from your local machine:
docker exec -it <master_node_container_name> bash -c "cd myjars && spark-submit --master local[*] --class ecart-migration.jar"

(Note: Replace <master_node_container_name> with the name of your master node container.)

Your Spark code will now be executed within the containerized Spark cluster.

Feel free to modify the project structure and Spark code as needed for your specific use case. Happy Spark containerization and data processing!


This project is licensed under the MIT License.


No description, website, or topics provided.






No releases published


No packages published