Consume Data from Kafka | Load to AWS S3 | Containerize Multi-Node Spark Cluster with Docker

This project demonstrates how to create a data lake and containerize a multi-node Spark cluster using Docker and run your Spark code within the containers.

Using this code, we read data from the Kafka topic in JSON format. The JSON data is then parsed using Spark SQL's json_tuple function to create a DataFrame with relevant columns. The processed data is written to a Parquet file format. The output is appended to a specific path in the S3 bucket, created based on the current timestamp.

Clone this repository to your local machine:

git clone https://github.com/Noosarpparashar/ecart-migration.git
cd ecart-migration

Start the Spark cluster using Docker Compose:

docker-compose up

Access the master node container:

docker exec -it <master_node_container_name> bash

(Note: Replace <master_node_container_name> with the actual name of your master node container.)

Create a folder named myjars inside the container:

mkdir myjars

Copy the jar from your local machine to the myjars folder inside the container:

docker cp /path/to/local/jar/file <master_node_container_name>:/opt/bitnami/spark/myjars

(Note: Replace /path/to/local/jar/file with the actual path to your jar file, and <master_node_container_name> with the name of your master node container.)

Exit the container:

exit

Run the Spark Submit command from your local machine:

docker exec -it <master_node_container_name> bash -c "cd myjars && spark-submit --master local[*] --class com.its.ecartsales.framework.jobs.controllers.StreamKafkaConsumerEcartFactOrder1 ecart-migration.jar"

(Note: Replace <master_node_container_name> with the name of your master node container.)

Your Spark code will now be executed within the containerized Spark cluster.

Feel free to modify the project structure and Spark code as needed for your specific use case. Happy Spark containerization and data processing!

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
project		project
src/main		src/main
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
build.sbt		build.sbt
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Consume Data from Kafka | Load to AWS S3 | Containerize Multi-Node Spark Cluster with Docker

License

About

Releases

Packages

Languages

Noosarpparashar/ecart-migration

Folders and files

Latest commit

History

Repository files navigation

Consume Data from Kafka | Load to AWS S3 | Containerize Multi-Node Spark Cluster with Docker

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages