This project demonstrates how to create a data lake and containerize a multi-node Spark cluster using Docker and run your Spark code within the containers.
Using this code, we read data from the Kafka topic in JSON format. The JSON data is then parsed using Spark SQL's json_tuple function to create a DataFrame with relevant columns. The processed data is written to a Parquet file format. The output is appended to a specific path in the S3 bucket, created based on the current timestamp.
- Clone this repository to your local machine:
git clone https://github.com/Noosarpparashar/ecart-migration.git
cd ecart-migration
- Start the Spark cluster using Docker Compose:
docker-compose up
- Access the master node container:
docker exec -it <master_node_container_name> bash
(Note: Replace <master_node_container_name>
with the actual name of your master node container.)
- Create a folder named
myjars
inside the container:
mkdir myjars
- Copy the jar from your local machine to the
myjars
folder inside the container:
docker cp /path/to/local/jar/file <master_node_container_name>:/opt/bitnami/spark/myjars
(Note: Replace /path/to/local/jar/file
with the actual path to your jar file, and <master_node_container_name>
with the name of your master node container.)
- Exit the container:
exit
- Run the Spark Submit command from your local machine:
docker exec -it <master_node_container_name> bash -c "cd myjars && spark-submit --master local[*] --class com.its.ecartsales.framework.jobs.controllers.StreamKafkaConsumerEcartFactOrder1 ecart-migration.jar"
(Note: Replace <master_node_container_name>
with the name of your master node container.)
Your Spark code will now be executed within the containerized Spark cluster.
Feel free to modify the project structure and Spark code as needed for your specific use case. Happy Spark containerization and data processing!
This project is licensed under the MIT License.