This code repository has all the code needed to setup a AWS Data Pipeline to do big data processing using spark application. For complete implementation details visit here.
- AWS account
- PySpark basics
- Shell scripting basics
- Login to AWS console
- Create an S3 bucket and upload the files in the repo in the below structure
- Create an EC2 instance, install task runner and start task runner service. For detailed steps click here.
- Go to Data Pipeline, click create pipeline and upload the pipeline definition json file.
- Click activate and your pipeline starts running.
We are going to setup a data pipeline to process the data daily, upon the arrival of a ready.txt dummy file. We need to create an EMR cluster once the data arrives and start processing data. So first we will create an EC2 Instance, install and start task runnner service in it.
Below are the different activities performed in the pipeline to achieve the required output.
- Check if ready.txt file exists. This is the Pre-Condition that has to be met for the active pipeline to start running.
- Once Pre-Condition is met, create an EMR cluster and install & start the task runnner as a bootstrap action.
- ShellActivity copies the input file from S3 to HDFS. Below commands are given as command in ShellActivity.
aws s3 cp #{my_s3_input_data} /home/hadoop hdfs dfs -mkdir /input hdfs dfs -copyFromLocal /home/hadoop/netflix_titles.csv /input
- EMRActivity submits the PySpark application to the EMR cluster.
- Once the data transformations are done. Another ShellActivity combines the output files into a single csv file and uploads to S3.
hdfs dfs -getmerge /output/transformed_data /home/hadoop/output.csv aws s3 cp /home/hadoop/output.csv s3://<myBucket>/data/