Data Processing using AWS Data Pipeline, S3 and EMR

Project Overview

This code repository has all the code needed to setup a AWS Data Pipeline to do big data processing using spark application. For complete implementation details visit here.

Pre-requisites

AWS account
PySpark basics
Shell scripting basics

Getting Started

Login to AWS console
Create an S3 bucket and upload the files in the repo in the below structure
Create an EC2 instance, install task runner and start task runner service. For detailed steps click here.
Go to Data Pipeline, click create pipeline and upload the pipeline definition json file.
Click activate and your pipeline starts running.

Pipeline flow

We are going to setup a data pipeline to process the data daily, upon the arrival of a ready.txt dummy file. We need to create an EMR cluster once the data arrives and start processing data. So first we will create an EC2 Instance, install and start task runnner service in it.

Below are the different activities performed in the pipeline to achieve the required output.

Check if ready.txt file exists. This is the Pre-Condition that has to be met for the active pipeline to start running.
Once Pre-Condition is met, create an EMR cluster and install & start the task runnner as a bootstrap action.

ShellActivity copies the input file from S3 to HDFS. Below commands are given as command in ShellActivity.

 aws s3 cp #{my_s3_input_data}  /home/hadoop
 hdfs dfs -mkdir /input
 hdfs dfs -copyFromLocal /home/hadoop/netflix_titles.csv /input

EMRActivity submits the PySpark application to the EMR cluster.
Once the data transformations are done. Another ShellActivity combines the output files into a single csv file and uploads to S3.
```
 hdfs dfs -getmerge /output/transformed_data /home/hadoop/output.csv
 aws s3 cp /home/hadoop/output.csv s3://<myBucket>/data/
```

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Scripts		Scripts
data		data
README.md		README.md
datapipelinedefinition.json		datapipelinedefinition.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Processing using AWS Data Pipeline, S3 and EMR

Project Overview

Pre-requisites

Getting Started

Pipeline flow

About

Releases

Packages

Languages

Anusha-GK/AWS-Data-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Data Processing using AWS Data Pipeline, S3 and EMR

Project Overview

Pre-requisites

Getting Started

Pipeline flow

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages