An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
-
Updated
Mar 9, 2020 - Python
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
BERT, AWS RDS, AWS Forecast, EMR Spark Cluster, Hive, Serverless, Google Assistant + Raspberry Pi, Infrared, Google Cloud Platform Natural Language, Anomaly detection, Tensorflow, Mathematics
Reference Architectures for Datalakes on AWS
Classwork projects and home works done through Udacity data engineering nano degree
Terraform module to provision an Elastic MapReduce (EMR) cluster on AWS
Bits of code I use during live demos
The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.
This is an ETL application on AWS with general open sales and customer data that you can find here: https://github.com/camposvinicius/data/blob/main/AdventureWorks.zip, it's a zipped file with some .csvs inside that we will apply transformations.
Apache Spark TPC-DS benchmark setup with EMR launch setup
A Cassandra Architecture for GDELT Database 🌍
An end-to-end data pipeline for building Data Lake and supporting report using Apache Spark.
Uses EMR clusters to export dynamoDB tables to S3 and generates import steps
A boilerplate for spark projects with docker support for local development and scripts for emr support.
Create a data pipeline on AWS to execute batch processing in a Spark cluster provisioned by Amazon EMR. ETL using managed airflow: extracts data from S3, transform data using spark, load transformed data back to S3.
A large-scale data framework that will enable us to store and analyze financial market data and drive future predictions for investment.
This project demonstrates the use of Amazon Elastic Map Reduce (EMR) for processing large datasets using Apache Spark. It includes a Spark script for ETL (Extract, Transform, Load) operations, AWS command line instructions for setting up and managing the EMR cluster, and a dataset for testing and demonstration purposes.
Generic python library that enables to provision emr clusters with yaml config files (Configuration as Code)
Event driven EMR via Serverless
Add a description, image, and links to the emr-cluster topic page so that developers can more easily learn about it.
To associate your repository with the emr-cluster topic, visit your repo's landing page and select "manage topics."