Skip to content

A collection of airflow sample workflows for data processing on aws

License

Notifications You must be signed in to change notification settings

ychantit/airflow_aws_utils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

airflow_aws_utils

A collection of airflow helper scripts to bootstrap building data processing pipelines on aws

requirements

aws_emr_concurrent_job_runner

  • This workflow showcase a solution to run concurrent jobs (such as spark job, hive script, mr job...etc.) on AWS EMR
  • I needed a way to submit multiple jobs to a shared EMR instance and execute them in parallel. The AWS EMR Step API only allows to schedule jobs in a sequential way and the AWS DataPipeline is too expensive...I ended up using the ssh operator of airflow to connect to the master node of EMR and submit the jobs on cli.
  • This workflow concurrent jobs are hive scripts. Each script will attempts to write a new partition of an external table stored on S3 in parquet format

aws_athena_query_runner

  • This workflow shows how to submit a query for aws athena and then block until the query returns

About

A collection of airflow sample workflows for data processing on aws

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages