Skip to content

This is a whole picture include ETL framework, DevOps pipline, cloud Infrastructure etc.

Notifications You must be signed in to change notification settings

SixGod191001/CEDC

Repository files navigation

CEDC

Backgroud

This project is aiming to build a whole cloud based DevOps ETL process. Include below Parts:

AWS

  1. Cloud Infrastructure
    • Jenkins on ECS
    • Airflow on EKS
  2. Airflow framework(wrapper)
  3. Jenkins Devops Pipeline
  4. Glue ETL Common Solution
  5. Multi-account architecture

Power BI

  1. Front end development & design
  2. Backend development & design
  3. DB development & design

Azure

  1. User/Role Management Architecture
  2. Network/Security Architecture
  3. DevOps Architecture
    • Infrastructure Level DevOps
    • Project Level DevOps
  4. Project Architecture
    • ETL framework/solution
    • Data Visualization(PowerBI)

Project Name

Cloud base ETL DevOps process of Community = CEDC

Project Directory

Project Wiki

Project Wiki

Project Sprint

Sprint

Architecture

basic logicflow

Airflow framework

Cloud Infrastructure

Account distribution

  • DevOps Account: this is a DevOps account mainly include Jenkins and Airflow
  • Data Account: this is a data lake account mainly include S3
  • Serverless Account: this is a ETL account mainly include Glue, Lambda etc
  • IDP Account: this is a Identity account which can assume A/B/C accounts by User role or Admin Role

jenkins Infrastructure

Note: in the first draft, we can centralized deploy all services into one account for demo purpose.

Airflow framework

Features

  • Parameter driven framework
  • Check Dependence
  • Kickoff
  • Monitor
  • Job Retry
  • Notify
  • Metadata backend

Jenkins DevOps Pipeline

Features

  • Deploy airflow dags and glue job in project
  • Onboarding/Off Boarding
  • Data validation
  • Convert SQL to Glue Pyspark

Glue ETL jobs

Account prerequisite

Standard aws serverless account with below items:

  • Glue
  • Lambda
  • S3
  • Cloudwatch Events
  • Cloudwatch logs
  • Secrets manager
  • wip ...

Glue

Glue job naming standard:

  • <project_name>_<table_name or process_name>_prelanding
  • <project_name>_<table_name or process_name>_landing
  • <project_name>_<table_name or process_name>_landing_merge
  • <project_name>_<table_name or process_name>_refinement
  • <project_name>_<table_name or process_name>_publish

IAM Roles Management

  1. Serverless Account: Glue Job Execution role -> DEVOPS_GLUE_CEDC_EXECUTION (cross account role to ensure Airflow can trigger glue jobs on Account C)
  2. DevOps Account: DEVOPS_GLUE_CEDC_READ/DEVOPS_GLUE_CEDC_ADMIN (Readonly or Admin)
  3. IDP Account: CICD Role: DEVOPS_CICD_CEDC (which will assume admin access for all accounts for now.)
  4. Data Account: DEVOPS_S3_CEDC_READ/DEVOPS_S3_CEDC_ADMIN

OpenAI