Skip to content

Data93/Project-3-Batch-Processing-Using-Airflow-and-Spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project 3 Batch Processing Using Airflow and Spark

Use case: From product need help to integrate data from our dwh to their product via API:

  • Top Country Based on User
  • Total Film Based on Category

Prepare Tools:

Dataset: https://www.kaggle.com/datasets/kapturovalexander/pagila-postgresql-sample-database

Flow: alt text

Noted:

  • What is TiDB? TiDB is an open-source NewSQL database that supports Hybrid Transactional and Analytical Processing workloads. Step by Step:

  • Check connection DB server

    • Postgres
    • TiDB
  • Run airflow on your local

    • Create file requirements.txt:
    • Build images, Dockerfile: docker build -t my-airflow .
  • Create docker compose, docker-compose.yaml:

  • Set connection on airflow

  • Extract:

    • Create module connector postgres
    • Create module get data from postgres
  • Transform:

    • Create script for transformation data using spark
  • Load

    • Create module connector for hadoop
  • Create load data to hadoop

Result : WhatsApp Image 2024-09-08 at 05 06 07_cba724bc

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published