Skip to content

Raylouiss/Tubes_big_data

Repository files navigation

TPC-H Big Data Pipeline with Apache Spark & Iceberg & DBT


Pre-requisites

Sebelum menjalankan project ini, pastikan telah menginstal dan mengatur hal-hal berikut:

sudo apt update
sudo apt install python3-pip
sudo apt install openjdk-17-jdk
pip install -r requirements.txt
  1. Install Spark 3.4.2:

    wget https://archive.apache.org/dist/spark/spark-3.4.2/spark-3.4.2-bin-hadoop3.tgz
    tar -xvzf spark-3.4.2-bin-hadoop3.tgz
    mv spark-3.4.2-bin-hadoop3 ~/spark
    rm spark-3.4.2-bin-hadoop3.tgz
  2. Tambahkan Spark ke PATH: Tambahkan ke ~/.bashrc:

    export SPARK_HOME="$HOME/spark"
    export PATH="$SPARK_HOME/bin:$PATH"

    Aktifkan perubahan:

    source ~/.bashrc
    spark-submit --version
  3. Unduh Apache Iceberg JAR:

    mkdir -p ~/.ivy2/jars
    wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.4_2.12/1.9.1/iceberg-spark-runtime-3.4_2.12-1.9.1.jar -P ~/.ivy2/jars
    cp ~/.ivy2/jars/iceberg-spark-runtime-3.4_2.12-1.9.1.jar ~/spark/jars/
  4. Chmod file sh:

    chmod +x run.sh
  5. Buat .env berdasarkan .env.docker atau .env.example:


🛠️ Run Local Without Docker

Berikut adalah langkah-langkah eksekusi:

  1. Clone repository utama:

    git clone https://github.com/Raylouiss/Tubes_big_data.git
    cd Tubes_big_data
  2. Clone generator TPC-H ke dalam repo (Kalau tpch-dbgen sudah ada isinya, tidak usah dilakukan):

    git clone https://github.com/electrum/tpch-dbgen
    cd tpch-dbgen
    make
    ./dbgen -s 50
  3. Konversi .tbl menjadi .csv (Lakukan di Tubes_big_data)(Kalau tpch-csv sudah ada isinya, tidak usah dilakukan):

    cd ..
    ./run.sh clean_data
  4. Load data ke Apache Iceberg (Lakukan di Tubes_big_data) (Kalau iceberg_lakehouse sudah ada isinya, tidak usah dilakukan):

    ./run.sh iceberg_submit
  5. (Opsional) Jalankan query SQL ke tabel Iceberg untuk testing:

    ./run.sh iceberg_sql
  6. Start Spark Thrift Server:

    ./run.sh start_sts
  7. Jalankan DBT:

    ./run.sh dbt_profile
    dbt debug
    dbt run --vars '{"startdate": "1997-01-01", "enddate": "1997-12-31"}' --select top_customers
    dbt run --vars '{"startdate": "1997-01-01", "enddate": "1997-12-31"}' --select top_products
  8. Apabila ingin menambahkan query dapat dilakukan pada folder models

  9. Stop Spark Thrift Server:

    ./run.sh stop_sts

🐳 Run with Docker

Berikut adalah langkah-langkah eksekusi dengan menggunakan Docker

  1. Install Docker Make sure Docker is installed and running. Docker installation process can be seen here
  2. Rename .env.docker Rename .env.docker to .env. You can modify the env value as needed
  3. Generate TPCH data (skip if data has been generated) Generate TPCH data with these steps
    cd tpch-dbgen
    docker build -t tpch-dbgen .
    chmod +x docker_run.sh
    ./docker_run.sh <file size in GB>
    
  4. Run Docker Container Run Docker container with these steps
    docker compose up -d
    docker exec -it tpch_pipeline bash
    
  5. Load data to Apache Iceberg Go back to this project root dir and load data to apache iceberg using
    ./run.sh iceberg_submit
    Check data exists by using ./run.sh iceberg_sql
  6. Start Spark Thrift Server
    ./run.sh start_sts
  7. Generate dbt profile
    ./run.sh dbt_profile
    dbt debug
    dbt run --vars '{"startdate": "1997-01-01", "enddate": "1997-12-31"}' --select top_customers
    dbt run --vars '{"startdate": "1997-01-01", "enddate": "1997-12-31"}' --select top_products
  8. Stop Spark Thrift Server
    ./run.sh stop_sts

📝 Bonus: dbt testing

Run dbt quality checking by

dbt test

Previous output can be seen inside output/dbt_test.txt

📝 Bonus: dbt docs

Generate dbt automated documentation

dbt docs generate
dbt docs serve --port [port]

🧊 Struktur Folder

Tubes_big_data/
├── analytics/                   # Folder untuk menyimpan hasil analisis query
├── csv_analytics/               # Folder yang berisi skrip python untuk melakuakn query ke csv
├── iceberg_lakehouse/           # Folder untuk menyimpan data Iceberg (warehouse lokal)
├── macros/                      # Folder macros dbt
│ ├── date_filter.sql
│ └── revenue.sql
├── models/                      # Folder model dbt
│ ├── top_customers.sql
│ ├── top_products.sql
│ └── schema.yml
├── output/                      # Folder untuk menyimpan hasil dbt test
├── snapshots/                   # Folder untuk menyimpan snapshot
├── target/                      # Artefak build dbt (manifest, run_results, dll.)
├── tpch-csv/                    # Hasil konversi data dari .tbl menjadi .csv (Terbuat nanti)
├── tpch-dbgen/                  # Generator data TPC-H
├── .dockerignore                # File untuk mengecualikan file dari Docker tracking
├── .env                         # Konfigurasi environment untuk Spark & Iceberg
├── .env.example                 # Contoh konfigurasi environment
├── .gitignore                   # File untuk mengecualikan file dari Git tracking
├── Dockerfile                   # Dockerfile
├── README.md                    # Dokumentasi proyek
├── dbt_project.yml              # Konfigurasi proyek dbt
├── docker-compose.yml           # Docker compose
├── load_to_iceberg.py           # Script Python untuk load data ke Iceberg
├── requirements.txt             # Dependensi python
└── run.sh                       # Skrip untuk clean data, iceberg submit & sql, dbt profile, dan start & stop sts, serta spark sql

📚 Referensi

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors