Sebelum menjalankan project ini, pastikan telah menginstal dan mengatur hal-hal berikut:
sudo apt update
sudo apt install python3-pip
sudo apt install openjdk-17-jdk
pip install -r requirements.txt-
Install Spark 3.4.2:
wget https://archive.apache.org/dist/spark/spark-3.4.2/spark-3.4.2-bin-hadoop3.tgz tar -xvzf spark-3.4.2-bin-hadoop3.tgz mv spark-3.4.2-bin-hadoop3 ~/spark rm spark-3.4.2-bin-hadoop3.tgz -
Tambahkan Spark ke PATH: Tambahkan ke
~/.bashrc:export SPARK_HOME="$HOME/spark" export PATH="$SPARK_HOME/bin:$PATH"
Aktifkan perubahan:
source ~/.bashrc spark-submit --version
-
Unduh Apache Iceberg JAR:
mkdir -p ~/.ivy2/jars wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.4_2.12/1.9.1/iceberg-spark-runtime-3.4_2.12-1.9.1.jar -P ~/.ivy2/jars cp ~/.ivy2/jars/iceberg-spark-runtime-3.4_2.12-1.9.1.jar ~/spark/jars/
-
Chmod file sh:
chmod +x run.sh
-
Buat .env berdasarkan .env.docker atau .env.example:
Berikut adalah langkah-langkah eksekusi:
-
Clone repository utama:
git clone https://github.com/Raylouiss/Tubes_big_data.git cd Tubes_big_data -
Clone generator TPC-H ke dalam repo (Kalau tpch-dbgen sudah ada isinya, tidak usah dilakukan):
git clone https://github.com/electrum/tpch-dbgen cd tpch-dbgen make ./dbgen -s 50 -
Konversi
.tblmenjadi.csv(Lakukan di Tubes_big_data)(Kalau tpch-csv sudah ada isinya, tidak usah dilakukan):cd .. ./run.sh clean_data -
Load data ke Apache Iceberg (Lakukan di Tubes_big_data) (Kalau iceberg_lakehouse sudah ada isinya, tidak usah dilakukan):
./run.sh iceberg_submit
-
(Opsional) Jalankan query SQL ke tabel Iceberg untuk testing:
./run.sh iceberg_sql
-
Start Spark Thrift Server:
./run.sh start_sts
-
Jalankan DBT:
./run.sh dbt_profile dbt debug dbt run --vars '{"startdate": "1997-01-01", "enddate": "1997-12-31"}' --select top_customers dbt run --vars '{"startdate": "1997-01-01", "enddate": "1997-12-31"}' --select top_products
-
Apabila ingin menambahkan query dapat dilakukan pada folder models
-
Stop Spark Thrift Server:
./run.sh stop_sts
Berikut adalah langkah-langkah eksekusi dengan menggunakan Docker
- Install Docker Make sure Docker is installed and running. Docker installation process can be seen here
- Rename .env.docker
Rename
.env.dockerto.env. You can modify the env value as needed - Generate TPCH data (skip if data has been generated)
Generate TPCH data with these steps
cd tpch-dbgen docker build -t tpch-dbgen . chmod +x docker_run.sh ./docker_run.sh <file size in GB> - Run Docker Container
Run Docker container with these steps
docker compose up -d docker exec -it tpch_pipeline bash - Load data to Apache Iceberg
Go back to this project root dir and load data to apache iceberg using
Check data exists by using
./run.sh iceberg_submit
./run.sh iceberg_sql - Start Spark Thrift Server
./run.sh start_sts
- Generate dbt profile
./run.sh dbt_profile dbt debug dbt run --vars '{"startdate": "1997-01-01", "enddate": "1997-12-31"}' --select top_customers dbt run --vars '{"startdate": "1997-01-01", "enddate": "1997-12-31"}' --select top_products
- Stop Spark Thrift Server
./run.sh stop_sts
Run dbt quality checking by
dbt testPrevious output can be seen inside output/dbt_test.txt
Generate dbt automated documentation
dbt docs generate
dbt docs serve --port [port]Tubes_big_data/
├── analytics/ # Folder untuk menyimpan hasil analisis query
├── csv_analytics/ # Folder yang berisi skrip python untuk melakuakn query ke csv
├── iceberg_lakehouse/ # Folder untuk menyimpan data Iceberg (warehouse lokal)
├── macros/ # Folder macros dbt
│ ├── date_filter.sql
│ └── revenue.sql
├── models/ # Folder model dbt
│ ├── top_customers.sql
│ ├── top_products.sql
│ └── schema.yml
├── output/ # Folder untuk menyimpan hasil dbt test
├── snapshots/ # Folder untuk menyimpan snapshot
├── target/ # Artefak build dbt (manifest, run_results, dll.)
├── tpch-csv/ # Hasil konversi data dari .tbl menjadi .csv (Terbuat nanti)
├── tpch-dbgen/ # Generator data TPC-H
├── .dockerignore # File untuk mengecualikan file dari Docker tracking
├── .env # Konfigurasi environment untuk Spark & Iceberg
├── .env.example # Contoh konfigurasi environment
├── .gitignore # File untuk mengecualikan file dari Git tracking
├── Dockerfile # Dockerfile
├── README.md # Dokumentasi proyek
├── dbt_project.yml # Konfigurasi proyek dbt
├── docker-compose.yml # Docker compose
├── load_to_iceberg.py # Script Python untuk load data ke Iceberg
├── requirements.txt # Dependensi python
└── run.sh # Skrip untuk clean data, iceberg submit & sql, dbt profile, dan start & stop sts, serta spark sql