Skip to content

This repo includes basic usage of dbt Core with Apache Spark

Notifications You must be signed in to change notification settings

ElifSinemAktas/dbt-core

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dbt Core

dbt Core is a popular open-source command-line tool that's used for orchestrating and managing the transformation and modeling of data in your data warehouse. It's particularly well-suited for data analytics teams that work with SQL-based transformations in data warehousing environments. Here are some reasons why you might consider using DBT in your data workflows:

  • Modularity and Reusability
  • Version Control
  • Code Reviws and Collaboration
  • Data Testing
  • Incremental Processing
  • Easy Deployment
  • Documentation
  • Data Lineage
  • Elastic Compute Scaling
  • Community and Ecosystem

In this repo we will discover dbt Core with Apache Spark

Prerequisites

  • Python 3.10 (Version 3.8 or higher is suitable)
  • Spark (Version 3.1.1) Trift Server
  • Linux OS (to directly use the commands in)

Steps

  • Create Spark Trift Server using Docker and run
  • Copy csv files to docker container
  • Connect Spark container
  • Run beeline and connect hive
  • Create tables using csv files
  • Create venv in your local machine
  • Install dbt
  • Initilize and configure dbt
  • Check configuration using debug
  • Create sql and yml files for transformation
  • Run dbt
  • Create dbeaver connection and check the data easily

Create Spark Trift Server using Docker and run

This docker-compose starts a Spark Thrift server and a Postgres database as a Hive Metastore backend.

docker-compose up -d
docker ps
CONTAINER ID   IMAGE                      COMMAND                  CREATED       STATUS       PORTS                                              NAMES
0594d2d79ef6   godatadriven/spark:3.1.1   "/scripts/entrypoint…"   3 weeks ago   Up 3 hours   0.0.0.0:4040->4040/tcp, 0.0.0.0:10000->10000/tcp   dbt_core-dbt-spark3-thrift-1
ef7fb9edd3aa   postgres:9.6.17-alpine     "docker-entrypoint.s…"   3 weeks ago   Up 3 hours   0.0.0.0:5433->5432/tcp                             dbt_core-dbt-hive-metastore-1

Copy csv files to docker container

Change "your_path" and "container_id" as your info.

cd /*your_path*/dbt_core/seeds
docker cp raw_customers.csv *container_id*:/opt
docker cp raw_orders.csv *container_id*:/opt
docker cp raw_payments.csv *container_id*:/opt

Connect Spark container

docker-exec -it <container_id> bash

Run beeline and connect hive

/usr/spark/bin/beeline
!connect jdbc:hive2://localhost:10000

Example Output:

root@0594d2d79ef6:/# /usr/spark/bin/beeline
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Beeline version 2.3.7 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: train
Enter password for jdbc:hive2://localhost:10000: ********
Connected to: Spark SQL (version 3.1.1)
Driver: Hive JDBC (version 2.3.7)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000>

Create tables using csv files

CREATE TABLE default.raw_customers (
    id int,
    first_name string,
    last_name string
    ) 
USING CSV
OPTIONS (path "/opt/raw_customers.csv",
        delimiter ",",
        header "true");

CREATE TABLE default.raw_orders (
    id int,
    user_id int,
    order_date date,
    status string
    )
USING CSV
OPTIONS (path "/opt/raw_orders.csv",
        delimiter ",",
        header "true");

CREATE TABLE default.raw_payments (
    id int,
    order_id int,
    payment_method string,
    amount bigint
    )
USING CSV
OPTIONS (path "/opt/raw_payments.csv",
        delimiter ",",
        header "true");

Create venv in your local machine

python3 -m venv dbt-env
source dbt-env/bin/activate

Install dbt

pip install "dbt-spark[PyHive]"

Note: sasl installation can be problem when installing the package. If you are using linux you can follow the steps here.

Initilize and configure dbt

dbt init tutorial_spark

dbt_init_1 dbt_init_2

After configuration, a folder system will be created.

Check configuration using debug

cd tutorial_spark
cd dbt debug

dbt_debug_1

Create sql and yml files for transformation

Visit https://github.com/dbt-labs/jaffle_shop and get the "models" folder from there and copy the files into your tutorial_spark/models folder.

models_1

Run dbt

dbt run

dbt_run_1

Create dbeaver connection and check the data easily

Download Dbeaver https://dbeaver.io/download/

dbeaver_1

dbeaver_2

dbeaver_3

Resources

About

This repo includes basic usage of dbt Core with Apache Spark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published