Skip to content

Deploy an open source and modern data stack in no time with Terraform

Notifications You must be signed in to change notification settings

JeremyLG/open-data-stack

Repository files navigation

Open Data Stack

Intro

This repository is made to deploy open source tools easily to have a modern data stack.

There is no real need for Airflow, because dbt is meant to be deployed in serverless mode with this repository (Cloud Workflows + Cloud Scheduler). This is an opinionated choice, because I dislike the current use data teams make of Airflow. But later on I will still add support for it.

It only supports GCP for now.

As Airbyte and Lightdash needs multiple containers to run, they can't be deployed in serverless. Thus they are deployed as Compute Engine VM.

Planning to later add support for:

  • Airflow for people who don't want to reduce costs and stay serverless.
  • Snowflake because it's as popular as BigQuery, so why not.
  • AWS Cloud because it's the most popular cloud provider.
  • Metabase because it was (and is still) the go-to visualization tool to deploy in early stages projects
  • duckDB because it's a trending database for analytics workloads

Data Tools

  • GCP IAM, APIs, etc...
  • Airbyte: Extract and Load
  • dbt: Transform
  • BigQuery: Warehouse
  • Cloud Workflows / Cloud Scheduler: Schedule
  • Lightdash: Visualize
  • Streamlit: Machine Learning UI

Begin your journey

Setup GCP Account and Billing

To use the Google Cloud Platform with this project, you will need to create a Google Cloud account and enable billing. In the billing page, you will find a billing ID in the format ######-######-######. Make sure to note this value, as it will be required in the next step.

You can even set an organization if you need and if you have a professional DNS. You should follow instructions at the following link

If you want to avoid the next few steps, you can use the docker image already created for you.

It has all the tools needed: gcloud cli, terraform, etc...

Setup Google Cloud CLI

To set up the Google Cloud SDK on your computer, follow the instructions provided for your specific operating system. Once you have installed the gcloud command-line interface (CLI), open a terminal window and run the following command. This will allow Terraform to use your default credentials for authentication.

gcloud auth application-default login

Install Terraform

To use Terraform, you will first need to install the Terraform CLI on your local machine. Follow the instructions provided at the following link to complete the installation.

Fork this repository

You can just do it through the Github UI and then clone it to your local machine.

Or if you want to fork the repository with the following command:

gh repo fork REPOSITORY --org ORGANIZATION --clone=true

Then you will need to install Github CLI with the following instructions

TODO document github PAT

Deploy the open data stack

Fill up your .env file

To create the resources on Google Cloud, you will first have to fill your .env file. We provided a template, you can just copy it and rename it to .env.

Then you only need to fill what you want. BILLING_ID (which we already have thanks to step 1), PROJECT, REGION and ZONE should at least be set. You can keep default REGION and ZONE that we set in the template file. Make sure that your PROJECT variable is 6 to 30 characters in length and only contains lowercase letters, numbers, and hyphens.

(Optional) folder creation for organization

If you setup an organization (optional), you can also create a folder (optional). In order to do this optional step, fill the FOLDER_NAME value in your .env file and afterwards just run:

make create-folder

Check the FOLDER_ID value on the CLI, here is an example, the value is the one under the green rectange. Just fill the .env file with this value. folder

Deploy the whole thing

Finally run the following command in a terminal window:

make all

This will create a google project, a gcs bucket for Terraform state infrastructure storage and deploy the IaC afterwards. That's it.

The different tools deployed

(Optional) Load balancers for personal DNS

You can enable IAP and HTTPS endpoints for Airbyte and Lightdash instances. This is set in the load_balancer files and through your DNS variable that you can optionally set in your .env file.

You will get your two instances only available on your DNS with IAP authenticated users : airbyte.yourdns.com / lightdash.yourdns.com.

Airbyte

If you want to directly access your airbyte instance, we can tunnel the instance IP to our localhost with this command:

make airbyte-tunnel

You can now create connectors between your sources and destinations on the url localhost:8002

When you don't need to connect to the instance anymore just run:

make airbyte-fuser

dbt

You can initialize a dbt project with the command:

make dbt-init

It will be based on three env variables located in your .env file: PROJECT, DBT_PROJECT and DBT_DATASET.

Then you can locally run your models, views, etc... with the following command:

make dbt-run

TODO develop serverless dbt

Lightdash

If you want to directly access your lightdash instance, we can tunnel the instance IP to our localhost with this command:

make lightdash-tunnel

You can now connect on the url localhost:8003, sadly Lightdash isn't really terraform friendly so we need to do some UI few steps. For now I don't know how to automate this, I will need to deep dive in the CLI (or even API if there is any)

See Lightdash initial project setup tutorial in our docs here

When you don't need to connect to the instance anymore just run:

make lightdash-fuser

Known issues and technical difficulties

New latest image version

About

Deploy an open source and modern data stack in no time with Terraform

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published