Skip to content

04 Building and Serving OLAP Database

Matin Nuhamunada edited this page Apr 18, 2024 · 15 revisions

Building Knowledgebase: ELT with DBT and DuckDB

Alt text BGCFlow integrates various bionformatic tools for the analysis of genomes and BGCs. Many of these tools generate table, or table-like outputs. Using bgcflow build database we can extract data from all this tables, build a parquet file, and load them into DuckDB (SQL OLAP Database) using data build tools (DBT).

This step will generate:

  • Parquet tables, located in data/processed/<PROJECT NAME>/data_warehouse/
  • DuckDB SQL database, located in data/processed/<PROJECT NAME>/dbt/antiSMASH_<ANTISMASH_VERSION>/duckdb.db

Exploring the Knowledgebase with Metabase

Metabase admin setup

Alt text For interactive exploration and visualization of the tables, BGCFlow utilizes Metabase, a popular business analytics platform.

To run metabase, run:

bgcflow serve --metabase

If you are running Metabase for the first time, you will be given a default admin id and password. Please change this once you access the admin panel of Metabase. Note that this password will only shown once!

$ bgcflow serve --metabase
(cd resources/metabase && java -Xms2g -Xmx8g -DMB_SETUP_TOKEN=ad0fb086-351b-4fa5-a17e-76282d2c9753 -jar metabase_v0.46.2.jar)
Metabase is running - pid: 20201
Connecting to metabase...
Retrying to connect: 1x
Retrying to connect: 2x
Retrying to connect: 3x
Metabase setup successful.

Login account: admin@bgcflow.com
Password: Q&@Gt853Q7Ae
Please change the password once you logged in.

Metabase is running at: http://localhost:3000
To add database connections, see: https://www.metabase.com/docs/latest/databases/connecting
Press q to quit Metabase server: 

This will start a Metabase server with DuckDB plugin that is served in http://localhost:3000. Follow the Metabase documentation to load and explore the data.

image image image image

Using command line script to add and synchronise DBT model in metabase

In BGCFlow_wrapper 0.3.1, we introduce a script to synchronize the dbt schema with the DuckDB data in metabase.

Usage

$ bgcflow sync -h
Usage: bgcflow sync [OPTIONS] PROJECT_NAME

  Upload and sync DuckDB database to Metabase.

Options:
  --dbt-excludes TEXT       A list of dbt models to exclude from the
                            synchronization.
  --metabase-http           Use HTTP instead of HTTPS to connect to Metabase.
  --dbt-database TEXT       The name of the dbt database to use.
  --metabase-database TEXT  The name of the Metabase database to use.
  --dbt-schema TEXT         The name of the dbt schema to use.
  --mb-password TEXT        The Metabase password. If None, the user will be
                            prompted to enter their password.
  --mb-username TEXT        The Metabase username. If None, the user will be
                            prompted to enter their username.
  --metabase-host TEXT      The URL of the Metabase server.
  --dbt-dir DIRECTORY       The directory containing the dbt project to
                            upload. If None, the directory is inferred from
                            the BGCFlow project directory.
  --bgcflow-dir DIRECTORY   The root directory of the BGCFlow project.
  -h, --help                Show this message and exit.

Arguments

  • project-name: The name of the project to upload to Metabase.
  • --bgcflow-dir: The root directory of the BGCFlow project. Default is the current directory.
  • --dbt-dir: The directory containing the dbt project to upload. If None, the directory is inferred from the BGCFlow project directory.
  • --metabase-host: The URL of the Metabase server. Default is http://localhost:3000.
  • --mb-username: The Metabase username. If None, the user will be prompted to enter their username.
  • --mb-password: The Metabase password. If None, the user will be prompted to enter their password.
  • --dbt-schema: The name of the dbt schema to use. Default is main.
  • --metabase-database: The name of the Metabase database to use. If None, the project name is used.
  • --dbt-database: The name of the dbt database to use. Default is dbt_bgcflow.
  • --metabase-http: Whether to use HTTP instead of HTTPS to connect to Metabase.

Example

bgcflow sync Lactobacillus_delbrueckii

image image image

Uploading your data to Motherduck

You can also upload the generated database to Motherduck.

python workflow/scripts/motherduck.py -h
usage: motherduck.py [-h] [--database_name DATABASE_NAME] database_path

Upload a local DuckDB database to a remote DuckDB database.

positional arguments:
  database_path         The path to the local DuckDB database.

options:
  -h, --help            show this help message and exit
  --database_name DATABASE_NAME
                        The name of the remote database to create or replace. Default is "bgcflow".