Skip to content

Egor/dev 1212 jaffle shop to duckdb ingestion first notebook#12

Merged
ZmeiGorynych merged 19 commits intomainfrom
egor/dev-1212-jaffle-shop-to-duckdb-ingestion-first-notebook
Apr 8, 2026
Merged

Egor/dev 1212 jaffle shop to duckdb ingestion first notebook#12
ZmeiGorynych merged 19 commits intomainfrom
egor/dev-1212-jaffle-shop-to-duckdb-ingestion-first-notebook

Conversation

@ZmeiGorynych
Copy link
Copy Markdown
Member

@ZmeiGorynych ZmeiGorynych commented Apr 6, 2026

Summary by CodeRabbit

  • New Features

    • Added a Jaffle Shop example set to generate sample data, build a DuckDB database, load a schema, and auto-ingest into the semantic layer; includes a CLI-style example that verifies integrity and prints reports.
  • Documentation

    • Added a notebook walkthrough demonstrating generation, loading, ingestion, and semantic-layer queries; updated docs to reflect new example and test locations.
  • Tests

    • Added integration tests for the example pipeline and notebook execution.
  • Chores

    • Added optional example dependencies and tooling.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a complete Jaffle Shop example set: CSV data generation, DuckDB schema and ingestion scripts, a notebook demonstrating YAML-backed SLayer ingestion, integration tests to validate data and notebook execution, project dependency updates, and docs updated to use tests/integration/ paths. (48 words)

Changes

Cohort / File(s) Summary
Database Schema
docs/examples/jaffle_shop_schema.sql
New DuckDB schema creating eight tables with primary keys and FK relationships used by examples and tests.
Data generation & DuckDB ingestion
docs/examples/jaffle_shop_duckdb.py
New CLI-style script: generates Jaffle Shop CSVs via jafgen, creates schema, loads CSVs into DuckDB (converts cent columns to dollars), and runs verification queries; exposes helper functions used by tests.
SLayer ingestion example
docs/examples/ingest.py
New script that ensures a DuckDB DB exists, configures YAMLStorage, persists a DatasourceConfig, runs ingest_datasource, sets default time dimensions for specific models, saves models, and prints discovery reports.
Notebook example
docs/examples/jaffle_shop_ingest.ipynb
New Jupyter notebook demonstrating full pipeline: data generation, DuckDB loading, SLayer ingestion into YAML storage, model inspection, and example semantic-layer queries.
Integration tests
tests/integration/test_jaffle_shop_duckdb.py, tests/integration/test_jaffle_shop_notebook.py, tests/integration/test_integration.py
New integration tests for the DuckDB example and notebook execution; fixture to build test DB; assertions for row counts, FK integrity, join/aggregate sanity checks; docstring path update in test_integration.py.
Docs & developer guidance
CLAUDE.md, docs/development.md
Updated docs to reference the new tests/integration/ layout and revised integration test commands; removed old package-structure section.
Project config / deps
pyproject.toml
Added optional jafgen (git dependency) and new examples extra (jafgen, duckdb, duckdb-engine); added jafgen to all extra; added nbclient, nbformat, and ipykernel to dev deps.

Sequence Diagram(s)

sequenceDiagram
    participant Gen as Data Generator (jafgen)
    participant Temp as Temp CSV Dir
    participant DB as DuckDB
    participant YAML as YAMLStorage
    participant SLayer as SLayer Ingest
    participant User as User/Console

    Gen->>Temp: generate CSVs
    Temp->>DB: load schema & ingest CSVs (convert cents→dollars)
    DB->>DB: run verification queries
    DB->>YAML: provide connection / DatasourceConfig
    YAML->>SLayer: persist datasource config
    SLayer->>DB: ingest datasource, discover models
    SLayer->>YAML: persist discovered models (apply time dims)
    SLayer->>User: print discovery report
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • AivanF

Poem

🐰 I hopped through CSV rows, crunchy and neat,
DuckDB warmed the data into a treat.
YAML shelves hold models tidy and bright,
I nudged time dims under the moonlit byte.
Jaffle Shop sings — carrots, queries, delight!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.89% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title references the main objective (Jaffle Shop to DuckDB ingestion notebook) but is somewhat verbose and includes an internal reference (Egor/dev 1212) that lacks clarity for external audiences.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch egor/dev-1212-jaffle-shop-to-duckdb-ingestion-first-notebook

Comment @coderabbitai help to get the list of available commands and usage tips.

@AivanF
Copy link
Copy Markdown
Collaborator

AivanF commented Apr 6, 2026

Let's move all the integration tests into a separate folder

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
docs/examples/jaffle_shop_schema.sql (1)

13-13: Prefer fixed-point types for currency columns.

Using DOUBLE for monetary values can introduce rounding drift in sums and comparisons.

💡 Proposed fix
-    tax_rate DOUBLE NOT NULL
+    tax_rate DECIMAL(6,4) NOT NULL

-    price DOUBLE NOT NULL,
+    price DECIMAL(12,2) NOT NULL,

-    subtotal DOUBLE NOT NULL,
-    tax_paid DOUBLE NOT NULL,
-    order_total DOUBLE NOT NULL
+    subtotal DECIMAL(12,2) NOT NULL,
+    tax_paid DECIMAL(12,2) NOT NULL,
+    order_total DECIMAL(12,2) NOT NULL

-    cost DOUBLE NOT NULL,
+    cost DECIMAL(12,2) NOT NULL,

Also applies to: 20-20, 29-31, 44-44

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/examples/jaffle_shop_schema.sql` at line 13, Replace floating-point
DOUBLE types used for monetary values with a fixed-point type: change the
tax_rate column definition (and any other currency-related columns flagged
elsewhere) from DOUBLE to a DECIMAL/NUMERIC with appropriate precision and scale
(for example DECIMAL(10,4) or NUMERIC(10,4)) to avoid rounding drift; update the
CREATE TABLE column definitions (e.g., the tax_rate declaration and the other
flagged DOUBLE columns) to use the chosen DECIMAL/NUMERIC type and ensure any
INSERTs/SELECTs or schema docs reflect the new type.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/examples/ingest.py`:
- Around line 41-46: The calls to generate_data, create_schema, and load_data
use positional arguments for functions that accept multiple parameters; update
these invocations to use keyword arguments so parameters are explicit and follow
the guideline — e.g., call generate_data(work_dir=WORK_DIR, years=years),
create_schema(conn=conn, schema_file=SCHEMA_FILE), and load_data(conn=conn,
data_dir=data_dir) (referencing the generate_data, create_schema, and load_data
symbols to locate the calls).
- Around line 44-47: The DuckDB connection opened via duckdb.connect (assigned
to conn) can leak if create_schema or load_data throws; wrap the calls that use
conn in a try/finally (or use a context manager) so that conn.close() is always
executed: ensure conn is closed in the finally block after calling
create_schema(conn, SCHEMA_FILE) and load_data(conn, data_dir), or use with
duckdb.connect(DB_PATH) as conn to guarantee cleanup.

In `@docs/examples/jaffle_shop_duckdb.py`:
- Around line 41-44: Calls to functions with multiple parameters should use
keyword arguments: replace the positional call subprocess.run(cmd,
cwd=output_dir, check=True) with subprocess.run(args=cmd, cwd=output_dir,
check=True) and update the generate_data(...) and load_data(...) invocations to
pass named parameters (e.g., years=..., days=..., output_dir=...) instead of
relying on positional ordering so each argument is explicit (refer to
subprocess.run, generate_data, and load_data in the diff).
- Around line 139-171: Wrap the sequence that uses the DuckDB connection (conn)
— the calls to create_schema(conn), load_data(conn, data_dir), verify(conn) and
the subsequent printing — in a try/finally so conn.close() is always executed;
specifically open conn = duckdb.connect(db_path) then try: perform the prints,
create_schema, load_data, verify and printing of results, and finally: call
conn.close(); if you want to preserve original error behavior re-raise the
exception after cleaning up.

In `@pyproject.toml`:
- Around line 54-55: The dev dependency lists in pyproject.toml are missing the
notebook runtime packages required by tests; update the extras so that nbclient
and nbformat are included (add "nbclient" and "nbformat" to the appropriate dev
dependencies and to the "all" extra) so tests/test_jaffle_shop_notebook.py
(which calls pytest.importorskip) will run under poetry install -E all; locate
and modify the entries that define examples and all (the "examples = [...]" and
"all = [...]" lists) to include these two packages.

In `@tests/test_jaffle_shop_duckdb.py`:
- Line 16: The test import includes an unused symbol LOAD_ORDER which triggers
Ruff F401; remove LOAD_ORDER from the import list so the statement only imports
TABLE_NAMES, create_schema, generate_data, load_data, and verify (i.e., update
the import in tests/test_jaffle_shop_duckdb.py to drop LOAD_ORDER).
- Line 23: The fixture calls use positional arguments for multi-parameter
helpers (notably generate_data) — update those calls to pass arguments by name
instead of position (e.g., pass the tmpdir path as the parameter name for the
path argument and use years= and days= for generate_data), and do the same for
the other multi-parameter helper calls in this fixture (the two other helper
invocations referenced in the comment) so they follow the project rule "use
keyword arguments for functions with more than 1 parameter."

---

Nitpick comments:
In `@docs/examples/jaffle_shop_schema.sql`:
- Line 13: Replace floating-point DOUBLE types used for monetary values with a
fixed-point type: change the tax_rate column definition (and any other
currency-related columns flagged elsewhere) from DOUBLE to a DECIMAL/NUMERIC
with appropriate precision and scale (for example DECIMAL(10,4) or
NUMERIC(10,4)) to avoid rounding drift; update the CREATE TABLE column
definitions (e.g., the tax_rate declaration and the other flagged DOUBLE
columns) to use the chosen DECIMAL/NUMERIC type and ensure any INSERTs/SELECTs
or schema docs reflect the new type.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1a945f3f-9c33-4650-a637-6b63919ff061

📥 Commits

Reviewing files that changed from the base of the PR and between 96eecb8 and 2df8612.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • docs/examples/ingest.py
  • docs/examples/jaffle_shop_duckdb.py
  • docs/examples/jaffle_shop_ingest.ipynb
  • docs/examples/jaffle_shop_schema.sql
  • pyproject.toml
  • tests/test_jaffle_shop_duckdb.py
  • tests/test_jaffle_shop_notebook.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/test_jaffle_shop_notebook.py`:
- Around line 23-29: The call to nbformat.read and the NotebookClient
constructor use positional args; change nbformat.read(f, as_version=4) to use
the fp keyword (nbformat.read(fp=f, as_version=4)) and change
nbclient.NotebookClient(nb, timeout=600, kernel_name="python3",
resources={"metadata": {"path": os.path.dirname(NOTEBOOK_PATH)}}) to use the nb
keyword for the notebook argument (nbclient.NotebookClient(nb=nb, timeout=600,
kernel_name="python3", resources={"metadata": {"path":
os.path.dirname(NOTEBOOK_PATH)}})) so both multi-parameter function calls use
explicit keyword arguments.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e52a02eb-1860-4ae9-aecd-cc80fd99fd35

📥 Commits

Reviewing files that changed from the base of the PR and between 2df8612 and af7418c.

📒 Files selected for processing (8)
  • CLAUDE.md
  • docs/development.md
  • tests/integration/__init__.py
  • tests/integration/test_integration.py
  • tests/integration/test_integration_duckdb.py
  • tests/integration/test_integration_postgres.py
  • tests/integration/test_jaffle_shop_duckdb.py
  • tests/integration/test_jaffle_shop_notebook.py
✅ Files skipped from review due to trivial changes (3)
  • tests/integration/test_integration.py
  • docs/development.md
  • CLAUDE.md

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/integration/test_jaffle_shop_notebook.py (1)

22-31: Consider using context managers for proper resource cleanup.

Two observations:

  1. Missing encoding (line 22): open() without explicit encoding may behave inconsistently across platforms. Use encoding="utf-8" for Jupyter notebooks.

  2. Kernel cleanup (lines 25-31): NotebookClient supports context manager protocol. If execute() raises an exception or the test fails, the kernel process may be left orphaned. The related test fixture in test_jaffle_shop_duckdb.py (context snippet 2) demonstrates proper cleanup with explicit teardown.

♻️ Proposed refactor using context managers
-    with open(NOTEBOOK_PATH) as f:
+    with open(NOTEBOOK_PATH, encoding="utf-8") as f:
         nb = nbformat.read(fp=f, as_version=4)
 
-    client = nbclient.NotebookClient(
+    with nbclient.NotebookClient(
         nb=nb,
         timeout=600,
         kernel_name="python3",
         resources={"metadata": {"path": os.path.dirname(NOTEBOOK_PATH)}},
-    )
-    client.execute()
+    ) as client:
+        client.execute()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/test_jaffle_shop_notebook.py` around lines 22 - 31, Open
the notebook file using open(NOTEBOOK_PATH, encoding="utf-8") and pass the file
handle to nbformat.read to avoid platform-dependent decoding, then instantiate
nbclient.NotebookClient using its context manager (e.g., with
nbclient.NotebookClient(nb=nb, timeout=600, kernel_name="python3",
resources={"metadata": {"path": os.path.dirname(NOTEBOOK_PATH)}}) as client:)
and call client.execute() inside that with-block so the kernel is always cleaned
up on success or exception; reference NOTEBOOK_PATH, nbformat.read,
nbclient.NotebookClient, and client.execute to locate where to apply these
changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/integration/test_jaffle_shop_notebook.py`:
- Around line 22-31: Open the notebook file using open(NOTEBOOK_PATH,
encoding="utf-8") and pass the file handle to nbformat.read to avoid
platform-dependent decoding, then instantiate nbclient.NotebookClient using its
context manager (e.g., with nbclient.NotebookClient(nb=nb, timeout=600,
kernel_name="python3", resources={"metadata": {"path":
os.path.dirname(NOTEBOOK_PATH)}}) as client:) and call client.execute() inside
that with-block so the kernel is always cleaned up on success or exception;
reference NOTEBOOK_PATH, nbformat.read, nbclient.NotebookClient, and
client.execute to locate where to apply these changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d2b47999-07de-4ca9-bc78-0bca7c9ac0fc

📥 Commits

Reviewing files that changed from the base of the PR and between 0571d73 and 7a19cb2.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • docs/examples/jaffle_shop_duckdb.py
  • pyproject.toml
  • tests/integration/test_jaffle_shop_notebook.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • docs/examples/jaffle_shop_duckdb.py
  • pyproject.toml

…ev-1212-jaffle-shop-to-duckdb-ingestion-first-notebook
…ev-1212-jaffle-shop-to-duckdb-ingestion-first-notebook
…SQL and query-level dimension/measure references:

  1. Auto-convert multi-dot references in model SQL snippets
    Model filters, dimension sql, and measure sql fields are SQL snippets that must use __ alias syntax for multi-hop join paths. Multi-dot references (e.g., customers.regions.name) are now auto-converted to the correct form (customers__regions.name) with a warning, instead of silently producing invalid SQL.
  2. Reject __ in entity names
    Model, dimension, measure, and query names now reject double underscores, which are reserved for join path aliases in generated SQL.
  3. Resolve query filter columns through model dimensions
    Query-level filters now resolve column references through model dimensions and measures rather than passing them as raw SQL. Bare names are looked up on the source model; dotted names walk the join graph to find the target model and use the dimension's sql expression. This fixes multi-hop filters like "customers.regions.name == 'US'" which previously produced invalid SQL.
…ev-1212-jaffle-shop-to-duckdb-ingestion-first-notebook

# Conflicts:
#	poetry.lock
#	pyproject.toml
@AivanF AivanF self-requested a review April 8, 2026 11:20
@ZmeiGorynych ZmeiGorynych merged commit dcc9eb0 into main Apr 8, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants