# Setup: Build Schema Index

First, we need to index the table schemas from our databases so the agent can discover them.

In [2]:
from src.retrieval.schema_indexer import SchemaIndexer
import os
from dotenv import load_dotenv

# Index Snowflake first (recreate the collection)
snowflake_indexer = SchemaIndexer(connector_type='snowflake')
snowflake_indexer.build_schema_index(recreate=True)

# Index PostgreSQL second (append to collection)
postgres_indexer = SchemaIndexer(connector_type='postgres')
postgres_indexer.build_schema_index(recreate=False)

✓ Connected to Snowflake: PROD_SALES.PUBLIC
✓ Disconnected from Snowflake
Auto-discovered 3 schemas: DATA_MART_MANUFACTURING, DATA_MART_MARKETING, PUBLIC

[1/3] INDEXING SCHEMA: DATA_MART_MANUFACTURING
✓ Disconnected from Snowflake
Auto-discovered 3 schemas: DATA_MART_MANUFACTURING, DATA_MART_MARKETING, PUBLIC

[1/3] INDEXING SCHEMA: DATA_MART_MANUFACTURING
✓ Connected to Snowflake: PROD_SALES.PUBLIC
Discovering tables in PROD_SALES.DATA_MART_MANUFACTURING...
✓ Connected to Snowflake: PROD_SALES.PUBLIC
Discovering tables in PROD_SALES.DATA_MART_MANUFACTURING...
✓ Found 1 tables/views
✓ Disconnected from Snowflake

Generating metadata documents for 1 tables...
  [1/1] Processing PRODUCTS...
✓ Found 1 tables/views
✓ Disconnected from Snowflake

Generating metadata documents for 1 tables...
  [1/1] Processing PRODUCTS...
✓ Connected to Snowflake: PROD_SALES.PUBLIC
Discovering tables in PROD_SALES.DATA_MART_MANUFACTURING...
✓ Connected to Snowflake: PROD_SALES.PUBLIC
Discovering tables in 

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


✓ Disconnected from Snowflake
✓ Generated 1 metadata documents
✓ Deleted existing collection: database_schemas

Indexing 1 table metadata documents...
✓ Indexed 1 tables from schema DATA_MART_MANUFACTURING

[2/3] INDEXING SCHEMA: DATA_MART_MARKETING
✓ Indexed 1 tables from schema DATA_MART_MANUFACTURING

[2/3] INDEXING SCHEMA: DATA_MART_MARKETING
✓ Connected to Snowflake: PROD_SALES.PUBLIC
Discovering tables in PROD_SALES.DATA_MART_MARKETING...
✓ Connected to Snowflake: PROD_SALES.PUBLIC
Discovering tables in PROD_SALES.DATA_MART_MARKETING...
✓ Found 1 tables/views
✓ Disconnected from Snowflake

Generating metadata documents for 1 tables...
  [1/1] Processing CUSTOMERS...
✓ Found 1 tables/views
✓ Disconnected from Snowflake

Generating metadata documents for 1 tables...
  [1/1] Processing CUSTOMERS...
✓ Connected to Snowflake: PROD_SALES.PUBLIC
Discovering tables in PROD_SALES.DATA_MART_MARKETING...
✓ Connected to Snowflake: PROD_SALES.PUBLIC
Discovering tables in PROD_SALES.DATA_MART_

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


✓ Disconnected from Snowflake
✓ Generated 1 metadata documents

Indexing 1 table metadata documents...
✓ Indexed 1 tables from schema DATA_MART_MARKETING

[3/3] INDEXING SCHEMA: PUBLIC
✓ Connected to Snowflake: PROD_SALES.PUBLIC
Discovering tables in PROD_SALES.PUBLIC...
✓ Connected to Snowflake: PROD_SALES.PUBLIC
Discovering tables in PROD_SALES.PUBLIC...
✓ Found 0 tables/views
✓ Disconnected from Snowflake

Generating metadata documents for 0 tables...
✓ Generated 0 metadata documents
No tables found in schema PUBLIC

✓ SCHEMA INDEX BUILT SUCCESSFULLY
  - Collection: database_schemas
  - Schemas indexed: 3
  - Total tables indexed: 2
  - Path: ./chroma_db

✓ Found 0 tables/views
✓ Disconnected from Snowflake

Generating metadata documents for 0 tables...
✓ Generated 0 metadata documents
No tables found in schema PUBLIC

✓ SCHEMA INDEX BUILT SUCCESSFULLY
  - Collection: database_schemas
  - Schemas indexed: 3
  - Total tables indexed: 2
  - Path: ./chroma_db

✓ Connected to PostgreSQL

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


✓ Connected to PostgreSQL: stage_sales
✓ Disconnected from PostgreSQL
✓ Generated 2 metadata documents

Indexing 2 table metadata documents...
✓ Indexed 2 tables from schema public

✓ SCHEMA INDEX BUILT SUCCESSFULLY
  - Collection: database_schemas
  - Schemas indexed: 1
  - Total tables indexed: 2
  - Path: ./chroma_db



In [7]:
from src.agent.smart_planner import run_smart_dq_check

result_1 = run_smart_dq_check("Check for duplicates in Snowflake PRODUCTS table")

result_2 = run_smart_dq_check("Check for duplicates in postgres PRODUCTS table")


SMART DATA QUALITY CHECK
User Query: Check for duplicates in Snowflake PRODUCTS table

Step 1: Discovering relevant tables...
----------------------------------------------------------------------


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given



Found 3 relevant table(s):

  1. [SNOWFLAKE] PROD_SALES.DATA_MART_MANUFACTURING.PRODUCTS
     Relevance: 4.48%

  2. [SNOWFLAKE] PROD_SALES.DATA_MART_MARKETING.CUSTOMERS
     Relevance: -14.95%

  3. [POSTGRES] stage_sales.public.products
     Relevance: -29.56%

Step 2: Building context for agent...
----------------------------------------------------------------------

Step 3: Running agent with discovered tables...
----------------------------------------------------------------------


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `check_dataset_duplicates` with `{'dataset_id': 'PROD_SALES.DATA_MART_MANUFACTURING.PRODUCTS', 'connector_type': 'snowflake'}`


[0m--- Loading data for: PROD_SALES.DATA_MART_MANUFACTURING.PRODUCTS using SNOWFLAKE connector ---
✓ Connected to Snowflake: PROD_SALES.PUBLIC
Executing query: SELECT * FROM PROD_SALES.DATA_MART_MANUFACTURING.PRODUCTS
✓ Loaded 24599 rows from Snowflake
✓ Disconnected from Snowflake
[36;1m[1;3m{'datase

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given



Found 3 relevant table(s):

  1. [POSTGRES] stage_sales.public.products
     Relevance: -26.71%

  2. [SNOWFLAKE] PROD_SALES.DATA_MART_MANUFACTURING.PRODUCTS
     Relevance: -32.31%

  3. [SNOWFLAKE] PROD_SALES.DATA_MART_MARKETING.CUSTOMERS
     Relevance: -40.17%

Step 2: Building context for agent...
----------------------------------------------------------------------

Step 3: Running agent with discovered tables...
----------------------------------------------------------------------


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `check_dataset_duplicates` with `{'dataset_id': 'stage_sales.public.products', 'connector_type': 'postgres'}`


[0m--- Loading data for: stage_sales.public.products using POSTGRES connector ---
✓ Connected to PostgreSQL: stage_sales
Executing query: SELECT * FROM stage_sales.public.products


  df = pd.read_sql_query(sql_query, self._connection)


✓ Loaded 30000 rows from PostgreSQL
✓ Disconnected from PostgreSQL
[36;1m[1;3m{'dataset_id': 'stage_sales.public.products', 'duplicate_qty': 599, 'status': 'failure'}[0m[32;1m[1;3mThe duplicate check for the PostgreSQL table `stage_sales.public.products` has been completed. There are **599 duplicate rows** in the table.[0m

[1m> Finished chain.[0m

RESULT



In [8]:
# Display the result
print("Agent Output:")
print(result_1['output'])
print(result_2['output'])

Agent Output:
The duplicate check for the Snowflake table `PROD_SALES.DATA_MART_MANUFACTURING.PRODUCTS` has been completed successfully. No duplicate rows were found in the dataset.
The duplicate check for the PostgreSQL table `stage_sales.public.products` has been completed. There are **599 duplicate rows** in the table.
