# Smart Data Quality Agent - Demo Notebook

## What This Notebook Does
This notebook demonstrates the complete Smart DQ Check workflow with enhanced null analysis:

1. **Schema Indexing** - Automatically discovers and indexes database tables (must run for first time) - No CDC logic applied as no requirement related defined on task description
2. **Smart Analysis** - Uses AI to understand your query and analyze the right tables  
3. **Enhanced Report Generation** - Creates comprehensive data quality reports with detailed null analysis in multiple formats
4. **Fixed Null Analysis Issue**: Enhanced agent prompt to ensure detailed null values breakdown is always included
5. **Complete Data Coverage**: Reports now show per-column null counts and percentages
6. **Multi-Format Output**: All formats (Markdown, HTML, JSON) contain complete null analysis details

## How to Use

### **For Comprehensive Analysis:**
1. **First Time Setup**: Run indexing cell to build schema index and discover your database tables
2. **Run Analysis**: Following cell prompts for your query - try examples like:
   - "Generate comprehensive DQ report for prod customers table"
   - "Create data quality report for production database invoices"
3. **Get Reports**: System generates detailed reports saved to `../reports/` in 3 formats (.html, .json, .md)

### **For Single Quality Check Analysis:**
1. **Run Schema Indexing** (same as above - required for all analysis types)
2. **Choose Single Check Type**: Run the specific cells for focused analysis:
   - **Duplicate-Only**: Run duplicate analysis cell for just duplicate detection
   - **Null-Only**: Run null analysis cell for just missing values detection
   - **Statistics-Only**: Use query "Check statistics for [table]" in comprehensive cell
3. **Faster Execution**: Single checks run much faster than comprehensive analysis
4. **Focused Reports**: Each single check creates a targeted report file (e.g., `duplicate_only_analysis.md`)

### **When to Use Each Approach:**
- **Comprehensive**: When you need complete data quality overview with all checks and in all report formats
- **Single Check**: When investigating specific issues or need quick focused analysis

### **Reports being saved under reports folder shows:**
- **Duplicate Record Analysis** - Complete duplicate detection and counts
- **Detailed Null Values Assessment** - Per-column breakdown with counts and percentages  
- **Statistical Summaries** - Comprehensive data profiling with descriptive stats
- **Actionable Recommendations** - Data quality improvement suggestions

**Just run all cells in order and follow the prompts**

In [1]:
import os
from dotenv import load_dotenv
import json
import re
import yaml
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')
from src.retrieval.schema_indexer import SchemaIndexer
from src.agent.smart_planner import run_smart_dq_check
import src.reporting.report_templates
import src.data_quality.checks
import src.reporting.report_generator
import src.agent.reporting_tools
import src.reporting.report_processor
import src.reporting.report_templates
import src.reporting.report_generator
from src.reporting.report_processor import SmartDQReportProcessor

## Schema Indexing 

Do not forget to run this cell for first time 
No need if there is no change in target schema & database (Incremental approach did not apply as not defined as requirement in task)

In [2]:
# Load settings.yaml to read connector types dynamically
with open('../config/settings.yaml', 'r') as f:
    settings = yaml.safe_load(f)

# Get connector types from settings.yaml
connector_types = list(settings['connectors'].keys())

print(f" BUILDING SCHEMA INDEX FROM SCRATCH")

# First connector: recreate=True
# Subsequent connectors: recreate=False to preserve other connector data
for i, ct in enumerate(connector_types):
    print(f"\n Rebuilding index for {ct.upper()}")

    # First connector recreates the entire index, others append to it
    recreate_index = (i == 0)  # Only recreate on first connector

    if recreate_index:
        print(f"  Starting fresh - recreating entire index")
    else:
        print(f"  Adding to existing index")

    indexer = SchemaIndexer(connector_type=ct)
    indexer.build_schema_index(recreate=recreate_index)
    print(f"   Index {'created' if recreate_index else 'updated'} for {ct}")

print(f"\n SCHEMA INDEX REBUILD COMPLETE!")

 BUILDING SCHEMA INDEX FROM SCRATCH

 Rebuilding index for SNOWFLAKE
  Starting fresh - recreating entire index
 Connecting to Snowflake...
 Connecting to Snowflake...
‚úì Connected to Snowflake: PROD_SALES.PUBLIC
‚úì Connected to Snowflake: PROD_SALES.PUBLIC
‚úì Disconnected from Snowflake
Auto-discovered 4 schemas on Snowflake: DATA_MART_FINANCE, DATA_MART_MANUFACTURING, DATA_MART_MARKETING, PUBLIC
[1/4] INDEXING SCHEMA: DATA_MART_FINANCE
‚úì Disconnected from Snowflake
Auto-discovered 4 schemas on Snowflake: DATA_MART_FINANCE, DATA_MART_MANUFACTURING, DATA_MART_MARKETING, PUBLIC
[1/4] INDEXING SCHEMA: DATA_MART_FINANCE

Generating metadata documents for 2 tables...
  [1/2] Processing INCOME_BAND...

Generating metadata documents for 2 tables...
  [1/2] Processing INCOME_BAND...
  [2/2] Processing INVOICES...
  [2/2] Processing INVOICES...


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


 Deleted existing collection: database_schemas

Indexing 2 table metadata documents...
Indexed 2 tables from schema DATA_MART_FINANCE
[2/4] INDEXING SCHEMA: DATA_MART_MANUFACTURING
Indexed 2 tables from schema DATA_MART_FINANCE
[2/4] INDEXING SCHEMA: DATA_MART_MANUFACTURING

Generating metadata documents for 1 tables...
  [1/1] Processing PRODUCTS...

Generating metadata documents for 1 tables...
  [1/1] Processing PRODUCTS...


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given



Indexing 1 table metadata documents...
Indexed 1 tables from schema DATA_MART_MANUFACTURING
[3/4] INDEXING SCHEMA: DATA_MART_MARKETING

Generating metadata documents for 3 tables...
  [1/3] Processing CALL_CENTER...

Generating metadata documents for 3 tables...
  [1/3] Processing CALL_CENTER...
  [2/3] Processing CUSTOMERS...
  [2/3] Processing CUSTOMERS...
  [3/3] Processing SALES...
  [3/3] Processing SALES...


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given



Indexing 3 table metadata documents...
Indexed 3 tables from schema DATA_MART_MARKETING
[4/4] INDEXING SCHEMA: PUBLIC

Generating metadata documents for 0 tables...
No tables found in schema PUBLIC
 SCHEMA INDEX BUILT SUCCESSFULLY
  - Collection: database_schemas
  - Schemas indexed: 4
  - Total tables indexed: 6
   Index created for snowflake

 Rebuilding index for POSTGRES
  Adding to existing index

Generating metadata documents for 0 tables...
No tables found in schema PUBLIC
 SCHEMA INDEX BUILT SUCCESSFULLY
  - Collection: database_schemas
  - Schemas indexed: 4
  - Total tables indexed: 6
   Index created for snowflake

 Rebuilding index for POSTGRES
  Adding to existing index
 Connecting to postgres...
 Connecting to postgres...
‚úì Connected to PostgreSQL: stage_sales
‚úì Disconnected from PostgreSQL
Auto-discovered 1 schemas on postgres: public
[1/1] INDEXING SCHEMA: public
Discovering tables in stage_sales.public...
‚úì Found 3 tables/views

Generating metadata documents for

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given



Indexing 3 table metadata documents...
Indexed 3 tables from schema public
 SCHEMA INDEX BUILT SUCCESSFULLY
  - Collection: database_schemas
  - Schemas indexed: 1
  - Total tables indexed: 3
   Index updated for postgres

 SCHEMA INDEX REBUILD COMPLETE!


## Enhanced Comprehensive Report Creation

Generate comprehensive data quality reports with detailed duplicate counts, null analysis, and descriptive statistics.

In [None]:
# Loop example queries

queries = [
"Analyze staging analytics data quality", # Missing table to show error handling
"Create comprehensive report of UNKNOWN table data quality for staging database", # Nonsense table name to test error handling
"Generate comprehensive DQ report for staging sales table in postgres",
"Check prod customers table for missing values",
"Find duplicates in production invoice table snowflake"
]

for _, query in enumerate(queries):
    print(f"\n Running DQ check for query: {query}")
    report = run_smart_dq_check(query)
    print(f"Report output: {report.get('output', 'No output found')}")
    processor = SmartDQReportProcessor(reports_dir="../reports")
    processor.process_comprehensive_report(report)
    print(f"{_}, Report processing complete.")


 Running DQ check for query: Analyze staging analytics data quality

SMART DATA QUALITY CHECK
User Query: Analyze staging analytics data quality

Step 1: Discovering relevant tables...
----------------------------------------------------------------------


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event CollectionGetEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event CollectionGetEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 9, updating n_results = 9
Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of el

Report output: No relevant tables found. Please ensure the schema index is built. Run: python src/retrieval/schema_indexer.py
Smart DQ check failed due to low relevance scores.
Skipping report generation - no valid table found for analysis.
Try using a more specific table name in your query.
0, Report processing complete.

 Running DQ check for query: Create comprehensive report of UNKNOWN table data quality for staging database

SMART DATA QUALITY CHECK
User Query: Create comprehensive report of UNKNOWN table data quality for staging database

Step 1: Discovering relevant tables...
----------------------------------------------------------------------


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 9, updating n_results = 9
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 9, updating n_results = 9
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given


Report output: No relevant tables found. Please ensure the schema index is built. Run: python src/retrieval/schema_indexer.py
Smart DQ check failed due to low relevance scores.
Skipping report generation - no valid table found for analysis.
Try using a more specific table name in your query.
1, Report processing complete.

 Running DQ check for query: Generate comprehensive DQ report for staging sales table in postgres

SMART DATA QUALITY CHECK
User Query: Generate comprehensive DQ report for staging sales table in postgres

Step 1: Discovering relevant tables...
----------------------------------------------------------------------


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 9, updating n_results = 9
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 9, updating n_results = 9
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given



Found 3 relevant table(s) (‚â•5% relevance):

  1. [POSTGRES] stage_sales.public.products
     Relevance: 16.67% ‚úì

  2. [SNOWFLAKE] PROD_SALES.DATA_MART_MARKETING.SALES
     Relevance: 12.63% ‚úì

  3. [POSTGRES] stage_sales.public.invoices
     Relevance: 10.19% ‚úì

Step 2: Building context for agent...
----------------------------------------------------------------------

Step 3: Running agent with discovered tables...
----------------------------------------------------------------------


[1m> Entering new AgentExecutor chain...[0m


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `generate_comprehensive_dq_report` with `{'dataset_id': 'stage_sales.public.products', 'connector_type': 'postgres', 'output_format': 'markdown'}`


[0mRunning duplicates check...
--- Loading data for: stage_sales.public.products using POSTGRES connector ---
‚úì Connected to PostgreSQL: stage_sales
Executing query: SELECT * FROM stage_sales.public.products
[32;1m[1;3m
Invo

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 9, updating n_results = 9
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 9, updating n_results = 9
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given



Found 3 relevant table(s) (‚â•5% relevance):

  1. [SNOWFLAKE] PROD_SALES.DATA_MART_MARKETING.CUSTOMERS
     Relevance: 34.92% ‚úì

  2. [SNOWFLAKE] PROD_SALES.DATA_MART_MANUFACTURING.PRODUCTS
     Relevance: 12.31% ‚úì

  3. [SNOWFLAKE] PROD_SALES.DATA_MART_MARKETING.CALL_CENTER
     Relevance: 5.50% ‚úì

Step 2: Building context for agent...
----------------------------------------------------------------------

Step 3: Running agent with discovered tables...
----------------------------------------------------------------------


[1m> Entering new AgentExecutor chain...[0m


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `check_dataset_null_values` with `{'dataset_id': 'PROD_SALES.DATA_MART_MARKETING.CUSTOMERS', 'connector_type': 'snowflake'}`


[0m--- Loading data for: PROD_SALES.DATA_MART_MARKETING.CUSTOMERS using SNOWFLAKE connector ---
[32;1m[1;3m
Invoking: `check_dataset_null_values` with `{'dataset_id': 'PROD_SALES.DATA_MART_MARKETING.CUSTOMERS', '

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 9, updating n_results = 9
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 9, updating n_results = 9
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given



Found 3 relevant table(s) (‚â•5% relevance):

  1. [SNOWFLAKE] PROD_SALES.DATA_MART_FINANCE.INVOICES
     Relevance: 56.88% ‚úì

  2. [SNOWFLAKE] PROD_SALES.DATA_MART_MARKETING.CALL_CENTER
     Relevance: 22.37% ‚úì

  3. [SNOWFLAKE] PROD_SALES.DATA_MART_MARKETING.SALES
     Relevance: 13.12% ‚úì

Step 2: Building context for agent...
----------------------------------------------------------------------

Step 3: Running agent with discovered tables...
----------------------------------------------------------------------


[1m> Entering new AgentExecutor chain...[0m


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `check_dataset_duplicates` with `{'dataset_id': 'PROD_SALES.DATA_MART_FINANCE.INVOICES', 'connector_type': 'snowflake'}`


[0m--- Loading data for: PROD_SALES.DATA_MART_FINANCE.INVOICES using SNOWFLAKE connector ---
[32;1m[1;3m
Invoking: `check_dataset_duplicates` with `{'dataset_id': 'PROD_SALES.DATA_MART_FINANCE.INVOICES', 'connector_type': 'sn

User Input base query and report creation

In [4]:
# Get user input for the query with a default option
default_query = "Generate comprehensive DQ report for prod customers table"
user_query = input("Enter your DQ check query (or press Enter to use default): ").strip() or default_query
print(f"Using query: {user_query}")

comprehensive_report = run_smart_dq_check(user_query)

Using query: Generate comprehensive DQ report for prod customers table

SMART DATA QUALITY CHECK
User Query: Generate comprehensive DQ report for prod customers table

Step 1: Discovering relevant tables...
----------------------------------------------------------------------


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 9, updating n_results = 9
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 9, updating n_results = 9
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given



Found 3 relevant table(s) (‚â•5% relevance):

  1. [SNOWFLAKE] PROD_SALES.DATA_MART_MARKETING.CUSTOMERS
     Relevance: 37.23% ‚úì

  2. [SNOWFLAKE] PROD_SALES.DATA_MART_MANUFACTURING.PRODUCTS
     Relevance: 36.49% ‚úì

  3. [SNOWFLAKE] PROD_SALES.DATA_MART_MARKETING.SALES
     Relevance: 23.34% ‚úì

Step 2: Building context for agent...
----------------------------------------------------------------------

Step 3: Running agent with discovered tables...
----------------------------------------------------------------------


[1m> Entering new AgentExecutor chain...[0m


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `generate_comprehensive_dq_report` with `{'dataset_id': 'PROD_SALES.DATA_MART_MARKETING.CUSTOMERS', 'connector_type': 'snowflake', 'output_format': 'markdown'}`


[0mRunning duplicates check...
--- Loading data for: PROD_SALES.DATA_MART_MARKETING.CUSTOMERS using SNOWFLAKE connector ---
[32;1m[1;3m
Invoking: `generate_comprehensive_dq_report` 

Comprehensive report saving function run

In [5]:
# Initialize the report processor
processor = SmartDQReportProcessor(reports_dir="../reports")

# Process the comprehensive report (handles validation, extraction, checks, and file generation)
result = processor.process_comprehensive_report(comprehensive_report)

Smart DQ check successful - proceeding with report generation...
Extracted info:
   Dataset: PROD_SALES.DATA_MART_MARKETING.CUSTOMERS
   Connector: snowflake

Running comprehensive DQ assessment with reliable data sources...
   Executing duplicates check...
--- Loading data for: PROD_SALES.DATA_MART_MARKETING.CUSTOMERS using SNOWFLAKE connector ---
‚úì Connected to Snowflake: PROD_SALES.PUBLIC
Executing query: SELECT * FROM PROD_SALES.DATA_MART_MARKETING.CUSTOMERS
‚úì Connected to Snowflake: PROD_SALES.PUBLIC
Executing query: SELECT * FROM PROD_SALES.DATA_MART_MARKETING.CUSTOMERS
‚úì Loaded 37811 rows from Snowflake
‚úì Disconnected from Snowflake
‚úì Loaded 37811 rows from Snowflake
‚úì Disconnected from Snowflake
   Executing null_values check...
--- Loading data for: PROD_SALES.DATA_MART_MARKETING.CUSTOMERS using SNOWFLAKE connector ---
   Executing null_values check...
--- Loading data for: PROD_SALES.DATA_MART_MARKETING.CUSTOMERS using SNOWFLAKE connector ---
‚úì Connected to Snow

## Single-Quality Check Report Creation

Run only single detection and save results to a focused report file. In case single quality checks needed and saved seperately.

In [6]:
# For DUPLICATE-ONLY analysis - use agent with specific query
duplicate_query = "Check for duplicates in prod customers table"
print(f"Running duplicate-only analysis: {duplicate_query}")

duplicate_result = run_smart_dq_check(duplicate_query)

# Save to a simple duplicate-focused file
duplicate_report_content = f"""# Duplicate Analysis Report
Generated: {datetime.now()}
Query: {duplicate_query}

## Results:
{duplicate_result['output']}
"""

# Save to reports folder
duplicate_file_path = "../reports/duplicate_only_analysis.md"
with open(duplicate_file_path, "w") as f:
    f.write(duplicate_report_content)

Running duplicate-only analysis: Check for duplicates in prod customers table

SMART DATA QUALITY CHECK
User Query: Check for duplicates in prod customers table

Step 1: Discovering relevant tables...
----------------------------------------------------------------------


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 9, updating n_results = 9
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 9, updating n_results = 9
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given



Found 2 relevant table(s) (‚â•5% relevance):

  1. [SNOWFLAKE] PROD_SALES.DATA_MART_MARKETING.CUSTOMERS
     Relevance: 34.89% ‚úì

  2. [SNOWFLAKE] PROD_SALES.DATA_MART_MANUFACTURING.PRODUCTS
     Relevance: 6.91% ‚úì

Step 2: Building context for agent...
----------------------------------------------------------------------

Step 3: Running agent with discovered tables...
----------------------------------------------------------------------


[1m> Entering new AgentExecutor chain...[0m


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `check_dataset_duplicates` with `{'dataset_id': 'PROD_SALES.DATA_MART_MARKETING.CUSTOMERS', 'connector_type': 'snowflake'}`


[0m--- Loading data for: PROD_SALES.DATA_MART_MARKETING.CUSTOMERS using SNOWFLAKE connector ---
[32;1m[1;3m
Invoking: `check_dataset_duplicates` with `{'dataset_id': 'PROD_SALES.DATA_MART_MARKETING.CUSTOMERS', 'connector_type': 'snowflake'}`


[0m--- Loading data for: PROD_SALES.DATA_MART_MARKETING.

In [7]:
# For NULL-ONLY analysis - use agent with specific query
null_query = "Check for null values in prod customers table"
print(f"Running null-only analysis: {null_query}")

null_result = run_smart_dq_check(null_query)

# Save to a simple null-focused file
null_report_content = f"""# Null Analysis Report
Generated: {datetime.now()}
Query: {null_query}

## Results:
{null_result['output']}
"""

# Save to reports folder
null_file_path = "../reports/null_only_analysis.md"
with open(null_file_path, "w") as f:
    f.write(null_report_content)

Running null-only analysis: Check for null values in prod customers table

SMART DATA QUALITY CHECK
User Query: Check for null values in prod customers table

Step 1: Discovering relevant tables...
----------------------------------------------------------------------


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 9, updating n_results = 9
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 9, updating n_results = 9
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given



Found 1 relevant table(s) (‚â•5% relevance):

  1. [SNOWFLAKE] PROD_SALES.DATA_MART_MARKETING.CUSTOMERS
     Relevance: 28.18% ‚úì

Step 2: Building context for agent...
----------------------------------------------------------------------

Step 3: Running agent with discovered tables...
----------------------------------------------------------------------


[1m> Entering new AgentExecutor chain...[0m


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `check_dataset_null_values` with `{'dataset_id': 'PROD_SALES.DATA_MART_MARKETING.CUSTOMERS', 'connector_type': 'snowflake'}`


[0m--- Loading data for: PROD_SALES.DATA_MART_MARKETING.CUSTOMERS using SNOWFLAKE connector ---
[32;1m[1;3m
Invoking: `check_dataset_null_values` with `{'dataset_id': 'PROD_SALES.DATA_MART_MARKETING.CUSTOMERS', 'connector_type': 'snowflake'}`


[0m--- Loading data for: PROD_SALES.DATA_MART_MARKETING.CUSTOMERS using SNOWFLAKE connector ---
‚úì Connected to Snowflake: PROD_SALES.PUBLIC


In [8]:
# For DESCRIPTIVE STATISTICS-ONLY analysis - use agent with specific query
stats_query = "Check descriptive statistics for prod customers table"
print(f"Running statistics-only analysis: {stats_query}")

stats_result = run_smart_dq_check(stats_query)

# Save to a simple statistics-focused file
stats_report_content = f"""# Descriptive Statistics Analysis Report
Generated: {datetime.now()}
Query: {stats_query}

## Results:
{stats_result['output']}
"""

# Save to reports folder
stats_file_path = "../reports/stats_only_analysis.md"
with open(stats_file_path, "w") as f:
    f.write(stats_report_content)

Running statistics-only analysis: Check descriptive statistics for prod customers table

SMART DATA QUALITY CHECK
User Query: Check descriptive statistics for prod customers table

Step 1: Discovering relevant tables...
----------------------------------------------------------------------


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 9, updating n_results = 9
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given



Found 3 relevant table(s) (‚â•5% relevance):

  1. [SNOWFLAKE] PROD_SALES.DATA_MART_MARKETING.CUSTOMERS
     Relevance: 40.41% ‚úì

  2. [SNOWFLAKE] PROD_SALES.DATA_MART_MANUFACTURING.PRODUCTS
     Relevance: 31.84% ‚úì

  3. [SNOWFLAKE] PROD_SALES.DATA_MART_MARKETING.CALL_CENTER
     Relevance: 22.27% ‚úì

Step 2: Building context for agent...
----------------------------------------------------------------------

Step 3: Running agent with discovered tables...
----------------------------------------------------------------------


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `check_dataset_descriptive_stats` with `{'dataset_id': 'PROD_SALES.DATA_MART_MARKETING.CUSTOMERS', 'connector_type': 'snowflake'}`


[0m--- Loading data for: PROD_SALES.DATA_MART_MARKETING.CUSTOMERS using SNOWFLAKE connector ---
‚úì Connected to Snowflake: PROD_SALES.PUBLIC
Executing query: SELECT * FROM PROD_SALES.DATA_MART_MARKETING.CUSTOMERS
‚úì Loaded 37811 rows from Snowflake
‚úì 