## Demo notebook showcase entire process

In [None]:
import os
from dotenv import load_dotenv
import json
import re
import yaml
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')
from src.retrieval.schema_indexer import SchemaIndexer
from src.agent.smart_planner import run_smart_dq_check
import src.reporting.report_templates
import src.data_quality.checks
import src.reporting.report_generator
import src.agent.reporting_tools
from src.reporting.report_templates import ReportTemplates
from src.reporting.report_generator import DataQualityReportGenerator

### Indexing target schemas and tables based on samples data and table metadata

In [None]:
# Load settings.yaml to read connector types dynamically
with open('../config/settings.yaml', 'r') as f:
    settings = yaml.safe_load(f)

# Get connector types from settings.yaml
connector_types = list(settings['connectors'].keys())

for ct in connector_types:
    print(f"Building schema index for connector type: {ct}")
    # recreate=False to avoid rebuilding entire index, created ones for second run recreate=False right choice
    if ct == 'snowflake':
        SchemaIndexer(connector_type=ct).build_schema_index(recreate=True)
    else:
        SchemaIndexer(connector_type=ct).build_schema_index(recreate=False)

Building schema index for connector type: snowflake
 Connecting to Snowflake...
 Connecting to Snowflake...
‚úì Connected to Snowflake: PROD_SALES.PUBLIC
‚úì Connected to Snowflake: PROD_SALES.PUBLIC
‚úì Disconnected from Snowflake
Auto-discovered 4 schemas on Snowflake: DATA_MART_FINANCE, DATA_MART_MANUFACTURING, DATA_MART_MARKETING, PUBLIC
[1/4] INDEXING SCHEMA: DATA_MART_FINANCE
‚úì Disconnected from Snowflake
Auto-discovered 4 schemas on Snowflake: DATA_MART_FINANCE, DATA_MART_MANUFACTURING, DATA_MART_MARKETING, PUBLIC
[1/4] INDEXING SCHEMA: DATA_MART_FINANCE

Generating metadata documents for 1 tables...
  [1/1] Processing INVOICES...

Generating metadata documents for 1 tables...
  [1/1] Processing INVOICES...


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


 Deleted existing collection: database_schemas

Indexing 1 table metadata documents...
Indexed 1 tables from schema DATA_MART_FINANCE
[2/4] INDEXING SCHEMA: DATA_MART_MANUFACTURING

Generating metadata documents for 1 tables...
  [1/1] Processing PRODUCTS...

Generating metadata documents for 1 tables...
  [1/1] Processing PRODUCTS...


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given



Indexing 1 table metadata documents...
Indexed 1 tables from schema DATA_MART_MANUFACTURING
[3/4] INDEXING SCHEMA: DATA_MART_MARKETING

Generating metadata documents for 1 tables...
  [1/1] Processing CUSTOMERS...

Generating metadata documents for 1 tables...
  [1/1] Processing CUSTOMERS...


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given



Indexing 1 table metadata documents...
Indexed 1 tables from schema DATA_MART_MARKETING
[4/4] INDEXING SCHEMA: PUBLIC

Generating metadata documents for 0 tables...
No tables found in schema PUBLIC
 SCHEMA INDEX BUILT SUCCESSFULLY
  - Collection: database_schemas
  - Schemas indexed: 4
  - Total tables indexed: 3
Building schema index for connector type: postgres

Generating metadata documents for 0 tables...
No tables found in schema PUBLIC
 SCHEMA INDEX BUILT SUCCESSFULLY
  - Collection: database_schemas
  - Schemas indexed: 4
  - Total tables indexed: 3
Building schema index for connector type: postgres
 Connecting to postgres...
‚úì Connected to PostgreSQL: stage_sales
‚úì Disconnected from PostgreSQL
Auto-discovered 1 schemas on postgres: public
[1/1] INDEXING SCHEMA: public
 Connecting to postgres...
‚úì Connected to PostgreSQL: stage_sales
‚úì Disconnected from PostgreSQL
Auto-discovered 1 schemas on postgres: public
[1/1] INDEXING SCHEMA: public
Discovering tables in stage_sal

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given



Indexing 3 table metadata documents...
Indexed 3 tables from schema public
 SCHEMA INDEX BUILT SUCCESSFULLY
  - Collection: database_schemas
  - Schemas indexed: 1
  - Total tables indexed: 3


## Enhanced Comprehensive Report Creation

Generate comprehensive data quality reports with detailed duplicate counts, null analysis, and descriptive statistics.

In [None]:
comprehensive_report = run_smart_dq_check(
    "Generate comprehensive DQ report for prod customers table"
    #  "generate comprehensive report for me unknown table"
)

print("\n Smart DQ Check Response:")
print(comprehensive_report.get('output', 'No output returned'))

üöÄ Running smart DQ check with relevance threshold protection...

SMART DATA QUALITY CHECK
User Query: generate comprehensive report for me unknown table

Step 1: Discovering relevant tables...
----------------------------------------------------------------------


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 6, updating n_results = 6
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 6, updating n_results = 6
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given



üìù Smart DQ Check Response:
--------------------------------------------------
No relevant tables found. Please ensure the schema index is built. Run: python src/retrieval/schema_indexer.py
--------------------------------------------------


In [None]:
# Check if the smart DQ check was successful before proceeding with report generation
report_output = comprehensive_report.get('output', '')
print(f"Report content preview: {report_output[:200]}...")

# Check for relevance threshold failure
if "No tables found with sufficient relevance" in report_output:
    print("Smart DQ check failed due to low relevance scores.")
    print("Skipping report generation - no valid table found for analysis.")
    print("Try using a more specific table name in your query.")
else:
    print("Smart DQ check successful - proceeding with report generation...")

    # Create reports directory one folder up
    reports_dir = "../reports"
    os.makedirs(reports_dir, exist_ok=True)

    # Initialize report generator
    generator = DataQualityReportGenerator()

    # Generate timestamp for unique filenames
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

    # Extract dataset and connector info from the comprehensive_report
    dataset_id = None
    connector_type = None

    # Look for patterns like "stage_sales.public.customers" or "PROD_SALES.PUBLIC.CUSTOMERS"
    dataset_patterns = [
        r'`([^`]+\.public\.[^`]+)`',  # Pattern: `schema.public.table`
        r'([A-Z_]+\.[A-Z_]+\.[A-Z_]+)',  # Pattern: SCHEMA.PUBLIC.TABLE
        r'([a-z_]+\.public\.[a-z_]+)',   # Pattern: schema.public.table
    ]

    for pattern in dataset_patterns:
        match = re.search(pattern, report_output)
        if match:
            dataset_id = match.group(1)
            break

    # Determine connector type based on dataset naming convention or explicit mention
    if dataset_id:
        if 'PROD_SALES' in dataset_id.upper() or 'snowflake' in report_output.lower():
            connector_type = 'snowflake'
        elif 'STAGE_SALES' in dataset_id.upper() or 'postgres' in report_output.lower():
            connector_type = 'postgres'
        else:
            # Default based on case - uppercase typically Snowflake, lowercase typically Postgres
            connector_type = 'snowflake' if dataset_id.isupper() else 'postgres'

    # Only proceed if we successfully extracted dataset info
    if not dataset_id:
        print("Could not extract dataset from successful report - this shouldn't happen!")
    else:
        print(f"Extracted info:")
        print(f"   Dataset: {dataset_id}")
        print(f"   Connector: {connector_type}")

        base_filename = f"comprehensive_dq_report_{timestamp}"

        print(f"\nRunning optimized DQ assessment...")

        # Use the optimized approach: run DQ checks directly and create assessment
        from src.data_quality.checks import check_dataset_duplicates, check_dataset_null_values, check_dataset_descriptive_stats

        # Run all DQ checks directly
        print("   Running duplicate check...")
        duplicate_results = check_dataset_duplicates(dataset_id, connector_type=connector_type)

        print("   Running null values check...")
        null_results = check_dataset_null_values(dataset_id, connector_type=connector_type)

        print("   Running descriptive statistics...")
        stats_results = check_dataset_descriptive_stats(dataset_id, connector_type=connector_type)

        # Create assessment using the optimized method (correct parameter name)
        check_results = {
            'duplicates': duplicate_results,
            'null_values': null_results,
            'descriptive_stats': stats_results
        }

        assessment_results = generator.create_assessment_from_results(
            check_results=check_results,
            dataset_id=dataset_id,
            connector_type=connector_type
        )

        # Display summary of actual results
        print(f"\nData Quality Results for {dataset_id}:")
        print(f"   Passed: {assessment_results['summary']['passed_checks']} checks")
        print(f"   Failed: {assessment_results['summary']['failed_checks']} checks")
        print(f"   Errors: {assessment_results['summary']['error_checks']} checks")

        print(f"\nGenerating reports using standardized templates...")

        # 1. Save as Markdown (.md) using template
        md_file = f"{reports_dir}/{base_filename}.md"
        md_content = generator.generate_markdown_report(assessment_results)
        with open(md_file, 'w', encoding='utf-8') as f:
            f.write(md_content)
        print(f"   MARKDOWN: {md_file}")

        # 2. Save as HTML (.html) using template
        html_file = f"{reports_dir}/{base_filename}.html"
        html_content = generator.generate_html_report(assessment_results)
        with open(html_file, 'w', encoding='utf-8') as f:
            f.write(html_content)
        print(f"   HTML: {html_file}")

        # 3. Save as JSON (.json) using generator
        json_file = f"{reports_dir}/{base_filename}.json"
        json_content = generator.generate_json_report(assessment_results)
        with open(json_file, 'w', encoding='utf-8') as f:
            f.write(json_content)
        print(f"   JSON: {json_file}")

üìù Report content preview: No relevant tables found. Please ensure the schema index is built. Run: python src/retrieval/schema_indexer.py...
‚úÖ Smart DQ check successful - proceeding with report generation...
‚ö†Ô∏è Could not extract dataset from successful report - this shouldn't happen!
