# Smart Data Quality Agent - Demo Notebook

## What This Notebook Does
This notebook demonstrates the complete Smart DQ Check workflow:

1. **Schema Indexing** - Automatically discovers and indexes your database tables (only runs when changes detected) Must run for first time
2. **Smart Analysis** - Uses AI to understand your query and analyze the right tables  
3. **Report Generation** - Creates comprehensive data quality reports in multiple formats (Markdown, HTML, JSON)

## How to Use
1. **First Time Setup**: The schema indexer will run automatically and discover your database tables
2. **Run Analysis**: Enter a natural language query like "check quality of customer data" or "analyze sales table duplicates" or just hit enter if you want to run default query
3. **Get Reports**: The system generates detailed reports saved to the `../reports` folder

## What You'll Get in 3 different format .html .json and .md
- Duplicate record analysis
- Missing data (null values) assessment  
- Statistical summaries and data profiling
- Actionable recommendations for data quality improvements

**Just run all cells in order **

In [None]:
import os
from dotenv import load_dotenv
import json
import re
import yaml
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')
from src.retrieval.schema_indexer import SchemaIndexer
from src.agent.smart_planner import run_smart_dq_check
import src.reporting.report_templates
import src.data_quality.checks
import src.reporting.report_generator
import src.agent.reporting_tools
from src.reporting.report_templates import ReportTemplates
from src.reporting.report_generator import DataQualityReportGenerator

### Schema Indexing - Always rebuild from scratch to ensure fresh data

In [None]:
# Load settings.yaml to read connector types dynamically
with open('../config/settings.yaml', 'r') as f:
    settings = yaml.safe_load(f)

# Get connector types from settings.yaml
connector_types = list(settings['connectors'].keys())

print(f"üöÄ REBUILDING SCHEMA INDEX FROM SCRATCH")
print(f"{'='*60}")

# First connector: recreate=True to start fresh
# Subsequent connectors: recreate=False to preserve other connector data
for i, ct in enumerate(connector_types):
    print(f"\nüî® Rebuilding index for {ct.upper()}")
    print(f"{'='*40}")

    # First connector recreates the entire index, others append to it
    recreate_index = (i == 0)  # Only recreate on first connector

    if recreate_index:
        print(f"   üóëÔ∏è  Starting fresh - recreating entire index")
    else:
        print(f"   ‚ûï Adding to existing index")

    indexer = SchemaIndexer(connector_type=ct)
    indexer.build_schema_index(recreate=recreate_index)
    print(f"   ‚úÖ Index {'created' if recreate_index else 'updated'} for {ct}")

print(f"\nüéâ SCHEMA INDEX REBUILD COMPLETE!")
print(f"{'='*60}")


üî® Rebuilding index for SNOWFLAKE
 Connecting to Snowflake...
 Connecting to Snowflake...
‚úì Connected to Snowflake: PROD_SALES.PUBLIC
‚úì Connected to Snowflake: PROD_SALES.PUBLIC
‚úì Disconnected from Snowflake
Auto-discovered 4 schemas on Snowflake: DATA_MART_FINANCE, DATA_MART_MANUFACTURING, DATA_MART_MARKETING, PUBLIC
[1/4] INDEXING SCHEMA: DATA_MART_FINANCE
‚úì Disconnected from Snowflake
Auto-discovered 4 schemas on Snowflake: DATA_MART_FINANCE, DATA_MART_MANUFACTURING, DATA_MART_MARKETING, PUBLIC
[1/4] INDEXING SCHEMA: DATA_MART_FINANCE

Generating metadata documents for 2 tables...
  [1/2] Processing INCOME_BAND...

Generating metadata documents for 2 tables...
  [1/2] Processing INCOME_BAND...
  [2/2] Processing INVOICES...
  [2/2] Processing INVOICES...


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given



Indexing 2 table metadata documents...
Indexed 2 tables from schema DATA_MART_FINANCE
[2/4] INDEXING SCHEMA: DATA_MART_MANUFACTURING

Generating metadata documents for 1 tables...
  [1/1] Processing PRODUCTS...

Generating metadata documents for 1 tables...
  [1/1] Processing PRODUCTS...


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given



Indexing 1 table metadata documents...
Indexed 1 tables from schema DATA_MART_MANUFACTURING
[3/4] INDEXING SCHEMA: DATA_MART_MARKETING

Generating metadata documents for 3 tables...
  [1/3] Processing CALL_CENTER...

Generating metadata documents for 3 tables...
  [1/3] Processing CALL_CENTER...
  [2/3] Processing CUSTOMERS...
  [2/3] Processing CUSTOMERS...
  [3/3] Processing SALES...
  [3/3] Processing SALES...


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given



Indexing 3 table metadata documents...
Indexed 3 tables from schema DATA_MART_MARKETING
[4/4] INDEXING SCHEMA: PUBLIC

Generating metadata documents for 0 tables...
No tables found in schema PUBLIC
 SCHEMA INDEX BUILT SUCCESSFULLY
  - Collection: database_schemas
  - Schemas indexed: 4
  - Total tables indexed: 6
   ‚úÖ Index rebuilt for snowflake

üî® Rebuilding index for POSTGRES

Generating metadata documents for 0 tables...
No tables found in schema PUBLIC
 SCHEMA INDEX BUILT SUCCESSFULLY
  - Collection: database_schemas
  - Schemas indexed: 4
  - Total tables indexed: 6
   ‚úÖ Index rebuilt for snowflake

üî® Rebuilding index for POSTGRES
 Connecting to postgres...
‚úì Connected to PostgreSQL: stage_sales
‚úì Disconnected from PostgreSQL
Auto-discovered 1 schemas on postgres: public
[1/1] INDEXING SCHEMA: public
 Connecting to postgres...
‚úì Connected to PostgreSQL: stage_sales
‚úì Disconnected from PostgreSQL
Auto-discovered 1 schemas on postgres: public
[1/1] INDEXING SCHEMA

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given



Indexing 3 table metadata documents...
Indexed 3 tables from schema public
 SCHEMA INDEX BUILT SUCCESSFULLY
  - Collection: database_schemas
  - Schemas indexed: 1
  - Total tables indexed: 3
   ‚úÖ Index rebuilt for postgres

üéâ SCHEMA INDEX REBUILD COMPLETE!


## Enhanced Comprehensive Report Creation

Generate comprehensive data quality reports with detailed duplicate counts, null analysis, and descriptive statistics.

In [None]:
# Get user input for the query with a default option
default_query = "Generate comprehensive DQ report for prod customers table"
print(f"Default query: {default_query}")
user_query = input("Enter your DQ check query (or press Enter to use default): ").strip() or default_query
print(f"Using query: {user_query}")

comprehensive_report = run_smart_dq_check(user_query)

Default query: Generate comprehensive DQ report for prod customers table
Using query: create report for stage sales table

SMART DATA QUALITY CHECK
User Query: create report for stage sales table

Step 1: Discovering relevant tables...
----------------------------------------------------------------------
Using query: create report for stage sales table

SMART DATA QUALITY CHECK
User Query: create report for stage sales table

Step 1: Discovering relevant tables...
----------------------------------------------------------------------


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 12, updating n_results = 12
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Number of requested results 20 is greater than number of elements in index 12, updating n_results = 12
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given


In [None]:
# Use the new SmartDQReportProcessor to handle the complex report generation workflow
from src.reporting.report_processor import SmartDQReportProcessor

# Initialize the processor
processor = SmartDQReportProcessor(reports_dir="../reports")

# Process the comprehensive report (handles validation, extraction, checks, and file generation)
result = processor.process_comprehensive_report(comprehensive_report)

# Display the final result
if result['status'] == 'success':
    print(f" Generated files: {list(result['files_generated'].keys())}")
else:
    print(f"\nReport generation failed: {result['reason']}")
    if result['reason'] == 'low_relevance':
        print("Try using a more specific table name in your query.")

Report content preview: No tables found with sufficient relevance (minimum 15% match required). Best match was 'PROD_SALES.DATA_MART_MARKETING.SALES' with only 0.9% relevance. Please try a more specific table name or check i...
Smart DQ check failed due to low relevance scores.
Skipping report generation - no valid table found for analysis.
Try using a more specific table name in your query.

Report generation failed: low_relevance
Try using a more specific table name in your query.
