# Smart Data Quality Agent - Demo Notebook

## What This Notebook Does
This notebook demonstrates the complete Smart DQ Check workflow with enhanced null analysis:

1. **Schema Indexing** - Automatically discovers and indexes database tables (must run for first time) - No CDC logic applied as no requirement related defined on task description
2. **Smart Analysis** - Uses AI to understand your query and analyze the right tables  
3. **Enhanced Report Generation** - Creates comprehensive data quality reports with detailed null analysis in multiple formats
4. **Fixed Null Analysis Issue**: Enhanced agent prompt to ensure detailed null values breakdown is always included
5. **Complete Data Coverage**: Reports now show per-column null counts and percentages
6. **Multi-Format Output**: All formats (Markdown, HTML, JSON) contain complete null analysis details

## How to Use

### **For Comprehensive Analysis:**
1. **First Time Setup**: Run indexing cell to build schema index and discover your database tables
2. **Run Analysis**: Following cell prompts for your query - try examples like:
   - "Generate comprehensive DQ report for prod customers table"
   - "Create data quality report for production database invoices"
3. **Get Reports**: System generates detailed reports saved to `../reports/` in 3 formats (.html, .json, .md)

### **For Single Quality Check Analysis:**
1. **Run Schema Indexing** (same as above - required for all analysis types)
2. **Choose Single Check Type**: Run the specific cells for focused analysis:
   - **Duplicate-Only**: Run duplicate analysis cell for just duplicate detection
   - **Null-Only**: Run null analysis cell for just missing values detection
   - **Statistics-Only**: Use query "Check statistics for [table]" in comprehensive cell
3. **Faster Execution**: Single checks run much faster than comprehensive analysis
4. **Focused Reports**: Each single check creates a targeted report file (e.g., `duplicate_only_analysis.md`)

### **When to Use Each Approach:**
- **Comprehensive**: When you need complete data quality overview with all checks and in all report formats
- **Single Check**: When investigating specific issues or need quick focused analysis

### **Reports being saved under reports folder shows:**
- **Duplicate Record Analysis** - Complete duplicate detection and counts
- **Detailed Null Values Assessment** - Per-column breakdown with counts and percentages  
- **Statistical Summaries** - Comprehensive data profiling with descriptive stats
- **Actionable Recommendations** - Data quality improvement suggestions

**Just run all cells in order and follow the prompts**

In [None]:
import os
from dotenv import load_dotenv
import json
import re
import yaml
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')
from src.retrieval.schema_indexer import SchemaIndexer
from src.agent.smart_planner import run_smart_dq_check
import src.reporting.report_templates
import src.data_quality.checks
import src.reporting.report_generator
import src.agent.reporting_tools
import src.reporting.report_processor
import src.reporting.report_templates
import src.reporting.report_generator
from src.reporting.report_processor import SmartDQReportProcessor

**Indexing target databases cell**

In [None]:
# Load settings.yaml to read connector types dynamically
with open('../config/settings.yaml', 'r') as f:
    settings = yaml.safe_load(f)

# Get connector types from settings.yaml
connector_types = list(settings['connectors'].keys())

print(f" BUILDING SCHEMA INDEX FROM SCRATCH")

# First connector: recreate=True
# Subsequent connectors: recreate=False to preserve other connector data
for i, ct in enumerate(connector_types):
    print(f"\n Rebuilding index for {ct.upper()}")

    # First connector recreates the entire index, others append to it
    recreate_index = (i == 0)  # Only recreate on first connector

    if recreate_index:
        print(f"  Starting fresh - recreating entire index")
    else:
        print(f"  Adding to existing index")

    indexer = SchemaIndexer(connector_type=ct)
    indexer.build_schema_index(recreate=recreate_index)
    print(f"   Index {'created' if recreate_index else 'updated'} for {ct}")

print(f"\n SCHEMA INDEX REBUILD COMPLETE!")

**Example queries runs by loop**

In [None]:
# Loop example queries

queries = [
"Analyze staging analytics data quality", # Missing table to show error handling
"Create comprehensive report of UNKNOWN table data quality for staging database", # Nonsense table name to test error handling
"Generate comprehensive DQ report for staging sales table in postgres",
"Check prod customers table for missing values",
"Find duplicates in production invoice table snowflake"
]

for _, query in enumerate(queries):
    print(f"\n Running DQ check for query: {query}")
    report = run_smart_dq_check(query)
    print(f"Report output: {report.get('output', 'No output found')}")
    processor = SmartDQReportProcessor(reports_dir="../reports")
    processor.process_comprehensive_report(report)
    print(f"{_}, Report processing complete.")

**User prompt input based**

In [None]:
# Get user input for the query with a default option
default_query = "Generate comprehensive DQ report for prod customers table"
user_prompt = input("Enter your DQ check query (or press Enter to use default): ").strip() or default_query
print(f"Using prompt: {user_prompt}")

comprehensive_report = run_smart_dq_check(user_prompt)

**Report creator class and method showcase**

In [None]:
# Initialize the report processor
processor = SmartDQReportProcessor(reports_dir="../reports")

# Process the comprehensive report (handles validation, extraction, checks, and file generation)
result = processor.process_comprehensive_report(comprehensive_report)

**Single data quality type check instead comprehensive report**

In [None]:
# For DUPLICATE-ONLY analysis - use agent with specific query
duplicate_query = "Check for duplicates in prod customers table"
print(f"Running duplicate-only analysis: {duplicate_query}")

duplicate_result = run_smart_dq_check(duplicate_query)

# Save to a simple duplicate-focused file
duplicate_report_content = f"""# Duplicate Analysis Report
Generated: {datetime.now()}
Query: {duplicate_query}

## Results:
{duplicate_result['output']}
"""

# Save to reports folder
duplicate_file_path = "../reports/duplicate_only_analysis.md"
with open(duplicate_file_path, "w") as f:
    f.write(duplicate_report_content)

In [None]:
# For DESCRIPTIVE STATISTICS-ONLY analysis - use agent with specific query
stats_query = "Check descriptive statistics for prod customers table"
print(f"Running statistics-only analysis: {stats_query}")

stats_result = run_smart_dq_check(stats_query)

# Save to a simple statistics-focused file
stats_report_content = f"""# Descriptive Statistics Analysis Report
Generated: {datetime.now()}
Query: {stats_query}

## Results:
{stats_result['output']}
"""

# Save to reports folder
stats_file_path = "../reports/stats_only_analysis.md"
with open(stats_file_path, "w") as f:
    f.write(stats_report_content)