# YAML Shredder & Comparator Demo

A comprehensive guide to using Schema Sentinel's powerful tools for transforming YAML/JSON data into relational structures and comparing configurations.

## Overview

This notebook demonstrates:
- **YAML Shredder**: Convert nested YAML/JSON files into normalized relational tables
- **YAML Comparator**: Compare two YAML files to identify structural and data differences
- **Document Generator**: Generate markdown documentation from YAML configurations
- **SQL DDL Generation**: Create database schemas for the Snowflake SQL dialect

## Section 1: Setup and Installation

First, let's ensure the yaml_shredder package is properly installed and ready to use.

In [None]:
import sys
from pathlib import Path

# Verify yaml_shredder is available
try:
    import yaml_shredder
    print(f"âœ“ yaml_shredder version: {yaml_shredder.__version__}")
    print(f"âœ“ Location: {yaml_shredder.__file__}")
except ImportError as e:
    print(f"âœ— Error importing yaml_shredder: {e}")
    raise RuntimeError("yaml_shredder is required for this demo notebook. Please install it (e.g., 'pip install yaml_shredder') and retry.") from e

## Section 2: Import Required Modules

Import all the necessary modules from yaml_shredder for YAML processing and comparison.

In [None]:
from pathlib import Path
from pprint import pprint
import tempfile
import shutil
from datetime import datetime

import pandas as pd
import yaml

from yaml_shredder import (
    TableGenerator,
    DDLGenerator,
    StructureAnalyzer,
    SQLiteLoader,
    YAMLComparator,
    generate_doc_from_yaml,
)

# Set up temporary directory for this demo with unique suffix to avoid conflicts
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
TEMP_DIR = Path(tempfile.gettempdir()) / f"yaml_shredder_demo_{timestamp}"
TEMP_DIR.mkdir(parents=True, exist_ok=True)

print(f"Working directory: {TEMP_DIR}")

## Section 3: Create Sample YAML Files

Let's create sample YAML files to demonstrate the yaml_shredder and comparator functionality.

In [None]:
# Create sample YAML file 1 (Original configuration)
config_v1 = {
    "deployment": {
        "environment": "production",
        "region": "us-east-1",
        "version": "1.0.0"
    },
    "database": {
        "host": "db.example.com",
        "port": 5432,
        "name": "main_db"
    },
    "services": [
        {
            "name": "api-service",
            "port": 8080,
            "replicas": 3,
            "cpu": "500m"
        },
        {
            "name": "cache-service",
            "port": 6379,
            "replicas": 2,
            "cpu": "250m"
        }
    ]
}

yaml_file_1 = TEMP_DIR / "config_v1.yaml"
with open(yaml_file_1, "w") as f:
    yaml.safe_dump(config_v1, f)

print(f"âœ“ Created {yaml_file_1.name}")
print("\nContent of config_v1.yaml:")
pprint(config_v1)

In [None]:
# Create sample YAML file 2 (Updated configuration with changes)
config_v2 = {
    "deployment": {
        "environment": "production",
        "region": "us-west-2",  # Changed
        "version": "1.1.0"      # Changed
    },
    "database": {
        "host": "db-new.example.com",  # Changed
        "port": 5432,
        "name": "main_db",
        "ssl_enabled": True  # New
    },
    "services": [
        {
            "name": "api-service",
            "port": 8080,
            "replicas": 5,      # Changed
            "cpu": "500m"
        },
        {
            "name": "cache-service",
            "port": 6379,
            "replicas": 2,
            "cpu": "250m"
        },
        {
            "name": "worker-service",  # New service
            "port": 9000,
            "replicas": 1,
            "cpu": "100m"
        }
    ]
}

yaml_file_2 = TEMP_DIR / "config_v2.yaml"
with open(yaml_file_2, "w") as f:
    yaml.safe_dump(config_v2, f)

print(f"âœ“ Created {yaml_file_2.name}")
print("\nContent of config_v2.yaml:")
pprint(config_v2)

## Section 4: Analyze YAML Structure

First, let's analyze the structure of our YAML files to understand what nested elements they contain.

In [None]:
# Load and analyze the structure
with open(yaml_file_1) as f:
    data = yaml.safe_load(f)

analyzer = StructureAnalyzer(max_depth=3)
analysis = analyzer.analyze(data)

print("=" * 70)
print("YAML STRUCTURE ANALYSIS")
print("=" * 70)
analyzer.print_summary(analysis)

## Section 5: Convert YAML to Relational Tables

Transform the nested YAML structure into normalized relational tables for database storage.

In [None]:
# Generate tables from YAML
table_gen = TableGenerator(max_depth=None)  # Full flattening
tables = table_gen.generate_tables(data, root_table_name="CONFIG", source_file=yaml_file_1)

print("\n" + "=" * 70)
print("GENERATED TABLES")
print("=" * 70)
table_gen.print_summary()

print("\nTable Details:")
for table_name, df in tables.items():
    print(f"\nðŸ“Š Table: {table_name}")
    print(f"   Rows: {len(df)}, Columns: {len(df.columns)}")
    print(f"   Schema: {list(df.columns)}")

In [None]:
# Show sample data from SERVICES table
if "SERVICES" in tables:
    print("\nSample SERVICES data:")
    print(tables["SERVICES"].to_string())

## Section 6: Load Tables into SQLite Database

Store the generated tables in a SQLite database for persistence and querying.

In [None]:
# Load tables into SQLite
db_path = TEMP_DIR / "config.db"
loader = SQLiteLoader(db_path)
loader.connect()
loader.load_tables(tables, if_exists="replace", create_indexes=True)
loader.print_summary()
loader.disconnect()

print(f"\nâœ“ Database created at: {db_path}")

## Section 7: Generate SQL DDL

Create SQL DDL statements for multiple database systems to recreate the schema.

In [None]:
# Generate DDL for Snowflake
ddl_gen = DDLGenerator(dialect="snowflake")
ddl_statements = ddl_gen.generate_ddl(tables, table_gen.relationships)

print("\n" + "=" * 70)
print("SQL DDL - SNOWFLAKE DIALECT")
print("=" * 70)
for table_name, sql in ddl_statements.items():
    print(f"\n-- Table: {table_name}")
    print(sql)
    print()

## Section 8: Compare Two YAML Files

Use YAMLComparator to identify differences between the two configuration versions.

In [None]:
# Initialize comparator
comparator = YAMLComparator(output_dir=TEMP_DIR / "comparisons")

print("\n" + "=" * 70)
print("YAML COMPARISON")
print("=" * 70)

# Load YAML files to databases for comparison
db1 = comparator.load_yaml_to_db(yaml_file_1, max_depth=None)
db2 = comparator.load_yaml_to_db(yaml_file_2, max_depth=None)

print(f"\nâœ“ Loaded {yaml_file_1.name} â†’ {db1.name}")
print(f"âœ“ Loaded {yaml_file_2.name} â†’ {db2.name}")

In [None]:
# Get table information for both databases
print("\n" + "-" * 70)
print("Table Information Comparison")
print("-" * 70)

tables_db1 = comparator.get_table_info(db1)
tables_db2 = comparator.get_table_info(db2)

print(f"\nDatabase 1 ({yaml_file_1.name}):")
for table_name in tables_db1.keys():
    print(f"  - {table_name}")

print(f"\nDatabase 2 ({yaml_file_2.name}):")
for table_name in tables_db2.keys():
    print(f"  - {table_name}")

In [None]:
# Get row counts to identify changes
row_counts_db1 = comparator.get_row_counts(db1)
row_counts_db2 = comparator.get_row_counts(db2)

print("\n" + "-" * 70)
print("Row Count Comparison")
print("-" * 70)

all_tables = set(row_counts_db1.keys()) | set(row_counts_db2.keys())
for table in sorted(all_tables):
    count1 = row_counts_db1.get(table, 0)
    count2 = row_counts_db2.get(table, 0)
    change = count2 - count1
    symbol = "+" if change > 0 else "-" if change < 0 else "="
    print(f"{table:20} {count1:3d} â†’ {count2:3d} [{symbol}{abs(change)}]")

## Section 9: Generate Comparison Report

Create a detailed markdown report showing the differences between the two YAML files.

In [None]:
# Generate detailed comparison report
report_path = TEMP_DIR / "comparison_report.md"

report = comparator.compare_yaml_files(
    yaml1_path=yaml_file_1,
    yaml2_path=yaml_file_2,
    output_report=report_path,
    keep_dbs=True,
    root_table_name="CONFIG"
)

print(f"\nâœ“ Comparison report saved to: {report_path}")
print("\n" + "=" * 70)
print("COMPARISON REPORT (Preview)")
print("=" * 70)
print(report[:1500] + "...\n[Report truncated for display]")

# Also display the raw report content
print("\nFull report path:", report_path)

## Section 10: Generate Markdown Documentation

Create comprehensive markdown documentation from the YAML configuration showing all tables and data.

In [None]:
# Generate markdown documentation
docs_dir = TEMP_DIR / "docs"
docs_dir.mkdir(exist_ok=True)

doc_path = generate_doc_from_yaml(
    yaml_path=yaml_file_1,
    output_dir=docs_dir,
    root_name="CONFIG",
    max_depth=None,  # Full flattening
    keep_db=False    # Remove temporary database
)

print(f"\nâœ“ Documentation generated: {doc_path}")
print(f"  File size: {doc_path.stat().st_size:,} bytes")

# Show preview of the documentation
with open(doc_path) as f:
    doc_content = f.read()

print("\n" + "=" * 70)
print("DOCUMENTATION PREVIEW")
print("=" * 70)
print(doc_content[:1000] + "...\n[Documentation truncated for display]")

## Section 11: Summary and Key Takeaways

Review the key features and use cases of YAML Shredder and Comparator.

In [None]:
print("\n" + "=" * 70)
print("SUMMARY - YAML SHREDDER & COMPARATOR CAPABILITIES")
print("=" * 70)

summary = """
âœ“ YAML SHREDDER FEATURES:
  1. Structure Analysis - Identify nested structures and data patterns
  2. Table Generation - Convert nested YAML/JSON into normalized tables
  3. Database Loading - Store transformed data in SQLite with indexes
  4. DDL Generation - Create SQL schemas for Snowflake, PostgreSQL, MySQL
  5. Depth Control - max_depth parameter controls flattening levels

âœ“ YAML COMPARATOR FEATURES:
  1. Database Loading - Convert YAML files to SQLite databases
  2. Schema Comparison - Identify table and column differences
  3. Row Count Analysis - Track data changes between versions
  4. Difference Detection - Automatically detect added/removed/modified rows
  5. Report Generation - Create detailed markdown comparison reports

âœ“ DOCUMENTATION GENERATOR FEATURES:
  1. Markdown Output - Generate comprehensive documentation
  2. Schema Details - Show table schemas with column types
  3. Data Preview - Display actual table data in markdown tables
  4. Smart Truncation - Preserve JSON objects, truncate regular text
  5. File Naming - Auto-name documents after source files

USE CASES:
  â€¢ Configuration Drift Detection - Compare YAML across environments
  â€¢ Data Pipeline Transformation - Normalize nested data for analytics
  â€¢ Schema Discovery - Infer database schemas from YAML examples
  â€¢ Change Tracking - Monitor configuration evolution over time
  â€¢ API Response Processing - Convert JSON API responses to tables
  â€¢ Environment Synchronization - Ensure config consistency
"""

print(summary)

print("\n" + "=" * 70)
print("FILES GENERATED IN THIS DEMO")
print("=" * 70)
print(f"Working Directory: {TEMP_DIR}")
for item in sorted(TEMP_DIR.rglob("*")):
    if item.is_file():
        rel_path = item.relative_to(TEMP_DIR)
        size = item.stat().st_size
        print(f"  {rel_path} ({size:,} bytes)")