# Sam's Databricks Tooling - Comprehensive Workflow Demo

This notebook demonstrates the complete workflow using all four tools in the Databricks tooling suite:

1. **`tool__workstation`** - Spark session management
2. **`tool__dag_chainer`** - DataFrame workflow orchestration
3. **`tool__table_polisher`** - Data standardization
4. **`tool__table_indexer`** - Entity indexing with persistence

## Philosophy: Unix-like Tool Composition

These tools follow the Unix philosophy of building small, focused components that can be composed together to create powerful data processing workflows.

In [None]:
# Quick Spark Test - Run this first to verify setup
print("🔧 Testing Spark Setup...")

import os
# Ensure Java 17 is used
os.environ['JAVA_HOME'] = '/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home'

try:
    from tool__workstation import get_spark
    spark = get_spark('local_basic')
    print(f"✅ Spark {spark.version} session created successfully!")
    print(f"☕ Java Home: {os.environ.get('JAVA_HOME')}")
    
    # Simple test
    test_df = spark.range(3).toDF("number")
    count = test_df.count()
    print(f"🧪 Test DataFrame created with {count} rows")
    
    spark.stop()
    print("🔥 Spark session stopped - ready for main demo!")
except Exception as e:
    print(f"❌ Setup Error: {e}")
    print("💡 Make sure Java 17 is installed: brew install openjdk@17")

In [3]:
# Step 1: Import all tools and initialize workstation
print("🚀 Initializing Sam's Databricks Tooling Suite")
print("=" * 50)

# Import all our tools
from tool__workstation import get_spark, is_spark_active, spark_health_check
from tool__dag_chainer import DagChain
from tool__table_polisher import polish
from tool__table_indexer import TableIndexer

# Initialize Spark session using workstation
spark = get_spark("local_delta")
print(f"✅ Spark session active: {spark.version}")

# Verify health
health = spark_health_check()
print(f"📊 Delta Lake enabled: {health['delta_enabled']}")
print(f"🔧 Session ID: {health['session_id']}")
print()

🚀 Initializing Sam's Databricks Tooling Suite
✅ Spark session active: 4.0.1
📊 Delta Lake enabled: True
🔧 Session ID: local-1758552298134



25/09/22 09:44:58 WARN SparkSession: Cannot use io.delta.sql.DeltaSparkSessionExtension to configure session extensions.
java.lang.ClassNotFoundException: io.delta.sql.DeltaSparkSessionExtension
	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:467)
	at org.apache.spark.util.SparkClassUtils.classForName(SparkClassUtils.scala:41)
	at org.apache.spark.util.SparkClassUtils.classForName$(SparkClassUtils.scala:36)
	at org.apache.spark.util.Utils$.classForName(Utils.scala:99)
	at org.apache.spark.sql.classic.SparkSession$.$anonfun$applyExtensions$2(SparkSession.scala:1056)
	at org.apache.spark.sql.classic.SparkSession$.$anonfun$applyExtensions$2$adapted(SparkSession.scala:1054)
	at scala.collection.IterableOnceOps.foreach(IterableOnce

In [None]:
# Step 2: Load sample demand data and create workflow chain
print("📊 Loading Sample Demand Data")
print("-" * 30)

# Create sample demand planning data with realistic business scenario
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType
from datetime import date

# Sample data schema
schema = StructType([
    StructField("Customer Name", StringType(), True),
    StructField("Plant_Location", StringType(), True), 
    StructField("Material-Code", StringType(), True),
    StructField("Demand Qty", IntegerType(), True),
    StructField("Forecast_Amount", DoubleType(), True),
    StructField("Date", DateType(), True)
])

# Sample data with messy column names and values (realistic scenario)
sample_data = [
    ("  ACME Corp  ", "PLANT_001", "MAT-12345", 1000, 95000.50, date(2024, 1, 15)),
    ("acme corp", "plant_001", "mat-12345", 1200, 110000.00, date(2024, 2, 15)),
    ("Global Industries", "PLANT_002", "MAT-67890", 800, 75000.25, date(2024, 1, 20)),
    ("  Tech Solutions LLC  ", "Plant_003", "mat-11111", 1500, 140000.00, date(2024, 1, 25)),
    ("ACME Corp", "Plant_001", "MAT-22222", 900, 85000.75, date(2024, 2, 10)),
    (None, "PLANT_002", "mat-67890", 600, 55000.00, date(2024, 2, 20)),
    ("tech solutions llc", "PLANT_003", "MAT-11111", 1800, 165000.50, date(2024, 2, 25))
]

# Create DataFrame with messy data
df_raw_demand = spark.createDataFrame(sample_data, schema)

# Initialize workflow chain
chain = DagChain()
chain.dag__raw_import = df_raw_demand

print("✅ Raw data loaded into chain")
chain.look(0)  # Show raw data

In [None]:
# Step 3: Apply table polisher for data standardization
print("🧹 Applying Table Polisher - Data Standardization")
print("-" * 45)

# Apply polish() function to standardize the messy data
chain.dag__polished_data = polish(chain.dag__raw_import)

print("✅ Data standardized with table polisher")
print("📋 Notice the changes:")
print("   • Column names: lowercase, special chars → underscores")
print("   • Customer values: trimmed, normalized")
print("   • Consistent ordering: key columns first")
print()

# Show the polished data
chain.look(-1)  # Look at latest (polished data)

print("\n📊 Comparison - Before vs After Polishing:")
print("   BEFORE: 'Customer Name', 'Plant_Location', 'Material-Code'")
print("   AFTER:  'customer_name', 'plant_location', 'material_code'")
print("   + Data values cleaned and normalized")

In [None]:
# Step 4: Apply TableIndexer for persistent entity indexing
print("🔢 Applying Table Indexer - Entity Indexing")
print("-" * 40)

# Create TableIndexer with polished data
indexer = TableIndexer(chain.dag__polished_data)

print("📊 Indexing entities:")
print("   • Customers: Persistent mapping to consecutive integers")
print("   • Plants: Separate index mapping")  
print("   • Materials: Separate index mapping")
print("   • Stored in Delta tables for consistency across runs")
print()

# Apply indexing for each entity type
chain.dag__indexed_customers = indexer.customer("customer_name")
chain.dag__indexed_plants = indexer.plant("plant_location") 
chain.dag__indexed_materials = indexer.material("material_code")

print("✅ Entity indexing completed")
print("📋 New columns added:")
print("   • Index__customer_name - Customer indices")
print("   • Index__plant_location - Plant indices")
print("   • Index__material_code - Material indices")
print()

# Show final indexed data
chain.look(-1)  # Look at materials (latest)

In [None]:
# Step 5: Workflow inspection and analysis
print("🔍 Workflow Analysis & Chain Inspection")
print("-" * 38)

print("📊 Complete workflow trace:")
chain.trace(shape=True)  # Show all DataFrames with row counts

print("\n🎯 Business Value Demonstration:")
print("   • Started with messy, inconsistent data")
print("   • Applied systematic standardization")  
print("   • Created persistent entity mappings")
print("   • Ready for ML models and analytics")

print("\n💾 Persistence Features:")
print("   • Entity indices persist across Spark sessions")
print("   • New entities get consecutive indices")
print("   • Race condition safe with Delta MERGE")
print("   • Catalog: test_catalog.supply_chain")

print("\n🔧 Tool Composition Benefits:")
print("   • Each tool focused on single responsibility")
print("   • Tools compose naturally in pipelines")
print("   • Consistent session management via workstation")
print("   • Visible intermediate steps via dag chainer")

In [None]:
# Step 6: Demonstrate Delta table persistence
print("💾 Delta Table Persistence Demo")
print("-" * 30)

print("📋 Created mapping tables:")

# Show the mapping tables that were created
mapping_tables = [
    "test_catalog.supply_chain.mapping__active_customers",
    "test_catalog.supply_chain.mapping__active_plants", 
    "test_catalog.supply_chain.mapping__active_materials"
]

for table in mapping_tables:
    try:
        df_mapping = spark.table(table)
        count = df_mapping.count()
        print(f"   ✅ {table}: {count} entities")
        if count > 0:
            print(f"      Sample: {df_mapping.take(2)}")
    except Exception as e:
        print(f"   ❌ {table}: Not yet created")

print(f"\n🔄 Persistence Test:")
print("   • Run this notebook again - entities keep same indices")
print("   • Add new entities - they get next available indices") 
print("   • Multiple concurrent sessions - no conflicts (Delta MERGE)")

print(f"\n📈 Production Ready Features:")
print("   • Auto-compaction enabled on mapping tables")
print("   • Write optimization enabled")
print("   • Catalog/schema auto-creation")
print("   • Comprehensive error handling and logging")

## Summary: Complete Toolset Integration

This demo showed the power of **Unix-like tool composition** for demand planning workflows:

### Tools Used:
1. **Workstation** → Centralized Spark session management
2. **DAG Chainer** → Workflow orchestration with visibility
3. **Table Polisher** → Consistent data standardization  
4. **Table Indexer** → Persistent entity indexing

### Key Benefits:
- **Composable**: Tools work together naturally
- **Visible**: Every step tracked and inspectable
- **Persistent**: Entity mappings survive session restarts
- **Race-safe**: Concurrent access handled properly
- **Scalable**: Ready for production Databricks environments

### Next Steps:
- Load your own CSV/Parquet data instead of sample data
- Extend with additional business logic transformations
- Write final results to gold layer tables using `chain.write()`
- Create custom tools following the same patterns

*The toolset is designed to grow with your needs while maintaining simplicity and composability.*