<div align="center">

# Medallion Architecture Data Pipeline with LangChain Agents
Building a Modern Data Lakehouse using GenAI-powered ETL Orchestration

</div>

<div align="center">

## Cell 1: Install Dependencies and Restart Python
This cell ensures all necessary Python libraries for the data pipeline are installed. langchain, langchain-community, langchain-core, databricks-langchain, langgraph, playwright, and Pillow are required for building and running the agentic pipeline. dbutils.library.restartPython() is crucial in Databricks notebooks to ensure newly installed libraries are immediately available.

</div>

In [None]:
%pip install langchain langchain-community langchain-core databricks-langchain langgraph playwright Pillow
dbutils.library.restartPython()

<div align="center">

## Cell 2: Configure LangChain and LangSmith
This cell sets up environment variables for LangChain and LangSmith. LangSmith is used for tracing, monitoring, and debugging LangChain applications. It's configured with an endpoint, API key (which you should replace with your actual key), and a project name for organizing runs.

</div>

In [None]:
import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "YOUR_LANGSMITH_API_KEY"
os.environ["LANGCHAIN_PROJECT"] = "Databricks - Medallion Pipeline"

print(f"LangSmith configured. Project: '{os.environ['LANGCHAIN_PROJECT']}'")

<div align="center">

## Cell 3: Import Dependencies
Setting up the foundation with comprehensive library imports to enable:
- LangChain integration with Databricks for LLM-powered agents
- LangGraph components for orchestrating the agent workflow
- PySpark SQL capabilities for scalable data transformations
- Core data science libraries for analysis and visualization

</div>

In [None]:
from langchain_community.chat_models import ChatDatabricks
from langchain_core.messages import HumanMessage, AIMessage
from langchain.tools import tool
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode

from typing import Annotated, TypedDict, List, Optional
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, regexp_replace, when, lit, md5, concat_ws, expr, to_date, upper, sum as _sum, avg as _avg, count as _count, date_format, year, month
from pyspark.sql.window import Window
from pyspark.sql.types import *
import json
import time
from datetime import datetime
import random
from IPython.display import Image, display

<div align="center">

## Cell 4: Initialize Spark Session, LLM, and Database Schemas

This cell initializes the SparkSession, the large language model (LLM) for agentic operations, and creates the necessary database schemas (Bronze, Silver, Gold) following the Medallion Architecture pattern. A log_event utility function is also defined here to provide structured, colored logging throughout the pipeline execution.

</div>

In [None]:
def log_event(level, source, message):
    colors = {"AGENT": "\033[94m", "TOOL": "\033[93m", "INFO": "\033[92m", "ERROR": "\033[91m", "GRAPH": "\033[95m"}
    reset_color = "\033[0m"
    timestamp = datetime.now().strftime("%H:%M:%S.%f")[:-3]
    print(f"{timestamp} | {colors.get(level, '')}{level:<7}{reset_color} | {colors.get(level, '')}[{source}]{reset_color} {message}")

spark = SparkSession.builder.appName("AgenticDataPipeline").getOrCreate()
llm = ChatDatabricks(endpoint="databricks-claude-3-7-sonnet", temperature=0.0, max_tokens=8000)

spark.sql("CREATE SCHEMA IF NOT EXISTS ops_bronze")
spark.sql("CREATE SCHEMA IF NOT EXISTS ops_silver")
spark.sql("CREATE SCHEMA IF NOT EXISTS ops_gold")

log_event("INFO", "Setup", "Environment initialized with Claude 3.7 Sonnet")

<div align="center">

## Cell 5: Generate Bronze Layer Raw Data

This cell creates synthetic raw data for the Bronze layer. It generates data for customers_raw, transactions_raw, accounts_raw, and opportunities_raw tables within the ops_bronze schema. This step simulates the ingestion of raw, potentially messy, source data into the data lake.

</div>

In [None]:
def create_bronze_data():
    log_event("INFO", "DataGen", "Generating Bronze layer data...")
    
    customer_data = []
    for i in range(1, 101):
        name = f"FName{i} LName{i}" if random.random() > 0.1 else None
        email = f"fname{i}.lname{i}@email.com" if random.random() > 0.1 else "invalid-email"
        address = f"{i*10} Main St" if random.random() > 0.05 else None
        customer_data.append((i, name, email, address, f"2023-01-{random.randint(10, 28)}"))
    
    customer_data.extend([
        (1, 'John Doe', 'john.doe@email.com', '123 Elm St', '2023-01-15'), 
        (3, 'Peter Jones', 'peter.jones.dup@email.com', '789 Pine Ln', '2023-01-17')
    ])
    spark.createDataFrame(customer_data, ["customer_id", "name", "email", "address", "join_date"]).write.mode("overwrite").saveAsTable("ops_bronze.customers_raw")

    transactions_data = []
    for i in range(1, 201):
        cust_id = random.randint(1, 110)
        qty = random.randint(-5, 50)
        amount = f"${random.uniform(5, 1000):.2f}" if random.random() > 0.05 else None
        transactions_data.append((100+i, cust_id, qty, amount, f"2023-02-{random.randint(1, 28)}"))
    spark.createDataFrame(transactions_data, ["transaction_id", "customer_id", "quantity", "amount", "transaction_date"]).write.mode("overwrite").saveAsTable("ops_bronze.transactions_raw")

    accounts_data = [('ACC{:03d}'.format(i), f'GlobalCorp-{i}', random.choice(['Technology', 'Healthcare', 'Finance', 'TECH', None]), random.choice(['USA', 'UK', 'Germany'])) for i in range(1, 51)]
    accounts_data.append(('ACC001', 'Global Corporation', 'Tech', 'USA'))
    spark.createDataFrame(accounts_data, ["account_id", "account_name", "industry", "region"]).write.mode("overwrite").saveAsTable("ops_bronze.accounts_raw")
    
    opportunities_data = [(f'OPP{i:03d}', f"ACC{random.randint(1, 55):03d}", random.randint(-1000, 200000), '2024-07-15', random.choice(['Closed Won', 'Negotiation', 'Proposal', 'Qualification', 'Closed Lost'])) for i in range(1, 151)]
    spark.createDataFrame(opportunities_data, ["opportunity_id", "account_id", "value", "close_date", "stage"]).write.mode("overwrite").saveAsTable("ops_bronze.opportunities_raw")
    
    log_event("INFO", "DataGen", "Bronze data creation complete")
    display(spark.sql("SHOW TABLES IN ops_bronze"))

create_bronze_data()

<div align="center">

## Cell 6: Define Agent Tools

This cell defines the tool functions that the agents within the LangGraph workflow can use:

- **get_table_info**: Profiles a given table, providing schema, row count, and a data preview.
- **execute_pyspark_code**: Safely executes generated PySpark code within the Spark environment, handling errors. 
- **create_notebook_visualization**: Generates and displays basic visualizations directly within the Databricks notebook.

These tools empower the LLM agents to interact with the data and environment.

</div>

In [None]:
@tool
def get_table_info(table_name: str) -> str:
    """Provides comprehensive table profiling including schema, statistics, and data preview for transformation planning."""
    try:
        log_event("TOOL", "get_table_info", f"Profiling: {table_name}")
        df = spark.table(table_name)
        schema_str = "\n".join([f"- {field.name}: {str(field.dataType)}" for field in df.schema.fields])
        count = df.count()
        preview = df.limit(3).toPandas().to_string()
        return f"TABLE: {table_name}\nROW_COUNT: {count}\n\nSCHEMA:\n{schema_str}\n\nSAMPLE_DATA:\n{preview}"
    except Exception as e:
        return f"ERROR: {str(e)}"

@tool
def execute_pyspark_code(code: str) -> str:
    """Executes PySpark transformation code with comprehensive error handling and validation."""
    try:
        log_event("TOOL", "execute_pyspark_code", f"Executing transformation...")
        exec_globals = {
            'spark': spark, 'col': col, 'regexp_replace': regexp_replace, 'when': when, 'lit': lit,
            'md5': md5, 'concat_ws': concat_ws, 'expr': expr, 'to_date': to_date, 'upper': upper,
            '_sum': _sum, '_avg': _avg, '_count': _count, 'date_format': date_format, 'year': year, 'month': month, 'Window': Window
        }
        exec(code, exec_globals)
        log_event("TOOL", "execute_pyspark_code", "SUCCESS")
        return json.dumps({"status": "success", "error": None})
    except Exception as e:
        error_msg = f"EXECUTION_ERROR: {str(e)}"
        log_event("ERROR", "execute_pyspark_code", error_msg)
        return json.dumps({"status": "error", "error": error_msg})
        
@tool
def create_notebook_visualization(table_name: str, plot_type: str, x_col: str, y_col: str, title: str) -> str:
    """Creates optimized dashboard visualizations with automatic data sorting and limiting for performance."""
    try:
        log_event("TOOL", "create_visualization", f"Creating: {title}")
        df = spark.table(table_name)
        if plot_type == 'bar' and 'Top' in title:
            df_display = df.orderBy(col(y_col).desc()).limit(5)
        else:
            df_display = df.orderBy(x_col)
        print(f"\n=== DASHBOARD COMPONENT: {title} ===")
        display(df_display)
        return f"SUCCESS: '{title}' visualization created and displayed"
    except Exception as e:
        return f"VISUALIZATION_ERROR: {str(e)}"

all_tools = [get_table_info, execute_pyspark_code, create_notebook_visualization]
tool_node = ToolNode(all_tools)

<div align="center">

## Cell 7: Define Agent States, Nodes, and Graph Workflow

This is the core of the agentic pipeline.

### AgentState
Defines the shared state object passed between agents in the LangGraph workflow, including messages, current task, PySpark code, review feedback, execution errors, and retry count.

### Agent Functions

- **planner_agent**: Generates a detailed data transformation plan
- **code_generator_agent**: Writes PySpark code based on the plan and feedback
- **code_reviewer_agent**: Reviews the generated code for quality and correctness 
- **prepare_for_execution**: Formats the PySpark code for tool execution

### Conditional Edges
- **after_review_decider**: Routes based on code review feedback
- **after_execution_decider**: Routes based on execution results
These enable an autonomous self-correction loop.

### Workflow Definition 
StateGraph defines the nodes (agents and tools) and edges (transitions) of the Medallion pipeline workflow.

</div>

In [None]:
class AgentState(TypedDict):
    messages: Annotated[list, lambda x, y: x + y]
    current_task: str
    pyspark_code: Optional[str]
    review_feedback: Optional[str]
    execution_error: Optional[str]
    retry_count: int

def planner_agent(state: AgentState):
    log_event("AGENT", "Planner", "Creating transformation strategy")
    
    system_prompt = """You are an expert data architect specializing in Databricks Medallion architecture. 
    
    OBJECTIVES:
    - Create precise, executable PySpark transformation plans
    - Ensure data quality, deduplication, and proper type casting
    - Follow medallion best practices (Bronze=raw, Silver=cleaned, Gold=aggregated)
    - Design for scalability and performance optimization
    
    OUTPUT FORMAT:
    Provide a detailed, step-by-step technical plan with:
    1. Data profiling requirements
    2. Specific transformation logic
    3. Quality checks and validation rules
    4. Performance optimization strategies"""
    
    user_prompt = f"""TRANSFORMATION TASK: {state['current_task']}
    
    Create a comprehensive technical plan that addresses:
    - Data quality issues and remediation
    - Schema standardization and type casting
    - Deduplication strategies  
    - Business rule implementation
    - Performance optimization techniques"""
    
    messages = [HumanMessage(content=f"{system_prompt}\n\n{user_prompt}")]
    response = llm.invoke(messages)
    return {"messages": [AIMessage(content=response.content, name="PlannerAgent")]}

def code_generator_agent(state: AgentState):
    log_event("AGENT", "CodeGenerator", "Generating PySpark implementation")
    
    plan = next((msg.content for msg in state['messages'] if isinstance(msg, AIMessage) and msg.name == "PlannerAgent"), "")
    
    context_prompt = ""
    if state.get('review_feedback') and "APPROVED" not in state['review_feedback'].upper():
        context_prompt = f"\n\nREVIEW FEEDBACK TO ADDRESS:\n{state['review_feedback']}"
    elif state.get('execution_error'):
        context_prompt = f"\n\nEXECUTION ERROR TO FIX:\n{state['execution_error']}"

    system_prompt = """You are a senior PySpark developer with expertise in Databricks optimization.

    CODING REQUIREMENTS:
    - Generate complete, executable PySpark code only
    - Use explicit imports for all functions
    - Implement robust error handling and data validation
    - Apply performance optimizations (caching, partitioning)
    - Follow PySpark best practices for large-scale data processing
    - Use .mode("overwrite").saveAsTable() for all writes
    
    CRITICAL CONSTRAINTS:
    - NO explanatory text, comments, or markdown
    - SparkSession 'spark' is pre-initialized
    - Code must be production-ready and optimized
    - Handle edge cases and null values appropriately"""

    user_prompt = f"""IMPLEMENTATION PLAN:\n{plan}{context_prompt}
    
    Generate the complete PySpark transformation code that:
    1. Implements all requirements from the plan
    2. Handles data quality issues robustly
    3. Optimizes for performance and scalability
    4. Produces clean, reliable output tables"""

    messages = [HumanMessage(content=f"{system_prompt}\n\n{user_prompt}")]
    response = llm.invoke(messages)
    clean_code = response.content.strip().replace("```python", "").replace("```", "").strip()
    return {"pyspark_code": clean_code, "execution_error": None, "review_feedback": None}

def code_reviewer_agent(state: AgentState):
    log_event("AGENT", "CodeReviewer", "Performing quality assurance review")
    
    plan = next((msg.content for msg in state['messages'] if isinstance(msg, AIMessage) and msg.name == "PlannerAgent"), "")
    
    system_prompt = """You are a senior data engineering QA specialist with deep PySpark expertise.

    REVIEW CRITERIA:
    - Code completeness and correctness
    - Performance optimization implementation
    - Data quality and validation logic
    - Error handling robustness
    - Adherence to transformation requirements
    - Production readiness standards

    RESPONSE FORMAT:
    - If code meets ALL requirements: respond with exactly "APPROVED"
    - If issues exist: respond with "REJECTION_REASON: [specific actionable feedback]"
    
    Be thorough but decisive. Code must be production-quality."""

    user_prompt = f"""ORIGINAL PLAN:\n{plan}\n\nIMPLEMENTATION CODE:\n```python\n{state['pyspark_code']}\n```
    
    Evaluate if the code fully implements the plan requirements with production-quality standards."""

    messages = [HumanMessage(content=f"{system_prompt}\n\n{user_prompt}")]
    response = llm.invoke(messages)
    return {"review_feedback": response.content}

def prepare_for_execution(state: AgentState):
    log_event("AGENT", "ExecutorPrep", "Preparing code execution")
    tool_call_message = AIMessage(
        content="", 
        tool_calls=[{
            'name': 'execute_pyspark_code', 
            'args': {'code': state.get('pyspark_code')}, 
            'id': f'exec_{datetime.now().isoformat()}'
        }]
    )
    return {"messages": [tool_call_message]}

def after_review_decider(state: AgentState):
    feedback = state.get('review_feedback', '')
    if "APPROVED" in feedback.upper():
        log_event("GRAPH", "Router", "Code approved → Execution")
        return "prepare_for_execution"
    else:
        log_event("GRAPH", "Router", f"Code rejected → Revision (attempt {state.get('retry_count', 0) + 1})")
        return "revise_code"

def after_execution_decider(state: AgentState):
    last_message = state['messages'][-1]
    execution_result = json.loads(last_message.content)
    
    if execution_result["status"] == "success":
        log_event("GRAPH", "Router", "Execution successful → Complete")
        return END
    elif state.get('retry_count', 0) < 3:
        log_event("GRAPH", "Router", "Execution failed → Retry")
        state['execution_error'] = execution_result["error"]
        return "code_generator"
    else:
        log_event("ERROR", "Router", "Max retries exceeded → Abort")
        return END

workflow = StateGraph(AgentState)
workflow.add_node("planner", planner_agent)
workflow.add_node("code_generator", code_generator_agent)
workflow.add_node("code_reviewer", code_reviewer_agent)
workflow.add_node("revise_code_node", lambda state: {"retry_count": state.get('retry_count', 0) + 1})
workflow.add_node("prepare_for_execution", prepare_for_execution)
workflow.add_node("executor", tool_node)

workflow.set_entry_point("planner")
workflow.add_edge("planner", "code_generator")
workflow.add_edge("code_generator", "code_reviewer")
workflow.add_conditional_edges("code_reviewer", after_review_decider, {
    "prepare_for_execution": "prepare_for_execution", 
    "revise_code": "revise_code_node"
})
workflow.add_edge("revise_code_node", "code_generator")
workflow.add_edge("prepare_for_execution", "executor")
workflow.add_conditional_edges("executor", after_execution_decider, {
    "code_generator": "code_generator", 
    "__end__": END
})

app = workflow.compile()
log_event("INFO", "Setup", "Enhanced LangGraph pipeline compiled")

In [None]:
def create_workflow_diagram():
    """Create a text-based workflow diagram as fallback for visualization issues"""
    diagram = """
    ┌─────────────┐
    │   PLANNER   │
    │   AGENT     │
    └─────┬───────┘
          │
          ▼
    ┌─────────────┐
    │    CODE     │
    │ GENERATOR   │
    └─────┬───────┘
          │
          ▼
    ┌─────────────┐
    │    CODE     │
    │  REVIEWER   │
    └─────┬───────┘
          │
          ▼
    ┌─────────────┐    ┌─────────────┐
    │  APPROVED?  │───►│   REVISE    │
    │             │    │    CODE     │
    └─────┬───────┘    └─────┬───────┘
          │                  │
          ▼                  │
    ┌─────────────┐          │
    │  EXECUTE    │◄─────────┘
    │    CODE     │
    └─────┬───────┘
          │
          ▼
    ┌─────────────┐
    │  SUCCESS?   │
    │             │
    └─────────────┘
    """
    print("=== LANGGRAPH WORKFLOW DIAGRAM ===")
    print(diagram)

try:
    # Try to create visual graph with improved error handling
    try:
        # First try with mermaid text output (safer fallback)
        mermaid_code = app.get_graph().draw_mermaid()
        print("=== WORKFLOW MERMAID DIAGRAM ===")
        print(mermaid_code)
    except Exception as mermaid_error:
        log_event("INFO", "Visualization", f"Mermaid generation failed: {mermaid_error}")
        
    # Try PNG generation with better error handling
    try:
        from langgraph.graph.graph import CompiledGraph
        if hasattr(app.get_graph(), 'draw_ascii'):
            ascii_graph = app.get_graph().draw_ascii()
            print("=== WORKFLOW ASCII DIAGRAM ===")
            print(ascii_graph)
        else:
            create_workflow_diagram()
    except Exception as png_error:
        log_event("INFO", "Visualization", f"PNG visualization unavailable: {png_error}")
        create_workflow_diagram()
        
except Exception as e:
    log_event("INFO", "Visualization", "Using text-based workflow diagram")
    create_workflow_diagram()

<div align="center">

## Cell 9: Execute Bronze to Silver Transformations

This cell initiates the first major phase of the Medallion pipeline: transforming raw data from the Bronze layer to a cleaned and standardized Silver layer. It iterates through a list of predefined transformation tasks, each describing requirements for cleansing and validating data for specific tables (customers, transactions, accounts, opportunities). The execute_pipeline_task function orchestrates the agentic workflow for each task.

</div>

In [None]:
def execute_pipeline_task(task_description):
    task_name = task_description.strip().splitlines()[0][:50]
    log_event("INFO", "Pipeline", f"EXECUTING: {task_name}")
    
    initial_state = {
        "messages": [HumanMessage(content="Initialize transformation pipeline")], 
        "current_task": task_description, 
        "retry_count": 0
    }
    
    try:
        for state_update in app.stream(initial_state, {"recursion_limit": 20}):
            node_name = list(state_update.keys())[0]
            log_event("GRAPH", "Flow", f"Completed: {node_name}")
        
        log_event("INFO", "Pipeline", f"COMPLETED: {task_name}")
    except Exception as e:
        log_event("ERROR", "Pipeline", f"Task failed: {task_name} - {str(e)}")
    
    time.sleep(1)

log_event("INFO", "PIPELINE", "===== BRONZE → SILVER TRANSFORMATIONS =====")

bronze_to_silver_transformations = [
    """CUSTOMER DATA CLEANSING: ops_bronze.customers_raw → ops_silver.customers_cleaned
    
    REQUIREMENTS:
    - Remove duplicate customer_id records (keep first occurrence)  
    - Fill null 'name' values with 'Unknown Customer'
    - Validate email format using regex pattern '.+@.+\\..+' 
    - Set invalid emails to null
    - Convert join_date string to proper DateType
    - Filter out records with null email OR null address
    - Add data quality flags for tracking""",
    
    """TRANSACTION DATA STANDARDIZATION: ops_bronze.transactions_raw → ops_silver.transactions_cleaned
    
    REQUIREMENTS:
    - Deduplicate on transaction_id (keep first record)
    - Clean amount field: remove '$' prefix and convert to Decimal(10,2)
    - Filter out negative quantity transactions
    - Filter out null or zero amount transactions  
    - Restrict to customer_id <= 100 (valid customers only)
    - Convert transaction_date to DateType
    - Add calculated fields for analysis""",
    
    """ACCOUNT DATA NORMALIZATION: ops_bronze.accounts_raw → ops_silver.accounts_cleaned
    
    REQUIREMENTS:
    - Deduplicate by account_id, preserving first record
    - Standardize industry values to uppercase
    - Map 'TECH' industry to 'TECHNOLOGY' 
    - Replace null industry with 'NOT_SPECIFIED'
    - Validate account_id format consistency
    - Add data lineage tracking columns""",
    
    """OPPORTUNITY DATA VALIDATION: ops_bronze.opportunities_raw → ops_silver.opportunities_cleaned
    
    REQUIREMENTS:
    - Cast opportunity value to Decimal(14,2) with proper handling
    - Filter out opportunities with value <= 0
    - Validate account_id matches pattern 'ACC###'
    - Convert close_date string to DateType
    - Standardize stage values with proper casing
    - Add business rule validation flags"""
]

for task in bronze_to_silver_transformations:
    execute_pipeline_task(task)

<div align="center">

## Cell 10: Execute Silver to Gold Aggregations 

This cell performs the second major phase of the Medallion pipeline: aggregating and enriching data from the Silver layer into the Gold layer. It defines and executes tasks for creating analytical tables like customer_spending, account_performance, and monthly_sales_summary. These aggregations prepare the data for business intelligence and reporting.

</div>

In [None]:
log_event("INFO", "PIPELINE", "===== SILVER → GOLD AGGREGATIONS =====")

silver_to_gold_aggregations = [
    """CUSTOMER SPENDING ANALYTICS: Create ops_gold.customer_spending
    
    REQUIREMENTS:
    - Join ops_silver.customers_cleaned with ops_silver.transactions_cleaned
    - Group by customer_id and customer name
    - Calculate total_spent (sum of amounts) with null handling
    - Calculate total_transactions (count of transactions)
    - Calculate average_transaction_value 
    - Add customer spending tier classification
    - Order by total_spent descending for performance""",
    
    """ACCOUNT PERFORMANCE METRICS: Create ops_gold.account_performance  
    
    REQUIREMENTS:
    - Join ops_silver.accounts_cleaned with ops_silver.opportunities_cleaned
    - Group by account_id, account_name, and industry
    - Calculate total_pipeline_value (sum of all opportunity values)
    - Calculate won_value (sum where stage = 'Closed Won')
    - Calculate win_rate as percentage of closed won vs total closed
    - Count open_opportunities (non-closed stages)
    - Add performance ranking within industry""",
    
    """MONTHLY SALES TRENDS: Create ops_gold.monthly_sales_summary
    
    REQUIREMENTS:
    - Source from ops_silver.transactions_cleaned
    - Extract year-month from transaction_date as 'YYYY-MM' format
    - Group by month period  
    - Calculate monthly_revenue (sum of amounts)
    - Calculate monthly_transaction_count
    - Calculate average_transaction_size per month
    - Add month-over-month growth calculations
    - Order chronologically for time series analysis"""
]

for task in silver_to_gold_aggregations:
    execute_pipeline_task(task)

In [None]:
log_event("INFO", "PIPELINE", "===== BUSINESS INTELLIGENCE DASHBOARD =====")

<div align="center">

## Cell 12: Generate and Validate Dashboard Visualizations

This cell leverages the bi_agent (which uses the Notebook_visualization tool) to generate key business intelligence dashboard components. It then performs final data validation on the Gold layer tables by displaying record counts and top N rows, ensuring the pipeline's output is as expected.

</div>

In [None]:
dashboard_prompt = """You are a Business Intelligence specialist. Create executive dashboard visualizations using the create_notebook_visualization tool.

DASHBOARD REQUIREMENTS:
1. Customer Analysis: Top 5 customers by total spending (bar chart from ops_gold.customer_spending)
2. Industry Performance: Pipeline value distribution by industry (bar chart from ops_gold.account_performance)  
3. Revenue Trends: Monthly revenue progression over time (line chart from ops_gold.monthly_sales_summary)

Execute all three visualizations to complete the executive dashboard."""

bi_agent = llm.bind_tools(all_tools)
dashboard_response = bi_agent.invoke(dashboard_prompt)

if dashboard_response.tool_calls:
    for tool_call in dashboard_response.tool_calls:
        tool_function = {t.name: t for t in all_tools}[tool_call['name']]
        result = tool_function.invoke(tool_call['args'])
        log_event("INFO", "Dashboard", f"Visualization created: {tool_call['args'].get('title', 'Unknown')}")
else:
    log_event("ERROR", "Dashboard", "Failed to generate dashboard visualizations")

log_event("INFO", "VALIDATION", "===== FINAL DATA VALIDATION =====")

validation_tables = [
    "ops_gold.customer_spending",
    "ops_gold.account_performance", 
    "ops_gold.monthly_sales_summary"
]

for table in validation_tables:
    try:
        print(f"\n=== FINAL VALIDATION: {table} ===")
        df = spark.table(table)
        print(f"Record Count: {df.count()}")
        if "customer_spending" in table:
            display(df.orderBy(col("total_spent").desc()).limit(10))
        elif "account_performance" in table:
            display(df.orderBy(col("total_pipeline_value").desc()).limit(10))
        else:
            display(df.orderBy("month").limit(12))
    except Exception as e:
        log_event("ERROR", "Validation", f"Failed to validate {table}: {str(e)}")

log_event("INFO", "COMPLETION", "===== MEDALLION PIPELINE EXECUTION COMPLETE =====")

In [None]:
print("\n🎉 SUCCESS: Databricks Medallion Architecture pipeline completed successfully!")
print("✅ All Bronze → Silver → Gold transformations executed")
print("✅ Business Intelligence dashboard components created")
print("✅ Data quality validation completed")