# Star Schema Builder for Fabric Monitoring Analytics

## Overview
This notebook transforms raw Monitor Hub activity data into a Kimball-style star schema suitable for SQL queries, semantic models, and Power BI reports.

It is designed to work in both:
- **Microsoft Fabric notebooks** (paths auto-resolve under `/lakehouse/default/Files/`), and
- **Local dev** (writes under `exports/` by default).

## Star Schema Tables
- **Dimensions**: dim_date, dim_time, dim_workspace, dim_item, dim_user, dim_activity_type, dim_status
- **Facts**: fact_activity, fact_daily_metrics

## Key Features
- Incremental loading with high-water mark tracking
- SCD Type 2 support for slowly changing dimensions
- Automatic surrogate key generation
- Pre-aggregated daily metrics for fast dashboards

## How to Use
1. **Install Package** (first run only): Uncomment and run the pip install cell
2. **Configure Paths**: Set INPUT_DIR and OUTPUT_DIR for your environment
3. **Run Pipeline**: Execute the build cells
4. **Optional**: Convert to Delta tables for SQL Endpoint access

## Package Installation
<span style="color:red">pip install is only required on first run. Uncomment and run once, then re-comment.</span>

In [None]:
# %pip install /lakehouse/default/Files/usf_fabric_monitoring-0.3.0-py3-none-any.whl --force-reinstall

## Setup Local Path (For Local Development)

In [None]:
# SETUP LOCAL PATH (For Local Development)
import sys
import os
from pathlib import Path

# Add the src directory to sys.path to allow importing the local package
# This is necessary when running locally without installing the package
current_dir = Path(os.getcwd())

# Check if we are in notebooks directory
if current_dir.name == "notebooks":
    src_path = current_dir.parent / "src"
else:
    # Assume we are in project root
    src_path = current_dir / "src"

if src_path.exists() and str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))
    print(f"‚úÖ Added {src_path} to sys.path (local development mode)")
else:
    print("‚ÑπÔ∏è Running in Fabric or package already installed")

In [None]:
# Force reload of modules to pick up code changes
import importlib
import usf_fabric_monitoring.core.star_schema_builder as ssb
importlib.reload(ssb)
print("‚úÖ Modules reloaded")

In [None]:
# Package / environment verification (safe: no Azure/API imports)
from importlib.metadata import PackageNotFoundError, version
import importlib
import usf_fabric_monitoring
from usf_fabric_monitoring.core.utils import resolve_path

try:
    pkg_version = getattr(usf_fabric_monitoring, "__version__", None) or version("usf_fabric_monitoring")
except PackageNotFoundError:
    pkg_version = "unknown"

print(f"usf_fabric_monitoring version: {pkg_version}")
print(f"Resolved output dir example: {resolve_path('exports/star_schema')}")

# Check star schema builder availability
try:
    from usf_fabric_monitoring.core.star_schema_builder import StarSchemaBuilder, ALL_STAR_SCHEMA_DDLS
    print("‚úÖ StarSchemaBuilder module loaded successfully")
except ImportError as e:
    print(f"‚ùå Failed to import StarSchemaBuilder: {e}")
    print("   Make sure you have version 0.3.0+ installed")

## Configuration

In [None]:
from pathlib import Path
from usf_fabric_monitoring.core.utils import resolve_path

# ============================================================================
# CONFIGURATION - Update these values for your environment
# ============================================================================

# Input: Where Monitor Hub pipeline outputs are stored (CSV files with Smart Merge data)
INPUT_DIR = resolve_path("exports/monitor_hub_analysis")

# Output: Where star schema tables will be written
OUTPUT_DIR = resolve_path("exports/star_schema")

# Processing options
INCREMENTAL_LOAD = True  # Set to False for full refresh (rebuilds all tables)
WRITE_TO_DELTA_TABLES = False  # Set to True in Fabric to create SQL Endpoint tables

# ============================================================================
# Display configuration
# ============================================================================
print("=" * 60)
print("STAR SCHEMA BUILDER CONFIGURATION")
print("=" * 60)
print(f"Input Directory:       {INPUT_DIR}")
print(f"Output Directory:      {OUTPUT_DIR}")
print(f"Mode:                  {'Incremental' if INCREMENTAL_LOAD else 'Full Refresh'}")
print(f"Write to Delta Tables: {WRITE_TO_DELTA_TABLES}")
print("=" * 60)

## Load Source Data

In [None]:
import pandas as pd
from pathlib import Path
from datetime import datetime

# Find activities_master CSV files (contains Smart Merge enriched data with accurate failure status)
input_path = Path(INPUT_DIR)
activities_files = sorted(input_path.glob("activities_master_*.csv"), reverse=True)

if not activities_files:
    # Fallback to parquet if no CSV
    parquet_dir = input_path / "parquet"
    parquet_files = sorted(parquet_dir.glob("activities_*.parquet"), reverse=True) if parquet_dir.exists() else []
    if parquet_files:
        print(f"‚ö†Ô∏è No CSV files found, falling back to parquet: {parquet_files[0].name}")
        activities_df = pd.read_parquet(parquet_files[0])
    else:
        raise FileNotFoundError(
            f"No activities files found in {INPUT_DIR}\n"
            "Run the Monitor Hub pipeline first: make monitor-hub"
        )
else:
    # Show available files
    print("Available activity files:")
    for i, f in enumerate(activities_files[:5]):
        df_check = pd.read_csv(f, usecols=['status'], nrows=100000)
        failed_sample = (df_check['status'] == 'Failed').sum()
        print(f"  {i+1}. {f.name} - {'Has failures' if failed_sample > 0 else 'No failures in sample'}")
    
    # Try to find a file with failures, otherwise use latest
    selected_file = None
    for f in activities_files:
        df_check = pd.read_csv(f, usecols=['status'], low_memory=False)
        if (df_check['status'] == 'Failed').sum() > 0:
            selected_file = f
            print(f"\n‚úÖ Selected file with Smart Merge failure data: {f.name}")
            break
    
    if selected_file is None:
        selected_file = activities_files[0]
        print(f"\n‚ö†Ô∏è No file with failures found, using latest: {selected_file.name}")
    
    activities_df = pd.read_csv(selected_file, low_memory=False)

print(f"\n‚úÖ Loaded {len(activities_df):,} activity records")
print(f"   Columns: {len(activities_df.columns)} total")

# Show status distribution (key for Smart Merge validation)
if 'status' in activities_df.columns:
    status_counts = activities_df['status'].value_counts()
    print(f"   Status distribution:")
    for status, count in status_counts.items():
        print(f"     - {status}: {count:,}")

# Show date range
time_col = 'start_time' if 'start_time' in activities_df.columns else 'StartTimeUtc'
if time_col in activities_df.columns:
    activities_df[time_col] = pd.to_datetime(activities_df[time_col], errors='coerce')
    print(f"   Date range: {activities_df[time_col].min()} to {activities_df[time_col].max()}")

## Build Star Schema

In [None]:
# Use reloaded module
from usf_fabric_monitoring.core.star_schema_builder import StarSchemaBuilder
import importlib
import usf_fabric_monitoring.core.star_schema_builder as ssb_module
importlib.reload(ssb_module)
StarSchemaBuilder = ssb_module.StarSchemaBuilder

from datetime import datetime

# Create output directory
output_path = Path(OUTPUT_DIR)
output_path.mkdir(parents=True, exist_ok=True)

# Build star schema
print("=" * 60)
print("BUILDING STAR SCHEMA")
print("=" * 60)
start_time = datetime.now()

builder = StarSchemaBuilder(output_directory=OUTPUT_DIR)
results = builder.build_complete_schema(
    activities=activities_df.to_dict(orient="records"),
    incremental=INCREMENTAL_LOAD
)

duration = (datetime.now() - start_time).total_seconds()

if results["status"] == "success":
    print(f"\n‚úÖ Star Schema build completed in {duration:.2f} seconds!")
    
    print(f"\nüìä Dimensions Built:")
    for dim_name, count in results.get("dimensions_built", {}).items():
        if not dim_name.endswith("_new"):
            new_count = results.get("dimensions_built", {}).get(f"{dim_name}_new", "")
            new_suffix = f" (+{new_count} new)" if new_count else ""
            print(f"   ‚Ä¢ {dim_name}: {count:,} records{new_suffix}")
    
    print(f"\nüìà Fact Tables Built:")
    for fact_name, count in results.get("facts_built", {}).items():
        if not fact_name.endswith("_new"):
            new_count = results.get("facts_built", {}).get(f"{fact_name}_new", "")
            new_suffix = f" (+{new_count} new)" if new_count else ""
            print(f"   ‚Ä¢ {fact_name}: {count:,} records{new_suffix}")
    
    print(f"\nüìÅ Output Directory: {OUTPUT_DIR}")
else:
    print(f"\n‚ùå Build failed:")
    for error in results.get("errors", []):
        print(f"   {error}")

## Convert to Delta Tables (Fabric Only)
Run this cell only in Microsoft Fabric to create Delta tables accessible via SQL Endpoint.

In [None]:
if WRITE_TO_DELTA_TABLES:
    try:
        from pyspark.sql import SparkSession
        
        spark = SparkSession.builder.getOrCreate()
        
        print("Converting Parquet files to Delta tables...")
        print("-" * 50)
        
        # Convert each parquet to Delta table
        for parquet_file in Path(OUTPUT_DIR).glob("*.parquet"):
            table_name = parquet_file.stem  # e.g., "dim_date", "fact_activity"
            
            try:
                df = spark.read.parquet(str(parquet_file))
                
                # Write as Delta table (overwrite mode)
                df.write.mode("overwrite").format("delta").saveAsTable(table_name)
                
                print(f"   ‚úÖ {table_name}: {df.count():,} rows")
            except Exception as e:
                print(f"   ‚ùå {table_name}: {e}")
        
        print("-" * 50)
        print("‚úÖ Delta tables created successfully!")
        print("   Tables are now available in the SQL Endpoint.")
        
    except ImportError:
        print("‚ö†Ô∏è PySpark not available. Delta table creation skipped.")
        print("   This feature is only available in Microsoft Fabric.")
else:
    print("‚ÑπÔ∏è Delta table creation skipped (WRITE_TO_DELTA_TABLES=False)")
    print("   Set WRITE_TO_DELTA_TABLES=True in Fabric to enable.")

## Validate Star Schema

In [None]:
import pandas as pd

print("=" * 60)
print("STAR SCHEMA VALIDATION")
print("=" * 60)

# Load and validate tables
tables = {}
for parquet_file in Path(OUTPUT_DIR).glob("*.parquet"):
    table_name = parquet_file.stem
    tables[table_name] = pd.read_parquet(parquet_file)
    print(f"{table_name}: {len(tables[table_name]):,} records")

# Check FK integrity
print("\nüîó Foreign Key Validation:")

fk_checks = [
    ('fact_activity', 'workspace_sk', 'dim_workspace', 'workspace_sk'),
    ('fact_activity', 'item_sk', 'dim_item', 'item_sk'),
    ('fact_activity', 'user_sk', 'dim_user', 'user_sk'),
    ('fact_activity', 'date_sk', 'dim_date', 'date_sk'),
    ('fact_activity', 'time_sk', 'dim_time', 'time_sk'),
    ('fact_activity', 'activity_type_sk', 'dim_activity_type', 'activity_type_sk'),
    ('fact_activity', 'status_sk', 'dim_status', 'status_sk'),
]

all_passed = True
for fact_table, fact_col, dim_table, dim_col in fk_checks:
    if fact_table in tables and dim_table in tables:
        fact_vals = set(tables[fact_table][fact_col].dropna().unique())
        dim_vals = set(tables[dim_table][dim_col].unique())
        orphans = fact_vals - dim_vals
        status = '‚úÖ PASS' if len(orphans) == 0 else f'‚ùå {len(orphans)} orphans'
        if len(orphans) > 0:
            all_passed = False
        print(f"   {fact_col}: {status}")

print("\n" + ("‚úÖ All FK validations passed!" if all_passed else "‚ö†Ô∏è Some FK validations failed"))

## Sample Analytical Queries

In [None]:
# Top 10 Most Active Workspaces (FIXED: Aggregate by SK first, then join for names)
print("üìä Top 10 Most Active Workspaces")
print("-" * 80)

if 'fact_activity' in tables and 'dim_workspace' in tables:
    fact = tables['fact_activity']
    dim_ws = tables['dim_workspace']
    
    # Aggregate by workspace_sk first (avoids duplicate rows from merge)
    ws_stats = fact.groupby('workspace_sk').agg(
        activity_count=('activity_id', 'count'),
        unique_users=('user_sk', 'nunique'),
        total_duration=('duration_seconds', 'sum'),
        failed=('is_failed', 'sum')
    ).reset_index()
    
    # Join with dimension for names
    ws_stats = ws_stats.merge(
        dim_ws[['workspace_sk', 'workspace_name', 'environment']], 
        on='workspace_sk', 
        how='left'
    )
    
    # Sort and display top 10
    ws_stats = ws_stats.sort_values('activity_count', ascending=False).head(10)
    ws_stats['duration_hrs'] = round(ws_stats['total_duration'] / 3600, 2)
    ws_stats['failure_rate'] = round(100 * ws_stats['failed'] / ws_stats['activity_count'], 2)
    
    print(f"{'Workspace':<45} | {'Env':<7} | {'Activities':>10} | {'Hours':>8} | {'Failed':>6} | {'Fail %':>6}")
    print("-" * 105)
    for _, row in ws_stats.iterrows():
        ws_name = str(row['workspace_name'])[:44] if row['workspace_name'] else 'Unknown'
        env = str(row['environment'])[:7] if row['environment'] else 'N/A'
        print(f"{ws_name:<45} | {env:<7} | {int(row['activity_count']):>10,} | {row['duration_hrs']:>8} | {int(row['failed']):>6} | {row['failure_rate']:>5}%")

In [None]:
# Activity by Type
print("üìä Activity Count by Type")
print("-" * 50)

if 'fact_activity' in tables and 'dim_activity_type' in tables:
    fact = tables['fact_activity']
    dim_type = tables['dim_activity_type']
    
    merged = fact.merge(dim_type[['activity_type_sk', 'activity_type', 'activity_category']], on='activity_type_sk')
    result = merged.groupby(['activity_category', 'activity_type']).size().sort_values(ascending=False).head(10)
    
    for (cat, act_type), count in result.items():
        print(f"   {cat:<15} | {act_type:<30} | {count:>10,}")

In [None]:
# Daily Activity Trend
print("üìä Daily Activity Trend (Last 14 Days)")
print("-" * 50)

if 'fact_daily_metrics' in tables and 'dim_date' in tables:
    fact_daily = tables['fact_daily_metrics']
    dim_date = tables['dim_date']
    
    # Aggregate across workspaces
    daily_totals = fact_daily.groupby('date_sk').agg({
        'total_activities': 'sum',
        'unique_users': 'sum',
        'failed_activities': 'sum'
    }).reset_index()
    
    # Join with date dimension
    daily_totals = daily_totals.merge(
        dim_date[['date_sk', 'full_date', 'day_of_week_name']], 
        on='date_sk'
    ).sort_values('full_date', ascending=False).head(14)
    
    for _, row in daily_totals.iterrows():
        day = row['day_of_week_name'][:3]
        print(f"   {row['full_date']} ({day}) | {int(row['total_activities']):>8,} activities | {int(row['unique_users']):>4} users | {int(row['failed_activities']):>3} failed")

In [None]:
# Failure Analysis (Smart Merge Data)
print("üìä Failure Analysis")
print("-" * 50)

if 'fact_activity' in tables and 'dim_status' in tables:
    fact = tables['fact_activity']
    dim_status = tables['dim_status']
    
    # Overall failure stats
    total = len(fact)
    failed = (fact['is_failed'] == 1).sum()
    success_rate = ((total - failed) / total) * 100 if total > 0 else 0
    
    print(f"   Total Activities:    {total:,}")
    print(f"   Failed Activities:   {failed:,}")
    print(f"   Success Rate:        {success_rate:.2f}%")
    
    # Failures by status
    print(f"\n   Failures by Status:")
    merged = fact[fact['is_failed'] == 1].merge(
        dim_status[['status_sk', 'status_code']], on='status_sk'
    )
    if len(merged) > 0:
        for status, count in merged['status_code'].value_counts().items():
            print(f"     - {status}: {count:,}")
    
    # Failures by activity type (if available)
    if 'dim_activity_type' in tables:
        dim_type = tables['dim_activity_type']
        merged = fact[fact['is_failed'] == 1].merge(
            dim_type[['activity_type_sk', 'activity_type']], on='activity_type_sk'
        )
        if len(merged) > 0:
            print(f"\n   Top Failed Activity Types:")
            for act_type, count in merged['activity_type'].value_counts().head(5).items():
                print(f"     - {act_type}: {count:,}")

## Additional Sample Analytical Queries

The following cells demonstrate various analytical capabilities of the star schema.

In [None]:
# Top 10 Most Frequently Used Items
print("üìä Top 10 Most Frequently Used Items")
print("-" * 100)

if 'fact_activity' in tables and 'dim_item' in tables:
    fact = tables['fact_activity']
    dim_item = tables['dim_item']
    
    item_stats = fact.groupby('item_sk').agg(
        activity_count=('activity_id', 'count'),
        unique_users=('user_sk', 'nunique'),
        total_duration=('duration_seconds', 'sum'),
        failed=('is_failed', 'sum')
    ).reset_index()
    
    item_stats = item_stats.merge(
        dim_item[['item_sk', 'item_name', 'item_type', 'item_category']], 
        on='item_sk'
    )
    
    top_items = item_stats.sort_values('activity_count', ascending=False).head(10)
    
    print(f"{'Item Name':<40} | {'Type':<18} | {'Category':<12} | {'Activities':>10} | {'Users':>6}")
    print("-" * 100)
    for _, row in top_items.iterrows():
        item_name = str(row['item_name'])[:39] if row['item_name'] else 'Unknown'
        item_type = str(row['item_type'])[:17] if row['item_type'] else 'N/A'
        category = str(row['item_category'])[:11] if row['item_category'] else 'N/A'
        print(f"{item_name:<40} | {item_type:<18} | {category:<12} | {int(row['activity_count']):>10,} | {int(row['unique_users']):>6}")

In [None]:
# Activity by Hour of Day (Peak Usage Analysis)
print("üìä Activity by Hour of Day (Peak Usage Analysis)")
print("-" * 60)

if 'fact_activity' in tables and 'dim_time' in tables:
    fact = tables['fact_activity']
    dim_time = tables['dim_time']
    
    hourly_stats = fact.groupby('time_sk').agg(
        activity_count=('activity_id', 'count'),
        unique_users=('user_sk', 'nunique')
    ).reset_index()
    
    hourly_stats = hourly_stats.merge(
        dim_time[['time_sk', 'hour', 'period_of_day', 'is_business_hours']], 
        on='time_sk'
    )
    
    hourly_stats = hourly_stats.sort_values('hour')
    
    print(f"{'Hour':<6} | {'Period':<12} | {'Business':>8} | {'Activities':>12} | {'Users':>6}")
    print("-" * 60)
    for _, row in hourly_stats.iterrows():
        hour_str = f"{int(row['hour']):02d}:00"
        period = str(row['period_of_day'])[:11]
        biz = "Yes" if row['is_business_hours'] else "No"
        print(f"{hour_str:<6} | {period:<12} | {biz:>8} | {int(row['activity_count']):>12,} | {int(row['unique_users']):>6}")

In [None]:
# Top 10 Most Active Users
print("üìä Top 10 Most Active Users")
print("-" * 90)

if 'fact_activity' in tables and 'dim_user' in tables:
    fact = tables['fact_activity']
    dim_user = tables['dim_user']
    
    user_stats = fact.groupby('user_sk').agg(
        activity_count=('activity_id', 'count'),
        unique_items=('item_sk', 'nunique'),
        unique_workspaces=('workspace_sk', 'nunique'),
        failed=('is_failed', 'sum')
    ).reset_index()
    
    user_stats = user_stats.merge(
        dim_user[['user_sk', 'user_key', 'user_domain', 'user_type']], 
        on='user_sk'
    )
    
    top_users = user_stats.sort_values('activity_count', ascending=False).head(10)
    
    print(f"{'User':<35} | {'Domain':<15} | {'Type':<8} | {'Activities':>10} | {'Items':>6} | {'WS':>4}")
    print("-" * 90)
    for _, row in top_users.iterrows():
        user_key = str(row['user_key'])[:34] if row['user_key'] else 'Unknown'
        domain = str(row['user_domain'])[:14] if row['user_domain'] else 'N/A'
        user_type = str(row['user_type'])[:7] if row['user_type'] else 'N/A'
        print(f"{user_key:<35} | {domain:<15} | {user_type:<8} | {int(row['activity_count']):>10,} | {int(row['unique_items']):>6} | {int(row['unique_workspaces']):>4}")

In [None]:
# Activity Volume by Day of Week
print("üìä Activity Volume by Day of Week")
print("-" * 60)

if 'fact_activity' in tables and 'dim_date' in tables:
    fact = tables['fact_activity']
    dim_date = tables['dim_date']
    
    daily_stats = fact.groupby('date_sk').agg(
        activity_count=('activity_id', 'count'),
        unique_users=('user_sk', 'nunique'),
        failed=('is_failed', 'sum')
    ).reset_index()
    
    daily_stats = daily_stats.merge(
        dim_date[['date_sk', 'day_of_week_name', 'day_of_week', 'is_weekend']], 
        on='date_sk'
    )
    
    dow_stats = daily_stats.groupby(['day_of_week', 'day_of_week_name', 'is_weekend']).agg({
        'activity_count': 'sum',
        'unique_users': 'mean',
        'failed': 'sum'
    }).reset_index()
    
    dow_stats = dow_stats.sort_values('day_of_week')
    
    print(f"{'Day':<12} | {'Weekend':>8} | {'Activities':>12} | {'Avg Users':>10} | {'Failed':>8}")
    print("-" * 60)
    for _, row in dow_stats.iterrows():
        day_name = str(row['day_of_week_name'])
        weekend = "Yes" if row['is_weekend'] else "No"
        print(f"{day_name:<12} | {weekend:>8} | {int(row['activity_count']):>12,} | {row['unique_users']:>10.1f} | {int(row['failed']):>8}")

In [None]:
# Environment Comparison (DEV vs TEST vs UAT vs PRD)
print("üìä Environment Comparison")
print("-" * 80)

if 'fact_activity' in tables and 'dim_workspace' in tables:
    fact = tables['fact_activity']
    dim_ws = tables['dim_workspace']
    
    # Get workspace environments
    ws_env = dim_ws[['workspace_sk', 'environment']].drop_duplicates()
    fact_with_env = fact.merge(ws_env, on='workspace_sk', how='left')
    
    env_stats = fact_with_env.groupby('environment').agg(
        activity_count=('activity_id', 'count'),
        unique_users=('user_sk', 'nunique'),
        unique_items=('item_sk', 'nunique'),
        unique_workspaces=('workspace_sk', 'nunique'),
        total_duration=('duration_seconds', 'sum'),
        failed=('is_failed', 'sum')
    ).reset_index()
    
    env_stats['duration_hrs'] = round(env_stats['total_duration'] / 3600, 2)
    env_stats['failure_rate'] = round(100 * env_stats['failed'] / env_stats['activity_count'], 2)
    env_stats = env_stats.sort_values('activity_count', ascending=False)
    
    print(f"{'Environment':<12} | {'Activities':>12} | {'Users':>6} | {'Items':>6} | {'WS':>4} | {'Hours':>10} | {'Fail %':>7}")
    print("-" * 80)
    for _, row in env_stats.iterrows():
        env = str(row['environment']) if row['environment'] else 'Unknown'
        print(f"{env:<12} | {int(row['activity_count']):>12,} | {int(row['unique_users']):>6} | {int(row['unique_items']):>6} | {int(row['unique_workspaces']):>4} | {row['duration_hrs']:>10} | {row['failure_rate']:>6}%")

In [None]:
# Item Category Distribution
print("üìä Activity by Item Category")
print("-" * 70)

if 'fact_activity' in tables and 'dim_item' in tables:
    fact = tables['fact_activity']
    dim_item = tables['dim_item']
    
    item_cat = dim_item[['item_sk', 'item_category']].drop_duplicates()
    fact_with_cat = fact.merge(item_cat, on='item_sk', how='left')
    
    cat_stats = fact_with_cat.groupby('item_category').agg(
        activity_count=('activity_id', 'count'),
        unique_items=('item_sk', 'nunique'),
        unique_users=('user_sk', 'nunique'),
        failed=('is_failed', 'sum')
    ).reset_index()
    
    cat_stats['failure_rate'] = round(100 * cat_stats['failed'] / cat_stats['activity_count'], 2)
    cat_stats = cat_stats.sort_values('activity_count', ascending=False)
    
    print(f"{'Category':<18} | {'Activities':>12} | {'Items':>6} | {'Users':>6} | {'Failed':>8} | {'Fail %':>7}")
    print("-" * 70)
    for _, row in cat_stats.iterrows():
        cat = str(row['item_category'])[:17] if row['item_category'] else 'Unknown'
        print(f"{cat:<18} | {int(row['activity_count']):>12,} | {int(row['unique_items']):>6} | {int(row['unique_users']):>6} | {int(row['failed']):>8} | {row['failure_rate']:>6}%")

In [None]:
# Long-Running Activities Analysis
print("üìä Long-Running Activities Analysis (>1 hour)")
print("-" * 100)

if 'fact_activity' in tables and 'dim_item' in tables and 'dim_workspace' in tables:
    fact = tables['fact_activity']
    dim_item = tables['dim_item']
    dim_ws = tables['dim_workspace']
    
    # Filter for long-running activities (>1 hour = 3600 seconds)
    long_running = fact[fact['is_long_running'] == 1].copy() if 'is_long_running' in fact.columns else fact[fact['duration_seconds'] > 3600].copy()
    
    if len(long_running) > 0:
        print(f"Total long-running activities (>1 hr): {len(long_running):,}")
        print()
        
        # Enrich with names
        long_running = long_running.merge(
            dim_item[['item_sk', 'item_name', 'item_type']], on='item_sk', how='left'
        ).merge(
            dim_ws[['workspace_sk', 'workspace_name']], on='workspace_sk', how='left'
        )
        
        # Top 10 longest
        top_longest = long_running.nlargest(10, 'duration_seconds')
        
        print(f"{'Item Name':<35} | {'Type':<15} | {'Workspace':<25} | {'Duration':>12}")
        print("-" * 100)
        for _, row in top_longest.iterrows():
            item_name = str(row['item_name'])[:34] if row['item_name'] else 'Unknown'
            item_type = str(row['item_type'])[:14] if row['item_type'] else 'N/A'
            ws_name = str(row['workspace_name'])[:24] if row['workspace_name'] else 'Unknown'
            duration_hrs = round(row['duration_seconds'] / 3600, 2)
            print(f"{item_name:<35} | {item_type:<15} | {ws_name:<25} | {duration_hrs:>10} hrs")
    else:
        print("No long-running activities found (>1 hour)")

In [None]:
# Cross-Workspace Usage Patterns (Workspaces shared by multiple users)
print("üìä Cross-Workspace Usage Patterns")
print("-" * 80)

if 'fact_activity' in tables and 'dim_workspace' in tables:
    fact = tables['fact_activity']
    dim_ws = tables['dim_workspace']
    
    # Calculate collaboration metrics per workspace
    ws_collab = fact.groupby('workspace_sk').agg(
        unique_users=('user_sk', 'nunique'),
        unique_items=('item_sk', 'nunique'),
        activity_count=('activity_id', 'count')
    ).reset_index()
    
    ws_collab = ws_collab.merge(
        dim_ws[['workspace_sk', 'workspace_name', 'environment']], on='workspace_sk'
    )
    
    # Sort by user count (collaboration indicator)
    ws_collab = ws_collab.sort_values('unique_users', ascending=False).head(10)
    ws_collab['activities_per_user'] = round(ws_collab['activity_count'] / ws_collab['unique_users'], 1)
    
    print(f"{'Workspace':<40} | {'Env':<7} | {'Users':>6} | {'Items':>6} | {'Activities':>10} | {'Act/User':>8}")
    print("-" * 95)
    for _, row in ws_collab.iterrows():
        ws_name = str(row['workspace_name'])[:39] if row['workspace_name'] else 'Unknown'
        env = str(row['environment'])[:6] if row['environment'] else 'N/A'
        print(f"{ws_name:<40} | {env:<7} | {int(row['unique_users']):>6} | {int(row['unique_items']):>6} | {int(row['activity_count']):>10,} | {row['activities_per_user']:>8}")

In [None]:
# Monthly Trend Analysis
print("üìä Monthly Activity Trend")
print("-" * 70)

if 'fact_activity' in tables and 'dim_date' in tables:
    fact = tables['fact_activity']
    dim_date = tables['dim_date']
    
    # Join with date dimension
    fact_dated = fact.merge(
        dim_date[['date_sk', 'year', 'month', 'month_name']], on='date_sk'
    )
    
    monthly_stats = fact_dated.groupby(['year', 'month', 'month_name']).agg(
        activity_count=('activity_id', 'count'),
        unique_users=('user_sk', 'nunique'),
        unique_items=('item_sk', 'nunique'),
        failed=('is_failed', 'sum')
    ).reset_index()
    
    monthly_stats = monthly_stats.sort_values(['year', 'month'], ascending=False).head(12)
    monthly_stats['failure_rate'] = round(100 * monthly_stats['failed'] / monthly_stats['activity_count'], 2)
    
    print(f"{'Month':<15} | {'Activities':>12} | {'Users':>6} | {'Items':>6} | {'Failed':>8} | {'Fail %':>7}")
    print("-" * 70)
    for _, row in monthly_stats.iterrows():
        month_label = f"{row['month_name'][:3]} {int(row['year'])}"
        print(f"{month_label:<15} | {int(row['activity_count']):>12,} | {int(row['unique_users']):>6} | {int(row['unique_items']):>6} | {int(row['failed']):>8} | {row['failure_rate']:>6}%")

## Summary

The star schema has been built and is ready for:

1. **SQL Queries** - Query the parquet files directly or use Delta tables via SQL Endpoint
2. **Semantic Model** - Create a Direct Lake model pointing to these tables
3. **Power BI Reports** - Build monitoring dashboards using the semantic model

### Scheduled Refresh
To automate the star schema build:
1. Schedule this notebook to run daily after the Monitor Hub pipeline
2. Or use a Fabric Data Pipeline to orchestrate both steps

### Table Relationships (for Semantic Model)
```
fact_activity[date_sk] ‚Üí dim_date[date_sk]
fact_activity[time_sk] ‚Üí dim_time[time_sk]
fact_activity[workspace_sk] ‚Üí dim_workspace[workspace_sk]
fact_activity[item_sk] ‚Üí dim_item[item_sk]
fact_activity[user_sk] ‚Üí dim_user[user_sk]
fact_activity[activity_type_sk] ‚Üí dim_activity_type[activity_type_sk]
fact_activity[status_sk] ‚Üí dim_status[status_sk]
```