# Star Schema Builder for Fabric Monitoring Analytics

## Overview
This notebook transforms raw Monitor Hub activity data into a Kimball-style star schema suitable for SQL queries, semantic models, and Power BI reports.

It is designed to work in both:
- **Microsoft Fabric notebooks** (paths auto-resolve under `/lakehouse/default/Files/`), and
- **Local dev** (writes under `exports/` by default).

## Star Schema Tables
- **Dimensions**: dim_date, dim_time, dim_workspace, dim_item, dim_user, dim_activity_type, dim_status
- **Facts**: fact_activity, fact_daily_metrics

## Key Features
- Incremental loading with high-water mark tracking
- SCD Type 2 support for slowly changing dimensions
- Automatic surrogate key generation
- Pre-aggregated daily metrics for fast dashboards

## How to Use
1. **Install Package** (first run only): Uncomment and run the pip install cell
2. **Configure Paths**: Set INPUT_DIR and OUTPUT_DIR for your environment
3. **Run Pipeline**: Execute the build cells
4. **Optional**: Convert to Delta tables for SQL Endpoint access

## Package Installation
<span style="color:red">pip install is only required on first run. Uncomment and run once, then re-comment.</span>

In [1]:
# %pip install /lakehouse/default/Files/usf_fabric_monitoring-0.3.0-py3-none-any.whl --force-reinstall

## Setup Local Path (For Local Development)

In [2]:
# SETUP LOCAL PATH (For Local Development)
import sys
import os
from pathlib import Path

# Add the src directory to sys.path to allow importing the local package
# This is necessary when running locally without installing the package
current_dir = Path(os.getcwd())

# Check if we are in notebooks directory
if current_dir.name == "notebooks":
    src_path = current_dir.parent / "src"
else:
    # Assume we are in project root
    src_path = current_dir / "src"

if src_path.exists() and str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))
    print(f"‚úÖ Added {src_path} to sys.path (local development mode)")
else:
    print("‚ÑπÔ∏è Running in Fabric or package already installed")

‚ÑπÔ∏è Running in Fabric or package already installed


In [3]:
# Force reload of modules to pick up code changes
import importlib
import usf_fabric_monitoring.core.star_schema_builder as ssb
importlib.reload(ssb)
print("‚úÖ Modules reloaded")

‚úÖ Modules reloaded


In [4]:
# Package / environment verification (safe: no Azure/API imports)
from importlib.metadata import PackageNotFoundError, version
import importlib
import usf_fabric_monitoring
from usf_fabric_monitoring.core.utils import resolve_path

try:
    pkg_version = getattr(usf_fabric_monitoring, "__version__", None) or version("usf_fabric_monitoring")
except PackageNotFoundError:
    pkg_version = "unknown"

print(f"usf_fabric_monitoring version: {pkg_version}")
print(f"Resolved output dir example: {resolve_path('exports/star_schema')}")

# Check star schema builder availability
try:
    from usf_fabric_monitoring.core.star_schema_builder import StarSchemaBuilder, ALL_STAR_SCHEMA_DDLS
    print("‚úÖ StarSchemaBuilder module loaded successfully")
except ImportError as e:
    print(f"‚ùå Failed to import StarSchemaBuilder: {e}")
    print("   Make sure you have version 0.3.0+ installed")

usf_fabric_monitoring version: 0.3.7
Resolved output dir example: /home/sanmi/Documents/J'TOYE_DIGITAL/LEIT_TEKSYSTEMS/1_Project_Rhico/usf_fabric_monitoring/exports/star_schema
‚úÖ StarSchemaBuilder module loaded successfully


## Configuration

In [5]:
from pathlib import Path
from usf_fabric_monitoring.core.utils import resolve_path

# ============================================================================
# CONFIGURATION - Update these values for your environment
# ============================================================================

# Input: Where Monitor Hub pipeline outputs are stored (CSV files with Smart Merge data)
INPUT_DIR = resolve_path("exports/monitor_hub_analysis")

# Output: Where star schema tables will be written
OUTPUT_DIR = resolve_path("exports/star_schema")

# Processing options
INCREMENTAL_LOAD = True  # Set to False for full refresh (rebuilds all tables)
WRITE_TO_DELTA_TABLES = False  # Set to True in Fabric to create SQL Endpoint tables

# ============================================================================
# Display configuration
# ============================================================================
print("=" * 60)
print("STAR SCHEMA BUILDER CONFIGURATION")
print("=" * 60)
print(f"Input Directory:       {INPUT_DIR}")
print(f"Output Directory:      {OUTPUT_DIR}")
print(f"Mode:                  {'Incremental' if INCREMENTAL_LOAD else 'Full Refresh'}")
print(f"Write to Delta Tables: {WRITE_TO_DELTA_TABLES}")
print("=" * 60)

STAR SCHEMA BUILDER CONFIGURATION
Input Directory:       /home/sanmi/Documents/J'TOYE_DIGITAL/LEIT_TEKSYSTEMS/1_Project_Rhico/usf_fabric_monitoring/exports/monitor_hub_analysis
Output Directory:      /home/sanmi/Documents/J'TOYE_DIGITAL/LEIT_TEKSYSTEMS/1_Project_Rhico/usf_fabric_monitoring/exports/star_schema
Mode:                  Incremental
Write to Delta Tables: False


## Load Source Data

In [6]:
# Preview source data (optional - for verification only)
import pandas as pd
from pathlib import Path

input_path = Path(INPUT_DIR)

# Check for parquet files first (preferred - has complete data including end_time for failures)
parquet_dir = input_path / "parquet"
parquet_files = sorted(parquet_dir.glob("activities_*.parquet"), reverse=True) if parquet_dir.exists() else []

if parquet_files:
    print(f"‚úÖ Found parquet source: {parquet_files[0].name}")
    preview_df = pd.read_parquet(parquet_files[0])
    print(f"   Columns: {len(preview_df.columns)} (includes end_time for failure tracking)")
else:
    # Fallback to CSV
    csv_files = sorted(input_path.glob("activities_master_*.csv"), reverse=True)
    if csv_files:
        print(f"‚ö†Ô∏è Using CSV source: {csv_files[0].name}")
        preview_df = pd.read_csv(csv_files[0], low_memory=False)
        print(f"   Columns: {len(preview_df.columns)}")
    else:
        raise FileNotFoundError(f"No activities files found in {INPUT_DIR}")

print(f"   Records: {len(preview_df):,}")

# Show status distribution
if 'status' in preview_df.columns:
    print(f"   Status distribution:")
    for status, count in preview_df['status'].value_counts().items():
        print(f"     - {status}: {count:,}")

‚úÖ Found parquet source: activities_20251217_143508.parquet
   Columns: 39 (includes end_time for failure tracking)
   Records: 1,286,374
   Status distribution:
     - Succeeded: 1,280,156
     - Failed: 6,218
   Columns: 39 (includes end_time for failure tracking)
   Records: 1,286,374
   Status distribution:
     - Succeeded: 1,280,156
     - Failed: 6,218


## Build Star Schema

In [None]:
# Build Star Schema using the recommended function
# This function:
# 1. Loads from parquet source (complete data with 28 columns including end_time)
# 2. Uses end_time fallback for failed records that have NULL start_time
# 3. Enriches with workspace names automatically
# 4. Handles all dimension lookups correctly

import importlib
import usf_fabric_monitoring.core.star_schema_builder as ssb_module
importlib.reload(ssb_module)
build_star_schema_from_pipeline_output = ssb_module.build_star_schema_from_pipeline_output

from datetime import datetime
from pathlib import Path

# Create output directory
output_path = Path(OUTPUT_DIR)
output_path.mkdir(parents=True, exist_ok=True)

# Build star schema
print("=" * 60)
print("BUILDING STAR SCHEMA")
print("=" * 60)
start_time = datetime.now()

# NOTE: Using incremental=False to ensure clean rebuild
# Set incremental=True only for production daily loads with new data
tables = build_star_schema_from_pipeline_output(
    pipeline_output_dir=str(INPUT_DIR),
    output_directory=str(OUTPUT_DIR),
    incremental=False  # Full rebuild - prevents duplicate data issues
)

duration = (datetime.now() - start_time).total_seconds()
print(f"\n‚úÖ Star Schema build completed in {duration:.2f} seconds!")
print(f"üìÅ Output Directory: {OUTPUT_DIR}")

BUILDING STAR SCHEMA
2025-12-17 14:50:56 | INFO | star_schema_builder | Starting Star Schema Build
2025-12-17 14:50:56 | INFO | star_schema_builder | Mode: Full Refresh
2025-12-17 14:50:56 | INFO | star_schema_builder | Input activities: 1286374
2025-12-17 14:50:56 | INFO | star_schema_builder | Pre-processing: Enriching activities with workspace names...
2025-12-17 14:50:56 | INFO | star_schema_builder | Starting Star Schema Build
2025-12-17 14:50:56 | INFO | star_schema_builder | Mode: Full Refresh
2025-12-17 14:50:56 | INFO | star_schema_builder | Input activities: 1286374
2025-12-17 14:50:56 | INFO | star_schema_builder | Pre-processing: Enriching activities with workspace names...
2025-12-17 14:50:56 | INFO | star_schema_builder | Loaded 512 workspace names from /home/sanmi/Documents/J'TOYE_DIGITAL/LEIT_TEKSYSTEMS/1_Project_Rhico/usf_fabric_monitoring/exports/monitor_hub_analysis/parquet/workspaces_20251217_143508.parquet
2025-12-17 14:50:56 | INFO | star_schema_builder | Loaded 5

## Convert to Delta Tables (Fabric Only)
Run this cell only in Microsoft Fabric to create Delta tables accessible via SQL Endpoint.

In [None]:
if WRITE_TO_DELTA_TABLES:
    try:
        from pyspark.sql import SparkSession
        
        spark = SparkSession.builder.getOrCreate()
        
        print("Converting Parquet files to Delta tables...")
        print("-" * 50)
        
        # Convert each parquet to Delta table
        for parquet_file in Path(OUTPUT_DIR).glob("*.parquet"):
            table_name = parquet_file.stem  # e.g., "dim_date", "fact_activity"
            
            try:
                df = spark.read.parquet(str(parquet_file))
                
                # Write as Delta table (overwrite mode)
                df.write.mode("overwrite").format("delta").saveAsTable(table_name)
                
                print(f"   ‚úÖ {table_name}: {df.count():,} rows")
            except Exception as e:
                print(f"   ‚ùå {table_name}: {e}")
        
        print("-" * 50)
        print("‚úÖ Delta tables created successfully!")
        print("   Tables are now available in the SQL Endpoint.")
        
    except ImportError:
        print("‚ö†Ô∏è PySpark not available. Delta table creation skipped.")
        print("   This feature is only available in Microsoft Fabric.")
else:
    print("‚ÑπÔ∏è Delta table creation skipped (WRITE_TO_DELTA_TABLES=False)")
    print("   Set WRITE_TO_DELTA_TABLES=True in Fabric to enable.")

## Validate Star Schema

In [None]:
import pandas as pd

print("=" * 60)
print("STAR SCHEMA VALIDATION")
print("=" * 60)

# Load tables from parquet files (ensures we have fresh data)
tables = {}
for parquet_file in Path(OUTPUT_DIR).glob("*.parquet"):
    table_name = parquet_file.stem
    tables[table_name] = pd.read_parquet(parquet_file)
    print(f"{table_name}: {len(tables[table_name]):,} records")

# Quick data quality check
fact = tables.get('fact_activity')
if fact is not None:
    print(f"\nüìä Fact Activity Summary:")
    print(f"   Total records: {len(fact):,}")
    print(f"   Total activities (record_count sum): {fact['record_count'].sum():,}")
    print(f"   Failed activities: {fact['is_failed'].sum():,}")
    print(f"   Records with date_sk: {fact['date_sk'].notna().sum():,}")

# Check FK integrity
print("\nüîó Foreign Key Validation:")

fk_checks = [
    ('fact_activity', 'workspace_sk', 'dim_workspace', 'workspace_sk'),
    ('fact_activity', 'item_sk', 'dim_item', 'item_sk'),
    ('fact_activity', 'user_sk', 'dim_user', 'user_sk'),
    ('fact_activity', 'date_sk', 'dim_date', 'date_sk'),
    ('fact_activity', 'time_sk', 'dim_time', 'time_sk'),
    ('fact_activity', 'activity_type_sk', 'dim_activity_type', 'activity_type_sk'),
    ('fact_activity', 'status_sk', 'dim_status', 'status_sk'),
]

all_passed = True
for fact_table, fact_col, dim_table, dim_col in fk_checks:
    if fact_table in tables and dim_table in tables:
        fact_vals = set(tables[fact_table][fact_col].dropna().unique())
        dim_vals = set(tables[dim_table][dim_col].unique())
        orphans = fact_vals - dim_vals
        status = '‚úÖ PASS' if len(orphans) == 0 else f'‚ùå {len(orphans)} orphans'
        if len(orphans) > 0:
            all_passed = False
        print(f"   {fact_col}: {status}")

print("\n" + ("‚úÖ All FK validations passed!" if all_passed else "‚ö†Ô∏è Some FK validations failed"))

## Sample Analytical Queries

In [None]:
# Top 10 Most Active Workspaces
# NOTE: Use record_count (not activity_id) as many activities have NULL activity_id
print("üìä Top 10 Most Active Workspaces")
print("-" * 100)

if 'fact_activity' in tables and 'dim_workspace' in tables:
    fact = tables['fact_activity']
    dim_ws = tables['dim_workspace']
    
    # Aggregate by workspace_sk - use record_count for accurate totals
    ws_stats = fact.groupby('workspace_sk').agg(
        activity_count=('record_count', 'sum'),  # Sum record_count, not count activity_id
        unique_users=('user_sk', 'nunique'),
        total_duration=('duration_seconds', 'sum'),
        failed=('is_failed', 'sum')
    ).reset_index()
    
    # Join with dimension for names
    ws_stats = ws_stats.merge(
        dim_ws[['workspace_sk', 'workspace_name', 'environment']], 
        on='workspace_sk', 
        how='left'
    )
    
    # Sort and display top 10
    ws_stats = ws_stats.sort_values('activity_count', ascending=False).head(10)
    ws_stats['duration_hrs'] = round(ws_stats['total_duration'] / 3600, 2)
    ws_stats['failure_rate'] = round(100 * ws_stats['failed'] / ws_stats['activity_count'], 2)
    
    print(f"{'Workspace':<45} | {'Env':<7} | {'Activities':>12} | {'Hours':>10} | {'Failed':>8} | {'Fail %':>7}")
    print("-" * 100)
    for _, row in ws_stats.iterrows():
        ws_name = str(row['workspace_name'])[:44] if row['workspace_name'] else 'Unknown'
        env = str(row['environment'])[:7] if row['environment'] else 'N/A'
        print(f"{ws_name:<45} | {env:<7} | {int(row['activity_count']):>12,} | {row['duration_hrs']:>10.2f} | {int(row['failed']):>8,} | {row['failure_rate']:>6.2f}%")

In [None]:
# Activity by Type - use record_count for accurate totals
# Shows BOTH top 15 by volume AND all activity types with failures
print("üìä Activity Count by Type (Top 15 by Volume)")
print("-" * 85)

if 'fact_activity' in tables and 'dim_activity_type' in tables:
    fact = tables['fact_activity']
    dim_type = tables['dim_activity_type']
    
    # Aggregate by activity_type_sk first
    type_stats = fact.groupby('activity_type_sk').agg(
        activity_count=('record_count', 'sum'),
        failed=('is_failed', 'sum')
    ).reset_index()
    
    # Join with dimension for type details
    type_stats = type_stats.merge(
        dim_type[['activity_type_sk', 'activity_type', 'activity_category']], 
        on='activity_type_sk'
    )
    type_stats['success_rate'] = round(100 * (1 - type_stats['failed'] / type_stats['activity_count']), 2)
    
    # Top 15 by volume
    top_15 = type_stats.sort_values('activity_count', ascending=False).head(15)
    
    print(f"{'Category':<18} | {'Activity Type':<30} | {'Count':>12} | {'Failed':>8} | {'Success %':>9}")
    print("-" * 85)
    for _, row in top_15.iterrows():
        print(f"{row['activity_category']:<18} | {row['activity_type']:<30} | {int(row['activity_count']):>12,} | {int(row['failed']):>8,} | {row['success_rate']:>8.2f}%")
    
    # Show ALL activity types with failures (important for job history activities)
    types_with_failures = type_stats[type_stats['failed'] > 0].sort_values('failed', ascending=False)
    if len(types_with_failures) > 0:
        print()
        print("=" * 85)
        print("üìä Activity Types with Failures (All)")
        print("=" * 85)
        print(f"{'Category':<18} | {'Activity Type':<30} | {'Count':>12} | {'Failed':>8} | {'Success %':>9}")
        print("-" * 85)
        for _, row in types_with_failures.iterrows():
            print(f"{row['activity_category']:<18} | {row['activity_type']:<30} | {int(row['activity_count']):>12,} | {int(row['failed']):>8,} | {row['success_rate']:>8.2f}%")
        print("-" * 85)
        print(f"{'TOTAL':<18} | {'':<30} | {int(types_with_failures['activity_count'].sum()):>12,} | {int(types_with_failures['failed'].sum()):>8,} |")

In [None]:
# Daily Activity Trend
print("üìä Daily Activity Trend (Last 14 Days)")
print("-" * 50)

if 'fact_daily_metrics' in tables and 'dim_date' in tables:
    fact_daily = tables['fact_daily_metrics']
    dim_date = tables['dim_date']
    
    # Aggregate across workspaces
    daily_totals = fact_daily.groupby('date_sk').agg({
        'total_activities': 'sum',
        'unique_users': 'sum',
        'failed_activities': 'sum'
    }).reset_index()
    
    # Join with date dimension
    daily_totals = daily_totals.merge(
        dim_date[['date_sk', 'full_date', 'day_of_week_name']], 
        on='date_sk'
    ).sort_values('full_date', ascending=False).head(14)
    
    for _, row in daily_totals.iterrows():
        day = row['day_of_week_name'][:3]
        print(f"   {row['full_date']} ({day}) | {int(row['total_activities']):>8,} activities | {int(row['unique_users']):>4} users | {int(row['failed_activities']):>3} failed")

In [None]:
# Failure Analysis (Smart Merge Data)
print("üìä Failure Analysis")
print("-" * 50)

if 'fact_activity' in tables and 'dim_status' in tables:
    fact = tables['fact_activity']
    dim_status = tables['dim_status']
    
    # Overall failure stats
    total = len(fact)
    failed = (fact['is_failed'] == 1).sum()
    success_rate = ((total - failed) / total) * 100 if total > 0 else 0
    
    print(f"   Total Activities:    {total:,}")
    print(f"   Failed Activities:   {failed:,}")
    print(f"   Success Rate:        {success_rate:.2f}%")
    
    # Failures by status
    print(f"\n   Failures by Status:")
    merged = fact[fact['is_failed'] == 1].merge(
        dim_status[['status_sk', 'status_code']], on='status_sk'
    )
    if len(merged) > 0:
        for status, count in merged['status_code'].value_counts().items():
            print(f"     - {status}: {count:,}")
    
    # Failures by activity type (if available)
    if 'dim_activity_type' in tables:
        dim_type = tables['dim_activity_type']
        merged = fact[fact['is_failed'] == 1].merge(
            dim_type[['activity_type_sk', 'activity_type']], on='activity_type_sk'
        )
        if len(merged) > 0:
            print(f"\n   Top Failed Activity Types:")
            for act_type, count in merged['activity_type'].value_counts().head(5).items():
                print(f"     - {act_type}: {count:,}")

## Additional Sample Analytical Queries

The following cells demonstrate various analytical capabilities of the star schema.

In [None]:
# Top 10 Most Frequently Used Items
print("üìä Top 10 Most Frequently Used Items")
print("-" * 110)

if 'fact_activity' in tables and 'dim_item' in tables:
    fact = tables['fact_activity']
    dim_item = tables['dim_item']
    
    # Use record_count for accurate totals
    item_stats = fact.groupby('item_sk').agg(
        activity_count=('record_count', 'sum'),
        unique_users=('user_sk', 'nunique'),
        total_duration=('duration_seconds', 'sum'),
        failed=('is_failed', 'sum')
    ).reset_index()
    
    item_stats = item_stats.merge(
        dim_item[['item_sk', 'item_name', 'item_type', 'item_category']], 
        on='item_sk'
    )
    
    top_items = item_stats.sort_values('activity_count', ascending=False).head(10)
    
    print(f"{'Item Name':<40} | {'Type':<18} | {'Category':<12} | {'Activities':>12} | {'Users':>6}")
    print("-" * 110)
    for _, row in top_items.iterrows():
        item_name = str(row['item_name'])[:39] if row['item_name'] else 'Unknown'
        item_type = str(row['item_type'])[:17] if row['item_type'] else 'N/A'
        category = str(row['item_category'])[:11] if row['item_category'] else 'N/A'
        print(f"{item_name:<40} | {item_type:<18} | {category:<12} | {int(row['activity_count']):>12,} | {int(row['unique_users']):>6}")

In [None]:
# Activity by Hour of Day
# Shows when activities occur during the day
print("üìä Activity by Hour of Day")
print("-" * 70)

if 'fact_activity' in tables and 'dim_time' in tables:
    fact = tables['fact_activity']
    dim_time = tables['dim_time']
    
    # Get hourly stats from fact table (use record_count for accurate totals)
    hourly_stats = fact.groupby('time_sk').agg(
        activity_count=('record_count', 'sum')
    ).reset_index()
    
    # Join with time dimension (using correct column names: hour_24, time_period)
    hourly_stats = hourly_stats.merge(
        dim_time[['time_sk', 'hour_24', 'time_period', 'is_business_hours']], 
        on='time_sk'
    )
    
    # Aggregate by hour
    hourly_summary = hourly_stats.groupby(['hour_24', 'time_period']).agg(
        total_activities=('activity_count', 'sum')
    ).reset_index().sort_values('hour_24')
    
    print(f"{'Hour':<8} | {'Period':<12} | {'Activities':>14} | {'Chart'}")
    print("-" * 70)
    for _, row in hourly_summary.iterrows():
        bar = "‚ñà" * int(row['total_activities'] / hourly_summary['total_activities'].max() * 30)
        print(f"{row['hour_24']:02d}:00    | {row['time_period']:<12} | {int(row['total_activities']):>14,} | {bar}")

In [None]:
# Top 10 Most Active Users
print("üìä Top 10 Most Active Users")
print("-" * 100)

if 'fact_activity' in tables and 'dim_user' in tables:
    fact = tables['fact_activity']
    dim_user = tables['dim_user']
    
    # Use record_count for accurate totals
    user_stats = fact.groupby('user_sk').agg(
        activity_count=('record_count', 'sum'),
        unique_items=('item_sk', 'nunique'),
        unique_workspaces=('workspace_sk', 'nunique'),
        failed=('is_failed', 'sum')
    ).reset_index()
    
    # Join with user dimension (correct column names: user_principal_name, domain)
    user_stats = user_stats.merge(
        dim_user[['user_sk', 'user_principal_name', 'domain', 'user_type']], 
        on='user_sk'
    )
    
    top_users = user_stats.sort_values('activity_count', ascending=False).head(10)
    
    print(f"{'User':<35} | {'Domain':<15} | {'Type':<8} | {'Activities':>12} | {'Items':>6} | {'WS':>4}")
    print("-" * 100)
    for _, row in top_users.iterrows():
        user = str(row['user_principal_name'])[:34] if row['user_principal_name'] else 'Unknown'
        domain = str(row['domain'])[:14] if row['domain'] else 'N/A'
        user_type = str(row['user_type'])[:7] if row['user_type'] else 'N/A'
        print(f"{user:<35} | {domain:<15} | {user_type:<8} | {int(row['activity_count']):>12,} | {int(row['unique_items']):>6} | {int(row['unique_workspaces']):>4}")

In [None]:
# Activity Volume by Day of Week
print("üìä Activity Volume by Day of Week")
print("-" * 70)

if 'fact_activity' in tables and 'dim_date' in tables:
    fact = tables['fact_activity']
    dim_date = tables['dim_date']
    
    # Use record_count for accurate totals
    daily_stats = fact.groupby('date_sk').agg(
        activity_count=('record_count', 'sum'),
        unique_users=('user_sk', 'nunique'),
        failed=('is_failed', 'sum')
    ).reset_index()
    
    daily_stats = daily_stats.merge(
        dim_date[['date_sk', 'day_of_week_name', 'day_of_week', 'is_weekend']], 
        on='date_sk'
    )
    
    dow_stats = daily_stats.groupby(['day_of_week', 'day_of_week_name', 'is_weekend']).agg({
        'activity_count': 'sum',
        'unique_users': 'mean',
        'failed': 'sum'
    }).reset_index()
    
    dow_stats = dow_stats.sort_values('day_of_week')
    
    print(f"{'Day':<12} | {'Weekend':>8} | {'Activities':>14} | {'Avg Users':>10} | {'Failed':>10}")
    print("-" * 70)
    for _, row in dow_stats.iterrows():
        day_name = str(row['day_of_week_name'])
        weekend = "Yes" if row['is_weekend'] else "No"
        print(f"{day_name:<12} | {weekend:>8} | {int(row['activity_count']):>14,} | {row['unique_users']:>10.1f} | {int(row['failed']):>10,}")

In [None]:
# Environment Comparison (DEV vs TEST vs UAT vs PRD)
print("üìä Environment Comparison")
print("-" * 95)

if 'fact_activity' in tables and 'dim_workspace' in tables:
    fact = tables['fact_activity']
    dim_ws = tables['dim_workspace']
    
    # Get workspace environments
    ws_env = dim_ws[['workspace_sk', 'environment']].drop_duplicates()
    fact_with_env = fact.merge(ws_env, on='workspace_sk', how='left')
    
    # Use record_count for accurate totals
    env_stats = fact_with_env.groupby('environment').agg(
        activity_count=('record_count', 'sum'),
        unique_users=('user_sk', 'nunique'),
        unique_items=('item_sk', 'nunique'),
        unique_workspaces=('workspace_sk', 'nunique'),
        total_duration=('duration_seconds', 'sum'),
        failed=('is_failed', 'sum')
    ).reset_index()
    
    env_stats['duration_hrs'] = round(env_stats['total_duration'] / 3600, 2)
    env_stats['failure_rate'] = round(100 * env_stats['failed'] / env_stats['activity_count'], 2)
    env_stats = env_stats.sort_values('activity_count', ascending=False)
    
    print(f"{'Environment':<12} | {'Activities':>14} | {'Users':>6} | {'Items':>6} | {'WS':>4} | {'Hours':>10} | {'Fail %':>7}")
    print("-" * 95)
    for _, row in env_stats.iterrows():
        env = str(row['environment']) if row['environment'] else 'Unknown'
        print(f"{env:<12} | {int(row['activity_count']):>14,} | {int(row['unique_users']):>6} | {int(row['unique_items']):>6} | {int(row['unique_workspaces']):>4} | {row['duration_hrs']:>10.2f} | {row['failure_rate']:>6.2f}%")

In [None]:
# Activity by Item Category
print("üìä Activity by Item Category")
print("-" * 80)

if 'fact_activity' in tables and 'dim_item' in tables:
    fact = tables['fact_activity']
    dim_item = tables['dim_item']
    
    item_cat = dim_item[['item_sk', 'item_category']].drop_duplicates()
    fact_with_cat = fact.merge(item_cat, on='item_sk', how='left')
    
    # Use record_count for accurate totals
    cat_stats = fact_with_cat.groupby('item_category').agg(
        activity_count=('record_count', 'sum'),
        unique_items=('item_sk', 'nunique'),
        unique_users=('user_sk', 'nunique'),
        failed=('is_failed', 'sum')
    ).reset_index()
    
    cat_stats['failure_rate'] = round(100 * cat_stats['failed'] / cat_stats['activity_count'], 2)
    cat_stats = cat_stats.sort_values('activity_count', ascending=False)
    
    print(f"{'Category':<18} | {'Activities':>14} | {'Items':>6} | {'Users':>6} | {'Failed':>8} | {'Fail %':>7}")
    print("-" * 80)
    for _, row in cat_stats.iterrows():
        cat = str(row['item_category'])[:17] if row['item_category'] else 'Unknown'
        print(f"{cat:<18} | {int(row['activity_count']):>14,} | {int(row['unique_items']):>6} | {int(row['unique_users']):>6} | {int(row['failed']):>8,} | {row['failure_rate']:>6.2f}%")

In [None]:
# Long-Running Activities Analysis
print("üìä Long-Running Activities Analysis (>1 hour)")
print("-" * 100)

if 'fact_activity' in tables and 'dim_item' in tables and 'dim_workspace' in tables:
    fact = tables['fact_activity']
    dim_item = tables['dim_item']
    dim_ws = tables['dim_workspace']
    
    # Filter for long-running activities (>1 hour = 3600 seconds)
    long_running = fact[fact['is_long_running'] == 1].copy() if 'is_long_running' in fact.columns else fact[fact['duration_seconds'] > 3600].copy()
    
    if len(long_running) > 0:
        print(f"Total long-running activities (>1 hr): {len(long_running):,}")
        print()
        
        # Enrich with names
        long_running = long_running.merge(
            dim_item[['item_sk', 'item_name', 'item_type']], on='item_sk', how='left'
        ).merge(
            dim_ws[['workspace_sk', 'workspace_name']], on='workspace_sk', how='left'
        )
        
        # Top 10 longest
        top_longest = long_running.nlargest(10, 'duration_seconds')
        
        print(f"{'Item Name':<35} | {'Type':<15} | {'Workspace':<25} | {'Duration':>12}")
        print("-" * 100)
        for _, row in top_longest.iterrows():
            item_name = str(row['item_name'])[:34] if row['item_name'] else 'Unknown'
            item_type = str(row['item_type'])[:14] if row['item_type'] else 'N/A'
            ws_name = str(row['workspace_name'])[:24] if row['workspace_name'] else 'Unknown'
            duration_hrs = round(row['duration_seconds'] / 3600, 2)
            print(f"{item_name:<35} | {item_type:<15} | {ws_name:<25} | {duration_hrs:>10} hrs")
    else:
        print("No long-running activities found (>1 hour)")

In [None]:
# Cross-Workspace Usage Patterns (Workspaces shared by multiple users)
print("üìä Cross-Workspace Usage Patterns")
print("-" * 80)

if 'fact_activity' in tables and 'dim_workspace' in tables:
    fact = tables['fact_activity']
    dim_ws = tables['dim_workspace']
    
    # Calculate collaboration metrics per workspace
    # Use record_count sum instead of activity_id count (activity_id is NULL for granular operations)
    ws_collab = fact.groupby('workspace_sk').agg(
        unique_users=('user_sk', 'nunique'),
        unique_items=('item_sk', 'nunique'),
        activity_count=('record_count', 'sum')
    ).reset_index()
    
    ws_collab = ws_collab.merge(
        dim_ws[['workspace_sk', 'workspace_name', 'environment']], on='workspace_sk'
    )
    
    # Sort by user count (collaboration indicator)
    ws_collab = ws_collab.sort_values('unique_users', ascending=False).head(10)
    ws_collab['activities_per_user'] = round(ws_collab['activity_count'] / ws_collab['unique_users'], 1)
    
    print(f"{'Workspace':<40} | {'Env':<7} | {'Users':>6} | {'Items':>6} | {'Activities':>10} | {'Act/User':>8}")
    print("-" * 95)
    for _, row in ws_collab.iterrows():
        ws_name = str(row['workspace_name'])[:39] if row['workspace_name'] else 'Unknown'
        env = str(row['environment'])[:6] if row['environment'] else 'N/A'
        print(f"{ws_name:<40} | {env:<7} | {int(row['unique_users']):>6} | {int(row['unique_items']):>6} | {int(row['activity_count']):>10,} | {row['activities_per_user']:>8}")

In [None]:
# Monthly Trend Analysis
print("üìä Monthly Activity Trend")
print("-" * 70)

if 'fact_activity' in tables and 'dim_date' in tables:
    fact = tables['fact_activity']
    dim_date = tables['dim_date']
    
    # Join with date dimension (use month_number not month)
    fact_dated = fact.merge(
        dim_date[['date_sk', 'year', 'month_number', 'month_name']], on='date_sk'
    )
    
    # Use record_count sum instead of activity_id count
    monthly_stats = fact_dated.groupby(['year', 'month_number', 'month_name']).agg(
        activity_count=('record_count', 'sum'),
        unique_users=('user_sk', 'nunique'),
        unique_items=('item_sk', 'nunique'),
        failed=('is_failed', 'sum')
    ).reset_index()
    
    monthly_stats = monthly_stats.sort_values(['year', 'month_number'], ascending=False).head(12)
    monthly_stats['failure_rate'] = round(100 * monthly_stats['failed'] / monthly_stats['activity_count'], 2)
    
    print(f"{'Month':<15} | {'Activities':>12} | {'Users':>6} | {'Items':>6} | {'Failed':>8} | {'Fail %':>7}")
    print("-" * 70)
    for _, row in monthly_stats.iterrows():
        month_label = f"{row['month_name'][:3]} {int(row['year'])}"
        print(f"{month_label:<15} | {int(row['activity_count']):>12,} | {int(row['unique_users']):>6} | {int(row['unique_items']):>6} | {int(row['failed']):>8,} | {row['failure_rate']:>6.2f}%")

## Summary

The star schema has been built and is ready for:

1. **SQL Queries** - Query the parquet files directly or use Delta tables via SQL Endpoint
2. **Semantic Model** - Create a Direct Lake model pointing to these tables
3. **Power BI Reports** - Build monitoring dashboards using the semantic model

### Scheduled Refresh
To automate the star schema build:
1. Schedule this notebook to run daily after the Monitor Hub pipeline
2. Or use a Fabric Data Pipeline to orchestrate both steps

### Table Relationships (for Semantic Model)
```
fact_activity[date_sk] ‚Üí dim_date[date_sk]
fact_activity[time_sk] ‚Üí dim_time[time_sk]
fact_activity[workspace_sk] ‚Üí dim_workspace[workspace_sk]
fact_activity[item_sk] ‚Üí dim_item[item_sk]
fact_activity[user_sk] ‚Üí dim_user[user_sk]
fact_activity[activity_type_sk] ‚Üí dim_activity_type[activity_type_sk]
fact_activity[status_sk] ‚Üí dim_status[status_sk]
```