# ConnectWise to Microsoft Fabric ETL Notebook

This notebook demonstrates how to use the ConnectWise PSA to Microsoft Fabric OneLake integration pipeline. The code uses the modernized ETL architecture that provides direct Delta writes to OneLake with proper table naming and partitioning strategies.

## Setup

First, install the package and any dependencies:

In [None]:
# Install the package (adjust path as needed)
%pip install /lakehouse/Files/dist/fabric_api-0.1.0-py3-none-any.whl

# Optional: Install any additional dependencies
%pip install delta-spark sparkdantic

## Set Environment Variables

Configure the environment variables for API access and logging:

In [ ]:
import os
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s")

# Set up environment variables directly in the notebook for testing
# In production, these should come from Fabric Key Vault
os.environ["CW_COMPANY"] = "your_company_name"  # Replace with actual value
os.environ["CW_PUBLIC_KEY"] = "your_public_key"  # Replace with actual value
os.environ["CW_PRIVATE_KEY"] = "your_private_key"  # Replace with actual value
os.environ["CW_CLIENTID"] = "your_client_id"  # Replace with actual value

# Set Fabric storage environment variables - these help with path resolution
os.environ["FABRIC_STORAGE_ACCOUNT"] = "your_fabric_storage"  # Replace with your storage account name
os.environ["FABRIC_TENANT_ID"] = "your_tenant_id"  # Optional, can be obtained from workspace settings

# Verify all required variables are available
required_vars = [
    "CW_COMPANY", 
    "CW_PUBLIC_KEY",
    "CW_PRIVATE_KEY",
    "CW_CLIENTID"
]

# Verify all required variables are available
missing_vars = [var for var in required_vars if not os.getenv(var)]
if missing_vars:
    raise ValueError(f"Missing required environment variables: {', '.join(missing_vars)}. "
                     f"Please add these secrets to your workspace Key Vault or set them directly in the notebook.")

print("Environment variables configured successfully.")

## Simple ETL Execution

Run a full ETL process to extract all entities and load them to OneLake:

In [ ]:
from pyspark.sql import SparkSession
from fabric_api.bronze_loader import process_all_entities, process_entities
from fabric_api.client import ConnectWiseClient

# Get or create SparkSession
spark = SparkSession.getActiveSession() or SparkSession.builder.getOrCreate()

# Set the Bronze layer path in OneLake
# This will be automatically formatted for Fabric using ensure_fabric_path()
bronze_path = "/lakehouse/default/Tables/psa_bronze"

# Process all configured entities
results = process_all_entities(
    spark=spark,
    bronze_path=bronze_path,
    page_size=100,
    max_pages=10,  # Limit for testing, set to None for all pages in production
    write_mode="append"
)

# Display the results
print("\nETL Results:")
for entity_name, (df, errors) in results.items():
    print(f"  {entity_name}: {df.count()} records, {len(errors)} validation errors")

## Incremental ETL with Date Filtering

Run an incremental ETL process to extract data for a specific date range:

In [ ]:
from datetime import datetime, timedelta
import fabric_api.api_utils as api_utils

# Set date range for incremental load
end_date = datetime.now().strftime("%Y-%m-%d")
start_date = (datetime.now() - timedelta(days=30)).strftime("%Y-%m-%d")

# Build date conditions for each entity type
conditions = {
    "TimeEntry": api_utils.build_condition_string(date_gte=start_date, date_lte=end_date),
    "ExpenseEntry": api_utils.build_condition_string(date_gte=start_date, date_lte=end_date),
    "PostedInvoice": api_utils.build_condition_string(date_gte=start_date, date_lte=end_date),
    "UnpostedInvoice": api_utils.build_condition_string(date_gte=start_date, date_lte=end_date)
}

# Process specific entities with date filtering
incremental_results = process_entities(
    spark=spark,
    entity_names=["TimeEntry", "ExpenseEntry", "PostedInvoice", "UnpostedInvoice"],
    bronze_path=bronze_path,
    page_size=100,
    max_pages=10,  # Limit for testing
    conditions=conditions,  # Apply our date filters
    write_mode="append"    # Use append mode for incremental loading
)

# Display the results
print("\nIncremental ETL Results:")
for entity_name, (df, errors) in incremental_results.items():
    print(f"  {entity_name}: {df.count()} records, {len(errors)} validation errors")
    
    # Show validation error summary if errors exist
    if errors:
        from fabric_api import log_utils as log
        error_summary = log.summarize_validation_errors(errors)
        print(f"  Error types: {error_summary.get('fields', [])}")

## Advanced: Processing Specific Entities

If you need more control over the ETL process, you can use the lower-level APIs:

In [ ]:
# Process specific entities with custom configuration
from fabric_api.extract import fetch_agreements_raw, fetch_active_agreements
from fabric_api.extract._common import validate_batch
from fabric_api.connectwise_models import Agreement
from fabric_api.client import ConnectWiseClient

# Create a ConnectWise client 
client = ConnectWiseClient()

# Example of manually fetching and processing data
print("\nManual processing example - Active Agreements:")

# 1. Fetch raw data directly
raw_agreements = fetch_active_agreements(client, max_pages=5)
print(f"  Fetched {len(raw_agreements)} active agreements")

# 2. Validate against schema
valid_agreements, errors = validate_batch(raw_agreements, Agreement)
print(f"  Validation: {len(valid_agreements)} valid, {len(errors)} invalid")

# 3. Convert to DataFrame (optional)
if valid_agreements:
    from fabric_api.bronze_loader import create_dataframe
    df = create_dataframe(spark, valid_agreements, Agreement)
    
    print("\nSample data:")
    df.select("id", "name", "type").show(5, truncate=False)
    
    # 4. Write to custom location if needed
    custom_path = "/lakehouse/default/Tables/psa_bronze/cw_active_agreements"
    
    from fabric_api.bronze_loader import write_to_delta, add_fabric_metadata, register_table_metadata
    
    # Add metadata
    df = add_fabric_metadata(df, "ActiveAgreement", "ConnectWise active agreements only")
    
    # Write to Delta
    write_to_delta(
        df=df,
        table_path=custom_path,
        partition_cols=["type"],
        mode="overwrite"
    )
    
    # Register table
    register_table_metadata(
        spark=spark, 
        table_path=custom_path,
        entity_name="ActiveAgreement",
        description="ConnectWise active agreements only"
    )
    
    print(f"\nWrote {df.count()} active agreements to {custom_path}")

## Query the Loaded Data

Once the data is loaded, you can query it using Spark SQL:

In [ ]:
# Query the loaded data using Spark SQL
from fabric_api.bronze_loader import ENTITY_CONFIG
import os

# Get the table names from our configuration
agreement_table = os.path.basename(ENTITY_CONFIG["Agreement"]["output_table"])
posted_invoice_table = os.path.basename(ENTITY_CONFIG["PostedInvoice"]["output_table"])
time_table = os.path.basename(ENTITY_CONFIG["TimeEntry"]["output_table"])

print(f"Querying tables: {agreement_table}, {posted_invoice_table}, {time_table}")

# Query agreements
agreement_df = spark.sql(f"""
SELECT id, name, type, agreementType, billingCycle, startDate, endDate
FROM {agreement_table}
WHERE startDate IS NOT NULL
ORDER BY startDate DESC
LIMIT 10
""")

# Display the results
print("\nSample Agreements:")
agreement_df.show(truncate=False)

# Example of using the etl_timestamp metadata column
metadata_df = spark.sql(f"""
SELECT 
    COUNT(*) as count,
    entity_name,
    DATE(etl_timestamp) as load_date
FROM {agreement_table}
GROUP BY entity_name, DATE(etl_timestamp)
ORDER BY load_date DESC
""")

print("\nData Load Summary:")
metadata_df.show()

# Example complex query joining multiple tables
print("\nAttempting to run a join query - this will only work if all tables exist:")
try:
    joined_df = spark.sql(f"""
    SELECT 
        i.id as invoice_id,
        i.identifier as invoice_number,
        i.total as invoice_total,
        t.id as time_entry_id,
        t.hours as hours_worked,
        t.timeStart as work_date,
        t.billableOption as billing_option
    FROM {posted_invoice_table} i
    JOIN {time_table} t ON t.invoice_id = i.id
    WHERE i.total > 0
    ORDER BY i.date DESC
    LIMIT 10
    """)
    
    print("Sample Joined Data:")
    joined_df.show(truncate=False)
except Exception as e:
    print(f"Join query failed (this is normal if you haven't loaded all tables): {str(e)}")

# Example of querying the validation errors table
try:
    validation_errors_df = spark.sql("""
    SELECT 
        entity,
        COUNT(*) as error_count,
        COLLECT_SET(error_type) as error_types
    FROM cw_validation_errors
    GROUP BY entity
    """)
    
    print("\nValidation Error Summary:")
    validation_errors_df.show(truncate=False)
except Exception as e:
    print(f"Validation errors query failed (this is normal if no errors exist): {str(e)}")

## Fabric-Native Features

The modernized pipeline includes several Fabric-specific optimizations:

1. **Standardized Naming Convention**: All tables use a consistent `cw_` prefix (e.g., `cw_agreement`, `cw_time_entry`, etc.)

2. **Entity-Specific Partitioning**:
   - Time-related tables (`cw_time_entry`, `cw_expense_entry`, `cw_posted_invoice`) - Partitioned by date
   - Agreement tables (`cw_agreement`) - Partitioned by type
   - Error tables (`cw_validation_errors`) - Partitioned by entity

3. **Rich Metadata**: All tables include standardized metadata columns:
   - `etl_entity_name`: Source system name ("ConnectWise")
   - `etl_entity_type`: Entity type (e.g., "Agreement", "TimeEntry")
   - `etl_timestamp`: When the data was loaded
   - `etl_version`: Pipeline version

4. **Delta Optimizations**:
   - `delta.autoOptimize.optimizeWrite` - Optimizes write performance
   - `delta.autoOptimize.autoCompact` - Auto-compacts small files
   - `mergeSchema` - Supports schema evolution

5. **Automatic Table Registration**: Tables are automatically registered in the Fabric catalog for easy discovery

6. **ABFSS Path Construction**: Direct writes to OneLake using proper ABFSS URLs when in Fabric

7. **Comprehensive Validation**: Invalid records are tracked in a dedicated `cw_validation_errors` table

These features ensure that data is immediately available for analysis in Fabric without any post-processing steps.