# BI & Analytics Integrations

**Training Objective:** Integrating Databricks Lakehouse with BI tools and sharing data with business users

**Topics Covered:**
- SQL Warehouses: Serverless, Pro, Classic
- Databricks Genie (AI/BI) - natural language queries
- External Integrations: Power BI, Dremio
- Preparing data for BI layer

## Theoretical Introduction

**Section Objective:** Understand how to share data from Lakehouse to the external world and business users.

**Basic Concepts:**
- **SQL Warehouse**: Optimized compute engine for SQL queries (not for Python/Scala code).
- **Genie**: Intelligent data assistant that understands table structure and answers natural language questions.
- **Direct Lake**: Power BI connection mode that reads Parquet files directly, bypassing SQL layer (fastest).

## User Isolation

In [0]:
%run ../00_setup

## Environment Configuration

In [0]:
from pyspark.sql import functions as F

# Set catalog and schema
spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {GOLD_SCHEMA}")

display(spark.createDataFrame([
    ("Catalog", CATALOG),
    ("Schema", GOLD_SCHEMA)
], ["Parameter", "Value"]))

## Databricks SQL Warehouses

SQL Warehouses are the "heart" of the BI layer in Databricks.

### Warehouse Types

1. **Serverless**: Starts in seconds, scales automatically. Recommended.
2. **Pro**: Uses the Photon engine but requires longer startup time (unless a pool is used).
3. **Classic**: Older architecture (VM-based).

In [0]:
# Example Warehouse configuration definition (JSON)
# This can be used in the Databricks API to automate creation

warehouse_config = {
    "name": "Warehouse_Demo",
    "cluster_size": "2X-Small",
    "min_num_clusters": 1,
    "max_num_clusters": 2,
    "auto_stop_mins": 10,
    "enable_serverless_compute": True,
    "warehouse_type": "PRO",
    "tags": {
        "department": "Sales",
        "cost_center": "1234"
    }
}

# Display configuration as DataFrame
config_df = spark.createDataFrame([
    ("name", warehouse_config["name"]),
    ("cluster_size", warehouse_config["cluster_size"]),
    ("min_num_clusters", str(warehouse_config["min_num_clusters"])),
    ("max_num_clusters", str(warehouse_config["max_num_clusters"])),
    ("auto_stop_mins", str(warehouse_config["auto_stop_mins"])),
    ("enable_serverless_compute", str(warehouse_config["enable_serverless_compute"])),
    ("warehouse_type", warehouse_config["warehouse_type"]),
    ("tags.department", warehouse_config["tags"]["department"]),
    ("tags.cost_center", warehouse_config["tags"]["cost_center"])
], ["Parameter", "Value"])

display(config_df)

## Databricks Genie (AI/BI)

Genie allows asking questions to data without knowing SQL.

### Preparing Data for Genie

For Genie to work well, we need to ensure metadata (comments on tables and columns) is present.

In [0]:
%python
# Add comments to the Gold table and the correct column
spark.sql(
    f"""
    COMMENT ON TABLE {CATALOG}.{GOLD_SCHEMA}.fact_sales IS 
    'Fact table containing sales transactions. Contains amounts, dates, and foreign keys to dimensions.'
    """
)

spark.sql(
    f"""
    COMMENT ON COLUMN {CATALOG}.{GOLD_SCHEMA}.fact_sales.net_amount IS 
    'Total order value in PLN (gross).'
    """
)

display(
    spark.createDataFrame(
        [
            ("Metadata", "Updated"),
            ("Goal", "Genie will now better understand this data")
        ],
        ["Status", "Value"]
    )
)

## External Integrations (Power BI & Iceberg)

### Power BI - Direct Lake vs Direct Query

- **Direct Lake**: Power BI Service -> OneLake/Storage (Parquet). Requires Fabric or appropriate configuration.
- **Direct Query**: Power BI -> SQL Warehouse -> Storage.

### Unity Catalog Iceberg Endpoint

**Dremio** connects to Databricks via the **Unity Catalog Iceberg REST Catalog endpoint**. 
This requires enabling **UniForm (Iceberg reads)** on Delta tables.

#### How it works?

```
Dremio → Unity Catalog (Iceberg REST API) → Delta Table with UniForm → Parquet files
```

Delta Lake and Iceberg use the same Parquet files - UniForm only generates additional Iceberg metadata without copying data.

#### Requirements:
1. Table registered in **Unity Catalog** (managed or external)
2. **Column mapping** enabled (`delta.columnMapping.mode = 'name'`)
3. **Databricks Runtime 14.3 LTS+** for writing
4. Table **without deletion vectors** (or use REORG to remove them)

In [0]:
# Step 1: Enable UniForm when creating a new table

spark.sql(f"""
CREATE TABLE IF NOT EXISTS {CATALOG}.{GOLD_SCHEMA}.fact_sales_iceberg
TBLPROPERTIES (
    'delta.columnMapping.mode' = 'name',
    'delta.enableIcebergCompatV2' = 'true',
    'delta.universalFormat.enabledFormats' = 'iceberg'
)
AS SELECT * FROM {CATALOG}.{GOLD_SCHEMA}.fact_sales
""")

In [0]:
# Step 2: Enable UniForm on an existing table (ALTER TABLE)

spark.sql(f"""
ALTER TABLE {CATALOG}.{GOLD_SCHEMA}.dim_customer SET TBLPROPERTIES (
    'delta.columnMapping.mode' = 'name',
    'delta.enableIcebergCompatV2' = 'true',
    'delta.universalFormat.enabledFormats' = 'iceberg'
)
""")

In [0]:
# Step 3: If the table has Deletion Vectors - use REORG
# REORG removes deletion vectors and enables UniForm in one step

# spark.sql(f"""
# REORG TABLE {CATALOG}.{GOLD_SCHEMA}.fact_sales 
# APPLY (UPGRADE UNIFORM(ICEBERG_COMPAT_VERSION=2))
# """)

In [0]:
# Step 4: Check Iceberg metadata generation status

result = spark.sql(f"DESCRIBE EXTENDED {CATALOG}.{GOLD_SCHEMA}.dim_customer")
display(result.filter("col_name LIKE '%iceberg%' OR col_name LIKE '%uniform%' OR col_name LIKE '%converted%'"))

#### Dremio Configuration - Unity Catalog Iceberg REST

Dremio connects via **Iceberg REST Catalog API** - **does NOT require SQL Warehouse**.

| Connection | SQL Warehouse? | How it works |
|------------|----------------|------------|
| Power BI / Tableau (JDBC) | YES | Queries via SQL Warehouse |
| Dremio / Snowflake / Trino (Iceberg REST) | NO | Direct file access via API |

**Configuration in Dremio:**

1. **Add Source** → Iceberg / REST Catalog
2. **Endpoint URI**: 
   ```
   https://<workspace-url>/api/2.1/unity-catalog/iceberg-rest
   ```
3. **Warehouse** (= catalog name): `<uc-catalog-name>` e.g. `main`
4. **Authentication**: Personal Access Token or OAuth2 (Service Principal)

Unity Catalog uses **credential vending** - it passes temporary credentials to storage (S3/ADLS), so Dremio reads Parquet files directly.

#### Manual Metadata Synchronization

If Dremio doesn't see the latest data, force synchronization:

In [0]:
# Manual Iceberg metadata synchronization (if automatic didn't work)
# spark.sql(f"MSCK REPAIR TABLE {CATALOG}.{GOLD_SCHEMA}.dim_customer SYNC METADATA")

#### UniForm Limitations for Dremio

| Limitation | Description |
|--------------|------|
| Read-only | Dremio can only read, not write |
| Deletion Vectors | Must be disabled (or use REORG) |
| Materialized Views | Do not support UniForm |
| Streaming Tables | Do not support UniForm |
| VOID type | Not supported in Iceberg |

---

### Preparing a Dedicated View

For BI tools (Dremio, Power BI), it is good practice to create views that hide join logic.

In [0]:
# Creating a reporting view
view_name = "v_sales_summary_bi"

spark.sql(f"""
CREATE OR REPLACE VIEW {CATALOG}.{GOLD_SCHEMA}.{view_name} AS
SELECT 
    c.country,
    year(f.order_date) as year,
    month(f.order_date) as month,
    count(distinct f.order_id) as orders_count,
    sum(f.total_amount) as total_revenue
FROM {CATALOG}.{GOLD_SCHEMA}.fact_sales f
JOIN {CATALOG}.{GOLD_SCHEMA}.dim_customer c ON f.customer_id = c.customer_id
GROUP BY 1, 2, 3
""")

display(spark.createDataFrame([
    ("View", view_name),
    ("Location", f"{CATALOG}.{GOLD_SCHEMA}.{view_name}"),
    ("Status", "Ready for BI connection")
], ["Parameter", "Value"]))

In [0]:
# Verify view
display(spark.table(f"{CATALOG}.{GOLD_SCHEMA}.{view_name}"))

## Best Practices

### BI Performance:
- Use **Serverless SQL Warehouses** for best UX (fast start).
- Enable **Photon** (default in Serverless/Pro).
- Use **Materialized Views** for heavy aggregations if dashboards are slow.

### Governance:
- Do not connect BI directly to Silver/Bronze tables. Use **Gold** only.
- Use dedicated **Service Principals** for BI connections, not personal accounts.

## Summary

1. We configured metadata for **Genie**.
2. We discussed **SQL Warehouses** types.
3. We prepared an optimized view for **Power BI / Dremio**.

## Clean up resources

In [0]:
# spark.sql(f"DROP VIEW IF EXISTS {CATALOG}.{GOLD_SCHEMA}.{view_name}")
display(spark.createDataFrame([("Status", "Resources kept for further exercises")], ["Info", "Value"]))