# M09: Governance & Unity Catalog 

**Training Objective:** Master Unity Catalog as a governance platform for Databricks Lakehouse, managing access, data masking, lineage, and audit logging

**Topics Covered:**
- Unity Catalog Architecture: Metastore, Catalog, Schema, Tables/Views/Volumes
- Access Management: GRANT/REVOKE privileges
- Data Masking and Row-Level Security
- Data Lineage and Audit Logging
- Delta Sharing - secure data sharing
- Best Practices for Data Governance

---


## 9.1. Theoretical Introduction

**Section Objective:** Understanding Unity Catalog as a unified governance platform for data lakehouse

**Basic Concepts:**
- **Unity Catalog**: Unified governance solution for all data assets
- **Metastore**: Region-level container for catalogs (top-level)
- **Three-level namespace**: catalog.schema.table
- **Securable objects**: Tables, Views, Functions, Volumes, Models
- **Fine-grained access control**: Table, column, row-level security
- **Automatic lineage**: End-to-end data flow tracking without instrumentation

**Unity Catalog Object Hierarchy:**
```
Metastore (region-level)
 ↓
Catalog (domain/environment)
 ↓
Schema (namespace/layer)
 ↓
Securable Objects:
 - Tables / Views (data)
 - Functions (UDF, stored procedures)
 - Volumes (file storage)
 - Models (ML models)
```

**Key Features:**
- **Unified governance**: Single platform for data, ML, BI
- **ACID transactions**: Transactional guarantees at catalog level
- **Audit logging**: Who accessed what and when
- **Data discovery**: Metadata search and tagging
- **Delta Sharing**: Secure cross-organization sharing

**Why is this important?**
Unity Catalog solves fundamental governance problems in data lake:
- Lack of central access control
- Difficulty tracking lineage
- No data access audit
- Compliance issues (GDPR, HIPAA)
- Data silos between teams

Unity Catalog provides enterprise-grade governance while maintaining data lakehouse flexibility.

## 9.2. Per-User Isolation

Run the initialization script for per-user catalog and schema isolation:

In [0]:
%run ../../setup/00_setup

## 9.3. Configuration

Library imports and user context display:

In [0]:
# Paths to data directories (subdirectories in DATASET_PATH from 00_setup)
CUSTOMERS_PATH = f"{DATASET_PATH}/customers"
ORDERS_PATH = f"{DATASET_PATH}/orders"
PRODUCTS_PATH = f"{DATASET_PATH}/products"

# Paths to specific files
CUSTOMERS_CSV = f"{CUSTOMERS_PATH}/customers.csv"
ORDERS_JSON = f"{ORDERS_PATH}/orders_batch.json"
PRODUCTS_PARQUET = f"{PRODUCTS_PATH}/products.parquet"

## 9.4. Unity Catalog Architecture

**Unity Catalog** is a unified governance solution for Databricks Lakehouse.

### Object Hierarchy:

```
Metastore (region-level)
 ↓
Catalog (database/domain)
 ↓
Schema (namespace)
 ↓
Securable Objects:
 - Tables / Views
 - Functions (UDF, stored procedures)
 - Volumes (files storage)
 - Models (ML models)
```

### Three-level namespace:
```sql
catalog.schema.table
```

Example:
```sql
main.sales.orders
dev.analytics.customer_metrics
prod.gold.daily_revenue
```

### Key Features:
- **Unified governance**: single platform for data, ML, BI
- **Fine-grained access control**: table, column, row level
- **Automatic lineage**: end-to-end data flow tracking
- **Audit logging**: who accessed what and when
- **Data discovery**: metadata search and tagging

---

#### Unity Catalog Full Architecture

![Unity Catalog Architecture - Metastore, Catalogs, Schemas, Objects, External Locations](../../../assets/images/training_2026/m09_unity_catalog_architecture.png)

### 9.4.1. Setup and Basic Operations

### 9.4.2. Creating User Groups
We create user groups for permission demonstration:
- `data_engineers`: Full access to Bronze/Silver schemas
- `data_analysts`: Read-only access to Gold

**Active context verification:**

In [0]:
# Verification of created schemas
schemas = spark.sql(f"SHOW SCHEMAS IN {CATALOG}").select("databaseName").collect()
schema_names = [row.databaseName for row in schemas]

**Active catalog and schema set**

We set the default working context - all subsequent operations will be executed in this catalog and schema unless a full path is specified.

In [0]:
# Verification of created schemas
spark.sql(f"SHOW SCHEMAS IN {CATALOG}").display()

## 9.5. Data Preparation

Before we proceed to access management, we will load real data from the dataset/ directory that we will use in the Unity Catalog examples.

In [0]:
orders_df = spark.read.option("header", "true").option("inferSchema", "true").json(ORDERS_JSON)
orders_df.write.format("delta").mode("overwrite").saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.orders")

display(spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.orders"))

In [0]:
customers_df = spark.read.option("header", "true").option("inferSchema", "true").csv(CUSTOMERS_CSV)
customers_df.write.format("delta").mode("overwrite").saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.customers")

display(spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers"))

In [0]:
products_df = spark.read.parquet(PRODUCTS_PARQUET)
products_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.products")
display(spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.products"))

In [0]:
# Verification of orders record count
spark.sql(f"SELECT COUNT(*) as count FROM {CATALOG}.{BRONZE_SCHEMA}.orders").display()

### 9.5.1. Add comments to table and columns

You can add descriptive comments to Unity Catalog tables and columns using SQL commands. This improves data discoverability and governance.

The cell below demonstrates how to add comments to a table and a specific column using Spark SQL.

In [0]:
# Add comments to table and columns
spark.sql(f"""
    COMMENT ON TABLE {CATALOG}.{BRONZE_SCHEMA}.orders IS
    'Cleaned orders table with data quality validations applied'
""")

spark.sql(f"""
    COMMENT ON COLUMN {CATALOG}.{BRONZE_SCHEMA}.orders.customer_id IS
    'Customer identifier - PII data, access restricted'
""")

### 9.5.2. Add tags to orders table

You can classify and manage tables in Unity Catalog using **tags** (key-value pairs). Tags help with data discovery, compliance, and governance (e.g., marking tables as PII, GDPR, or Sensitive).

Example:  
sql
ALTER TABLE retailhub_trainer.bronze.orders
  SET TAGS ('pii' = 'false', 'data_classification' = 'transactional', 'retention' = '7_years');

- `pii`: Indicates if table contains personally identifiable information.
- `data_classification`: Describes the type of data (e.g., transactional, reference).
- `retention`: Specifies data retention policy.

You need `APPLY TAG` privilege to add tags.

In [0]:
# Add tags to orders table
spark.sql(f"""
    ALTER TABLE {CATALOG}.{BRONZE_SCHEMA}.orders 
    SET TAGS ('sensitivity' = 'high', 'domain' = 'sales')
""")

# Add tags to customer_id column
spark.sql(f"""
    ALTER TABLE {CATALOG}.{BRONZE_SCHEMA}.orders 
    ALTER COLUMN customer_id SET TAGS ('pii' = 'true')
""")

display(spark.createDataFrame([("Status", " Tags added to table and column")], ["Info", "Value"]))

## 9.6. Querying Metadata and Tags

In [0]:
# Find all columns marked as PII
pii_columns = spark.sql(f"""
    SELECT 
        catalog_name, 
        schema_name, 
        table_name, 
        column_name, 
        tag_value 
    FROM system.information_schema.column_tags
    WHERE tag_name = 'pii' AND tag_value = 'true'
      AND catalog_name = '{CATALOG}'
""")

display(pii_columns)

## 9.7. Monitoring & Observability (System Tables)

Unity Catalog provides **System Tables** (`system.*`) for operational monitoring and observability.
These tables give insights into costs, job runs, pipeline health, query performance, and storage usage.

**Available System Table Categories:**

| Category | Schema | Key Tables |
|----------|--------|------------|
| **Billing** | `system.billing` | `usage`, `list_prices` |
| **Compute** | `system.compute` | `clusters`, `warehouse_events` |
| **Workflows** | `system.workflow` | `job_run_timeline`, `job_task_run_timeline` |
| **Pipelines** | `system.lakeflow` | `pipeline_event_log` |
| **Queries** | `system.query` | `history` |
| **Storage** | `system.storage` | `predictive_optimization_operations_history` |
| **Access** | `system.access` | `audit`, `table_lineage`, `column_lineage` |

> **Note**: System tables require **Metastore admin** or specific `MONITOR` privileges.

#### System Tables Overview

![System Tables - audit, billing, compute, information_schema, lineage, storage](../../../assets/images/training_2026/m09_system_tables_overview.png)

### 9.7.1. Cost Monitoring (DBU Usage)

Track Databricks Unit (DBU) consumption by workspace, SKU, and user. Essential for budget management and chargeback.

In [None]:
# Daily DBU cost breakdown by SKU (last 30 days)
cost_daily = spark.sql("""
    SELECT 
        usage_date,
        sku_name,
        usage_unit,
        SUM(usage_quantity) as total_dbus,
        ROUND(SUM(usage_quantity * list_prices.pricing.default), 2) as estimated_cost_usd
    FROM system.billing.usage
    LEFT JOIN system.billing.list_prices 
        ON usage.sku_name = list_prices.sku_name
        AND usage.usage_date BETWEEN list_prices.price_start_time AND COALESCE(list_prices.price_end_time, '2099-12-31')
    WHERE usage_date >= current_date() - INTERVAL 30 DAYS
    GROUP BY usage_date, sku_name, usage_unit
    ORDER BY usage_date DESC, estimated_cost_usd DESC
""")

display(cost_daily)

In [None]:
# Top 10 most expensive users (last 30 days)
cost_by_user = spark.sql("""
    SELECT 
        identity_metadata.run_as as run_as_user,
        sku_name,
        ROUND(SUM(usage_quantity), 2) as total_dbus,
        COUNT(DISTINCT usage_date) as active_days
    FROM system.billing.usage
    WHERE usage_date >= current_date() - INTERVAL 30 DAYS
        AND identity_metadata.run_as IS NOT NULL
    GROUP BY identity_metadata.run_as, sku_name
    ORDER BY total_dbus DESC
    LIMIT 10
""")

display(cost_by_user)

In [None]:
# Cost trend: weekly aggregation with week-over-week change
cost_trend = spark.sql("""
    WITH weekly AS (
        SELECT
            DATE_TRUNC('week', usage_date) as week_start,
            ROUND(SUM(usage_quantity), 2) as total_dbus
        FROM system.billing.usage
        WHERE usage_date >= current_date() - INTERVAL 90 DAYS
        GROUP BY DATE_TRUNC('week', usage_date)
    )
    SELECT
        week_start,
        total_dbus,
        LAG(total_dbus) OVER (ORDER BY week_start) as prev_week_dbus,
        ROUND(
            (total_dbus - LAG(total_dbus) OVER (ORDER BY week_start)) 
            / LAG(total_dbus) OVER (ORDER BY week_start) * 100, 1
        ) as wow_change_pct
    FROM weekly
    ORDER BY week_start DESC
""")

display(cost_trend)

### 9.7.2. Job & Workflow Monitoring

Monitor Lakeflow Jobs execution: success rates, durations, failures. Critical for SLA compliance.

In [None]:
# Job run history with success/failure rates (last 7 days)
job_runs = spark.sql("""
    SELECT
        job_id,
        job_name,
        COUNT(*) as total_runs,
        SUM(CASE WHEN result_state = 'SUCCESS' THEN 1 ELSE 0 END) as success_count,
        SUM(CASE WHEN result_state IN ('FAILED', 'TIMEDOUT') THEN 1 ELSE 0 END) as failure_count,
        ROUND(SUM(CASE WHEN result_state = 'SUCCESS' THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) as success_rate_pct,
        ROUND(AVG(TIMESTAMPDIFF(MINUTE, period.start_time, period.end_time)), 1) as avg_duration_min
    FROM system.workflow.job_run_timeline
    WHERE period.start_time >= current_date() - INTERVAL 7 DAYS
    GROUP BY job_id, job_name
    ORDER BY failure_count DESC, total_runs DESC
""")

display(job_runs)

In [None]:
# Failed jobs with error details (last 24 hours)
failed_jobs = spark.sql("""
    SELECT
        job_name,
        run_id,
        result_state,
        period.start_time as start_time,
        period.end_time as end_time,
        TIMESTAMPDIFF(MINUTE, period.start_time, period.end_time) as duration_min,
        triggered_by
    FROM system.workflow.job_run_timeline
    WHERE result_state IN ('FAILED', 'TIMEDOUT', 'CANCELED')
        AND period.start_time >= current_timestamp() - INTERVAL 24 HOURS
    ORDER BY period.start_time DESC
""")

display(failed_jobs)

### 9.7.3. Lakeflow Pipeline Monitoring

Track Lakeflow Declarative Pipeline health, update durations, and data quality issues.

In [None]:
# Lakeflow pipeline events (last 7 days)
pipeline_events = spark.sql("""
    SELECT
        pipeline_id,
        pipeline_name,
        event_type,
        maturity_level,
        message,
        timestamp
    FROM system.lakeflow.pipeline_event_log
    WHERE timestamp >= current_date() - INTERVAL 7 DAYS
        AND event_type IN ('flow_progress', 'update_progress', 'maintenance_progress')
    ORDER BY timestamp DESC
    LIMIT 50
""")

display(pipeline_events)

In [None]:
# Pipeline update durations and success rates
pipeline_health = spark.sql("""
    WITH updates AS (
        SELECT
            pipeline_id,
            pipeline_name,
            origin.update_id,
            MIN(timestamp) as start_time,
            MAX(timestamp) as end_time,
            MAX(CASE WHEN message LIKE '%completed successfully%' THEN 'SUCCESS'
                      WHEN message LIKE '%FAILED%' OR message LIKE '%error%' THEN 'FAILED'
                      ELSE 'RUNNING' END) as status
        FROM system.lakeflow.pipeline_event_log
        WHERE timestamp >= current_date() - INTERVAL 30 DAYS
        GROUP BY pipeline_id, pipeline_name, origin.update_id
    )
    SELECT
        pipeline_name,
        COUNT(*) as total_updates,
        SUM(CASE WHEN status = 'SUCCESS' THEN 1 ELSE 0 END) as successes,
        SUM(CASE WHEN status = 'FAILED' THEN 1 ELSE 0 END) as failures,
        ROUND(AVG(TIMESTAMPDIFF(MINUTE, start_time, end_time)), 1) as avg_duration_min
    FROM updates
    GROUP BY pipeline_name
    ORDER BY failures DESC
""")

display(pipeline_health)

### 9.7.4. Query Performance Monitoring

Identify slow queries, heavy users, and optimization opportunities using SQL Warehouse query history.

In [None]:
# Slowest queries (last 7 days, > 60s)
slow_queries = spark.sql("""
    SELECT
        statement_id,
        executed_by as user,
        SUBSTRING(statement_text, 1, 120) as query_preview,
        execution_status,
        total_duration_ms / 1000 as duration_sec,
        rows_produced,
        start_time
    FROM system.query.history
    WHERE start_time >= current_date() - INTERVAL 7 DAYS
        AND total_duration_ms > 60000
        AND statement_type IN ('SELECT', 'MERGE', 'INSERT', 'CREATE_TABLE_AS_SELECT')
    ORDER BY total_duration_ms DESC
    LIMIT 20
""")

display(slow_queries)

In [None]:
# Query volume and performance by user (last 7 days)
query_by_user = spark.sql("""
    SELECT
        executed_by as user,
        COUNT(*) as total_queries,
        ROUND(AVG(total_duration_ms / 1000), 2) as avg_duration_sec,
        ROUND(MAX(total_duration_ms / 1000), 2) as max_duration_sec,
        SUM(rows_produced) as total_rows_produced,
        COUNT(CASE WHEN execution_status = 'FAILED' THEN 1 END) as failed_queries
    FROM system.query.history
    WHERE start_time >= current_date() - INTERVAL 7 DAYS
    GROUP BY executed_by
    ORDER BY total_queries DESC
    LIMIT 15
""")

display(query_by_user)

### 9.7.5. Compute & Cluster Monitoring

Track cluster utilization, uptime, and idle time to optimize compute costs.

In [None]:
# Active clusters with uptime and DBU usage
cluster_usage = spark.sql("""
    SELECT
        cluster_id,
        cluster_name,
        cluster_source,
        driver_node_type,
        worker_count,
        change_time,
        state
    FROM system.compute.clusters
    WHERE change_time >= current_date() - INTERVAL 7 DAYS
    ORDER BY change_time DESC
    LIMIT 50
""")

display(cluster_usage)

In [None]:
# SQL Warehouse usage patterns
warehouse_usage = spark.sql("""
    SELECT
        warehouse_id,
        event_type,
        event_time,
        cluster_count
    FROM system.compute.warehouse_events
    WHERE event_time >= current_date() - INTERVAL 7 DAYS
    ORDER BY event_time DESC
    LIMIT 50
""")

display(warehouse_usage)

### 9.7.6. Storage & Table Size Monitoring

Monitor table sizes, growth trends, and identify tables that need optimization.

In [None]:
# Largest tables in the catalog
largest_tables = spark.sql(f"""
    SELECT
        catalog_name,
        schema_name,
        table_name,
        ROUND(total_size_in_bytes / (1024*1024*1024), 3) as size_gb,
        row_count,
        active_files_count,
        last_updated
    FROM system.storage.table_storage
    WHERE catalog_name = '{CATALOG}'
    ORDER BY total_size_in_bytes DESC
    LIMIT 20
""")

display(largest_tables)

In [None]:
# Predictive Optimization history
pred_opt = spark.sql(f"""
    SELECT
        catalog_name,
        schema_name,
        table_name,
        operation_type,
        operation_status,
        usage_quantity as dbus_used,
        start_time,
        end_time
    FROM system.storage.predictive_optimization_operations_history
    WHERE catalog_name = '{CATALOG}'
        AND start_time >= current_date() - INTERVAL 30 DAYS
    ORDER BY start_time DESC
    LIMIT 30
""")

display(pred_opt)

### 9.7.7. Governance Health Dashboard

Combined governance health check: tables without comments, untagged PII, permissions audit.

In [None]:
# Tables without comments (governance gap)
undocumented = spark.sql(f"""
    SELECT 
        table_catalog,
        table_schema,
        table_name,
        table_type
    FROM {CATALOG}.information_schema.tables
    WHERE comment IS NULL OR comment = ''
    ORDER BY table_schema, table_name
""")

display(undocumented)

In [None]:
# All tags across catalog (governance inventory)
all_tags = spark.sql(f"""
    SELECT 
        catalog_name,
        schema_name,
        table_name,
        column_name,
        tag_name,
        tag_value
    FROM system.information_schema.column_tags
    WHERE catalog_name = '{CATALOG}'
    ORDER BY schema_name, table_name, column_name
""")

display(all_tags)

In [None]:
# Permission audit: all grants in catalog
perm_audit = spark.sql(f"""
    SELECT 
        grantor,
        grantee,
        table_catalog,
        table_schema,
        table_name,
        privilege_type,
        is_grantable
    FROM {CATALOG}.information_schema.table_privileges
    ORDER BY grantee, table_schema, table_name
""")

display(perm_audit)

> **Tip**: Create a **Lakeflow Job** that runs these monitoring queries daily and sends alerts via email/Slack on anomalies (e.g., cost spike > 20%, job failure rate > 5%, tables without comments).

## 9.8. Unity Catalog Functions (UDF)

**Functions** in Unity Catalog allow:
- Creating reusable SQL/Python functions
- Centralized management of business logic
- Access control through GRANT/REVOKE
- Lineage tracking for functions

**Function types**:
- **Scalar Functions**: return a single value
- **Table Functions**: return a table
- **SQL Functions**: written in SQL
- **Python Functions**: written in Python (UDF)

### 9.8.1. Data Classification (Tagging)

> *Note: This section covers data tagging for classification, related to governance.*

**Tagging** allows data classification (e.g., PII, Sensitive, GDPR) at the table or column level.
This facilitates data discovery and governance (e.g., reporting all tables containing personal data).

In [0]:
# SQL Function - masking customer_id
spark.sql(f"""
  CREATE OR REPLACE FUNCTION {CATALOG}.{SILVER_SCHEMA}.mask_customer_id(customer_id STRING)
  RETURNS STRING
  LANGUAGE SQL
  COMMENT 'Masks customer_id, showing only last 3 digits'
  RETURN CONCAT('****', SUBSTRING(CAST(customer_id AS STRING), -3))
""")

In [0]:
# Test mask_customer_id function
result_df = spark.sql(f"""
  SELECT 
    customer_id,
    {CATALOG}.{SILVER_SCHEMA}.mask_customer_id(customer_id) as masked_id,
    first_name,
    last_name
  FROM {CATALOG}.{BRONZE_SCHEMA}.customers
  LIMIT 5
""")

display(result_df)

**Creating categorize_price function**

Python UDF function categorizes product prices:
- **Low**: < 50
- **Medium**: 50-200 
- **High**: > 200

Python UDF can contain any Python logic.

In [0]:
# Python UDF - price categorization
spark.sql(f"""
  CREATE OR REPLACE FUNCTION {CATALOG}.{SILVER_SCHEMA}.categorize_price(price DOUBLE)
  RETURNS STRING
  LANGUAGE PYTHON
  COMMENT 'Categorizes prices: Low, Medium, High'
  AS $$
    if price < 50:
        return "Low"
    elif price < 200:
        return "Medium"
    else:
        return "High"
  $$
""")

In [0]:
# Test categorize_price function
result_df = spark.sql(f"""
  SELECT 
    product_name,
    unit_cost,
    {CATALOG}.{SILVER_SCHEMA}.categorize_price(unit_cost) as price_category
  FROM {CATALOG}.{BRONZE_SCHEMA}.products
  ORDER BY unit_cost
  LIMIT 10
""")

display(result_df)

**Setting permissions for data-analysts**

The `data-analysts` group received:
- **USE CATALOG**: Access to catalog
- **USE SCHEMA**: Access to Silver schema 
- **SELECT**: Read data from Silver schema

**Setup:** Create groups for demonstration purposes
> **Note:** This requires account admin privileges. If you don't have them, ensure these groups exist.

* TO DO IN GUI

In [0]:
# Grant catalog access to data analysts
spark.sql(f"""
    GRANT USE CATALOG ON CATALOG {CATALOG} TO `data-analysts`
""")

spark.sql(f"""
    GRANT USE SCHEMA ON SCHEMA {CATALOG}.{SILVER_SCHEMA} TO `data-analysts`
""")

spark.sql(f"""
    GRANT SELECT ON SCHEMA {CATALOG}.{SILVER_SCHEMA} TO `data-analysts`
""")

In [0]:
# Grant full access to data engineers
spark.sql(f"""
    GRANT USE CATALOG, CREATE SCHEMA ON CATALOG {CATALOG} TO `data-engineers`
""")

**Permissions for Data Analysts (Gold Layer):**

In [0]:
# GRANT for data-analysts on Gold schema
spark.sql(f"""
  GRANT USE SCHEMA ON SCHEMA {CATALOG}.{GOLD_SCHEMA} TO `data-analysts`
""")

spark.sql(f"""
  GRANT SELECT ON SCHEMA {CATALOG}.{GOLD_SCHEMA} TO `data-analysts`
""")

**Table-specific access control**

Fine-grained permissions:
- **finance-team**: Access to fact_sales (revenue analysis)
- **marketing-team**: Access to customers_masked (customer insights with PII masking)

In [0]:
# GRANT EXECUTE na Functions
spark.sql(f"""
  GRANT EXECUTE ON FUNCTION {CATALOG}.{SILVER_SCHEMA}.mask_customer_id TO `data-analysts`
""")

spark.sql(f"""
  GRANT EXECUTE ON FUNCTION {CATALOG}.{SILVER_SCHEMA}.categorize_price TO `data-analysts`
""")

In [0]:
# Verify permissions on table
spark.sql(f"""
    SHOW GRANTS ON TABLE {CATALOG}.{BRONZE_SCHEMA}.customers
""").display()

---

## 9.9. Data Masking and Row-Level Security

### 9.9.1. Column-level masking (Dynamic Views):

Use `current_user()` and `is_account_group_member()` functions for conditional masking:

#### Column Masking & Row-Level Security Flow

![Column Masking & Row-Level Security - dynamiczna kontrola dostepu do danych](../../../assets/images/training_2026/m09_masking_rls_flow.png)

In [0]:
# Create masked view for PII data
spark.sql(f"""
  CREATE OR REPLACE VIEW {CATALOG}.{GOLD_SCHEMA}.customers_masked AS
  SELECT 
    customer_id,
    CASE 
      WHEN is_account_group_member('alt_test') THEN first_name
      ELSE CONCAT(LEFT(first_name, 1), '***')
    END as first_name,
    CASE 
      WHEN is_account_group_member('alt_test') THEN last_name
      ELSE CONCAT(LEFT(last_name, 1), '***')
    END as last_name,
    city,
    country,
    registration_date
  FROM {CATALOG}.{BRONZE_SCHEMA}.customers
""")

In [0]:
df = spark.table(f"{CATALOG}.{GOLD_SCHEMA}.customers_masked")
display(df)

**View customers_masked created**

View with dynamic PII data masking:

In [0]:
# Test View z maskowaniem
result_df = spark.sql(f"""
  SELECT * FROM {CATALOG}.{GOLD_SCHEMA}.customers_masked LIMIT 10
""")

display(result_df)

In [0]:
spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{SILVER_SCHEMA}.customers_masked;")

In [0]:
%sql
CREATE FUNCTION customer_mask_test (customer_id STRING)
  RETURN CASE WHEN is_account_group_member('alt_test') THEN customer_id ELSE 'CUST-**-****' END

In [0]:
# Define the customers_masked table schema without CTAS
spark.sql(f"""
  CREATE OR REPLACE TABLE {CATALOG}.{SILVER_SCHEMA}.customers_masked (
    customer_id STRING,
    customer_id_masked STRING MASK customer_mask_test,
    first_name STRING,
    last_name STRING,
    email STRING,
    country STRING
  )
""")

In [0]:
spark.sql(f"""
  INSERT INTO {CATALOG}.{SILVER_SCHEMA}.customers_masked (customer_id, customer_id_masked, first_name, last_name, email, country)
  VALUES
    ('CUST001', 'CUST001', 'Alice', 'Smith', 'alice.smith@example.com', 'USA'),
    ('CUST002', 'CUST002', 'Bob', 'Johnson', 'bob.johnson@example.com', 'Canada'),
    ('CUST003', 'CUST003', 'Carol', 'Williams', 'carol.williams@example.com', 'UK')
""")
display(spark.sql(f"SELECT * FROM {CATALOG}.{SILVER_SCHEMA}.customers_masked"))

In [0]:
spark.sql(
    f"""
    CREATE OR REPLACE VIEW {CATALOG}.{GOLD_SCHEMA}.orders_hashed AS
    SELECT 
        order_id,
        SHA2(CAST(customer_id AS STRING), 256) as customer_id_hash,
        product_id,
        quantity,
        total_amount,
        order_datetime
    FROM {CATALOG}.{BRONZE_SCHEMA}.orders
    """
)

display(
    spark.createDataFrame(
        [
            ("View", f"{CATALOG}.{GOLD_SCHEMA}.orders_hashed"),
            ("Masking", "customer_id → SHA2-256 hash"),
            ("Purpose", "Analysts can aggregate without revealing customer_id")
        ],
        ["Parameter", "Value"]
    )
)

**View orders_hashed created**

Customer_id is hashed using SHA2-256. This enables:
- **Analysts**: Data aggregation without revealing customer_id
- **Privacy**: Maintaining anonymity while preserving grouping capability
- **Compliance**: Meeting GDPR/privacy regulations requirements

In [0]:
display(spark.sql(f"SELECT * FROM {CATALOG}.{GOLD_SCHEMA}.orders_hashed"))

### 9.9.2. Row-Level Security (RLS):

Restrict which rows users can see based on their identity or group membership:

**RLS View customers_rls created**

Row-Level Security filters data based on group membership:
- **global-access**: Sees all customers
- **east-coast-team**: Only customers from NY, NJ, NC, GA 
- **alt_test**: Only customers from CA
- **midwest-team**: Only customers from FL, IL, TX, MI
- **Other groups**: No access (FALSE)

Automatic row filtering without data duplication.

In [0]:
# Creating RLS view - access per region (state)
spark.sql(f"""
    CREATE OR REPLACE VIEW {CATALOG}.{GOLD_SCHEMA}.customers_rls AS
    SELECT *
    FROM {CATALOG}.{BRONZE_SCHEMA}.customers
    WHERE 
        CASE 
            WHEN is_account_group_member('global-access') THEN TRUE
            WHEN is_account_group_member('east-coast-team') THEN UPPER(state) IN ('NY', 'NJ', 'NC', 'GA')
            WHEN is_account_group_member('alt_test') THEN UPPER(state) = 'CA'
            WHEN is_account_group_member('midwest-team') THEN UPPER(state) IN ('FL', 'IL', 'TX', 'MI')
            ELSE FALSE
        END
""")
display(spark.table(f"{CATALOG}.{GOLD_SCHEMA}.customers_rls"))
display(spark.createDataFrame([
    ("RLS View", f"{CATALOG}.{GOLD_SCHEMA}.customers_rls"),
    ("Mechanism", "Filtering per state based on group membership"),
    ("global-access", "All customers"),
    ("east-coast-team", "NY, NJ, NC, GA"),
    ("alt_test", "CA"),
    ("midwest-team", "FL, IL, TX, MI")
], ["Group", "Visibility"]))

**Granting permissions to RLS Views:**

### 9.9.3. Attribute-Based Access Control (ABAC)

ABAC is a **security pattern** in Unity Catalog that combines multiple governance features to control access based on **data attributes** rather than just user identity.

**ABAC in Databricks = Tags + Column Masks + Row Filters**

| Component | Purpose | Mechanism |
|---|---|---|
| **Tags** (Data Classification) | Mark sensitive data with attributes | `ALTER TABLE ... SET TAGS ('pii' = 'true')` |
| **Column Masks** | Hide/transform column values based on user group | `CREATE FUNCTION mask_fn(...)` + Dynamic Views |
| **Row Filters** | Restrict which rows a user can see | RLS Views with `IS_ACCOUNT_GROUP_MEMBER()` |

**How ABAC works in practice:**

1. **Classify** -- Tag tables and columns with sensitivity levels (e.g., `pii`, `data_classification`)
2. **Define policies** -- Create masking functions and RLS views that enforce access rules
3. **Assign** -- Grant access to groups; the tags + functions automatically enforce attribute-based filtering
4. **Audit** -- Use `system.information_schema` to track tags and access patterns

**Key difference from RBAC:**
- **RBAC** (Role-Based): Access determined by user's *role* (e.g., `data-analysts` group gets SELECT)
- **ABAC** (Attribute-Based): Access determined by *data attributes* (e.g., columns tagged `pii=true` are automatically masked)

> **Exam note**: Unity Catalog implements ABAC through the combination of Tags, Column Masks, and Row Filters. Know that tags provide metadata for governance, while masks and filters enforce data-level security policies.

## 9.10. Access Management: GRANT / REVOKE

### Privileges Hierarchy in Unity Catalog:

**Privilege levels**:
1. **Metastore-level**: CREATE CATALOG, USE CATALOG
2. **Catalog-level**: USE CATALOG, CREATE SCHEMA
3. **Schema-level**: USE SCHEMA, CREATE TABLE, CREATE FUNCTION, CREATE VOLUME
4. **Object-level**: SELECT, MODIFY (INSERT/UPDATE/DELETE/MERGE), EXECUTE

**Securable Objects - Inheritance**:
- Privileges inherit down the hierarchy
- GRANT on Catalog → inherits to all Schemas and Tables
- GRANT on Schema → inherits to all Tables in that Schema
- You can grant privileges at specific level for fine-grained control

### GRANT/REVOKE Examples:

#### Permission Inheritance Diagram

![Permission Inheritance - dziedziczenie uprawnien GRANT w hierarchii UC](../../../assets/images/training_2026/m09_grant_inheritance.png)

In [0]:
# GRANT access to customers_rls
spark.sql(
    f"""
    GRANT SELECT ON VIEW {CATALOG}.{GOLD_SCHEMA}.customers_rls TO `account users`
    """
)

**Granting permissions to orders_hashed**

In [0]:
# GRANT access to orders_hashed
spark.sql(f"""
  GRANT SELECT ON VIEW {CATALOG}.{GOLD_SCHEMA}.orders_hashed TO `account users`
""")

**Revoking access to base tables (Enforcement):**

In [0]:
# Revoke direct access to base table
spark.sql(f"""
    REVOKE SELECT ON TABLE {CATALOG}.{BRONZE_SCHEMA}.orders FROM `account users`
""")

**RLS Views - Access control setup**

Security pattern:
1. **GRANT SELECT** on RLS Views for `all-users`
2. **REVOKE SELECT** on base tables (force Views usage)
3. **Automatic filtering** based on group membership

Users can SELECT from Views, but not from base tables - enforcing RLS.

---

## 9.11. Data Lineage and Audit Logging

### 9.11.1. Querying Data Lineage:

Unity Catalog automatically tracks lineage for:
- Table → Table (ETL transformations)
- Notebook → Table (data writes)
- Dashboard → Table (BI queries)
- ML Model → Table (training data)

**General Table Lineage**

In [0]:
# Query table lineage from system tables
lineage_df = spark.sql(f"""
  SELECT 
    source_table_full_name,
    source_type,
    target_table_full_name,
    target_type,
    event_date,
    created_by
  FROM system.access.table_lineage
  WHERE target_table_full_name LIKE '{CATALOG}.%'
  ORDER BY event_date DESC
  LIMIT 50
""")

display(lineage_df)

**Lineage for tables in catalog displayed**

The system automatically tracks lineage for:
- **Table → Table**: ETL transformations
- **Notebook → Table**: Data writes 
- **Dashboard → Table**: BI queries
- **ML Model → Table**: Training data

Lineage is available through `system.access.table_lineage` without additional instrumentation.

**1. Upstream Lineage (Sources)**

In [0]:
# Find upstream dependencies (sources) for a table
upstream_df = spark.sql(f"""
    SELECT DISTINCT
        source_table_full_name,
        source_type
    FROM system.access.table_lineage
    WHERE target_table_full_name = '{CATALOG}.{SILVER_SCHEMA}.fact_sales'
""")

display(upstream_df)

**⬆ Upstream: Source tables for fact_sales**

Shows all tables used as data sources in the `fact_sales` View. Helpful for impact analysis when making changes to upstream tables.

**2. Downstream Lineage (Consumers)**

In [0]:
# Find downstream dependencies (consumers) of a table
downstream_df = spark.sql(f"""
    SELECT DISTINCT
        target_table_full_name,
        target_type
    FROM system.access.table_lineage
    WHERE source_table_full_name = '{CATALOG}.{BRONZE_SCHEMA}.customers'
""")

display(downstream_df)

**Downstream: Views/Tables consuming customers**

Shows all Views and tables that consume data from the `customers` table. Critical for understanding impact of changes and data governance.

**3. Column-Level Lineage**

In [0]:
# Column-level lineage (if available)
column_lineage = spark.sql(f"""
    SELECT 
        source_table_full_name,
        source_column_name,
        target_table_full_name,
        target_column_name,
        event_date
    FROM system.access.column_lineage
    WHERE target_table_full_name = '{CATALOG}.{SILVER_SCHEMA}.fact_sales'
    ORDER BY target_column_name
""")

display(column_lineage)

**Column-level lineage for fact_sales**

Unity Catalog tracks lineage at column level - which columns in source tables affect which columns in the target table. Detailed information for data governance and impact analysis.

### 9.11.2. Audit Logging:

Unity Catalog logs all access and operations:

**1. General Audit Logs**

In [0]:
# Query audit logs
audit_df = spark.sql("""
    SELECT 
        event_time,
        user_identity.email as user_email,
        service_name,
        action_name,
        request_params.full_name_arg as table_name,
        response.status_code,
        request_id
    FROM system.access.audit
    WHERE action_name IN ('getTable', 'createTable', 'deleteTable', 'updateTable')
        AND event_date >= current_date() - INTERVAL 7 DAYS
    ORDER BY event_time DESC
    LIMIT 100
""")
audit_df.display()

**2. Sensitive Data Access**

In [0]:
# Track who accessed sensitive tables
sensitive_access = spark.sql(f"""
    SELECT 
        event_time,
        user_identity.email as user,
        action_name,
        request_params.full_name_arg as table_accessed,
        source_ip_address
    FROM system.access.audit
    WHERE request_params.full_name_arg LIKE '{CATALOG}.%.customers%'
        AND action_name = 'getTable'
        AND event_date >= current_date() - INTERVAL 7 DAYS
    ORDER BY event_time DESC
    LIMIT 100
""")

display(sensitive_access)

** Audit logs: Access to customers table (last 7 days)**

Monitoring access to sensitive tables with PII data:
- **Who**: User email
- **When**: Event time 
- **What**: Table name
- **From where**: Source IP address

Critical for compliance (GDPR, HIPAA) and security monitoring.

**3. Privilege Changes**

In [0]:
# Grant/Revoke audit trail
grant_audit = spark.sql("""
    SELECT 
        event_time,
        user_identity.email as admin_user,
        action_name,
        request_params.privilege as privilege_granted,
        request_params.securable_full_name as object_name,
        request_params.principal as grantee
    FROM system.access.audit
    WHERE action_name IN ('grantPrivilege', 'revokePrivilege')
        AND event_date >= current_date() - INTERVAL 30 DAYS
    ORDER BY event_time DESC
""")

display(grant_audit)

**Audit trail of privilege changes**

Complete audit trail of permission changes:
- **Admin user**: Who executed GRANT/REVOKE
- **Action**: grantPrivilege or revokePrivilege
- **Privilege**: Which permission (SELECT, MODIFY, etc.)
- **Object**: On which object (table, schema, catalog)
- **Grantee**: To whom permissions were granted/revoked

Essential for governance and compliance audits.

---

## 9.12. Delta Sharing

**Delta Sharing** = Secure data sharing protocol (cross-org, cross-cloud)

### 9.12.1. Components:
- **Share**: collection of tables to share
- **Recipient**: organization/user receiving data
- **Provider**: data owner (you)

### 9.12.2. Create Share:

#### Delta Sharing Architecture

![Delta Sharing - Provider, Recipient, Open Protocol, Cross-Cloud](../../../assets/images/training_2026/m09_delta_sharing.png)

In [0]:
# Creating Share for external partners
share_name = f"{CATALOG}_partner_share"

spark.sql(f"""
  CREATE SHARE IF NOT EXISTS {share_name}
  COMMENT 'Data sharing for business partners'
""")

**Share '{share_name}' created**

Delta Sharing Share is a collection of tables for secure sharing with external partners:
- **Cross-org**: Between different Databricks organizations
- **Cross-cloud**: AWS ↔ Azure ↔ GCP 
- **Open protocol**: Open-source standard

In [0]:
# Add table to Share (Gold layer only - aggregated data)
spark.sql(f"""
  ALTER SHARE {share_name}
  ADD TABLE {CATALOG}.{GOLD_SCHEMA}.fact_sales
""")

In [0]:
spark.sql(f"""
  ALTER SHARE {share_name}
  ADD SCHEMA {CATALOG}.{SILVER_SCHEMA}
""")

**Table fact_sales added to Share**

Best practice: Share only Gold layer (aggregated data):
- **Security**: No access to raw data
- **Privacy**: Aggregations hide individual records
- **Stability**: Gold layer has stable schema and structure

**Tables in Share verified**

Share currently contains the added tables and can be shared with recipients. Recipients will receive an activation link to consume shared data via Delta Sharing protocol.

In [0]:
# Verify Share contents
spark.sql(f"SHOW ALL IN SHARE {share_name}").display()

### 9.12.3. Create Recipient:

> **[UI DEMO]** Create a recipient in the Databricks UI: Catalog -> Delta Sharing -> New Recipient.

### 9.12.4. Consuming shared data (as recipient):

> **[UI DEMO]** As a recipient, use the activation link to access shared data via Open Sharing protocol.

### 9.12.5. Best practices for Delta Sharing:

1. **Share only aggregated/gold data**: don't share raw/bronze layers
2. **Use views for masking**: create view with masked PII before sharing
3. **Monitor access**: track who accesses shared data
4. **Version control**: use table versions for stable APIs
5. **Documentation**: clear documentation for recipients

---

## 9.13. Information Schema

**Theoretical Introduction:**

Unity Catalog provides an `INFORMATION_SCHEMA` in every catalog for metadata queries.

```sql
-- List all tables in a schema
SELECT table_name, table_type, created
FROM my_catalog.information_schema.tables
WHERE table_schema = 'my_schema';

-- List all columns for a table
SELECT column_name, data_type, is_nullable
FROM my_catalog.information_schema.columns
WHERE table_name = 'customers';

-- Check grants on a table
SELECT grantee, privilege_type
FROM my_catalog.information_schema.table_privileges
WHERE table_name = 'customers';
```

**Available views in `information_schema`:**

| View | Content |
|------|--------|
| `tables` | All tables and views |
| `columns` | Column definitions |
| `table_privileges` | Granted permissions |
| `schemata` | Schema metadata |
| `catalogs` | Catalog information |
| `views` | View definitions |

**Exam Note:** `INFORMATION_SCHEMA` is the standard SQL way to query metadata. It is available per catalog in Unity Catalog.

---

## 9.14. Summary

### 9.14.1. You learned:

 **Unity Catalog Architecture**: Metastore → Catalog → Schema → Tables 
 **Access Control**: GRANT/REVOKE privileges at multiple levels 
 **Data Masking**: Column-level masking with dynamic views 
 **Row-Level Security**: Filter data based on user identity 
 **Data Lineage**: Track data flow through system tables 
 **Audit Logging**: Monitor who accessed what and when 
 **Monitoring & Observability**: Cost, job, query, and storage monitoring via System Tables 
 **Delta Sharing**: Secure cross-organization data sharing 

### 9.14.2. Key Takeaways:

1. **Unified Governance**: Single platform for all data assets
2. **Fine-grained Control**: Table, column, row-level security
3. **Automatic Lineage**: No extra instrumentation needed
4. **Compliance-ready**: Audit logs for regulatory requirements
5. **Secure Sharing**: Delta Sharing for external collaboration

## 9.15. Troubleshooting

### Problem 1: "Table or view not found"
**Cause**: Missing USE CATALOG or USE SCHEMA permissions 
**Solution**:
```sql
GRANT USE CATALOG ON CATALOG <catalog_name> TO <principal>;
GRANT USE SCHEMA ON SCHEMA <catalog>.<schema> TO <principal>;
```

### Problem 2: "Permission denied" on SELECT
**Cause**: Missing SELECT permissions on table 
**Solution**:
```sql
GRANT SELECT ON TABLE <catalog>.<schema>.<table> TO <principal>;
-- or on entire schema:
GRANT SELECT ON SCHEMA <catalog>.<schema> TO <principal>;
```

### Problem 3: "Cannot execute function"
**Cause**: Missing EXECUTE permission on function 
**Solution**:
```sql
GRANT EXECUTE ON FUNCTION <catalog>.<schema>.<function_name> TO <principal>;
```

### Problem 4: "Volume not accessible"
**Cause**: Missing READ VOLUME / WRITE VOLUME permissions 
**Solution**:
```sql
GRANT READ VOLUME ON VOLUME <catalog>.<schema>.<volume> TO <principal>;
GRANT WRITE VOLUME ON VOLUME <catalog>.<schema>.<volume> TO <principal>;
```

### Problem 5: RLS View not filtering data
**Cause**: User doesn't belong to any group defined in CASE WHEN 
**Solution**: Add user to appropriate group or add default fallback in View

### Problem 6: Lineage not showing dependencies
**Cause**: Lineage is automatic but may be delayed by a few minutes 
**Solution**: Wait 5-10 minutes and query system.access.table_lineage again

### Problem 7: Share not visible to recipient
**Cause**: Recipient hasn't activated the activation link 
**Solution**: Send activation link from DESCRIBE RECIPIENT

---

## 9.16. Best Practices Summary

### 1. **Catalog Organization**
- Use environment-based catalogs: `dev`, `test`, `prod`
- Organize schemas by layers: `bronze`, `silver`, `gold`
- Apply naming conventions: `<catalog>.<schema>.<object>`

### 2. **Access Control**
- **Principle of Least Privilege**: Grant minimum required permissions
- Use groups, not individual users
- Inheritance: GRANT on Catalog → inherits to Schema → inherits to Tables
- Regularly audit permissions (SHOW GRANTS)

### 3. **Data Masking & RLS**
- Mask PII in Views for users without pii-access-group
- Use RLS for multi-tenant scenarios
- Always test masking with different group memberships

### 4. **Lineage & Audit**
- Leverage automatic lineage to track data flow
- Regularly check audit logs for sensitive tables
- Monitor lineage after pipeline changes

### 5. **Delta Sharing**
- Share only Gold layer (aggregated data)
- Use masked Views in Share
- Document Share contracts for recipients

### 6. **Documentation & Governance**
- Add COMMENT to all tables, views, functions
- Use Table Properties for metadata (owner, PII, retention)
- Regularly check governance health checks

### 7. **Monitoring & Observability**
- Use `system.billing.usage` for cost tracking and chargeback
- Monitor job/pipeline SLAs with `system.workflow.job_run_timeline`
- Track slow queries in `system.query.history`
- Set up daily governance health checks (undocumented tables, untagged PII)
- Create alerts for cost spikes and job failures

---