# Unity Catalog & Governance 

**Training Objective:** Understand Unity Catalog as a governance platform for Databricks Lakehouse, managing access, data masking, lineage, and audit logging

**Topics Covered:**
- Unity Catalog Architecture: Metastore, Catalog, Schema, Tables/Views/Volumes
- Access Management: GRANT/REVOKE privileges
- Data Masking and Row-Level Security
- Data Lineage and Audit Logging
- Delta Sharing - secure data sharing
- Best Practices for Data Governance


## Theoretical Introduction

**Section Objective:** Understanding Unity Catalog as a unified governance platform for data lakehouse

**Basic Concepts:**
- **Unity Catalog**: Unified governance solution for all data assets
- **Metastore**: Region-level container for catalogs (top-level)
- **Three-level namespace**: catalog.schema.table
- **Securable objects**: Tables, Views, Functions, Volumes, Models
- **Fine-grained access control**: Table, column, row-level security
- **Automatic lineage**: End-to-end data flow tracking without instrumentation

**Unity Catalog Object Hierarchy:**
```
Metastore (region-level)
 â†“
Catalog (domain/environment)
 â†“
Schema (namespace/layer)
 â†“
Securable Objects:
 - Tables / Views (data)
```

## Per-User Isolation

Run the initialization script for per-user catalog and schema isolation:

In [0]:
%run ../00_setup

## Configuration

Library imports and user context display:

In [0]:
# Paths to data directories (subdirectories in DATASET_BASE_PATH from 00_setup)
CUSTOMERS_PATH = f"{DATASET_BASE_PATH}/customers"
ORDERS_PATH = f"{DATASET_BASE_PATH}/orders"
PRODUCTS_PATH = f"{DATASET_BASE_PATH}/products"

# Paths to specific files
CUSTOMERS_CSV = f"{CUSTOMERS_PATH}/customers.csv"
ORDERS_JSON = f"{ORDERS_PATH}/orders_batch.json"
PRODUCTS_PARQUET = f"{PRODUCTS_PATH}/products.parquet"

## Unity Catalog Architecture

**Unity Catalog** is a unified governance solution for Databricks Lakehouse.

### Object Hierarchy:

```
Metastore (region-level)
 â†“
Catalog (database/domain)
 â†“
Schema (namespace)
 â†“
Securable Objects:
 - Tables / Views
 - Functions (UDF, stored procedures)
 - Volumes (files storage)
 - Models (ML models)
```

### Three-level namespace:
```sql
catalog.schema.table
```

Example:
```sql
main.sales.orders
dev.analytics.customer_metrics
prod.gold.daily_revenue
```

### Key Features:
- **Unified governance**: single platform for data, ML, BI
- **Fine-grained access control**: table, column, row level
- **Automatic lineage**: end-to-end data flow tracking
- **Audit logging**: who accessed what and when
- **Data discovery**: metadata search and tagging

---

### Setup and Basic Operations

### Creating User Groups
We create user groups for permission demonstration:
- `data_engineers`: Full access to Bronze/Silver schemas
- `data_analysts`: Read-only access to Gold

**Active context verification:**

In [0]:
# Verification of created schemas
schemas = spark.sql(f"SHOW SCHEMAS IN {CATALOG}").select("databaseName").collect()
schema_names = [row.databaseName for row in schemas]

**Active catalog and schema set**

We set the default working context - all subsequent operations will be executed in this catalog and schema unless a full path is specified.

In [0]:
# Verification of created schemas
spark.sql(f"SHOW SCHEMAS IN {CATALOG}").display()

## Data Preparation

Before we proceed to access management, we will load real data from the dataset/ directory that we will use in the Unity Catalog examples.

In [0]:
orders_df = spark.read.option("header", "true").option("inferSchema", "true").json(ORDERS_JSON)
orders_df.write.format("delta").mode("overwrite").saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.orders")

display(spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.orders"))

In [0]:
customers_df = spark.read.option("header", "true").option("inferSchema", "true").csv(CUSTOMERS_CSV)
customers_df.write.format("delta").mode("overwrite").saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.customers")

display(spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers"))

In [0]:
products_df = spark.read.parquet(PRODUCTS_PARQUET)
products_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.products")
display(spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.products"))

In [0]:
# Verification of orders record count
spark.sql(f"SELECT COUNT(*) as count FROM {CATALOG}.{BRONZE_SCHEMA}.orders").display()

### Add comments to table and columns

You can add descriptive comments to Unity Catalog tables and columns using SQL commands. This improves data discoverability and governance.

The cell below demonstrates how to add comments to a table and a specific column using Spark SQL.

In [0]:
# Add comments to table and columns
spark.sql(f"""
    COMMENT ON TABLE {CATALOG}.{BRONZE_SCHEMA}.orders IS
    'Cleaned orders table with data quality validations applied'
""")

spark.sql(f"""
    COMMENT ON COLUMN {CATALOG}.{BRONZE_SCHEMA}.orders.customer_id IS
    'Customer identifier - PII data, access restricted'
""")

### Add tags to orders table

You can classify and manage tables in Unity Catalog using **tags** (key-value pairs). Tags help with data discovery, compliance, and governance (e.g., marking tables as PII, GDPR, or Sensitive).

Example:  
sql
ALTER TABLE ecommerce_platform_trainer.bronze.orders
  SET TAGS ('pii' = 'false', 'data_classification' = 'transactional', 'retention' = '7_years');

- `pii`: Indicates if table contains personally identifiable information.
- `data_classification`: Describes the type of data (e.g., transactional, reference).
- `retention`: Specifies data retention policy.

You need `APPLY TAG` privilege to add tags.

In [0]:
# Add tags to orders table
spark.sql(f"""
    ALTER TABLE {CATALOG}.{BRONZE_SCHEMA}.orders 
    SET TAGS ('sensitivity' = 'high', 'domain' = 'sales')
""")

# Add tags to customer_id column
spark.sql(f"""
    ALTER TABLE {CATALOG}.{BRONZE_SCHEMA}.orders 
    ALTER COLUMN customer_id SET TAGS ('pii' = 'true')
""")

display(spark.createDataFrame([("Status", " Tags added to table and column")], ["Info", "Value"]))

## Unity Catalog Functions (UDF)

**Functions** in Unity Catalog allow:
- Creating reusable SQL/Python functions
- Centralized management of business logic
- Access control through GRANT/REVOKE
- Lineage tracking for functions

**Function types**:
- **Scalar Functions**: return a single value
- **Table Functions**: return a table
- **SQL Functions**: written in SQL
- **Python Functions**: written in Python (UDF)

### Data Classification (Tagging)

**Tagging** allows data classification (e.g., PII, Sensitive, GDPR) at the table or column level.
This facilitates data discovery and governance (e.g., reporting all tables containing personal data).

In [0]:
# SQL Function - masking customer_id
spark.sql(f"""
  CREATE OR REPLACE FUNCTION {CATALOG}.{SILVER_SCHEMA}.mask_customer_id(customer_id STRING)
  RETURNS STRING
  LANGUAGE SQL
  COMMENT 'Masks customer_id, showing only last 3 digits'
  RETURN CONCAT('****', SUBSTRING(CAST(customer_id AS STRING), -3))
""")

In [0]:
# Test mask_customer_id function
result_df = spark.sql(f"""
  SELECT 
    customer_id,
    {CATALOG}.{SILVER_SCHEMA}.mask_customer_id(customer_id) as masked_id,
    first_name,
    last_name
  FROM {CATALOG}.{BRONZE_SCHEMA}.customers
  LIMIT 5
""")

display(result_df)

**Function categorize_price created**

Python UDF function categorizes product prices:
- **Low**: < 50
- **Medium**: 50-200 
- **High**: > 200

Python UDF can contain any Python logic.

In [0]:
# Python UDF - price categorization
spark.sql(f"""
  CREATE OR REPLACE FUNCTION {CATALOG}.{SILVER_SCHEMA}.categorize_price(price DOUBLE)
  RETURNS STRING
  LANGUAGE PYTHON
  COMMENT 'Categorizes prices: Low, Medium, High'
  AS $$
    if price < 50:
        return "Low"
    elif price < 200:
        return "Medium"
    else:
        return "High"
  $$
""")

In [0]:
# Test categorize_price function
result_df = spark.sql(f"""
  SELECT 
    product_name,
    unit_cost,
    {CATALOG}.{SILVER_SCHEMA}.categorize_price(unit_cost) as price_category
  FROM {CATALOG}.{BRONZE_SCHEMA}.products
  ORDER BY unit_cost
  LIMIT 10
""")

display(result_df)

---

## Data Masking and Row-Level Security

### Column-level masking (Dynamic Views):

Use `current_user()` and `is_account_group_member()` functions for conditional masking:

In [0]:
# Create masked view for PII data
spark.sql(f"""
  CREATE OR REPLACE VIEW {CATALOG}.{GOLD_SCHEMA}.customers_masked AS
  SELECT 
    customer_id,
    CASE 
      WHEN is_account_group_member('pii-access-group') THEN first_name
      ELSE CONCAT(LEFT(first_name, 1), '***')
    END as first_name,
    CASE 
      WHEN is_account_group_member('pii-access-group') THEN last_name
      ELSE CONCAT(LEFT(last_name, 1), '***')
    END as last_name,
    city,
    country,
    registration_date
  FROM {CATALOG}.{BRONZE_SCHEMA}.customers
""")

**View customers_masked created**

View with dynamic PII data masking:

In [0]:
# Test View with masking
result_df = spark.sql(f"""
  SELECT * FROM {CATALOG}.{GOLD_SCHEMA}.customers_masked LIMIT 10
""")

display(result_df)

In [0]:
%sql
CREATE FUNCTION customer_mask(customer_id STRING)
  RETURN CASE WHEN is_account_group_member('data-engineers') THEN customer_id ELSE 'CUST-**-****' END

In [0]:
# Define the customers_masked table schema without CTAS
spark.sql(f"""
  CREATE OR REPLACE TABLE {CATALOG}.{SILVER_SCHEMA}.customers_masked (
    customer_id STRING,
    customer_id_masked STRING MASK customer_mask,
    first_name STRING,
    last_name STRING,
    email STRING,
    country STRING
  )
""")

In [0]:
display(spark.sql(f"SELECT * FROM {CATALOG}.{SILVER_SCHEMA}.customers_masked"))

In [0]:
spark.sql(
    f"""
    CREATE OR REPLACE VIEW {CATALOG}.{GOLD_SCHEMA}.orders_hashed AS
    SELECT 
        order_id,
        SHA2(CAST(customer_id AS STRING), 256) as customer_id_hash,
        product_id,
        quantity,
        total_amount,
        order_datetime
    FROM {CATALOG}.{BRONZE_SCHEMA}.orders
    """
)

display(
    spark.createDataFrame(
        [
            ("View", f"{CATALOG}.{GOLD_SCHEMA}.orders_hashed"),
            ("Masking", "customer_id â†’ SHA2-256 hash"),
            ("Purpose", "Analysts can aggregate without revealing customer_id")
        ],
        ["Parameter", "Value"]
    )
)

**View orders_hashed created**

Customer_id is hashed using SHA2-256. This enables:
- **Analysts**: Data aggregation without revealing customer_id
- **Privacy**: Maintaining anonymity while preserving grouping capability
- **Compliance**: Meeting GDPR/privacy regulations requirements

In [0]:
display(spark.sql(f"SELECT * FROM {CATALOG}.{GOLD_SCHEMA}.orders_hashed"))

### Row-Level Security (RLS):

Restrict which rows users can see based on their identity or group membership:

**RLS View customers_rls created**

Row-Level Security filters data based on group membership:
- **global-access**: Sees all customers
- **east-coast-team**: Only customers from NY, NJ, NC, GA 
- **west-coast-team**: Only customers from CA
- **midwest-team**: Only customers from FL, IL, TX, MI
- **Other groups**: No access (FALSE)

Automatic row filtering without data duplication.

In [0]:
# Creating RLS view - access per region (state)
spark.sql(f"""
    CREATE OR REPLACE VIEW {CATALOG}.{GOLD_SCHEMA}.customers_rls AS
    SELECT *
    FROM {CATALOG}.{BRONZE_SCHEMA}.customers
    WHERE 
        CASE 
            WHEN is_account_group_member('global-access') THEN TRUE
            WHEN is_account_group_member('east-coast-team') THEN UPPER(state) IN ('NY', 'NJ', 'NC', 'GA')
            WHEN is_account_group_member('west-coast-team') THEN UPPER(state) = 'CA'
            WHEN is_account_group_member('midwest-team') THEN UPPER(state) IN ('FL', 'IL', 'TX', 'MI')
            ELSE FALSE
        END
""")

In [0]:
display(spark.createDataFrame([
    ("RLS View", f"{CATALOG}.{GOLD_SCHEMA}.customers_rls"),
    ("Mechanism", "Filtering per state based on group membership"),
    ("global-access", "All customers"),
    ("east-coast-team", "NY, NJ, NC, GA"),
    ("west-coast-team", "CA"),
    ("midwest-team", "FL, IL, TX, MI")
], ["Group", "Visibility"]))

**Granting permissions to RLS Views:**

## Access Management: GRANT / REVOKE

### Privileges Hierarchy in Unity Catalog:

**Privilege levels**:
1. **Metastore-level**: CREATE CATALOG, USE CATALOG
2. **Catalog-level**: USE CATALOG, CREATE SCHEMA
3. **Schema-level**: USE SCHEMA, CREATE TABLE, CREATE FUNCTION, CREATE VOLUME
4. **Object-level**: SELECT, MODIFY (INSERT/UPDATE/DELETE/MERGE), EXECUTE

**Securable Objects - Inheritance**:
- Privileges inherit down the hierarchy
- GRANT on Catalog â†’ inherits to all Schemas and Tables
- GRANT on Schema â†’ inherits to all Tables in that Schema
- You can grant privileges at specific level for fine-grained control

### GRANT/REVOKE Examples:

In [0]:
# GRANT access to customers_rls
spark.sql(
    f"""
    GRANT SELECT ON VIEW {CATALOG}.{GOLD_SCHEMA}.customers_rls TO `account users`
    """
)

**Granting permissions to orders_rls**

In [0]:
# GRANT access to orders_rls
spark.sql(f"""
  GRANT SELECT ON VIEW {CATALOG}.{GOLD_SCHEMA}.orders_hashed TO `account users`
""")

**Revoking access to base tables (Enforcement):**

In [0]:
# Revoke direct access to base table
spark.sql(f"""
    REVOKE SELECT ON TABLE {CATALOG}.{BRONZE_SCHEMA}.orders FROM `account users`
""")

**RLS Views - Access control setup**

Security pattern:
1. **GRANT SELECT** on RLS Views for `all-users`
2. **REVOKE SELECT** on base tables (force Views usage)
3. **Automatic filtering** based on group membership

Users can SELECT from Views, but not from base tables - enforcing RLS.

---

## Data Lineage and Audit Logging

### Querying Data Lineage:

Unity Catalog automatically tracks lineage for:
- Table â†’ Table (ETL transformations)
- Notebook â†’ Table (data writes)
- Dashboard â†’ Table (BI queries)
- ML Model â†’ Table (training data)

**General Table Lineage**

In [0]:
# Query table lineage from system tables
lineage_df = spark.sql(f"""
  SELECT 
    source_table_full_name,
    source_type,
    target_table_full_name,
    target_type,
    event_date,
    created_by
  FROM system.access.table_lineage
  WHERE target_table_full_name LIKE '{CATALOG}.%'
  ORDER BY event_date DESC
  LIMIT 50
""")

display(lineage_df)

**Lineage for tables in catalog displayed**

The system automatically tracks lineage for:
- **Table â†’ Table**: ETL transformations
- **Notebook â†’ Table**: Data writes 
- **Dashboard â†’ Table**: BI queries
- **ML Model â†’ Table**: Training data

Lineage is available through `system.access.table_lineage` without additional instrumentation.

**1. Upstream Lineage (Sources)**

In [0]:
# Find upstream dependencies (sources) for a table
upstream_df = spark.sql(f"""
    SELECT DISTINCT
        source_table_full_name,
        source_type
    FROM system.access.table_lineage
    WHERE target_table_full_name = '{CATALOG}.{SILVER_SCHEMA}.customer_order_summary'
""")

display(upstream_df)

**â¬† Upstream: Source tables for customer_order_summary**

Shows all tables used as data sources in the `customer_order_summary` View. Helpful for impact analysis when making changes to upstream tables.

**2. Downstream Lineage (Consumers)**

In [0]:
# Find downstream dependencies (consumers) of a table
downstream_df = spark.sql(f"""
    SELECT DISTINCT
        target_table_full_name,
        target_type
    FROM system.access.table_lineage
    WHERE source_table_full_name = '{CATALOG}.{BRONZE_SCHEMA}.customers'
""")

display(downstream_df)

**Downstream: Views/Tables consuming customers**

Shows all Views and tables that consume data from the `customers` table. Critical for understanding impact of changes and data governance.

**3. Column-Level Lineage**

In [0]:
# Column-level lineage (if available)
column_lineage = spark.sql(f"""
    SELECT 
        source_table_full_name,
        source_column_name,
        target_table_full_name,
        target_column_name,
        event_date
    FROM system.access.column_lineage
    WHERE target_table_full_name = '{CATALOG}.{SILVER_SCHEMA}.customer_order_summary'
    ORDER BY target_column_name
""")

display(column_lineage)

**Column-level lineage for customer_order_summary**

Unity Catalog tracks lineage at column level - which columns in source tables affect which columns in the target table. Detailed information for data governance and impact analysis.

### Audit Logging:

Unity Catalog logs all access and operations:

**1. General Audit Logs**

In [0]:
# Query audit logs
audit_df = spark.sql("""
    SELECT 
        event_time,
        user_identity.email as user_email,
        service_name,
        action_name,
        request_params.full_name_arg as table_name,
        response.status_code,
        request_id
    FROM system.access.audit
    WHERE action_name IN ('getTable', 'createTable', 'deleteTable', 'updateTable')
        AND event_date >= current_date() - INTERVAL 7 DAYS
    ORDER BY event_time DESC
    LIMIT 100
""")
audit_df.display()

**2. Sensitive Data Access**

In [0]:
# Track who accessed sensitive tables
sensitive_access = spark.sql(f"""
    SELECT 
        event_time,
        user_identity.email as user,
        action_name,
        request_params.full_name_arg as table_accessed,
        source_ip_address
    FROM system.access.audit
    WHERE request_params.full_name_arg LIKE '{CATALOG}.%.customers%'
        AND action_name = 'getTable'
        AND event_date >= current_date() - INTERVAL 7 DAYS
    ORDER BY event_time DESC
    LIMIT 100
""")

display(sensitive_access)

**ðŸ”’ Audit logs: Access to customers table (last 7 days)**

Monitoring access to sensitive tables with PII data:
- **Who**: User email
- **When**: Event time 
- **What**: Table name
- **From where**: Source IP address

Critical for compliance (GDPR, HIPAA) and security monitoring.

**3. Privilege Changes**

In [0]:
# Grant/Revoke audit trail
grant_audit = spark.sql("""
    SELECT 
        event_time,
        user_identity.email as admin_user,
        action_name,
        request_params.privilege as privilege_granted,
        request_params.securable_full_name as object_name,
        request_params.principal as grantee
    FROM system.access.audit
    WHERE action_name IN ('grantPrivilege', 'revokePrivilege')
        AND event_date >= current_date() - INTERVAL 30 DAYS
    ORDER BY event_time DESC
""")

display(grant_audit)

**Audit trail of privilege changes**

Complete audit trail of permission changes:
- **Admin user**: Who executed GRANT/REVOKE
- **Action**: grantPrivilege or revokePrivilege
- **Privilege**: Which permission (SELECT, MODIFY, etc.)
- **Object**: On which object (table, schema, catalog)
- **Grantee**: To whom permissions were granted/revoked

Essential for governance and compliance audits.

---

## Delta Sharing

**Delta Sharing** = Secure data sharing protocol (cross-org, cross-cloud)

### Components:
- **Share**: collection of tables to share
- **Recipient**: organization/user receiving data
- **Provider**: data owner (you)

### Create Share:

In [0]:
# Creating Share for external partners
share_name = f"{CATALOG}_partner_share"

spark.sql(f"""
  CREATE SHARE IF NOT EXISTS {share_name}
  COMMENT 'Data sharing for business partners'
""")

**Share '{share_name}' created**

Delta Sharing Share is a collection of tables for secure sharing with external partners:
- **Cross-org**: Between different Databricks organizations
- **Cross-cloud**: AWS â†” Azure â†” GCP 
- **Open protocol**: Open-source standard

In [0]:
# Add table to Share (Gold layer only - aggregated data)
spark.sql(f"""
  ALTER SHARE {share_name}
  ADD TABLE {CATALOG}.{GOLD_SCHEMA}.fact_sales
""")

In [0]:
spark.sql(f"""
  ALTER SHARE {share_name}
  ADD SCHEMA {CATALOG}.{SILVER_SCHEMA}
""")

**Table customer_order_summary added to Share**

Best practice: Share only Gold layer (aggregated data):
- **Security**: No access to raw data
- **Privacy**: Aggregations hide individual records
- **Stability**: Gold layer has stable schema and structure

**Tables in Share verified**

Share currently contains the added tables and can be shared with recipients. Recipients will receive an activation link to consume shared data via Delta Sharing protocol.

In [0]:
# Verify Share contents
spark.sql(f"SHOW ALL IN SHARE {share_name}").display()

### Best practices for Delta Sharing:

1. **Share only aggregated/gold data**: don't share raw/bronze layers
2. **Use views for masking**: create view with masked PII before sharing
3. **Monitor access**: track who accesses shared data
4. **Version control**: use table versions for stable APIs
5. **Documentation**: clear documentation for recipients

---

---

## Summary

### You learned:

 **Unity Catalog Architecture**: Metastore â†’ Catalog â†’ Schema â†’ Tables 
 **Access Control**: GRANT/REVOKE privileges at multiple levels 
 **Data Masking**: Column-level masking with dynamic views 
 **Row-Level Security**: Filter data based on user identity 
 **Data Lineage**: Track data flow through system tables 
 **Audit Logging**: Monitor who accessed what and when 
 **Delta Sharing**: Secure cross-organization data sharing 

### Key Takeaways:

1. **Unified Governance**: Single platform for all data assets
2. **Fine-grained Control**: Table, column, row-level security
3. **Automatic Lineage**: No extra instrumentation needed
4. **Compliance-ready**: Audit logs for regulatory requirements
5. **Secure Sharing**: Delta Sharing for external collaboration

## Clean up resources

In [0]:
# spark.sql(f"DROP SCHEMA IF EXISTS {CATALOG}.{USER_SCHEMA} CASCADE")
display(spark.createDataFrame([("Status", "Resources kept for further exercises")], ["Info", "Value"]))