# Unity Catalog Governance - Demo

**Cel szkoleniowy:** Opanowanie Unity Catalog jako platformy governance dla Databricks Lakehouse, zarzƒÖdzanie dostƒôpami, data masking, lineage i audit logging

**Zakres tematyczny:**
- Unity Catalog Architecture: Metastore, Catalog, Schema, Tables/Views/Volumes
- ZarzƒÖdzanie dostƒôpami: GRANT/REVOKE privileges
- Data Masking i Row-Level Security
- Data Lineage i Audit Logging
- Delta Sharing - secure data sharing
- Best Practices for Data Governance

---

## Kontekst i wymagania

- **Dzie≈Ñ szkolenia**: Dzie≈Ñ 3 - Transformation, Governance & Integrations
- **Typ notebooka**: Demo
- **Wymagania techniczne**:
  - Databricks Runtime 13.0+ (zalecane: 14.3 LTS)
  - Unity Catalog w≈ÇƒÖczony (wymagane!)
  - Uprawnienia: CREATE CATALOG, CREATE SCHEMA, GRANT/REVOKE
  - Klaster: Standard z minimum 2 workers
- **Czas trwania**: 45 minut
- **Prerekvizity**: 03_databricks_jobs_orchestration.ipynb

## Wstƒôp teoretyczny

**Cel sekcji:** Zrozumienie Unity Catalog jako zunifikowanej platformy governance dla data lakehouse

**Podstawowe pojƒôcia:**
- **Unity Catalog**: Zunifikowane rozwiƒÖzanie governance dla wszystkich data assets
- **Metastore**: Region-level container dla katalog√≥w (top-level)
- **Three-level namespace**: catalog.schema.table
- **Securable objects**: Tables, Views, Functions, Volumes, Models
- **Fine-grained access control**: Table, column, row-level security
- **Automatic lineage**: End-to-end data flow tracking bez instrumentacji

**Hierarchia obiekt√≥w Unity Catalog:**
```
Metastore (region-level)
    ‚Üì
Catalog (domain/environment)
    ‚Üì
Schema (namespace/layer)
    ‚Üì
Securable Objects:
    - Tables / Views (data)
    - Functions (UDF, stored procedures)
    - Volumes (file storage)
    - Models (ML models)
```

**Kluczowe cechy:**
- **Unified governance**: Jedna platforma dla danych, ML, BI
- **ACID transactions**: Gwarancje transakcyjne na poziomie katalogu
- **Audit logging**: Who accessed what and when
- **Data discovery**: Metadata search i tagging
- **Delta Sharing**: Secure cross-organization sharing

**Dlaczego to wa≈ºne?**
Unity Catalog rozwiƒÖzuje fundamentalne problemy governance w data lake:
- Brak centralnej kontroli dostƒôpu
- Trudno≈õci z ≈õledzeniem lineage
- Brak audytu dostƒôpu do danych
- Problemy z compliance (GDPR, HIPAA)
- Silosy danych miƒôdzy zespo≈Çami

Unity Catalog zapewnia enterprise-grade governance przy zachowaniu flexibility data lakehouse.

## Izolacja per u≈ºytkownik

Uruchom skrypt inicjalizacyjny dla per-user izolacji katalog√≥w i schemat√≥w:

In [0]:
%run ../00_setup

## Konfiguracja

Import bibliotek i wy≈õwietlenie kontekstu u≈ºytkownika:

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import *

# Wy≈õwietl kontekst u≈ºytkownika (zmienne z 00_setup)
print("=" * 80)
print("UNITY CATALOG GOVERNANCE - KONTEKST U≈ªYTKOWNIKA")
print("=" * 80)
print(f"Katalog: {CATALOG}")
print(f"Schema Bronze: {BRONZE_SCHEMA}")
print(f"Schema Silver: {SILVER_SCHEMA}")
print(f"Schema Gold: {GOLD_SCHEMA}")
print(f"U≈ºytkownik: {raw_user}")
print(f"Dataset path: {DATASET_BASE_PATH}")
print("=" * 80)

# Ustaw katalog i schemat jako domy≈õlne
spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {SILVER_SCHEMA}")

print(f"\n‚úì Aktywny katalog: {CATALOG}")
print(f"‚úì Aktywny schemat: {SILVER_SCHEMA}")

## 2.1 Przygotowanie Danych z Dataset

Zanim przejdziemy do zarzƒÖdzania dostƒôpami, wczytamy rzeczywiste dane z katalogu dataset/, kt√≥re bƒôdziemy u≈ºywaƒá w przyk≈Çadach Unity Catalog.

In [0]:
# Wczytanie customers z dataset
customers_path = "/dbfs/FileStore/dataset/customers/customers.csv"

customers_df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv(customers_path)

print(f"‚úì Wczytano {customers_df.count()} klient√≥w")
customers_df.printSchema()
display(customers_df.limit(5))

In [0]:
# Wczytanie orders z dataset
orders_path = "/dbfs/FileStore/dataset/orders/orders_batch.json"

orders_df = spark.read \
    .option("multiline", "true") \
    .json(orders_path)

print(f"‚úì Wczytano {orders_df.count()} zam√≥wie≈Ñ")
orders_df.printSchema()
display(orders_df.limit(5))

In [0]:
# Wczytanie products z dataset
products_path = "/dbfs/FileStore/dataset/products/products.parquet"

products_df = spark.read.parquet(products_path)

print(f"‚úì Wczytano {products_df.count()} produkt√≥w")
products_df.printSchema()
display(products_df.limit(5))

## 1Ô∏è‚É£ Unity Catalog Architecture

**Unity Catalog** to zunifikowane rozwiƒÖzanie governance dla Databricks Lakehouse.

### Hierarchia obiekt√≥w:

```
Metastore (region-level)
    ‚Üì
Catalog (database/domain)
    ‚Üì
Schema (namespace)
    ‚Üì
Securable Objects:
    - Tables / Views
    - Functions (UDF, stored procedures)
    - Volumes (files storage)
    - Models (ML models)
```

### Three-level namespace:
```sql
catalog.schema.table
```

Przyk≈Çad:
```sql
main.sales.orders
dev.analytics.customer_metrics
prod.gold.daily_revenue
```

### Kluczowe cechy:
- **Unified governance**: jedna platforma dla danych, ML, BI
- **Fine-grained access control**: table, column, row level
- **Automatic lineage**: end-to-end data flow tracking
- **Audit logging**: who accessed what and when
- **Data discovery**: metadata search i tagging

---

## üìã Setup i Basic Operations

### Creating Catalogs and Schemas:

In [0]:
# Create Catalog
spark.sql(f"""
    CREATE CATALOG IF NOT EXISTS {CATALOG}
    COMMENT 'Katalog KION dla danych szkoleniowych'
""")

print(f"‚úì Katalog '{CATALOG}' utworzony/zweryfikowany")

# List catalogs
spark.sql("SHOW CATALOGS").display()

In [0]:
# Create Schemas within catalog
spark.sql(f"""
  CREATE SCHEMA IF NOT EXISTS {CATALOG}.{BRONZE_SCHEMA}
  COMMENT 'Bronze layer - surowe dane'
""")

spark.sql(f"""
  CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SILVER_SCHEMA}
  COMMENT 'Silver layer - oczyszczone dane'
""")

spark.sql(f"""
  CREATE SCHEMA IF NOT EXISTS {CATALOG}.{GOLD_SCHEMA}
  COMMENT 'Gold layer - dane biznesowe'
""")

print(f"‚úì Schematy Bronze, Silver, Gold utworzone w katalogu '{CATALOG}'")

In [0]:
# Set default catalog and schema
spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {SILVER_SCHEMA}")

print(f"‚úì Aktywny katalog: {CATALOG}")
print(f"‚úì Aktywny schemat: {SILVER_SCHEMA}")

# Weryfikacja utworzonych schemat√≥w
schemas = spark.sql(f"SHOW SCHEMAS IN {CATALOG}").select("databaseName").collect()
schema_names = [row.databaseName for row in schemas]

print("\n‚úì Utworzone schematy w katalogu:")
for schema_name in schema_names:
    print(f"  - {schema_name}")

### Creating Tables in Unity Catalog:

In [0]:
# Zapisanie tabeli customers w Bronze layer
customers_df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("mergeSchema", "true") \
    .saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.customers")

print(f"‚úì Tabela customers zapisana w {CATALOG}.{BRONZE_SCHEMA}")

# Weryfikacja
result = spark.sql(f"SELECT COUNT(*) as count FROM {CATALOG}.{BRONZE_SCHEMA}.customers").collect()[0]
print(f"‚úì Liczba rekord√≥w: {result.count}")

In [0]:
# Add table properties and comments
spark.sql("""
    ALTER TABLE kion_prod.silver.orders
    SET TBLPROPERTIES (
        'delta.enableChangeDataFeed' = 'true',
        'delta.autoOptimize.optimizeWrite' = 'true',
        'delta.autoOptimize.autoCompact' = 'true',
        'owner' = 'data-engineering-team',
        'department' = 'analytics',
        'pii_data' = 'true'
    )
""")

spark.sql("""
    COMMENT ON TABLE kion_prod.silver.orders IS
    'Cleaned orders table with data quality validations applied'
""")

spark.sql("""
    COMMENT ON COLUMN kion_prod.silver.orders.customer_id IS
    'Customer identifier - PII data, access restricted'
""")

# Zapisanie tabeli orders w Bronze layer
orders_df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("mergeSchema", "true") \
    .saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.orders")

print(f"‚úì Tabela orders zapisana w {CATALOG}.{BRONZE_SCHEMA}")

# Weryfikacja
result = spark.sql(f"SELECT COUNT(*) as count FROM {CATALOG}.{BRONZE_SCHEMA}.orders").collect()[0]
print(f"‚úì Liczba rekord√≥w: {result.count}")

In [0]:
# Zapisanie tabeli products w Bronze layer
products_df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("mergeSchema", "true") \
    .saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.products")

print(f"‚úì Tabela products zapisana w {CATALOG}.{BRONZE_SCHEMA}")

# Weryfikacja
result = spark.sql(f"SELECT COUNT(*) as count FROM {CATALOG}.{BRONZE_SCHEMA}.products").collect()[0]
print(f"‚úì Liczba rekord√≥w: {result.count}")

## 4. Unity Catalog Volumes

**Volumes** to zarzƒÖdzane przestrzenie dla przechowywania plik√≥w (non-tabular data) w Unity Catalog:
- **Managed Volumes**: Databricks zarzƒÖdza cyklem ≈ºycia plik√≥w
- **External Volumes**: po≈ÇƒÖczenie z zewnƒôtrznymi lokalizacjami storage

**Zastosowania**:
- Przechowywanie plik√≥w ML models, checkpoints
- Staging area dla danych przed ingestion
- Archiwum dokument√≥w, log√≥w, raport√≥w

In [0]:
# Tworzenie Managed Volume
volume_name = "files"

spark.sql(f"""
  CREATE VOLUME IF NOT EXISTS {CATALOG}.{BRONZE_SCHEMA}.{volume_name}
  COMMENT 'Managed volume dla plik√≥w staging'
""")

print(f"‚úì Volume '{volume_name}' utworzony w {CATALOG}.{BRONZE_SCHEMA}")

In [0]:
# Przyk≈Çad: Zapisanie danych do Volume
volume_path = f"/Volumes/{CATALOG}/{BRONZE_SCHEMA}/{volume_name}"

# Eksport customers do CSV w Volume
customers_df.coalesce(1).write \
    .mode("overwrite") \
    .option("header", "true") \
    .csv(f"{volume_path}/customers_export")

print(f"‚úì Dane customers wyeksportowane do Volume: {volume_path}/customers_export")

In [0]:
# Weryfikacja plik√≥w w Volume
dbutils.fs.ls(f"{volume_path}/customers_export")

## 5. Unity Catalog Functions (UDF)

**Functions** w Unity Catalog pozwalajƒÖ na:
- Tworzenie reu≈ºywalnych funkcji SQL/Python
- Centralne zarzƒÖdzanie logikƒÖ biznesowƒÖ
- Kontrolƒô dostƒôpu przez GRANT/REVOKE
- Lineage tracking dla funkcji

**Rodzaje funkcji**:
- **Scalar Functions**: zwracajƒÖ pojedynczƒÖ warto≈õƒá
- **Table Functions**: zwracajƒÖ tabelƒô
- **SQL Functions**: napisane w SQL
- **Python Functions**: napisane w Python (UDF)

In [0]:
# Przyk≈Çad 1: SQL Function - maskowanie customer_id
spark.sql(f"""
  CREATE OR REPLACE FUNCTION {CATALOG}.{SILVER_SCHEMA}.mask_customer_id(customer_id INT)
  RETURNS STRING
  LANGUAGE SQL
  COMMENT 'Maskuje customer_id, pokazujƒÖc tylko ostatnie 3 cyfry'
  RETURN CONCAT('****', SUBSTRING(CAST(customer_id AS STRING), -3))
""")

print(f"‚úì Funkcja mask_customer_id utworzona w {CATALOG}.{SILVER_SCHEMA}")

In [0]:
# Test funkcji mask_customer_id
result_df = spark.sql(f"""
  SELECT 
    customer_id,
    {CATALOG}.{SILVER_SCHEMA}.mask_customer_id(customer_id) as masked_id,
    first_name,
    last_name
  FROM {CATALOG}.{BRONZE_SCHEMA}.customers
  LIMIT 5
""")

display(result_df)

In [0]:
# Przyk≈Çad 2: Python UDF - kategoryzacja cen
spark.sql(f"""
  CREATE OR REPLACE FUNCTION {CATALOG}.{SILVER_SCHEMA}.categorize_price(price DOUBLE)
  RETURNS STRING
  LANGUAGE PYTHON
  COMMENT 'Kategoryzuje ceny: Low, Medium, High'
  AS $$
    if price < 50:
        return "Low"
    elif price < 200:
        return "Medium"
    else:
        return "High"
  $$
""")

print(f"‚úì Funkcja categorize_price utworzona w {CATALOG}.{SILVER_SCHEMA}")

In [0]:
# Test funkcji categorize_price
result_df = spark.sql(f"""
  SELECT 
    product_name,
    price,
    {CATALOG}.{SILVER_SCHEMA}.categorize_price(price) as price_category
  FROM {CATALOG}.{BRONZE_SCHEMA}.products
  ORDER BY price
  LIMIT 10
""")

display(result_df)

In [0]:
# Describe table
spark.sql("DESCRIBE EXTENDED kion_prod.silver.orders").display()

# Tworzenie View w Silver layer - agregacja zam√≥wie≈Ñ
spark.sql(f"""
  CREATE OR REPLACE VIEW {CATALOG}.{SILVER_SCHEMA}.customer_order_summary AS
  SELECT 
    c.customer_id,
    c.first_name,
    c.last_name,
    c.country,
    COUNT(o.order_id) as total_orders,
    SUM(o.total_amount) as total_spent,
    AVG(o.total_amount) as avg_order_value,
    MAX(o.order_datetime) as last_order_date
  FROM {CATALOG}.{BRONZE_SCHEMA}.customers c
  LEFT JOIN {CATALOG}.{BRONZE_SCHEMA}.orders o
    ON c.customer_id = o.customer_id
  GROUP BY c.customer_id, c.first_name, c.last_name, c.country
""")

print(f"‚úì View customer_order_summary utworzony w {CATALOG}.{SILVER_SCHEMA}")

---

## 6Ô∏è‚É£ ZarzƒÖdzanie dostƒôpami: GRANT / REVOKE

### Hierarchia Privileges w Unity Catalog:

**Poziomy uprawnie≈Ñ**:
1. **Metastore-level**: CREATE CATALOG, USE CATALOG
2. **Catalog-level**: USE CATALOG, CREATE SCHEMA
3. **Schema-level**: USE SCHEMA, CREATE TABLE, CREATE FUNCTION, CREATE VOLUME
4. **Object-level**: SELECT, MODIFY (INSERT/UPDATE/DELETE/MERGE), EXECUTE

**Securable Objects - Inheritance**:
- Uprawnienia dziedziczƒÖ siƒô w d√≥≈Ç hierarchii
- GRANT na Catalog ‚Üí dziedziczy na wszystkie Schemas i Tables
- GRANT na Schema ‚Üí dziedziczy na wszystkie Tables w tym Schema
- Mo≈ºna nadaƒá uprawnienia na konkretnym poziomie dla fine-grained control

### Przyk≈Çady GRANT/REVOKE:

In [0]:
# Grant catalog access to data analysts
spark.sql(f"""
    GRANT USE CATALOG ON CATALOG {CATALOG} TO `data-analysts`
""")

spark.sql(f"""
    GRANT USE SCHEMA ON SCHEMA {SCHEMA} TO `data-analysts`
""")

# Analysts can read gold tables
spark.sql(f"""
    GRANT SELECT ON SCHEMA {SCHEMA} TO `data-analysts`
""")

print(f"‚úÖ Granted access to data-analysts group for catalog {CATALOG} and schema {SCHEMA}")

In [0]:
# Grant full access to data engineers
spark.sql("""
    GRANT USE CATALOG, CREATE SCHEMA ON CATALOG kion_prod TO `data-engineers`
""")

spark.sql(f"""
  GRANT USE SCHEMA ON SCHEMA {CATALOG}.{GOLD_SCHEMA} TO `data-analysts`
""")

# 3. GRANT SELECT na wszystkich tabelach w Gold
spark.sql(f"""
  GRANT SELECT ON SCHEMA {CATALOG}.{GOLD_SCHEMA} TO `data-analysts`
""")

print(f"‚úì Grupa 'data-analysts' ma SELECT na {CATALOG}.{GOLD_SCHEMA}")

spark.sql("""
    GRANT ALL PRIVILEGES ON SCHEMA kion_prod.bronze TO `data-engineers`
""")

spark.sql("""
    GRANT ALL PRIVILEGES ON SCHEMA kion_prod.silver TO `data-engineers`
""")

spark.sql("""
    GRANT ALL PRIVILEGES ON SCHEMA kion_prod.gold TO `data-engineers`
""")

print("‚úÖ Granted full access to data-engineers group")

In [0]:
# Grant specific table access
spark.sql("""
    GRANT SELECT ON TABLE kion_prod.gold.daily_sales TO `finance-team`
""")

spark.sql("""
    GRANT SELECT ON TABLE kion_prod.gold.customer_metrics TO `marketing-team`
""")

# 4. GRANT ALL PRIVILEGES dla data-engineers
spark.sql(f"""
  GRANT ALL PRIVILEGES ON SCHEMA {CATALOG}.{BRONZE_SCHEMA} TO `data-engineers`
""")

spark.sql(f"""
  GRANT ALL PRIVILEGES ON SCHEMA {CATALOG}.{SILVER_SCHEMA} TO `data-engineers`
""")

spark.sql(f"""
  GRANT ALL PRIVILEGES ON SCHEMA {CATALOG}.{GOLD_SCHEMA} TO `data-engineers`
""")

print("‚úÖ Granted table-specific access")
print(f"‚úì Grupa 'data-engineers' ma ALL PRIVILEGES na Bronze/Silver/Gold")

In [0]:
# 5. GRANT EXECUTE na Function
spark.sql(f"""
  GRANT EXECUTE ON FUNCTION {CATALOG}.{SILVER_SCHEMA}.mask_customer_id TO `data-analysts`
""")

spark.sql(f"""
  GRANT EXECUTE ON FUNCTION {CATALOG}.{SILVER_SCHEMA}.categorize_price TO `data-analysts`
""")

print(f"‚úì Grupa 'data-analysts' ma EXECUTE na funkcjach")

In [0]:
# Show grants on object
spark.sql(f"""
    SHOW GRANTS ON TABLE {CATALOG}.{BRONZE_SCHEMA}.customers
""").display()

print(f"‚úì Uprawnienia na tabeli customers")

### Ownership and transfer:

---

## 3Ô∏è‚É£ Data Masking i Row-Level Security

### Column-level masking (Dynamic Views):

U≈ºyj funkcji `current_user()` i `is_account_group_member()` do conditional masking:

In [0]:
# Create masked view for PII data
spark.sql(f"""
  CREATE OR REPLACE VIEW {CATALOG}.{GOLD_SCHEMA}.customers_masked AS
  SELECT 
    customer_id,
    CASE 
      WHEN is_account_group_member('pii-access-group') THEN first_name
      ELSE CONCAT(LEFT(first_name, 1), '***')
    END as first_name,
    CASE 
      WHEN is_account_group_member('pii-access-group') THEN last_name
      ELSE CONCAT(LEFT(last_name, 1), '***')
    END as last_name,
    city,
    country,
    registration_date
  FROM {CATALOG}.{BRONZE_SCHEMA}.customers
""")

print(f"‚úì View customers_masked utworzony w {CATALOG}.{GOLD_SCHEMA}")
print("  - PII-access-group: widzi pe≈Çne dane")
print("  - Inne grupy: widzi zamaskowane imiona i nazwiska")

In [0]:
# Test View z maskowaniem
result_df = spark.sql(f"""
  SELECT * FROM {CATALOG}.{GOLD_SCHEMA}.customers_masked LIMIT 10
""")

display(result_df)
print("‚úì Dane z maskowaniem (imiona i nazwiska zamaskowane dla u≈ºytkownik√≥w bez pii-access-group)")

In [0]:
# Alternatywnie: Hash sensitive identifiers
spark.sql(f"""
  CREATE OR REPLACE VIEW {CATALOG}.{GOLD_SCHEMA}.orders_hashed AS
  SELECT 
    order_id,
    SHA2(CAST(customer_id AS STRING), 256) as customer_id_hash,
    product_id,
    quantity,
    total_amount,
    order_datetime,
    status
  FROM {CATALOG}.{BRONZE_SCHEMA}.orders
""")

print(f"‚úì View orders_hashed utworzony - customer_id jest zahashowany")
print("  - Analitycy mogƒÖ agregowaƒá bez ujawniania customer_id")

### Row-Level Security (RLS):

Restrict which rows users can see based on their identity or group membership:

In [0]:
# Create row-level security view - country access
spark.sql(f"""
    CREATE OR REPLACE VIEW {CATALOG}.{GOLD_SCHEMA}.customers_rls AS
    SELECT *
    FROM {CATALOG}.{BRONZE_SCHEMA}.customers
    WHERE 
        CASE 
            WHEN is_account_group_member('global-access') THEN TRUE
            WHEN is_account_group_member('poland-team') THEN country = 'Poland'
            WHEN is_account_group_member('germany-team') THEN country = 'Germany'
            WHEN is_account_group_member('france-team') THEN country = 'France'
            ELSE FALSE
        END
""")

print(f"‚úì RLS View utworzony - u≈ºytkownicy widzƒÖ tylko klient√≥w ze swojego kraju")

In [0]:
# RLS based on user attribute (e.g., department)
spark.sql("""
    CREATE OR REPLACE VIEW kion_prod.gold.sales_rls AS
    SELECT 
        o.*,
        d.department
    FROM kion_prod.silver.orders o
    JOIN kion_prod.silver.departments d ON o.department_id = d.department_id
    WHERE 
        is_account_group_member('admin') OR
        current_user() IN (
            SELECT user_email 
            FROM kion_prod.gold.user_department_mapping 
            WHERE department = d.department
        )
""")

# Users only see sales from their own department

# RLS na zam√≥wieniach - tylko zam√≥wienia ze statusem zgodnym z uprawnieniami
spark.sql(f"""
  CREATE OR REPLACE VIEW {CATALOG}.{GOLD_SCHEMA}.orders_rls AS
  SELECT 
    o.*
  FROM {CATALOG}.{BRONZE_SCHEMA}.orders o
  WHERE 
    is_account_group_member('admin') OR
    (is_account_group_member('finance-team') AND o.status IN ('completed', 'shipped')) OR
    (is_account_group_member('warehouse-team') AND o.status IN ('pending', 'processing', 'shipped'))
""")

print(f"‚úì RLS View dla orders - u≈ºytkownicy widzƒÖ tylko zam√≥wienia zgodne z ich rolƒÖ")

In [0]:
# GRANT dostƒôp do RLS View
spark.sql(f"""
  GRANT SELECT ON VIEW {CATALOG}.{GOLD_SCHEMA}.customers_rls TO `all-users`
""")

spark.sql(f"""
  GRANT SELECT ON VIEW {CATALOG}.{GOLD_SCHEMA}.orders_rls TO `all-users`
""")

# Revoke direct access to base table
spark.sql("""
    REVOKE SELECT ON TABLE kion_prod.silver.orders FROM `all-users`
""")

print(f"‚úì U≈ºytkownicy majƒÖ dostƒôp przez RLS Views")
print("  - Automatyczne filtrowanie wierszy bazowane na group membership")

---

## 4Ô∏è‚É£ Data Lineage i Audit Logging

### Querying Data Lineage:

Unity Catalog automatically tracks lineage for:
- Table ‚Üí Table (ETL transformations)
- Notebook ‚Üí Table (data writes)
- Dashboard ‚Üí Table (BI queries)
- ML Model ‚Üí Table (training data)

In [0]:
# Query table lineage z system tables
lineage_df = spark.sql(f"""
  SELECT 
    source_table_full_name,
    source_type,
    target_table_full_name,
    target_type,
    created_at,
    created_by
  FROM system.access.table_lineage
  WHERE target_table_full_name LIKE '{CATALOG}.%'
  ORDER BY created_at DESC
  LIMIT 50
""")

display(lineage_df)
print(f"‚úì Lineage dla tabel w katalogu {CATALOG}")

In [0]:
# Find upstream dependencies (sources) for a table
upstream_df = spark.sql(f"""
    SELECT DISTINCT
        source_table_full_name,
        source_type
    FROM system.access.table_lineage
    WHERE target_table_full_name = '{CATALOG}.{SILVER_SCHEMA}.customer_order_summary'
""")

display(upstream_df)
print(f"‚¨ÜÔ∏è Upstream: tabele ≈∫r√≥d≈Çowe dla customer_order_summary")

In [0]:
# Find downstream dependencies (consumers) of a table
downstream_df = spark.sql(f"""
    SELECT DISTINCT
        target_table_full_name,
        target_type
    FROM system.access.table_lineage
    WHERE source_table_full_name = '{CATALOG}.{BRONZE_SCHEMA}.customers'
""")

display(downstream_df)
print(f"‚¨áÔ∏è Downstream: Views/Tables korzystajƒÖce z customers")

In [0]:
# Column-level lineage (if available)
column_lineage = spark.sql(f"""
    SELECT 
        source_table_full_name,
        source_column_name,
        target_table_full_name,
        target_column_name,
        created_at
    FROM system.access.column_lineage
    WHERE target_table_full_name = '{CATALOG}.{SILVER_SCHEMA}.customer_order_summary'
    ORDER BY target_column_name
""")
display(column_lineage)

print(f"üìä Column-level lineage dla customer_order_summary View")

### Audit Logging:

Unity Catalog logs all access and operations:

In [0]:
# Query audit logs
audit_df = spark.sql("""
    SELECT 
        event_time,
        user_identity.email as user_email,
        service_name,
        action_name,
        request_params.full_name_arg as table_name,
        response.status_code,
        request_id
    FROM system.access.audit
    WHERE action_name IN ('getTable', 'createTable', 'deleteTable', 'updateTable')
        AND event_date >= current_date() - INTERVAL 7 DAYS
    ORDER BY event_time DESC
    LIMIT 100
""")
audit_df.display()

In [0]:
# Track who accessed sensitive tables
sensitive_access = spark.sql(f"""
    SELECT 
        event_time,
        user_identity.email as user,
        action_name,
        request_params.full_name_arg as table_accessed,
        source_ip_address
    FROM system.access.audit
    WHERE request_params.full_name_arg LIKE '{CATALOG}.%.customers%'
        AND action_name = 'getTable'
        AND event_date >= current_date() - INTERVAL 7 DAYS
    ORDER BY event_time DESC
    LIMIT 100
""")

display(sensitive_access)
print(f"üîí Audit logs: dostƒôp do tabeli customers (ostatnie 7 dni)")

In [0]:
# Grant/Revoke audit trail
grant_audit = spark.sql("""
    SELECT 
        event_time,
        user_identity.email as admin_user,
        action_name,
        request_params.privilege as privilege_granted,
        request_params.securable_full_name as object_name,
        request_params.principal as grantee
    FROM system.access.audit
    WHERE action_name IN ('grantPrivilege', 'revokePrivilege')
        AND event_date >= current_date() - INTERVAL 30 DAYS
    ORDER BY event_time DESC
""")
grant_audit.display()

print("üìù Audit trail of privilege changes")

---

## 5Ô∏è‚É£ Delta Sharing

**Delta Sharing** = Secure data sharing protocol (cross-org, cross-cloud)

### Komponenty:
- **Share**: kolekcja tabel do udostƒôpnienia
- **Recipient**: organizacja/u≈ºytkownik otrzymujƒÖcy dane
- **Provider**: w≈Ça≈õciciel danych (Ty)

### Create Share:

In [0]:
# Tworzenie Share dla zewnƒôtrznych partner√≥w
share_name = f"{CATALOG}_partner_share"

spark.sql(f"""
  CREATE SHARE IF NOT EXISTS {share_name}
  COMMENT 'Udostƒôpnienie danych KION dla partner√≥w biznesowych'
""")

print(f"‚úì Share '{share_name}' utworzony")

In [0]:
# Dodanie tabel do Share (tylko Gold layer - agregowane dane)
spark.sql(f"""
  ALTER SHARE {share_name}
  ADD TABLE {CATALOG}.{GOLD_SCHEMA}.customer_order_summary
""")

print(f"‚úì Tabela customer_order_summary dodana do {share_name}")
print("  - Partnerzy otrzymajƒÖ dostƒôp tylko do zagregowanych danych Gold")

In [0]:
# Weryfikacja zawarto≈õci Share
spark.sql(f"SHOW ALL IN SHARE {share_name}").display()

print(f"‚úì Tabele w Share: {share_name}")

### Create Recipient:

### Consuming shared data (as recipient):

### Best practices for Delta Sharing:

1. **Share only aggregated/gold data**: nie udostƒôpniaj raw/bronze layers
2. **Use views for masking**: create view with masked PII before sharing
3. **Monitor access**: track who accesses shared data
4. **Version control**: use table versions for stable APIs
5. **Documentation**: clear documentation dla recipients

---

## 6Ô∏è‚É£ Best Practices for Data Governance

### 1. Catalog organization strategy:

### 2. Access control patterns:

### 3. Tagging and documentation:

### 4. Monitoring and alerts:

In [0]:
# Regular governance health checks

# 1. Tables without owners
unowned_tables = spark.sql(f"""
    SELECT 
        table_catalog,
        table_schema,
        table_name
    FROM system.information_schema.tables
    WHERE table_catalog = '{CATALOG}'
        AND table_owner IS NULL
""")

display(unowned_tables)
print("‚ö†Ô∏è Tabele bez w≈Ça≈õcicieli (powinny mieƒá przypisanego owner)")

# 2. Tables without comments
undocumented = spark.sql(f"""
    SELECT 
        table_catalog,
        table_schema,
        table_name
    FROM system.information_schema.tables
    WHERE table_catalog = '{CATALOG}'
        AND (comment IS NULL OR comment = '')
""")

display(undocumented)
print("üìù Tabele bez dokumentacji (dodaj COMMENT ON TABLE)")

# 3. Unused tables (no queries in 90 days)
unused_tables = spark.sql(f"""
    WITH recent_access AS (
        SELECT DISTINCT request_params.full_name_arg as table_name
        FROM system.access.audit
        WHERE action_name = 'getTable'
            AND event_date >= current_date() - INTERVAL 90 DAYS
    )
    SELECT 
        t.table_catalog,
        t.table_schema,
        t.table_name,
        t.created as table_created_at
    FROM system.information_schema.tables t
    LEFT JOIN recent_access ra 
        ON CONCAT(t.table_catalog, '.', t.table_schema, '.', t.table_name) = ra.table_name
    WHERE t.table_catalog = '{CATALOG}'
        AND ra.table_name IS NULL
        AND t.created < current_date() - INTERVAL 90 DAYS
""")
unused_tables.display()

---

## ‚úÖ Podsumowanie

### Nauczy≈Çe≈õ siƒô:

‚úÖ **Unity Catalog Architecture**: Metastore ‚Üí Catalog ‚Üí Schema ‚Üí Tables  
‚úÖ **Access Control**: GRANT/REVOKE privileges at multiple levels  
‚úÖ **Data Masking**: Column-level masking with dynamic views  
‚úÖ **Row-Level Security**: Filter data based on user identity  
‚úÖ **Data Lineage**: Track data flow through system tables  
‚úÖ **Audit Logging**: Monitor who accessed what and when  
‚úÖ **Delta Sharing**: Secure cross-organization data sharing  

### Key Takeaways:

1. **Unified Governance**: Single platform for all data assets
2. **Fine-grained Control**: Table, column, row-level security
3. **Automatic Lineage**: No extra instrumentation needed
4. **Compliance-ready**: Audit logs for regulatory requirements
5. **Secure Sharing**: Delta Sharing for external collaboration

### Nastƒôpne kroki:
- **Notebook 05**: BI & ML Integrations
- **Workshop 03**: Governance + Integrations hands-on

---

## üìö Dodatkowe zasoby

- [Unity Catalog Documentation](https://docs.databricks.com/data-governance/unity-catalog/index.html)
- [Delta Sharing Protocol](https://delta.io/sharing/)
- [Unity Catalog Best Practices](https://docs.databricks.com/data-governance/unity-catalog/best-practices.html)

---

## ‚úÖ Checklist - Unity Catalog Governance

Po uko≈Ñczeniu tego notebooka powiniene≈õ umieƒá:

- [ ] **UC Architecture**: Zrozumieƒá hierarchiƒô Metastore ‚Üí Catalog ‚Üí Schema ‚Üí Objects
- [ ] **Tworzenie obiekt√≥w**: Utworzyƒá Catalog, Schema, Tables, Views, Volumes, Functions
- [ ] **GRANT/REVOKE**: ZarzƒÖdzaƒá uprawnieniami na wszystkich poziomach
- [ ] **Privileges**: Rozumieƒá SELECT, MODIFY, CREATE TABLE, EXECUTE
- [ ] **Data Masking**: Tworzyƒá Views z maskowaniem wra≈ºliwych danych
- [ ] **Row-Level Security**: Implementowaƒá RLS bazowane na group membership
- [ ] **Lineage**: ≈öledziƒá upstream/downstream dependencies
- [ ] **Audit Logging**: Zapytywaƒá system.access.audit o aktywno≈õƒá u≈ºytkownik√≥w
- [ ] **Delta Sharing**: Tworzyƒá Share i udostƒôpniaƒá dane zewnƒôtrznym recipientom
- [ ] **Best Practices**: Monitorowaƒá governance health (owners, documentation, unused tables)

---

## üîß Troubleshooting

### Problem 1: "Table or view not found"
**Przyczyna**: Brak uprawnie≈Ñ USE CATALOG lub USE SCHEMA  
**RozwiƒÖzanie**:
```sql
GRANT USE CATALOG ON CATALOG <catalog_name> TO <principal>;
GRANT USE SCHEMA ON SCHEMA <catalog>.<schema> TO <principal>;
```

### Problem 2: "Permission denied" przy SELECT
**Przyczyna**: Brak uprawnie≈Ñ SELECT na tabeli  
**RozwiƒÖzanie**:
```sql
GRANT SELECT ON TABLE <catalog>.<schema>.<table> TO <principal>;
-- lub na ca≈Çym schema:
GRANT SELECT ON SCHEMA <catalog>.<schema> TO <principal>;
```

### Problem 3: "Cannot execute function"
**Przyczyna**: Brak uprawnienia EXECUTE na funkcji  
**RozwiƒÖzanie**:
```sql
GRANT EXECUTE ON FUNCTION <catalog>.<schema>.<function_name> TO <principal>;
```

### Problem 4: "Volume not accessible"
**Przyczyna**: Brak uprawnie≈Ñ READ VOLUME / WRITE VOLUME  
**RozwiƒÖzanie**:
```sql
GRANT READ VOLUME ON VOLUME <catalog>.<schema>.<volume> TO <principal>;
GRANT WRITE VOLUME ON VOLUME <catalog>.<schema>.<volume> TO <principal>;
```

### Problem 5: RLS View nie filtruje danych
**Przyczyna**: U≈ºytkownik nie nale≈ºy do ≈ºadnej grupy zdefiniowanej w CASE WHEN  
**RozwiƒÖzanie**: Dodaj u≈ºytkownika do odpowiedniej grupy lub dodaj domy≈õlny fallback w View

### Problem 6: Lineage nie pokazuje zale≈ºno≈õci
**Przyczyna**: Lineage jest automatyczne, ale mo≈ºe op√≥≈∫niaƒá siƒô o kilka minut  
**RozwiƒÖzanie**: Poczekaj 5-10 minut i ponownie zapytaj system.access.table_lineage

### Problem 7: Share nie widoczny dla recipient
**Przyczyna**: Recipient nie aktywowa≈Ç activation link  
**RozwiƒÖzanie**: Wy≈õlij activation link z DESCRIBE RECIPIENT

---

## üèÜ Best Practices Summary

### 1. **Catalog Organization**
- ‚úÖ U≈ºywaj environment-based catalogs: `dev`, `test`, `prod`
- ‚úÖ Organizuj schematy wed≈Çug warstw: `bronze`, `silver`, `gold`
- ‚úÖ Stosuj naming conventions: `<catalog>.<schema>.<object>`

### 2. **Access Control**
- ‚úÖ **Principle of Least Privilege**: Nadawaj minimalne wymagane uprawnienia
- ‚úÖ U≈ºywaj grup, nie indywidualnych u≈ºytkownik√≥w
- ‚úÖ Inheritance: GRANT na Catalog ‚Üí dziedziczy na Schema ‚Üí dziedziczy na Tables
- ‚úÖ Regularnie audytuj uprawnienia (SHOW GRANTS)

### 3. **Data Masking & RLS**
- ‚úÖ Maskuj PII w Views dla u≈ºytkownik√≥w bez pii-access-group
- ‚úÖ U≈ºywaj RLS dla multi-tenant scenarios
- ‚úÖ Zawsze testuj masking z r√≥≈ºnymi group membership

### 4. **Lineage & Audit**
- ‚úÖ Wykorzystuj automatic lineage do ≈õledzenia data flow
- ‚úÖ Regularnie sprawdzaj audit logs dla sensitive tables
- ‚úÖ Monitoruj lineage po zmianach w pipeline

### 5. **Delta Sharing**
- ‚úÖ Udostƒôpniaj tylko Gold layer (aggregated data)
- ‚úÖ U≈ºywaj masked Views w Share
- ‚úÖ Dokumentuj Share contracts dla recipients

### 6. **Documentation & Governance**
- ‚úÖ Dodawaj COMMENT do wszystkich tabel, views, functions
- ‚úÖ U≈ºywaj Table Properties dla metadata (owner, PII, retention)
- ‚úÖ Regularnie sprawdzaj governance health checks

### 7. **Volumes & Functions**
- ‚úÖ U≈ºywaj Managed Volumes dla ML artifacts i staging
- ‚úÖ Centralizuj logikƒô biznesowƒÖ w UC Functions
- ‚úÖ Kontroluj dostƒôp przez GRANT EXECUTE

---