# Workshop 4: Unity Catalog Governance

## The Story

A security audit revealed that too many users have access to sensitive customer data.
Your task is to secure the `customers_governance` table using Unity Catalog.
You need to implement Row-Level Security (RLS) to ensure analysts can only see data from their own country.

**Your Mission:**
1. Audit current permissions.
2. Grant specific permissions to a group.
3. Create a Row Filter (RLS) to restrict access.
4. Create a Data Mask to hide PII (Personally Identifiable Information).

**Time:** 30 minutes


In [None]:
%run ../00_setup
from pyspark.sql.functions import col, concat, lit

# --- INDEPENDENT SETUP ---
# We will create a small synthetic table (20 rows) for the workshop.
table_name = f"{CATALOG}.{BRONZE_SCHEMA}.customers_governance"

print(f"Generating synthetic data for table: {table_name}...")

# Generate 20 unique, realistic records with SSN (Sensitive Data)
data = [
    (1, "James", "Smith", "james.smith@example.com", "123-45-6789", "US"),
    (2, "Michael", "Johnson", "michael.johnson@example.com", "234-56-7890", "US"),
    (3, "Robert", "Williams", "robert.williams@example.com", "345-67-8901", "US"),
    (4, "Maria", "Jones", "maria.jones@example.com", "456-78-9012", "US"),
    (5, "David", "Brown", "david.brown@example.com", "567-89-0123", "US"),
    (6, "Joseph", "Davis", "joseph.davis@example.com", "678-90-1234", "US"),
    (7, "Thomas", "Miller", "thomas.miller@example.com", "789-01-2345", "US"),
    (8, "Charles", "Wilson", "charles.wilson@example.com", "890-12-3456", "US"),
    (9, "Daniel", "Moore", "daniel.moore@example.com", "901-23-4567", "US"),
    (10, "Matthew", "Taylor", "matthew.taylor@example.com", "012-34-5678", "US"),
    (11, "Christopher", "Anderson", "chris.anderson@example.com", "321-54-9876", "UK"),
    (12, "Andrew", "Thomas", "andrew.thomas@example.com", "432-65-0987", "UK"),
    (13, "Elizabeth", "Jackson", "elizabeth.jackson@example.com", "543-76-1098", "UK"),
    (14, "Brian", "White", "brian.white@example.com", "654-87-2109", "UK"),
    (15, "George", "Harris", "george.harris@example.com", "765-98-3210", "UK"),
    (16, "Jennifer", "Martin", "jennifer.martin@example.com", "876-09-4321", "UK"),
    (17, "Linda", "Thompson", "linda.thompson@example.com", "987-10-5432", "UK"),
    (18, "Barbara", "Garcia", "barbara.garcia@example.com", "098-21-6543", "UK"),
    (19, "Susan", "Martinez", "susan.martinez@example.com", "109-32-7654", "UK"),
    (20, "Jessica", "Robinson", "jessica.robinson@example.com", "210-43-8765", "UK"),
]

schema = "CustomerID INT, FirstName STRING, LastName STRING, Email STRING, SSN STRING, Country STRING"
df_synthetic = spark.createDataFrame(data, schema)

# Add FullName
df_final = df_synthetic.withColumn("FullName", concat(col("FirstName"), lit(" "), col("LastName")))

# Save as Delta Table
df_final.write.format("delta").mode("overwrite").saveAsTable(table_name)

print(f"✅ Table {table_name} created successfully with 20 unique records.")
display(df_final)

## Step 1: Audit Permissions

### Task 1.1: Check current grants

See who has access to the table.

**Hint:**
```sql
SHOW GRANTS ON TABLE catalog.schema.table_name
```


In [None]:
# TODO: Show grants
# display(spark.sql(f"SHOW GRANTS ON TABLE {table_name}"))


## Step 2: Row-Level Security (Row Filter)

We want to restrict access so that users can only see rows where `Country = 'US'`.

### Task 2.1: Create a Row Filter Function

Create a SQL function that returns TRUE if the user is allowed to see the row.

**Hint:**
```sql
CREATE OR REPLACE FUNCTION filter_country(country STRING)
RETURN IF(is_account_group_member('admin'), true, country = 'US')
```

In [None]:
# TODO: Create row filter function
# spark.sql(f"CREATE OR REPLACE FUNCTION {CATALOG}.{BRONZE_SCHEMA}.filter_country ...")

### Task 2.2: Apply the Filter to the Table

**Hint:**
```sql
ALTER TABLE catalog.schema.table_name SET ROW FILTER catalog.schema.filter_country ON (Country)
```

## Step 3: Column Masking (Dynamic Data Masking)

We need to hide the `SSN` column for non-admins.

### Task 3.1: Create a Masking Function

**Hint:**
```sql
CREATE OR REPLACE FUNCTION mask_ssn(ssn STRING)
RETURN CASE WHEN is_account_group_member('admin') THEN ssn ELSE '***-**-****' END
```

In [None]:
# TODO: Apply row filter
# spark.sql(f"ALTER TABLE {table_name} SET ROW FILTER ...")

# TODO: Create masking function
# spark.sql(f"CREATE OR REPLACE FUNCTION {CATALOG}.{BRONZE_SCHEMA}.mask_ssn ...")

# TODO: Apply column mask
# spark.sql(f"ALTER TABLE {table_name} ALTER COLUMN SSN SET MASK ...")

## Step 4: Metadata Management (Tags & Comments)

Good governance requires good documentation. Unity Catalog allows you to add **Comments** and **Tags** to tables and columns.

### Task 4.1: Add Comments
Add a description to the table and the `SSN` column.

**Hint:**
```sql
COMMENT ON TABLE catalog.schema.table IS 'Customer data with PII'
COMMENT ON COLUMN catalog.schema.table.column IS 'Social Security Number'
```

### Task 4.2: Add Tags
Tag the table as containing PII data.

**Hint:**
```sql
ALTER TABLE catalog.schema.table SET TAGS ('pii' = 'true', 'sensitivity' = 'high')
ALTER TABLE catalog.schema.table ALTER COLUMN SSN SET TAGS ('pii' = 'true')
```

In [None]:
# TODO: Add comments
# spark.sql(f"COMMENT ON TABLE {table_name} IS ...")
# spark.sql(f"COMMENT ON COLUMN {table_name}.SSN IS ...")

# TODO: Add tags
# spark.sql(f"ALTER TABLE {table_name} SET TAGS ...")
# spark.sql(f"ALTER TABLE {table_name} ALTER COLUMN SSN SET TAGS ...")

## Step 5: Data Quality (Constraints)

Governance is not just about security, it's also about **Data Quality**.
We want to ensure that no one can insert invalid data into our governed table.

### Task 5.1: Add a Constraint

Add a check constraint to ensure `SSN` is not empty and has at least 5 characters.

**Hint:**
```sql
ALTER TABLE catalog.schema.table_name ADD CONSTRAINT valid_ssn CHECK (length(SSN) > 5)
```

### Task 5.2: Test the Constraint

Try to insert an invalid record and see it fail.

**Hint:**
```sql
INSERT INTO catalog.schema.table_name (CustomerID, SSN) VALUES (999, 'BAD')
```

In [None]:
# TODO: Add constraint
# spark.sql(f"ALTER TABLE {table_name} ADD CONSTRAINT ...")

# TODO: Try to insert invalid data (Expect Error)
# try:
#     spark.sql(f"INSERT INTO {table_name} ...")
# except Exception as e:
#     print(f"Caught expected error: {e}")

## Step 6: Verification

Query the table to see if the filter, mask, and constraints are working.
(Note: If you are an admin, you might still see everything unless you test with a different user or logic).

In [None]:
# Verify results
display(spark.table(table_name))


# Solution

The complete code is below.


In [None]:
# ============================================================
# FULL SOLUTION - Workshop 4: Unity Catalog Governance
# ============================================================

table_name = f"{CATALOG}.{BRONZE_SCHEMA}.customers_governance"

# --- Step 1: Audit ---
print("CURRENT GRANTS:")
display(spark.sql(f"SHOW GRANTS ON TABLE {table_name}"))

# --- Step 2: Row Filter ---
# 1. Create Function
spark.sql(f"""
CREATE OR REPLACE FUNCTION {CATALOG}.{BRONZE_SCHEMA}.filter_country(country STRING)
RETURN IF(is_account_group_member('admin'), true, country = 'US')
""")

# 2. Apply Filter
spark.sql(f"ALTER TABLE {table_name} SET ROW FILTER {CATALOG}.{BRONZE_SCHEMA}.filter_country ON (Country)")
print("Row Filter applied.")

# --- Step 3: Column Mask ---
# 1. Create Function
spark.sql(f"""
CREATE OR REPLACE FUNCTION {CATALOG}.{BRONZE_SCHEMA}.mask_ssn(ssn STRING)
RETURN CASE WHEN is_account_group_member('admin') THEN ssn ELSE '***-**-****' END
""")

# 2. Apply Mask
spark.sql(f"ALTER TABLE {table_name} ALTER COLUMN SSN SET MASK {CATALOG}.{BRONZE_SCHEMA}.mask_ssn")
print("Column Mask applied.")

# --- Step 4: Metadata Management ---
# 1. Add Comments
spark.sql(f"COMMENT ON TABLE {table_name} IS 'Customer data with PII'")
spark.sql(f"COMMENT ON COLUMN {table_name}.SSN IS 'Social Security Number'")
print("Comments added.")

# 2. Add Tags
spark.sql(f"ALTER TABLE {table_name} SET TAGS ('pii' = 'true', 'sensitivity' = 'high')")
spark.sql(f"ALTER TABLE {table_name} ALTER COLUMN SSN SET TAGS ('pii' = 'true')")
print("Tags added.")

# --- Step 5: Data Quality ---
try:
    spark.sql(f"ALTER TABLE {table_name} ADD CONSTRAINT valid_ssn CHECK (length(SSN) > 5)")
    print("Constraint added.")
except Exception as e:
    print(f"Constraint might already exist: {e}")

print("Testing constraint with invalid data...")
try:
    spark.sql(f"INSERT INTO {table_name} (CustomerID, SSN) VALUES (999, 'BAD')")
except Exception as e:
    print("✅ SUCCESS: Invalid data was rejected by Unity Catalog!")

# --- Step 6: Verification ---
print("\nVERIFICATION (As Admin):")
display(spark.table(table_name))