# Training Environment Setup

## What This Notebook Does

This setup script creates **isolated training environments** for each participant:

1. **Gets current user** from Databricks session
2. **Creates personal catalog** (`training_<username>`)
3. **Creates medallion schemas** (bronze, silver, gold)
4. **Sets up data paths** for Unity Catalog Volumes

## Why Per-User Isolation?

| Approach | Pros | Cons | When to Use |
|----------|------|------|-------------|
| **Catalog per user** | Full isolation, realistic production setup | Requires CREATE CATALOG permission | Training, sandbox environments |
| **Schema per user** | Simpler permissions | Shared catalog, potential conflicts | Quick demos, limited permissions |
| **Shared everything** | Easiest setup | No isolation, data conflicts | Single-user demos only |

**We use Catalog isolation** because:
- Mimics production environments (each team/domain has own catalog)
- Demonstrates Unity Catalog governance features
- Prevents conflicts between participants
- Each user can DROP/CREATE without affecting others

## Cost Considerations

- **Storage**: Each catalog has separate storage location (managed by Unity Catalog)
- **Compute**: Shared cluster, no additional cost per catalog
- **Cleanup**: Catalogs can be dropped after training to free storage

---

In [0]:
import re

# =============================================================================
# STEP 1: Get Current User (Dynamic)
# =============================================================================
# This automatically gets the logged-in user from Databricks session
raw_user = spark.sql("SELECT current_user()").first()[0]
print(f"Current user: {raw_user}")

# Create a clean slug from email (e.g., "jan.kowalski@company.com" -> "jan_kowalski")
user_slug = re.sub(r'[^a-zA-Z0-9]', '_', raw_user.split('@')[0]).lower()
print(f"User slug: {user_slug}")

## Configuration Variables

After running this setup, the following variables are available in all notebooks:

| Variable | Description | Example |
|----------|-------------|---------|
| `CATALOG` | User's personal catalog | `training_jan_kowalski` |
| `BRONZE_SCHEMA` | Raw data layer | `bronze` |
| `SILVER_SCHEMA` | Cleaned/validated data | `silver` |
| `GOLD_SCHEMA` | Business-ready aggregates | `gold` |
| `DATASET_BASE_PATH` | Path to source data files | `/Volumes/.../datasets` |
| `user_slug` | Sanitized username | `jan_kowalski` |

In [None]:
# =============================================================================
# STEP 2: Define Catalog and Schema Names
# =============================================================================
# Each user gets their own catalog with medallion architecture schemas

CATALOG = f"training_{user_slug}"
BRONZE_SCHEMA = "bronze"
SILVER_SCHEMA = "silver"
GOLD_SCHEMA = "gold"

# Shared data source (read-only Volume with training datasets)
# This Volume is pre-created by trainer and shared with all participants
SHARED_DATA_CATALOG = "training_shared"
DATASET_BASE_PATH = f"/Volumes/{SHARED_DATA_CATALOG}/datasets/files"

print(f"User catalog: {CATALOG}")
print(f"Schemas: {BRONZE_SCHEMA}, {SILVER_SCHEMA}, {GOLD_SCHEMA}")
print(f"Source data: {DATASET_BASE_PATH}")

## Create User Catalog and Schemas

**What happens below:**
1. Creates catalog if not exists (requires CREATE CATALOG permission)
2. Creates bronze/silver/gold schemas
3. Sets the catalog as default context

**If you get permission error:**
- Ask your Databricks admin to grant: `GRANT CREATE CATALOG ON METASTORE TO <user>`
- Or use a pre-created catalog (change CATALOG variable above)

In [None]:
# =============================================================================
# STEP 3: Create Catalog (if not exists)
# =============================================================================
try:
    spark.sql(f"CREATE CATALOG IF NOT EXISTS {CATALOG}")
    print(f"Catalog '{CATALOG}' ready")
except Exception as e:
    if "PERMISSION_DENIED" in str(e) or "does not have" in str(e):
        print(f"ERROR: No permission to create catalog.")
        print(f"Ask admin to run: GRANT CREATE CATALOG ON METASTORE TO `{raw_user}`")
        print(f"Or use existing catalog by changing CATALOG variable above.")
        raise
    else:
        raise

In [None]:
# =============================================================================
# STEP 4: Create Medallion Schemas
# =============================================================================
spark.sql(f"USE CATALOG {CATALOG}")

for schema_name in [BRONZE_SCHEMA, SILVER_SCHEMA, GOLD_SCHEMA]:
    spark.sql(f"CREATE SCHEMA IF NOT EXISTS {schema_name}")
    print(f"Schema '{CATALOG}.{schema_name}' ready")

In [None]:
# =============================================================================
# STEP 5: Verify Setup
# =============================================================================
print("=" * 60)
print("TRAINING ENVIRONMENT READY")
print("=" * 60)
print(f"User:              {raw_user}")
print(f"Catalog:           {CATALOG}")
print(f"Bronze Schema:     {CATALOG}.{BRONZE_SCHEMA}")
print(f"Silver Schema:     {CATALOG}.{SILVER_SCHEMA}")
print(f"Gold Schema:       {CATALOG}.{GOLD_SCHEMA}")
print(f"Source Data:       {DATASET_BASE_PATH}")
print("=" * 60)
print()
print("You can now use these variables in your notebooks:")
print("  - CATALOG, BRONZE_SCHEMA, SILVER_SCHEMA, GOLD_SCHEMA")
print("  - DATASET_BASE_PATH (for reading source files)")
print()
print("Example:")
print(f"  spark.read.csv('{DATASET_BASE_PATH}/customers/customers.csv')")
print(f"  spark.sql('SELECT * FROM {CATALOG}.{BRONZE_SCHEMA}.customers')")

## Verify Data Access

Check if you can access the shared training data:

In [None]:
# =============================================================================
# STEP 6: Verify Data Access
# =============================================================================
try:
    files = dbutils.fs.ls(DATASET_BASE_PATH)
    print(f"Available data directories in {DATASET_BASE_PATH}:")
    for f in files:
        print(f"  - {f.name}")
except Exception as e:
    print(f"WARNING: Cannot access {DATASET_BASE_PATH}")
    print(f"Error: {e}")
    print()
    print("Possible solutions:")
    print("  1. Ask trainer to create shared Volume and grant READ access")
    print("  2. Upload data manually to your catalog's Volume")
    print(f"  3. Create Volume: CREATE VOLUME {CATALOG}.default.datasets")

---

## Trainer Setup (Run Once Before Training)

**If you are the trainer**, run the cell below to:
1. Create a group for training participants
2. Create shared data Volume
3. Grant permissions to all participants

This only needs to be run once before the training session.

In [None]:
# =============================================================================
# TRAINER SETUP - Run once before training
# =============================================================================
# Uncomment and run if you are the trainer

# TRAINING_GROUP = "training_participants"
# SHARED_CATALOG = "training_shared"

# # Step 1: Create shared catalog for data
# spark.sql(f"CREATE CATALOG IF NOT EXISTS {SHARED_CATALOG}")
# spark.sql(f"CREATE SCHEMA IF NOT EXISTS {SHARED_CATALOG}.datasets")

# # Step 2: Create Volume for shared data files
# spark.sql(f"""
#     CREATE VOLUME IF NOT EXISTS {SHARED_CATALOG}.datasets.files
#     COMMENT 'Shared training data files (read-only for participants)'
# """)

# # Step 3: Grant permissions to training group
# # First create group in Account Console or via SCIM
# spark.sql(f"GRANT USE CATALOG ON CATALOG {SHARED_CATALOG} TO `{TRAINING_GROUP}`")
# spark.sql(f"GRANT USE SCHEMA ON SCHEMA {SHARED_CATALOG}.datasets TO `{TRAINING_GROUP}`")
# spark.sql(f"GRANT READ VOLUME ON VOLUME {SHARED_CATALOG}.datasets.files TO `{TRAINING_GROUP}`")

# # Step 4: Grant CREATE CATALOG to training group (for per-user isolation)
# spark.sql(f"GRANT CREATE CATALOG ON METASTORE TO `{TRAINING_GROUP}`")

# # Step 5: List group members (for verification)
# # Note: This requires Account Admin or using Databricks REST API
# print(f"Setup complete. Add users to group '{TRAINING_GROUP}' in Account Console.")
# print(f"Upload training data to: /Volumes/{SHARED_CATALOG}/datasets/files/")

---

## Cleanup (After Training)

Run this to remove your training environment after the session:

In [None]:
# =============================================================================
# CLEANUP - Run after training to remove your catalog
# =============================================================================
# WARNING: This will DELETE all tables and data in your training catalog!

# Uncomment to run:
# spark.sql(f"DROP CATALOG IF EXISTS {CATALOG} CASCADE")
# print(f"Catalog '{CATALOG}' has been removed.")