# Pre-Training Configuration

## Purpose
This notebook is run **once by the trainer** before the training starts.  
It creates isolated environments for each participant:

1. **Catalog** per user: `retailhub_{username}`
2. **Schemas**: `bronze`, `silver`, `gold`, `default`
3. **Volume**: `datasets` (managed) in the `default` schema
4. **Dataset files**: copied from the Git repo to each user's Volume
5. **Permissions**: full access granted to each participant on their catalog

**After running this notebook**, participants can run `00_setup.ipynb` to validate their environment.

---

### Checklist before running:
- [ ] Databricks workspace is ready
- [ ] Training group exists with all participants added
- [ ] This repo is cloned via Repos in the workspace
- [ ] You have `CREATE CATALOG` and `MANAGE` privileges (account admin or metastore admin)
- [ ] Storage location for managed catalogs is configured

## Step 0: Configuration

Adjust the values below for your training session.

In [0]:
# =============================================================================
# CONFIGURATION -- Adjust these values for your training
# =============================================================================

# Databricks group containing all training participants
TRAINING_GROUP = "admins"

# Catalog naming: retailhub_{username}
CATALOG_PREFIX = "retailhub"

# Managed location for catalogs (ADLS, S3, or GCS path)
# Example Azure: "abfss://container@account.dfs.core.windows.net/path"
# Example AWS:   "s3://bucket/path"
# Leave empty to use the metastore default location
STORAGE_LOCATION = "abfss://unity-catalog-storage@dbstoragexlcs5kgkoop2g.dfs.core.windows.net/7405606614957089"

# Schema names (Medallion architecture)
BRONZE_SCHEMA = "bronze"
SILVER_SCHEMA = "silver"
GOLD_SCHEMA = "gold"
DEFAULT_SCHEMA = "default"

# Volume name for datasets
VOLUME_NAME = "datasets"

print(f"Training group:   {TRAINING_GROUP}")
print(f"Catalog prefix:   {CATALOG_PREFIX}")
print(f"Storage location: {STORAGE_LOCATION or '(metastore default)'}")

## Step 1: Get Participants from Training Group

Uses the Databricks SCIM API to retrieve group members and generate catalog names.

In [0]:
import requests
import re

def get_group_members(group_name):
    """
    Get all members of a Databricks group using the SCIM REST API.
    Returns list of usernames (email addresses).
    """
    context = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
    host = context.apiUrl().get()
    token = context.apiToken().get()

    headers = {
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json"
    }

    # Find the group by name
    groups_url = f"{host}/api/2.0/preview/scim/v2/Groups"
    params = {"filter": f'displayName eq "{group_name}"'}
    response = requests.get(groups_url, headers=headers, params=params)
    response.raise_for_status()

    groups = response.json().get("Resources", [])
    if not groups:
        raise ValueError(f"Group '{group_name}' not found")

    group_id = groups[0]["id"]

    # Get group details with members
    group_url = f"{host}/api/2.0/preview/scim/v2/Groups/{group_id}"
    response = requests.get(group_url, headers=headers)
    response.raise_for_status()

    members = response.json().get("members", [])

    # Resolve each member to an email
    user_emails = []
    for member in members:
        if member.get("$ref", "").startswith("Users/"):
            user_id = member["value"]
            user_url = f"{host}/api/2.0/preview/scim/v2/Users/{user_id}"
            user_response = requests.get(user_url, headers=headers)
            user_response.raise_for_status()
            emails = user_response.json().get("emails", [])
            if emails:
                user_emails.append(emails[0].get("value", ""))

    return user_emails


def sanitize_username(email):
    """
    Convert email to a safe catalog name suffix.
    Example: jan.kowalski@company.com -> jan_kowalski
    """
    username = email.split('@')[0]
    safe_name = re.sub(r'[^a-zA-Z0-9]', '_', username).lower()
    safe_name = re.sub(r'_+', '_', safe_name)
    safe_name = re.sub(r'^[0-9_]+', '', safe_name)
    safe_name = re.sub(r'[0-9_]+$', '', safe_name)
    return safe_name or "user"


print("Functions defined: get_group_members(), sanitize_username()")

In [0]:
# =============================================================================
# Get users and build catalog mapping
# =============================================================================
try:
    training_users = get_group_members(TRAINING_GROUP)
    print(f"Found {len(training_users)} users in group '{TRAINING_GROUP}':")
    print("=" * 60)

    user_catalog_map = {}
    for email in sorted(training_users):
        safe_name = sanitize_username(email)

        # Map trainer accounts to a single catalog
        if any(sub in email.lower() for sub in ["trainer", "krzysztof.burejza"]):
            catalog_name = f"{CATALOG_PREFIX}_trainer"
        else:
            catalog_name = f"{CATALOG_PREFIX}_{safe_name}"

        user_catalog_map[email] = catalog_name
        print(f"  {email:<40} -> {catalog_name}")

    print("=" * 60)
    print(f"Total: {len(set(user_catalog_map.values()))} unique catalogs to create")

except Exception as e:
    print(f"ERROR: {e}")
    print(f"\nPossible issues:")
    print(f"  1. Group '{TRAINING_GROUP}' does not exist")
    print(f"  2. You don't have permission to read group members")
    print(f"  3. Group has no members")
    raise

## Step 2: Create Catalogs, Schemas, and Volumes

For each participant:
- Catalog: `retailhub_{username}`
- Schemas: `bronze`, `silver`, `gold`, `default`
- Volume: `datasets` in `default` schema
- Permissions: `ALL PRIVILEGES` granted to the user and the trainer

In [0]:
def create_user_environment(email, catalog_name, storage_location):
    """
    Create catalog, schemas, volume and set permissions for a training participant.
    """
    results = {"catalog": False, "schemas": [], "volume": False, "permissions": False}
    trainer_email = spark.sql("SELECT current_user()").first()[0]

    try:
        # Create catalog (with or without managed location)
        if storage_location:
            spark.sql(f"""
                CREATE CATALOG IF NOT EXISTS {catalog_name}
                MANAGED LOCATION '{storage_location}/{catalog_name}'
            """)
        else:
            spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog_name}")
        results["catalog"] = True

        # Create Medallion schemas + default
        for schema in [DEFAULT_SCHEMA, BRONZE_SCHEMA, SILVER_SCHEMA, GOLD_SCHEMA]:
            spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog_name}.{schema}")
            results["schemas"].append(schema)

        # Create managed Volume for datasets
        spark.sql(f"""
            CREATE VOLUME IF NOT EXISTS {catalog_name}.{DEFAULT_SCHEMA}.{VOLUME_NAME}
            COMMENT 'Training datasets for RetailHub project'
        """)
        results["volume"] = True

        # Grant permissions
        spark.sql(f"GRANT ALL PRIVILEGES ON CATALOG {catalog_name} TO `{email}`")
        if trainer_email != email:
            spark.sql(f"GRANT ALL PRIVILEGES ON CATALOG {catalog_name} TO `{trainer_email}`")
        results["permissions"] = True

    except Exception as e:
        results["error"] = str(e)

    return results

In [0]:
# =============================================================================
# Create environments for all participants
# =============================================================================
creation_results = {}

for email, catalog_name in user_catalog_map.items():
    print(f"Processing: {email}")
    result = create_user_environment(email, catalog_name, STORAGE_LOCATION)
    creation_results[email] = result

    if "error" in result:
        print(f"  ERROR: {result['error']}")
    else:
        print(f"  Catalog:     {catalog_name}")
        print(f"  Schemas:     {', '.join(result['schemas'])}")
        print(f"  Volume:      {result['volume']}")
        print(f"  Permissions: {result['permissions']}")
    print()

successful = sum(1 for r in creation_results.values() if "error" not in r)
print(f"Result: {successful}/{len(creation_results)} environments created successfully")

## Step 3: Copy Dataset Files to Volumes

Copies the entire `dataset/` folder from the Git repo to each user's Volume.  
Uses `shutil.copytree` for a complete recursive copy.

**Dataset structure in each Volume:**
```
/Volumes/retailhub_{user}/default/datasets/
  customers/
    customers.csv
    customers_extended.csv
    customers_new.csv
  orders/
    orders_batch.json
    stream/
      orders_stream_001.json ... 003.json
  products/
    products.csv
  demo/ingestion/
    orders/stream/
      orders_stream_004.json ... 006.json
  workshop/
    Customers.csv, Product.csv, ...
    Lakeflow/
      Customers/Customers.csv, ...
```

In [0]:
import shutil
import os

def copy_dataset_to_volume(catalog_name):
    """
    Copy dataset files from the Git repo to a user's Volume.
    Skips .DS_Store and other hidden files.
    """
    # Source: dataset/ folder in the repo root
    # When running from training_2026/setup/, repo root is ../../
    repo_root = os.path.abspath(os.path.join(os.getcwd(), "..", ".."))
    source_path = os.path.join(repo_root, "dataset")

    # Target: Volume path
    volume_path = f"/Volumes/{catalog_name}/{DEFAULT_SCHEMA}/{VOLUME_NAME}"

    if not os.path.exists(source_path):
        return {"error": f"Source not found: {source_path}"}

    try:
        # Copy with ignore for hidden files
        def ignore_hidden(directory, files):
            return [f for f in files if f.startswith('.')]

        shutil.copytree(source_path, volume_path, dirs_exist_ok=True, ignore=ignore_hidden)
        return {"success": True}
    except Exception as e:
        return {"error": str(e)}

In [0]:
# =============================================================================
# Copy dataset to each participant's Volume
# =============================================================================
print("Copying dataset to user Volumes...")
print("=" * 60)

copy_results = {}

for email, catalog_name in user_catalog_map.items():
    print(f"  {catalog_name}... ", end="")
    result = copy_dataset_to_volume(catalog_name)
    copy_results[email] = result

    if "error" in result:
        print(f"FAILED: {result['error']}")
    else:
        print("OK")

successful = sum(1 for r in copy_results.values() if r.get("success"))
print("=" * 60)
print(f"Result: {successful}/{len(copy_results)} volumes populated")

## Step 4: Verification

Final check that all environments are ready for training.

In [0]:
# =============================================================================
# VERIFICATION -- Check all environments
# =============================================================================
print("=" * 60)
print("TRAINING ENVIRONMENT SUMMARY")
print("=" * 60)
print()

all_ok = True

for email, catalog_name in user_catalog_map.items():
    status = []

    # Check catalog
    try:
        spark.sql(f"USE CATALOG {catalog_name}")
        status.append("catalog: OK")
    except:
        status.append("catalog: MISSING")
        all_ok = False

    # Check schemas
    try:
        schemas = [row[0] for row in spark.sql(f"SHOW SCHEMAS IN {catalog_name}").collect()]
        missing = [s for s in [BRONZE_SCHEMA, SILVER_SCHEMA, GOLD_SCHEMA] if s not in schemas]
        if missing:
            status.append(f"schemas: MISSING {missing}")
            all_ok = False
        else:
            status.append("schemas: OK")
    except:
        status.append("schemas: ERROR")
        all_ok = False

    # Check Volume
    volume_path = f"/Volumes/{catalog_name}/{DEFAULT_SCHEMA}/{VOLUME_NAME}"
    try:
        files = dbutils.fs.ls(volume_path)
        file_count = len(files)
        status.append(f"volume: OK ({file_count} items)")
    except:
        status.append("volume: MISSING")
        all_ok = False

    print(f"  {email}")
    print(f"    Catalog: {catalog_name}")
    print(f"    Status:  {' | '.join(status)}")
    print()

print("=" * 60)
if all_ok:
    print("ALL ENVIRONMENTS READY -- Training can begin!")
    print("Participants should run: training_2026/setup/00_setup.ipynb")
else:
    print("WARNING: Some environments have issues. Check the details above.")
print("=" * 60)

---

## Cleanup (After Training)

Run this section **after** the training to remove all participant catalogs and data.  
**WARNING: This is a destructive operation -- all training data will be permanently deleted!**

In [0]:
# =============================================================================
# CLEANUP -- Remove all training catalogs (after training)
# =============================================================================
# Uncomment the code below to execute cleanup.

# print("Dropping training catalogs...")
# for email, catalog_name in user_catalog_map.items():
#     try:
#         spark.sql(f"DROP CATALOG IF EXISTS {catalog_name} CASCADE")
#         print(f"  Dropped: {catalog_name}")
#     except Exception as e:
#         print(f"  Failed: {catalog_name} -- {e}")
# print("Cleanup complete.")

In [0]:
# =============================================================================
# ALTERNATIVE CLEANUP -- Find and remove all retailhub_* catalogs
# =============================================================================
# Use this if user_catalog_map is not available (e.g., new session).

# catalogs_df = spark.sql("SHOW CATALOGS")
# retailhub_catalogs = [row.catalog for row in catalogs_df.collect()
#                       if row.catalog.startswith("retailhub_")]
#
# print(f"Found {len(retailhub_catalogs)} catalogs to remove:")
# for cat in retailhub_catalogs:
#     print(f"  - {cat}")
#
# # Uncomment to drop:
# # for cat in retailhub_catalogs:
# #     spark.sql(f"DROP CATALOG IF EXISTS {cat} CASCADE")
# #     print(f"Dropped: {cat}")