# Vector Search with BrickKit

This notebook demonstrates **end-to-end usage of BrickKit** for deploying a governed Vector Search solution.

## What BrickKit Does

BrickKit automates governance for Databricks resources:
- **Naming conventions** - Environment-aware names (dev/acc/prd suffixes)
- **Tagging** - Automatic cost center, team, compliance tags
- **Ownership rules** - Enforce service principals for catalogs, groups for schemas
- **Validation** - Catch governance violations before deployment

## What This Demo Shows

1. Load a governance convention from YAML
2. Create sample data (or optionally fetch from World Bank API)
3. Define governed resources using BrickKit models
4. Deploy using BrickKit executors
5. Test vector search
6. See what governance BrickKit applied automatically

---
## 1. Configuration

In [None]:
%pip install databricks-vectorsearch databricks-sdk pydantic pyyaml --quiet
dbutils.library.restartPython()

In [None]:
# === CONFIGURATION ===
# Edit these widgets or override via job parameters

dbutils.widgets.text("catalog", "quant_risk", "Catalog Name (base)")
dbutils.widgets.text("schema", "indicators", "Schema Name")
dbutils.widgets.dropdown("environment", "dev", ["dev", "acc", "prd"], "Environment")
dbutils.widgets.dropdown("dry_run", "true", ["true", "false"], "Dry Run")

# Read widget values
CATALOG_BASE = dbutils.widgets.get("catalog")
SCHEMA_NAME = dbutils.widgets.get("schema")
ENVIRONMENT = dbutils.widgets.get("environment")
DRY_RUN = dbutils.widgets.get("dry_run").lower() == "true"

# Derived names (will be suffixed by BrickKit based on environment)
TABLE_NAME = "worldbank_indicators"
ENDPOINT_NAME = "quant_risk_search"
INDEX_NAME = f"{TABLE_NAME}_index"

print(f"Environment: {ENVIRONMENT}")
print(f"Dry Run: {DRY_RUN}")
print(f"Catalog (base): {CATALOG_BASE}")
print(f"Schema: {SCHEMA_NAME}")

In [None]:
# === IMPORTS ===
import logging
import sys
from pathlib import Path
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.utils import AnalysisException

from databricks.sdk import WorkspaceClient
from databricks.vector_search.client import VectorSearchClient

# BrickKit imports
from brickkit import (
    Catalog,
    Schema,
    Tag,
    SecurableType,
    VectorSearchEndpoint,
    VectorSearchIndex,
    load_convention,
)
from brickkit.executors import (
    CatalogExecutor,
    SchemaExecutor,
    VectorSearchEndpointExecutor,
    VectorSearchIndexExecutor,
)
from brickkit.models.base import set_current_environment
from brickkit.models.enums import Environment

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
logger = logging.getLogger(__name__)

# Set BrickKit environment
ENV_MAP = {"dev": Environment.DEV, "acc": Environment.ACC, "prd": Environment.PRD}
set_current_environment(ENV_MAP[ENVIRONMENT])

print(f"BrickKit environment set to: {ENVIRONMENT}")

In [None]:
# === LOAD GOVERNANCE CONVENTION ===
# The convention defines naming patterns, required tags, and ownership rules

CONVENTION_PATH = "conventions/financial_services.yml"
convention = load_convention(CONVENTION_PATH)

print(f"Loaded convention: {convention.name} (v{convention.version})")
print(f"Rules: {len(convention.schema.rules)}")
print(f"Default tags: {len(convention.schema.tags)}")

# Show what the convention enforces
for rule in convention.schema.rules:
    mode = "ENFORCED" if rule.mode.value == "enforced" else "ADVISORY"
    print(f"  [{mode}] {rule.rule}")

---
## 2. Sample Data

We'll use a small inline dataset of World Bank indicators. This lets you run the full demo quickly without external API calls.

In [None]:
# === SAMPLE DATA ===
# 20 World Bank indicators with embedding text for vector search

SAMPLE_INDICATORS = [
    ("SP.POP.TOTL", "Population, total", "Total population counts all residents regardless of legal status or citizenship.", "Demographics"),
    ("NY.GDP.MKTP.CD", "GDP (current US$)", "GDP at purchaser's prices is the sum of gross value added by all resident producers.", "Economy"),
    ("NY.GDP.PCAP.CD", "GDP per capita (current US$)", "GDP per capita is gross domestic product divided by midyear population.", "Economy"),
    ("SI.POV.DDAY", "Poverty headcount ratio at $2.15 a day", "Poverty headcount ratio at $2.15 a day is the percentage of the population living on less than $2.15 a day.", "Poverty"),
    ("SI.POV.GINI", "Gini index", "Gini index measures the extent to which the distribution of income among individuals deviates from a perfectly equal distribution.", "Inequality"),
    ("SL.UEM.TOTL.ZS", "Unemployment, total (% of labor force)", "Unemployment refers to the share of the labor force that is without work but available and seeking employment.", "Labor"),
    ("FP.CPI.TOTL.ZG", "Inflation, consumer prices (annual %)", "Inflation as measured by the consumer price index reflects the annual percentage change in the cost of goods and services.", "Economy"),
    ("SP.DYN.LE00.IN", "Life expectancy at birth, total (years)", "Life expectancy at birth indicates the number of years a newborn infant would live if patterns of mortality at birth were to stay the same.", "Health"),
    ("SH.DYN.MORT", "Mortality rate, under-5 (per 1,000 live births)", "Under-five mortality rate is the probability per 1,000 that a newborn baby will die before reaching age five.", "Health"),
    ("SE.ADT.LITR.ZS", "Literacy rate, adult total (% of people ages 15 and above)", "Adult literacy rate is the percentage of people ages 15 and above who can read and write a short simple statement.", "Education"),
    ("SE.PRM.ENRR", "School enrollment, primary (% gross)", "Gross enrollment ratio is the ratio of total enrollment to the population of the age group that officially corresponds to the level of education.", "Education"),
    ("EG.USE.ELEC.KH.PC", "Electric power consumption (kWh per capita)", "Electric power consumption measures the production of power plants and combined heat and power plants less transmission losses.", "Energy"),
    ("EN.ATM.CO2E.PC", "CO2 emissions (metric tons per capita)", "Carbon dioxide emissions are those stemming from the burning of fossil fuels and the manufacture of cement.", "Environment"),
    ("AG.LND.FRST.ZS", "Forest area (% of land area)", "Forest area is land under natural or planted stands of trees of at least 5 meters in situ.", "Environment"),
    ("SH.XPD.CHEX.PC.CD", "Current health expenditure per capita (current US$)", "Current expenditures on health per capita in current US dollars.", "Health"),
    ("IT.NET.USER.ZS", "Individuals using the Internet (% of population)", "Internet users are individuals who have used the Internet in the last 3 months.", "Technology"),
    ("BX.KLT.DINV.CD.WD", "Foreign direct investment, net inflows (BoP, current US$)", "Foreign direct investment are the net inflows of investment to acquire a lasting management interest.", "Economy"),
    ("GC.DOD.TOTL.GD.ZS", "Central government debt, total (% of GDP)", "Debt is the entire stock of direct government fixed-term contractual obligations to others outstanding.", "Economy"),
    ("NE.EXP.GNFS.ZS", "Exports of goods and services (% of GDP)", "Exports of goods and services represent the value of all goods and other market services provided to the rest of the world.", "Trade"),
    ("NE.IMP.GNFS.ZS", "Imports of goods and services (% of GDP)", "Imports of goods and services represent the value of all goods and other market services received from the rest of the world.", "Trade"),
]

# Schema for the indicators table
INDICATORS_SCHEMA = StructType([
    StructField("indicator_id", StringType(), False),
    StructField("indicator_name", StringType(), True),
    StructField("description", StringType(), True),
    StructField("topic", StringType(), True),
    StructField("embedding_text", StringType(), True),
])

def create_sample_dataframe(spark: SparkSession):
    """Create DataFrame from sample indicators with embedding text."""
    rows = [
        (ind_id, name, desc, topic, f"{name}. {desc}")
        for ind_id, name, desc, topic in SAMPLE_INDICATORS
    ]
    return spark.createDataFrame(rows, INDICATORS_SCHEMA)

# Preview the sample data
sample_df = create_sample_dataframe(spark)
print(f"Sample data: {sample_df.count()} indicators")
sample_df.select("indicator_id", "indicator_name", "topic").show(5, truncate=40)

### (Optional) Fetch Real Data from World Bank API

Uncomment and run the cell below to fetch real indicator metadata. This takes several minutes.

In [None]:
# === OPTIONAL: FETCH FROM WORLD BANK API ===
# Uncomment this cell to fetch real data (takes several minutes)

# %pip install wbgapi requests tqdm --quiet

# import wbgapi as wb
# import requests
# from requests.exceptions import RequestException, Timeout
# from tqdm import tqdm

# def fetch_worldbank_indicators(spark: SparkSession, limit: int = 100):
#     """Fetch indicator metadata from World Bank API."""
#     series_list = wb.series.info()
#     series_ids = [s.get("id") for s in series_list.items][:limit]
#     
#     rows = []
#     for series_id in tqdm(series_ids, desc="Fetching"):
#         try:
#             url = f"https://api.worldbank.org/v2/indicator/{series_id}?format=json"
#             resp = requests.get(url, timeout=30)
#             resp.raise_for_status()
#             data = resp.json()
#             if len(data) >= 2 and data[1]:
#                 meta = data[1][0]
#                 name = meta.get("name", "") or ""
#                 desc = meta.get("sourceNote", "") or ""
#                 topics = meta.get("topics", []) or []
#                 topic = topics[0].get("value", "") if topics else ""
#                 embedding_text = f"{name}. {desc}".strip()
#                 rows.append((series_id, name, desc, topic, embedding_text))
#         except (RequestException, Timeout, ValueError) as e:
#             print(f"Skipping {series_id}: {e}")
#     
#     return spark.createDataFrame(rows, INDICATORS_SCHEMA)

# # Fetch real data (uncomment to use)
# sample_df = fetch_worldbank_indicators(spark, limit=500)
# print(f"Fetched {sample_df.count()} indicators from World Bank API")

---
## 3. Define Governed Resources

Now we define our resources using BrickKit models. The convention automatically applies:
- Environment-specific naming (e.g., `quant_risk_dev`)
- Required governance tags
- Ownership rules validation

In [None]:
# === DEFINE GOVERNED RESOURCES ===

environment = ENV_MAP[ENVIRONMENT]

# Get owners from convention (enforces SP for catalogs, Group for schemas)
catalog_owner = convention.get_catalog_owner(environment)
schema_owner = convention.get_owner(SecurableType.SCHEMA, environment)

print(f"Catalog owner: {catalog_owner.resolved_name} ({catalog_owner.principal_type.value})")
print(f"Schema owner: {schema_owner.resolved_name} ({schema_owner.principal_type.value})")

In [None]:
# === CATALOG ===
catalog_name = convention.generate_name(SecurableType.CATALOG, environment)

catalog = Catalog(
    name=catalog_name,
    owner=catalog_owner,
    comment="Risk Analytics catalog for quantitative trading",
)

# Apply convention (adds tags, validates rules)
convention.apply_to(catalog, environment)
errors = convention.get_validation_errors(catalog)
if errors:
    raise ValueError(f"Catalog validation failed: {errors}")

print(f"Catalog: {catalog.name}")
print(f"  Tags: {len(catalog.tags)}")

In [None]:
# === SCHEMA ===
schema = Schema(
    name=SCHEMA_NAME,
    catalog_name=catalog.name,
    owner=schema_owner,
    comment="World Bank indicator metadata for vector search",
)

convention.apply_to(schema, environment)
errors = convention.get_validation_errors(schema)
if errors:
    raise ValueError(f"Schema validation failed: {errors}")

print(f"Schema: {schema.fqdn}")
print(f"  Tags: {len(schema.tags)}")

In [None]:
# === VECTOR SEARCH ENDPOINT ===
vs_endpoint = VectorSearchEndpoint(
    name=ENDPOINT_NAME,
    comment="Semantic search endpoint for risk analytics indicators",
    tags=[
        Tag(key="purpose", value="semantic_search"),
        Tag(key="model", value="databricks-bge-large-en"),
    ],
)

convention.apply_to(vs_endpoint, environment)
errors = convention.get_validation_errors(vs_endpoint)
if errors:
    raise ValueError(f"Endpoint validation failed: {errors}")

print(f"Endpoint: {vs_endpoint.resolved_name}")
print(f"  Tags: {len(vs_endpoint.tags)}")

In [None]:
# === VECTOR SEARCH INDEX ===
FULL_TABLE_NAME = f"{catalog.name}.{schema.name}.{TABLE_NAME}"

vs_index = VectorSearchIndex(
    name=INDEX_NAME,
    endpoint_name=ENDPOINT_NAME,
    source_table=FULL_TABLE_NAME,
    primary_key="indicator_id",
    embedding_column="embedding_text",
    embedding_model="databricks-bge-large-en",
    pipeline_type="TRIGGERED",
    tags=[
        Tag(key="index_type", value="managed_embedding"),
    ],
)

convention.apply_to(vs_index, environment)
errors = convention.get_validation_errors(vs_index)
if errors:
    raise ValueError(f"Index validation failed: {errors}")

print(f"Index: {vs_index.resolved_name}")
print(f"  Source: {vs_index.source_table}")
print(f"  Endpoint: {vs_index.resolved_endpoint_name}")
print(f"  Tags: {len(vs_index.tags)}")

---
## 4. Deploy with BrickKit Executors

BrickKit executors handle:
- Idempotent create (skip if exists)
- Wait for provisioning
- Tag application
- Error handling

In [None]:
# === INITIALIZE CLIENTS AND EXECUTORS ===

ws_client = WorkspaceClient()
vs_client = VectorSearchClient()

catalog_executor = CatalogExecutor(ws_client, dry_run=DRY_RUN)
schema_executor = SchemaExecutor(ws_client, dry_run=DRY_RUN)
endpoint_executor = VectorSearchEndpointExecutor(ws_client, dry_run=DRY_RUN)
index_executor = VectorSearchIndexExecutor(ws_client, dry_run=DRY_RUN)

print(f"Executors initialized (dry_run={DRY_RUN})")

In [None]:
# === DEPLOY CATALOG ===
result = catalog_executor.create(catalog)
print(f"Catalog: {result.operation.value} - {result.message}")

In [None]:
# === DEPLOY SCHEMA ===
result = schema_executor.create(schema)
print(f"Schema: {result.operation.value} - {result.message}")

In [None]:
# === WRITE DATA TO TABLE ===
# Using PySpark DataFrame API (not spark.sql)

if not DRY_RUN:
    # Write with Delta format and Change Data Feed enabled
    (
        sample_df
        .write
        .format("delta")
        .option("delta.enableChangeDataFeed", "true")
        .mode("overwrite")
        .saveAsTable(FULL_TABLE_NAME)
    )
    
    # Verify
    count = spark.table(FULL_TABLE_NAME).count()
    print(f"Table: {FULL_TABLE_NAME} - {count} rows written")
else:
    print(f"[DRY RUN] Would write {sample_df.count()} rows to {FULL_TABLE_NAME}")

In [None]:
# === DEPLOY VECTOR SEARCH ENDPOINT ===
result = endpoint_executor.create(vs_endpoint)
print(f"Endpoint: {result.operation.value} - {result.message}")

# Wait for endpoint to be online (uses executor's built-in wait logic)
if not DRY_RUN and result.operation.value == "CREATE":
    print("Waiting for endpoint to be online...")
    if endpoint_executor.wait_for_endpoint(vs_endpoint):
        print(f"Endpoint {vs_endpoint.resolved_name} is ONLINE")
    else:
        raise RuntimeError(f"Endpoint {vs_endpoint.resolved_name} failed to provision")

In [None]:
# === DEPLOY VECTOR SEARCH INDEX ===
result = index_executor.create(vs_index)
print(f"Index: {result.operation.value} - {result.message}")

---
## 5. Test Vector Search

The index syncs asynchronously. Once ready, we can run similarity searches.

In [None]:
# === CHECK INDEX STATUS ===

if not DRY_RUN:
    FULL_INDEX_NAME = f"{catalog.name}.{schema.name}.{vs_index.resolved_name}"
    
    index = vs_client.get_index(
        endpoint_name=vs_endpoint.resolved_name,
        index_name=FULL_INDEX_NAME,
    )
    status = index.describe().get("status", {})
    print(f"Index status: ready={status.get('ready', 'UNKNOWN')}")
    print(f"Message: {status.get('message', 'N/A')}")
else:
    print("[DRY RUN] Would check index status")

In [None]:
# === RUN SIMILARITY SEARCH ===

if not DRY_RUN:
    TEST_QUERY = "poverty and inequality measures"
    
    try:
        results = index.similarity_search(
            query_text=TEST_QUERY,
            columns=["indicator_id", "indicator_name", "description", "topic"],
            num_results=5,
        )
        
        print(f"Search: '{TEST_QUERY}'")
        print("=" * 60)
        
        data = results.get("result", {}).get("data_array", [])
        for i, row in enumerate(data, 1):
            print(f"{i}. [{row[3]}] {row[1]}")
            print(f"   {row[2][:80]}...")
            print()
            
    except Exception as e:
        if "not ready" in str(e).lower() or "syncing" in str(e).lower():
            print("Index is still syncing. Please wait and try again.")
        else:
            raise
else:
    print("[DRY RUN] Would run similarity search")

---
## 6. What BrickKit Added (Governance Value)

Let's see what governance BrickKit applied automatically.

In [None]:
# === GOVERNANCE SUMMARY ===

def display_resource_governance(name: str, resource):
    """Display governance metadata for a resource."""
    print(f"\n{'='*60}")
    print(f"{name}")
    print(f"{'='*60}")
    
    # Name (with environment suffix)
    if hasattr(resource, 'resolved_name'):
        print(f"Name: {resource.resolved_name}")
    elif hasattr(resource, 'fqdn'):
        try:
            print(f"Name: {resource.fqdn}")
        except ValueError:
            print(f"Name: {resource.name}")
    else:
        print(f"Name: {resource.name}")
    
    # Owner
    if hasattr(resource, 'owner') and resource.owner:
        owner = resource.owner
        print(f"Owner: {owner.resolved_name} ({owner.principal_type.value})")
    
    # Tags
    if hasattr(resource, 'tags') and resource.tags:
        print(f"Tags ({len(resource.tags)}):")
        for tag in sorted(resource.tags, key=lambda t: t.key):
            print(f"  - {tag.key}: {tag.value}")

# Display governance for all resources
display_resource_governance("CATALOG", catalog)
display_resource_governance("SCHEMA", schema)
display_resource_governance("VECTOR SEARCH ENDPOINT", vs_endpoint)
display_resource_governance("VECTOR SEARCH INDEX", vs_index)

In [None]:
# === CONVENTION RULES APPLIED ===

print("\n" + "=" * 60)
print("CONVENTION RULES")
print("=" * 60)
print(f"Convention: {convention.name} (v{convention.version})")
print()

for rule in convention.schema.rules:
    mode = "ENFORCED" if rule.mode.value == "enforced" else "ADVISORY"
    print(f"[{mode}] {rule.rule}")
    
print()
print("What this means:")
print("- Catalogs MUST be owned by service principals (not users)")
print("- All resources MUST be owned by SP or Group (no individual users)")
print("- Resources SHOULD have cost_center and team tags")
print("- BrickKit validated all these rules before deployment")

In [None]:
# === WHAT YOU DIDN'T HAVE TO DO ===

print("\n" + "=" * 60)
print("WHAT BRICKKIT DID FOR YOU")
print("=" * 60)

benefits = [
    ("Environment suffixes", f"All names automatically suffixed with '_{ENVIRONMENT}'"),
    ("Governance tags", f"{len(catalog.tags)} tags auto-applied from convention"),
    ("Ownership validation", "Verified catalog has SP owner, schema has Group owner"),
    ("Idempotent deployment", "Executors skip if resource exists, sync tags if needed"),
    ("Wait logic", "Built-in endpoint provisioning wait with timeout/retry"),
    ("Consistent patterns", "Same governance across Catalog, Schema, Endpoint, Index"),
]

for benefit, detail in benefits:
    print(f"\n{benefit}:")
    print(f"  {detail}")

print("\n" + "=" * 60)
print("Without BrickKit, you would manually:")
print("  - Add environment suffixes to every resource name")
print("  - Remember which tags to apply (and apply them consistently)")
print("  - Validate ownership rules before deployment")
print("  - Write wait/retry logic for endpoint provisioning")
print("  - Handle idempotency (check exists, update tags, etc.)")
print("=" * 60)

---
## Summary

This demo showed:

1. **Convention Loading** - Governance rules from YAML
2. **Governed Models** - `Catalog`, `Schema`, `VectorSearchEndpoint`, `VectorSearchIndex`
3. **Executors** - Idempotent deployment with built-in wait logic
4. **Automatic Governance** - Tags, naming, ownership validation

### Next Steps

- Modify `conventions/financial_services.yml` to change governance rules
- Set `dry_run=false` to deploy for real
- Try different environments (`dev`, `acc`, `prd`) to see naming changes
- Add your own data source instead of sample indicators