# Module 8 (BONUS): GenAI, Vector Search & AI Functions

**Description:**
This is a **BONUS** module that goes beyond traditional Data Preparation.
Databricks is not just for traditional ML. It now integrates Generative AI deeply into the platform. This module explores the "New" AI capabilities: using LLMs directly in SQL and building Vector Search indexes for RAG (Retrieval Augmented Generation) applications.

**Agenda:**
1.  **AI Functions:** Using built-in SQL functions (`ai_analyze_sentiment`, `ai_query`) to process text with LLMs.
2.  **Embeddings & Vector Search:** Understanding how text is converted to vectors and searching by semantic similarity.
3.  **RAG Concept:** A brief look at how Vector Search powers Chatbots.


## Context and Requirements

| Attribute | Value |
|-----------|-------|
| **Training Day** | Day 2 |
| **Module Type** | BONUS - Demo Notebook |
| **Technical Requirements** | Databricks Runtime ML 14.3 LTS or newer |
| **Dependencies** | Unity Catalog, Model Serving endpoints (optional) |
| **Estimated Time** | 30-45 minutes |

**Important Compatibility Notes:**
- This notebook requires **Databricks Runtime 14.3 LTS or newer**
- AI Functions (`ai_analyze_sentiment`, `ai_query`) require **Serverless SQL** or compatible runtime
- Vector Search requires a provisioned **Vector Search Endpoint**
- Some features may not work on older runtimes or community edition

**Note:** This is a BONUS module introducing GenAI concepts. The notebook is designed to demonstrate concepts even without full infrastructure - code that requires specific endpoints is commented out.

## Theoretical Introduction

### GenAI Ecosystem in Databricks

| Component | Description | Use Case |
|-----------|-------------|----------|
| **AI Functions** | SQL functions powered by LLMs | Text analysis, sentiment, extraction |
| **Embeddings** | Vector representations of text | Semantic similarity, search |
| **Vector Search** | Index and query vector embeddings | RAG, recommendation systems |
| **Model Serving** | Deploy and serve ML/LLM models | Real-time inference |

### Key Concepts

**Embeddings:**
- Dense vector representations of text (typically 768-4096 dimensions)
- Similar meanings → similar vectors (cosine similarity)
- Foundation for semantic search and RAG

**Vector Search:**
- Efficient nearest neighbor search in high-dimensional space
- Enables "find similar documents" at scale
- Databricks offers managed Vector Search with Delta Sync

**RAG (Retrieval-Augmented Generation):**
- Combine LLM reasoning with private knowledge
- Steps: Query → Retrieve relevant docs → Generate answer with context
- Reduces hallucinations and enables company-specific AI

## Per-User Isolation

Run the setup notebook to configure your isolated environment with unique catalog and schema names. This ensures your GenAI experiments don't interfere with other participants.

In [0]:
%pip install databricks-vectorsearch
dbutils.library.restartPython()


In [0]:
%run ./00_Setup

## Section 1: AI Functions in SQL

Databricks now allows you to call Large Language Models (LLMs) directly from SQL. This democratizes AI, allowing Data Analysts to perform complex text tasks without Python.

**Available Functions:**
- `ai_analyze_sentiment(text)`: Returns 'positive', 'negative', 'neutral'.
- `ai_classify(text, categories)`: Classifies text into provided labels.
- `ai_summarize(text)`: Summarizes long text.
- `ai_translate(text, lang)`: Translates text.
- `ai_query(model, prompt)`: Sends a custom prompt to a served model (e.g., Llama 3, DBRX).

In [0]:
# Create sample data - customer reviews or support tickets
reviews_data = [
    (1, "I absolutely love this product! Fast shipping and great quality."),
    (2, "The item arrived damaged and the support team was rude."),
    (3, "It's okay, does the job but nothing special."),
    (4, "Can you help me reset my password?"),
    (5, "Where is my refund for order #12345?")
]

df_reviews = spark.createDataFrame(reviews_data, ["id", "text"])

# Save to table
try:
    df_reviews.write.mode("overwrite").saveAsTable(f"{catalog_name}.{schema_name}.customer_reviews")
    print(f"Table created: {catalog_name}.{schema_name}.customer_reviews")
except Exception as e:
    print(f"Could not save table: {e}")
    print("Continuing with DataFrame in memory...")

display(df_reviews)

### Example 1.1: Sentiment Analysis & Classification
We can use `ai_analyze_sentiment` to quickly gauge customer satisfaction and `ai_classify` to route tickets.

In [0]:
# AI Functions - Sentiment Analysis & Classification
# NOTE: These functions require Databricks Runtime 14.3+ with AI Functions enabled
# They may not work on all workspace configurations

try:
    df_ai = spark.sql(f"""
    SELECT 
        id,
        text,
        ai_analyze_sentiment(text) as sentiment,
        ai_classify(text, ARRAY('complaint', 'praise', 'support_request', 'billing')) as category
    FROM {catalog_name}.{schema_name}.customer_reviews
    """)
    display(df_ai)
    
except Exception as e:
    print(f"AI Functions not available: {e}")
    print("\nThis feature requires:")
    print("  - Databricks Runtime 14.3 LTS or newer")
    print("  - Serverless SQL or AI Functions enabled workspace")
    print("  - Unity Catalog enabled")
    print("\nManual alternative: Use Python libraries like transformers or openai")

### Example 1.2: Generative AI with `ai_query`
For more complex tasks, we can ask an LLM (like Llama 3 or DBRX) to generate text.
*Note: This requires a Model Serving Endpoint to be active.*

In [0]:
# Example: Generating a polite response to the customer
# We wrap this in a try-catch block or comment it out because it requires a paid Model Serving endpoint.

query = f'''
SELECT 
    id,
    text,
    ai_query(
        'databricks-llama-4-maverick',
        CONCAT('Write a short, polite response to this customer review: ', text)
    ) as suggested_response
FROM {catalog_name}.{schema_name}.customer_reviews
'''
display(spark.sql(query))


## Section 2: Embeddings & Vector Search

**What is an Embedding?**
Computers don't understand text; they understand numbers. An **Embedding** is a way to convert text into a long list of numbers (a vector), e.g., `[0.12, -0.5, 0.88, ...]`.
- **Magic Property:** Texts with similar *meanings* will have vectors that are mathematically *close* to each other.
- "King" and "Queen" will be close. "Apple" and "Car" will be far apart.

**What is Vector Search?**
It's a specialized database that indexes these vectors to perform "Nearest Neighbor" search extremely fast. This is the foundation of **RAG (Retrieval Augmented Generation)**.

In [0]:
# Prepare Source Data (Knowledge Base)
# Sample product documentation for Vector Search demo
docs_data = [
    (1, "Electric Forklift 2000: Ideal for indoor warehouse use. Zero emissions. 2-ton capacity."),
    (2, "Diesel Titan X: Heavy-duty outdoor forklift. 5-ton capacity. Best for rough terrain."),
    (3, "Hand Pallet Jack: Manual tool for moving light pallets up to 500kg."),
    (4, "Warehouse Automation Suite: Software for inventory management and robot control."),
    (5, "Hydraulic Lift Table: Adjustable height platform for ergonomic loading and unloading."),
    (6, "Cold Storage Reach Truck: Designed for freezer environments, extended reach."),
    (7, "Order Picker Pro: High-efficiency picker for small item fulfillment."),
    (8, "Battery Management System: Monitors and optimizes forklift battery health."),
    (9, "Dock Leveler: Bridges the gap between dock and truck for safe loading."),
    (10, "Pallet Wrapper 3000: Automated stretch wrapping for palletized goods."),
    (11, "Safety Training Module: Interactive e-learning for warehouse staff."),
    (12, "RFID Inventory Scanner: Real-time tracking of goods with handheld device.")
]

df_docs = spark.createDataFrame(docs_data, ["id", "content"])

# Save with Change Data Feed enabled (required for Vector Search sync)
try:
    df_docs.write.format("delta") \
        .option("delta.enableChangeDataFeed", "true") \
        .mode("overwrite") \
        .saveAsTable(f"{catalog_name}.{schema_name}.product_docs")
    print(f"✅ Knowledge Base created: {catalog_name}.{schema_name}.product_docs")
except Exception as e:
    print(f"Could not save with Delta format: {e}")
    print("Trying without Change Data Feed...")
    try:
        df_docs.write.mode("overwrite").saveAsTable(f"{catalog_name}.{schema_name}.product_docs")
        print("✅ Table created (without CDF)")
    except Exception as e2:
        print(f"Could not create table: {e2}")
        print("Continuing with DataFrame in memory...")

display(df_docs)

### Example 2.1: Creating a Vector Index

To make this searchable, we would create a **Vector Search Index**.
Databricks manages the embedding process for us (Source Table -> Embedding Model -> Vector Index).

*Note: The code below requires a Vector Search Endpoint to be provisioned in the Compute tab.*

In [0]:
from databricks.vector_search.client import VectorSearchClient

In [0]:
# Vector Search Client - Configuration
# NOTE: This requires a Vector Search Endpoint provisioned in Databricks

vs_endpoint_name = "my_vector_search_endpoint"  
index_name = f"{catalog_name}.{schema_name}.product_docs_index"
source_table = f"{catalog_name}.{schema_name}.product_docs"

vsc = VectorSearchClient()
vsc.create_delta_sync_index(
    endpoint_name=vs_endpoint_name,
    source_table_name=source_table,
    index_name=index_name,
    pipeline_type="TRIGGERED",
    primary_key="id",
    embedding_source_column="content",
    embedding_model_endpoint_name="databricks-gte-large-en"
)

In [0]:
vsc = VectorSearchClient()
indexes = vsc.list_indexes(vs_endpoint_name)
display(indexes)

### Example 2.2: Performing a Similarity Search

Once indexed, we can ask questions like "I need a truck for outside". The system converts this query to a vector and finds the closest match (The Diesel Forklift), even though the word "truck" or "outside" might not match exactly (it matches "outdoor" semantically).

In [0]:

display(spark.sql(f"""
SELECT
    *
FROM
    VECTOR_SEARCH(
        index => '{index_name}',
        query_text => 'Battery more info',
        num_results => 3
    )
order by search_score asc

"""))

## Best Practices

| Practice | Description |
|----------|-------------|
| **Endpoint Management** | Turn off Model Serving endpoints when not in use to control costs |
| **Chunking Strategy** | For long documents, split into meaningful chunks (paragraphs, sections) before embedding |
| **Embedding Model Selection** | Choose embedding model based on language and domain (e.g., `databricks-gte-large-en` for English) |
| **Enable CDF** | Always enable Change Data Feed for tables used with Vector Search |
| **RAG Temperature** | Use lower temperature (0.0-0.3) for factual Q&A, higher (0.7-1.0) for creative tasks |
| **Prompt Engineering** | Include context and format instructions in prompts for better ai_query results |

## Summary

### What We Accomplished
| Task | Status |
|------|--------|
| Explored AI Functions in SQL | ✅ |
| Used `ai_analyze_sentiment` and `ai_classify` | ✅ |
| Understood Embeddings concept | ✅ |
| Learned Vector Search architecture | ✅ |
| Created knowledge base for RAG | ✅ |

### Key Takeaways

1. **AI Functions** democratize AI by allowing SQL-based LLM calls
2. **Embeddings** convert text to semantic vectors for similarity search
3. **Vector Search** enables meaning-based retrieval at scale
4. **RAG** combines retrieval with generation for accurate, grounded AI responses
5. Databricks provides managed infrastructure for the entire GenAI stack

### Next Steps

After this BONUS module, you have a complete foundation in Data Preparation for ML:
- Return to core modules if needed for review
- Explore Databricks documentation for advanced GenAI features
- Consider building a RAG chatbot with your own data

## Cleanup (Optional)

Remove temporary tables and indexes created during this demo. Vector Search indexes should be deleted via the Databricks UI.

In [0]:
# Cleanup: Remove tables created in this notebook
# Uncomment and run to clean up resources

# spark.sql(f"DROP TABLE IF EXISTS {catalog_name}.{schema_name}.customer_reviews")
# spark.sql(f"DROP TABLE IF EXISTS {catalog_name}.{schema_name}.product_docs")

# Note: Vector Search indexes must be deleted via the Databricks UI or API:
# vsc = VectorSearchClient()
# vsc.delete_index(endpoint_name=vs_endpoint_name, index_name=index_name)

print("To clean up, uncomment the DROP TABLE statements above.")