# AI/ML and GenAI Integrations

**Training Objective:** Understanding AI/ML capabilities in Databricks: MLflow, Feature Store, Vector Search and GenAI

**Topics Covered:**
- MLflow: Tracking, Registry, Model Serving
- Feature Store: central feature repository
- Vector Search: vector database for RAG
- GenAI: LLM integration (DBRX, Llama)

## Setup and Configuration

## Context and Requirements

- **Training Day**: Day 3 - Integrations and Governance
- **Notebook Type**: Demo
- **Technical Requirements**:
 - ML Cluster (Databricks Runtime ML 13.0+)
 - Access to Unity Catalog (for Feature Store and Vector Search)
 - (Optional) Configured Vector Search Endpoint
 - (Optional) Access to Model Serving (Foundation Models)

## Theoretical Introduction

**Section Objective:** Understand how Databricks supports the full AI/ML lifecycle.

**Basic Concepts:**
- **MLflow**: Open-source platform for managing ML lifecycle (experiments, models, deployments).
- **Feature Engineering (Feature Store)**: Centralized feature repository in Unity Catalog, ensuring consistency between training and inference.
- **Vector Search**: Vector database integrated with Delta Lake, key for RAG (Retrieval Augmented Generation).
- **Model Serving**: Deploying models as REST API endpoints (including Foundation Models like DBRX, Llama).

## User Isolation

In [0]:
%run ../00_setup

## Environment Configuration

In [0]:
import mlflow
import pandas as pd
from pyspark.sql import functions as F
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Feature Engineering Client (Unity Catalog - new way)
from databricks.feature_engineering import FeatureEngineeringClient

# Set catalog and schema
spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {GOLD_SCHEMA}")

display(spark.createDataFrame([
    ("Catalog", CATALOG),
    ("Schema", GOLD_SCHEMA)
], ["Parameter", "Value"]))

## MLflow - Tracking & Registry

We will build a simple model predicting order value and register it in MLflow.

In [0]:
# 1. Experiment Configuration
username = spark.sql("SELECT current_user()").collect()[0][0]
experiment_path = f"/Users/{username}/KION_ML_Experiment"

mlflow.set_experiment(experiment_path)

display(spark.createDataFrame([("MLflow Experiment", experiment_path)], ["Parameter", "Value"]))

In [0]:
%python
# 2. Data Preparation
df_sales = spark.sql(
    f"""
    SELECT 
        net_amount as label,
        quantity,
        month(order_ts) as month,
        year(order_ts) as year
    FROM {CATALOG}.{GOLD_SCHEMA}.fact_sales
    WHERE net_amount IS NOT NULL
    """
).toPandas()

X = df_sales.drop("label", axis=1)
y = df_sales["label"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

display(
    spark.createDataFrame(
        [
            ("Train shape", f"{X_train.shape[0]} rows, {X_train.shape[1]} features"),
            ("Test shape", f"{X_test.shape[0]} rows, {X_test.shape[1]} features")
        ],
        ["Dataset", "Size"]
    )
)

In [0]:
# 3. Training and Logging
import mlflow
from mlflow.models.signature import infer_signature
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Ensure environment variables are available
if 'CATALOG' not in locals() or 'GOLD_SCHEMA' not in locals():
    raise ValueError("CATALOG and GOLD_SCHEMA variables are not defined. Run Setup cells.")

with mlflow.start_run(run_name="RandomForest_v1") as run:
    # Parameters
    n_estimators = 50
    mlflow.log_param("n_estimators", n_estimators)
    
    # Model
    model = RandomForestRegressor(n_estimators=n_estimators)
    model.fit(X_train, y_train)
    
    # Metrics
    predictions = model.predict(X_test)
    mse = mean_squared_error(y_test, predictions)
    mlflow.log_metric("mse", mse)
    
    # Signature (Required for Model Serving!)
    # Defines input and output schema for REST API endpoint
    signature = infer_signature(X_train, model.predict(X_train))
    
    # Logging and Registering model in Unity Catalog
    # We save the model under: catalog.schema.model_name
    model_name = f"{CATALOG}.{GOLD_SCHEMA}.kion_sales_forecast"
    
    print(f"Registering model in Unity Catalog: {model_name}...")
    
    mlflow.sklearn.log_model(
        model, 
        "model",
        signature=signature,
        registered_model_name=model_name,
        input_example=X_train.iloc[:5]
    )
    
    run_id = run.info.run_id

display(spark.createDataFrame([
    ("Run ID", run_id),
    ("MSE", f"{mse:.2f}"),
    ("Model Name", model_name),
    ("Signature", "Added & Registered"),
    ("Next Step", "Go to 'Serving' -> 'Create Endpoint'")
], ["Parameter", "Value"]))

In [0]:
# Optional: Programmatic Model Serving Endpoint creation
# To run this model as REST API, we can use Databricks SDK or UI (Serving -> Create Endpoint)

"""
# Example code to create endpoint (requires: pip install databricks-sdk)
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import EndpointCoreConfigInput, ServedEntityInput

w = WorkspaceClient()

endpoint_name = "kion-sales-forecast-endpoint"

w.serving_endpoints.create(
    name=endpoint_name,
    config=EndpointCoreConfigInput(
        served_entities=[
            ServedEntityInput(
                entity_name=f"{CATALOG}.{GOLD_SCHEMA}.kion_sales_forecast",
                entity_version="1",  # Model version from Unity Catalog
                workload_size="Small",
                scale_to_zero_enabled=True
            )
        ]
    )
)
"""
print("Model registered. You can now go to 'Serving' tab and create an endpoint selecting model:", model_name)

## Feature Store

Feature Store allows defining features once and using them everywhere.

In [0]:
from databricks.feature_store import FeatureStoreClient

In [0]:
# Feature Table Definition
fs = FeatureStoreClient()
feature_table_name = f"{CATALOG}.{GOLD_SCHEMA}.customer_features_{username.split('@')[0].replace('.', '_')}"

# Feature calculation logic
features_df = spark.sql(
    f"""
    SELECT 
        customer_id,
        count(*) as order_count,
        sum(net_amount) as total_spend,
        avg(net_amount) as avg_order_value
    FROM {CATALOG}.{GOLD_SCHEMA}.fact_sales
    GROUP BY customer_id
    """
)

# Write to Feature Store
# Note: create_table requires a unique name and Primary Key

fs.create_table(
    name=feature_table_name,
    primary_keys=["customer_id"],
    df=features_df,
    description="Basic customer features: order count, total spend."
)
status_msg = "Created"


display(spark.createDataFrame([
    ("Feature Table", feature_table_name),
    ("Status", status_msg),
    ("Primary Key", "customer_id"),
    ("Features", "order_count, total_spend, avg_order_value")
], ["Parameter", "Value"]))

## Vector Search (RAG Foundation)

Vector Search is the key to building RAG systems. It allows searching for documents semantically similar to the user's query.

**Requirements:**
1. **Vector Search Endpoint** (must be created in Compute -> Vector Search).
2. **Source Table** with Change Data Feed enabled.

In [0]:
# 1. Text Data Preparation (Knowledge Base Simulation)
# We create a table with product descriptions (in a real scenario, these would be PDFs, documentation, etc.)

docs_df = spark.createDataFrame([
    (1, "Electric forklift, 2-ton capacity, for indoor warehouse use."),
    (2, "Diesel forklift, 5-ton capacity, for outdoor use, rough terrain."),
    (3, "Manual pallet jack, 500kg capacity, for light tasks."),
    (4, "Warehouse automation system, high-bay racking.")
], ["id", "text"])

source_table = f"{CATALOG}.{GOLD_SCHEMA}.product_docs_rag"
docs_df.write.format("delta").mode("overwrite").option("delta.enableChangeDataFeed", "true").saveAsTable(source_table)

display(spark.createDataFrame([
    ("Source Table", source_table),
    ("Document Count", "4"),
    ("Change Data Feed", "Enabled")
], ["Parameter", "Value"]))

In [0]:

%skip
%pip install databricks-vectorsearch
dbutils.library.restartPython()

In [0]:

# 2. Vector Search Configuration (Actual Code)
# Note: This requires an active Vector Search Endpoint.
# If it doesn't exist, the code will throw an error, but shows the correct implementation path.

from databricks.vector_search.client import VectorSearchClient

# Endpoint name (must exist in UI)
VECTOR_SEARCH_ENDPOINT_NAME = "product_docs_index"
index_name = f"{CATALOG}.{GOLD_SCHEMA}.product_docs_idx"

status_info = []

try:
    vsc = VectorSearchClient()
    # Create Index (Delta Sync Index)
    vsc.create_delta_sync_index(
        endpoint_name=VECTOR_SEARCH_ENDPOINT_NAME,
        source_table_name=source_table,
        index_name=index_name,
        pipeline_type="TRIGGERED",
        primary_key="id",
        embedding_source_column="text",
        embedding_model_endpoint_name="databricks-gte-large-en"
    )
    status_info.append(("Status", "Index created"))
    status_info.append(("Endpoint", VECTOR_SEARCH_ENDPOINT_NAME))
    status_info.append(("Index Name", index_name))
except Exception as e:
    status_info.append(("Status", f"Error: {e}"))

display(spark.createDataFrame(status_info, ["Parameter", "Value"]))

In [0]:

# 3. Search Simulation (if Vector Search was working)

query = "I am looking for a forklift for heavy outdoor work"
display(spark.sql(f"""
SELECT
    *
FROM
    VECTOR_SEARCH(
        index => '{index_name}',
        query_text => '{query}',
        num_results => 3
    )
"""))

In [0]:

# --- LOCAL SIMULATION (For educational purposes, when infrastructure is missing) ---
display(spark.createDataFrame([
    ("Query", "I am looking for a forklift for heavy outdoor work"),
    ("", ""),
    ("Result - ID", "2"),
    ("Result - Text", "Diesel forklift, 5-ton capacity, for outdoor use, rough terrain."),
    ("Result - Score", "0.89")
], ["Parameter", "Value"]))

## GenAI & Model Serving

Integration with LLMs (Large Language Models) via Databricks Model Serving.

In [0]:
# LLM Model Call (e.g., DBRX, Llama 3)
import mlflow.deployments

# List available endpoints (may require permissions)
# client = mlflow.deployments.get_deploy_client("databricks")
# endpoints = client.list_endpoints()

def query_llm(prompt):
    try:
        client = mlflow.deployments.get_deploy_client("databricks")
        
        # We use 'databricks-claude-sonnet-4-5' or 'databricks-meta-llama-3-70b-instruct'
        # These models are often available as "Pay-per-token"
        response = client.predict(
            endpoint="databricks-claude-sonnet-4-5",
            inputs={
                "messages": [
                    {"role": "system", "content": "You are a forklift expert."},
                    {"role": "user", "content": prompt}
                ],
                "max_tokens": 100
            }
        )
        return response['choices'][0]['message']['content']
    except Exception as e:
        return f"Failed to connect to model: {e}"

# Test
prompt = "What are the advantages of electric forklifts?"
answer = query_llm(prompt)

display(spark.createDataFrame([
    ("Question", prompt),
    ("LLM Answer", answer)
], ["Parameter", "Value"]))

## AI Functions - Built-in AI Functions in SQL

**Databricks AI Functions** are native SQL functions that allow using LLMs directly in SQL queries.
They are available in Unity Catalog and work at the column level.

**Available Functions:**

| Function | Description | Use Case |
|---------|------|------------------|
| `ai_query()` | Call any LLM prompt | Generation, translation, summarization |
| `ai_analyze_sentiment()` | Analyze text sentiment | Customer feedback, reviews |
| `ai_classify()` | Classify text into categories | Ticket routing, tagging |
| `ai_extract()` | Extract structured data | Document parsing |
| `ai_fix_grammar()` | Grammar correction | Text data cleaning |
| `ai_gen()` | Generate text | Product descriptions, emails |
| `ai_similarity()` | Semantic text similarity | Deduplication, matching |
| `ai_summarize()` | Summarize text | Article summaries |
| `ai_translate()` | Translate text | Internationalization |

> **Note:** These functions require an active Model Serving endpoint and are paid per token.

### ai_query() - Universal LLM Function

In [0]:
-- Example 5.1a: Generating product descriptions
-- We use ai_query() to enrich data with LLM-generated descriptions

SELECT 
    product_id,
    product_name,
    category,
    price,
    ai_query(
        'databricks-meta-llama-3-3-70b-instruct',  -- model name
        CONCAT(
            'Write a short, attractive marketing description (max 2 sentences) for the product: ',
            product_name, 
            ' from category: ', category,
            ' priced at: ', price, ' USD'
        )
    ) AS generated_description
FROM workshop.silver.products
LIMIT 3;

-- NOTE: ai_query() requires an available Foundation Model endpoint
-- In an environment without an endpoint, you will get an error

In [0]:
-- Example 5.1b: Translating data to another language
SELECT 
    customer_id,
    first_name,
    city,
    state,
    ai_query(
        'databricks-meta-llama-3-3-70b-instruct',
        CONCAT('Translate to Polish: Customer from ', city, ', ', state)
    ) AS polish_description
FROM workshop.silver.customers
LIMIT 3;

### ai_analyze_sentiment() - Sentiment Analysis

Automatic text sentiment analysis (positive/negative/neutral).

In [0]:
-- Example 5.2: Customer review sentiment analysis
-- Creating temporary data with reviews for analysis

WITH customer_reviews AS (
    SELECT 1 AS review_id, 'Great product! Highly recommended.' AS review_text
    UNION ALL
    SELECT 2, 'Terrible quality, will never buy again.'
    UNION ALL
    SELECT 3, 'Product is OK, nothing special.'
    UNION ALL
    SELECT 4, 'Fantastic customer service, super fast delivery!'
    UNION ALL
    SELECT 5, 'Disappointed. Product does not work as described.'
)
SELECT 
    review_id,
    review_text,
    ai_analyze_sentiment(review_text) AS sentiment
FROM customer_reviews;

-- Result: sentiment = 'positive', 'negative' or 'neutral' for each review

### ai_similarity() - Semantic Similarity

Compares two texts and returns a similarity score (0-1).

In [0]:
-- Example 5.3: Finding similar products
-- Useful for deduplication or recommendations

WITH product_pairs AS (
    SELECT 
        'Laptop Dell Inspiron 15' AS product_a,
        'Notebook Dell Inspiron 15.6 inch' AS product_b
    UNION ALL
    SELECT 
        'iPhone 15 Pro Max 256GB',
        'Samsung Galaxy S24 Ultra'
    UNION ALL
    SELECT 
        'USB-C Cable 1m',
        'USB Type-C Cord 100cm'
)
SELECT 
    product_a,
    product_b,
    ai_similarity(product_a, product_b) AS similarity_score
FROM product_pairs
ORDER BY similarity_score DESC;

-- similarity_score > 0.8 = probable duplicates
-- similarity_score 0.5-0.8 = similar products
-- similarity_score < 0.5 = different products

### ai_extract() - Structured Data Extraction

Extracts structured information from text (formerly ai_parse_document).

In [0]:
-- Example 5.4: Extracting data from unstructured text
-- Parsing contact information from messages

WITH messages AS (
    SELECT 1 AS msg_id, 
           'Hello, my name is John Smith. My phone number is 512-345-678, email: john.smith@example.com' AS message
    UNION ALL
    SELECT 2, 
           'Please contact: Anna Nowak, tel. 601 234 567, anna.nowak@company.com'
    UNION ALL
    SELECT 3, 
           'Order placed by Peter Wilson, contact: peter.w@gmail.com, +48 700 800 900'
)
SELECT 
    msg_id,
    message,
    ai_extract(
        message,
        ARRAY('full_name', 'phone', 'email')
    ) AS extracted_data
FROM messages;

-- Result: STRUCT with fields full_name, phone, email for each message

### ai_classify() - Text Classification

Assigns text to one of the predefined categories.

In [0]:
-- Example 5.5: Automatic support ticket categorization

WITH support_tickets AS (
    SELECT 1 AS ticket_id, 'I cannot log in to the system' AS issue
    UNION ALL
    SELECT 2, 'When will I receive a refund for order #12345?'
    UNION ALL
    SELECT 3, 'Product arrived damaged, photos attached'
    UNION ALL
    SELECT 4, 'How to change account password?'
    UNION ALL
    SELECT 5, 'I want to cancel premium subscription'
)
SELECT 
    ticket_id,
    issue,
    ai_classify(
        issue,
        ARRAY('technical_issues', 'returns_refunds', 'payments', 'account_settings', 'cancellation')
    ) AS category
FROM support_tickets;

-- Automatic ticket routing to appropriate teams!

### ai_summarize() and ai_translate() - Summarization and Translation

In [0]:
-- Example 5.6a: Summarizing long product descriptions

WITH product_descriptions AS (
    SELECT 1 AS product_id, 
           'The Dell Latitude 5540 business laptop is a versatile notebook designed for professionals. Equipped with the latest 13th Gen Intel Core i7 processor, 16GB DDR5 RAM, and a fast 512GB NVMe SSD. The 15.6-inch Full HD matte IPS screen ensures excellent readability even in bright light. The aluminum chassis guarantees durability, and the battery provides up to 10 hours of operation. Integrated security features include a fingerprint reader, TPM 2.0, and an IR camera for Windows Hello.' AS full_description
)
SELECT 
    product_id,
    ai_summarize(full_description) AS short_summary
FROM product_descriptions;

In [0]:
-- Example 5.6b: Translating product descriptions to multiple languages

SELECT 
    product_id,
    product_name,
    ai_translate(product_name, 'de') AS german_name,
    ai_translate(product_name, 'es') AS spanish_name,
    ai_translate(product_name, 'fr') AS french_name
FROM workshop.silver.products
LIMIT 5;

-- Supported languages: 'en', 'de', 'es', 'fr', 'pl', 'it', 'pt', 'nl', 'ja', 'ko', 'zh' and more

### Practical Use Case: Customer Data Enrichment

Combining multiple AI Functions in a single pipeline to analyze and enrich data.

In [0]:
-- Example 5.7: Complete customer feedback analysis pipeline

WITH customer_feedback AS (
    SELECT 
        1 AS feedback_id,
        'John Smith' AS customer_name,
        'I absolutely love this product! Fast shipping and great quality. Will buy again!' AS feedback_text
    UNION ALL
    SELECT 2, 'Maria Garcia', 'Product arrived damaged. Customer service was unhelpful. Very disappointed.'
    UNION ALL
    SELECT 3, 'Klaus Mueller', 'Decent product for the price. Nothing special but works as expected.'
)
SELECT 
    feedback_id,
    customer_name,
    feedback_text,
    
    -- Sentiment Analysis
    ai_analyze_sentiment(feedback_text) AS sentiment,
    
    -- Classification into categories
    ai_classify(
        feedback_text, 
        ARRAY('product_quality', 'delivery', 'customer_service', 'price')
    ) AS main_topic,
    
    -- Summarization (for long texts)
    ai_summarize(feedback_text) AS summary,
    
    -- Translation to Polish
    ai_translate(feedback_text, 'pl') AS polish_translation
    
FROM customer_feedback;

-- Result: Complete analysis of each feedback in a single query!

### ðŸ“Œ AI Functions Summary

| Function | Syntax | Usage Example |
|---------|----------|-----------------|
| `ai_query()` | `ai_query(model, prompt)` | Generation, any prompt |
| `ai_analyze_sentiment()` | `ai_analyze_sentiment(text)` | Review analysis |
| `ai_classify()` | `ai_classify(text, categories)` | Ticket routing |
| `ai_extract()` | `ai_extract(text, fields)` | Document parsing |
| `ai_similarity()` | `ai_similarity(text1, text2)` | Deduplication |
| `ai_summarize()` | `ai_summarize(text)` | Description shortening |
| `ai_translate()` | `ai_translate(text, lang)` | Internationalization |
| `ai_fix_grammar()` | `ai_fix_grammar(text)` | Text correction |
| `ai_gen()` | `ai_gen(prompt)` | Content generation |

**Requirements:**
- Unity Catalog 
- Foundation Model APIs or custom Model Serving endpoint
- Permissions: `USE CATALOG`, `USE SCHEMA`, and access to the model

**Costs:** Billed per token - remember to optimize prompts!

## Summary

1. We built a training pipeline in **MLflow**.
2. We created a feature table in **Feature Store**.
3. We showed how to configure **Vector Search** for RAG.
4. We used **GenAI** to generate responses.

## Clean up resources

In [0]:
# spark.sql(f"DROP TABLE IF EXISTS {feature_table_name}")
# spark.sql(f"DROP TABLE IF EXISTS {source_table}")
display(spark.createDataFrame([("Status", "Resources kept for further exercises")], ["Info", "Value"]))