# Databricks Data Preparation in ML - Workshop 2
## Vector Search Fundamentals

**Part of the Databricks Data Preparation in ML Training Series**

---

## Workshop Objectives

This workshop introduces Vector Search fundamentals in Databricks, focusing on:
- **Theory**: Understanding Vector Search and embeddings concepts
- **Demo**: Step-by-step implementation with simple datasets
- **Workshop**: Hands-on practice with Amazon Kindle data

## Theory: Vector Search and Embeddings

### What is Vector Search?
Vector Search enables **semantic similarity-based** data retrieval, going beyond exact keyword matching.

**Example:**
- Query: "fantasy book"
- Finds: "magical adventure", "wizard story", "enchanted tale"

### Embeddings - Text as Numbers
Embeddings represent text as numerical vectors:
```
"Harry Potter" → [0.2, -0.5, 0.8, ...]
"Magic story" → [0.3, -0.4, 0.7, ...]
```

### Vector Search Process:
1. **Data Preparation** - Text cleaning and preprocessing
2. **Embedding Generation** - Text → numerical vectors  
3. **Index Creation** - Optimized search infrastructure
4. **Semantic Search** - Finding similar content

### ML Applications:
- Semantic search systems
- Recommendation engines
- RAG (Retrieval Augmented Generation)
- Document similarity analysis

## Duration: ~60 minutes
## Level: Intermediate → Advanced

## 1. Environment Setup

### Required Libraries:
- `databricks-vectorsearch` - Native Vector Search
- `sentence-transformers` - Embeddings (alternative to OpenAI)
- Standard PySpark libraries

### Workshop Structure:
- **DEMO**: Simple data + implementation
- **WORKSHOP**: Amazon Kindle dataset + instructions

In [0]:
%pip install databricks-vectorsearch
dbutils.library.restartPython()

In [0]:
from pyspark.sql.functions import col, concat, lit
from pyspark.sql.types import StructType, StructField, StringType
from databricks.vector_search.client import VectorSearchClient

vs_client = VectorSearchClient()

display("✅ Vector Search client ready")
display("🎯 Environment setup completed! Ready to start Vector Search workshop 🚀")

---

# PART 1: DEMO - Vector Search Implementation

## Demo Objective:
Demonstrate the complete Vector Search process using simple generated data in the notebook.

## 1.1 Generate Sample Data

Create a simple book dataset for Vector Search demonstration.

In [0]:
# Creating sample book data
books_data = [
    ("1", "Harry Potter and the Magic Stone", "A young wizard discovers his magical heritage and attends Hogwarts school", "Fantasy"),
    ("2", "The Hunger Games", "A dystopian story about survival and rebellion in a post-apocalyptic world", "Sci-Fi"),
    ("3", "Pride and Prejudice", "A romantic novel about love, marriage and social expectations in 19th century England", "Romance"),
    ("4", "Lord of the Rings", "An epic fantasy adventure about hobbits, wizards and the battle against evil", "Fantasy"),
    ("5", "1984", "A dystopian novel about totalitarian government surveillance and control", "Sci-Fi"),
    ("6", "Romeo and Juliet", "A tragic love story about two young lovers from feuding families", "Romance"),
    ("7", "Game of Thrones", "A medieval fantasy epic with political intrigue, dragons and magic", "Fantasy"),
    ("8", "The Martian", "A science fiction story about survival on Mars using technology and ingenuity", "Sci-Fi"),
    ("9", "The Notebook", "A heartwarming romance about enduring love that transcends time and memory", "Romance"),
    ("10", "Chronicles of Narnia", "A magical adventure in a fantasy world with talking animals and mythical creatures", "Fantasy")
]

schema = StructType([
    StructField("id", StringType(), True),
    StructField("title", StringType(), True),
    StructField("description", StringType(), True),
    StructField("category", StringType(), True)
])

clean_books_df = spark.createDataFrame(books_data, schema)
display(clean_books_df)

## 1.2 Generate Embeddings

Two methods for creating embeddings:
- **Method 1**: Sentence Transformers (free)
- **Method 2**: OpenAI Ada (paid, higher quality)

In [0]:
# Method 1: Sentence Transformers (free)
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Convert to pandas for batch processing
texts_df = clean_books_df.toPandas()

# Generate embeddings
embeddings = model.encode(texts_df['description'].tolist())

# Add embeddings to DataFrame
texts_df['embedding'] = embeddings.tolist()

# Check dimensions
print(f"Embedding dimensions: {embeddings.shape}")
print(f"First embedding (first 5 values): {embeddings[0][:5]}")

In [0]:
display(texts_df)

In [0]:
# Method 2: OpenAI Ada (optional - requires API key)
# import openai
# openai.api_key = "your-api-key"

# def get_openai_embedding(text):
#     response = openai.Embedding.create(
#         input=text,
#         model="text-embedding-ada-002"
#     )
#     return response['data'][0]['embedding']

# For demo we use Sentence Transformers
print("✅ Embeddings generated using Sentence Transformers method")

## 1.3 Delta Table Preparation

Vector Search in Databricks requires data in Delta Lake format.

In [0]:
# Save as Delta Table
table_name = "data_ml_preparation.vectors.demo_books_embeddings"
spark.createDataFrame(texts_df).write.mode("overwrite").saveAsTable(table_name)

print(f"✅ Delta Table '{table_name}' created")
display(texts_df)

In [0]:
%sql

describe detail demo_books_embeddings

In [0]:
# Enable Change Data Feed for the source table
spark.sql("""
    ALTER TABLE data_ml_preparation.vectors.demo_books_embeddings
    SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
""")

## 1.4 Creating Vector Search Index

Creating an index for efficient semantic search.

In [0]:
endpoint_name = "demo_books_embeddings_endpoint"
index_name = "data_ml_preparation.vectors.demo_books_embeddings_index"

In [0]:
from databricks.vector_search.client import VectorSearchClient

# Initialize the vector search client
vs_client = VectorSearchClient()

# Step 1: Create endpoint
vs_client.create_endpoint(name="demo_books_embeddings_endpoint")

In [0]:
vs_client.create_delta_sync_index(
    endpoint_name="demo_books_embeddings_endpoint",
    index_name="data_ml_preparation.vectors.demo_books_embeddings_index",
    source_table_name="data_ml_preparation.vectors.demo_books_embeddings",  #
    pipeline_type="TRIGGERED",
    primary_key="id",  # adjust based on your schema
    embedding_dimension=384,
    embedding_vector_column="embedding"
)

## 1.5 Testing Vector Search

Examples of semantic search with different queries.

In [0]:
# Test queries for Vector Search


test_queries = [
    "magical wizard adventure",
    "dystopian future society", 
    "romantic love story",
    "space exploration Mars"
]


# Ensure the query is defined
query = test_queries[0]

# Encode the query
query_embedding = model.encode([query])[0].tolist()

# Correct index name
index_name = 'data_ml_preparation.vectors.demo_books_embeddings_index'


    # Perform similarity search
results = vs_client.get_index(endpoint_name, index_name).similarity_search(
        query_vector=query_embedding,
        columns=["title", "description", "category"],
        num_results=5
    )
    
print(results)


In [0]:
display(results)

---

# PART 2: WORKSHOP - Amazon Kindle Dataset

## 🎯 Objective:
Hands-on implementation of Vector Search on real Amazon Kindle data.

## 📋 Tasks to Complete:
1. Load Amazon Kindle data
2. Clean and prepare data
3. Remove duplicates and handle missing values
4. Generate embeddings (2 methods)
5. Create Vector Search index
6. Test search functionality

## 📂 Data Path:
`...raw/amazon_kindel_data/kindel_data_amazon.csv`

In [0]:
display(results)

## Task 1: Loading Amazon Kindle Data

**Instructions:**
1. Load data from the given path using `spark.read.csv()`
2. Use options: `header=true`, `inferSchema=true`, `multiline=true`
3. Check data schema (`printSchema()`)
4. Count number of records
5. Display first 5 rows

In [0]:
# TODO: Task 1 - Loading data
# data_path = "abfss://altml@altswdsa.dfs.core.windows.net/raw/amazon_kindel_data/kindel_data_amazon.csv"

# Your code here:

## Task 2: Data Cleaning

**Instructions:**
1. Check missing values in columns: `title`, `description`, `category`
2. Remove duplicates using `distinct()`
3. Filter records where `description` is not null and has length > 20 characters
4. Clean text in `description` from HTML tags using `regexp_replace`
5. Create `text_for_embedding` column combining title and description
6. Count final records and display examples

In [0]:
# TODO: Task 2 - Data cleaning

# Check missing values:

# Remove duplicates:

# Filter and clean:

# Your code here:

## Task 3: Generating embeddings

**Instructions:**
1. **Method 1**: Use Sentence Transformers
   - Load model: `SentenceTransformer('all-MiniLM-L6-v2')`
   - Convert data to pandas: `toPandas()`
   - Generate embeddings: `model.encode(texts)`

2. **Method 2**: (Optional) OpenAI API
   - Use model `text-embedding-ada-002`
   - Requires API key in secrets

3. Add embeddings to DataFrame
4. Check embedding dimensions
5. Test similarity between sample texts

In [0]:
# TODO: Task 3 - Embeddings

# Method 1: Sentence Transformers

# Method 2: OpenAI (optional)

# Your code here:

## Task 4: Vector Search Implementation

**Instructions:**
1. **Delta Table Preparation**:
   - Convert embeddings to Spark DataFrame
   - Save as Delta Table: `saveAsTable("kindle_books_embeddings")`

2. **Vector Search Creation**:
   - Create endpoint: `vs_client.create_endpoint("kindle_endpoint")`
   - Create index: `vs_client.create_delta_sync_index()`
   - Parameters: `embedding_dimension=384`, `embedding_vector_column="embedding"`

3. **Search Testing**:
   - Test queries: "fantasy magic", "romance love", "sci-fi space"
   - Use `similarity_search()` with query embeddings
   - Display top 3 results for each query

In [0]:
# TODO: Task 4 - Vector Search

# Delta Table preparation:

# Vector Search endpoint and index creation:

# Search testing:

# Your code here:

---

## 🎯 Workshop Summary

### What you learned:
1. **Vector Search Theory** - embeddings, semantic similarity
2. **Data Preparation** - text cleaning, handling duplicates
3. **Generating embeddings** - Sentence Transformers vs OpenAI
4. **Databricks Vector Search** - endpoints, indexes, queries
5. **Practical applications** - semantic search, recommendations

### Best practices for ML Associate:
- **Data Quality**: Clean text = better embeddings
- **Model Choice**: Balance between cost and quality
- **Performance**: Batch processing for large datasets
- **Monitoring**: Track embedding quality and search relevance
- **Security**: API keys in Databricks secrets

### Next steps:
- **RAG Implementation**: Integration with LLM
- **Hybrid Search**: Vector + keyword search
- **Production Deployment**: Monitoring and scaling
- **Advanced Features**: Metadata filtering, re-ranking