![](https://raw.githubusercontent.com/Databricks-BR/workshop_agents/refs/heads/main/demo-main/img/header_workshop.png)


 Item | Description |
 --- | --- |
 **Objective** | Create Vector Search Index and Endpoint |
 **Databricks Run Time** | DBR 16.4 LTS |
 **Language** | Python, Pyspark and SQL |


![](https://raw.githubusercontent.com/Databricks-BR/workshop_agents/refs/heads/main/demo-main/img/img/04_diagram.png)


## Querying Unstructured Data
.
<img src="https://www.databricks.com/sites/default/files/2024-01/db-vector-search-image-01_0.png?v=1705100714" style="width: 800px; margin-left: 10px">

However, often the data we need to access is not necessarily structured or we are not looking for an exact search.

**Databricks Vector Search** is a serverless vector database, **integrated** seamlessly into the Data Intelligence Platform.

Unlike other databases, Databricks Vector Search supports **automatic synchronization** of data from the source to the index, eliminating the complex and costly maintenance of pipelines.

It leverages the same **security and governance** tools that organizations have already built for greater peace of mind.

With its serverless design, Databricks Vector Search **scales** easily to support billions of embeddings and thousands of real-time queries per second.

### Creating Vector Search

In [0]:
# Databricks Notebook: Parse PDF using Python and Save to Delta

# ==============================================================================
# Step 0: Installs and Imports
# ==============================================================================
# This notebook reads a PDF, extracts question and answer pairs using a robust
# Python-based parser, and saves the result as a Delta Table.
# This method works on any Databricks cluster.

%pip install pypdf --quiet

dbutils.library.restartPython()

import re
import pandas as pd

# ==============================================================================
# Step 1: Configuration
# ==============================================================================
# Please fill in these variables with your own Unity Catalog names.
CATALOG = "vinicius_fialho_testes"
SCHEMA = "workshop_ml_agentes"
VOLUME = "unstructured_data"
PDF_FILE_NAME = "faq-meli-ir.pdf"
DELTA_TABLE_NAME = "faq"

# --- Construct full paths ---
PDF_PATH = f"/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/{PDF_FILE_NAME}"
DELTA_TABLE_FQN = f"{CATALOG}.{SCHEMA}.{DELTA_TABLE_NAME}"

print(f"Reading PDF from: {PDF_PATH}")
print(f"Will create Delta Table at: {DELTA_TABLE_FQN}")

# ==============================================================================
# Step 2: Extract Questions and Answers from the PDF
# ==============================================================================
# This is the most reliable method using Python's regex library.

def extract_qa_pairs_from_pdf(pdf_path):
    """
    Extracts text from a PDF and splits it into (question, answer) pairs
    based on the FAQ structure (Question in ALL CAPS ending with '?' followed by the answer).
    """
    try:
        from pypdf import PdfReader
        reader = PdfReader(pdf_path)
        # Add newlines for better regex matching at the start/end of the document
        raw_text = "\n" + "".join(page.extract_text() for page in reader.pages)

        # This pattern finds multi-line, all-caps questions ending in '?'.
        # We will use finditer to get the exact start and end positions of each question.
        question_pattern = r'(\n[A-Z][A-Z\s,]+\?\n)'
        matches = list(re.finditer(question_pattern, raw_text))

        if not matches:
            print("Warning: Regex did not find any matching Q&A patterns. The PDF structure might be different than expected.")
            return []

        # --- Stage 1: Initial rough extraction ---
        qa_pairs = []
        for i, match in enumerate(matches):
            question_text = match.group(1)
            answer_start_pos = match.end()
            answer_end_pos = matches[i+1].start() if i + 1 < len(matches) else len(raw_text)
            answer_text = raw_text[answer_start_pos:answer_end_pos]
            qa_pairs.append([ # Use a list to make it mutable
                " ".join(question_text.strip().split()),
                " ".join(answer_text.strip().split())
            ])
        
        # --- Stage 2: Cleanup process to fix question fragments in answers ---
        # We iterate backwards to safely modify the next item in the list.
        for i in range(len(qa_pairs) - 2, -1, -1):
            current_answer = qa_pairs[i][1]
            
            # This pattern implements your rule: a period, a space, and then at least 3 ALL CAPS words.
            # This indicates the start of a question fragment.
            spillover_pattern = r'\. (?=([A-Z\']{2,}\s){2,}[A-Z\']{2,})'
            match = re.search(spillover_pattern, current_answer)
            
            if match:
                split_point = match.start()
                
                # The text that belongs to the next question
                spillover_text = current_answer[split_point+2:].strip()
                
                # The corrected answer for the current question
                corrected_answer = current_answer[:split_point+1].strip()
                
                # Update the current answer
                qa_pairs[i][1] = corrected_answer
                
                # Prepend the fragment to the next question
                qa_pairs[i+1][0] = f"{spillover_text} {qa_pairs[i+1][0]}"

        # Convert list of lists back to list of tuples for the final output
        final_pairs = [tuple(pair) for pair in qa_pairs if pair[0] and pair[1]]

        print(f"Successfully extracted and cleaned {len(final_pairs)} Question/Answer pairs.")
        return final_pairs
        
    except Exception as e:
        print(f"Error reading or processing PDF: {e}")
        return []

qa_pairs = extract_qa_pairs_from_pdf(PDF_PATH)

# ==============================================================================
# Step 3: Create and Save the Delta Table
# ==============================================================================
# This section creates the table with id, pergunta, and resposta columns.

def create_qa_delta_table(qa_data, table_fqn):
    """
    Creates a DataFrame with the Q&A data and saves it as a Delta Table.
    """
    if not qa_data:
        print("No data to process. Halting execution.")
        return
    
    # Creates a pandas DataFrame with temporary english column names.
    df_pandas = pd.DataFrame(qa_data, columns=['question', 'answer'])
    
    # Adds a sequential 'id' column.
    df_pandas.reset_index(inplace=True)
    
    # Renames columns to the desired Portuguese names.
    df_pandas = df_pandas.rename(columns={'index': 'id', 'question': 'question', 'answer': 'answer'})
    
    # Creates a Spark DataFrame and saves it.
    spark_df = spark.createDataFrame(df_pandas)
    print(f"\nWriting to Delta Table: {table_fqn}...")
    spark_df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(table_fqn)
    print("="*80)
    print("SUCCESS! Your Delta Table with questions and answers is ready.")
    print("="*80)
    display(spark.table(table_fqn).limit(5))

# Run the final function
create_qa_delta_table(qa_pairs, DELTA_TABLE_FQN)


In [0]:
%sql
SELECT *
FROM vinicius_fialho_testes.workshop_ml_agentes.faq
LIMIT 10

In [0]:
%sql
-- Enable Change Data Feed on the `FAQ` table
-- This configuration allows Vector Search to read inserted, deleted, or updated data in the FAQ incrementally.

ALTER TABLE vinicius_fialho_testes.workshop_ml_agentes.faq SET TBLPROPERTIES (delta.enableChangeDataFeed = true)

Creating Vector Search Index and Endpoint via UI.

[Our Vector Search Endpoint](https://e2-demo-field-eng.cloud.databricks.com/explore/data/vinicius_fialho_testes/workshop_ml_agentes/faq_vs?o=1444828305810485&activeTab=overview)

![](/Workspace/Users/vinicius.fialho@databricks.com/Hunter/mercado_libre/workshop_01/img/04_VS Endpoint)

### Consulting a FAQ

Knowledge bases, such as FAQs, service scripts, and compliance rules, can be easily indexed with Databricks Vector Search. With this, we can:

* Identify relevant documents
* Increase the accuracy of responses from Generative AI models
* No need to pre-train or fine-tune these models

In [0]:
%sql
CREATE OR REPLACE FUNCTION vinicius_fialho_testes.workshop_ml_agentes.search_faq(question STRING)
RETURNS TABLE(id LONG, question STRING, answer STRING)
COMMENT 'Use this function to query the knowledge base about delivery times, exchange or return requests, among other frequently asked questions about our marketplace'
RETURN select id, question, answer from vector_search(
  index => 'vinicius_fialho_testes.workshop_ml_agentes.faq_vs', 
  query => search_faq.question,
  num_results => 1
)

In [0]:
%sql
SELECT *
FROM vinicius_fialho_testes.workshop_ml_agentes.search_faq("Does Meli generate revenue from logistics?")

In [0]:
%sql
SELECT *
FROM vinicius_fialho_testes.workshop_ml_agentes.search_faq("¿Donde se muestran los anuncios de Mercado Libre?")