# Azure AI Search - Creating and Updating index from local files

This code demonstrates how to use Azure AI Search by using the push API to insert vectors into your search index:

+ Create an index schema
+ Load the sample data from a local folder
+ Embed the documents in-memory using Azure OpenAI's text-embedding-ada-002 model
+ Index the vector and nonvector fields on Azure AI Search
+ Run a series of vector and hybrid queries, including metadata filtering and hybrid (text + vectors) search. 

The code uses Azure OpenAI to generate embeddings for title and content fields. You'll need access to Azure OpenAI to run this demo.

The code reads the `pdf` files inside `data` folder, which contains the input data for which embeddings need to be generated.

The output is a combination of human-readable text and embeddings that can be pushed into a search index.

## Prerequisites

+ An Azure subscription, with [access to Azure OpenAI](https://aka.ms/oai/access). You must have the Azure OpenAI service name and an API key.

+ A deployment of the text-embedding-ada-002 embedding model.

+ Azure AI Search, any tier, but choose a service that has sufficient capacity for your vector index. We recommend Basic or higher. [Enable semantic ranking](https://learn.microsoft.com/azure/search/semantic-how-to-enable-disable) if you want to run the hybrid query with semantic ranking.

+ fill out the `.env` file with correct values


### Install packages

In [1]:
! pip install -r requirements.txt --quiet

## Import required libraries and environment variables

<span style="color:red">make sure to change the `local.env` with `.env`</span>.

In [8]:
from dotenv import load_dotenv
import os
import uuid
import json
from azure.core.credentials import AzureKeyCredential
from PyPDF2 import PdfReader
from openai import AzureOpenAI
from azure.search.documents.models import VectorizedQuery
from azure.search.documents import SearchClient
from azure.search.documents import SearchIndexingBufferedSender


load_dotenv("./local.env", override=True) # change this to .env for your own environment

# The following variables from your .env file are used in this notebook
endpoint = os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"]
credential = AzureKeyCredential(os.environ["AZURE_SEARCH_ADMIN_KEY"])
index_name = os.environ["AZURE_SEARCH_INDEX"]
azure_openai_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
azure_openai_key = os.environ["AZURE_OPENAI_KEY"] if len(os.environ["AZURE_OPENAI_KEY"]) > 0 else None
azure_openai_embedding_deployment = os.environ["AZURE_OPENAI_EMBEDDING_DEPLOYMENT"]
azure_openai_api_version = os.environ["AZURE_OPENAI_API_VERSION"]

# Initialize AzureOpenAI client
client = AzureOpenAI(
    azure_deployment=azure_openai_embedding_deployment,
    api_version=azure_openai_api_version,
    azure_endpoint=azure_openai_endpoint,
    api_key=azure_openai_key
)

# Initialize the SearchClient
search_client = SearchClient(
    endpoint=endpoint,
    index_name=index_name,
    credential=credential
)

## Prep Data & Create embeddings
Read your data, generate OpenAI embeddings and export to a format to insert your Azure AI Search index:

you may run into issue where embedding may time out due to large amount of calls if data contains alot of pdfs

In [3]:


def process_pdf(file_path):
    pdf_reader = PdfReader(file_path)
    num_pages = len(pdf_reader.pages)
    chunks = []
    for i in range(0, num_pages, 5):
        chunk_text = ""
        for page in pdf_reader.pages[i:min(i + 5, num_pages)]:
            chunk_text += page.extract_text()
        chunks.append(chunk_text)
    return chunks

# Function to vectorize text using Azure Text Analytics
async def vectorize_text(text):
    content_response = client.embeddings.create(input=text, model=azure_openai_embedding_deployment)
    return content_response.data[0].embedding

data_folder = "data"
for file_name in os.listdir(data_folder):
    if file_name.endswith(".pdf"):
        file_path = os.path.join(data_folder, file_name)
        chunks = process_pdf(file_path)
        print(len(chunks))
        file_chunks = []
        for chunk in chunks:
            vector = await vectorize_text(chunk)
            file_chunks.append({
                "id": str(uuid.uuid4()),
                "title": file_name,
                "content": chunk,
                "contentVector": vector
            })
        # Output embeddings to docVectors.json file
        output_path = os.path.join('.', 'output', f"{file_name}.json")
        output_directory = os.path.dirname(output_path)
        if not os.path.exists(output_directory):
            os.makedirs(output_directory)
        with open(output_path, "w") as f:
            json.dump(file_chunks, f)

22
21


## Create your search index

Create your search index schema and vector search configuration. If you get an error, check the search service for available quota and check the .env file to make sure you're using a unique search index name.

In [4]:
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SimpleField,
    SearchFieldDataType,
    SearchableField,
    SearchField,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    SemanticConfiguration,
    SemanticPrioritizedFields,
    SemanticField,
    SemanticSearch,
    SearchIndex
)

# Create a search index
index_client = SearchIndexClient(
    endpoint=endpoint, credential=credential)
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True),
    SearchableField(name="title", type=SearchFieldDataType.String, filterable=True),
    SearchableField(name="content", type=SearchFieldDataType.String),
    SearchField(name="contentVector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True, vector_search_dimensions=1536, vector_search_profile_name="myHnswProfile"),
]

# Configure the vector search configuration  
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(
            name="myHnsw"
        )
    ],
    profiles=[
        VectorSearchProfile(
            name="myHnswProfile",
            algorithm_configuration_name="myHnsw",
        )
    ]
)

semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=SemanticPrioritizedFields(
        title_field=SemanticField(field_name="title"),
        content_fields=[SemanticField(field_name="content")]
    )
)

# Create the semantic settings with the configuration
semantic_search = SemanticSearch(configurations=[semantic_config])

# Create the search index with the semantic settings
index = SearchIndex(name=index_name, fields=fields,
                    vector_search=vector_search, semantic_search=semantic_search)
result = index_client.create_or_update_index(index)
print(f' {result.name} created')


 local-test-index created


## Insert text and embeddings into index
take all the `.json` file from output and update the index with its content.

In [5]:

# Directory containing the JSON files
json_directory = "output"

# Load and update documents from JSON files
for filename in os.listdir(json_directory):
    if filename.endswith(".json"):
        file_path = os.path.join(json_directory, filename)
        with open(file_path, 'r') as file:
            document = json.load(file)
            # Use SearchIndexingBufferedSender to upload the documents in batches optimized for indexing  
            with SearchIndexingBufferedSender(  
                endpoint=endpoint,  
                index_name=index_name,  
                credential=credential,  
            ) as batch_client:  
                # Add upload actions for all documents  
                batch_client.upload_documents(documents=document) 
            print(f"Uploaded {len(document)} documents in total")
print("Documents updated successfully.")


Uploaded 22 documents in total
Uploaded 21 documents in total
Documents updated successfully.


<span style="color:green">it may take couple of minutes for the index to populate with batched data.</span>

---

## Perform a vector similarity search

This example shows a pure vector search using the vectorizable text query, all you need to do is pass in text and your vectorizer will handle the query vectorization.

In [None]:
# Pure Vector Search
query = "deductable for health plus plan"  
  
embedding = client.embeddings.create(input=query, model=azure_openai_embedding_deployment).data[0].embedding
vector_query = VectorizedQuery(vector=embedding, k_nearest_neighbors=3, fields="contentVector")
  
results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query],
    select=["title", "content"],
)  
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['content']}")  


This example shows a pure vector search to demonstrate OpenAI's text-embedding-ada-002 multilingual capabilities.

In [None]:
# Pure Vector Search multi-lingual (e.g 'tools for software development' in Dutch)  
query = "deductable for health plus plan"  
  
embedding = client.embeddings.create(input=query, model=azure_openai_embedding_deployment).data[0].embedding
vector_query = VectorizedQuery(vector=embedding, k_nearest_neighbors=3, fields="contentVector")

results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query],
    select=["title", "content"],
)  
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['content']}")  



## Perform an Exhaustive KNN exact nearest neighbor search

This example shows how you can exhaustively search your vector index regardless of what index you have, HNSW or ExhaustiveKNN. You can use this to calculate the ground-truth values.

In [None]:
# Pure Vector Search
query = "deductable for health plus plan"  
  
embedding = client.embeddings.create(input=query, model=azure_openai_embedding_deployment).data[0].embedding
vector_query = VectorizedQuery(vector=embedding, k_nearest_neighbors=3, fields="contentVector", exhaustive=True)
  
results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query],
    select=["title", "content"],
)  
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['content']}")  


## Perform a Pure Vector Search with a filter
This example shows how to apply filters on your index. Note, that you can choose whether you want to use Pre-Filtering (default) or Post-Filtering.

In [None]:
from azure.search.documents.models import VectorFilterMode

# Pure Vector Search
query = "deductable for health plus plan"  
  
embedding = client.embeddings.create(input=query, model=azure_openai_embedding_deployment).data[0].embedding
vector_query = VectorizedQuery(vector=embedding, k_nearest_neighbors=3, fields="contentVector")

results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query],
    vector_filter_mode=VectorFilterMode.PRE_FILTER,
    filter="title eq 'Northwind_Health_Plus_Benefits_Details.pdf'",
    select=["title", "content"],
)
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['content']}")  


## Perform a Hybrid Search

In [None]:
# Hybrid Search
query = "deductable for health plus plan"  
  
embedding = client.embeddings.create(input=query, model=azure_openai_embedding_deployment).data[0].embedding
vector_query = VectorizedQuery(vector=embedding, k_nearest_neighbors=3, fields="contentVector")

results = search_client.search(  
    search_text=query,  
    vector_queries=[vector_query],
    select=["title", "content"],
    top=3
)  
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['content']}")  


## Perform a Semantic Hybrid Search

In [None]:
from azure.search.documents.models import QueryType, QueryCaptionType, QueryAnswerType

# Semantic Hybrid Search
query = "deductable for health plus plan"

embedding = client.embeddings.create(input=query, model=azure_openai_embedding_deployment).data[0].embedding
vector_query = VectorizedQuery(vector=embedding, k_nearest_neighbors=3, fields="contentVector", exhaustive=True)

results = search_client.search(  
    search_text=query,  
    vector_queries=[vector_query],
    select=["title", "content"],
    query_type=QueryType.SEMANTIC, semantic_configuration_name='my-semantic-config', query_caption=QueryCaptionType.EXTRACTIVE, query_answer=QueryAnswerType.EXTRACTIVE,
    top=3
)

semantic_answers = results.get_answers()
for answer in semantic_answers:
    if answer.highlights:
        print(f"Semantic Answer: {answer.highlights}")
    else:
        print(f"Semantic Answer: {answer.text}")
    print(f"Semantic Answer Score: {answer.score}\n")

for result in results:
    print(f"Title: {result['title']}")
    print(f"Reranker Score: {result['@search.reranker_score']}")
    print(f"Content: {result['content']}")

    captions = result["@search.captions"]
    if captions:
        caption = captions[0]
        if caption.highlights:
            print(f"Caption: {caption.highlights}\n")
        else:
            print(f"Caption: {caption.text}\n")
