# Azure AI Search integrated vectorization sample

This Python notebook demonstrates the [integrated vectorization](https://learn.microsoft.com/azure/search/vector-search-integrated-vectorization) features of Azure AI Search that are currently in public preview. 

Integrated vectorization takes a dependency on indexers and skillsets, using the Text Split skill for data chunking, and the AzureOpenAIEmbedding skill and your Azure OpenAI resorce for embedding.

This example uses PDFs from the `data/documents` folder for chunking, embedding, indexing, and queries.

### Prerequisites

+ An Azure subscription, with [access to Azure OpenAI](https://aka.ms/oai/access).
 
+ Azure AI Search, any tier, but we recommend Basic or higher for this workload. [Enable semantic ranker](https://learn.microsoft.com/azure/search/semantic-how-to-enable-disable) if you want to run a hybrid query with semantic ranking.

+ A deployment of the `text-embedding-ada-002` model on Azure OpenAI.

+ Azure Blob Storage. This notebook connects to your storage account and loads a container with the sample PDFs.


### Set up a Python virtual environment in Visual Studio Code

1. Open the Command Palette (Ctrl+Shift+P).
1. Search for **Python: Create Environment**.
1. Select **Venv**.
1. Select a Python interpreter. Choose 3.10 or later.

It can take a minute to set up. If you run into problems, see [Python environments in VS Code](https://code.visualstudio.com/docs/python/environments).

### Install packages

In [None]:
! pip install -r azure-search-integrated-vectorization-sample-requirements.txt --quiet

### Load .env file (Copy .env-sample to .env and update accordingly)

In [19]:
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential
import os

load_dotenv(override=True) # take environment variables from .env.

# Variables not used here do not need to be updated in your .env file
endpoint = os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"]
credential = AzureKeyCredential(os.getenv("AZURE_SEARCH_ADMIN_KEY")) if os.getenv("AZURE_SEARCH_ADMIN_KEY") else DefaultAzureCredential()
index_name = os.getenv("AZURE_SEARCH_INDEX", "int-vec")
blob_connection_string = os.environ["BLOB_CONNECTION_STRING"]
# search blob datasource connection string is optional - defaults to blob connection string
# This field is only necessary if you are using MI to connect to the data source
# https://learn.microsoft.com/azure/search/search-howto-indexing-azure-blob-storage#supported-credentials-and-connection-strings
search_blob_connection_string = os.getenv("SEARCH_BLOB_DATASOURCE_CONNECTION_STRING", blob_connection_string)
blob_container_name = os.getenv("BLOB_CONTAINER_NAME", "int-vec")
azure_openai_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
azure_openai_key = os.getenv("AZURE_OPENAI_KEY")
azure_openai_embedding_deployment = os.environ["AZURE_OPENAI_EMBEDDING_DEPLOYMENT"]
azure_openai_model_name = os.environ["AZURE_OPENAI_EMBEDDING_MODEL_NAME"]
azure_openai_model_dimensions = int(os.getenv("AZURE_OPENAI_EMBEDDING_DIMENSIONS", 1024))
# This field is only necessary if you want to use OCR to scan PDFs in the data source
azure_ai_services_key = os.getenv("AZURE_AI_SERVICES_KEY", "")

use_ocr = len(azure_ai_services_key) > 0

## Connect to Blob Storage and load documents

Retrieve documents from Blob Storage. You can use the sample documents in the data/documents folder.  

In [21]:
from azure.storage.blob import BlobServiceClient
import glob

def upload_sample_documents(
        blob_connection_string: str,
        blob_container_name: str,
        use_user_identity: bool = False,
        use_ocr_sample: bool = False
    ):
    # Connect to Blob Storage
    # blob_service_client = BlobServiceClient.from_connection_string(logging_enable=True, conn_str=blob_connection_string, credential=DefaultAzureCredential() if use_user_identity else None)
    blob_service_client = BlobServiceClient.from_connection_string(blob_connection_string)
    container_client = blob_service_client.get_container_client(blob_container_name)
    if not container_client.exists():
        container_client.create_container()

    documents_directory = os.path.join("data", "documents")
    if use_ocr_sample:
        documents_directory = os.path.join("data", "ocrdocuments")
    pdf_files = glob.glob(os.path.join(documents_directory, '*.pdf'))
    print(f"Uploading {len(pdf_files)} documents")
    for file in pdf_files:
        with open(file, "rb") as data:
            name = os.path.basename(file)
            if not container_client.get_blob_client(name).exists():
                container_client.upload_blob(name=name, data=data)

upload_sample_documents(
    blob_connection_string=blob_connection_string,
    blob_container_name=blob_container_name,
    # Set to false if you want to use credentials included in the blob connection string
    # Otherwise your identity will be used as credentials
    use_user_identity=True,
    # By default, OCR is not used
    # If an AI services API key is provided, OCR will be used
    use_ocr_sample=use_ocr
)
print(f"Setup sample data in {blob_container_name}")

Uploading 1 documents
data\ocrdocuments\Invoice_1.pdf
Setup sample data in documentos


## Create a blob data source connector on Azure AI Search

In [22]:
from azure.search.documents.indexes import SearchIndexerClient
from azure.search.documents.indexes.models import (
    SearchIndexerDataContainer,
    SearchIndexerDataSourceConnection
)
from azure.search.documents.indexes.models import NativeBlobSoftDeleteDeletionDetectionPolicy

# Create a data source 
indexer_client = SearchIndexerClient(endpoint, credential)
container = SearchIndexerDataContainer(name=blob_container_name)
data_source_connection = SearchIndexerDataSourceConnection(
    name=f"{index_name}-blob",
    type="azureblob",
    connection_string=search_blob_connection_string,
    container=container,
    data_deletion_detection_policy=NativeBlobSoftDeleteDeletionDetectionPolicy()
)
data_source = indexer_client.create_or_update_data_source_connection(data_source_connection)

print(f"Data source '{data_source.name}' created or updated")

Data source 'inv-vec-blob' created or updated


## Create a search index

Vector and nonvector content is stored in a search index.

Tener cuidado con el nombre que se le dan a los perfiles

In [24]:
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    ExhaustiveKnnAlgorithmConfiguration,
    ExhaustiveKnnParameters,
    VectorSearchProfile,
    AzureOpenAIVectorizer,
    AzureOpenAIParameters,
    SemanticConfiguration,
    SemanticSearch,
    SemanticPrioritizedFields,
    SemanticField,
    SearchIndex
)

# Create a search index  
index_client = SearchIndexClient(endpoint=endpoint, credential=credential)  
fields = [  
    SearchField(name="parent_id", type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=True),  
    SearchField(name="title", type=SearchFieldDataType.String),  
    SearchField(name="chunk_id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True, analyzer_name="keyword"),  
    SearchField(name="chunk", type=SearchFieldDataType.String, sortable=False, filterable=False, facetable=False),  
    SearchField(name="vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), vector_search_dimensions=azure_openai_model_dimensions, vector_search_profile_name="myKnnProfile"),
    SearchField(name="metadata_storage_path", type=SearchFieldDataType.String, filterable=True, facetable=True)
]  
  
# Configure the vector search configuration  
vector_search = VectorSearch(  
    algorithms=[  
        ExhaustiveKnnAlgorithmConfiguration(name="myKnn",
                                            kind="exhaustiveKnn",
                                            parameters=ExhaustiveKnnParameters(metric="cosine"))
    ],  
    profiles=[  
        VectorSearchProfile(  
            name="myKnnProfile",  
            algorithm_configuration_name="myKnn",  
            vectorizer="myOpenAI",  
        )
    ],  
    vectorizers=[  
        AzureOpenAIVectorizer(  
            name="myOpenAI",  
            kind="azureOpenAI",  
            azure_open_ai_parameters=AzureOpenAIParameters(  
                resource_uri=azure_openai_endpoint,  
                deployment_id=azure_openai_embedding_deployment,
                model_name=azure_openai_model_name,
                api_key=azure_openai_key,
            ),
        ),  
    ],  
)  
  
semantic_config = SemanticConfiguration(  
    name="my-semantic-config",  
    prioritized_fields=SemanticPrioritizedFields(
        title_field=SemanticField(field_name="title"),  
        content_fields=[SemanticField(field_name="chunk"),
                        SemanticField(field_name="title")],
        keywords_fields=[SemanticField(field_name="chunk_id")],  
    ) 
)
  
# Create the semantic search with the configuration  
semantic_search = SemanticSearch(configurations=[semantic_config])  
  
# Create the search index
index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search, semantic_search=semantic_search)  
result = index_client.create_or_update_index(index)  
print(f"{result.name} created")  


inv-vec created


## Create a skillset

Skills drive integrated vectorization. [Text Split](https://learn.microsoft.com/azure/search/cognitive-search-skill-textsplit) provides data chunking. [AzureOpenAIEmbedding](https://learn.microsoft.com/azure/search/cognitive-search-skill-azure-openai-embedding) handles calls to Azure OpenAI, using the connection information you provide in the environment variables. An [indexer projection](https://learn.microsoft.com/azure/search/index-projections-concept-intro) specifies secondary indexes used for chunked data.

In [25]:
from azure.search.documents.indexes.models import (
    SplitSkill,
    InputFieldMappingEntry,
    OutputFieldMappingEntry,
    AzureOpenAIEmbeddingSkill,
    OcrSkill,
    MergeSkill,
    SearchIndexerIndexProjections,
    SearchIndexerIndexProjectionSelector,
    SearchIndexerIndexProjectionsParameters,
    IndexProjectionMode,
    SearchIndexerSkillset,
    CognitiveServicesAccountKey
)

# Create a skillset  
skillset_name = f"{index_name}-skillset"

ocr_skill = OcrSkill(
    description="OCR skill to scan PDFs and other images with text",
    context="/document/normalized_images/*",
    line_ending="Space",
    default_language_code="en",
    should_detect_orientation=True,
    inputs=[
        InputFieldMappingEntry(name="image", source="/document/normalized_images/*")
    ],
    outputs=[
        OutputFieldMappingEntry(name="text", target_name="text"),
        OutputFieldMappingEntry(name="layoutText", target_name="layoutText")
    ]
)

merge_skill = MergeSkill(
    description="Merge skill for combining OCR'd and regular text",
    context="/document",
    inputs=[
        InputFieldMappingEntry(name="text", source="/document/content"),
        InputFieldMappingEntry(name="itemsToInsert", source="/document/normalized_images/*/text"),
        InputFieldMappingEntry(name="offsets", source="/document/normalized_images/*/contentOffset")
    ],
    outputs=[
        OutputFieldMappingEntry(name="mergedText", target_name="merged_content")
    ]
)

# If an AI Services key is provided, use the OCR text as the source text for chunking
# Otherwise, use the normal document content.
split_skill_text_source = "/document/content" if not use_ocr else "/document/merged_content"
split_skill = SplitSkill(  
    description="Split skill to chunk documents",  
    text_split_mode="pages",  
    context="/document",  
    maximum_page_length=10000,  
    page_overlap_length=1000,  
    inputs=[  
        InputFieldMappingEntry(name="text", source=split_skill_text_source),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="textItems", target_name="pages")  
    ],  
)  
  
embedding_skill = AzureOpenAIEmbeddingSkill(  
    description="Skill to generate embeddings via Azure OpenAI",  
    context="/document/pages/*",  
    resource_uri=azure_openai_endpoint,  
    deployment_id=azure_openai_embedding_deployment,  
    model_name=azure_openai_model_name,
    dimensions=azure_openai_model_dimensions,
    api_key=azure_openai_key,  
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/pages/*"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="embedding", target_name="vector")  
    ],  
)  
  
index_projections = SearchIndexerIndexProjections(  
    selectors=[  
        SearchIndexerIndexProjectionSelector(  
            target_index_name=index_name,  
            parent_key_field_name="parent_id",  
            source_context="/document/pages/*",  
            mappings=[  
                InputFieldMappingEntry(name="chunk", source="/document/pages/*"),  
                InputFieldMappingEntry(name="vector", source="/document/pages/*/vector"),  
                InputFieldMappingEntry(name="title", source="/document/metadata_storage_name"),
                InputFieldMappingEntry(name="metadata_storage_path", source="/document/metadata_storage_path"),  
            ],  
        ),  
    ],  
    parameters=SearchIndexerIndexProjectionsParameters(  
        projection_mode=IndexProjectionMode.SKIP_INDEXING_PARENT_DOCUMENTS  
    ),  
) 

cognitive_services_account = CognitiveServicesAccountKey(key=azure_ai_services_key) if use_ocr else None

skills = [split_skill, embedding_skill]
if use_ocr:
    skills.extend([ocr_skill, merge_skill])

skillset = SearchIndexerSkillset(  
    name=skillset_name,  
    description="Skillset to chunk documents and generating embeddings",  
    skills=skills,  
    index_projections=index_projections,
    cognitive_services_account=cognitive_services_account
)
  
client = SearchIndexerClient(endpoint, credential)  
client.create_or_update_skillset(skillset)  
print(f"{skillset.name} created")  


inv-vec-skillset created


## Create an indexer

In [29]:
from azure.search.documents.indexes.models import (
    SearchIndexer,
    FieldMapping,
    IndexingParameters,
    IndexingParametersConfiguration,
    BlobIndexerImageAction,
)

# Create an indexer  
indexer_name = f"{index_name}-indexer"  

indexer_parameters = None
if use_ocr:
    indexer_parameters = IndexingParameters(
        configuration=IndexingParametersConfiguration(
            image_action=BlobIndexerImageAction.GENERATE_NORMALIZED_IMAGE_PER_PAGE,
            query_timeout=None,
            excluded_file_name_extensions=".pptx",
            indexed_file_name_extensions=".pdf",
            data_to_extract="contentAndMetadata",
            ))

indexer = SearchIndexer(  
    name=indexer_name,  
    description="Indexer to index documents and generate embeddings",
    skillset_name=skillset_name,  
    target_index_name=index_name,  
    data_source_name=data_source.name,
    # Map the metadata_storage_name field to the title field in the index to display the PDF title in the search results  
    field_mappings=[FieldMapping(source_field_name="metadata_storage_name", target_field_name="title"),
                    FieldMapping(source_field_name="metadata_storage_path", target_field_name="metadata_storage_path")],
    parameters=indexer_parameters
)  

indexer_client = SearchIndexerClient(endpoint, credential)  
indexer_result = indexer_client.create_or_update_indexer(indexer)  
  
# Run the indexer  
indexer_client.run_indexer(indexer_name)  
print(f' {indexer_name} is created and running. If queries return no results, please wait a bit and try again.')  


ResourceExistsError: () Another indexer invocation is currently in progress; concurrent invocations are not allowed.
Code: 
Message: Another indexer invocation is currently in progress; concurrent invocations are not allowed.

## Perform a vector similarity search

This example shows a pure vector search using the vectorizable text query, all you need to do is pass in text and your vectorizer will handle the query vectorization.

If you indexed the health plan PDF file, send queries that ask plan-related questions.

In [30]:
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery

# Pure Vector Search
query = "Which is more comprehensive, Northwind Health Plus vs Northwind Standard?"
if use_ocr:
    query = "Que es un fondo mutuo ?"
  
search_client = SearchClient(endpoint, index_name, credential=credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=1, fields="vector", exhaustive=True)
# Use the below query to pass in the raw vector query instead of the query vectorization
# vector_query = RawVectorQuery(vector=generate_embeddings(query), k_nearest_neighbors=3, fields="vector")
  
results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query],
    select=["parent_id", "chunk_id", "chunk"],
    top=1
)  
  
for result in results:  
    print(f"parent_id: {result['parent_id']}")  
    print(f"chunk_id: {result['chunk_id']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['chunk']}")   


parent_id: aHR0cHM6Ly9pdGF1aW52ZXJzaW9uZXMuYmxvYi5jb3JlLndpbmRvd3MubmV0L2RvY3VtZW50b3MvUEZJJTIwSW52ZXJzaW9uZXMlMjAoY2FwYWNpdGFjaSVDMyVCM24pJTIwbWFyem8lMjAyMDI0LnBwdHg1
chunk_id: 8938fa2c17ef_aHR0cHM6Ly9pdGF1aW52ZXJzaW9uZXMuYmxvYi5jb3JlLndpbmRvd3MubmV0L2RvY3VtZW50b3MvUEZJJTIwSW52ZXJzaW9uZXMlMjAoY2FwYWNpdGFjaSVDMyVCM24pJTIwbWFyem8lMjAyMDI0LnBwdHg1_pages_3
Score: 0.8823329
Content: los cuales tienen por objetivo lograr la mayor rentabilidad posible, ajustada por el riesgo que el fondo puede tomar. FFMM recomendados por Itaú Recomienda estos FFMM a tus clientes Fondos Mutuos Itaú Mi Cartera: Fondo Mutuo Itaú Mi Cartera Tranqui Son fondos mutuos que invierten en otros fondos de litau, Fondo diseñado para clientes de perfil conservador, en porcentajes que permiten lograr una cartera o por su bajo nivel de riesgo. Actualmente es un fondo portafolio diversificado, según la estrategia de cada que está compuesto por sólo 4 fondos mutuos de renta uno de los Mi Cartera por cada perfil de inversion

## Perform a hybrid search

In [31]:
# Hybrid Search
query = "Which is more comprehensive, Northwind Health Plus vs Northwind Standard?"  
if use_ocr:
    query = "Que es un fondo mutuo ?"

search_client = SearchClient(endpoint, index_name, credential=credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=1, fields="vector", exhaustive=True)
  
results = search_client.search(  
    search_text=query,  
    vector_queries= [vector_query],
    select=["parent_id", "chunk_id", "chunk"],
    top=1
)  
  
for result in results:  
    print(f"parent_id: {result['parent_id']}")  
    print(f"chunk_id: {result['chunk_id']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['chunk']}")  


parent_id: aHR0cHM6Ly9pdGF1aW52ZXJzaW9uZXMuYmxvYi5jb3JlLndpbmRvd3MubmV0L2RvY3VtZW50b3MvUEZJJTIwSW52ZXJzaW9uZXMlMjAoY2FwYWNpdGFjaSVDMyVCM24pJTIwbWFyem8lMjAyMDI0LnBwdHg1
chunk_id: 8938fa2c17ef_aHR0cHM6Ly9pdGF1aW52ZXJzaW9uZXMuYmxvYi5jb3JlLndpbmRvd3MubmV0L2RvY3VtZW50b3MvUEZJJTIwSW52ZXJzaW9uZXMlMjAoY2FwYWNpdGFjaSVDMyVCM24pJTIwbWFyem8lMjAyMDI0LnBwdHg1_pages_3
Score: 0.024931129068136215
Content: los cuales tienen por objetivo lograr la mayor rentabilidad posible, ajustada por el riesgo que el fondo puede tomar. FFMM recomendados por Itaú Recomienda estos FFMM a tus clientes Fondos Mutuos Itaú Mi Cartera: Fondo Mutuo Itaú Mi Cartera Tranqui Son fondos mutuos que invierten en otros fondos de litau, Fondo diseñado para clientes de perfil conservador, en porcentajes que permiten lograr una cartera o por su bajo nivel de riesgo. Actualmente es un fondo portafolio diversificado, según la estrategia de cada que está compuesto por sólo 4 fondos mutuos de renta uno de los Mi Cartera por cada perfil d

## Perform a hybrid search + semantic reranking

In [32]:
from azure.search.documents.models import (
    QueryType,
    QueryCaptionType,
    QueryAnswerType
)
# Semantic Hybrid Search
query = "Which is more comprehensive, Northwind Health Plus vs Northwind Standard?"
if use_ocr:
    query = "Que es un fondo mutuo ?"

search_client = SearchClient(endpoint, index_name, credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=1, fields="vector", exhaustive=True)

results = search_client.search(  
    search_text=query,
    vector_queries=[vector_query],
    select=["parent_id", "chunk_id", "chunk"],
    query_type=QueryType.SEMANTIC,
    semantic_configuration_name='my-semantic-config',
    query_caption=QueryCaptionType.EXTRACTIVE,
    query_answer=QueryAnswerType.EXTRACTIVE,
    top=1
)

semantic_answers = results.get_answers()
if semantic_answers:
    for answer in semantic_answers:
        if answer.highlights:
            print(f"Semantic Answer: {answer.highlights}")
        else:
            print(f"Semantic Answer: {answer.text}")
        print(f"Semantic Answer Score: {answer.score}\n")

for result in results:
    print(f"parent_id: {result['parent_id']}")  
    print(f"chunk_id: {result['chunk_id']}")  
    print(f"Reranker Score: {result['@search.reranker_score']}")
    print(f"Content: {result['chunk']}")  

    captions = result["@search.captions"]
    if captions:
        caption = captions[0]
        if caption.highlights:
            print(f"Caption: {caption.highlights}\n")
        else:
            print(f"Caption: {caption.text}\n")


Semantic Answer: Perfil cliente avanzado Un fondo mutuo es un<em> vehiculo</em> de<em> inversion</em> que permite acceder a un abanico de activos o instrumentos financieros de forma diversificada, con costos atractivos Cuentan con un servicio de administración, es decir, que son los expertos los que manejan las inversiones del portafolio por el cliente Existen fondos de múltiples ca...
Semantic Answer Score: 0.97802734375

parent_id: aHR0cHM6Ly9pdGF1aW52ZXJzaW9uZXMuYmxvYi5jb3JlLndpbmRvd3MubmV0L2RvY3VtZW50b3MvUEZJJTIwSW52ZXJzaW9uZXMlMjAoY2FwYWNpdGFjaSVDMyVCM24pJTIwbWFyem8lMjAyMDI0LnBwdHg1
chunk_id: 8938fa2c17ef_aHR0cHM6Ly9pdGF1aW52ZXJzaW9uZXMuYmxvYi5jb3JlLndpbmRvd3MubmV0L2RvY3VtZW50b3MvUEZJJTIwSW52ZXJzaW9uZXMlMjAoY2FwYWNpdGFjaSVDMyVCM24pJTIwbWFyem8lMjAyMDI0LnBwdHg1_pages_3
Reranker Score: 3.721200942993164
Content: los cuales tienen por objetivo lograr la mayor rentabilidad posible, ajustada por el riesgo que el fondo puede tomar. FFMM recomendados por Itaú Recomienda estos FFMM a tus c