# Azure AI Search: Integrated Vectorization with Cohere embed-v-4 Skillset

This notebook demonstrates how to use Azure AI Search **integrated vectorization** with skillsets to:

1. **Crack PDFs** from Azure Blob Storage (built-in document processing)
2. **Chunk documents** using the Text Split skill
3. **Generate embeddings** using Cohere embed-v-4 via AML skill (Microsoft Foundry)
4. **Create chunks** using index projections (one-to-many mapping)
5. **Search** using hybrid search with semantic ranker

## Architecture

```
Blob Storage (PDFs)
        |
    Data Source
        |
    Indexer + Skillset
        |-- Document Cracking (built-in PDF parsing)
        |-- Text Split Skill (chunking)
        |-- AzureMachineLearningSkill (Cohere embed-v-4)
        |
    Index Projections (one-to-many)
        |
    Search Index (chunks with vectors + semantic config)
        |
    Hybrid Search + Semantic Ranker
```

## 1. Setup and Installation

In [63]:
%pip install azure-search-documents azure-identity python-dotenv azure-storage-blob azure-ai-inference --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [64]:
import os
from datetime import timedelta
from dotenv import load_dotenv
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient, SearchIndexerClient
from azure.search.documents.indexes.models import (
    SearchIndex, SearchField, SearchFieldDataType,
    VectorSearch, HnswAlgorithmConfiguration, VectorSearchProfile,
    SemanticConfiguration, SemanticField, SemanticPrioritizedFields, SemanticSearch,
    SearchIndexerDataContainer, SearchIndexerDataSourceConnection,
    SearchIndexer, SearchIndexerSkillset, SplitSkill, AzureMachineLearningSkill,
    InputFieldMappingEntry, OutputFieldMappingEntry,
    SearchIndexerIndexProjection, SearchIndexerIndexProjectionSelector,
    SearchIndexerIndexProjectionsParameters, FieldMapping,
)
from azure.search.documents.models import VectorizedQuery, QueryType
from azure.ai.inference import EmbeddingsClient
from azure.ai.inference.models import EmbeddingInputType
import time

# Load environment variables
load_dotenv("../.env")

# Configuration
search_endpoint = os.environ["AZURE_SEARCH_ENDPOINT"]
search_api_key = os.environ["AZURE_SEARCH_API_KEY"]
storage_connection_string = os.environ["AZURE_STORAGE_CONNECTION_STRING"]
storage_container = os.environ["AZURE_STORAGE_CONTAINER"]
inference_endpoint = os.environ["AZURE_INFERENCE_ENDPOINT"]
inference_credential = os.environ["AZURE_INFERENCE_CREDENTIAL"]

# Resource names
index_name = "cohere-skillset-demo-index"
data_source_name = "cohere-skillset-demo-datasource"
skillset_name = "cohere-skillset-demo-skillset"
indexer_name = "cohere-skillset-demo-indexer"
embedding_model = "embed-v-4-0"
embedding_dimensions = 1536

# Initialize clients
credential = AzureKeyCredential(search_api_key)
index_client = SearchIndexClient(endpoint=search_endpoint, credential=credential)
indexer_client = SearchIndexerClient(endpoint=search_endpoint, credential=credential)
embeddings_client = EmbeddingsClient(endpoint=inference_endpoint, credential=AzureKeyCredential(inference_credential))

print(f"Connected to: {search_endpoint}")

Connected to: https://farzad-srch-wcus-basic.search.windows.net


## 2. Create Data Source

In [65]:
# Create data source for Azure Blob Storage
data_source = SearchIndexerDataSourceConnection(
    name=data_source_name, type="azureblob",
    connection_string=storage_connection_string,
    container=SearchIndexerDataContainer(name=storage_container)
)
indexer_client.create_or_update_data_source_connection(data_source)
print(f"Data source '{data_source_name}' created")

Data source 'cohere-skillset-demo-datasource' created


## 3. Create Search Index

In [66]:
# Define index fields
fields = [
    SearchField(name="chunk_id", type=SearchFieldDataType.String, key=True, filterable=True, analyzer_name="keyword"),
    SearchField(name="parent_id", type=SearchFieldDataType.String, filterable=True),
    SearchField(name="title", type=SearchFieldDataType.String, searchable=True, filterable=True),
    SearchField(name="chunk", type=SearchFieldDataType.String, searchable=True),
    SearchField(name="chunk_vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True, vector_search_dimensions=embedding_dimensions, vector_search_profile_name="cohere-vector-profile")
]

# Vector search with HNSW defaults
vector_search = VectorSearch(
    algorithms=[HnswAlgorithmConfiguration(name="cohere-hnsw-config")],
    profiles=[VectorSearchProfile(name="cohere-vector-profile", algorithm_configuration_name="cohere-hnsw-config")]
)

# Semantic search
semantic_search = SemanticSearch(configurations=[
    SemanticConfiguration(name="cohere-semantic-config", prioritized_fields=SemanticPrioritizedFields(
        title_field=SemanticField(field_name="title"),
        content_fields=[SemanticField(field_name="chunk")]
    ))
])

# Create index
index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search, semantic_search=semantic_search)
result = index_client.create_or_update_index(index)
print(f"Index '{result.name}' created with {len(result.fields)} fields")

Index 'cohere-skillset-demo-index' created with 5 fields


## 4. Create Skillset

In [67]:
inference_scoring_uri = f"{inference_endpoint}/embeddings"

# Text Split Skill - chunks documents
split_skill = SplitSkill(
    name="text-split-skill", text_split_mode="pages",
    maximum_page_length=2000, page_overlap_length=500,
    inputs=[InputFieldMappingEntry(name="text", source="/document/content")],
    outputs=[OutputFieldMappingEntry(name="textItems", target_name="pages")]
)

# AML Skill - generates embeddings via Azure AI Inference API
embedding_skill = AzureMachineLearningSkill(
    name="cohere-embedding-skill",
    scoring_uri=inference_scoring_uri,
    authentication_key=inference_credential,
    context="/document/pages/*",
    timeout=timedelta(seconds=60),
    degree_of_parallelism=5,
    inputs=[
        InputFieldMappingEntry(name="input", source="=[$(/document/pages/*)]"),
        InputFieldMappingEntry(name="model", source="='embed-v-4-0'")
    ],
    outputs=[OutputFieldMappingEntry(name="data", target_name="embedding_response")]
)

# Index Projections - one-to-many mapping (parent doc -> chunks)
index_projections = SearchIndexerIndexProjection(
    selectors=[SearchIndexerIndexProjectionSelector(
        target_index_name=index_name, parent_key_field_name="parent_id",
        source_context="/document/pages/*",
        mappings=[
            InputFieldMappingEntry(name="chunk", source="/document/pages/*"),
            InputFieldMappingEntry(name="chunk_vector", source="/document/pages/*/embedding_response/0/embedding"),
            InputFieldMappingEntry(name="title", source="/document/metadata_storage_name")
        ]
    )],
    parameters=SearchIndexerIndexProjectionsParameters(projection_mode="skipIndexingParentDocuments")
)

# Create skillset
skillset = SearchIndexerSkillset(
    name=skillset_name, skills=[split_skill, embedding_skill], index_projection=index_projections
)
indexer_client.create_or_update_skillset(skillset)
print(f"Skillset '{skillset_name}' created")

Skillset 'cohere-skillset-demo-skillset' created


## 5. Create and Run Indexer

In [68]:
# Create and run the indexer
indexer = SearchIndexer(
    name=indexer_name,
    data_source_name=data_source_name,
    target_index_name=index_name,
    skillset_name=skillset_name,
    field_mappings=[
        FieldMapping(source_field_name="metadata_storage_path", target_field_name="parent_id"),
        FieldMapping(source_field_name="metadata_storage_name", target_field_name="title")
    ]
)
indexer_client.create_or_update_indexer(indexer)
indexer_client.run_indexer(indexer_name)
print(f"Indexer '{indexer_name}' started...")

# Poll for completion
while True:
    status = indexer_client.get_indexer_status(indexer_name)
    if status.last_result and status.last_result.status in ["success", "transientFailure", "persistentFailure"]:
        print(f"Status: {status.last_result.status} | Docs: {status.last_result.item_count} | Errors: {status.last_result.failed_item_count}")
        if status.last_result.errors:
            for e in status.last_result.errors[:3]:
                print(f"  Error: {e.error_message}")
        break
    time.sleep(10)

Indexer 'cohere-skillset-demo-indexer' started...
Status: success | Docs: 0 | Errors: 0


## 6. Hybrid Search

In [69]:
search_client = SearchClient(endpoint=search_endpoint, index_name=index_name, credential=credential)

def hybrid_search(query: str, top: int = 5):
    """Hybrid search with semantic ranker using Cohere embeddings."""
    embedding = embeddings_client.embed(
        input=[query], model=embedding_model, input_type=EmbeddingInputType.QUERY
    ).data[0].embedding
    
    return list(search_client.search(
        search_text=query,
        vector_queries=[VectorizedQuery(vector=embedding, k=top, fields="chunk_vector")],
        query_type=QueryType.SEMANTIC,
        semantic_configuration_name="cohere-semantic-config",
        select=["chunk_id", "title", "chunk"],
        top=top
    ))

In [70]:
# Example searches
queries = [
    "What are the key trends in artificial intelligence?",
    "How is AI impacting healthcare?"
]

for query in queries:
    print(f"Query: {query}")
    print("-" * 60)
    for i, r in enumerate(hybrid_search(query, top=3), 1):
        print(f"{i}. {r['title']} (score: {r.get('@search.reranker_score', 'N/A'):.2f})")
        print(f"   {r['chunk'][:150].replace(chr(10), ' ')}...")
    print()

Query: What are the key trends in artificial intelligence?
------------------------------------------------------------
1. the-state-of-enterprise-ai_2025-report.pdf (score: 2.79)
   Deliberate change  management  They build structures that speed organizational learning, combining  centralized governance and training with distribut...
2. the-state-of-enterprise-ai_2025-report.pdf (score: 2.76)
   a grounded view of how  AI is being deployed inside organizations today.    The state of enterprise AI  |  2025 Report3  01 Enterprise usage is scalin...
3. the-state-of-enterprise-ai_2025-report.pdf (score: 2.68)
   today, despite broad availability of these tools. Models are capable of far more than  most organizations have embedded into workflows, and this prese...

Query: How is AI impacting healthcare?
------------------------------------------------------------
1. the-state-of-enterprise-ai_2025-report.pdf (score: 2.93)
   savings KPI.    The state of enterprise AI  |  2025 Report21  AI 

## 7. Cleanup (Optional)

In [71]:
# Uncomment to delete all resources
# for name, delete_fn in [(indexer_name, indexer_client.delete_indexer),
#                         (skillset_name, indexer_client.delete_skillset),
#                         (data_source_name, indexer_client.delete_data_source_connection),
#                         (index_name, index_client.delete_index)]:
#     delete_fn(name); print(f"Deleted: {name}")