## 📚 Prerequisites

Before running this notebook, ensure you have configured Azure AI services, set the appropriate configuration parameters, and set up a Conda environment to ensure reproducibility. You can find the setup instructions and how to create a Conda environment in the [REQUIREMENTS.md](REQUIREMENTS.md) file.

## 📋 Table of Contents

This notebook lays the foundation for subsequent notebooks by guiding you through the creation of Azure AI Search indexes. 

This notebook assists in creating an Azure AI Search Index, covering the following sections:

> We'll be using the Azure Search SDK for Python to accomplish this. 

1. [**Define Field Types**](#define-field-types): Outlines the process of defining the structure and behavior of an index using various field types.

2. [**Configuring Vector Search**](#configuring-vector-search): Discusses the setup of algorithms and profiles for handling vector-based queries.

3. [**Configuring Semantic Search**](#configuring-semantic-search): Explores how to enhance search capabilities by leveraging advanced AI models.

4. [**Create or Update Index**](#create-or-update-index): Details the steps to create a new index or update an existing one.

For additional information, refer to the following resources:
- [Azure AI Search Documentation](https://learn.microsoft.com/en-us/azure/search/)

In [None]:
import os
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes import SearchIndexClient, SearchIndexerClient
from azure.search.documents.indexes.models import (
    AzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters,
    HnswAlgorithmConfiguration,
    HnswParameters,
    SearchField,
    SearchFieldDataType,
    SearchIndex,
    SemanticConfiguration,
    SemanticField,
    SemanticPrioritizedFields,
    SemanticSearch,
    VectorSearch,
    VectorSearchProfile,
)

from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Define the target directory
target_directory = os.getcwd()  # Get the current working directory

# Move one directory back
parent_directory = os.path.dirname(target_directory)

# Check if the parent directory exists
if os.path.exists(parent_directory):
    # Change the current working directory to the parent directory
    os.chdir(parent_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Parent directory {parent_directory} does not exist.")

Directory changed to c:\Users\pablosal\Desktop\aihlsignited-medindexer\labs


In [None]:
# Set the service endpoint and API key from the environment
# Create an SDK client
endpoint = os.environ["AZURE_AI_SEARCH_SERVICE_ENDPOINT"]

admin_documents_index_client = SearchIndexClient(
    endpoint=endpoint,
    index_name=os.environ["AZURE_SEARCH_INDEX_NAME"], # ai-policies-v2-index
    credential=AzureKeyCredential(os.environ["AZURE_SEARCH_ADMIN_KEY"]),
)

# Creating the Index

## Define Field Types

### 🧠 Understanding Field Types in Azure AI Search

In Azure Cognitive Search, the structure and behavior of an index are defined using various field types, each tailored for specific use cases. These field types are `SearchField`, `SimpleField`, `SearchableField`, and `ComplexField`.

- **SearchField**: This is the foundational field type for defining an index's schema. It encompasses a broad range of attributes that specify the field's role and behavior in the index. Key attributes include:
  - `name` and `type`: Define the field's identifier and data type.
  - `key`: Indicates if the field is a unique identifier for documents.
  - `searchable`: Specifies if the field undergoes full-text search analysis.
  - `filterable`, `sortable`, `facetable`: Determine how the field interacts with search queries.
  - Analyzers (`analyzer_name`, `search_analyzer_name`, `index_analyzer_name`): Configure text analysis for the field.
  - Advanced search attributes like `vector_search_dimensions` and `synonym_map_names`.
  - `fields`: For complex types, defining nested sub-fields. 

- **SimpleField**: A streamlined version of `SearchField`, designed for fields that don't require full-text search or advanced text analysis. It's typically used for non-textual data like identifiers and metadata, supporting attributes like `key`, `filterable`, `sortable`, and `facetable`.

- **SearchableField**: Tailored for fields that require full-text search capabilities, this type includes most of the attributes of `SearchField`. It's particularly suitable for fields with textual content that needs to be searchable, like titles, descriptions, or full text.

- **ComplexField**: Designed for fields that contain nested data structures, `ComplexField` allows you to define a field with sub-fields. It's characterized by:
  - `name`: The unique identifier for the field.
  - `collection`: A boolean indicating if the field is a collection of complex objects.
  - `fields`: A list of sub-fields, which can be of any field type, including nested `ComplexField`.

### How to Use These Field Types 🛠️

- **Creating Simple and Searchable Fields**: Use `SimpleField` for basic data types and `SearchableField` for text-heavy fields requiring search capabilities.

- **Designing Complex Data Structures**: Utilize `ComplexField` to model hierarchical or nested data within your index, defining each level of the hierarchy with appropriate sub-fields. 

- **Optimizing Search Behavior**: Leverage `SearchField` for granular control over search behavior, including the use of analyzers and advanced search features like vector search.

> **Note:** Full-text search analyzes and searches through all text within documents, considering language nuances and relevance. Non-full-text search, on the other hand, looks for exact matches or range queries in specific fields or attributes.

In [4]:
# Define the schema for the Azure AI Search index used in the Policy Knowledge Store
fields = [
    # Unique identifier for the parent document (e.g., the full policy file)
    SearchField(
        name="parent_id",
        type=SearchFieldDataType.String,
        sortable=True,
        filterable=True,
        facetable=True,
    ),

    # Path to the original policy file in Azure Blob Storage
    SearchField(
        name="parent_path",
        type=SearchFieldDataType.String,
        filterable=True,
        facetable=False,
    ),

    # Title or name of the policy document
    SearchField(
        name="policy_name",
        type=SearchFieldDataType.String,
        searchable=True,
        filterable=True,
        facetable=True,
    ),

    # Name of the payer organization (e.g., Aetna, Cigna)
    SearchField(
        name="payer_name",
        type=SearchFieldDataType.String,
        searchable=True,
        filterable=True,
        facetable=True,
    ),

    # List of drug names mentioned in the policy
    SearchField(
        name="drug_names",
        type=SearchFieldDataType.Collection(SearchFieldDataType.String),
        searchable=True,
        filterable=True,
        facetable=True,
    ),

    # Medical specialties the policy applies to (e.g., oncology, nephrology)
    SearchField(
        name="medical_specialties",
        type=SearchFieldDataType.Collection(SearchFieldDataType.String),
        searchable=True,
        filterable=True,
        facetable=True,
    ),

    # Diseases or conditions covered in the policy
    SearchField(
        name="covered_diseases",
        type=SearchFieldDataType.Collection(SearchFieldDataType.String),
        searchable=True,
        filterable=True,
        facetable=True,
    ),

    # ICD codes corresponding to the diseases
    SearchField(
        name="covered_diseases_icd_codes",
        type=SearchFieldDataType.Collection(SearchFieldDataType.String),
        searchable=True,
        filterable=True,
        facetable=True,
    ),

    # Drug codes referenced in the policy (e.g., NDC, RxNorm)
    SearchField(
        name="covered_drug_codes",
        type=SearchFieldDataType.Collection(SearchFieldDataType.String),
        searchable=True,
        filterable=True,
        facetable=True,
    ),

    # Unique identifier for each chunk of the policy
    SearchField(
        name="chunk_id",
        type=SearchFieldDataType.String,
        key=True,
        sortable=True,
        filterable=True,
        facetable=True,
        analyzer_name="keyword",
    ),

    # Chunked content from the policy document (e.g., sections or paragraphs)
    SearchField(
        name="chunk",
        type=SearchFieldDataType.String,
        searchable=True,
        sortable=False,
        filterable=False,
        facetable=False,
    ),

    # Vector representation of the chunk, used for semantic search
    SearchField(
        name="vector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        vector_search_dimensions=3072, # This depends on the model used, we are using embeddign large with 3072 dimensions
        vector_search_profile_name="myHnswProfile",
    ),
]


## Configuring Vector Search

Configuring vector search in Azure AI Search involves setting up algorithms and profiles to handle vector-based queries. These are particularly useful for semantic search scenarios, such as finding similar items based on vector representations.

### Understanding the Configuration

The configuration consists of two main components: algorithm configurations and vector search profiles.

#### Algorithm Configurations:

1. **HnswAlgorithmConfiguration**: Hierarchical Navigable Small World (HNSW) is a high-performance, memory-efficient algorithm for approximate nearest neighbor search in high-dimensional spaces. It creates a multi-layer graph structure that enables fast search for nearest neighbors in high-dimensional data. The configuration includes:
   - `name`: A unique identifier for this configuration.
   - `kind`: Specifies the algorithm type, in this case, it's HNSW.
   - `parameters`: These are key settings that allow you to customize HNSW's behavior for optimal performance and accuracy. They include `m`, `ef_construction`, `ef_search`, and `metric`.
     - `m`: Controls the degree of the graph, affecting both search speed and accuracy.
     - `ef_construction`: Influences the index construction time and quality.
     - `ef_search`: Determines the trade-off between search time and accuracy.
   - `metric`: Specifies the distance function used for measuring vector similarity, such as "cosine".

2. **ExhaustiveKnnAlgorithmConfiguration**: This is a brute-force search algorithm that examines the entire vector index, used during querying. It's slower but can be more accurate for certain use cases. Similar to HNSW, it has `name`, `kind`, and `metric`. However, it lacks the additional tuning parameters found in HNSW.


### Tuning HNSW Parameters for Optimal Performance

**Striking the Right Balance between Recall, Latency, and Indexing**

The HNSW algorithm parameters can be adjusted to optimize the performance of your vector search. Here are some strategies:

- **Increase 'ef_search'**: This can improve recall without needing to reindex. However, monitor your system for potential latency increases. If increasing 'ef_search' isn't effective or causes high latency, consider the next steps.

- **Reindex with higher values of ‘m' and/or 'ef_construction'**: This can improve the quality of the search. However, keep in mind that increasing 'ef_construction' may result in longer indexing latency.

- **Increase the ‘m' value**: This should be done carefully and only if other parameters don't sufficiently improve recall after trying the previous steps. Increasing 'm' can improve the quality of the HNSW graph, but it may also increase the memory usage and indexing time.

Remember, tuning these parameters involves a trade-off between recall and latency. It's important to test different configurations and monitor their impact on your system's performance.   

#### Vector Search Profiles:

These profiles allow you to define combinations of algorithm configurations for different search scenarios. Each profile, like `myHnswProfile` or `myExhaustiveKnnProfile`, is linked to an algorithm configuration via `algorithm_configuration_name`.

For example, you might have a profile `fastSearchProfile` linked to an HNSW configuration for general queries where speed is essential, and another profile `accurateSearchProfile` linked to an exhaustive KNN configuration for scenarios where precision is paramount.

```python
fastSearchProfile = {
    "name": "fastSearchProfile",
    "algorithm_configuration_name": "myHnswConfiguration"
}

accurateSearchProfile = {
    "name": "accurateSearchProfile",
    "algorithm_configuration_name": "myExhaustiveKnnConfiguration"
}
```

### Why Configure Vector Search This Way?

+ **Flexibility**: Having different algorithms and profiles lets you tailor your search strategy to specific needs. For example, use HNSW for general queries where speed is essential and exhaustive KNN for scenarios where precision is paramount.

- **Tunable Performance**: HNSW algorithm parameters can be adjusted to find the right balance between speed and accuracy, making it adaptable to various datasets and search requirements.

+ **Accuracy vs. Speed Trade-offs**: Exhaustive KNN offers high accuracy at the cost of speed and is suitable for scenarios where search completeness is critical.

### We use here integrated Vectorization:

Integrated vectorization in Azure AI Search streamlines the process of embedding generation by performing vectorization during indexing and querying. This approach eliminates the need for separate preprocessing pipelines, ensuring that both documents and queries are transformed into embeddings using the same model, thereby maintaining consistency and improving search relevance. By leveraging integrated vectorization, the system can efficiently handle semantic searches, providing more accurate and contextually relevant results More [here](https://learn.microsoft.com/en-us/azure/search/vector-search-integrated-vectorization?utm_source=chatgpt.com)

In [5]:
# Configure the vector search for policy chunk indexing and retrieval
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(
            name="myHnsw",  # Name of the HNSW algorithm configuration
            parameters=HnswParameters(
                m=5,  # Number of bi-directional links per element
                ef_construction=300,  # Size of the dynamic candidate list during index construction
                ef_search=400,  # Size of the dynamic candidate list during querying
            ),
        ),
    ],
    profiles=[
        VectorSearchProfile(
            name="myHnswProfile",  # Profile name referenced in the index's vector fields
            algorithm_configuration_name="myHnsw",  # Links to the HNSW algorithm configuration
            vectorizer_name="myOpenAIVectorizer",  # Associates with the defined vectorizer
        )
    ],
    vectorizers=[
        AzureOpenAIVectorizer(
            vectorizer_name="myOpenAIVectorizer",  # Name of the vectorizer
            parameters=AzureOpenAIVectorizerParameters(
                resource_url=os.environ['AZURE_OPENAI_ENDPOINT'],  # Azure OpenAI resource endpoint
                deployment_name=os.environ['AZURE_OPENAI_EMBEDDING_DEPLOYMENT'],  # Deployment ID of the embedding model
                model_name=os.environ['AZURE_OPENAI_EMBEDDING_DEPLOYMENT'],  # Name of the embedding model
                api_key=os.environ['AZURE_OPENAI_KEY'],  # API key for authentication
            ),
        ),
    ],
)


## Configuring semantic search

Azure Cognitive Search's `SemanticConfiguration` enhances search capabilities by leveraging advanced AI models to interpret the intent and context of search queries. This configuration is particularly useful for creating a more intuitive and context-aware search experience. The key components of this configuration include `SemanticPrioritizedFields` and `SemanticField`.

### SemanticPrioritizedFields

`SemanticPrioritizedFields` plays a critical role in guiding the semantic search engine towards the most relevant parts of your documents. It includes three main properties:

1. **Title Field (`title_field`)**: This field is typically given higher priority in semantic analysis. It's crucial for summarizing the document and is often used in generating captions, highlights, and determining semantic relevance.

2. **Content Fields (`content_fields`)**: These fields usually contain the bulk of the document's text in natural language. They provide detailed context and are essential for in-depth semantic analysis. The order of the fields indicates their priority, with higher-priority fields being more influential in the analysis.

3. **Keywords Fields (`keywords_fields`)**: These fields should contain key terms or concepts relevant to the document. They are used to enhance the semantic understanding of the document's main themes or topics.

### SemanticField

`SemanticField` specifies individual fields from the index to be used in the `SemanticPrioritizedFields`. Each `SemanticField` requires only one attribute:

- **Field Name (`field_name`)**: This is the name of the field in the index that is to be used for semantic analysis.


In [6]:
# Configure semantic search for the policy knowledge store index
semantic_config_policy_index = SemanticConfiguration(
    name="policy-index-semantic-config",  # Give a descriptive name aligned with your index
    prioritized_fields=SemanticPrioritizedFields(
        # Use policy_name as the semantic title for better highlights in results
        title_field=SemanticField(field_name="policy_name"),

        # Use payer_name, medical_specialties, and drug_names as keyword hints
        keywords_fields=[
            SemanticField(field_name="payer_name"),
            SemanticField(field_name="medical_specialties"),
            SemanticField(field_name="drug_names"),
        ],

        # Chunk is your primary content for retrieval and semantic relevance
        content_fields=[
            SemanticField(field_name="chunk"),
        ],
    )
)

# Wrap into the semantic search settings
semantic_search_policy = SemanticSearch(
    configurations=[semantic_config_policy_index]
)


In this configuration, we are enabling semantic search for the Policy Knowledge Store, which means Azure AI Search will try to understand the meaning behind search queries instead of just looking for exact keyword matches. To do that well, we need to tell the system which parts of our indexed data are most important for understanding what each policy is about.

We start by defining the policy_name as the title_field. This is the human-readable name of the policy (like “Type 2 Diabetes Coverage Policy”), and it will appear as the title in the search results. Titles help Azure summarize and highlight results more naturally, giving users quick insight into what each result is about.

Then we define payer_name, medical_specialties, and drug_names as keywords_fields. These fields act like tags or key concepts. If someone searches for “oncology drugs covered by Cigna,” these keyword fields help the system focus on relevant content because it knows which policies mention “oncology,” “Cigna,” or certain drugs. These fields guide the AI to understand the overall theme of the document.

Finally, we use chunk as the content_field. This is where the real content of the policy lives—split into smaller pieces or paragraphs. This is what the semantic engine will actually read and compare against the user’s query to decide if it’s a good match. By pointing to chunk, we ensure the model looks at the right level of detail—without being overwhelmed by the entire document.

## **Create or Update Index**

In [7]:
index = SearchIndex(
    name=os.environ["AZURE_SEARCH_INDEX_NAME"],
    fields=fields,
    vector_search=vector_search,
    semantic_search=semantic_search_policy,
)

try:
    result = admin_documents_index_client.create_or_update_index(index)
    print("Index", result.name, "created")
except Exception as ex:
    print("Error creating index:", ex)


Index ai-policies-v2-index created
