## 📚 Prerequisites

Ensure that your Azure Services are properly set up, your Conda environment is created, and your environment variables are configured as per the instructions in the [README.md](README.md) file.

#### Import Libraries

In [None]:
!pip install azure-search-documents==11.6.0b5
!pip install python-dotenv
!pip install azure-storage-blob
!pip install azure-identity

In [2]:
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential
import os
load_dotenv(override=True) # take environment variables from .env.

True

#### Load Env Variables

In [3]:
#Azure Search
endpoint = os.environ["AZURE_AI_SEARCH_SERVICE_ENDPOINT"]
credential = AzureKeyCredential(os.getenv("AZURE_AI_SEARCH_ADMIN_KEY")) if os.getenv("AZURE_AI_SEARCH_ADMIN_KEY") else DefaultAzureCredential()
index_name = os.getenv("AZURE_AI_SEARCH_INDEX_NAME", "ai-policies-index")

#blob storage
blob_connection_string = os.environ["BLOB_CONNECTION_STRING"]
search_blob_connection_string = os.getenv("SEARCH_BLOB_DATASOURCE_CONNECTION_STRING", blob_connection_string)
blob_container_name = os.getenv("BLOB_CONTAINER_NAME", "pre-auth-policies")

#Azure OpenAI
azure_openai_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
azure_openai_key = os.getenv("AZURE_OPENAI_KEY")
azure_openai_embedding_deployment = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT", "text-embedding-3-large")
azure_openai_model_name = os.getenv("AZURE_OPENAI_EMBEDDING_MODEL_NAME", "text-embedding-3-large")
azure_openai_model_dimensions = int(os.getenv("AZURE_OPENAI_EMBEDDING_DIMENSIONS", 3072))

# This field is only necessary if you want to use OCR to scan PDFs in the data source
azure_ai_services_key = os.getenv("AZURE_AI_SERVICES_KEY", "")

use_ocr = len(azure_ai_services_key) > 0
# OCR must be used to add page numbers
add_page_numbers = use_ocr

## Upload Policies to Blob Storage

In this section, we will upload policy documents to Azure Blob Storage. This process involves connecting to the Azure Blob Storage account, creating a container if it doesn't already exist, and uploading the policy documents from a specified local directory to the blob container. 

### Steps:
1. **Initialize Azure Blob Storage Client**: Connect to the Azure Blob Storage account using the connection string and set up the container client.
2. **Create Container (if not exists)**: Ensure the specified container exists in the Blob Storage. If it doesn't, create it.
3. **Upload Policy Documents**: Iterate through the local directory containing the policy documents and upload each document to the Blob Storage container.

In [4]:
from src.storage.blob_helper import AzureBlobUploader

uploader = AzureBlobUploader(
    connection_string=search_blob_connection_string,
    container_name=blob_container_name,
    use_user_identity=False
)

2024-10-20 22:41:06,966 - micro - MainProcess - INFO     Container 'pre-auth-policies' already exists. (blob_helper.py:_create_container_if_not_exists:52)


In [5]:
# Local directory to upload files from
LOCAL_PATH = r"C:\Users\pablosal\Desktop\gbb-ai-hls-factory-prior-auth\utils\data\cases\policies"

# Remote directory in blob storage
REMOTE_PATH = "policies_ocr"

uploader.upload_files(
    local_path=LOCAL_PATH,
    remote_path=REMOTE_PATH,
    file_filter=AzureBlobUploader.filter_by_extension('.pdf'),
    overwrite=True
)

2024-10-20 22:41:07,207 - micro - MainProcess - INFO     Uploaded 'policies_ocr/001_inflammatory_Conditions.pdf' to blob storage. (blob_helper.py:upload_files:90)


## Setting Up a Blob Data Source Connector in Azure AI Search

In [6]:
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexerClient
from azure.search.documents.indexes.models import (
    SearchIndexerDataSourceConnection,
    SearchIndexerDataContainer,
    NativeBlobSoftDeleteDeletionDetectionPolicy
)

# Initialize the SearchIndexerClient
indexer_client = SearchIndexerClient(endpoint, credential)

# Create a data container for your blob storage
container = SearchIndexerDataContainer(name=blob_container_name)

# Notes:
# - data_change_detection_policy is not applicable for Blob Storage. The indexer automatically detects changes in blobs based on their LastModified timestamps.
# - Include data_deletion_detection_policy if needed. Use it to detect deletions, but remember to enable soft delete on your storage account.

# Create a data source connection without data_change_detection_policy
data_source_connection = SearchIndexerDataSourceConnection(
    name=f"{index_name}-blob",
    type="azureblob",
    connection_string=search_blob_connection_string,
    container=container,
    data_deletion_detection_policy=NativeBlobSoftDeleteDeletionDetectionPolicy()
)

# Create or update the data source connection
data_source = indexer_client.create_or_update_data_source_connection(data_source_connection)

print(f"Data source '{data_source.name}' created or updated")

Data source 'ai-policies-index-blob' created or updated


### Creating a Search Index

A search index is where both vector and non-vector content is stored. This index enables efficient searching and retrieval of documents, allowing for advanced search capabilities and quick access to relevant information.

In [7]:
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient, SearchIndexerClient
from azure.search.documents.indexes.models import (
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    AzureOpenAIVectorizerParameters,
    AzureOpenAIVectorizer,
    SemanticConfiguration,
    SemanticSearch,
    SemanticSearch,
    SemanticPrioritizedFields,
    SemanticField,
    SearchIndex,
    HnswParameters,
    SimpleField,
    SearchIndexerDataSourceConnection,
    SearchIndexerDataContainer,
    SearchIndexer,
    SearchIndexerDataSourceType,
    NativeBlobSoftDeleteDeletionDetectionPolicy,
)

In [8]:
# Create a search index  
index_client = SearchIndexClient(endpoint=endpoint, credential=credential)  

In [9]:
# Define fields
fields = [
    SearchField(
        name="parent_id",
        type=SearchFieldDataType.String,
        sortable=True,
        filterable=True,
        facetable=True
    ),
    SearchField(
        name="title",
        type=SearchFieldDataType.String,
    ),
    SearchField(
        name="parent_path",
        type=SearchFieldDataType.String,
    ),
    SearchField(
        name="chunk_id",
        type=SearchFieldDataType.String,
        key=True,
        sortable=True,
        filterable=True,
        facetable=True,
        analyzer_name="keyword"
    ),
    SearchField(
        name="chunk",
        type=SearchFieldDataType.String,
        searchable=True,
        sortable=False,
        filterable=False,
        facetable=False,
    ),
    SearchField(
        name="vector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        vector_search_dimensions=azure_openai_model_dimensions,
        vector_search_profile_name="myHnswProfile"
    )
]

if add_page_numbers:
    fields.append(
        SearchField(name="page_number", type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=False)
    )

# Configure the vector search configuration  
vector_search = VectorSearch(  
    algorithms=[  
        HnswAlgorithmConfiguration(name="myHnsw",
                                   parameters=HnswParameters(
                                        m=4,
                                        ef_construction=400,
                                        ef_search=500,
                                   )),
    ],  
    profiles=[  
        VectorSearchProfile(  
            name="myHnswProfile",  
            algorithm_configuration_name="myHnsw",  
            vectorizer_name="myOpenAI",  
        )
    ],  
    vectorizers=[  
        AzureOpenAIVectorizer(  
            vectorizer_name="myOpenAI",  
            parameters=AzureOpenAIVectorizerParameters(  
                resource_url=azure_openai_endpoint,  
                deployment_name=azure_openai_embedding_deployment,
                model_name=azure_openai_model_name,
                api_key=azure_openai_key,
            ),
        ),  
    ],  
)  
  
semantic_config = SemanticConfiguration(  
    name="my-semantic-config",  
    prioritized_fields=SemanticPrioritizedFields(  
        content_fields=[SemanticField(field_name="chunk")]  
    ),  
)

In [10]:
semantic_search = SemanticSearch(configurations=[semantic_config])

# Create the search index
index = SearchIndex(
    name=index_name,
    fields=fields,
    vector_search=vector_search,
    semantic_search=semantic_search
)

# Create or update the index
index_result = index_client.create_or_update_index(index)
print(f"Index '{index_result.name}' created or updated successfully.")

Index 'ai-policies-index' created or updated successfully.


## Creating a Skillset for Integrated Vectorization

A skillset in Azure AI Search defines a collection of skills that are applied to your data during indexing. These skills can include text splitting for data chunking, vectorization using Azure OpenAI embeddings, and more. By configuring a skillset, you can enhance your search index with advanced capabilities, making it more efficient and powerful.

### Key Components:
1. **Text Split Skill**: This skill chunks your data into manageable pieces, improving the granularity and relevance of search results.
2. **Azure OpenAI Embedding Skill**: This skill integrates with Azure OpenAI to generate embeddings for your data, enhancing vector search capabilities.
3. **Indexer Projection**: Specifies secondary indexes used for chunked data, ensuring that the processed data is correctly indexed and searchable.

By setting up a skillset, you can leverage these advanced features to create a more robust and efficient search index.

In [11]:
# Create a skillset  
skillset_name = f"{index_name}-skillset"

In [12]:
from azure.search.documents.indexes.models import (
    SplitSkill,
    InputFieldMappingEntry,
    OutputFieldMappingEntry,
    AzureOpenAIEmbeddingSkill,
    OcrSkill,
    SearchIndexerIndexProjection,
    SearchIndexerIndexProjectionSelector,
    SearchIndexerIndexProjectionsParameters,
    IndexProjectionMode,
    SearchIndexerSkillset,
    CognitiveServicesAccountKey
)

# Create a skillset  
skillset_name = f"{index_name}-skillset"

def create_ocr_skillset():
    ocr_skill = OcrSkill(
        description="OCR skill to scan PDFs and other images with text",
        context="/document/normalized_images/*",
        line_ending="Space",
        default_language_code="en",
        should_detect_orientation=True,
        inputs=[
            InputFieldMappingEntry(name="image", source="/document/normalized_images/*")
        ],
        outputs=[
            OutputFieldMappingEntry(name="text", target_name="text"),
            OutputFieldMappingEntry(name="layoutText", target_name="layoutText")
        ]
    )

    split_skill = SplitSkill(  
        description="Split skill to chunk documents",  
        text_split_mode="pages",  
        context="/document/normalized_images/*",  
        maximum_page_length=3000,  
        page_overlap_length=500,  
        inputs=[  
            InputFieldMappingEntry(name="text", source="/document/normalized_images/*/text"),  
        ],  
        outputs=[  
            OutputFieldMappingEntry(name="textItems", target_name="pages")  
        ]
    )

    embedding_skill = AzureOpenAIEmbeddingSkill(  
        description="Skill to generate embeddings via Azure OpenAI",  
        context="/document/normalized_images/*/pages/*",  
        resource_url=azure_openai_endpoint,  
        deployment_name=azure_openai_embedding_deployment,  
        model_name=azure_openai_model_name,
        dimensions=azure_openai_model_dimensions,
        api_key=azure_openai_key,  
        inputs=[  
            InputFieldMappingEntry(name="text", source="/document/normalized_images/*/pages/*"),  
        ],  
        outputs=[
            OutputFieldMappingEntry(name="embedding", target_name="vector")  
        ]
    )

    index_projections = SearchIndexerIndexProjection(  
        selectors=[  
            SearchIndexerIndexProjectionSelector(  
                target_index_name=index_name,  
                parent_key_field_name="parent_id",  
                source_context="/document/normalized_images/*/pages/*",  
                mappings=[
                    InputFieldMappingEntry(name="chunk", source="/document/normalized_images/*/pages/*"),  
                    InputFieldMappingEntry(name="vector", source="/document/normalized_images/*/pages/*/vector"),
                    InputFieldMappingEntry(name="parent_path", source="/document/metadata_storage_path"),
                    InputFieldMappingEntry(name="title", source="/document/metadata_storage_name"),
                    InputFieldMappingEntry(name="page_number", source="/document/normalized_images/*/pageNumber")
                ]
            )
        ],  
        parameters=SearchIndexerIndexProjectionsParameters(  
            projection_mode=IndexProjectionMode.SKIP_INDEXING_PARENT_DOCUMENTS  
        )  
    )

    cognitive_services_account = CognitiveServicesAccountKey(key=azure_ai_services_key) if use_ocr else None

    skills = [ocr_skill, split_skill, embedding_skill]

    return SearchIndexerSkillset(  
        name=skillset_name,  
        description="Skillset to chunk documents and generating embeddings",  
        skills=skills,  
        index_projection=index_projections,
        cognitive_services_account=cognitive_services_account
    )

def create_skillset():
    split_skill = SplitSkill(  
        description="Split skill to chunk documents",  
        text_split_mode="pages",  
        context="/document",  
        maximum_page_length=2000,  
        page_overlap_length=500,  
        inputs=[  
            InputFieldMappingEntry(name="text", source="/document/content"),  
        ],  
        outputs=[  
            OutputFieldMappingEntry(name="textItems", target_name="pages")  
        ]
    )

    embedding_skill = AzureOpenAIEmbeddingSkill(  
        description="Skill to generate embeddings via Azure OpenAI",  
        context="/document/pages/*",  
        resource_url=azure_openai_endpoint,  
        deployment_name=azure_openai_embedding_deployment,  
        model_name=azure_openai_model_name,
        dimensions=azure_openai_model_dimensions,
        api_key=azure_openai_key,  
        inputs=[  
            InputFieldMappingEntry(name="text", source="/document/pages/*"),  
        ],  
        outputs=[
            OutputFieldMappingEntry(name="embedding", target_name="vector")  
        ]
    )

    index_projections = SearchIndexerIndexProjection(  
        selectors=[  
            SearchIndexerIndexProjectionSelector(  
                target_index_name=index_name,  
                parent_key_field_name="parent_id",  
                source_context="/document/pages/*",  
                mappings=[
                    InputFieldMappingEntry(name="chunk", source="/document/pages/*"),  
                    InputFieldMappingEntry(name="vector", source="/document/pages/*/vector"),
                    InputFieldMappingEntry(name="parent_path", source="/document/metadata_storage_path"),
                    InputFieldMappingEntry(name="title", source="/document/metadata_storage_name"),
                ]
            )
        ],  
        parameters=SearchIndexerIndexProjectionsParameters(  
            projection_mode=IndexProjectionMode.SKIP_INDEXING_PARENT_DOCUMENTS  
        )  
    )

    cognitive_services_account = CognitiveServicesAccountKey(key=azure_ai_services_key) if use_ocr else None

    skills = [split_skill, embedding_skill]

    return SearchIndexerSkillset(  
        name=skillset_name,  
        description="Skillset to chunk documents and generating embeddings",  
        skills=skills,  
        index_projection=index_projections,
        cognitive_services_account=cognitive_services_account
    )

In [13]:
# Example usage
try:
    skillset = create_ocr_skillset()
    client = SearchIndexerClient(endpoint, credential)
    client.create_or_update_skillset(skillset)
    print(f"{skillset.name} created")
except Exception as e:
    print(f"Failed to create skillset: {e.message}")

ai-policies-index-skillset created


## Indexing Data

In [14]:
from azure.search.documents.indexes.models import IndexingParameters

# Configure indexing parameters to include blob metadata
indexing_parameters = IndexingParameters(
    configuration={
        "parsingMode": "default",
        "indexStorageMetadata": True
    }
)


In [15]:
from azure.search.documents.indexes.models import (
    SearchIndexer,
    IndexingParameters,
    IndexingParametersConfiguration,
    BlobIndexerImageAction
)

# Create an indexer  
indexer_name = f"{index_name}-indexer"  

indexer_parameters = None
if use_ocr:
    indexer_parameters = IndexingParameters(
        configuration=IndexingParametersConfiguration(
            image_action=BlobIndexerImageAction.GENERATE_NORMALIZED_IMAGE_PER_PAGE,
            query_timeout=None))

indexer = SearchIndexer(  
    name=indexer_name,  
    description="Indexer to index documents and generate embeddings",  
    skillset_name=skillset_name,  
    target_index_name=index_name,  
    data_source_name=data_source.name,
    parameters=indexer_parameters
)  

indexer_client = SearchIndexerClient(endpoint, credential)  
indexer_result = indexer_client.create_or_update_indexer(indexer)  
  
# Run the indexer  
indexer_client.run_indexer(indexer_name)  
print(f' {indexer_name} is created and running. If queries return no results, please wait a bit and try again.')  

 ai-policies-index-indexer is created and running. If queries return no results, please wait a bit and try again.


## Retrieval Results

In [22]:
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery

In [23]:
search_client = SearchClient(
    endpoint=os.environ["AZURE_AI_SEARCH_SERVICE_ENDPOINT"],
    index_name=index_name,
    credential=AzureKeyCredential(os.environ["AZURE_AI_SEARCH_ADMIN_KEY"]),
)

In [26]:
SEARCH_QUERY = "Prior Authorization is being requested for the following medication: Adalinumab 40mg"

In [30]:
vector_query = VectorizableTextQuery(text=SEARCH_QUERY, k_nearest_neighbors=5, fields="vector", exhaustive=True)

In [33]:
from azure.search.documents.models import QueryType, QueryCaptionType, QueryAnswerType

results = search_client.search(  
    search_text=SEARCH_QUERY,  
    vector_queries=[vector_query],
    #select=["content", "id"],
    #filters = 
    query_type=QueryType.SEMANTIC, semantic_configuration_name='my-semantic-config', query_caption=QueryCaptionType.EXTRACTIVE, query_answer=QueryAnswerType.EXTRACTIVE,
    top=5
)

for result in results:
    print("=" * 40)
    print(f"ID: {result['chunk_id']}")
    print(f"Reranker Score: {result['@search.reranker_score']}")
    print(f"Source_doc_path: {result['parent_path']}")
    content = result['chunk'][:100] + '...' if len(result['chunk']) > 100 else result['chunk']
    print(f"Content: {content}")

    captions = result.get("@search.captions", [])
    if captions:
        caption = captions[0]
        if caption.highlights:
            print(f"Caption: {caption.highlights}")
        else:
            print(f"Caption: {caption.text}")
    print("=" * 40)


ID: 5a23e2d0e7e0_aHR0cHM6Ly9zdG9yYWdlZmFjdG9yeWVhc3R1cy5ibG9iLmNvcmUud2luZG93cy5uZXQvcHJlLWF1dGgtcG9saWNpZXMvcG9saWNpZXNfb2NyLzAwMV9pbmZsYW1tYXRvcnlfQ29uZGl0aW9ucy5wZGY1_normalized_images_3_pages_0
Reranker Score: 2.91237735748291
Source_doc_path: https://storagefactoryeastus.blob.core.windows.net/pre-auth-policies/policies_ocr/001_inflammatory_Conditions.pdf
Content: Other Uses with Supportive Evidence There are guidelines and/or published data supporting the use of...
Caption: pulmonary and neurosarcoidosis.15 POLICY STATEMENT<em> Prior Authorization</em> is recommended for prescription benefit coverage of<em> adalimumab</em> products All approvals are provided for the duration noted below In cases where the<em> approval</em> is<em> authorized</em> in months, 1 month is equal to 30 days Because of the specialized skills required for evaluation and diagnosi...
ID: 5a23e2d0e7e0_aHR0cHM6Ly9zdG9yYWdlZmFjdG9yeWVhc3R1cy5ibG9iLmNvcmUud2luZG93cy5uZXQvcHJlLWF1dGgtcG9saWNpZXMvcG9saWNpZXNfb2NyL