## 📚 Prerequisites

Before running this notebook, ensure you have provisioned your Azure resources, configured your environment variables (Search endpoint, key, and index name), and set up a Conda environment to manage dependencies. Refer to [REQUIREMENTS.md](REQUIREMENTS.md) for full setup instructions.

## 📋 Table of Contents

This notebook walks through **two approaches for indexing policy documents stored in Azure Blob Storage**, using either the **Push SDK** or the **native Azure AI Search Indexer**.

#### 1. [**Indexing Policy Documents from Blob Storage Using the SDK (Push Model)**](#index-using-push-sdk)

In this approach, we build a fully controlled indexing pipeline using the Azure SDK for Python.

You will learn how to:

- Load PDF or text-based policy documents directly from Blob Storage
- Preprocess and chunk content for retrieval
- Generate embeddings using Azure OpenAI
- Upload documents, metadata, and vector embeddings into Azure AI Search using the `SearchClient`

This method gives you **maximum flexibility** and is ideal when you need to integrate your own preprocessing logic, custom chunking strategies, or real-time updates.

#### 2. [**Indexing Policy Documents Using Azure AI Search Indexers with Custom Skillsets**](#index-using-indexer)

This approach uses the built-in **Azure Search Indexer** to automatically crawl documents from Blob Storage and enrich them using a **custom skillset**.

You will learn how to:

- Connect Blob Storage as a data source
- Define a skillset that performs OCR, embedding, or text extraction
- Automatically map enriched fields into your Azure Search index

This method is best when you want to **automate the ingestion pipeline** with minimal custom code, leveraging Azure's low-code enrichment platform.

> By the end of this notebook, you’ll understand how to build both a **fully customized ingestion flow** using the SDK and a **scalable low-code pipeline** using native indexers with skillsets — both optimized for retrieving policy content stored in Blob Storage.


In [1]:
import os
from tenacity import retry, wait_random_exponential, stop_after_attempt
from dotenv import load_dotenv
import os
import json
import copy
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    ExhaustiveKnnAlgorithmConfiguration,
    ExhaustiveKnnParameters,
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    SearchIndex,
    SemanticConfiguration,
    SemanticPrioritizedFields,
    SemanticField,
    SearchField,
    VectorSearch,
    SemanticSearch,
    HnswAlgorithmConfiguration,
    HnswParameters,
    VectorSearch,
    VectorSearchAlgorithmKind,
    VectorSearchProfile,
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    VectorSearch,
    ExhaustiveKnnParameters,
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    SearchIndex,
    SemanticConfiguration,
    SemanticField,
    SearchField,
    VectorSearch,
    HnswParameters,
    VectorSearch,
    VectorSearchAlgorithmKind,
    VectorSearchAlgorithmMetric,
    VectorSearchProfile,
)

# Load environment variables from .env file
load_dotenv()

# Define the target directory
target_directory = os.getcwd()  # Get the current working directory

# Move one directory back
parent_directory = os.path.dirname(target_directory)

# Check if the parent directory exists
if os.path.exists(parent_directory):
    # Change the current working directory to the parent directory
    os.chdir(parent_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Parent directory {parent_directory} does not exist.")

Directory changed to c:\Users\pablosal\Desktop\aihlsignited-medindexer


### **Indexing Policy Documents from Blob Storage Using the SDK (Push Model)**

#### **1. Use Document Intelligence to parse and read the PDF into text (Markdown)**

In [2]:
from src.documentintelligence.document_intelligence_helper import AzureDocumentIntelligenceManager

text_extractor = AzureDocumentIntelligenceManager()

2025-04-02 19:20:55,844 - micro - MainProcess - INFO     Container 'pre-auth-policies' already exists. (blob_helper.py:_create_container_if_not_exists:89)


In [3]:
policy_raw_text_markdown = text_extractor.analyze_document(document_input="https://storageaeastusfactory.blob.core.windows.net/pre-auth-policies/policies_ocr/001.pdf", 
                                model_type="prebuilt-layout")

2025-04-02 19:20:56,765 - micro - MainProcess - INFO     Blob URL detected. Extracting content. (document_intelligence_helper.py:analyze_document:78)
2025-04-02 19:20:57,443 - micro - MainProcess - INFO     Downloaded blob 'policies_ocr/001.pdf' as bytes. (blob_helper.py:download_blob_to_bytes:311)


In [4]:
policy_raw_text_markdown.content[:100]

'<figure>\n\ncigna\nhealthcare\n\n</figure>\n\n\n# PRIOR AUTHORIZATION POLICY\n\nPOLICY:\n\nInflammatory Conditio'

#### **2. Extract Metadata**

- **Policy Name**: Extracted from document title or headers (e.g., "Antiseizure Medications – Epidiolex Prior Authorization Policy").
- **Payer Name**: Identifies which insurance company issued the policy (e.g., Cigna, UnitedHealthcare).
- **Drug Name(s)**: The medications referenced in the policy (e.g., Epidiolex, Dupixent).
- **Medical Specialties Involved**: Identifies if a policy is specific to Rheumatology, Neurology, Oncology, etc.
- **Indications & Diseases Covered**: Disease categories linked to a policy (e.g., Crohn’s disease, epilepsy)

In [5]:
import json
from typing import List, Dict, Optional
from pydantic import BaseModel
import openai
import os
import time
import requests
from azure.core.credentials import AzureKeyCredential
from utils.ml_logging import get_logger

logger = get_logger()

# --------------------------------------------------------------------------
# Configure the Azure OpenAI client using key-based authentication.
# --------------------------------------------------------------------------
client = openai.AzureOpenAI(
    api_version="2024-12-01-preview",
    azure_endpoint="https://pablo-m2areked-westeurope.cognitiveservices.azure.com",
    azure_deployment="gpt-4o-structured-outputs",
    api_key=os.getenv("AZURE_OPENAI_KEY_StructuredOutputs")
    )  # Ensure this env var is set.

model_name = "gpt-4o-structured-outputs"

In [6]:
# ----------------- DEFINE POLICY METADATA MODEL -----------------
class PolicyMetadata(BaseModel):
    policy_name: str
    payer_name: str
    drug_names: List[str]
    medical_specialties: List[str]
    indications_diseases: List[str]
    covered_diseases_icd_codes: Optional[List[str]]
    covered_drug_codes: Optional[List[str]]

    class Config:
        extra = "forbid"

# ----------------- ICD-10 LOOKUP FUNCTION -----------------
def lookup_icd_codes_for_disease(disease: str, max_retries: int = 2) -> List[str]:
    """Fetches up to three ICD-10 codes for a given disease with retries on failure."""

    if not isinstance(disease, str) or not disease.strip():
        logger.error("Invalid disease parameter provided.")
        return []

    url = "https://clinicaltables.nlm.nih.gov/api/icd10cm/v3/search"
    params = {"sf": "code,name", "terms": disease, "maxList": 3}

    for attempt in range(max_retries + 1):
        try:
            logger.info(f"[Attempt {attempt+1}] Fetching ICD-10 codes for disease: '{disease}' with params: {params}")
            resp = requests.get(url, params=params, timeout=10)
            resp.raise_for_status()

            if 'application/json' not in resp.headers.get('Content-Type', ''):
                logger.error(f"Unexpected content type: {resp.headers.get('Content-Type')}")
                return []

            data = resp.json()
            logger.debug(f"Full ICD-10 API Response: {data}")

            # ✅ Extract only the ICD-10 codes and ensure they are valid
            icd_codes = [item for item in data[1] if isinstance(item, str) and len(item) > 3]

            if icd_codes:
                logger.info(f"ICD-10 Codes Found for '{disease}': {icd_codes}")
                return icd_codes
            else:
                logger.warning(f"No valid ICD-10 codes found for '{disease}'.")
                return []

        except requests.RequestException as e:
            logger.error(f"ICD lookup failed for '{disease}' on attempt {attempt+1}: {e}")
            if attempt < max_retries:
                logger.info(f"Retrying ICD lookup for '{disease}'...")
                time.sleep(2)  # Wait before retrying

    logger.error(f"ICD lookup ultimately failed for '{disease}' after {max_retries+1} attempts.")
    return []


# ----------------- RXNORM LOOKUP FUNCTION -----------------
def lookup_drug_details(drug: str, max_retries: int = 2) -> List[str]:
    """Fetches RxNorm IDs for TAH-validated drugs with retries on failure."""
    
    if not isinstance(drug, str) or not drug.strip():
        logger.error("Invalid drug parameter provided.")
        return []

    url = "https://rxnav.nlm.nih.gov/REST/rxcui.json"
    params = {"name": drug}

    for attempt in range(max_retries + 1):
        try:
            logger.info(f"[Attempt {attempt+1}] Fetching RxNorm ID for drug: '{drug}' with params: {params}")
            resp = requests.get(url, params=params, timeout=10)
            resp.raise_for_status()

            if 'application/json' not in resp.headers.get('Content-Type', ''):
                logger.error(f"Unexpected content type: {resp.headers.get('Content-Type')}")
                return []

            data = resp.json()
            logger.debug(f"Full RxNorm API Response: {data}")

            # Extract only the RxNorm IDs
            rxnorm_ids = data.get("idGroup", {}).get("rxnormId", [])

            if rxnorm_ids:
                logger.info(f"RxNorm IDs Found for '{drug}': {rxnorm_ids}")
                return rxnorm_ids
            else:
                logger.warning(f"No RxNorm ID found for '{drug}'.")
                return []

        except requests.RequestException as e:
            logger.error(f"Drug lookup failed for '{drug}' on attempt {attempt+1}: {e}")
            if attempt < max_retries:
                logger.info(f"Retrying RxNorm lookup for '{drug}'...")
                time.sleep(2)  # Wait before retrying

    logger.error(f"RxNorm lookup ultimately failed for '{drug}' after {max_retries+1} attempts.")
    return []


# ----------------- FINAL ENRICHMENT FUNCTION -----------------
def enrich_metadata(metadata: PolicyMetadata) -> PolicyMetadata:
    """Enrich extracted metadata with ICD-10 and RxNorm codes, only for TAH-validated terms."""
    enriched = metadata.model_copy()
    enriched.covered_diseases_icd_codes = [
        code for disease in metadata.indications_diseases
        for code in lookup_icd_codes_for_disease(disease)
    ]
    enriched.covered_drug_codes = [
        code for drug in metadata.drug_names
        for code in lookup_drug_details(drug)
    ]
    return enriched
# --------------------------------------------------------------------------
# Extraction Phase: Use Azure OpenAI to parse policy text into the above schema.
# --------------------------------------------------------------------------
# Optimized function for metadata extraction
def extract_policy_metadata(policy_text: str) -> PolicyMetadata:
    messages = [
        {
            "role": "system",
            "content": (
                "You are an advanced AI system specializing in extracting **structured metadata** from clinical policy documents. "
                "Your goal is to achieve **100% accuracy** by leveraging **Tree of Thought reasoning, multi-step validation techniques, "
                "and automated normalization of payer names, drug names, and medical conditions**.\n\n"

                "### **1️⃣ Extract Policy Name (`policy_name`)** 📄\n"
                "- **Identify the official policy title** from document **headers, footers, or the first paragraph**.\n"
                "- **Ensure Standard Formatting:** Convert extracted titles into a **consistent naming format**.\n"
                "  - **Example Input (Raw OCR Extracted Text):**\n"
                "    - 'DUPIXENT (DUPILUMAB) - PRIOR AUTHORIZATION POLICY'\n"
                "    - 'Anthem BCBS Policy 2024: Prior Authorization - Dupixent'\n"
                "  - **Expected Standard Output:**\n"
                "    - 'Dupixent Prior Authorization Policy'\n\n"

                "2️⃣ **Payer Name (payer_name) - AUTOMATIC NORMALIZATION ENABLED:**\n"
                "   - Locate in **headers, footers, disclaimers, or embedded watermarks**.\n"
                "   - Normalize using the following standardized mapping:\n"
                "     - 'Cigna', 'Cigna Healthcare', 'Cigna Corp' → 'Cigna'\n"
                "     - 'Humana', 'Humana Inc', 'Humana Health Plan' → 'Humana'\n"
                "     - 'United Healthcare', 'United Health Care', 'UHC' → 'UnitedHealthcare'\n"
                "     - 'Anthem Blue Cross Blue Shield' → 'Anthem BCBS'\n"
                "     - 'Blue Cross Blue Shield' → 'BCBS'\n"
                "     - 'Kaiser Permanente' → 'Kaiser Permanente'\n"
                "     - 'Aetna', 'Aetna Health Inc', 'Aetna Insurance' → 'Aetna'\n"
                "     - 'WellCare Health Plans' → 'WellCare'\n"
                "     - 'Medicare Advantage' → 'Medicare'\n"
                "     - 'Medicaid' → 'Medicaid'\n"
                "     - 'MVP Health Care' → 'MVP HealthCare'\n"
                "     - 'HealthFirst' → 'HealthFirst'\n"
                "     - 'Molina Healthcare' → 'Molina Healthcare'\n"
                "     - 'Centene Corporation' → 'Centene'\n"
                "     - 'Blue Shield of California' → 'Blue Shield of California'\n"
                "     - 'Empire Blue Cross Blue Shield' → 'Empire BCBS'\n"
                "     - 'Horizon Blue Cross Blue Shield' → 'Horizon BCBS'\n"

                "3️⃣ **Drug Names (drug_names) - AUTOMATIC NORMALIZATION ENABLED:**\n"
                "   - Extract **all medications mentioned in the policy**.\n"
                "   - Include **both brand and generic names**.\n"
                "   - Normalize using **RxNorm drug database standards**.\n"

                "4️⃣ **Medical Specialties (medical_specialties):**\n"
                "   - Identify relevant **clinical specialties** (e.g., Neurology, Rheumatology, Oncology).\n"
                "   - Cross-check with **medical board designations** to avoid ambiguity.\n"

                "5️⃣ **Indications & Diseases (indications_diseases) - AUTOMATIC NORMALIZATION ENABLED:**\n"
                "   - Extract **every disease, condition, or indication mentioned**.\n"
                "   - Normalize to **ICD-10 classification**.\n"
                "   - Check **eligibility, exclusions, and coverage sections** for additional conditions.\n\n"

                "### **General Guidelines:**\n"
                "- If a field is missing, return an **empty string (`""`)** for text fields or an **empty list (`[]`)** for arrays.\n"
                "- Strictly **follow the JSON schema** without adding extra keys.\n"
                "- **Cross-validate extracted information** across multiple sections to prevent errors.\n"
            )
        },
        {
            "role": "user",
            "content": (
                "Extract and normalize structured metadata from the following insurance policy document:\n\n"
                f"{policy_text}\n\n"
                "### **Output Schema (Strict JSON Format):**\n"
                "{\n"
                '  "policy_name": "The official title of the prior authorization policy.",\n'
                '  "payer_name": "The name of the insurance company, automatically normalized.",\n'
                '  "drug_names": ["List of all referenced drugs, including brand and generic, automatically normalized."],\n'
                '  "medical_specialties": ["List of relevant clinical specialties (e.g., Neurology, Oncology)."],\n'
                '  "indications_diseases": ["List of all conditions covered under this policy, normalized to ICD-10 standards."]\n'
                "}"
            )
        }
    ]

    try:
        response = client.beta.chat.completions.parse(
            model=model_name,
            messages=messages,
            response_format=PolicyMetadata
        )
        metadata = response.choices[0].message.parsed

        # Enrich metadata with ICD-10 and RxNorm codes
        enriched_metadata = enrich_metadata(metadata)
        enriched = enriched_metadata.model_dump()

        return enriched
    
    except Exception as e:
        logger.error("Error during extraction: %s", e)
        raise

In [7]:
extract_policy_metadata = extract_policy_metadata(policy_raw_text_markdown.content)

2025-04-02 19:21:30,192 - micro - MainProcess - INFO     [Attempt 1] Fetching ICD-10 codes for disease: 'Ankylosing spondylitis' with params: {'sf': 'code,name', 'terms': 'Ankylosing spondylitis', 'maxList': 3} (1126985018.py:lookup_icd_codes_for_disease:27)
2025-04-02 19:21:30,356 - micro - MainProcess - INFO     ICD-10 Codes Found for 'Ankylosing spondylitis': ['M08.1', 'M45.6', 'M45.2'] (1126985018.py:lookup_icd_codes_for_disease:42)
2025-04-02 19:21:30,358 - micro - MainProcess - INFO     [Attempt 1] Fetching ICD-10 codes for disease: 'Crohn's disease' with params: {'sf': 'code,name', 'terms': "Crohn's disease", 'maxList': 3} (1126985018.py:lookup_icd_codes_for_disease:27)
2025-04-02 19:21:30,494 - micro - MainProcess - INFO     ICD-10 Codes Found for 'Crohn's disease': ['K50.90', 'K50.913', 'K50.914'] (1126985018.py:lookup_icd_codes_for_disease:42)
2025-04-02 19:21:30,496 - micro - MainProcess - INFO     [Attempt 1] Fetching ICD-10 codes for disease: 'Hidradenitis suppurativa' wit

In [8]:
extract_policy_metadata

{'policy_name': 'Inflammatory Conditions - Adalimumab Products Prior Authorization Policy',
 'payer_name': 'Cigna',
 'drug_names': ['Abrilada (adalimumab-afzb)',
  'adalimumab-aacf',
  'adalimumab-adaz',
  'adalimumab-adbm',
  'adalimumab-fkjp',
  'adalimumab-ryvk',
  'Humira (adalimumab)',
  'Amjevita (adalimumab-atto)',
  'Cyltezo (adalimumab-adbm)',
  'Hadlima (adalimumab-bwwd)',
  'Hulio (adalimumab-fkjp)',
  'Hyrimoz (adalimumab-adaz)',
  'Idacio (adalimumab-aacf)',
  'Simlandi (adalimumab-ryvk)',
  'Yuflyma (adalimumab-aaty)',
  'Yusimry (adalimumab-aqvh)'],
 'medical_specialties': ['Rheumatology',
  'Gastroenterology',
  'Dermatology',
  'Ophthalmology'],
 'indications_diseases': ['Ankylosing spondylitis',
  "Crohn's disease",
  'Hidradenitis suppurativa',
  'Juvenile idiopathic arthritis',
  'Plaque psoriasis',
  'Psoriatic arthritis',
  'Rheumatoid arthritis',
  'Ulcerative colitis',
  'Uveitis',
  "Behcet's disease",
  'Pyoderma gangrenosum',
  'Sarcoidosis',
  'Scleritis',
 

### 3. **Chunk Documents**

In [9]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
    ("####", "Header 4")
]

In [10]:
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(policy_raw_text_markdown.content)

In [26]:
chunks

[Document(metadata={}, page_content='<figure>  \ncigna\nhealthcare  \n</figure>'),
 Document(metadata={'Header 1': 'PRIOR AUTHORIZATION POLICY'}, page_content='POLICY:  \nInflammatory Conditions - Adalimumab Products Prior Authorization  \nPolicy  \n· Abrilada™ (adalimumab-afzb subcutaneous injection - Pfizer)  \n· adalimumab-aacf subcutaneous injection (Fresenius Kabi)  \n· adalimumab-adaz subcutaneous injection (Sandoz/Novartis)  \n· adalimumab-adbm subcutaneous injection (Boehringer Ingelheim)  \n· adalimumab-fkjp subcutaneous injection (Mylan)  \n· adalimumab-ryvk subcutaneous injection (Teva/Alvotech)  \n· Amjevita® (adalimumab-atto subcutaneous injection - Amgen)  \n· Cyltezo® (adalimumab-adbm subcutaneous injection - Boehringer\nIngelheim)  \n· Hadlima™ (adalimumab-bwwd\nsubcutaneous\ninjection\n–  \nOrganon/Samsung Bioepis)  \n· Hulio® (adalimumab-fkjp subcutaneous injection - Mylan)  \n· Humira® (adalimumab subcutaneous injection - AbbVie, Cordavis)  \n. Hyrimoz® (adalimumab-a

### 4. **Vectorize, Add Metadata, Index**

In [33]:
search_client = SearchClient(
    endpoint=os.environ["AZURE_AI_SEARCH_SERVICE_ENDPOINT"],
    index_name=os.environ["AZURE_SEARCH_INDEX_NAME"],
    credential=AzureKeyCredential(os.environ["AZURE_SEARCH_ADMIN_KEY"]),
)

In [None]:
from src.aoai.aoai_helper import AzureOpenAIManager
aoai_client = AzureOpenAIManager()

In [35]:
embedding = aoai_client.generate_embedding(input_text="perro")
embedding.data[0].embedding[:10] # Display the first 10 dimensions of the embedding

[-0.047314975410699844,
 0.012618283741176128,
 0.007378609851002693,
 -0.00984053872525692,
 0.028208108618855476,
 0.031466756016016006,
 -0.02137499861419201,
 0.0013170961756259203,
 0.02908378094434738,
 0.03399328514933586]

In [37]:
n = 100  # max batch size (number of docs) to upload at a time
total_docs_uploaded = 0

# Split up a list into chunks
def divide_chunks(l, n):
    for i in range(0, len(l), n):
        yield l[i : i + n]

# Generate embeddings (assuming aoai_client is already defined and configured)
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def generate_embeddings(text):
    response = aoai_client.generate_embedding(input_text=text)
    return response.data[0].embedding

# Prepare the data for indexing
chunked_content_docs = []

for chunk_counter, chunk in enumerate(chunks):
    json_data = {
        "parent_id": "001",  # Replace with dynamic Parent ID 
        "parent_path": "https://storageaeastusfactory.blob.core.windows.net/pre-auth-policies/policies_ocr/001.pdf",  # Replace with dynamic Parent Path
        "policy_name": extract_policy_metadata["policy_name"],
        "payer_name": extract_policy_metadata["payer_name"],
        "drug_names": extract_policy_metadata["drug_names"],
        "medical_specialties": extract_policy_metadata["medical_specialties"],
        "covered_diseases": extract_policy_metadata["indications_diseases"],
        "covered_diseases_icd_codes": extract_policy_metadata["covered_diseases_icd_codes"],
        "covered_drug_codes": extract_policy_metadata["covered_drug_codes"],
        "chunk_id": f"001_{chunk_counter}",  # Ensure unique chunk_id generation
        "chunk": chunk.page_content,
        "vector": generate_embeddings(chunk.page_content)
    }
    chunked_content_docs.append(json_data)

total_docs = len(chunked_content_docs)
total_docs_uploaded += total_docs
print(f"Total Documents to Upload: {total_docs}")

# Upload chunks to Azure AI Search
for documents_chunk in divide_chunks(chunked_content_docs, n):
    try:
        print(f"Uploading batch of {len(documents_chunk)} documents...")
        result = search_client.upload_documents(documents=documents_chunk)
        if all(res.succeeded for res in result):
            print(f"Upload of batch of {len(documents_chunk)} documents succeeded.")
        else:
            print("Some documents in the batch were not uploaded successfully.")
    except Exception as ex:
        print("Error in multiple documents upload: ", ex)

Total Documents to Upload: 14
Uploading batch of 14 documents...
Upload of batch of 14 documents succeeded.


### **Retrieve Data from Azure AI search**

Before executing this notebook, please review the notebook `03-retrieval.ipynb` for a better understanding of the process. 


In [44]:
import os
from dotenv import load_dotenv
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.models import (
    VectorQuery,
    VectorizedQuery,
    VectorizableTextQuery,
    QueryType,
    QueryCaptionType,
    QueryAnswerType,
)

In [None]:
# Set up Azure Cognitive Search credentials
service_endpoint = os.getenv("AZURE_AI_SEARCH_SERVICE_ENDPOINT")
key = os.getenv("AZURE_SEARCH_ADMIN_KEY")
credential = AzureKeyCredential(key)

# Define the name of the Azure Search index
# This is the index where your data is stored in Azure Search
index_name = os.getenv("AZURE_SEARCH_INDEX_NAME")

# Set up the Azure Search client with the specified index
# This prepares the client to interact with the Azure Search service
search_client = SearchClient(service_endpoint, index_name, credential=credential)

search_query = "What is the prior authorization policy for Inflammatory Conditions?"
search_vector = aoai_client.generate_embedding(search_query)

In [50]:
# Hybrid retrieval + rerank
r = search_client.search(
    search_text=search_query,
    top=5,
    vector_queries=[
        VectorizedQuery(vector=embedding.data[0].embedding, k_nearest_neighbors=50, fields="vector", weight=0.5),
    ],
    query_type=QueryType.SEMANTIC,
    semantic_configuration_name="policy-index-semantic-config",
    query_language="en-us",
    query_caption=QueryCaptionType.EXTRACTIVE,
    query_answer=QueryAnswerType.EXTRACTIVE,
)

# Iterate through the search results and print all metadata
for doc in r:
    print("Document Metadata:")
    print(f"Parent ID: {doc.get('parent_id', 'N/A')}")
    print(f"Parent Path: {doc.get('parent_path', 'N/A')}")
    print(f"Policy Name: {doc.get('policy_name', 'N/A')}")
    print(f"Payer Name: {doc.get('payer_name', 'N/A')}")
    print(f"Drug Names: {doc.get('drug_names', 'N/A')}")
    print(f"Medical Specialties: {doc.get('medical_specialties', 'N/A')}")
    print(f"Covered Diseases: {doc.get('covered_diseases', 'N/A')}")
    print(f"Covered Diseases ICD Codes: {doc.get('covered_diseases_icd_codes', 'N/A')}")
    print(f"Covered Drug Codes: {doc.get('covered_drug_codes', 'N/A')}")
    print(f"Chunk ID: {doc.get('chunk_id', 'N/A')}")
    print(f"Search Score: {doc.get('@search.score', 'N/A')}")
    print(f"Reranker Score: {doc.get('@search.reranker_score', 'N/A')}")
    print("-" * 80)  # Separator for readability

Document Metadata:
Parent ID: 001
Parent Path: https://storageaeastusfactory.blob.core.windows.net/pre-auth-policies/policies_ocr/001.pdf
Policy Name: Inflammatory Conditions - Adalimumab Products Prior Authorization Policy
Payer Name: Cigna
Drug Names: ['Abrilada (adalimumab-afzb)', 'adalimumab-aacf', 'adalimumab-adaz', 'adalimumab-adbm', 'adalimumab-fkjp', 'adalimumab-ryvk', 'Humira (adalimumab)', 'Amjevita (adalimumab-atto)', 'Cyltezo (adalimumab-adbm)', 'Hadlima (adalimumab-bwwd)', 'Hulio (adalimumab-fkjp)', 'Hyrimoz (adalimumab-adaz)', 'Idacio (adalimumab-aacf)', 'Simlandi (adalimumab-ryvk)', 'Yuflyma (adalimumab-aaty)', 'Yusimry (adalimumab-aqvh)']
Medical Specialties: ['Rheumatology', 'Gastroenterology', 'Dermatology', 'Ophthalmology']
Covered Diseases: ['Ankylosing spondylitis', "Crohn's disease", 'Hidradenitis suppurativa', 'Juvenile idiopathic arthritis', 'Plaque psoriasis', 'Psoriatic arthritis', 'Rheumatoid arthritis', 'Ulcerative colitis', 'Uveitis', "Behcet's disease", 'P

## **Filtering by Payer Name and ICD-10 Code**

The filter expression in the search query is used to narrow down the results based on specific conditions:

1. **`payer_name eq 'Cigna'`**  
   - This condition filters documents where the `payer_name` field is exactly `'Cigna'`.  
   - It ensures that only documents related to the payer `'Cigna'` are included in the search results.

2. **`covered_diseases_icd_codes/any(c: c eq 'M45.6')`**  
   - This condition filters documents where the `covered_diseases_icd_codes` collection contains the value `'M45.6'`.  
   - The `any()` function is used to check if any element in the `covered_diseases_icd_codes` array matches the specified value.  
   - In this case, it ensures that only documents related to the ICD code `'M45.6'` (e.g., Ankylosing Spondylitis) are included.

3. **Combining Conditions with `and`**  
   - The `and` operator combines the two conditions, so the filter only returns documents where **both conditions are true**.  
   - This means the results will include documents where the payer is `'Cigna'` **and** the ICD code `'M45.6'` is present in the `covered_diseases_icd_codes` field.


In [55]:
# Perform the search query with the updated filter
# Hybrid retrieval + rerank
r = search_client.search(
    search_text=search_query,
    top=5,
    vector_queries=[
        VectorizedQuery(vector=embedding.data[0].embedding, k_nearest_neighbors=50, fields="vector", weight=0.5),
    ],
    query_type=QueryType.SEMANTIC,
    semantic_configuration_name="policy-index-semantic-config",
    query_language="en-us",
    query_caption=QueryCaptionType.EXTRACTIVE,
    query_answer=QueryAnswerType.EXTRACTIVE,
    filter="payer_name eq 'Cigna' and covered_diseases_icd_codes/any(c: c eq 'M45.6')",  # Filter by payer name and ICD code
)

# Iterate through the search results and print all relevant metadata
for doc in r:
    content = doc.get("content", "").replace("\n", " ")[:1000]  # Limit content to 1000 characters for readability
    print(
        f"ID: {doc.get('parent_id', 'N/A')}, "
        f"Policy Name: {doc.get('policy_name', 'N/A')}, "
        f"Payer Name: {doc.get('payer_name', 'N/A')}, "
        f"Drug Names: {doc.get('drug_names', 'N/A')}, "
        f"Medical Specialties: {doc.get('medical_specialties', 'N/A')}, "
        f"Covered Diseases: {doc.get('covered_diseases', 'N/A')}, "
        f"Score: {doc.get('@search.score', 'N/A')}, "
        f"Reranker Score: {doc.get('@search.reranker_score', 'N/A')}. "
        f"Content: {content}"
    )

ID: 001, Policy Name: Inflammatory Conditions - Adalimumab Products Prior Authorization Policy, Payer Name: Cigna, Drug Names: ['Abrilada (adalimumab-afzb)', 'adalimumab-aacf', 'adalimumab-adaz', 'adalimumab-adbm', 'adalimumab-fkjp', 'adalimumab-ryvk', 'Humira (adalimumab)', 'Amjevita (adalimumab-atto)', 'Cyltezo (adalimumab-adbm)', 'Hadlima (adalimumab-bwwd)', 'Hulio (adalimumab-fkjp)', 'Hyrimoz (adalimumab-adaz)', 'Idacio (adalimumab-aacf)', 'Simlandi (adalimumab-ryvk)', 'Yuflyma (adalimumab-aaty)', 'Yusimry (adalimumab-aqvh)'], Medical Specialties: ['Rheumatology', 'Gastroenterology', 'Dermatology', 'Ophthalmology'], Covered Diseases: ['Ankylosing spondylitis', "Crohn's disease", 'Hidradenitis suppurativa', 'Juvenile idiopathic arthritis', 'Plaque psoriasis', 'Psoriatic arthritis', 'Rheumatoid arthritis', 'Ulcerative colitis', 'Uveitis', "Behcet's disease", 'Pyoderma gangrenosum', 'Sarcoidosis', 'Scleritis', 'Axial spondyloarthritis'], Score: 0.02182539738714695, Reranker Score: 1.8