# A Different Indexing strategy 
The idea here is to leverage a strategy that take advantage of the most common questions that would benefit from a semantic similarity search.  For example, only vectorize the Vendor Name, which allows a user to ask questions about a Vendor and if they misspell the vendor you would still be able to find the contract for a given vendor as well as any amendments associated with it.

## Index Structure

```
   {
      "id": "1fb91887f558ee99f577956e7f6701df",
      "contractId": "5004432",
      "vendorName": "Fabrikam Services",
      "clientName": "Contoso Elite",
      "contractTitle": "Vendor Contractor Agreement",
      "effectiveDate": "2024-12-14T00:00:00Z",
      "endDate": "2024-02-20T00:00:00Z",
      "signingDate": "2024-12-08T00:00:00Z",
      "status": null,
      "compensation": 20000,
      'terminationTerms': 'This Agreement shall commence on the Effective Date and continue for a period of 24 months, ending on January 15, 2026',
      'paymentTerms': 'Payment shall be made in monthly installments of $10,416.67, due within 30 days of invoice date',
      'currency': 'USD',
      "parentContractId": null,
      "amendmentNumber": null,
      "creationdate": null,
      "sourceFileName": 'MSA_TechCorp_Global_2024.pdf'
   }

    fields = [
            SimpleField(name="id", type=SearchFieldDataType.String, key=True, filterable=True),
            SearchableField(name="contractId", type=SearchFieldDataType.String, filterable=True),
            SearchableField(name="vendorName", type=SearchFieldDataType.String, filterable=True),
            SearchableField(name="clientName", type=SearchFieldDataType.String, filterable=True),
            SearchableField(name="contractTitle", type=SearchFieldDataType.String, filterable=True),  # Changed to SearchableField
            SimpleField(name="effectiveDate", type=SearchFieldDataType.DateTimeOffset, filterable=True, facetable=True),
            SimpleField(name="endDate", type=SearchFieldDataType.DateTimeOffset, filterable=True, facetable=True),
            SimpleField(name="signingDate", type=SearchFieldDataType.DateTimeOffset, filterable=True),
            SearchableField(name="status", type=SearchFieldDataType.String, filterable=True),
            SimpleField(name="compensation", type=SearchFieldDataType.String, filterable=True),
            SimpleField(name="terminationTerms, type=SearchFieldDataType.Double, filterable=True),
            SearchableField(name="parentContractId", type=SearchFieldDataType.String, filterable=True),
            SearchableField(name="amendmentNumber", type=SearchFieldDataType.String),
            SimpleField(name="creationdate", type=SearchFieldDataType.DateTimeOffset, filterable=True),
            SearchableField(name="sourceFileName", type=SearchFieldDataType.String, filterable=True),
            SearchableField(name="content", type=SearchFieldDataType.String),
            SearchField(
                name="vendorNameVector",
                type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                vector_search_dimensions=1536,
                vector_search_profile_name="myHnswProfile"
            )
        ]

```

Let's load our endpoints and keys from our .env file and print the out to make sure we are good for the next step.

In [1]:
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.storage.blob import BlobServiceClient, generate_blob_sas, BlobSasPermissions
from datetime import datetime, timedelta
import os
from dotenv import load_dotenv
from pathlib import Path

# Get root directory path
root_dir = Path().absolute().parent
env_path = root_dir / '.env'

# Load .env from root
load_dotenv(dotenv_path=env_path)
print(f"Loaded .env from {env_path}")
# Access variables
# Azure Storage settings

storage_account_name = os.getenv("STORAGE_ACCOUNT_NAME")  
storage_account_key = os.getenv("STORAGE_ACCOUNT_KEY")  # Add your storage account key here
container_name = "source"

ai_search_endpoint = os.environ["AZURE_SEARCH_ENDPOINT"]
ai_search_key = os.environ["AZURE_SEARCH_KEY"]
ai_search_admin_key = os.environ["AZURE_SEARCH_ADMIN_KEY"]
ai_search_index = "rdc-contracts-v1"

print(f"storage_account_name: {  storage_account_name }")
print(f"storage acct Key: {  storage_account_key[:4] + '*' * 5 + storage_account_key[-4:] }")
print(f"container_name: {container_name}")
print(f"ai_search_endpoint: {ai_search_endpoint}")
print(f"ai_search_key: {ai_search_key[:4] + '*' * 5 + ai_search_key[-4:]}")
print(f"ai_search_index: {ai_search_index}")

Loaded .env from c:\Users\rickcau\source\repos\vendor-contracts-gen-ai\.env
storage_account_name: stgclarivatecw
storage acct Key: ayze*****NQ==
container_name: source
ai_search_endpoint: https://rdc-ai-search.search.windows.net
ai_search_key: n8p2*****IUzQ
ai_search_index: rdc-contracts-v1


Version 3 of the Create Index, in this code we are only creating embebbings for the VendorName and not the content.  Due to the likelyhood of having many different document types, a chunking strategy will likely not work for all document types of the level of accuracy will likely be lower.

As long as all questions are vendor related this approach *should* work very well. Fine tuning a chunking strategy for one document type can be tough as many variables come into play.

Example:

Let's say you using a chunking size of 1000 and you have a document that is broken up into 20 chunks. If a user asks the following question

   ~~~
       What are the termination terms for our contract with Vendor X?
   ~~~

Let's assume that the termination terms span multiuple chuncks.  When a semantic search happens across those chucks it find chuck that has "Termination Terms" as this would be the closest simalarity, but the additional chucks that make up the full details of the termination terms would likely not be included as those additional chucks do not have any simalarity to the words "termination terms".

So, if we instead first perform semantic simalarity search using Vendor Name, we can find the closest simalarity for the vendor, extratct the field details and the full text of the contract.  Next, we have a 2nd step the performs a search for any amendments using the ContractID where ParentID is not NULL AND AmendmentNumber is not NULL, extract the content of it and inject in to prompt and let the LLM respond to the question.

In [2]:
import logging
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
# from azure.search.documents.indexes.models import (
#     SimpleField,
#     SearchFieldDataType,
#     SearchableField,
#     SemanticConfiguration, 
#     SemanticField, 
#     SearchField,
#     VectorSearch,
#     HnswAlgorithmConfiguration,
#     VectorSearchProfile,
#     SearchIndex,
#     ScoringProfile,
#     TextWeights,
#     FreshnessScoringFunction,
#     MagnitudeScoringFunction,
#     ScoringFunctionAggregation,
#     FreshnessScoringParameters,
#     MagnitudeScoringParameters,
#     ScoringFunctionInterpolation
# )

from azure.search.documents.indexes.models import (
    SimpleField,
    SearchFieldDataType,
    SearchableField,
    SearchField,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    SemanticConfiguration,
    SemanticPrioritizedFields,
    SemanticField,
    SemanticSearch,
    SearchIndex
)

from azure.core.exceptions import ServiceRequestError, ResourceExistsError, ResourceNotFoundError

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def create_search_index(
    endpoint: str,
    admin_key: str,
    index_name: str
) -> None:
    """
    Creates an Azure Cognitive Search index with specified fields, vector search capability,
    and scoring profiles.
    
    Args:
        endpoint (str): Azure Cognitive Search endpoint URL
        admin_key (str): Admin API key for Azure Cognitive Search
        index_name (str): Name of the index to create
    """
    # Input validation
    if not all([endpoint, admin_key, index_name]):
        raise ValueError("Endpoint, admin key, and index name are required")

    # Initialize the search index client
    try:
        logger.info(f"Initializing SearchIndexClient with endpoint: {endpoint}")
        search_index_client = SearchIndexClient(
            endpoint=endpoint,
            credential=AzureKeyCredential(admin_key)
        )
    except Exception as e:
        logger.error(f"Failed to initialize SearchIndexClient: {str(e)}")
        raise

    # Check if index exists
    try:
        search_index_client.get_index(index_name)
        logger.info(f"Index '{index_name}' already exists")
        return
    except ResourceNotFoundError:
        logger.info(f"Creating new index '{index_name}'...")
    except Exception as e:
        logger.error(f"Error checking index existence: {str(e)}")
        raise

    try:
        # Define the fields for the index
        fields = [
            SimpleField(name="id", type=SearchFieldDataType.String, key=True, filterable=True),
            SearchableField(name="contractId", type=SearchFieldDataType.String, filterable=True),
            SearchableField(name="vendorName", type=SearchFieldDataType.String, filterable=True),
            SearchableField(name="clientName", type=SearchFieldDataType.String, filterable=True),
            SearchableField(name="contractTitle", type=SearchFieldDataType.String, filterable=True),  # Changed to SearchableField
            SimpleField(name="effectiveDate", type=SearchFieldDataType.DateTimeOffset, filterable=True, facetable=True),
            SimpleField(name="endDate", type=SearchFieldDataType.DateTimeOffset, filterable=True, facetable=True),
            SimpleField(name="signingDate", type=SearchFieldDataType.DateTimeOffset, filterable=True),
            SearchableField(name="status", type=SearchFieldDataType.String, filterable=True),
            SimpleField(name="compensation", type=SearchFieldDataType.Double, filterable=True),
            SimpleField(name="terminationTerms", type=SearchFieldDataType.String, filterable=True),
            SearchableField(name="parentContractId", type=SearchFieldDataType.String, filterable=True),
            SearchableField(name="amendmentNumber", type=SearchFieldDataType.String),
            SimpleField(name="creationDate", type=SearchFieldDataType.DateTimeOffset, filterable=True),
            SearchableField(name="sourceFileName", type=SearchFieldDataType.String, filterable=True),
            SearchableField(name="content", type=SearchFieldDataType.String),
            SearchField(
                name="vendorNameVector",
                type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                vector_search_dimensions=1536,
                vector_search_profile_name="myHnswProfile"
            )
        ]
       
        # Define vector search settings
        vector_search = VectorSearch(
            algorithms=[
                HnswAlgorithmConfiguration(
                    name="myHnsw",
                    parameters={
                        "m": 4,
                        "efConstruction": 400,
                        "efSearch": 500,
                        "metric": "cosine"
                    }
                )
            ],
            profiles=[
                VectorSearchProfile(
                    name="myHnswProfile",
                    algorithm_configuration_name="myHnsw"
                )
            ]
        )
        
        # Define semantic settings
        semantic_config = SemanticConfiguration(
            name="default",
            prioritized_fields=SemanticPrioritizedFields(
                title_field=SemanticField(field_name="vendorName"),
                keywords_fields=[SemanticField(field_name="contractTitle"),
                                SemanticField(field_name="clientName")],
                content_fields=[SemanticField(field_name="content")]
            )
        )
        
        # Create semantic search configuration
        semantic_search = SemanticSearch(
            configurations=[semantic_config]
        )

        # Create the index with the defined fields and vector search settings
        index = SearchIndex(
            name=index_name,
            fields=fields,
            vector_search=vector_search,
            semantic_search=semantic_search
        )
        
        logger.info("Attempting to create or update index...")
        result = search_index_client.create_or_update_index(index)
        logger.info(f"Successfully created/updated index '{index_name}'")
        return result

    except ServiceRequestError as e:
        logger.error(f"Service request error while creating index: {str(e)}")
        raise
    except Exception as e:
        logger.error(f"Unexpected error while creating index: {str(e)}")
        raise

if __name__ == "__main__":
    try:
        create_search_index(
            endpoint=ai_search_endpoint,
            admin_key=ai_search_admin_key,
            index_name=ai_search_index
        )
    except Exception as e:
        logger.error(f"Failed to create index: {str(e)}")

INFO:__main__:Initializing SearchIndexClient with endpoint: https://rdc-ai-search.search.windows.net
INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: 'https://rdc-ai-search.search.windows.net/indexes('rdc-contracts-v1')?api-version=REDACTED'
Request method: 'GET'
Request headers:
    'api-key': 'REDACTED'
    'Accept': 'application/json;odata.metadata=minimal'
    'x-ms-client-request-id': '46049f75-b83f-11ef-88ec-1091d1f8d990'
    'User-Agent': 'azsdk-python-search-documents/11.5.2 Python/3.12.8 (Windows-11-10.0.26100-SP0)'
No body was attached to the request
INFO:azure.core.pipeline.policies.http_logging_policy:Response status: 404
Response headers:
    'Cache-Control': 'no-cache,no-store'
    'Pragma': 'no-cache'
    'Content-Length': '117'
    'Content-Type': 'application/json; charset=utf-8'
    'Content-Language': 'REDACTED'
    'Expires': '-1'
    'Server': 'Microsoft-IIS/10.0'
    'request-id': '46049f75-b83f-11ef-88ec-1091d1f8d990'
    'elapsed-time': 'REDACT