# Azure AI Search Simulator - Indexer Test

This notebook demonstrates how to use **indexers** with the Azure AI Search Simulator to automatically ingest and index documents.

## What This Notebook Tests

1. **Data Source Creation** - Configure a local file system data source
2. **Index Schema** - Create an index with various field types
3. **Indexer Execution** - Run an indexer to process JSON metadata and TXT content files
4. **Verification** - Confirm all 5 documents are indexed and searchable

## Prerequisites

1. **Start the Azure AI Search Simulator with HTTPS**:
   ```bash
   cd src/AzureAISearchSimulator.Api
   dotnet run --urls "https://localhost:7250"
   ```

2. **Sample data files** should be in the `./data` folder (already provided)

> ‚ö†Ô∏è **Note**: The Azure SDK requires HTTPS. The simulator must run on `https://localhost:7250`

## 1. Import Required Libraries

In [None]:
# Install required packages (uncomment if needed)
# !pip install azure-search-documents requests pandas

import os
import json
import time
import urllib3
from pathlib import Path

# Azure AI Search SDK imports
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient, SearchIndexerClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    SearchIndexer,
    SearchIndexerDataContainer,
    SearchIndexerDataSourceConnection,
    FieldMapping,
    IndexingParameters,
    IndexingParametersConfiguration,
)

# For displaying results
import pandas as pd
from IPython.display import display, HTML

# Suppress SSL warnings for local development
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

print("‚úÖ Libraries imported successfully!")




[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: C:\Users\laurelle\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


‚úÖ Libraries imported successfully!


## 2. Initialize Azure AI Search Clients

Configure connection to the local Azure AI Search Simulator.

In [2]:
# Configuration for Azure AI Search Simulator
SEARCH_ENDPOINT = "https://localhost:7250"
ADMIN_API_KEY = "admin-key-12345"

# Resource names for this test
INDEX_NAME = "indexer-test-docs"
DATA_SOURCE_NAME = "local-test-files"
INDEXER_NAME = "test-indexer"

# Path to sample data (relative to notebook location)
DATA_PATH = Path("./data").resolve()

# Create credentials
admin_credential = AzureKeyCredential(ADMIN_API_KEY)

# Configure HTTP transport to skip SSL certificate validation for local development
import requests as req_lib
from azure.core.pipeline.transport import RequestsTransport

session = req_lib.Session()
session.verify = False
transport = RequestsTransport(session=session, connection_verify=False)

# Create clients
index_client = SearchIndexClient(
    endpoint=SEARCH_ENDPOINT,
    credential=admin_credential,
    transport=transport,
    connection_verify=False
)

indexer_client = SearchIndexerClient(
    endpoint=SEARCH_ENDPOINT,
    credential=admin_credential,
    transport=transport,
    connection_verify=False
)

print(f"‚úÖ Connected to Azure AI Search Simulator at {SEARCH_ENDPOINT}")
print(f"üìÅ Data path: {DATA_PATH}")

# List sample data files
json_files = list(DATA_PATH.glob("*.json"))
txt_files = list(DATA_PATH.glob("*.txt"))
print(f"üìÑ Found {len(json_files)} JSON metadata files")
print(f"üìÑ Found {len(txt_files)} TXT content files")

‚úÖ Connected to Azure AI Search Simulator at https://localhost:7250
üìÅ Data path: C:\Projets\AzureAISimulator\samples\IndexerTestNotebook\data
üìÑ Found 5 JSON metadata files
üìÑ Found 5 TXT content files


## 3. Review Sample Data

Let's look at the sample documents we'll be indexing.

In [3]:
# Load and display sample data
sample_docs = []

for json_file in sorted(DATA_PATH.glob("*.json")):
    with open(json_file, 'r', encoding='utf-8') as f:
        metadata = json.load(f)
    
    # Read associated content file
    content_file = DATA_PATH / metadata.get('contentFile', '')
    content = ""
    if content_file.exists():
        with open(content_file, 'r', encoding='utf-8') as f:
            content = f.read()[:200] + "..."  # First 200 chars
    
    sample_docs.append({
        'id': metadata['id'],
        'title': metadata['title'],
        'author': metadata['author'],
        'category': metadata['category'],
        'tags': ', '.join(metadata.get('tags', [])),
        'content_preview': content
    })

# Display as DataFrame
df = pd.DataFrame(sample_docs)
print(f"üìö Sample Documents to Index ({len(sample_docs)} total):\n")
display(df[['id', 'title', 'author', 'category', 'tags']])

üìö Sample Documents to Index (5 total):



Unnamed: 0,id,title,author,category,tags
0,doc-001,Introduction to Azure AI Search,Azure Documentation Team,Documentation,"azure, search, ai, introduction"
1,doc-002,Creating Search Indexes,Search Engineering Team,Tutorial,"indexes, schema, fields, configuration"
2,doc-003,Understanding Indexers and Data Sources,Data Integration Team,Tutorial,"indexers, data-sources, blob-storage, automation"
3,doc-004,Search Query Syntax Guide,Query Processing Team,Reference,"queries, lucene, odata, filters"
4,doc-005,Security and Access Control,Security Team,Security,"security, api-keys, rbac, authentication"


## 4. Create Search Index

Define the index schema with fields matching our document structure.

In [4]:
# Define the index schema
index = SearchIndex(
    name=INDEX_NAME,
    fields=[
        # Key field (required)
        SimpleField(name="id", type=SearchFieldDataType.String, key=True),
        
        # Searchable text fields
        SearchableField(name="title", type=SearchFieldDataType.String, 
                       sortable=True, filterable=True),
        SearchableField(name="author", type=SearchFieldDataType.String,
                       filterable=True, facetable=True),
        SearchableField(name="content", type=SearchFieldDataType.String),
        
        # Filterable/Facetable fields
        SimpleField(name="category", type=SearchFieldDataType.String,
                   filterable=True, facetable=True, sortable=True),
        SimpleField(name="language", type=SearchFieldDataType.String,
                   filterable=True, facetable=True),
        
        # Date field
        SimpleField(name="createdDate", type=SearchFieldDataType.DateTimeOffset,
                   filterable=True, sortable=True),
        
        # Collection field for tags
        SearchField(name="tags", type=SearchFieldDataType.Collection(SearchFieldDataType.String),
                   searchable=True, filterable=True, facetable=True),
    ]
)

# Create or update the index
try:
    result = index_client.create_or_update_index(index)
    print(f"‚úÖ Index '{result.name}' created/updated successfully!")
    print(f"   Fields: {len(result.fields)}")
    for field in result.fields:
        print(f"   - {field.name}: {field.type} (key={field.key}, searchable={field.searchable})")
except Exception as e:
    print(f"‚ùå Error creating index: {e}")

‚úÖ Index 'indexer-test-docs' created/updated successfully!
   Fields: 8
   - id: Edm.String (key=True, searchable=False)
   - title: Edm.String (key=False, searchable=True)
   - author: Edm.String (key=False, searchable=True)
   - content: Edm.String (key=False, searchable=True)
   - category: Edm.String (key=False, searchable=False)
   - language: Edm.String (key=False, searchable=False)
   - createdDate: Edm.DateTimeOffset (key=False, searchable=False)
   - tags: Collection(Edm.String) (key=False, searchable=True)


## 5. Create Data Source Connection

Configure a data source pointing to our local file system with JSON documents.

In [None]:
# Create a data source connection pointing to local files
# The simulator supports "filesystem" type for local development
# container.name is combined with connection_string as a subfolder
# Use "." for root (no subfolder) since our files are directly in DATA_PATH

data_source = SearchIndexerDataSourceConnection(
    name=DATA_SOURCE_NAME,
    type="filesystem",  # Simulator-specific type for local files
    connection_string=str(DATA_PATH),
    container=SearchIndexerDataContainer(name=".", query="*.json")  # "." means root, query filters to *.json files
)

try:
    result = indexer_client.create_or_update_data_source_connection(data_source)
    print(f"‚úÖ Data source '{result.name}' created/updated successfully!")
    print(f"   Type: {result.type}")
    print(f"   Path: {result.connection_string}")
    print(f"   Container: {result.container.name if result.container else 'N/A'}")
    print(f"   Query: {result.container.query if result.container else 'N/A'}")
except Exception as e:
    print(f"‚ùå Error creating data source: {e}")

‚úÖ Data source 'local-test-files' created/updated successfully!
   Type: filesystem
   Path: C:\Projets\AzureAISimulator\samples\IndexerTestNotebook\data


## 6. Create and Run Indexer

Create an indexer that processes the JSON files and maps fields to the index.

In [6]:
# Create an indexer with JSON parsing configuration
indexer = SearchIndexer(
    name=INDEXER_NAME,
    data_source_name=DATA_SOURCE_NAME,
    target_index_name=INDEX_NAME,
    parameters=IndexingParameters(
        configuration=IndexingParametersConfiguration(
            parsing_mode="json"  # Parse JSON documents
        )
    ),
    # Field mappings from JSON to index fields
    field_mappings=[
        FieldMapping(source_field_name="id", target_field_name="id"),
        FieldMapping(source_field_name="title", target_field_name="title"),
        FieldMapping(source_field_name="author", target_field_name="author"),
        FieldMapping(source_field_name="category", target_field_name="category"),
        FieldMapping(source_field_name="tags", target_field_name="tags"),
        FieldMapping(source_field_name="createdDate", target_field_name="createdDate"),
        FieldMapping(source_field_name="language", target_field_name="language"),
    ]
)

try:
    result = indexer_client.create_or_update_indexer(indexer)
    print(f"‚úÖ Indexer '{result.name}' created/updated successfully!")
    print(f"   Data Source: {result.data_source_name}")
    print(f"   Target Index: {result.target_index_name}")
except Exception as e:
    print(f"‚ùå Error creating indexer: {e}")

‚úÖ Indexer 'test-indexer' created/updated successfully!
   Data Source: local-test-files
   Target Index: indexer-test-docs


In [7]:
# Run the indexer
print("üöÄ Running indexer...")
try:
    indexer_client.run_indexer(INDEXER_NAME)
    print("‚úÖ Indexer run triggered!")
except Exception as e:
    print(f"‚ùå Error running indexer: {e}")

# Wait for indexer to complete
print("\n‚è≥ Waiting for indexer to complete...")
max_wait = 30  # seconds
wait_interval = 2

for i in range(0, max_wait, wait_interval):
    time.sleep(wait_interval)
    try:
        status = indexer_client.get_indexer_status(INDEXER_NAME)
        last_result = status.last_result
        
        if last_result:
            print(f"   Status: {last_result.status}")
            if last_result.status in ["success", "transientFailure", "reset"]:
                break
    except Exception as e:
        print(f"   Checking status... ({e})")

# Get final status
status = indexer_client.get_indexer_status(INDEXER_NAME)
if status.last_result:
    result = status.last_result
    print(f"\nüìä Indexer Execution Results:")
    print(f"   Status: {result.status}")
    print(f"   Items Processed: {result.item_count}")
    print(f"   Items Failed: {result.failed_item_count}")
    if result.errors:
        print(f"   Errors:")
        for error in result.errors:
            print(f"      - {error.error_message}")

üöÄ Running indexer...
‚úÖ Indexer run triggered!

‚è≥ Waiting for indexer to complete...
   Status: success

üìä Indexer Execution Results:
   Status: success
   Items Processed: 0
   Items Failed: 0


## 7. Verify Indexed Documents

Check that all documents were indexed correctly by querying the index.

In [None]:
# Create search client to query the index
search_client = SearchClient(
    endpoint=SEARCH_ENDPOINT,
    index_name=INDEX_NAME,
    credential=admin_credential,
    transport=transport,
    connection_verify=False
)

# Get document count
results = search_client.search(search_text="*", include_total_count=True)
results_list = list(results)

print(f"üìä Document Count Verification:")
print(f"   Expected: 5 documents")
print(f"   Actual:   {len(results_list)} documents")

if len(results_list) == 5:
    print("   ‚úÖ All documents indexed successfully!")
else:
    print("   ‚ö†Ô∏è  Document count mismatch!")

# Display all indexed documents
print(f"\nüìö Indexed Documents:")
doc_data = []
for doc in results_list:
    doc_data.append({
        'id': doc.get('id'),
        'title': doc.get('title'),
        'author': doc.get('author'),
        'category': doc.get('category'),
        'tags': ', '.join(doc.get('tags', []) or [])
    })

display(pd.DataFrame(doc_data))

## 8. Test Search Functionality

Verify that the indexed content is searchable.

In [None]:
# Test search with different queries
test_queries = [
    ("indexer", "Should find doc about indexers"),
    ("security", "Should find security document"),
    ("Azure", "Should find multiple documents"),
]

print("üîç Search Tests:\n")
for query, description in test_queries:
    results = search_client.search(search_text=query, top=5)
    results_list = list(results)
    
    print(f"Query: '{query}'")
    print(f"Description: {description}")
    print(f"Results: {len(results_list)} document(s)")
    
    for doc in results_list:
        print(f"   - [{doc.get('id')}] {doc.get('title')}")
    print()

In [None]:
# Test filtering by category
print("üè∑Ô∏è Filter Tests:\n")

# Filter by category
results = search_client.search(
    search_text="*", 
    filter="category eq 'Tutorial'"
)
tutorial_docs = list(results)
print(f"Category = 'Tutorial': {len(tutorial_docs)} document(s)")
for doc in tutorial_docs:
    print(f"   - {doc.get('title')}")

print()

# Get facets by category
results = search_client.search(
    search_text="*", 
    facets=["category", "author"]
)
results_list = list(results)

print("üìä Facet Results:")
facets = results.get_facets()
if facets:
    for facet_name, facet_values in facets.items():
        print(f"\n{facet_name}:")
        for fv in facet_values:
            print(f"   - {fv.get('value')}: {fv.get('count')}")

## 9. Cleanup (Optional)

Delete all resources created during this test.

In [None]:
# Uncomment and run this cell to clean up all resources
# WARNING: This will delete the index, indexer, and data source!

cleanup = False  # Set to True to enable cleanup

if cleanup:
    print("üßπ Cleaning up resources...")
    
    # Delete indexer first
    try:
        indexer_client.delete_indexer(INDEXER_NAME)
        print(f"   ‚úÖ Deleted indexer: {INDEXER_NAME}")
    except Exception as e:
        print(f"   ‚ö†Ô∏è Could not delete indexer: {e}")
    
    # Delete data source
    try:
        indexer_client.delete_data_source_connection(DATA_SOURCE_NAME)
        print(f"   ‚úÖ Deleted data source: {DATA_SOURCE_NAME}")
    except Exception as e:
        print(f"   ‚ö†Ô∏è Could not delete data source: {e}")
    
    # Delete index
    try:
        index_client.delete_index(INDEX_NAME)
        print(f"   ‚úÖ Deleted index: {INDEX_NAME}")
    except Exception as e:
        print(f"   ‚ö†Ô∏è Could not delete index: {e}")
    
    print("\n‚úÖ Cleanup complete!")
else:
    print("‚ÑπÔ∏è Cleanup skipped. Set cleanup = True to delete resources.")

## Summary

This notebook demonstrated:

| Feature | Status |
|---------|--------|
| Index Creation | ‚úÖ Created index with 8 fields |
| Data Source | ‚úÖ Configured local file system data source |
| Indexer | ‚úÖ Created and executed indexer |
| Document Indexing | ‚úÖ Indexed 5 documents |
| Search | ‚úÖ Full-text search working |
| Filtering | ‚úÖ OData filters working |
| Faceting | ‚úÖ Faceted navigation working |

The Azure AI Search Simulator successfully replicates the core indexer functionality of Azure AI Search!