# Notebook 1: The Ingestion Phase

As we discussed in the presentation, the ingestion phase is basically the loading of the data sources the retrieval system uses. These data sources can be existing databases with structured data, however in this notebook we'll focus on unstructured data (such as documents).

## Learning Objectives
- Learn how to chunk markdown files into smaller sizes
- Learn how the text chunking size provides different quality retrieval results in a RAG application
- Learn how different embeddings models provide different results
- Learn how to load an Azure AI Search index for a Vector Store

### Install Required Packages

> NOTE: We need to use Semantic Kernel in this notebook in order to work with the embeddings and chunking (those features are not yet in Agent Framework as of the beginning of Jan 2026).

In [None]:
%pip install -U semantic-kernel -q

## Step 1: Chunk files into smaller pieces

### Document Chunking 

The process of taking a document and splitting into pieces is often referred to as "chunking". There are many ways to split a document and it isn't a *one-size-fits-all* activity, so you need to keep in mind how a document needs to be split in order to provide the most valuable chunks for your retrieval system.

Important things to remember about these chunks:

- We will get embeddings for each chunk
- Relevant chunks will be found by a similarity search using embeddings
- Often times an overlap of 10 - 20% is used if there is not a clean way to split the document
- When working with real documents, you may need to address tables and images (images typically have different embedding models or need to be *verbalized*)
- Each chunk needs to fit in the context window of the LLM, and keep in mind things can get lost in the middle when the context is too big
- You may need to modify your chunking to improve the retrieval quality of your system

In [2]:
from semantic_kernel.text import text_chunker
from typing import List

async def chunk_markdown_file(file_path: str, max_token_per_line: int = 256):
    """
    Reads a markdown file and chunks it into smaller pieces using Semantic Kernel's text_chunker.
    
    Args:
        file_path: Path to the markdown file
        max_token_per_line: Maximum number of tokens per line
    
    Returns:
        List of text chunks
    """
    
    # Step 1: Read the markdown file from the file system
    print(f"Reading file: {file_path}")
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            markdown_content = file.read()
        print(f"Successfully read file. Total characters: {len(markdown_content)}\n")
    except FileNotFoundError:
        print(f"Error: File '{file_path}' not found.")
        return []
    except Exception as e:
        print(f"Error reading file: {e}")
        return []
    
    # Step 2: Use Semantic Kernel's text_chunker to split into smaller pieces
    print(f"Chunking text with max_token_per_line={max_token_per_line}...\n")
        
    # Split the text into chunks
    chunks = text_chunker.split_markdown_lines(
        text=markdown_content, 
        max_token_per_line=max_token_per_line,
    )
    
    # Step 3: Capture all chunks into a list variable
    chunk_list: List[str] = list(chunks)
    
    print(f"Total chunks created: {len(chunk_list)}\n")
    
    # Step 4: Print out the first 3 chunks (or fewer if less than 3 exist)
    chunks_to_display = min(3, len(chunk_list))
    print(f"Displaying first {chunks_to_display} chunks:\n")
    print("=" * 80)
    
    for i in range(chunks_to_display):
        print(f"\n--- Chunk {i + 1} ---")
        print(f"Length: {len(chunk_list[i])} characters")
        print(f"Content:\n{chunk_list[i]}")
        print("-" * 40)
    
    print("=" * 80)
    
   
    
    return chunk_list

Next, you can now use the above method to split the sample markdown file (in the **labs/assets** folder) into chunks.

In [3]:
# Specify the path to your markdown file
markdown_file_path = "../../assets/sample.md"

# Chunk the markdown file
chunks = await chunk_markdown_file(
    file_path=markdown_file_path,
    max_token_per_line=256,  # Adjust chunk size as needed
)

if chunks:
     # Print summary statistics
    if chunks:
        avg_chunk_size = sum(len(chunk) for chunk in chunks) / len(chunks)
        print(f"\nChunking Summary:")
        print(f"  - Total chunks: {len(chunks)}")
        print(f"  - Average chunk size: {avg_chunk_size:.2f} characters")
        print(f"  - Smallest chunk: {len(min(chunks, key=len))} characters")
        print(f"  - Largest chunk: {len(max(chunks, key=len))} characters")

# The chunks list is now available for use in the next notebook
print(f"\n‚úÖ Chunks are now stored in the 'chunks' variable for use in the next step.")
print(f"   You can access individual chunks with chunks[0], chunks[1], etc.")


Reading file: ../../assets/sample.md
Successfully read file. Total characters: 1410

Chunking text with max_token_per_line=256...

Total chunks created: 2

Displaying first 2 chunks:


--- Chunk 1 ---
Length: 656 characters
Content:
# Introduction to Machine Learning

Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It focuses on developing computer programs that can access data and use it to learn for themselves.

## Types of Machine Learning

### Supervised Learning

In supervised learning, the algorithm learns from labeled training data. Each training example consists of an input object and a desired output value. The algorithm analyzes the training data and produces an inferred function.

### Unsupervised Learning

Unsupervised learning algorithms work with unlabeled data.
----------------------------------------

--- Chunk 2 ---
Length: 753 characters
Content:
The system tries to 

### Try Using LangChain's MarkdownHeaderTextSplitter (optional)

LangChain is another popular python package used with RAG applications. They have more text splitter options than Semantic Kernal has. In the code below you'll explore the [MarkdownHeaderTextSplitter](https://reference.langchain.com/v0.3/python/text_splitters/markdown/langchain_text_splitters.markdown.MarkdownHeaderTextSplitter.html).

First you'll need to install the packages.

In [4]:
%pip install langchain -q
%pip install langchain-core -q
%pip install langchain-text-splitters -q

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [5]:
from langchain_text_splitters import MarkdownHeaderTextSplitter
from typing import List, Dict

def chunk_markdown_file(file_path: str) -> List[Dict]:
    """
    Read a markdown file and split it into chunks based on headers.
    
    Args:
        file_path: Path to the markdown file
    
    Returns:
        List of document chunks with metadata
    """
    
    # Step 1: Read the markdown file
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            markdown_content = file.read()
        print(f"Successfully read file: {file_path}")
        print(f"File size: {len(markdown_content)} characters\n")
    except FileNotFoundError:
        print(f"Error: File '{file_path}' not found.")
        return []
    except Exception as e:
        print(f"Error reading file: {e}")
        return []
    
    # Step 2: Configure the MarkdownHeaderTextSplitter
    # Define which headers to split on and their metadata keys
    headers_to_split_on = [
        ("#", "Header 1"),      # H1 headers
        ("##", "Header 2"),     # H2 headers
        ("###", "Header 3"),    # H3 headers
    ]
    
    # Create the splitter instance
    markdown_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on,
        strip_headers=False  # Keep headers in the content
    )
    
    # Step 3: Split the document and capture chunks
    chunks = markdown_splitter.split_text(markdown_content)
    
    # Convert to list of dictionaries for easier handling
    chunk_list = []
    for i, chunk in enumerate(chunks):
        chunk_dict = {
            'index': i,
            'content': chunk.page_content,
            'metadata': chunk.metadata,
            'length': len(chunk.page_content)
        }
        chunk_list.append(chunk_dict)
    
    print(f"Total number of chunks created: {len(chunk_list)}\n")
    print("=" * 60)
    
    # Step 4: Print the first 3 chunks (or fewer if less than 3 exist)
    chunks_to_display = min(3, len(chunk_list))
    
    for i in range(chunks_to_display):
        chunk = chunk_list[i]
        print(f"\nüìÑ CHUNK {i + 1}:")
        print(f"   Metadata: {chunk['metadata']}")
        print(f"   Length: {chunk['length']} characters")
        print(f"   Content preview:")
        print("-" * 40)
        
        # Display first 300 characters of content (or full if shorter)
        content_preview = chunk['content'][:300]
        if len(chunk['content']) > 300:
            content_preview += "..."
        print(content_preview)
        print("-" * 40)
    
    # Return the full list of chunks for use in next lab
    return chunk_list


Next, you can now use the above method to split the same sample markdown file (in the **/data** folder) into chunks.

> NOTE: the LangChain splitter splits on sections and provided metadata about the section hierarchy (which may be useful for you).

In [6]:
# Specify the path to your markdown file
markdown_file_path = "../../assets/sample.md"

# Chunk the markdown file
lc_chunks = chunk_markdown_file(markdown_file_path)
if lc_chunks:
    print("\n" + "=" * 60)
    print("üìä SUMMARY STATISTICS:")
    print(f"   Total chunks: {len(chunks)}")
    
    total_chars = sum(chunk['length'] for chunk in lc_chunks)
    avg_chunk_size = total_chars / len(chunks) if chunks else 0
    print(f"   Average chunk size: {avg_chunk_size:.1f} characters")
    
    max_chunk = max(lc_chunks, key=lambda x: x['length'])
    min_chunk = min(lc_chunks, key=lambda x: x['length'])
    print(f"   Largest chunk: {max_chunk['length']} characters (chunk #{max_chunk['index']})")
    print(f"   Smallest chunk: {min_chunk['length']} characters (chunk #{min_chunk['index']})")

# The chunks list is now available for use in the next notebook
print(f"\n‚úÖ Chunks are now stored in the 'chunks' variable for use in the next step.")
print(f"   You can access individual chunks with chunks[0], chunks[1], etc.")

# If going to use the LangChain chunks in the next step, set the chunks variable to the chunk text
chunks = [item["content"] for item in lc_chunks]


Successfully read file: ../../assets/sample.md
File size: 1410 characters

Total number of chunks created: 6


üìÑ CHUNK 1:
   Metadata: {'Header 1': 'Introduction to Machine Learning'}
   Length: 287 characters
   Content preview:
----------------------------------------
# Introduction to Machine Learning  
Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It focuses on developing computer programs that can access data and use it to learn for themselves.
----------------------------------------

üìÑ CHUNK 2:
   Metadata: {'Header 1': 'Introduction to Machine Learning', 'Header 2': 'Types of Machine Learning', 'Header 3': 'Supervised Learning'}
   Length: 283 characters
   Content preview:
----------------------------------------
## Types of Machine Learning  
### Supervised Learning  
In supervised learning, the algorithm learns from labeled training data. Each training example consi

## Step 2: Create Embeddings for Semantic Searches

In this step you'll use AzureOpenAI to create the embeddings for the chunks you created above - you'll need to decide which chunking technique you like best.

The code below will utilize the older text-embedding-ada-002 model for creating the embeddings. In Step 4, you'll get to compare the embeddings from OpenAI and see how they can differ in a semantic search.

First you'll need to install the packages.

In [7]:
%pip install openai -q

Note: you may need to restart the kernel to use updated packages.


In [8]:
import os
import dotenv
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

dotenv.load_dotenv()

# Create a token provider that returns a fresh bearer token on each call
token_provider = get_bearer_token_provider(
    DefaultAzureCredential(),
    "https://cognitiveservices.azure.com/.default",
)

client = AzureOpenAI(
    azure_ad_token_provider=token_provider,
    api_version="2024-02-01",
)

def embed_chunks(text_chunks: list[str], model: str) -> list[list[float]]:
    # model is your Azure deployment name, e.g. "embeddings-prod"
    response = client.embeddings.create(
        model=model,          # deployment name, not the base model id[web:42]
        input=text_chunks,    # list of chunk strings
    )
    return [item.embedding for item in response.data]

Next you use the above utility to create embeddings of the chunks (created earlier) and take a look at a few of the returned vectors.

In [9]:

embeddings = embed_chunks(chunks, model="text-embedding-ada-002")

# Single embedding (first item)
print("First embedding:")
print(embeddings[0])  # a list[float]
print(len(embeddings[0]), "dimensions")

# First two embeddings
print("\nFirst two embeddings:")
for i, emb in enumerate(embeddings[:2]):  # slice to first 2[web:68][web:72]
    print(f"Embedding {i}:")
    print(emb[:8], "...")  # show just first few dims to keep output short
    print("dim:", len(emb))

First embedding:
[-0.02100842073559761, -0.003979153465479612, -0.0002070511254714802, -0.02178790792822838, -0.01785275712609291, 0.007222823332995176, 0.003259385470300913, 0.005214388482272625, -0.025019006803631783, -0.015313140116631985, 0.014747384004294872, 0.02328401990234852, -0.032663002610206604, -0.007373692002147436, -0.005714139901101589, -0.004095447715371847, 0.02549675665795803, -0.003567408537492156, 0.006983948405832052, -0.008863517083227634, -0.025157302618026733, 0.008938951417803764, 0.010221332311630249, -0.04174024984240532, -0.0010246477322652936, -0.01949973776936531, 0.020794691517949104, -0.037867963314056396, 0.0038817175664007664, -0.0025867640506476164, 0.009278405457735062, -0.021586749702692032, -0.001859138486906886, 0.0004050658899359405, -0.010868809185922146, -0.004717779811471701, 0.02535845898091793, -0.0016438367310911417, -0.0010136469500139356, 0.001255664974451065, 0.018393369391560555, 0.01457137055695057, -0.009196684695780277, -0.023937782

## Step 3: Load Azure AI Search Index

Next step is the inserting of the chunks and embeddings into a vector database. In this step we'll use Azure AI Search as the vector database.

In [10]:
import os
from typing import List
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SimpleField,
    SearchableField,
    SearchField,
    SearchFieldDataType,
    SearchIndex,
    VectorSearch,
    VectorSearchProfile,
    HnswAlgorithmConfiguration,
)

search_endpoint = os.getenv("AZURE_SEARCH_ENDPOINT")
search_key = os.getenv("AZURE_SEARCH_API_KEY")
my_initials = os.getenv("MY_INITIALS")
index_name = f"{my_initials.lower()}vectorindex"
embedding_dimension = 1536  # e.g. 1536 for ada-002

# Create SearchClient for later use
search_client = SearchClient(
    endpoint=search_endpoint,
    index_name=index_name,
    credential=AzureKeyCredential(search_key),
)

def ensure_chunk_vector_index() -> None:
    """
    Ensure an Azure AI Search index exists for chunk text + embeddings.
    If it does not exist, create it. If it exists, do nothing.
    """
    if not my_initials:
        raise ValueError("MY_INITIALS environment variable must be set in order to prevent index name collisions.")

    if not search_endpoint or not search_key:
        raise ValueError("AZURE_SEARCH_ENDPOINT and AZURE_SEARCH_API_KEY must be set or passed in.")

    credential = AzureKeyCredential(search_key)
    index_client = SearchIndexClient(endpoint=search_endpoint, credential=credential)

    # Check if the index already exists
    existing_names = list(index_client.list_index_names())
    if index_name in existing_names:
        print(f"Index '{index_name}' already exists; skipping creation.")
        return

    print(f"Index '{index_name}' does not exist; creating now...")

    fields = [
        SimpleField(
            name="id",
            type=SearchFieldDataType.String,
            key=True,
            filterable=True,
            sortable=True,
            facetable=True,
        ),
        SearchableField(
            name="content",
            type=SearchFieldDataType.String,
        ),
        SearchField(
            name="contentVector",
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
            searchable=True,
            vector_search_dimensions=embedding_dimension,
            vector_search_profile_name="chunk-vector-profile",
        ),
    ]

    vector_search = VectorSearch(
        algorithms=[
            HnswAlgorithmConfiguration(
                name="chunk-hnsw-config",
                kind="hnsw",
            )
        ],
        profiles=[
            VectorSearchProfile(
                name="chunk-vector-profile",
                algorithm_configuration_name="chunk-hnsw-config",
            )
        ],
    )

    index = SearchIndex(
        name=index_name,
        fields=fields,
        vector_search=vector_search,
    )

    result = index_client.create_index(index)  # only create because we know it doesn't exist
    print(f"Index '{result.name}' created.")


def upload_chunks_with_embeddings(
    chunks: List[str],
    embeddings: List[List[float]],
) -> None:
    """
    Upload chunks and their corresponding embeddings to the Azure AI Search index.
    """
    if len(chunks) != len(embeddings):
        raise ValueError("chunks and embeddings must have the same length")

    docs = []
    for i, (text, vector) in enumerate(zip(chunks, embeddings)):
        docs.append(
            {
                "@search.action": "mergeOrUpload",  # or "upload" if you only insert[web:85][web:88]
                "id": str(i),
                "content": text,
                "contentVector": vector,  # must match index vector field name & dimensions[web:87][web:91]
            }
        )

    # Azure AI Search supports up to 1,000 docs per batch; keep it small for now
    batch_size = 1000
    for start in range(0, len(docs), batch_size):
        batch = docs[start : start + batch_size]
        result = search_client.upload_documents(documents=batch)
        # Optional: check status per doc
        succeeded = sum(1 for r in result if r.succeeded)
        print(f"Uploaded {succeeded}/{len(batch)} documents in batch starting at {start}.")


Next you can run the code that will ensure the index has been created and then load the chunks and embeddings.

In [11]:
# first make sure the index exists
ensure_chunk_vector_index()

# chunks: list[str]  (your chunk strings from above - either Semantic Kernel or LangChain)
# embeddings: list[list[float]] generated from Azure OpenAI for each chunk using text-embedding-ada-002
upload_chunks_with_embeddings(chunks, embeddings)

Index 'jahvectorindex' does not exist; creating now...
Index 'jahvectorindex' created.
Uploaded 6/6 documents in batch starting at 0.


Next let's see what the semantic search results would be for these questions:
- ‚ÄúExplain the difference between supervised, unsupervised, and reinforcement learning.‚Äù
- ‚ÄúWhat kinds of real‚Äëworld problems can machine learning solve today?‚Äù
- ‚ÄúHow does reinforcement learning decide which actions to take to maximize rewards?‚Äù
- ‚ÄúGive some examples of how machine learning is used in healthcare and fraud prevention.‚Äù
- ‚ÄúWhy is machine learning becoming more important as the amount of data grows?‚Äù

In [12]:
import os
from typing import List
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from openai import AzureOpenAI

dotenv.load_dotenv()

# Create a token provider that returns a fresh bearer token on each call
token_provider = get_bearer_token_provider(
    DefaultAzureCredential(),
    "https://cognitiveservices.azure.com/.default",
)

aoai_client = AzureOpenAI(
    azure_ad_token_provider=token_provider,
    api_version="2024-02-01",
)

search_client = SearchClient(
    endpoint=search_endpoint,
    index_name=index_name,
    credential=AzureKeyCredential(search_key),
)

def embed_query(text: str, model: str = "text-embedding-ada-002") -> List[float]:
    """Create a single embedding vector for a query string using Azure OpenAI."""
    resp = aoai_client.embeddings.create(
        model=model,        # Azure deployment name
        input=[text],
    )
    return resp.data[0].embedding

from azure.search.documents.models import VectorizedQuery

def run_test_queries(queries: list[str], use_hybrid: bool = True, top_k: int = 3):
    for q in queries:
        print("=" * 80)
        print(f"Query: {q}")
        print(f"Hybrid search: {use_hybrid}")
        print("-" * 80)

        q_vector = embed_query(q)

        vq = VectorizedQuery(
            vector=q_vector,
            fields="contentVector",
        )

        if use_hybrid:
            results = search_client.search(
                search_text=q,
                vector_queries=[vq],
                top=top_k,
            )
        else:
            results = search_client.search(
                search_text=None,
                vector_queries=[vq],
                top=top_k,
            )

        for i, doc in enumerate(results):
            score = doc.get("@search.score", None)  # float relevance score
            print(f"[{i}] id={doc['id']}  score={score:.4f}" if score is not None else f"[{i}] id={doc['id']}")
            print(doc["content"])
            print("-" * 40)



In [13]:
TEST_QUERIES = [
    "Explain the difference between supervised, unsupervised, and reinforcement learning.",
    #"What kinds of real-world problems can machine learning solve today?",
    #"How does reinforcement learning decide which actions to take to maximize rewards?",
    #"Give some examples of how machine learning is used in healthcare and fraud prevention.",
    #"Why is machine learning becoming more important as the amount of data grows?",
]

# Hybrid on:
run_test_queries(TEST_QUERIES, use_hybrid=True, top_k=3)

# Hybrid off:
run_test_queries(TEST_QUERIES, use_hybrid=False, top_k=3)


Query: Explain the difference between supervised, unsupervised, and reinforcement learning.
Hybrid search: True
--------------------------------------------------------------------------------
[0] id=1  score=0.0331
## Types of Machine Learning  
### Supervised Learning  
In supervised learning, the algorithm learns from labeled training data. Each training example consists of an input object and a desired output value. The algorithm analyzes the training data and produces an inferred function.
----------------------------------------
[1] id=2  score=0.0328
### Unsupervised Learning  
Unsupervised learning algorithms work with unlabeled data. The system tries to learn without a teacher, finding hidden patterns or intrinsic structures in input data.
----------------------------------------
[2] id=3  score=0.0325
### Reinforcement Learning  
Reinforcement learning is about taking suitable action to maximize reward in a particular situation. It is employed by various software and machines

## Step 4: Test Different Chunk Sizes and Embedding Models

In this step, you get to explore the differences between using:
- text-embedding-ada-002
- text-embedding-3-small
- text-embedding-3-large

In [14]:
from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.models import VectorizedQuery
from typing import List, Dict, Any
import os

class AzureSearchComparison:
    """
    Compare semantic search results across different embedding model indexes.
    """
    
    def __init__(self):
        """
        Initialize Azure Search connection.
        """
        self.endpoint = search_endpoint
        self.credential = AzureKeyCredential(search_key)
        
        # Define the indexes and their embedding dimensions
        self.indexes = {
            'large3index': {
                'dimensions': 3072,  # Corrected dimension for text-embedding-3-large
                'model': 'text-embedding-3-large'
            },
            'small3index': {
                'dimensions': 1536,  # Corrected dimension for text-embedding-3-small
                'model': 'text-embedding-3-small'
            },
            'ada002index': {
                'dimensions': 1536,
                'model': 'text-embedding-ada-002'
            }
        }
    
    def get_embedding(self, text: str, model: str) -> List[float]:
        """
        Generate embedding for the query text using specified model.
        
        Args:
            text: Query text to embed
            model: OpenAI embedding model name
            
        Returns:
            List of floats representing the embedding vector
        """
        from openai import AzureOpenAI
        from azure.identity import DefaultAzureCredential, get_bearer_token_provider
        
        # Create a token provider that returns a fresh bearer token on each call
        token_provider = get_bearer_token_provider(
            DefaultAzureCredential(),
            "https://cognitiveservices.azure.com/.default",
        )
        
        client = AzureOpenAI(
            azure_ad_token_provider=token_provider,
            api_version="2024-02-01",
        )
                
        response = client.embeddings.create(
            input=text,
            model=model
        )
        
        return response.data[0].embedding
    
    def search_index(self, 
                     index_name: str, 
                     query_text: str, 
                     vector_dimensions: int,
                     embedding_model: str,
                     top_k: int = 5) -> List[Dict[str, Any]]:
        """
        Search a specific Azure AI Search index using vector similarity.
        
        Args:
            index_name: Name of the Azure search index
            query_text: Text query to search for
            vector_dimensions: Dimension of the embedding vectors
            embedding_model: OpenAI model to use for generating query embedding
            top_k: Number of top results to return
            
        Returns:
            List of search results with scores
        """
        # Create search client for this index
        search_client = SearchClient(
            endpoint=self.endpoint,
            index_name=index_name,
            credential=self.credential
        )
        
        # Generate embedding for the query
        print(f"  Generating embedding with {embedding_model}...")
        query_vector = self.get_embedding(query_text, embedding_model)
        
        # Create vector query
        vector_query = VectorizedQuery(
            vector=query_vector,
            k=top_k,
            fields="contentVector"
        )
        
        # Perform search
        results = search_client.search(
            search_text=None,  # Pure vector search
            vector_queries=[vector_query],
            select=["id", "content"],
            top=top_k
        )
        
        # Collect results
        search_results = []
        for result in results:
            search_results.append({
                'id': result['id'],
                'content': result['content'][:200] + "..." if len(result['content']) > 200 else result['content'],
                'score': result['@search.score']
            })
        
        return search_results
    
    def compare_search_results(self, query_text: str, top_k: int = 3):
        """
        Compare search results across all three indexes.
        
        Args:
            query_text: The search query
            top_k: Number of top results to show per index
        """
        print("\n" + "=" * 80)
        print("üîç SEMANTIC SEARCH COMPARISON")
        print("=" * 80)
        print(f"\nQuery: '{query_text}'\n")
        print(f"Retrieving top {top_k} results from each index...\n")
        
        all_results = {}
        
        # Search each index
        for index_name, index_info in self.indexes.items():
            print(f"\nüìä Searching {index_name.upper()}")
            print(f"   Model: {index_info['model']}")
            print(f"   Dimensions: {index_info['dimensions']}")
            print("-" * 60)
            
            try:
                results = self.search_index(
                    index_name=index_name,
                    query_text=query_text,
                    vector_dimensions=index_info['dimensions'],
                    embedding_model=index_info['model'],
                    top_k=top_k
                )
                
                all_results[index_name] = results
                
                # Display results for this index
                for i, result in enumerate(results, 1):
                    print(f"\n   Result {i} (Score: {result['score']:.4f}):")
                    print(f"   ID: {result['id']}")
                    print(f"   Content: {result['content']}")
                    
            except Exception as e:
                print(f"   ‚ùå Error searching {index_name}: {str(e)}")
                all_results[index_name] = []
        
        # Compare and analyze differences
        self.analyze_differences(all_results, query_text)
        
        return all_results
    
    def analyze_differences(self, all_results: Dict[str, List[Dict]], query_text: str):
        """
        Analyze and highlight differences between search results.
        
        Args:
            all_results: Dictionary of results from each index
            query_text: Original query text
        """
        print("\n" + "=" * 80)
        print("üìà ANALYSIS: Differences Between Embedding Models")
        print("=" * 80)
        
        # Check if all indexes returned results
        indexes_with_results = [idx for idx, results in all_results.items() if results]
        
        if len(indexes_with_results) < 2:
            print("\n‚ö†Ô∏è  Not enough results to compare. Check your indexes and API keys.")
            return
        
        # Compare top results
        print("\nüéØ Top Result Comparison:")
        print("-" * 40)
        for index_name, results in all_results.items():
            if results:
                top_result = results[0]
                print(f"\n{index_name}:")
                print(f"  Top match ID: {top_result['id']}")
                print(f"  Score: {top_result['score']:.4f}")
        
        # Check for agreement on top result
        top_ids = [results[0]['id'] for results in all_results.values() if results]
        if len(set(top_ids)) == 1:
            print("\n‚úÖ All models agree on the top result!")
        else:
            print("\nüîÑ Models returned different top results")
            
        # Calculate overlap in results
        print("\nüìä Result Overlap Analysis:")
        print("-" * 40)
        
        # Get all unique IDs per index
        for i, (idx1, results1) in enumerate(all_results.items()):
            if not results1:
                continue
            ids1 = set(r['id'] for r in results1)
            
            for idx2, results2 in list(all_results.items())[i+1:]:
                if not results2:
                    continue
                ids2 = set(r['id'] for r in results2)
                
                overlap = ids1.intersection(ids2)
                overlap_pct = (len(overlap) / max(len(ids1), len(ids2))) * 100
                
                print(f"\n{idx1} vs {idx2}:")
                print(f"  Overlapping results: {len(overlap)}/{max(len(ids1), len(ids2))}")
                print(f"  Similarity: {overlap_pct:.1f}%")
                
                if overlap:
                    print(f"  Common IDs: {', '.join(sorted(overlap))}")


In [15]:

# Initialize the comparison tool
searcher = AzureSearchComparison()

# The search query
query = "Explain the difference between supervised, unsupervised, and reinforcement learning."

# Run the comparison
print("\nüöÄ Starting semantic search comparison across embedding models...")
results = searcher.compare_search_results(query, top_k=3)

# Additional insights
print("\n" + "=" * 80)
print("üí° KEY INSIGHTS FOR STUDENTS:")
print("=" * 80)
print("""
1. DIMENSIONALITY: 
    - text-embedding-3-large (3072 dims) captures more nuanced relationships
    - text-embedding-ada-002 and text-embedding-3-small (1536 dims) are more efficient

2. PERFORMANCE VS COST:
    - Larger models may provide better semantic understanding
    - Smaller models are faster and cheaper to run at scale

3. USE CASE CONSIDERATIONS:
    - For high-precision tasks: Consider larger embedding models
    - For high-throughput applications: Smaller models may be sufficient
    - Always test with your specific data and queries

4. WHAT TO LOOK FOR:
    - Do all models find the same top result?
    - How much overlap is there in the top 3 results?
    - Are the relevance scores significantly different?
""")


üöÄ Starting semantic search comparison across embedding models...

üîç SEMANTIC SEARCH COMPARISON

Query: 'Explain the difference between supervised, unsupervised, and reinforcement learning.'

Retrieving top 3 results from each index...


üìä Searching LARGE3INDEX
   Model: text-embedding-3-large
   Dimensions: 3072
------------------------------------------------------------
  Generating embedding with text-embedding-3-large...
   ‚ùå Error searching large3index: () The index 'large3index' for service 'srcjan2026afworkshop' was not found.
Code: 
Message: The index 'large3index' for service 'srcjan2026afworkshop' was not found.

üìä Searching SMALL3INDEX
   Model: text-embedding-3-small
   Dimensions: 1536
------------------------------------------------------------
  Generating embedding with text-embedding-3-small...
   ‚ùå Error searching small3index: () The index 'small3index' for service 'srcjan2026afworkshop' was not found.
Code: 
Message: The index 'small3index' for servi