# Notebook 1: The Ingestion Phase

As we discussed in the presentation, the ingestion phase is basically the loading of the data sources the retrieval system uses. These data sources can be existing databases with structured data, however in this notebook we'll focus on unstructured data (such as documents).

## Learning Objectives
- Learn how to chunk markdown files into smaller sizes
- Learn how the text chunking size provides different quality retrieval results in a RAG application
- Learn how different embeddings models provide different results
- Learn how to load an Azure AI Search index for a Vector Store

### Install Required Packages

> NOTE: We need to use Semantic Kernel in this notebook in order to work with the embeddings and chunking (those features are not yet in Agent Framework as of the beginning of Jan 2026).

In [None]:
#r "nuget:Microsoft.Agents.AI.AzureAI, 1.0.0-preview.260108.1"
#r "nuget:Microsoft.SemanticKernel"

## Step 1: Chunk files into smaller pieces

### Document Chunking 

The process of taking a document and splitting into pieces is often referred to as "chunking". There are many ways to split a document and it isn't a *one-size-fits-all* activity, so you need to keep in mind how a document needs to be split in order to provide the most valuable chunks for your retrieval system.

Important things to remember about these chunks:

- We will get embeddings for each chunk
- Relevant chunks will be found by a similarity search using embeddings
- Often times an overlap of 10 - 20% is used if there is not a clean way to split the document
- When working with real documents, you may need to address tables and images (images typically have different embedding models or need to be *verbalized*)
- Each chunk needs to fit in the context window of the LLM, and keep in mind things can get lost in the middle when the context is too big
- You may need to modify your chunking to improve the retrieval quality of your system

In [None]:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Threading.Tasks;
using Microsoft.SemanticKernel.Text;

#pragma warning disable SKEXP0001, SKEXP0050

/// <summary>
/// Reads a markdown file and chunks it into smaller pieces using Semantic Kernel's TextChunker.
/// </summary>
/// <param name="filePath">Path to the markdown file</param>
/// <param name="maxTokenPerLine">Maximum number of tokens per line</param>
/// <returns>List of text chunks</returns>
public static async Task<List<string>> ChunkMarkdownFileAsync(string filePath, int maxTokenPerLine = 256)
{
    // Step 1: Read the markdown file from the file system
    Console.WriteLine($"Reading file: {filePath}");
    string markdownContent;
    
    try
    {
        markdownContent = await File.ReadAllTextAsync(filePath);
        Console.WriteLine($"Successfully read file. Total characters: {markdownContent.Length}\n");
    }
    catch (FileNotFoundException)
    {
        Console.WriteLine($"Error: File '{filePath}' not found.");
        return new List<string>();
    }
    catch (Exception e)
    {
        Console.WriteLine($"Error reading file: {e.Message}");
        return new List<string>();
    }
    
    // Step 2: Use Semantic Kernel's TextChunker to split into smaller pieces
    Console.WriteLine($"Chunking text with max_token_per_line={maxTokenPerLine}...\n");
    
    // Split the text into chunks
    var chunks = TextChunker.SplitMarkDownLines(
        text: markdownContent,
        maxTokensPerLine: maxTokenPerLine
    );
    
    // Step 3: Capture all chunks into a list variable
    List<string> chunkList = chunks.ToList();
    
    Console.WriteLine($"Total chunks created: {chunkList.Count}\n");
    
    // Step 4: Print out the first 3 chunks (or fewer if less than 3 exist)
    int chunksToDisplay = Math.Min(3, chunkList.Count);
    Console.WriteLine($"Displaying first {chunksToDisplay} chunks:\n");
    Console.WriteLine(new string('=', 80));
    
    for (int i = 0; i < chunksToDisplay; i++)
    {
        Console.WriteLine($"\n--- Chunk {i + 1} ---");
        Console.WriteLine($"Length: {chunkList[i].Length} characters");
        Console.WriteLine($"Content:\n{chunkList[i]}");
        Console.WriteLine(new string('-', 40));
    }
    
    Console.WriteLine(new string('=', 80));
    
    return chunkList;
}


Next, you can now use the above method to split the sample markdown file (in the **/data** folder) into chunks.

In [None]:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;

// Specify the path to your markdown file
string markdownFilePath = "./../../assets/sample.md";

// Chunk the markdown file
List<string> chunks = await ChunkMarkdownFileAsync(
    filePath: markdownFilePath,
    maxTokenPerLine: 256  // Adjust chunk size as needed
);

if (chunks != null && chunks.Count > 0)
{
    // Print summary statistics
    double avgChunkSize = chunks.Average(chunk => chunk.Length);
    int smallestChunkSize = chunks.Min(chunk => chunk.Length);
    int largestChunkSize = chunks.Max(chunk => chunk.Length);
    
    Console.WriteLine("\nChunking Summary:");
    Console.WriteLine($"  - Total chunks: {chunks.Count}");
    Console.WriteLine($"  - Average chunk size: {avgChunkSize:F2} characters");
    Console.WriteLine($"  - Smallest chunk: {smallestChunkSize} characters");
    Console.WriteLine($"  - Largest chunk: {largestChunkSize} characters");
}

// The chunks list is now available for use in the next notebook
Console.WriteLine("\n‚úÖ Chunks are now stored in the 'chunks' variable for use in the next step.");
Console.WriteLine("   You can access individual chunks with chunks[0], chunks[1], etc.");

### Try Using LangChain's MarkdownHeaderTextSplitter (optional)

LangChain is another popular python package used with RAG applications - and there is a nuget package for the [LangChan](https://github.com/tryAGI/LangChain) .NET project. I'm not sure how active development is on it these days. They have many options than Semantic Kernal. In the code below you'll explore the MarkdownHeaderTextSplitter may provide a better splitter option for you.

First you'll need to install the packages.

In [None]:
#r "nuget:LangChain, 0.17.0"

In [None]:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Threading.Tasks;
using LangChain.Splitters.Text;


/// <summary>
/// Read a markdown file and split it into chunks based on headers.
/// </summary>
/// <param name="filePath">Path to the markdown file</param>
/// <returns>List of document chunks with metadata</returns>
public static async Task<List<string>> LangChain_ChunkMarkdownFileAsync(string filePath)
{
    // Step 1: Read the markdown file
    string markdownContent;
    try
    {
        markdownContent = await File.ReadAllTextAsync(filePath);
        Console.WriteLine($"Successfully read file: {filePath}");
        Console.WriteLine($"File size: {markdownContent.Length} characters\n");
    }
    catch (FileNotFoundException)
    {
        Console.WriteLine($"Error: File '{filePath}' not found.");
        return new List<string>();
    }
    catch (Exception e)
    {
        Console.WriteLine($"Error reading file: {e.Message}");
        return new List<string>();
    }

    // Step 2: Configure the MarkdownHeaderTextSplitter
    // Define which headers to split on and their metadata keys
    var headersToSplitOn = new string[] { "#", "##", "###" };
    
    // Create the splitter instance
    var markdownSplitter = new MarkdownHeaderTextSplitter(
        headersToSplitOn: headersToSplitOn,
        includeHeaders: true  // Keep headers in the content
    );

    // Step 3: Split the document and capture chunks
    var chunks = markdownSplitter.SplitText(markdownContent);

    // Convert to list of ChunkInfo for easier handling
    var chunkList = new List<string>();
    for (int i = 0; i < chunks.Count; i++)
    {
        var chunk = chunks[i];
        chunkList.Add(chunk);
    }

    Console.WriteLine($"Total number of chunks created: {chunkList.Count}\n");
    Console.WriteLine(new string('=', 60));

    // Step 4: Print the first 3 chunks (or fewer if less than 3 exist)
    int chunksToDisplay = Math.Min(3, chunkList.Count);

    for (int i = 0; i < chunksToDisplay; i++)
    {
        var chunk = chunkList[i];
        Console.WriteLine($"\nüìÑ CHUNK {i + 1}:");
        Console.WriteLine($"   Length: {chunk.Length} characters");
        Console.WriteLine($"   Content preview:");
        Console.WriteLine(new string('-', 40));

        // Display first 300 characters of content (or full if shorter)
        string contentPreview = chunk.Length > 300 
            ? chunk.Substring(0, 300) + "..."
            : chunk;
        
        Console.WriteLine(contentPreview);
        Console.WriteLine(new string('-', 40));
    }

    // Return the full list of chunks for use in next lab
    return chunkList;
}


Next, you can now use the above method to split the same sample markdown file (in the **/data** folder) into chunks.

> NOTE: the LangChain splitter splits on sections and provided metadata about the section hierarchy (which may be useful for you).

In [None]:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;

// Specify the path to your markdown file
string markdownFilePath = "./../../assets/sample.md";

// Chunk the markdown file
List<string> lcChunks = await LangChain_ChunkMarkdownFileAsync(markdownFilePath);

if (lcChunks != null && lcChunks.Any())
{
    Console.WriteLine("\n" + new string('=', 60));
    Console.WriteLine("üìä SUMMARY STATISTICS:");
    Console.WriteLine($"   Total chunks: {lcChunks.Count}");
    
    int totalChars = lcChunks.Sum(chunk => chunk.Length);
    double avgChunkSize = lcChunks.Count > 0 ? (double)totalChars / lcChunks.Count : 0;
    Console.WriteLine($"   Average chunk size: {avgChunkSize:F1} characters");
    
    var maxChunk = lcChunks.OrderByDescending(x => x.Length).First();
    var minChunk = lcChunks.OrderBy(x => x.Length).First();
    Console.WriteLine($"   Largest chunk: {maxChunk.Length} characters (chunk #{maxChunk.Index})");
    Console.WriteLine($"   Smallest chunk: {minChunk.Length} characters (chunk #{minChunk.Index})");
}

// The chunks list is now available for use in the next notebook
Console.WriteLine($"\n‚úÖ Chunks are now stored in the 'chunks' variable for use in the next step.");
Console.WriteLine($"   You can access individual chunks with chunks[0], chunks[1], etc.");

// If going to use the LangChain chunks in the next step, set the chunks variable to the chunk text
List<string> chunks = lcChunks.Select(item => item).ToList();

## Step 2: Create Embeddings for Semantic Searches

In this step you'll use AzureOpenAI to create the embeddings for the chunks you created above - you'll need to decide with chunking technique you like best.

The code below will utilize the older text-embedding-ada-002 model for creating the embeddings. In Step 4, you'll get to compare the embeddings from OpenAI how they differ in a semantic search.

First you'll need to install the packages.

In [None]:
#r "nuget:Azure.AI.OpenAI, *-*"
#r "nuget:Azure.Core, *-*"
#r "nuget:DotNetEnv, *-*"
#r "nuget:Microsoft.Extensions.Configuration, 10.0.1"
#r "nuget:Microsoft.Extensions.Configuration.Json, 10.0.1"
#r "nuget:Microsoft.Extensions.configuration.Binder, 10.0.1"
#r "nuget:Microsoft.Extensions.Configuration.EnvironmentVariables, 10.0.1"

In [None]:
public class AzureOpenAISettings
{
    public string ApiKey { get; set; }
    public string Endpoint { get; set; }
    public string ApiVersion { get; set; }
    public string EmbeddingDeploymentName { get; set; }
}

Load the config file

In [None]:
public const string DefaultConfigFileName = "appsettings.Local.json";

public static string? FindConfigDirectory(string fileName)
{
    var directory = new DirectoryInfo(Directory.GetCurrentDirectory());

    while (directory is not null)
    {
        if (File.Exists(Path.Combine(directory.FullName, fileName)))
        {
            return directory.FullName;
        }
        directory = directory.Parent;
    }

    return null;
}

var basePath = FindConfigDirectory(DefaultConfigFileName)
            ?? throw new InvalidOperationException(
                $"Could not find {DefaultConfigFileName} in current directory or any parent directory.");

// Load configuration from appsettings.json
var configuration = new ConfigurationBuilder()
    .SetBasePath(basePath)
    .AddJsonFile("appsettings.Local.json", optional: true, reloadOnChange: true) // Optional environment-specific settings
    .AddEnvironmentVariables()
    .Build();

// Bind configuration to settings object
var _settings = new AzureOpenAISettings()
{
    Endpoint = configuration["AZURE_OPENAI_ENDPOINT"],
    ApiVersion = configuration["AZURE_OPENAI_API_VERSION"],
    EmbeddingDeploymentName = configuration["AZURE_OPENAI_EMBEDDING_DEPLOYMENT"]
};        

foreach (var kvp in configuration.AsEnumerable())
{
    if (!string.IsNullOrEmpty(kvp.Value))
    {
        Environment.SetEnvironmentVariable(kvp.Key, kvp.Value);
    }
}

In [None]:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Threading.Tasks;
using Microsoft.Extensions.Configuration;
using Azure;
using Azure.AI.OpenAI;
using Azure.Identity;

public class EmbeddingService
{
    private readonly AzureOpenAIClient _client;
    private readonly AzureOpenAISettings _settings;
    
    public EmbeddingService()
    {
           // Bind configuration to settings object
        _settings = new AzureOpenAISettings()
        {
            Endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT"),
            ApiVersion = Environment.GetEnvironmentVariable("AZURE_OPENAI_API_VERSION"),
            EmbeddingDeploymentName = Environment.GetEnvironmentVariable("AZURE_OPENAI_EMBEDDING_DEPLOYMENT")
        };        
        
        Console.WriteLine("Azure OpenAI Settings:");
        Console.WriteLine($"  - Endpoint: {_settings.Endpoint}");
        Console.WriteLine($"  - API Version: {_settings.ApiVersion}");
        Console.WriteLine($"  - Embedding Deployment Name: {_settings.EmbeddingDeploymentName}");
        
        // Initialize Azure OpenAI client
        _client = new AzureOpenAIClient(
            new Uri(_settings.Endpoint),
            new DefaultAzureCredential()
        );
    }
    
    public async Task<List<float[]>> EmbedChunksAsync(List<string> textChunks, string model = null)
    {
        // Use provided model or default from settings
        string deploymentName = model ?? _settings.EmbeddingDeploymentName;
        
        var embeddingClient = _client.GetEmbeddingClient(deploymentName);
        
        // Create embeddings for all chunks
        var response = await embeddingClient.GenerateEmbeddingsAsync(textChunks);
        
        // Extract embeddings from response
        return response.Value.Select(item => item.ToFloats().ToArray()).ToList();
    }
}

Next you use the above utility to create embeddings of the chunks (created earlier) and take a look at a few of the returned vectors.

In [None]:
var embeddings = await new EmbeddingService().EmbedChunksAsync(chunks);

// Single embedding (first item)
Console.WriteLine("First embedding:");
Console.WriteLine(embeddings[0]); //  a List[float]
Console.WriteLine(embeddings[0].Length + " dimensions");

// First two embeddings
Console.WriteLine("\nFirst two embeddings:");
for (int i = 0; i < 2 && i < embeddings.Count; i++)  // slice to first 2
{
    Console.WriteLine($"Embedding {i}:");
    Console.WriteLine(string.Join(", ", embeddings[i].Take(8)) + " ...");  // show just first few dims to keep output short
    Console.WriteLine("dim: " + embeddings[i].Length);
} 

## Step 3: Load Azure AI Search Index

Next step is the inserting of the chunks and embeddings into a vector database. In this step we'll use Azure AI Search as the vector database.

First you'll need to install the package.

In [None]:
#r "nuget:Azure.Search.Documents, *-*"


In [None]:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;
using Azure;
using Azure.Search.Documents;
using Azure.Search.Documents.Indexes;
using Azure.Search.Documents.Indexes.Models;
using Azure.Search.Documents.Models;
using Microsoft.Extensions.Configuration;

public class AzureSearchSettings
{
    public string Endpoint { get; set; }
    public string ApiKey { get; set; }
}

public class AzureSearchService
{
    private readonly string _searchEndpoint;
    private readonly string _searchKey;
    private readonly string _myInitials;
    private readonly string _indexName;
    private readonly int _embeddingDimension = 1536; // e.g. 1536 for ada-002
    private readonly SearchClient _searchClient;
    private readonly SearchIndexClient _indexClient;

    // TODO: modify to match config used above
    public AzureSearchService()
    {
        
        // Bind configuration to settings object
        var settings = new AzureSearchSettings()
        {
            Endpoint = Environment.GetEnvironmentVariable("AZURE_SEARCH_ENDPOINT"),
            ApiKey = Environment.GetEnvironmentVariable("AZURE_SEARCH_API_KEY")
        };

        _searchEndpoint = settings.Endpoint;
        _searchKey = settings.ApiKey;
        _myInitials = Environment.GetEnvironmentVariable("MY_INITIALS");
        _indexName = $"{_myInitials?.ToLower()}vectorindex";

        // Create SearchClient for later use
        _searchClient = new SearchClient(
            new Uri(_searchEndpoint),
            _indexName,
            new AzureKeyCredential(_searchKey)
        );

        // Create SearchIndexClient for index management
        _indexClient = new SearchIndexClient(
            new Uri(_searchEndpoint),
            new AzureKeyCredential(_searchKey)
        );
    }

    /// <summary>
    /// Ensure an Azure AI Search index exists for chunk text + embeddings.
    /// If it does not exist, create it. If it exists, do nothing.
    /// </summary>
    public async Task EnsureChunkVectorIndexAsync()
    {
        if (string.IsNullOrEmpty(_myInitials))
        {
            throw new InvalidOperationException("MY_INITIALS configuration must be set to prevent index name collisions.");
        }

        if (string.IsNullOrEmpty(_searchEndpoint) || string.IsNullOrEmpty(_searchKey))
        {
            throw new InvalidOperationException("AZURE_SEARCH_ENDPOINT and AZURE_SEARCH_API_KEY must be set.");
        }

        // Check if the index already exists
        var existingIndexes = _indexClient.GetIndexNamesAsync();
        await foreach (var existingName in existingIndexes)
        {
            if (existingName == _indexName)
            {
                Console.WriteLine($"Index '{_indexName}' already exists; skipping creation.");
                return;
            }
        }

        Console.WriteLine($"Index '{_indexName}' does not exist; creating now...");

        var fields = new List<SearchField>
        {
            new SimpleField("id", SearchFieldDataType.String)
            {
                IsKey = true,
                IsFilterable = true,
                IsSortable = true,
                IsFacetable = true
            },
            new SearchableField("content")
            {
                AnalyzerName = LexicalAnalyzerName.StandardLucene
            },
            new SearchField("contentVector", SearchFieldDataType.Collection(SearchFieldDataType.Single))
            {
                IsSearchable = true,
                VectorSearchDimensions = _embeddingDimension,
                VectorSearchProfileName = "chunk-vector-profile"
            }
        };

        var vectorSearch = new VectorSearch
        {
            Algorithms =
            {
                new HnswAlgorithmConfiguration("chunk-hnsw-config")
            },
            Profiles =
            {
                new VectorSearchProfile("chunk-vector-profile", "chunk-hnsw-config")
            }
        };

        var index = new SearchIndex(_indexName)
        {
            Fields = fields,
            VectorSearch = vectorSearch
        };

        var result = await _indexClient.CreateIndexAsync(index);
        Console.WriteLine($"Index '{result.Value.Name}' created.");
    }

    /// <summary>
    /// Upload chunks and their corresponding embeddings to the Azure AI Search index.
    /// </summary>
    public async Task UploadChunksWithEmbeddingsAsync(
        List<string> chunks,
        List<float[]> embeddings)
    {
        if (chunks.Count != embeddings.Count)
        {
            throw new ArgumentException("chunks and embeddings must have the same length");
        }

        var documents = new List<SearchDocument>();
        
        for (int i = 0; i < chunks.Count; i++)
        {
            var doc = new SearchDocument
            {
                ["id"] = i.ToString(),
                ["content"] = chunks[i],
                ["contentVector"] = embeddings[i]
            };
            documents.Add(doc);
        }

        // Azure AI Search supports up to 1,000 docs per batch
        const int batchSize = 1000;
        
        for (int start = 0; start < documents.Count; start += batchSize)
        {
            var batch = documents.Skip(start).Take(batchSize).ToList();
            
            var indexDocumentsBatch = IndexDocumentsBatch.MergeOrUpload(batch);
            var result = await _searchClient.IndexDocumentsAsync(indexDocumentsBatch);
            
            // Check status per doc
            int succeeded = result.Value.Results.Count(r => r.Succeeded);
            Console.WriteLine($"Uploaded {succeeded}/{batch.Count} documents in batch starting at {start}.");
        }
    }
}

Next you can run the code that will ensure the index has been created and then load the chunks and embeddings.

In [None]:
var searchService = new AzureSearchService();

// Ensure index exists
await searchService.EnsureChunkVectorIndexAsync();

// Upload chunks with embeddings
await searchService.UploadChunksWithEmbeddingsAsync(chunks, embeddings);

Next let's see what the semantic search results would be for these questions:
- ‚ÄúExplain the difference between supervised, unsupervised, and reinforcement learning.‚Äù
- ‚ÄúWhat kinds of real‚Äëworld problems can machine learning solve today?‚Äù
- ‚ÄúHow does reinforcement learning decide which actions to take to maximize rewards?‚Äù
- ‚ÄúGive some examples of how machine learning is used in healthcare and fraud prevention.‚Äù
- ‚ÄúWhy is machine learning becoming more important as the amount of data grows?‚Äù

In [None]:
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;
using Azure;
using Azure.AI.OpenAI;
using Azure.Search.Documents;
using Azure.Search.Documents.Models;
using Microsoft.Extensions.Configuration;

public class SearchQueryService
{
    private readonly AzureOpenAIClient _aoaiClient;
    private readonly SearchClient _searchClient;
    private readonly string _embeddingDeploymentName;
    
    public SearchQueryService()
    {
        // Bind configuration to settings object
        var settings = new AzureOpenAISettings()
        {
            Endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT"),
            ApiVersion = Environment.GetEnvironmentVariable("AZURE_OPENAI_API_VERSION"),
            EmbeddingDeploymentName = Environment.GetEnvironmentVariable("AZURE_OPENAI_EMBEDDING_DEPLOYMENT")
        };        
            
        // Initialize Azure OpenAI client
        _embeddingDeploymentName = Environment.GetEnvironmentVariable("AZURE_OPENAI_EMBEDDING_DEPLOYMENT") ?? "text-embedding-ada-002";
        
        _aoaiClient = new AzureOpenAIClient(
            new Uri(settings.Endpoint),
            new DefaultAzureCredential()
        );
        
        // Initialize Search client
        var searchEndpoint = Environment.GetEnvironmentVariable("AZURE_SEARCH_ENDPOINT");
        var searchKey = Environment.GetEnvironmentVariable("AZURE_SEARCH_API_KEY");
        var myInitials = Environment.GetEnvironmentVariable("MY_INITIALS");
        var indexName = $"{myInitials?.ToLower()}vectorindex";
        
        _searchClient = new SearchClient(
            new Uri(searchEndpoint),
            indexName,
            new AzureKeyCredential(searchKey)
        );
    }
    
    /// <summary>
    /// Create a single embedding vector for a query string using Azure OpenAI.
    /// </summary>
    public async Task<float[]> EmbedQueryAsync(string text, string model = null)
    {
        var deploymentName = model ?? _embeddingDeploymentName;
        var embeddingClient = _aoaiClient.GetEmbeddingClient(deploymentName);
        
        var response = await embeddingClient.GenerateEmbeddingsAsync(new List<string> { text });
        return response.Value.First().ToFloats().ToArray();
    }
    
    /// <summary>
    /// Run test queries against the search index.
    /// </summary>
    public async Task RunTestQueriesAsync(List<string> queries, bool useHybrid = true, int topK = 3)
    {
        foreach (var query in queries)
        {
            Console.WriteLine(new string('=', 80));
            Console.WriteLine($"Query: {query}");
            Console.WriteLine($"Hybrid search: {useHybrid}");
            Console.WriteLine(new string('-', 80));
            
            // Generate embedding for the query
            var queryVector = await EmbedQueryAsync(query);
            
            // Create vector query
            var vectorQuery = new VectorizedQuery(queryVector)
            {
                Fields = { "contentVector" }
            };
            
            // Configure search options
            var searchOptions = new SearchOptions
            {
                VectorSearch = new VectorSearchOptions
                {
                    Queries = { vectorQuery }
                },
                Size = topK
            };
            
            // Perform search
            SearchResults<SearchDocument> results;
            
            if (useHybrid)
            {
                // Hybrid search: both text and vector
                results = await _searchClient.SearchAsync<SearchDocument>(
                    searchText: query,
                    searchOptions);
            }
            else
            {
                // Pure vector search
                results = await _searchClient.SearchAsync<SearchDocument>(
                    searchText: null,
                    searchOptions);
            }
            
            // Display results
            int i = 0;
            await foreach (var result in results.GetResultsAsync())
            {
                var doc = result.Document;
                var score = result.Score;
                
                if (score.HasValue)
                {
                    Console.WriteLine($"[{i}] id={doc["id"]}  score={score.Value:F4}");
                }
                else
                {
                    Console.WriteLine($"[{i}] id={doc["id"]}");
                }
                
                Console.WriteLine(doc["content"]);
                Console.WriteLine(new string('-', 40));
                i++;
            }
        }
    }
}

In [None]:
var testQueries = new List<string>
{
    "Explain the difference between supervised, unsupervised, and reinforcement learning.",
    //"What kinds of real-world problems can machine learning solve today?",
    //"How does reinforcement learning decide which actions to take to maximize rewards?",
    //"Give some examples of how machine learning is used in healthcare and fraud prevention.",
    //"Why is machine learning becoming more important as the amount of data grows?",
};

var queryService = new SearchQueryService();

// Hybrid on:
await queryService.RunTestQueriesAsync(testQueries, useHybrid: true, topK: 3);

// Hybrid off:
//await queryService.RunTestQueriesAsync(testQueries, useHybrid: false, topK: 3);

## Step 4: Test Different Chunk Sizes and Embedding Models

In this step, you get to explore the differences between using:
- text-embedding-ada-002
- text-embedding-3-small
- text-embedding-3-large

In [None]:
using Azure;
using Azure.AI.OpenAI;
using Azure.Search.Documents;
using Azure.Search.Documents.Models;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;

/// <summary>
/// Compare semantic search results across different embedding model indexes.
/// </summary>
public class AzureSearchComparison
{
    private readonly string _openai_endpoint;
    private readonly string _openai_api_key;
    private readonly string _search_endpoint;
    private readonly AzureKeyCredential _search_credential;
    private readonly Dictionary<string, IndexInfo> _indexes;

    public class IndexInfo
    {
        public int Dimensions { get; set; }
        public string Model { get; set; } = string.Empty;
    }

    public class SearchResultItem
    {
        public string Id { get; set; } = string.Empty;
        public string Content { get; set; } = string.Empty;
        public double Score { get; set; }
    }

    /// <summary>
    /// Initialize Azure Search connection.
    /// </summary>
    public AzureSearchComparison()
    {
        
        _search_endpoint = Environment.GetEnvironmentVariable("AZURE_SEARCH_ENDPOINT");
        _search_credential = new AzureKeyCredential(Environment.GetEnvironmentVariable("AZURE_SEARCH_API_KEY"));
        _openai_endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT");

        // Define the indexes and their embedding dimensions
        _indexes = new Dictionary<string, IndexInfo>
        {
            ["large3index"] = new IndexInfo
            {
                Dimensions = 3072,  // Dimension for text-embedding-3-large
                Model = "text-embedding-3-large"
            },
            ["small3index"] = new IndexInfo
            {
                Dimensions = 1536,  // Dimension for text-embedding-3-small
                Model = "text-embedding-3-small"
            },
            ["ada002index"] = new IndexInfo
            {
                Dimensions = 1536,
                Model = "text-embedding-ada-002"
            }
        };
    }

    /// <summary>
    /// Generate embedding for the query text using specified model.
    /// </summary>
    /// <param name="text">Query text to embed</param>
    /// <param name="model">OpenAI embedding model name</param>
    /// <returns>List of floats representing the embedding vector</returns>
    private async Task<ReadOnlyMemory<float>> GetEmbeddingAsync(string text, string model)
    {
        var client = new AzureOpenAIClient(
            new Uri(_openai_endpoint),
            new DefaultAzureCredential()
        );

        var embeddingClient = client.GetEmbeddingClient(model);
        var response = await embeddingClient.GenerateEmbeddingAsync(text);

        return response.Value.ToFloats();
    }

    /// <summary>
    /// Search a specific Azure AI Search index using vector similarity.
    /// </summary>
    /// <param name="indexName">Name of the Azure search index</param>
    /// <param name="queryText">Text query to search for</param>
    /// <param name="vectorDimensions">Dimension of the embedding vectors</param>
    /// <param name="embeddingModel">OpenAI model to use for generating query embedding</param>
    /// <param name="topK">Number of top results to return</param>
    /// <returns>List of search results with scores</returns>
    public async Task<List<SearchResultItem>> SearchIndexAsync(
        string indexName,
        string queryText,
        int vectorDimensions,
        string embeddingModel,
        int topK = 5)
    {
        // Create search client for this index
        var searchClient = new SearchClient(
            new Uri(_search_endpoint),
            indexName,
            _search_credential
        );

        // Generate embedding for the query
        Console.WriteLine($"  Generating embedding with {embeddingModel}...");
        var queryVector = await GetEmbeddingAsync(queryText, embeddingModel);

        // Create vector query
        var vectorQuery = new VectorizedQuery(queryVector)
        {
            KNearestNeighborsCount = topK,
            Fields = { "contentVector" }
        };

        // Perform search
        var searchOptions = new SearchOptions
        {
            Size = topK,
            Select = { "id", "content" },
            VectorSearch = new VectorSearchOptions
            {
                Queries = { vectorQuery }
            }
        };

        var response = await searchClient.SearchAsync<SearchDocument>(null, searchOptions);

        // Collect results
        var searchResults = new List<SearchResultItem>();
        await foreach (var result in response.Value.GetResultsAsync())
        {
            var content = result.Document["content"]?.ToString() ?? string.Empty;
            searchResults.Add(new SearchResultItem
            {
                Id = result.Document["id"]?.ToString() ?? string.Empty,
                Content = content.Length > 200 ? content[..200] + "..." : content,
                Score = result.Score ?? 0
            });
        }

        return searchResults;
    }

    /// <summary>
    /// Compare search results across all three indexes.
    /// </summary>
    /// <param name="queryText">The search query</param>
    /// <param name="topK">Number of top results to show per index</param>
    public async Task<Dictionary<string, List<SearchResultItem>>> CompareSearchResultsAsync(
        string queryText,
        int topK = 3)
    {
        Console.WriteLine();
        Console.WriteLine(new string('=', 80));
        Console.WriteLine("üîç SEMANTIC SEARCH COMPARISON");
        Console.WriteLine(new string('=', 80));
        Console.WriteLine($"\nQuery: '{queryText}'\n");
        Console.WriteLine($"Retrieving top {topK} results from each index...\n");

        var allResults = new Dictionary<string, List<SearchResultItem>>();

        // Search each index
        foreach (var (indexName, indexInfo) in _indexes)
        {
            Console.WriteLine($"\nüìä Searching {indexName.ToUpper()}");
            Console.WriteLine($"   Model: {indexInfo.Model}");
            Console.WriteLine($"   Dimensions: {indexInfo.Dimensions}");
            Console.WriteLine(new string('-', 60));

            try
            {
                var results = await SearchIndexAsync(
                    indexName: indexName,
                    queryText: queryText,
                    vectorDimensions: indexInfo.Dimensions,
                    embeddingModel: indexInfo.Model,
                    topK: topK
                );

                allResults[indexName] = results;

                // Display results for this index
                for (int i = 0; i < results.Count; i++)
                {
                    var result = results[i];
                    Console.WriteLine($"\n   Result {i + 1} (Score: {result.Score:F4}):");
                    Console.WriteLine($"   ID: {result.Id}");
                    Console.WriteLine($"   Content: {result.Content}");
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine($"   ‚ùå Error searching {indexName}: {ex.Message}");
                allResults[indexName] = new List<SearchResultItem>();
            }
        }

        // Compare and analyze differences
        AnalyzeDifferences(allResults, queryText);

        return allResults;
    }

    /// <summary>
    /// Analyze and highlight differences between search results.
    /// </summary>
    /// <param name="allResults">Dictionary of results from each index</param>
    /// <param name="queryText">Original query text</param>
    private void AnalyzeDifferences(
        Dictionary<string, List<SearchResultItem>> allResults,
        string queryText)
    {
        Console.WriteLine();
        Console.WriteLine(new string('=', 80));
        Console.WriteLine("üìà ANALYSIS: Differences Between Embedding Models");
        Console.WriteLine(new string('=', 80));

        // Check if all indexes returned results
        var indexesWithResults = allResults
            .Where(kvp => kvp.Value.Count > 0)
            .Select(kvp => kvp.Key)
            .ToList();

        if (indexesWithResults.Count < 2)
        {
            Console.WriteLine("\n‚ö†Ô∏è  Not enough results to compare. Check your indexes and API keys.");
            return;
        }

        // Compare top results
        Console.WriteLine("\nüéØ Top Result Comparison:");
        Console.WriteLine(new string('-', 40));

        foreach (var (indexName, results) in allResults)
        {
            if (results.Count > 0)
            {
                var topResult = results[0];
                Console.WriteLine($"\n{indexName}:");
                Console.WriteLine($"  Top match ID: {topResult.Id}");
                Console.WriteLine($"  Score: {topResult.Score:F4}");
            }
        }

        // Check for agreement on top result
        var topIds = allResults
            .Where(kvp => kvp.Value.Count > 0)
            .Select(kvp => kvp.Value[0].Id)
            .ToList();

        if (topIds.Distinct().Count() == 1)
        {
            Console.WriteLine("\n‚úÖ All models agree on the top result!");
        }
        else
        {
            Console.WriteLine("\nüîÑ Models returned different top results");
        }

        // Calculate overlap in results
        Console.WriteLine("\nüìä Result Overlap Analysis:");
        Console.WriteLine(new string('-', 40));

        // Get all unique IDs per index
        var resultsList = allResults.ToList();
        for (int i = 0; i < resultsList.Count; i++)
        {
            var (idx1, results1) = resultsList[i];
            if (results1.Count == 0) continue;

            var ids1 = results1.Select(r => r.Id).ToHashSet();

            for (int j = i + 1; j < resultsList.Count; j++)
            {
                var (idx2, results2) = resultsList[j];
                if (results2.Count == 0) continue;

                var ids2 = results2.Select(r => r.Id).ToHashSet();

                var overlap = ids1.Intersect(ids2).ToList();
                var maxCount = Math.Max(ids1.Count, ids2.Count);
                var overlapPct = (overlap.Count / (double)maxCount) * 100;

                Console.WriteLine($"\n{idx1} vs {idx2}:");
                Console.WriteLine($"  Overlapping results: {overlap.Count}/{maxCount}");
                Console.WriteLine($"  Similarity: {overlapPct:F1}%");

                if (overlap.Count > 0)
                {
                    Console.WriteLine($"  Common IDs: {string.Join(", ", overlap.OrderBy(x => x))}");
                }
            }
        }
    }
}

In [None]:
// Initialize the comparison tool
var searcher = new AzureSearchComparison();

// The search query
var query = "Explain the difference between supervised, unsupervised, and reinforcement learning.";

// Run the comparison
Console.WriteLine("\nüöÄ Starting semantic search comparison across embedding models...");
var results = await searcher.CompareSearchResultsAsync(query, topK: 3);

// Additional insights
Console.WriteLine();
Console.WriteLine(new string('=', 80));
Console.WriteLine("üí° KEY INSIGHTS FOR STUDENTS:");
Console.WriteLine(new string('=', 80));
Console.WriteLine("""

1. DIMENSIONALITY: 
    - text-embedding-3-large (3072 dims) captures more nuanced relationships
    - text-embedding-ada-002 and text-embedding-3-small (1536 dims) are more efficient

2. PERFORMANCE VS COST:
    - Larger models may provide better semantic understanding
    - Smaller models are faster and cheaper to run at scale

3. USE CASE CONSIDERATIONS:
    - For high-precision tasks: Consider larger embedding models
    - For high-throughput applications: Smaller models may be sufficient
    - Always test with your specific data and queries

4. WHAT TO LOOK FOR:
    - Do all models find the same top result?
    - How much overlap is there in the top 3 results?
    - Are the relevance scores significantly different?
""");