# Search documents with Azure OpenAI Embeddings

This tutorial will walk you through using the Azure OpenAI embeddings API to perform document search where you'll query a knowledge base to find the most relevant document.

## Settings

| Setting | Value | 
| --- | --- |
| Model | text-embedding-ada-002 |


## Install additional .NET SDKs

You'll need to install 3 additional SDKs from NuGet.

### Azure.AI.OpenAI

[Azure.AI.OpenAI NuGet package](https://www.nuget.org/packages/Azure.AI.OpenAI/)

In [1]:
#r "nuget: Azure.AI.OpenAI, 1.0.0-beta.5"

### Utility SDKs

* [CSVHelper](https://www.nuget.org/packages/CsvHelper/) => help with reading CSV files
* [Microsoft.ML.Tokenizers](https://www.nuget.org/packages/Microsoft.ML.Tokenizers/0.20.1) => help generate tokens

In [2]:
#r "nuget: CsvHelper, 30.0.1"
#r "nuget: Microsoft.ML.Tokenizers, 0.20.1"

## Set the `using` statements

In [3]:
using System;
using Azure.AI.OpenAI;
using Azure;
using Microsoft.ML.Tokenizers;
using CsvHelper;
using System.Globalization;
using System.Text.RegularExpressions;

## Configure Azure OpenAI client

We'll use the configuration values you set in the **.devcontainer.json** to configure and initialize the `OpenAIClient`.

In [4]:
var AOAI_ENDPOINT = Environment.GetEnvironmentVariable("AOAI_ENDPOINT");
var AOAI_KEY = Environment.GetEnvironmentVariable("AOAI_KEY");
var AOAI_DEPLOYMENTID = Environment.GetEnvironmentVariable("AOAI_DEPLOYMENTID");

In [5]:
var endpoint = new Uri(AOAI_ENDPOINT);
var credentials = new AzureKeyCredential(AOAI_KEY);
var openAIClient = new OpenAIClient(endpoint, credentials);

## Read the document to be searched into memory

The document that is to be searched is a CSV file. It contains legislative bills. 

First create a class that will model each row of the file.

In [6]:
public class BillSummary
{
    public string Id { get; set; }
    public string Text { get; set; }
    public string Title { get; set; }
    public string Summary { get; set; }
    public int TokenCount { get; set; }
    public float[] AdaV2 { get; set; }
}

Next, open the CSV file and read certain columns into memory.

_(Note: The CSV file was downloaded when the Codespace was started. If you are not using Codespaces, download the [CSV from GitHub](https://raw.githubusercontent.com/Azure-Samples/Azure-OpenAI-Docs-Samples/main/Samples/Tutorials/Embeddings/data/bill_sum_data.csv) and place it in the same directory as this notebook.)_

In [7]:
List<BillSummary> billSummaries = new ();

using (var reader = new System.IO.StreamReader("bill_sum_data.csv"))
using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
{
    csv.Read();
    csv.ReadHeader();

    while (csv.Read())
    {
        var billSummary = new BillSummary
        {
            Id = csv.GetField<string>("bill_id"),
            Text = csv.GetField<string>("text"),
            Title = csv.GetField<string>("title"),
            Summary = csv.GetField<string>("summary")
        };
        billSummaries.Add(billSummary);
    }
}

Before checking the token count or creating the embeddings, we want to cleanup the `Text` property. That involves removing any extra whitespace, replacing ".." with "." and so on.

In [8]:
// The function to do the normalization
string NormalizeText(string s)
{
    // Replace multiple whitespace characters with a single space
    s = Regex.Replace(s, @"\s+", " ").Trim();

    // Remove ". ,"
    s = Regex.Replace(s, @"\. ,", "");

    // Replace ".." with "."
    s = s.Replace("..", ".");

    // Replace ". ." with "."
    s = s.Replace(". .", ".");

    // Remove newline characters
    s = s.Replace("\n", "");

    // Remove leading and trailing spaces
    s = s.Trim();

    return s;
}

// Perform the normalization on the Text property
billSummaries.ForEach(b => b.Text = NormalizeText(b.Text));

## Tokenize the document's text

Because Azure OpenAI's embedding function can only handle 8192 tokens, we need to make sure none of the `Text` fields evaluate to more than that. We'll use **Microsoft.ML.Tokenizers**.

In [9]:
// Function to perform the token count
int TokenCount(Tokenizer tokenizer, string s)
{
    var tokenizerResult = tokenizer.Encode(s);

    return tokenizerResult.Tokens.Count();
}

// In order to tokenize, we need the following
var vocabFilePath = @"vocab.json";
var mergesFilePath = @"merges.txt";

var tokenizer = new Tokenizer(new Bpe(vocabFilePath, mergesFilePath));

// Get a token count for the Text and put it into the TokenCount property
billSummaries.ForEach(b => b.TokenCount = TokenCount(tokenizer, b.Text));

### Remove anything that has too many tokens

If any of the rows have too many tokens we can't use it and should remove it from the `billSummaries`.

In [10]:
var tooLong = billSummaries.Where(b => b.TokenCount > 8192).ToList();

foreach (var bill in tooLong)
{
    billSummaries.Remove(bill);
}

### View the tokens

As a side note, let's view the tokens returned for the first row of data. _(This is not necessary for the tutorial, but it's interesting.)_

In [11]:
var firstTokenResult = tokenizer.Encode(billSummaries[0].Text);

firstTokenResult.Tokens.Display();

## Get the embeddings for each row in the document

For every row in the document, this will get the embedding and store it as an array of `float`s in the `AdaV2` property.

In [12]:
foreach (var bill in billSummaries)
{
    var embeddingResponse = await openAIClient.GetEmbeddingsAsync(AOAI_DEPLOYMENTID, new EmbeddingsOptions(bill.Text));

    bill.AdaV2 = embeddingResponse.Value.Data[0].Embedding.ToArray();
}

## Get embedding for the incoming query

Now we'll get the incoming query, or what we'll want to search for in the overall document, and then get an embedding for that.

In [13]:
var incoming = "Can I get information about STEM contributions?";
var incomingEmbeddingResponse = await openAIClient.GetEmbeddingsAsync("embedding-doc", new EmbeddingsOptions(incoming));
var incomingEmbedding = incomingEmbeddingResponse.Value.Data[0].Embedding.ToArray();

## Find closest match with cosine similarity comparison

To find the closest match between the rows in the document and the incoming query, we'll use the cosine similarity to compare the vectors (or arrays) of each embedding.

In [14]:
float CosineSimilarity (float[] vectorA, float[] vectorB) {
    float dotProduct = 0.0f;
    float magnitudeA = 0.0f;
    float magnitudeB = 0.0f;

    for (int i = 0; i < vectorA.Length; i++)
    {
        dotProduct += vectorA[i] * vectorB[i];
        magnitudeA += vectorA[i] * vectorA[i];
        magnitudeB += vectorB[i] * vectorB[i];
    }

    return dotProduct / (float)(Math.Sqrt(magnitudeA) * Math.Sqrt(magnitudeB));
}

Now use the `CosineSimilarity` function to peform the comparison. Here we are taking the top 4 results for illustration purposes.

In [15]:
var topMatches = billSummaries.Select(b => new
{
    Bill = b,
    Similarity = CosineSimilarity(incomingEmbedding, b.AdaV2)
}).OrderByDescending(b => b.Similarity).Take(4).ToList();

Finally, display the top match

In [17]:
topMatches[0].Bill.Title.Display();
topMatches[0].Bill.Summary.Display();

To amend the Internal Revenue Code of 1986 to encourage businesses to improve math and science education at elementary and secondary schools.

National Science Education Tax Incentive for Businesses Act of 2007 - Amends the Internal Revenue Code to allow a general business tax credit for contributions of property or services to elementary and secondary schools and for teacher training to promote instruction in science, technology, engineering, or mathematics .