# OpenAI: Composing data for embeddings

## Intro

A simplified sample to show how composing input for embeddings affects vector searches

### Step 1: Azure environment

This [Azure CLI script](../CreateEnv/CreateEnv.azcli) creates:

- an Azure Open AI instance
- deploys text-embedding-ada-002 to calculate embeddings

The script provides necessary credentials to connect to Azure OpenAI (e.g. API key and endpoint information) and stores them in environment variables.
```azurecli
$ENV:AZURE_OPENAI_ENDPOINT = $csEndpoint
$ENV:AZURE_OPENAI_API_KEY = $csApiKey
$ENV:AZURE_OPENAI_DEPLOYMENTNAME = $modelDeploymentName
``````

### Step 2: Housekeeping 

- Import nuget packages
- Define arbitrary facts 
- Create an instance of ***OpenAIClient()***

Replace `apiEndpoint`, `apiKey` and `embeddingModelDeploymentName` with values from your Azure OpenAI instance.

In [2]:
#r "nuget: Azure.AI.OpenAI, 1.0.0-beta.6"
#r "nuget: MathNet.Numerics, 5.0.0"

using System.IO;
using Azure.AI.OpenAI;
using Azure;
using MathNet.Numerics; 

//Define Azure OpenAI information
Uri apiEndpoint = new Uri("https://Your_Azure_OpenAI_API_endpoint");
string apiKey = "<<Your Azure OpenAI API key>>";
string embeddingModelDeploymentName = "<<your Azure OpenAI embedding deployment name>>";

AzureKeyCredential azureKeyCredential = new AzureKeyCredential(apiKey);
OpenAIClient openAIClient = new OpenAIClient(apiEndpoint, azureKeyCredential);

#!share --from c# openAIClient --as openAIClient
#!share --from c# embeddingModelDeploymentName --as embeddingModelDeploymentName


### Step 2: Create data

The data will be vectorized. It is composed of 

  - "ideal data" copied from the public ACI documentation availiable on the web,
  - "distracting data" from the public AKS documention available on the web
  - "distracting data" from [The Complete Works of William Shakespeare books](https://www.gutenberg.org/cache/epub/100/pg100.txt)

The data is used to ground a LLM query asking information about "ACI container hosting alternatives"

In [5]:
//Query data for vector search
string query = "aci container hosting alternatives"; 

//"Optimal data" - Copied from the public ACI documentation
string acifacts = await File.ReadAllTextAsync(Path.Combine("assets", "acifacts.txt"));
//"Disctracting data" - Copied from the public AKS documentation - somehow related to ACI
string aksfacts = await File.ReadAllTextAsync(Path.Combine("assets", "aksfacts.txt"));
//"Distracting data" - From the free ebook "The complete works of William Shakespeare"
string shakespearefacts = (await File.ReadAllTextAsync(Path.Combine("assets", "shakespearefacts.txt"))).Replace("\r\n", " ");

//"Distracting Data" - Starting with the "optimal data" - followed by a % of Shakespeare information
string distractedGrounding_100_25 = String.Join(acifacts, " ", shakespearefacts.Substring(0, (int)(acifacts.Length * 0.25)));
string distractedGrounding_100_50 = string.Join(acifacts, " ", shakespearefacts.Substring(0, (int)(acifacts.Length * 0.5)));
string distractedGrounding_100_75 = string.Join(acifacts, " ", shakespearefacts.Substring(0, (int)(acifacts.Length * 0.75)));
string distractedGrounding_100_100_aci = string.Join(acifacts, " ", shakespearefacts.Substring(0, acifacts.Length));

//"Distracting Data" - Starting with a % of Shakespeare information - followed by the "optimal data"
string distractedGrounding_25_100 = string.Join(shakespearefacts.Substring(0, (int)(acifacts.Length * 0.25)), " ", acifacts);
string distractedGrounding_50_100 = string.Join(shakespearefacts.Substring(0, (int)(acifacts.Length * 0.5)), " ", acifacts);
string distractedGrounding_75_100 = string.Join(shakespearefacts.Substring(0, (int)(acifacts.Length * 0.75)), " ", acifacts);
string distractedGrounding_100_100_sp = string.Join(shakespearefacts.Substring(0, acifacts.Length), " ", acifacts);

//"Distracting Data" - A mix of ACI and AKS information
string distracted_aci_aks = string.Concat(acifacts, " ", aksfacts); 
string distracted_aks_aci = string.Concat(aksfacts, " ", acifacts);

Console.WriteLine("Query, 'optimal data' and 'distracting data' defined");

#!share --from c# acifacts --as acifacts
#!share --from c# aksfacts --as aksfacts
#!share --from c# shakespearefacts --as shakespearefacts
#!share --from c# distractedGrounding_100_25 --as distractedGrounding_100_25
#!share --from c# distractedGrounding_100_50 --as distractedGrounding_100_50
#!share --from c# distractedGrounding_100_75 --as distractedGrounding_100_75
#!share --from c# distractedGrounding_100_100_aci --as distractedGrounding_100_100_aci
#!share --from c# distractedGrounding_25_100 --as distractedGrounding_25_100
#!share --from c# distractedGrounding_50_100 --as distractedGrounding_50_100
#!share --from c# distractedGrounding_75_100 --as distractedGrounding_75_100
#!share --from c# distractedGrounding_100_100_sp --as distractedGrounding_100_100_sp
#!share --from c# distracted_aci_aks --as distracted_aci_aks
#!share --from c# distracted_aks_aci --as distracted_aks_aci


Query, 'optimal data' and 'distracting data' defined


### Step 3: Vectorize data

By using Azure OpenAI embedding models an embedding or vector representation of the data will be calculated.


In [11]:
//Create Embeddings

Dictionary<string, double[]> vectors = new Dictionary<string, double[]>();

vectors.Add("query", await CreateEmbedding(query));
vectors.Add("acifacts", await CreateEmbedding(acifacts));
vectors.Add("aksfacts", await CreateEmbedding(aksfacts));
vectors.Add("shakespearefacts", await CreateEmbedding(shakespearefacts));
vectors.Add("distractedGrounding_100_25", await CreateEmbedding(distractedGrounding_100_25));
vectors.Add("distractedGrounding_100_50", await CreateEmbedding(distractedGrounding_100_50));
vectors.Add("distractedGrounding_100_75", await CreateEmbedding(distractedGrounding_100_75));
vectors.Add("distractedGrounding_100_100_aci", await CreateEmbedding(distractedGrounding_100_100_aci));
vectors.Add("distractedGrounding_25_100", await CreateEmbedding(distractedGrounding_25_100));
vectors.Add("distractedGrounding_50_100", await CreateEmbedding(distractedGrounding_50_100));
vectors.Add("distractedGrounding_75_100", await CreateEmbedding(distractedGrounding_75_100));
vectors.Add("distractedGrounding_100_100_sp", await CreateEmbedding(distractedGrounding_100_100_sp));
vectors.Add("distracted_aci_aks", await CreateEmbedding(distracted_aci_aks));
vectors.Add("distracted_aks_aci", await CreateEmbedding(distracted_aks_aci));

//Store embeddings
foreach(KeyValuePair<string, double[]> vector in vectors)
{
    string vectorString = String.Join(",", vector.Value);
    await File.WriteAllTextAsync(Path.Combine("assets", $"vector_{vector.Key}.txt"), vectorString);
}

private async Task<double[]> CreateEmbedding(string query)
{
    EmbeddingsOptions embeddingsOptions; 
    embeddingsOptions = new EmbeddingsOptions(query);
    Response<Embeddings> embedding = await openAIClient.GetEmbeddingsAsync(embeddingModelDeploymentName, embeddingsOptions); 

    return Array.ConvertAll(embedding.Value.Data[0].Embedding.ToArray(), x => (double)x);
}

Console.WriteLine("Embeddings created...");

#!share --from c# vectors --as vectors

Embeddings created...


### Step 4: Calculate Cosine distance 

The cosine distance between the query and the created embeddings are calculated and results are compared.

In [18]:
//Calculate Distances
Console.WriteLine($"\nCosine distance:");
double distance_aci = Distance.Cosine(vectors["query"], vectors["acifacts"]);
Console.WriteLine($"Cosine distance between query and acifacts: {distance_aci}");

double distance_aks = Distance.Cosine(vectors["query"], vectors["aksfacts"]);
Console.WriteLine($"Cosine distance between query and aksfacts: {distance_aks}");

double distance_sp = Distance.Cosine(vectors["query"], vectors["shakespearefacts"]);
Console.WriteLine($"Cosine distance between query and shakespearefacts: {distance_sp}");

double distance_100_25 = Distance.Cosine(vectors["query"], vectors["distractedGrounding_100_25"]);
Console.WriteLine($"Cosine distance between query and distractedGrounding_100_25: {distance_100_25}");

double distance_100_50 = Distance.Cosine(vectors["query"], vectors["distractedGrounding_100_50"]);
Console.WriteLine($"Cosine distance between query and distractedGrounding_100_50: {distance_100_50}");

double distance_100_75 = Distance.Cosine(vectors["query"], vectors["distractedGrounding_100_75"]);
Console.WriteLine($"Cosine distance between query and distractedGrounding_100_75: {distance_100_75}");

double distance_100_100_aci = Distance.Cosine(vectors["query"], vectors["distractedGrounding_100_100_aci"]);
Console.WriteLine($"Cosine distance between query and distractedGrounding_100_100_aci: {distance_100_100_aci}");

double distance_25_100 = Distance.Cosine(vectors["query"], vectors["distractedGrounding_25_100"]);
Console.WriteLine($"Cosine distance between query and distractedGrounding_25_100: {distance_25_100}"); 

double distance_50_100 = Distance.Cosine(vectors["query"], vectors["distractedGrounding_50_100"]);
Console.WriteLine($"Cosine distance between query and distractedGrounding_50_100: {distance_50_100}");

double distance_75_100 = Distance.Cosine(vectors["query"], vectors["distractedGrounding_75_100"]);
Console.WriteLine($"Cosine distance between query and distractedGrounding_75_100: {distance_75_100}");

double distance_100_100_sp = Distance.Cosine(vectors["query"], vectors["distractedGrounding_100_100_sp"]);
Console.WriteLine($"Cosine distance between query and distractedGrounding_100_100_sp: {distance_100_100_sp}");

double distance_aci_aks = Distance.Cosine(vectors["query"], vectors["distracted_aci_aks"]);
Console.WriteLine($"Cosine distance between query and distracted_aci_aks: {distance_aci_aks}");

double distance_aks_aci = Distance.Cosine(vectors["query"], vectors["distracted_aks_aci"]);
Console.WriteLine($"Cosine distance between query and distracted_aks_aci: {distance_aks_aci} \n\n");


//Compare results
Console.WriteLine($"Compare results:");
Console.WriteLine($"Deviation: non matching facts: \t {Math.Round(Math.Abs(distance_aci - distance_sp) / distance_aci * 100, 4)} %");
Console.WriteLine($"Deviation: similar facts 'AKS': \t {Math.Round(Math.Abs(distance_aci - distance_aks) / distance_aci * 100, 4)} % \n");

Console.WriteLine($"Distracting data after and before facts");
Console.WriteLine($"Deviation:25 % 'distracting data' after facts: \t {Math.Round(Math.Abs(distance_aci - distance_100_25) / distance_aci * 100, 4)} %");
Console.WriteLine($"Deviation:25 % 'distracting data' before facts: \t {Math.Round(Math.Abs(distance_aci - distance_25_100) / distance_aci * 100, 4)} % \n");
    
Console.WriteLine($"Deviation:50 % 'distracting data' after facts: \t {Math.Round(Math.Abs(distance_aci - distance_100_50) / distance_aci * 100, 4)} %");
Console.WriteLine($"Deviation:50 % 'distracting data' before facts: \t {Math.Round(Math.Abs(distance_aci - distance_50_100) / distance_aci * 100, 4)} % \n");

Console.WriteLine($"Deviation:75 % 'distracting data' after facts: \t {Math.Round(Math.Abs(distance_aci - distance_100_75) / distance_aci * 100, 4)} %");
Console.WriteLine($"Deviation:75 % 'distracting data' before facts: \t {Math.Round(Math.Abs(distance_aci - distance_75_100) / distance_aci * 100, 4)} %  \n");

Console.WriteLine($"Deviation:100 % 'distracting data' after facts: \t {Math.Round(Math.Abs(distance_aci - distance_100_100_aci) / distance_aci * 100, 4)} %");
Console.WriteLine($"Deviation:100 % 'distracting data' before facts: \t {Math.Round(Math.Abs(distance_aci - distance_100_100_sp) / distance_aci * 100, 4)} %  \n");



Cosine distance:
Cosine distance between query and acifacts: 0.1867628581649723
Cosine distance between query and aksfacts: 0.2501981790410386
Cosine distance between query and shakespearefacts: 0.36902545449461466
Cosine distance between query and distractedGrounding_100_25: 0.18919742913467197
Cosine distance between query and distractedGrounding_100_50: 0.1942534476349791
Cosine distance between query and distractedGrounding_100_75: 0.19264550879180653
Cosine distance between query and distractedGrounding_100_100_aci: 0.18931963697269438
Cosine distance between query and distractedGrounding_25_100: 0.21192521407481446
Cosine distance between query and distractedGrounding_50_100: 0.22273735568962805
Cosine distance between query and distractedGrounding_75_100: 0.22685229402738571
Cosine distance between query and distractedGrounding_100_100_sp: 0.22190080773419452
Cosine distance between query and distracted_aci_aks: 0.19082755856911848
Cosine distance between query and distracted_a

### Results
```
Cosine distance:
Cosine distance between query and acifacts: 0.1867628581649723
Cosine distance between query and aksfacts: 0.2501981790410386
Cosine distance between query and shakespearefacts: 0.36902545449461466
Cosine distance between query and distractedGrounding_100_25: 0.18919742913467197
Cosine distance between query and distractedGrounding_100_50: 0.1942534476349791
Cosine distance between query and distractedGrounding_100_75: 0.19264550879180653
Cosine distance between query and distractedGrounding_100_100_aci: 0.18931963697269438
Cosine distance between query and distractedGrounding_25_100: 0.21192521407481446
Cosine distance between query and distractedGrounding_50_100: 0.22273735568962805
Cosine distance between query and distractedGrounding_75_100: 0.22685229402738571
Cosine distance between query and distractedGrounding_100_100_sp: 0.22190080773419452
Cosine distance between query and distracted_aci_aks: 0.19082755856911848
Cosine distance between query and distracted_aks_aci: 0.20782505559110165 


Compare results:
Deviation: non matching facts: 	 97.5904 %
Deviation: similar facts 'AKS':	 33.9657 % 

Distracting data after and before facts
Deviation:25 % 'distracting data' after facts: 	 1.3036 %
Deviation:25 % 'distracting data' before facts:  13.4729 % 

Deviation:50 % 'distracting data' after facts: 	 4.0107 %
Deviation:50 % 'distracting data' before facts:  19.2621 % 

Deviation:75 % 'distracting data' after facts: 	 3.1498 %
Deviation:75 % 'distracting data' before facts:  21.4654 %  

Deviation:100 % 'distracting data' after facts:  1.369 %
Deviation:100 % 'distracting data' before facts: 18.8142 % 
```