# Lab 1: Build a Similarity Search for YouTube Transcripts

In this notebook, you will use the AzureOpenAI client to get the text embeddings of a string and perform a cosine similarity comparison against the transcripts from [Boston Azure Youtube channel](https://www.youtube.com/bostonazure) to find the videos with the highest similarity.

## Learning Objectives

* Load the variables in the .env file
* Connect to AzureOpenAI in C#
* Load the transcript file
* Calculate the similarity of a transcript's embeddings to the text embedding
* Output the most similar videos with a url formatted to navigate to the 5 min section that was found most similiar

### Step 1: load environment variables and create the AzureOpenAI client


Add references for libraries we are going to use and using statements to simplify our code:

In [None]:
#r "nuget: Azure.AI.OpenAI, *-*"
#r "nuget: Azure, *-*"
#r "nuget: dotenv.net, *-*"
#r "nuget:Microsoft.DotNet.Interactive.AIUtilities, *-*"

using Microsoft.DotNet.Interactive;
using Microsoft.DotNet.Interactive.AIUtilities;
using dotenv.net;
using Azure.AI.OpenAI;
using Azure;
using System;
using System.Text.Json;
using System.Text.Json.Serialization;
using System.IO;

Load the environment variables in the .env file and configure OpenAIClient:

In [None]:
DotEnv.Load();

var envVars = DotEnv.Read();

OpenAIClient client = new(new Uri(envVars["AZURE_OPENAI_ENDPOINT"]), 
    new AzureKeyCredential(envVars["AZURE_OPENAI_API_KEY"]));

var model = envVars["AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT"];

### Step 2: Set the threshold for the similarity score we want to use and version of the transcript file.    

In [None]:
var SIMILARITIES_RESULTS_THRESHOLD = 0.70;
var DATASET_NAME = "./prep/output/master_enriched.json";

### Step 3: Create some utility methods

Create a couple of data structures to hold the data and the similarity score:

In [None]:
public record DataRow(string speaker, string title, string videoId, string description, string start, int seconds, string text, float[] ada_v2);
public record DataRow2(string speaker, string title, string videoId, string description, string start, int seconds, float similarity);

Create utility methods:

In [None]:

Func<string, DataRow[]> LoadDataset = (string source) => {

    var transcripts = JsonSerializer.Deserialize<DataRow[]>(
        File.ReadAllText(source));
    return transcripts;
};

Func<string, DataRow[], int, Task<DataRow2[]>> GetVideos = async (string query, DataRow[] data, int rows) => {
    
    // get the embeddings for the query
    var response = await client.GetEmbeddingsAsync(new EmbeddingsOptions(model, new []{ query }));
    var queryEmbeddings = response.Value.Data[0].Embedding.ToArray();
    
    // use CosineSimilarityComparer to compare the query embeddings with the embeddings in the dataset
    var result = data
        .ScoreBySimilarityTo(queryEmbeddings, new CosineSimilarityComparer<float[]>(t => t),e => e.ada_v2)
        .OrderByDescending(s => s.Score)
        .Take(rows)
        .Select(r => new DataRow2(r.Value.speaker, r.Value.title, r.Value.videoId, r.Value.description, r.Value.start, r.Value.seconds, r.Score));
    
    return result.ToArray();
};

Action<DataRow2[], string> DisplayResults = (DataRow2[] data, string query) => {
    Console.WriteLine($"\nVideos similar to '{query}':");
    Console.WriteLine("");
    foreach (var row in data)
    {
        var youtube_url = $"https://youtu.be/{row.videoId}?t={row.seconds}";
        Console.WriteLine($" - {row.title}");
        Console.WriteLine($"   YouTube: {youtube_url}");
        Console.WriteLine($"   Similarity: {row.similarity}");
    }
    Console.WriteLine("");
};


### Step 4: Load the transcript file (and take a look at what is in it)

In [None]:
var transcriptData = LoadDataset(DATASET_NAME);

Notice that this version of the transcripts file has the embeddings in it already for the 5 minute chunks of the transcript text. The embeddings are the ada_v2 column.

In [None]:
transcriptData.Take(4).DisplayTable();

### Step 5: Try it out

I've put some default text in for a good example, but you should change the query to your own search and see what comes back.

In [None]:
var query = "What is RAG";

var videos = await GetVideos(query, transcriptData, 5);
DisplayResults(videos, query);