# Build a Similarity Search for YouTube Transcripts

In this notebook, you will use the AzureOpenAI client to get the text embeddings of a string and perform a cosine similarity comparison against the transcripts from [Boston Azure Youtube channel](https://www.youtube.com/bostonazure) to find the videos with the highest similarity.

## Learning Objectives

* Load the variables in the .env file
* Connect to AzureOpenAI in python
* Load the transcript file and create a pandas data frame
* Calculate the similarity of a transcript's embeddings to the text embedding
* Output the most similar videos with a url formatted to navigate to the 5 min section that was found most similiar


## Similarity

TODO: Description of what similarity is and some references to learn more

### Step 1: Load Environment Variables and Create the AzureOpenAI client


In [1]:
import os
import pandas as pd
import numpy as np
from openai import AzureOpenAI
from dotenv import load_dotenv

load_dotenv()

client = AzureOpenAI(
  api_key = os.getenv("AZURE_OPENAI_API_KEY"),
  api_version = "2024-02-01",
  azure_endpoint =os.getenv("AZURE_OPENAI_ENDPOINT") 
  )

model = os.getenv("AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT")

### Step 2: Set the threshold for the similarity score we want to use and version of the transcript file.    

In [2]:
SIMILARITIES_RESULTS_THRESHOLD = 0.70
DATASET_NAME = "./prep/output/master_enriched.json"

### Step 3: Create some utility methods

In [9]:
def load_dataset(source: str) -> pd.core.frame.DataFrame:
    pd_vectors = pd.read_json(source)
    return pd_vectors.drop(columns=["text"], errors="ignore").fillna("")

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_videos(
    query: str, dataset: pd.core.frame.DataFrame, rows: int
) -> pd.core.frame.DataFrame:
    # create a copy of the dataset
    video_vectors = dataset.copy()

    # get the embeddings for the query    
    query_embeddings = client.embeddings.create(input=query, model=model).data[0].embedding

    # create a new column with the calculated similarity for each row
    video_vectors["similarity"] = video_vectors["ada_v2"].apply(
        lambda x: cosine_similarity(np.array(query_embeddings), np.array(x))
    )

    # filter the videos by similarity
    mask = video_vectors["similarity"] >= SIMILARITIES_RESULTS_THRESHOLD
    video_vectors = video_vectors[mask].copy()

    # sort the videos by similarity
    video_vectors = video_vectors.sort_values(by="similarity", ascending=False).head(
        rows
    )

    # return the top rows
    return video_vectors.head(rows)

def display_results(videos: pd.core.frame.DataFrame, query: str):
    def _gen_yt_url(video_id: str, seconds: int) -> str:
        """convert time in format 00:00:00 to seconds"""
        return f"https://youtu.be/{video_id}?t={seconds}"

    print(f"\nVideos similar to '{query}':")
    print(f"")
    for _, row in videos.iterrows():
        youtube_url = _gen_yt_url(row["videoId"], row["seconds"])
        print(f" - {row['title']}")
        print(f"   YouTube: {youtube_url}")
        print(f"   Similarity: {row['similarity']}")
        print(f"")

### Step 4: Load the transcript file (and take a look at what is in it)

In [12]:
pd_vectors = load_dataset(DATASET_NAME)

In [13]:
pd_vectors

Unnamed: 0,speaker,title,videoId,description,start,seconds,ada_v2
0,,Map Azure DevOps Runtime Variables to Terrafor...,-ssTKjHVP_Q,"This is a recording of the March 29, 2023 virt...",00:00:02,2,"[-0.019308112561702003, -0.024012072011828003,..."
1,,Map Azure DevOps Runtime Variables to Terrafor...,-ssTKjHVP_Q,"This is a recording of the March 29, 2023 virt...",00:05:04,304,"[-0.0006771119078620001, -0.007956171408295, 0..."
2,,Map Azure DevOps Runtime Variables to Terrafor...,-ssTKjHVP_Q,"This is a recording of the March 29, 2023 virt...",00:10:07,607,"[-0.01619478687644, -0.020837383344769003, -0...."
3,,Map Azure DevOps Runtime Variables to Terrafor...,-ssTKjHVP_Q,"This is a recording of the March 29, 2023 virt...",00:15:10,910,"[-0.011464371345937, -0.032427001744508, -0.01..."
4,,Map Azure DevOps Runtime Variables to Terrafor...,-ssTKjHVP_Q,"This is a recording of the March 29, 2023 virt...",00:20:16,1216,"[-0.015697304159402, -0.015205482952296002, 0...."
...,...,...,...,...,...,...,...
147,,Udai Ramachandran: Azure Front Door,vTLZ3GoZZvI,"This is a recording of the September 14, 2021 ...",00:55:35,3335,"[0.010421303100883001, 0.022980673238635, 0.00..."
148,,Udai Ramachandran: Azure Front Door,vTLZ3GoZZvI,"This is a recording of the September 14, 2021 ...",01:00:38,3638,"[0.011888379231095002, 0.0041570011526340005, ..."
149,,Udai Ramachandran: Azure Front Door,vTLZ3GoZZvI,"This is a recording of the September 14, 2021 ...",01:05:45,3945,"[0.015738856047391, 0.008334751240909, 0.02017..."
150,,Udai Ramachandran: Azure Front Door,vTLZ3GoZZvI,"This is a recording of the September 14, 2021 ...",01:10:48,4248,"[0.010681172832846001, 0.005410764832049, -0.0..."


### Step 4: Try it out

I've put some default text in for a good example, but you should change the query to your own search and see what comes back.

In [14]:
query = "What is langchain?"

videos = get_videos(query, pd_vectors, 5)
display_results(videos, query)


Videos similar to 'What is langchain?':

 - Pamela Fox: Building a RAG app to chat with your data
   YouTube: https://youtu.be/3Zh9MEuyTQo?t=4260
   Similarity: 0.7424003538820976

 - Deploy Your GO API to Azure Functions
   YouTube: https://youtu.be/1NcnkU403UE?t=305
   Similarity: 0.7225723356406512

 - Deploy Your GO API to Azure Functions
   YouTube: https://youtu.be/1NcnkU403UE?t=2
   Similarity: 0.7213426954451995

 - Deploy Your GO API to Azure Functions
   YouTube: https://youtu.be/1NcnkU403UE?t=1819
   Similarity: 0.7176949841897059

 - Monitor Azure Resources with Kusto Query Language with Taiob Ali
   YouTube: https://youtu.be/6u-yWHNBCAg?t=2125
   Similarity: 0.7135936233681075





### Reference
This code is a modified version of this notebook: [ai-beginners-embeddings](https://github.com/gloveboxes/ai-beginners-embeddings/blob/main/main.ipynb)