# Detailed Explanation of the Python Semantic Search Script
This script is a command-line tool designed to perform a semantic search on a dataset of videos. Instead of searching for exact keywords, it uses a pre-trained machine learning model to find videos that are conceptually or semantically similar to a user's query.

## Here's a breakdown of how it works, function by function:

### Overall Purpose
The program's main loop prompts the user for a text query. It then uses the Azure OpenAI API to convert that query into a numerical representation called a vector embedding. It compares this query vector to pre-computed vector embeddings for a collection of videos, finds the most similar ones, and displays their details.

load_dataset(source: str) -> pd.core.frame.DataFrame

This function is responsible for ingesting your data.

- It reads a JSON file (in this case, embedding_index_3m.json) which is assumed to contain a list of video records. Each record likely has metadata (like title, videoId, summary, speaker) and a pre-calculated embedding vector (e.g., under a key like ada_v2).

- pd.read_json(source) reads the file into a pandas DataFrame, a powerful data structure for tabular data.

- .drop(columns=["text"], errors="ignore") removes a column named "text" if it exists. This is likely the original source text used to generate the embeddings, and it's no longer needed for the search.

- .fillna("") ensures that any missing values in the DataFrame are replaced with an empty string, preventing potential errors later in the code.

cosine_similarity(a, b)
This is the core mathematical function for measuring similarity between two vectors.

- What is Cosine Similarity? It's a metric that measures the cosine of the angle between two vectors in a multidimensional space. A score of 1 means the vectors point in the exact same direction (perfect similarity), a score of 0 means they are orthogonal (no similarity), and a score of -1 means they are opposite. In the context of embeddings, this score is a reliable way to determine how semantically related two pieces of text are.

- The function first checks if the input vectors a and b have different lengths. If so, it pads the shorter vector with zeros to ensure they can be multiplied element-wise.

- It then calculates the dot product of the two vectors and divides it by the product of their magnitudes (norms), which gives you the cosine of the angle between them.

get_videos(query: str, dataset: pd.core.frame.DataFrame, rows: int) -> pd.core.frame.DataFrame

This is the main search function.

- It takes a user's query, your dataset DataFrame, and the number of rows to return.

- client.embeddings.create(input=query, model=model) is the crucial API call. It sends the user's query to the Azure OpenAI service and receives a vector embedding in return.

- video_vectors["similarity"] = ... calculates the similarity score for every video in the dataset. It applies the cosine_similarity function to the query_embeddings and each video's pre-calculated ada_v2 embedding.

- mask = video_vectors["similarity"] >= SIMILARITIES_RESULTS_THRESHOLD filters the dataset, keeping only the videos with a similarity score above a certain threshold. This helps to eliminate irrelevant results.

- .sort_values(by="similarity", ascending=False) sorts the filtered videos from most similar to least similar.

- .head(rows) returns the top N results, where N is the value of the rows parameter.

display_results(videos: pd.core.frame.DataFrame, query: str)

This function handles the output presentation.

- It takes the DataFrame of top videos and the original query.

- It iterates through each video in the DataFrame.

- For each video, it prints out key information in a clean, readable format, including the title, a summary snippet, the YouTube URL (with a timestamp), the calculated similarity score, and the speakers.

Main Execution Block

The code at the bottom of the script handles the program's lifecycle.

- It first loads the dataset using load_dataset(DATASET_NAME).

- It then enters an infinite while True: loop to continuously prompt the user for a new query.

- The loop breaks if the user types "exit".

- For each query, it calls get_videos to perform the search and display_results to print the output.

Security for a Public GitHub Repository

You are using a very secure method for handling your API key.

- from dotenv import load_dotenv and load_dotenv() correctly load your key from a local .env file.

- os.environ['AZURE_OPENAI_API_KEY'] retrieves this key from the environment variables, meaning it is never hardcoded in the script itself.

This is a critical best practice. As long as you do not upload the .env file to your GitHub repository, your API key remains private and secure, and the script can be shared publicly without risk.

In [3]:
import os
import pandas as pd
import numpy as np
from openai import AzureOpenAI
from dotenv import load_dotenv

# Load environment variables from a .env file.
# This is a critical security practice for not exposing API keys
# and other sensitive information in a public repository.
load_dotenv()

# --- Configuration ---
# Get API key, model deployment name, and endpoint from environment variables.
# This makes the code secure and flexible.
try:
    api_key = os.environ['AZURE_OPENAI_API_KEY']
    # The deployment name is the name of the model you deployed in the Azure Portal.
    model = os.environ['AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT']
    # If using an older API version, you might also need the endpoint explicitly:
    # endpoint = os.environ['AZURE_OPENAI_ENDPOINT']
except KeyError as e:
    print(f"Error: Missing environment variable {e}. Please check your .env file.")
    # Exit the program if a required environment variable is not set.
    exit()

# Initialize the AzureOpenAI client with the API key.
# The client is the main interface for communicating with the Azure OpenAI service.
client = AzureOpenAI(
    api_key=api_key,
    api_version="2023-05-15"
)

# --- Constants ---
# Similarity threshold for filtering results. Only videos with a score above this
# will be considered. The value 0.75 is a good starting point.
SIMILARITIES_RESULTS_THRESHOLD = 0.75

# The name of the local JSON dataset file containing video metadata and embeddings.
# It is assumed this file is in the same directory as the script.
DATASET_NAME = "embedding_index_3m.json"


def load_dataset(source: str) -> pd.core.frame.DataFrame:
    """
    Loads the video session index from a JSON file into a pandas DataFrame.
    
    Args:
        source (str): The path to the JSON dataset file.
    
    Returns:
        pd.core.frame.DataFrame: The loaded and pre-processed DataFrame.
                                 Returns an empty DataFrame if the file is not found.
    """
    try:
        # Read the JSON file into a DataFrame.
        pd_vectors = pd.read_json(source)
        # Drop the original text column, as we only need the embeddings for search.
        # errors="ignore" prevents the code from failing if the column doesn't exist.
        return pd_vectors.drop(columns=["text"], errors="ignore").fillna("")
    except FileNotFoundError:
        print(f"Error: Dataset file not found at '{source}'. Please ensure the file exists.")
        return pd.DataFrame()


def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """
    Calculates the cosine similarity between two numpy arrays.
    
    This function measures the conceptual similarity between two embeddings.
    A score of 1 means perfect similarity, 0 means no similarity.
    
    Args:
        a (np.ndarray): The first vector (e.g., the query embedding).
        b (np.ndarray): The second vector (e.g., a video's embedding).
        
    Returns:
        float: The calculated cosine similarity score.
    """
    # Ensure both vectors have the same length by padding the shorter one with zeros.
    if len(a) > len(b):
        b = np.pad(b, (0, len(a) - len(b)), 'constant')
    elif len(b) > len(a):
        a = np.pad(a, (0, len(b) - len(a)), 'constant')
    
    # Calculate the dot product.
    dot_product = np.dot(a, b)
    # Calculate the norms (magnitudes) of the vectors.
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)

    # Avoid division by zero in case of an empty vector.
    if norm_a == 0 or norm_b == 0:
        return 0.0
    
    return dot_product / (norm_a * norm_b)


def get_videos(
    query: str, dataset: pd.core.frame.DataFrame, rows: int
) -> pd.core.frame.DataFrame:
    """
    Performs a semantic search on the video dataset using an embeddings model.
    
    Args:
        query (str): The search query from the user.
        dataset (pd.core.frame.DataFrame): The DataFrame containing video data and embeddings.
        rows (int): The number of top results to return.
        
    Returns:
        pd.core.frame.DataFrame: A DataFrame of the top `rows` most similar videos.
                                 Returns an empty DataFrame on error.
    """
    # Create a copy of the dataset to avoid modifying the original DataFrame.
    video_vectors = dataset.copy()

    try:
        # Get the embeddings for the user's query from the Azure OpenAI API.
        # This is where the query is transformed into a vector.
        query_embeddings = client.embeddings.create(
            input=query, 
            model=model
        ).data[0].embedding
    except Exception as e:
        print(f"An error occurred while getting embeddings from Azure: {e}")
        return pd.DataFrame()

    # Apply the cosine similarity function to each row's embedding in the DataFrame.
    # This calculates the similarity score between the query and every video.
    video_vectors["similarity"] = video_vectors["ada_v2"].apply(
        lambda x: cosine_similarity(np.array(query_embeddings), np.array(x))
    )

    # Create a boolean mask to filter out videos with a low similarity score.
    mask = video_vectors["similarity"] >= SIMILARITIES_RESULTS_THRESHOLD
    video_vectors = video_vectors[mask].copy()

    # Sort the videos by their similarity score in descending order.
    video_vectors = video_vectors.sort_values(by="similarity", ascending=False)

    # Return the top N videos.
    return video_vectors.head(rows)


def display_results(videos: pd.core.frame.DataFrame, query: str):
    """
    Prints the search results in a user-friendly format to the console.
    
    Args:
        videos (pd.core.frame.DataFrame): The DataFrame of video results.
        query (str): The original search query.
    """
    def _gen_yt_url(video_id: str, seconds: int) -> str:
        """Helper function to generate a YouTube URL with a timestamp."""
        return f"https://youtu.be/{video_id}?t={seconds}"

    print(f"\nVideos similar to '{query}':")
    # Check if any videos were found after filtering.
    if videos.empty:
        print(" - No videos found with a high enough similarity score.")
        return

    # Loop through the top videos and print their details.
    for _, row in videos.iterrows():
        youtube_url = _gen_yt_url(row["videoId"], row["seconds"])
        print(f" - {row['title']}")
        print(f"   Summary: {' '.join(row['summary'].split()[:15])}...")
        print(f"   YouTube: {youtube_url}")
        # Format the similarity score to 4 decimal places for readability.
        print(f"   Similarity: {row['similarity']:.4f}")
        print(f"   Speakers: {row['speaker']}")


# --- Main Execution Block ---
if __name__ == "__main__":
    # Load the dataset once at the start of the program.
    pd_vectors = load_dataset(DATASET_NAME)

    # If the dataset failed to load, exit the program.
    if pd_vectors.empty:
        exit()

    # Start a loop to get user queries.
    while True:
        # Prompt the user for input.
        query = input("Enter a query (or type 'exit' to quit): ")
        
        # Check for the exit command.
        if query.lower() == "exit":
            break
        
        # Perform the search and display the top 5 results.
        videos = get_videos(query, pd_vectors, 5)
        display_results(videos, query)



Videos similar to 'rstudio and notebook':
 - Reproducible Data Science with Machine Learning
   Summary: A separate team is responsible for deploying machine learning models, which aligns well with the...
   YouTube: https://youtu.be/NyWOfYKScUk?t=553
   Similarity: 0.8509
   Speakers: Rafal Lukawiecki
 - Reproducible Data Science with Machine Learning
   Summary: In this video, Rafal Lukawiecki discusses reproducible data science with machine learning. He demonstrates how...
   YouTube: https://youtu.be/NyWOfYKScUk?t=1289
   Similarity: 0.8477
   Speakers: Rafal Lukawiecki
 - Edit and run Jupyter notebooks without leaving Azure Machine Learning studio
   Summary: The video demonstrates the features of Studio Notebooks in Azure Machine Learning. Users can easily...
   YouTube: https://youtu.be/AAj-Fz0uCNk?t=184
   Similarity: 0.8435
   Speakers: Abe Omorogbe
 - Using R with Azure Machine Learning
   Summary: The video demonstrates how to use Visual Studio Code (VS Code) for an interac