# Vector embeddings with OpenAI

## Overview



## Setup

In [None]:
%pip install openai python-dotenv numpy pandas

## Setup OpenAI API

In [None]:
import os


from dotenv import load_dotenv
import openai

# Set up OpenAI client based on environment variables
load_dotenv()
AZURE_OPENAI_SERVICE = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_ADA_DEPLOYMENT = os.getenv("AZURE_OPENAI_ADA_DEPLOYMENT")
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_API_VERSION = os.getenv("AZURE_OPENAI_API_VERSION")

# Use API key authentication for more reliable connection
openai_client = openai.AzureOpenAI(
    api_version=AZURE_OPENAI_API_VERSION,
    azure_endpoint=AZURE_OPENAI_SERVICE,
    api_key=AZURE_OPENAI_API_KEY
)


In [None]:
print(AZURE_OPENAI_SERVICE)

In [None]:
print (os.getenv("AZURE_OPENAI_API_KEY"))
print (os.getenv("AZURE_OPENAI_ENDPOINT"))
print (os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"))    
print (os.getenv("AZURE_OPENAI_API_VERSION"))
print (os.getenv("AZURE_OPENAI_MODEL"))
print (os.getenv("AZURE_OPENAI_ADA_DEPLOYMENT"))

![vector1.jpeg](../Assets/images/vector1.jpeg)
![vector2.png](../Assets/images/vector2.png)
![vector3.png](../Assets/images/vector3.png)

## Vector representations

Demonstrates the fundamental concept of turning text into numbers that computers can understand and work with

- We have a regular sentence that humans can read and understand  
- Computers can't naturally understand the meaning of text – they need numbers
- We send our sentence to OpenAI's embedding model
- Think of this like a "meaning translator"
- The AI reads the sentence and converts it into a mathematical representation
- We get back a list of 1,536 numbers (called a vector)
- These numbers capture the meaning of our sentence
- It's like a "fingerprint" for the sentence's meaning

It's like translating a sentence into a secret code of numbers that captures what the sentence means.


In [None]:
sentence = "A dog just walked past my house and yipped yipped like a Martian"

response = openai_client.embeddings.create(model=AZURE_OPENAI_ADA_DEPLOYMENT, input=sentence)

vector = response.data[0].embedding

In [None]:
vector

In [None]:
len(vector)

### Document similarity modeled as cosine distance

The cell shows how AI can measure how "similar" different sentences are in meaning, even when they use different words.

1. Cosine Similarity Function
- This calculates how "similar" two vectors are
- Returns a score between -1 and 1 (closer to 1 = more similar)
- Think of it like measuring the angle between two arrows in high-dimensional space

2. Test Sentences
- Three pairs to compare:
- Pair 1: Identical sentences (should get perfect similarity)
- Pair 2: Similar meaning, different words (should get high similarity)
- Pair 3: Meaningful vs random text (should get low similarity)

3. Convert Text to Vectors
- Sends sentences to Azure OpenAI
- Gets back high-dimensional vectors (1536 numbers) that represent the meaning
- Each sentence becomes a list of numbers that captures its semantic meaning

4. Compare and Display Results
- Compares each pair of sentences
- Shows the similarity score


In [None]:
import numpy as np
import pandas as pd


def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

sentences1 = ['The new movie is awesome',
             'The new movie is awesome',
             'The new movie is awesome']

sentences2 = ['The new movie is awesome',
              'This recent movie is so good',
              'djkshsjdkhfsjdfkhsd']

def get_embeddings(sentences):
    embeddings_response = openai_client.embeddings.create(model=AZURE_OPENAI_ADA_DEPLOYMENT, input=sentences)
    return [embedding_object.embedding for embedding_object in embeddings_response.data]

embeddings1 = get_embeddings(sentences1)
embeddings2 = get_embeddings(sentences2)

for i in range(len(sentences1)):
    print(f"{sentences1[i]} \t\t {sentences2[i]} \t\t Score: {cosine_similarity(embeddings1[i], embeddings2[i]):.4f}")

### Vector search

This file contains a collection of movie titles and their corresponding vector embeddings

In [None]:
import json

# Load in vectors for movie titles
with open('openai_movies.json') as json_file:
    movie_vectors = json.load(json_file)

### It's like building a smart movie search engine! 

1. User Types a Search Query
- Someone searches for "101 Dalmations" (notice the misspelling!)
- This is what a user might type into a search box

2. Convert the Search into Numbers
- Takes the search query and turns it into a vector (list of 1,536 numbers)
- Now the computer can compare this search against the movie database

3. Compare Against Every Movie
- Goes through every movie in the database
- Calculates how similar the search query is to each movie title
- Creates a list of (movie_name, similarity_score) pairs

4. Show the Best Matches

In [None]:
# Compute vector for query
query = "101 Dalmatians"

embeddings_response = openai_client.embeddings.create(model=AZURE_OPENAI_ADA_DEPLOYMENT, input=[query])
vector = embeddings_response.data[0].embedding

# Compute cosine similarity between query and each movie title
scores = []
for movie in movie_vectors:
    scores.append((movie, cosine_similarity(vector, movie_vectors[movie])))

# Display the top 10 results
df = pd.DataFrame(scores, columns=['Movie', 'Score'])
df = df.sort_values('Score', ascending=False)
df.head(10)