# Vectors, Vectors, Vectors

As with many things in life, it all boils down to linear algebra and a few non-linear functions.

Vector representations enables similarity calculations and we can think of several applications that follow from it: question answering, evaluation procedures, fetching related texts, etc. Because of this usefulness, we want to find efficient ways of 1) obtaining vector representations, 2) operating on them, and 3) storing them for later use.

In this notebook, we will discuss how to obtain embeddings from OpenAI API and use a vector database to store and operate on vector representations.

In [1]:
%load_ext dotenv
%dotenv ../../05_src/.secrets

In [2]:
import os
from openai import OpenAI
client = OpenAI()

Our sample phrases cover three topics: freedom, friendship, and food.

In [3]:
phrases = [
    # Freedom
    "Freedom consists not in doing what we like, but in having the right to do what we ought.",
    "Those who deny freedom to others deserve it not for themselves.",
    "Liberty, when it begins to take root, is a plant of rapid growth.",
    "Freedom lies in being bold.",
    "Is freedom anything else than the right to live as we wish?",
    "I am no bird and no net ensnares me: I am a free human being with an independent will.",
    "The secret to happiness is freedom... And the secret to freedom is courage."
    "Freedom is the oxygen of the soul.", 
    "Life without liberty is like a body without spirit."
    # Friendship
    "There is nothing on this earth more to be prized than true friendship.",
    "There are no strangers here; Only friends you haven’t yet met.",
    "Friendship is the only cement that will ever hold the world together.",
    "A true friend is someone who is there for you when he'd rather be anywhere else.",
    "Friendship is the golden thread that ties the heart of all the world.", 
    "Your friend is the man who knows all about you and still likes you.",
    "A single rose can be my garden... a single friend, my world."
    # Food
    "One cannot think well, love well, sleep well, if one has not dined well.",
    "Let food be thy medicine and medicine be thy food.",
    "People who love to eat are always the best people.",
    "The only way to get rid of a temptation is to yield to it.",
    "Food is our common ground, a universal experience.",
    "Life is uncertain. Eat dessert first.",
    "All you need is love. But a little chocolate now and then doesn't hurt."
]

We have 20 phrases in total:

In [4]:
len(phrases)

20

To obtain embeddings, we will use the `text-embedding-3-small` model. This model generates 1536-dimensional vectors for each input text. 

The documentation for the embeddings API can be found [here](https://platform.openai.com/docs/guides/embeddings).

# A Simple Input

We first start with a simple example using the first document/phrase:

In [5]:
phrases[0]

'Freedom consists not in doing what we like, but in having the right to do what we ought.'

In [6]:
client = OpenAI()
response = client.embeddings.create(
    input = phrases[0], 
    model = "text-embedding-3-small"
)

In [7]:
response.data

[Embedding(embedding=[0.026252003386616707, 0.021613840013742447, 0.03242075815796852, 0.07648330926895142, 0.08269844949245453, -0.04668311029672623, -0.007658766582608223, 0.026483910158276558, -0.008632780984044075, -0.005762917455285788, 0.026159239932894707, -0.01841350644826889, -0.02139352634549141, 0.006934053730219603, 0.03654872626066208, 0.006719538476318121, 0.005954241845756769, -0.026112858206033707, 0.018390316516160965, 0.052735913544893265, 0.04271748289465904, 0.004861374851316214, -0.007908067665994167, 0.02502289041876793, -0.047865841537714005, -0.0057194349355995655, 0.02052387222647667, -0.0036844408605247736, -0.029823388904333115, 0.02364303544163704, 0.036108098924160004, -0.00829071644693613, -0.019990483298897743, 0.03603852540254593, 0.04436402767896652, 0.026275193318724632, 0.019596239551901817, -0.06062079221010208, 0.0337890163064003, 0.02455907315015793, -0.009792321361601353, -0.06256882101297379, -7.786134665366262e-05, 0.05528690293431282, 0.0285710

# Loop through Simple Inputs

We will now try the example found in the [API documentation](https://platform.openai.com/docs/guides/embeddings/embeddings#obtaining-the-embeddings), which simply loops through the documents, calling the API each time. The function below first performs a simple cleanup (removes line breaks), then requests the embeddings.

In [8]:
def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding

Using Python's list comprehension syntax, we can run the function for each of our example phrases.

In [9]:
embeddings = [get_embedding(doc) for doc in phrases]


The statement above is roughly equivalent to:

In [10]:
embeddings = []
for doc in phrases:
    doc_emb = get_embedding(doc)
    embeddings.append(doc_emb)
embeddings

[[0.026299498975276947,
  0.021637948229908943,
  0.0324685163795948,
  0.07648655027151108,
  0.08270195126533508,
  -0.046661898493766785,
  -0.007664889562875032,
  0.02650822512805462,
  -0.008627348579466343,
  -0.0057573639787733555,
  0.02620673179626465,
  -0.018391096964478493,
  -0.021394433453679085,
  0.0069517414085567,
  0.03655027598142624,
  0.006719823461025953,
  0.005971888080239296,
  -0.026137156412005424,
  0.018367905169725418,
  0.05273814871907234,
  0.042719293385744095,
  0.004861580673605204,
  -0.007948989048600197,
  0.02502395026385784,
  -0.04782148823142052,
  -0.005728374235332012,
  0.020513145253062248,
  -0.0036787991411983967,
  -0.029847845435142517,
  0.023678826168179512,
  0.036132823675870895,
  -0.00828527007251978,
  -0.01996813900768757,
  0.03606324642896652,
  0.044365908950567245,
  0.026229923591017723,
  0.019585473462939262,
  -0.06053059548139572,
  0.03381364047527313,
  0.02456011436879635,
  -0.009792736731469631,
  -0.06257147341

# Sending Lists of Inputs to the API

We can also send a collection of inputs to the API:

In [11]:
client = OpenAI()
response = client.embeddings.create(
    input = phrases, 
    model = "text-embedding-3-small"
)
response.data

[Embedding(embedding=[0.02645343542098999, 0.021735403686761856, 0.03266686201095581, 0.07655499130487442, 0.08267568051815033, -0.04678618162870407, -0.007378445006906986, 0.026522988453507423, -0.00861301552504301, -0.005714962258934975, 0.026291145011782646, -0.018408438190817833, -0.021132608875632286, 0.006955329328775406, 0.036538660526275635, 0.006607562769204378, 0.0059236218221485615, -0.025989746674895287, 0.01818818598985672, 0.05281413346529007, 0.04291437938809395, 0.004845546092838049, -0.008149327710270882, 0.024714602157473564, -0.04799177125096321, -0.005671491380780935, 0.020413890480995178, -0.0036486496683210135, -0.029907915741205215, 0.023810410872101784, 0.036121342331171036, -0.008276841603219509, -0.020019754767417908, 0.03607497364282608, 0.04432862997055054, 0.02645343542098999, 0.019648805260658264, -0.06065047159790993, 0.033826082944869995, 0.024505943059921265, -0.009615742601454258, -0.0625515952706337, -0.00017723410564940423, 0.05541078746318817, 0.028

# Vector DB

We can use a specialized database to store our embeddings, relate them to documents, and efficiently perform computations like cosine similarity.

![](img/02_chroma.png)

The document database that we will use for our experiments is Chroma DB, a simple implementation of Vector DB that is commonly used for prototyping. 

A few useful references are: 
- [ChromaDB Documentation](https://docs.trychroma.com/docs/overview/introduction).
- [ChromaDB Cookbook](https://cookbook.chromadb.dev/running/running-chroma/#chroma-cli).

Chroma can be run locally in memory, locally using file persistence, or using a Docker container.

## Running Chroma Locally in Memory

The simplest implementation is to run Chroma DB in memory without persistence.

In [12]:
import chromadb

chroma_client = chromadb.Client()

First, create a collection. A collection is a container that groups documents together. A collection would be equivalent to a table which groups togher records in a relational database.

In [13]:
collection = chroma_client.create_collection(name = "nice_phrases")

Then, add documents to our collection. Each document will contain:

1. An identifier.
2. The phrase.
3. The embeddings.

In [14]:
embeddings = [item.embedding for item in response.data]
ids = [f"id{i}" for i in range(len(phrases))]

In [15]:
collection.add(embeddings = embeddings, 
               documents = phrases, 
               ids = ids)

Now, we can use Chroma DB's [`query`](https://docs.trychroma.com/docs/querying-collections/query-and-get) method to perform a query using similarity search. 

## Performing a Search Using Custom Embeddings

We could use a function such as the one below to provide our own embeddings of the query text.

In [16]:
def query_chromadb(query, top_n = 2):
    query_embedding = get_embedding(query)
    results = collection.query(query_embeddings = [query_embedding], n_results = top_n)
    return [(id, score, text) for id, score, text in zip(results['ids'][0], results['distances'][0], results['documents'][0])]

In [17]:
query = "What is good food?"

query_chromadb(query, top_n=3)

[('id17',
  1.007392406463623,
  'Food is our common ground, a universal experience.'),
 ('id15',
  1.1379356384277344,
  'People who love to eat are always the best people.'),
 ('id14',
  1.1588244438171387,
  'Let food be thy medicine and medicine be thy food.')]

## Performing a Search Using Embedding Function

Alternatively, we can define the embedding function at the moment in which we create the collection.

If needed, list and remove any collection as you require:

In [18]:
chroma_client.list_collections()

[Collection(name=nice_phrases)]

In [19]:
chroma_client.delete_collection("nice_phrases")

We can now re-use the collection name using an OpenAI embedding function. Notice that we pass the `api_key` parameter explicitly, as the environment variable name that holds the API key for Chroma DB and for the OpenAI library are different.

In [20]:
import os
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

collection = chroma_client.create_collection(
    name = "nice_phrases",
    embedding_function = OpenAIEmbeddingFunction(
        api_key = os.getenv("OPENAI_API_KEY"),
        model_name="text-embedding-3-small")
)
collection.add(embeddings = embeddings, 
               documents = phrases, 
               ids = ids)

With the embedding function, we can now perform the query:

In [21]:
collection.query(
    query_texts = ["What is a friend?", "What is good food?"], 
    n_results = 2
)

{'ids': [['id10', 'id12'], ['id17', 'id15']],
 'embeddings': None,
 'documents': [["A true friend is someone who is there for you when he'd rather be anywhere else.",
   'Your friend is the man who knows all about you and still likes you.'],
  ['Food is our common ground, a universal experience.',
   'People who love to eat are always the best people.']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[None, None], [None, None]],
 'distances': [[0.9254103899002075, 1.0471810102462769],
  [1.0074349641799927, 1.1379879713058472]]}