# Vector Databases & RAGs Code-Along

While the pre-trained weights of most LLMs provide a great baseline for natural language, and even "intelligent" behavior, requests for contextual/granular information often fall flat on their face.

For this reason, we must utilize vector databases & RAGs to imbue contextual information in our models. Follow along with the code below to find out more.

In [126]:
!pip install faiss-cpu sentence-transformers transformers



## Vector Database

A vector database is a specialized system for storing and searching numerical vectors efficiently. Each vector usually represents a piece of unstructured data after being processed by an embedding model (think back to `word2vec` or the `audio2vec` models we worked with).

Instead of searching by keywords, a vector DB lets you search by semantic similarity. That is, we extract information that is not only **directly** related to our input prompt, but also information that **might** be related to our prompt (think back to cosine similarity).

We create a vector database by storing embeddings of the information we want our model to be aware of. This could be:
* legal documents
* copyrighted music
* or even pdf's of lecture slides

And one of the best parts is that if we use a private implementation of a vector database, we also don't have to share information with `Meta`, `Amazon`, `Microsoft` or any other large organization that would want to sniff around our data.

## Retrieval-Augmented-Generation (RAG)

Retrieval-Augmented Generation is a *technique* to improve an LLM's responses by:
* Retrieving relevant documents from a knowledge store (such a vector DB).
* Augmenting the model's prompt with those documents.
* Generating an answer using the model with this extra context.

With RAG, the model is handed the right documents at generation time. That is, the model does not respond to a user's queries until it refers to a specified set of documents. 

The vector DB acts as the "memory" that stores your domain knowledge, whereast the RAG pipeline is the "brain" that searches the vector DB for relevant content, combines the top result into the user prompt, and feeds all information into the main LLM.

For example, let's explore our `distilgpt2` model again. As we recall, its' basic functionalities are comparable to a hamsters' at the moment.

In [127]:
from transformers import pipeline

model_id = "distilgpt2"

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto"
)

Device set to use cpu


In [128]:
prompt = "The first president of the United States was "

out = pipe(prompt, max_new_tokens=10, do_sample=True, temperature=1.0)
print(out[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The first president of the United States was __________ in 1933, the second was in 1925


However, by giving the model a set of information which it can call back to using vector databases, we can improve its response.

First, we specify a list of relevant historical information.

In [140]:
facts = [
    "First president: George Washington served 1789–1797; no formal party.",
    "Second president: John Adams served 1797–1801; Federalist.",
    "Third president: Thomas Jefferson served 1801–1809; Democratic-Republican.",
    "Sixteenth president: Abraham Lincoln served 1861–1865; Republican; led the Union during the Civil War.",
    "Twenty-sixth: Theodore Roosevelt served 1901–1909; Republican.",
    "Thirty-second: Franklin D. Roosevelt served 1933–1945; Democratic; led during the Great Depression and WWII.",
    "Thirty-fourth: Dwight D. Eisenhower served 1953–1961; Republican.",
    "Thirty-fifth: John F. Kennedy served 1961–1963; Democratic; succeeded by Lyndon B. Johnson after his assassination.",
    "Thirty-sixth: Lyndon B. Johnson served 1963–1969; Democratic.",
    "Thirty-seventh: Richard Nixon served 1969–1974; Republican.",
    "Thirty-eigth: Gerald Ford served 1974–1977; Republican.",
    "Thirty-ninth: Jimmy Carter served 1977–1981; Democratic.",
    "Fortieth: Ronald Reagan served 1981–1989; Republican.",
    "Forty-first: George H. W. Bush served 1989–1993; Republican.",
    "Forty-second: Bill Clinton served 1993–2001; Democratic.",
    "Forty-third: George W. Bush served 2001–2009; Republican.",
    "Forty-fourth: Barack Obama served 2009–2017; Democratic.",
    "Forty-fifth: Donald Trump served 2017–2021; Republican.",
    "Forty-sixth: Joe Biden began 2021; Democratic."
]

Now that we have a prepared set of facts that we want the model to "remember", we convert all of these sentences into embeddings which can then be used to inform the model. We use the [SentenceTransformer](https://sbert.net/) object to convert all these sentences into vector embeddings which will then be stored on the [Facebook AI Similarity Search](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) (`FAISS`) which provides efficient similarity search & clustering of dense vectors.

In [141]:
from sentence_transformers import SentenceTransformer
import faiss     

embedder = SentenceTransformer('bert-base-nli-mean-tokens')
embeddings = embedder.encode(facts, normalize_embeddings=True) 

embeddings.shape

(19, 768)

Note that we retain the length of our original 'facts' list (19). However this time, we also have 768 additional dimensions which express some sort of semantic meaning of the sentence.

We could utilize this list of vector embeddings to find similar/disimilar facts. Let's envision a user trying to find more information on Ronald Reagan. They begin their prompt with "Ronald Reagan was a..." which we could express as a vector of 768 dimensions using our transformer model.

In [142]:
test_prompt = "Ronald Reagan was a"

emb_prompt = embedder.encode([test_prompt], normalize_embeddings=True)
emb_prompt

array([[-8.82279221e-03,  1.82359405e-02,  5.11318482e-02,
         3.68423387e-03,  1.70134883e-02, -4.15227488e-02,
        -7.91858360e-02,  5.28977625e-02, -4.34197932e-02,
         1.13973329e-02,  9.66298208e-03,  6.25337288e-02,
         4.50947322e-03,  5.83700426e-02,  1.53515243e-03,
        -1.72120556e-02, -5.39328121e-02,  1.09018404e-02,
        -2.22240612e-02, -6.65859925e-03, -8.64041783e-03,
         5.65448776e-02, -2.01766137e-02, -1.77255347e-02,
         3.94262969e-02,  2.72729341e-03,  9.84456390e-03,
        -1.17754459e-01,  5.83857857e-02,  1.01187946e-02,
        -3.07621136e-02,  1.46105932e-02, -8.37213174e-03,
         1.85176712e-02, -9.66928806e-03, -7.75542622e-03,
         2.88648978e-02,  4.66978783e-03, -1.03346794e-03,
         2.77778916e-02,  7.33339787e-02, -3.56054530e-02,
         1.89625546e-02,  4.44491208e-02, -4.00660485e-02,
        -3.41446549e-02, -3.33153439e-04,  5.46278767e-02,
         3.67420726e-02, -1.54714942e-01, -3.28307152e-0

While this vector does not provide much information to us humans, we could calculate the cosine similarity of this embedding to all other sentences in our `facts` list and extract the fact that has the **most** similarity.

In [143]:
reagan_fact = facts[12]
washington_fact = facts[0]

print("Reagan fact,", reagan_fact)
print("Washington fact,", washington_fact)

Reagan fact, Fortieth: Ronald Reagan served 1981–1989; Republican.
Washington fact, First president: George Washington served 1789–1797; no formal party.


In [144]:
from numpy import dot
from numpy.linalg import norm

# get embeddings of the two facts
emb_reagan = embedder.encode([reagan_fact], normalize_embeddings=True)
emb_washington = embedder.encode([washington_fact], normalize_embeddings=True)

# calculate cosine similarity
sim_reagan = dot(emb_prompt, emb_reagan.T)/(norm(emb_prompt) * norm(emb_reagan))
sim_washington = dot(emb_prompt, emb_washington.T)/(norm(emb_prompt) * norm(emb_washington))

print("Similarity of original prompt to Reagan fact", sim_reagan)
print("Similarity of original prompt to Washington fact", sim_washington)

Similarity of original prompt to Reagan fact [[0.74524]]
Similarity of original prompt to Washington fact [[0.4231794]]


To speed up the extraction of similar strings (also known as documents), we utilize the `FAISS` library. This provides us a database (similar to a SQL database) that specializes in searching highly dimensional vector embeddings. 

In [145]:
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings)

Now that we've added our embeddings to our vector database, we can perform operations such as the extraction of the top 3 similar facts. Let's evaluate our test prompt and see which documents are extracted.

In [146]:
# use FAISS to get top 3 similar vectors
top_k = 3

scores, ids = index.search(emb_prompt, top_k)

print(scores)
print(ids)

[[0.74524    0.5902182  0.56826836]]
[[12 11 13]]


In [162]:
print("Top k results for prompt:", test_prompt)

top_facts = []
for i in range(top_k):
    score = float(scores[0][i])
    ident = int(ids[0][i])
    fact = facts[int(ids[0][i])]
    top_facts.append(fact)

    print("\nfact", fact, "score", score, "\nid", ident)

Top k results for prompt: Ronald Reagan was a

fact Fortieth: Ronald Reagan served 1981–1989; Republican. score 0.7452399730682373 
id 12

fact Thirty-ninth: Jimmy Carter served 1977–1981; Democratic. score 0.590218186378479 
id 11

fact Forty-first: George H. W. Bush served 1989–1993; Republican. score 0.568268358707428 
id 13


We can attach these top 3 facts to our original prompt (along with some system-level instructions) to help our LLM provide a more accurate answer. First let's generate our "context", which provides relevant information to our original prompt.

In [163]:
context = ""
for f in top_facts:
    context += "\n" + f

print(context)


Fortieth: Ronald Reagan served 1981–1989; Republican.
Thirty-ninth: Jimmy Carter served 1977–1981; Democratic.
Forty-first: George H. W. Bush served 1989–1993; Republican.


Next, let's generate our system-level instructions to "point" our model towards the context we constructed for it. 

In [None]:
rag_prompt = (
    "Use ONLY these facts to answer the question. If unknown, say 'Not in facts.'\n"
    f"Facts:\n{context}\n\n"
    f"{test_prompt}"
)

rag_prompt

"Use ONLY these facts to answer the question. If unknown, say 'Not in facts.'\nFacts:\n\nFortieth: Ronald Reagan served 1981–1989; Republican.\nThirty-ninth: Jimmy Carter served 1977–1981; Democratic.\nForty-first: George H. W. Bush served 1989–1993; Republican.\n\nRonald Reagan was a"

And now we can test out how are minimal "RAG" operates!

In [None]:
out = pipe(rag_prompt, max_new_tokens=10, do_sample=True, temperature=0.5)
print(out[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Use ONLY these facts to answer the question. If unknown, say 'Not in facts.'
Facts:

Fortieth: Ronald Reagan served 1981–1989; Republican.
Thirty-ninth: Jimmy Carter served 1977–1981; Democratic.
Forty-first: George H. W. Bush served 1989–1993; Republican.

Ronald Reagan was a popular figure in the Republican Party. He was the


Notice that this vastly differs from what the models response would look like if we did not provide this contextual information.

In [174]:
out = pipe(test_prompt, max_new_tokens=10, do_sample=True, temperature=0.5)
print(out[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Ronald Reagan was a Republican presidential candidate in 1992 and was the third Republican


To confirm the validity of RAGs & Vector DB, let's try out this technique on our original prompt about the first US president. Except this time, let's make these techniques concrete by formalizing them into functions.

In [175]:
def get_top_k(emb_prompt, k=3):
    """Get top 3 similar facts to original prompt embedding
    """
    _, ids = index.search(emb_prompt, k)

    top_facts = []
    for i in ids[0]:
        fact = facts[i]
        top_facts.append(fact)
    
    return top_facts

In [176]:
def construct_prompt(facts, og_prompt):
    """Create a system prompt using the top 3 facts to inform answers
    """
    context = ""
    for f in facts:
        context += "\n" + f

    prompt = (
        "Use ONLY these facts to answer the question. If unknown, say 'Not in facts.'\n"
        f"Facts:\n{context}\n\n"
        f"{og_prompt}"
    )

    return prompt

In [177]:
# generate embedding of original prompt
emb_prompt = embedder.encode([prompt], normalize_embeddings=True)

# get top 3 facts from original prompt
top_facts = get_top_k(emb_prompt, 3)

print("Top k results for prompt:", prompt)
top_facts

Top k results for prompt: The first president of the United States was 


['Second president: John Adams served 1797–1801; Federalist.',
 'First president: George Washington served 1789–1797; no formal party.',
 'Third president: Thomas Jefferson served 1801–1809; Democratic-Republican.']

In [179]:
# construct prompt using newly found context
rag_prompt = construct_prompt(top_facts, prompt)

print("RAG Prompt")
rag_prompt

RAG Prompt


"Use ONLY these facts to answer the question. If unknown, say 'Not in facts.'\nFacts:\n\nSecond president: John Adams served 1797–1801; Federalist.\nFirst president: George Washington served 1789–1797; no formal party.\nThird president: Thomas Jefferson served 1801–1809; Democratic-Republican.\n\nThe first president of the United States was "

In [185]:
out = pipe(rag_prompt, max_new_tokens=10, do_sample=True, temperature=0.5)
print(out[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Use ONLY these facts to answer the question. If unknown, say 'Not in facts.'
Facts:

Second president: John Adams served 1797–1801; Federalist.
First president: George Washington served 1789–1797; no formal party.
Third president: Thomas Jefferson served 1801–1809; Democratic-Republican.

The first president of the United States was _____________.
Second president: John Adams served


## Vector Databases & RAGS in OpenAI

As we can see from the above example however, it's going to take a lot more work to get `distilgpt2` to a capacity where it is able to assist in everyday business automation. Therefore, for the remainder of this exercise we will utilize the `openai` API. 

Particularly, we will explore how we could pass `pdfs` to the openai API. In this example, we will pass a pdf of "presidential facts."

Therefore, we will instead utilize the `openai` API to interact with a vector store. This will then be used to inform `gpt-4o-mini`'s answers. While a vector store is not as comprehensive as a [vector database](https://myscale.com/blog/vector-store-vs-vector-database-comparison-guide/), this tool still allows us to efficiently search the semantic space to augment our responses.

Particularly we will interact with the vector store with the `id` "proj_fHRnVJY0Oyfm1ufG1sffxa6W."

To begin running the code below, copy & paste the provided API key in the code-block below.

**DO NOT PUSH THIS KEY TO GITHUB**  
**DO NOT PUSH THIS KEY TO GITHUB**  
**DO NOT PUSH THIS KEY TO GITHUB**  

In [None]:
api_key = "..."

In [None]:
from openai import OpenAI

# access the specific OpenAI project
client = OpenAI(project="proj_fHRnVJY0Oyfm1ufG1sffxa6W")
# specify vector store id
vec_id = "vs_689a38d0704081919c7e463d0efd9dfb"

Here we initialize our `openai` client and specify which vector store we will use.

We opent this file the same way we would for basic `I/O` operations. We can then create this file in our remote vector store.

In [None]:
# open the pdf file and create an object which could be interpreted by openai
with open("pres_facts.pdf", "rb") as file_obj:
    f = client.files.create(file=file_obj, purpose="assistants")

    # push pdf to vector store
    client.vector_stores.files.create(
        vector_store_id=vec_id,
        file_id=f.id,
    )

Now that we have our file within our vector store, we can utilize it to augment our responses. Note that the only thing we need to specify are our model "tools." This allows our model to search a specific vector store for more information.

```
[
    {
        "type": "file_search",
        "vector_store_ids": (vector store id that contains our file info)
    }
]
```

In [202]:
resp = client.responses.create(
    model='gpt-4-turbo',
    input='How much did James Madison weigh?',
    tools=[{"type": "file_search", "vector_store_ids": [vec_id]}],
)

print(resp.output_text)

James Madison weighed less than 100 pounds.


Again, let's see how our model works without the addition of our vector files.

In [203]:
resp = client.responses.create(
    model='gpt-4-turbo',
    input='How much did James Madison weigh?'
)

print(resp.output_text)

There is no specific record of James Madison's exact weight throughout his life. However, he was known for his small stature. Historical sources often describe him as standing around 5 feet 4 inches tall and being quite slight in build, leading to estimates that his weight was under 100 pounds during his presidency. Madison's light frame and shorter height were distinctive characteristics noted by his contemporaries.


Now that we know how to augment our responses, let's try out the `rag_case_study.ipynb` exercise!