# Vectorstores and Embeddings

What **embeddings** are? 
- They take a piece of text, and they create a numerical representation of that text. 
- Text with similar content will have similar vectors in this numeric space. 
- What that means is we can then compare those vectors and find pieces of text that are similar.

|         |               |
| ----------------------------------------------- | ------------------------------------- |
| ![](images/embeddings.png) | ![](images/vector_store.png) |

As a reminder of the full end-to-end workflow, we start with documents, we then create smaller splits of those documents, we then create embeddings of those documents, and then we store all of those in a vector store.

- A **vector store** is a database where you can easily look up similar vectors later on. 
- This will become useful when we're trying to find documents that are relevant for a question at hand.
- We can then take the question at hand, create an embedding, and then do comparisons to all the different vectors in the vector store, and then pick the n most similar. 
- We then take those n most similar chunks, and pass them along with the question into an LLM, and get back an answer. 

![](images/vs_db.png)

We'll cover all of that later on. For now, it's time to deep dive on **embeddings** and **vector stores** themselves. 

In [1]:
import os
import openai
import sys
sys.path.append('../..')

# from dotenv import load_dotenv, find_dotenv
# _ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ.get('OPENAI_API_KEY')

Notice that we're actually going to duplicate the first lecture. This is for the purposes of simulating some dirty data. \
After the documents are loaded, we can then use the **RecursiveCharacterTextSplitter** to create chunks.

In [3]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("docs/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("docs/MachineLearning-Lecture03.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [4]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [5]:
splits = text_splitter.split_documents(docs)

In [6]:
len(splits)

209

We can see that we've now created over 200 different chunks. 

Time to move on to the next section and create **embeddings** for all of them. \
We'll use OpenAI to create these **embeddings**. Before jumping into a real-world example, let's try it out with a few toy test cases just to get a sense for what's going on underneath the hood.

## Embeddings

Let's take our splits and embed them.

In [7]:
from langchain.embeddings.openai import OpenAIEmbeddings

embedding = OpenAIEmbeddings()

In [8]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [9]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

We can use NumPy to compare them, and see which ones are most similar. 

In [10]:
import numpy as np

We'll use a dot product to compare the two embeddings. \
The important thing to know is that higher is better. 

In [11]:
np.dot(embedding1, embedding2)

0.9631991217783938

In [12]:
np.dot(embedding1, embedding3)

0.7711219316134728

In [13]:
np.dot(embedding2, embedding3)

0.7596890092790902

Let's now get back to the real-world example. It's time to create **embeddings** for all the chunks of the PDFs and then store them in a **vector store**. \
- The vector store that we'll use for this lesson is **Chroma**.
- LangChain has integrations with lots, over 30 different vector stores. 
- We choose **Chroma** because it's **lightweight and in memory**, which makes it very easy to get up and started with. 

## Vectorstores

In [None]:
# !pip install chromadb

In [14]:
from langchain.vectorstores import Chroma

We're going to want to save this vector store so that we can use it in future lessons. So, let's create a variable called persist directory, which we will use later on at docs/chroma.

In [15]:
persist_directory = 'docs/chroma/'

Let's also just make sure that nothing is there already. If there's stuff there already, it can 
throw things off and we don't want that to happen.

In [16]:
!rm -rf ./docs/chroma  # remove old database files if any

'rm' is not recognized as an internal or external command,
operable program or batch file.


Let's now create the vector store. 

In [17]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

If we take a look at the collection count after doing this, we can see that it's 209, which is the same as the number of splits that we had from before.

In [18]:
print(vectordb._collection.count())

209


### Similarity Search

Let's now start using it. Let's think of a question that we can ask of this data. We know that this is about a class lecture.

In [19]:
question = "is there an email i can ask for help"

- We're going to use the **similarity search method**, and we're going to pass in the question, and then we'll also pass in K equals three. 
- This specifies the number of documents that we want to return.

In [20]:
docs = vectordb.similarity_search(question,k=3)

In [21]:
len(docs)

3

 If we take a look at the content of the first document, we can see that it is in fact about an email address, cs229-qa.cs.stanford.edu. And this is the email that we can send questions to and is read by all the TAs.

In [22]:
docs[0].page_content

"cs229-qa@cs.stanford.edu. This goes to an acc ount that's read by all the TAs and me. So \nrather than sending us email individually, if you send email to this account, it will \nactually let us get back to you maximally quickly with answers to your questions.  \nIf you're asking questions about homework probl ems, please say in the subject line which \nassignment and which question the email refers to, since that will also help us to route \nyour question to the appropriate TA or to me  appropriately and get the response back to \nyou quickly.  \nLet's see. Skipping ahead — let's see — for homework, one midterm, one open and term \nproject. Notice on the honor code. So one thi ng that I think will help you to succeed and \ndo well in this class and even help you to enjoy this cla ss more is if you form a study \ngroup.  \nSo start looking around where you' re sitting now or at the end of class today, mingle a \nlittle bit and get to know your classmates. I strongly encourage you to f

After doing so, let's make sure to **persist** the vector database so that we can use it in future lessons by running vectordb.persist. 

In [23]:
vectordb.persist()

This has  covered the basics of semantic search and shown us that we can get pretty good results based on just embeddings alone. But it isn't perfect and here we'll go over a few edge cases and show where this can fail.

## Failure modes

This seems great, and basic similarity search will get you 80% of the way there very easily. 

But there are some failure modes that can creep up. 

Here are some edge cases that can arise - we'll fix them in the next class.

In [24]:
question = "what did they say about matlab?"

In [25]:
docs = vectordb.similarity_search(question,k=5)

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')).


Notice that we're getting duplicate chunks (because of the duplicate `MachineLearning-Lecture01.pdf` in the index).

Semantic search fetches all similar documents, but does not enforce diversity.

`docs[0]` and `docs[1]` are indentical.

In [26]:
docs[0]

Document(page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to  learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your  own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of  this class,

In [27]:
docs[1]

Document(page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to  learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your  own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of  this class,

We can see a new failure mode.

The question below asks a question about the third lecture, but includes results from other lectures as well.

In [28]:
question = "what did they say about regression in the third lecture?"

In [29]:
docs = vectordb.similarity_search(question,k=5)

In [30]:
for doc in docs:
    print(doc.metadata)

{'page': 0, 'source': 'docs/MachineLearning-Lecture03.pdf'}
{'page': 2, 'source': 'docs/MachineLearning-Lecture02.pdf'}
{'page': 14, 'source': 'docs/MachineLearning-Lecture03.pdf'}
{'page': 4, 'source': 'docs/MachineLearning-Lecture03.pdf'}
{'page': 0, 'source': 'docs/MachineLearning-Lecture02.pdf'}


In [31]:
print(docs[4].page_content)

really makes a difference between a good so lution and amazing solution. And to give 
everyone to just how we do points assignments, or what is it that causes a solution to get 
full marks, or just how to write amazing so lutions. Becoming a grad er is usually a good 
way to do that.  
Graders are paid positions and you also get free  food, and it's usually fun for us to sort of 
hang out for an evening and grade all the a ssignments. Okay, so I will send email. So 
don't email me yet if you want to be a grader. I'll send email to the entire class later with 
the administrative details and to solicit app lications. So you can email us back then, to 
apply, if you'd be interested in being a grader.  
Okay, any questions about that? All right, okay, so let's get started with today's material. 
So welcome back to the second lecture. What  I want to do today is talk about linear 
regression, gradient descent, and the norma l equations. And I should also say, lecture 
notes have been posted

Approaches discussed in the next lecture can be used to address both!