# Vector Stores

After document loading and doc/ text splitting, we create embeddings of those documents and store them in a vector store. A vector store is a database where we can easily look up similar vectors when trying to find documents relevant to the question at hand. We can then take the question at hand, create an embedding, and then do comparisons to all the different vectors in the vector store and pick the n most similar. We then take those n most similar chunks, and pass them along with the question into an LLM and get back an answer. 

In [1]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [2]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("../public/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("../public/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("../public/cs229_lectures/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("../public/cs229_lectures/MachineLearning-Lecture03.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [3]:
# Split
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [4]:
splits = text_splitter.split_documents(docs)

In [5]:
len(splits)

209

## Embeddings 

Let's take the splits and embed them

In [6]:
from langchain_openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

In [7]:
# First two sentences are very similar and the third one is not related
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [8]:
# We can then use embedding class to create an embedding for each sentence:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)


In [9]:
# We can use NumPy to compare the similarity between the embeddings:
import numpy as np

In [10]:
# As expected the first two embeddings have a high similarity score:
np.dot(embedding1, embedding2)

0.9631227500523609

In [11]:
# If we compare the first and the third embedding, we get a lower similarity score:
np.dot(embedding1, embedding3)

0.7703257495981695

In [12]:
# If we compare the second and the third embedding, we also get a lower similarity score:
np.dot(embedding2, embedding3)

0.7591627401108028

## Vectorstores

In [13]:
# Chroma is lightweight and in-memory vector db which makes it easy to get up and started with.
# Hosted solutions such as pinecone is great for larger sets of data.
#! pip install chromadb

In [14]:
from langchain_community.vectorstores import Chroma
# Chroma specific keyword argument to persist the embeddings to disk:
persist_directory = 'docs/chroma/'

In [15]:
# Just making sure nothing is there
!rm -rf ./docs/chroma

In [16]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

In [17]:
# If we take a look at the collection count we can see its the same number of splits as we had above:
print(vectordb._collection.count())


209


## Similarity Search

In [18]:
question = "is there an email i can ask for help?"

In [19]:
# Return 3 documents that are most similar to the question:	
docs = vectordb.similarity_search(question,k=3)

In [20]:
len(docs)

3

In [22]:
print(docs[0].page_content)

cs229-qa@cs.stanford.edu. This goes to an acc ount that's read by all the TAs and me. So 
rather than sending us email individually, if you send email to this account, it will 
actually let us get back to you maximally quickly with answers to your questions.  
If you're asking questions about homework probl ems, please say in the subject line which 
assignment and which question the email refers to, since that will also help us to route 
your question to the appropriate TA or to me  appropriately and get the response back to 
you quickly.  
Let's see. Skipping ahead — let's see — for homework, one midterm, one open and term 
project. Notice on the honor code. So one thi ng that I think will help you to succeed and 
do well in this class and even help you to enjoy this cla ss more is if you form a study 
group.  
So start looking around where you' re sitting now or at the end of class today, mingle a 
little bit and get to know your classmates. I strongly encourage you to form study gro

In [24]:
# Lets save this and use it later:
vectordb.persist()

## Failure modes

This seems great, and basic similarity search will get you 80% of the way there very easily.

But there are some failure modes that can creep up.

In [25]:
question = "what did they say about matlab?"

In [26]:
docs = vectordb.similarity_search(question,k=5)

In [36]:
print(docs[0].page_content)

those homeworks will be done in either MATLA B or in Octave, which is sort of — I 
know some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn't.  
So I guess for those of you that haven't s een MATLAB before, and I know most of you 
have, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to 
plot data. And it's sort of an extremely easy to  learn tool to use for implementing a lot of 
learning algorithms.  
And in case some of you want to work on your  own home computer or something if you 
don't have a MATLAB license, for the purposes of  this class, there's also — [inaudible] 
write that down [inaudible] MATLAB — there' s also a software package called Octave 
that you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it's free, and for the purposes of  this class, it will work for just about 
everythin

In [37]:
print(docs[1].page_content)

those homeworks will be done in either MATLA B or in Octave, which is sort of — I 
know some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn't.  
So I guess for those of you that haven't s een MATLAB before, and I know most of you 
have, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to 
plot data. And it's sort of an extremely easy to  learn tool to use for implementing a lot of 
learning algorithms.  
And in case some of you want to work on your  own home computer or something if you 
don't have a MATLAB license, for the purposes of  this class, there's also — [inaudible] 
write that down [inaudible] MATLAB — there' s also a software package called Octave 
that you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it's free, and for the purposes of  this class, it will work for just about 
everythin

We can notice that we are getting duplicate chunks (because of the duplicate `MachineLearning-Lecture01.pdf` in the index).
Semantic search fetches all similar documents, but does not enforce diversity.


In [38]:
question = "what did they say about regression in the third lecture?"

In [39]:
docs = vectordb.similarity_search(question,k=5)

In [40]:
# As we loop through the docs metadata we can see a new failure mode. The questions asks about the third lecture but the first document is from the third lecture but the search includes results from other lectures as well.
# It is picking up on the word regression but seems to be ignoring the third lecture part of the question.
# It's not querying on the third lecture part of the question due to the fact that it is a piece of structured information that isn't really perfectly captured in the semantic embedding.
for doc in docs:
    print(doc.metadata)

{'page': 0, 'source': '../public/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 2, 'source': '../public/cs229_lectures/MachineLearning-Lecture02.pdf'}
{'page': 14, 'source': '../public/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 4, 'source': '../public/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 0, 'source': '../public/cs229_lectures/MachineLearning-Lecture02.pdf'}


In [41]:
print(docs[4].page_content)

really makes a difference between a good so lution and amazing solution. And to give 
everyone to just how we do points assignments, or what is it that causes a solution to get 
full marks, or just how to write amazing so lutions. Becoming a grad er is usually a good 
way to do that.  
Graders are paid positions and you also get free  food, and it's usually fun for us to sort of 
hang out for an evening and grade all the a ssignments. Okay, so I will send email. So 
don't email me yet if you want to be a grader. I'll send email to the entire class later with 
the administrative details and to solicit app lications. So you can email us back then, to 
apply, if you'd be interested in being a grader.  
Okay, any questions about that? All right, okay, so let's get started with today's material. 
So welcome back to the second lecture. What  I want to do today is talk about linear 
regression, gradient descent, and the norma l equations. And I should also say, lecture 
notes have been posted