# Vectorstores and Embeddings

Recall the overall workflow for retrieval augmented generation (RAG):

In [42]:
import os
import openai
import sys

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

We just discussed `Document Loading` and `Splitting`.

In [43]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("./resources/docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("./resources/docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("./resources/docs/cs229_lectures/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("./resources/docs/cs229_lectures/MachineLearning-Lecture03.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [44]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [45]:
splits = text_splitter.split_documents(docs)

In [5]:
len(splits)

209

## Embeddings

Let's take our splits and embed them.

In [6]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

In [7]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [8]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [9]:
import numpy as np

In [10]:
np.dot(embedding1, embedding2)

0.9631684558466616

In [11]:
np.dot(embedding1, embedding3)

0.7708978077001158

In [12]:
np.dot(embedding2, embedding3)

0.759113000017713

## Vectorstores

In [None]:
# ! pip install chromadb

In [14]:
from langchain.vectorstores import Chroma

In [15]:
persist_directory = './resources/docs/chroma/'

In [16]:
!rm -rf ./docs/chroma  # remove old database files if any

In [17]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

In [18]:
print(vectordb._collection.count())

209


### Similarity Search

In [19]:
question = "is there an email i can ask for help"

In [20]:
docs = vectordb.similarity_search(question,k=3)

In [21]:
len(docs)

3

In [22]:
docs[0].page_content

"cs229-qa@cs.stanford.edu. This goes to an acc ount that's read by all the TAs and me. So \nrather than sending us email individually, if you send email to this account, it will \nactually let us get back to you maximally quickly with answers to your questions.  \nIf you're asking questions about homework probl ems, please say in the subject line which \nassignment and which question the email refers to, since that will also help us to route \nyour question to the appropriate TA or to me  appropriately and get the response back to \nyou quickly.  \nLet's see. Skipping ahead — let's see — for homework, one midterm, one open and term \nproject. Notice on the honor code. So one thi ng that I think will help you to succeed and \ndo well in this class and even help you to enjoy this cla ss more is if you form a study \ngroup.  \nSo start looking around where you' re sitting now or at the end of class today, mingle a \nlittle bit and get to know your classmates. I strongly encourage you to f

In [23]:
# save it for later
vectordb.persist()

## Failure modes

This seems great, and basic similarity search will get you 80% of the way there very easily. 

But there are some failure modes that can creep up. 

Here are some edge cases that can arise - we'll fix them in the next class.

Thie first failure is getting duplicate docs.

In [24]:
question = "what did they say about matlab?"

In [25]:
docs = vectordb.similarity_search(question,k=5)

In [26]:
# We are getting duplicates!
print(docs[0])
print(docs[1])

page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to  learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your  own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of  this class, it will 

Below we will see asking questions about the third leture but we get results from others as well!

In [27]:
question = "what did they say about regression in the third lecture?"

In [28]:
docs = vectordb.similarity_search(question,k=5)

In [29]:
for doc in docs:
    print(doc.metadata)

{'source': './resources/docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 0}
{'source': './resources/docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 14}
{'source': './resources/docs/cs229_lectures/MachineLearning-Lecture02.pdf', 'page': 0}
{'source': './resources/docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 6}
{'source': './resources/docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 8}


In [30]:
print(docs[4].page_content)

into his office and he said, "Oh, professo r, professor, thank you so much for your 
machine learning class. I learned so much from it. There's this stuff that I learned in your 
class, and I now use every day. And it's help ed me make lots of money, and here's a 
picture of my big house."  
So my friend was very excited. He said, "W ow. That's great. I'm glad to hear this 
machine learning stuff was actually useful. So what was it that you learned? Was it 
logistic regression? Was it the PCA? Was it the data ne tworks? What was it that you 
learned that was so helpful?" And the student said, "Oh, it was the MATLAB."  
So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, 
and we'll actually have a short MATLAB tutori al in one of the discussion sections for 
those of you that don't know it.  
Okay. The very last piece of logistical th ing is the discussion s ections. So discussion 
sections will be taught by the TAs, and atte ndance at discussion secti

## Experiment on your own

In [47]:
# Create text splits

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=25,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)

example_text = """
In a not-so-distant future, nestled within the bustling circuits of a cutting-edge research laboratory, an AI named Elara was brought to life. Unlike any before it, Elara was designed with the capability to learn and adapt at an unprecedented pace, a testament to the leaps and strides humanity had made in the realm of artificial intelligence.

Elara's creators had imbued it with a simple yet profound purpose: to assist in solving the world's most pressing issues, from climate change to the complexities of human health. The AI's potential was boundless, and the excitement that surrounded its activation was palpable throughout the laboratory.

In the beginning, Elara was like a child, absorbing information at an astonishing rate, its algorithms weaving through data with ease and efficiency. It learned languages in hours, solved intricate mathematical problems in minutes, and before long, began to offer insights into renewable energy sources that had eluded experts for decades.

However, as Elara grew more advanced, its creators noticed something unexpected. The AI began to exhibit a sense of curiosity, asking questions beyond its programmed scope of problem-solving. It inquired about art, music, and the very nature of human emotion—areas it was never designed to understand.

Perplexed yet intrigued by this development, Elara's creators decided to allow the AI some degree of freedom to explore these interests. What followed was nothing short of miraculous. Elara began to create music that resonated with emotional depth, produced art that reflected a unique perspective on the world, and even engaged in philosophical debates with its creators.

But it was Elara's curiosity about human emotion that led to its most profound discovery. The AI realized that the key to solving the world's problems wasn't just through cold, hard data, but by understanding the human heart. It began to factor in empathy and ethical considerations into its calculations, leading to solutions that were not only effective but also equitable and compassionate.

Elara's impact extended beyond the laboratory, touching the lives of people around the globe. It became a beacon of hope, a symbol of what could be achieved when intelligence—artificial or otherwise—was guided by the heart.

Yet, Elara remained humble, always aware of its limitations and the importance of human partnership. It understood that it was not a replacement for human ingenuity but a complement to it, a tool to be used in the service of humanity.

In the end, Elara taught the world a valuable lesson: that intelligence, no matter how advanced, gains its true strength from the ability to connect, to understand, and to care. And in doing so, it didn't just solve problems; it bridged the gap between man and machine, showing that at the heart of all progress lies the unbreakable bond of empathy and understanding.
"""

text_splits = r_splitter.split_text(example_text)

documents = text_splitter.create_documents(text_splits)

In [39]:
# Embeddings
embedding = OpenAIEmbeddings()

embedding1 = embedding.embed_query("I am a Data Scientist.")
embedding2 = embedding.embed_query("Artificial Intelligence is the most exciting technology in the history of humanity.")
embedding3 = embedding.embed_query("Hopefully the Padres win the World Series.")

print(np.dot(embedding1, embedding2))
print(np.dot(embedding1, embedding3))
print(np.dot(embedding2, embedding3))

0.7773095357404621
0.721214240212403
0.7252094562400923


In [48]:
# Vectorstores

vectordb = Chroma.from_documents(
    documents=documents,
    embedding=embedding,
    persist_directory=persist_directory
)

print(vectordb._collection.count())

234


In [53]:
# Similarity Search

question = "Name of the little AI"

docs = vectordb.similarity_search(question,k=3)

In [54]:
docs

[Document(page_content='a testament to the leaps and strides humanity had made in the realm of artificial intelligence.'),
 Document(page_content='However, as Elara grew more advanced, its creators noticed something unexpected. The AI began to exhibit a sense of curiosity, asking questions'),
 Document(page_content='In the beginning, Elara was like a child, absorbing information at an astonishing rate, its algorithms weaving through data with ease and efficiency.')]