# **Vectorstores and Embeddings**

Recall the overall workflow for retrieval augmented generation (RAG)

In [1]:
import os

os.environ['GROQ_API_KEY'] = 'YOUR_API_KEY'
os.environ['OPENAI_API_KEY'] = 'YOUR_API_KEY'

In [2]:
! pip install langchain langchain_groq openai

Collecting langchain
  Downloading langchain-0.2.1-py3-none-any.whl (973 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.5/973.5 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain_groq
  Downloading langchain_groq-0.1.4-py3-none-any.whl (11 kB)
Collecting openai
  Downloading openai-1.30.5-py3-none-any.whl (320 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.7/320.7 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
Collecting langchain-core<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_core-0.2.3-py3-none-any.whl (310 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.2/310.2 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.0-py3-none-any.whl (23 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.67-py3-none-any.whl (124 kB)
[2K     [90m━━━━━━

We just discussed `Document Loading` and `Splitting`.

In [5]:
! pip install pypdf langchain_community

Collecting langchain_community
  Downloading langchain_community-0.2.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.6-py3-none-any.whl (28 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.21.2-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Installing collected packages: mypy-extensi

In [6]:
from langchain.document_loaders import PyPDFLoader

## Load PDFs
loaders = [
    PyPDFLoader("MachineLearning-Lecture01.pdf"),
    PyPDFLoader("MachineLearning-Lecture01.pdf"),
    PyPDFLoader("MachineLearning-Lecture02.pdf"),
    PyPDFLoader("MachineLearning-Lecture03.pdf")
]

docs = []
for loader in loaders:
    docs.extend(loader.load())

In [8]:
print(len(docs))

78


We have $78$ pages totally. now the next step is to split our document into chunks.

In [10]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

splits = text_splitter.split_documents(docs)

In [11]:
print(len(splits))

209


So, we splitted our document into $209$ chunks.

## **Embeddings**

Let's take our splits and embed them.

In [24]:
! pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.0.0-py3-none-any.whl (224 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/224.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m102.4/224.7 kB[0m [31m2.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.7/224.7 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (

In [25]:
from langchain_community.embeddings import HuggingFaceEmbeddings

hf = HuggingFaceEmbeddings(
    model_name = "sentence-transformers/all-mpnet-base-v2",
)

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Consider $3$ different sentences and compute the similarity between their embeddings.

In [13]:
sentence1 = "Instant Family was a great movie."
sentence2 = "I Really loved Mark Wahlberg acting in Instant Familiy movie."
sentence3 = "I Prefer Arsenal over Liverpool."

In [16]:
! pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.7.0


In [26]:
embedding1 = hf.embed_query(sentence1)
embedding2 = hf.embed_query(sentence2)
embedding3 = hf.embed_query(sentence3)

Compute the similarity between embeddings, for that we consider their dot product.

In [27]:
import numpy as np

print("sentence1 and 2 similarity", np.dot(embedding1, embedding2))
print("sentence1 and 3 similarity", np.dot(embedding1, embedding3))
print("sentence2 and 3 similarity", np.dot(embedding2, embedding3))

sentence1 and 2 similarity 0.6511208382313867
sentence1 and 3 similarity 0.021908986391920886
sentence2 and 3 similarity -0.023979944087888654


## **VectorStores**

In [28]:
! pip install chromadb

Collecting chromadb
  Downloading chromadb-0.5.0-py3-none-any.whl (526 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m526.8/526.8 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m54.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard]>=0.18.3 (from chromadb)
  Downloading uvicorn-0.30.0-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.4/62.4 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.5.0-py2.

In [29]:
from langchain.vectorstores import Chroma

In [30]:
persist_directory = 'docs/chroma/'

vectordb = Chroma.from_documents(
    documents = splits,
    embedding = hf,
    persist_directory = persist_directory
)

In [34]:
print(vectordb._collection.count())

209


$209$, which is the same as the number of splits.

## **Similarity Search**

In [31]:
question = "is there an email i can ask for help"

docs = vectordb.similarity_search(question, k = 3)

In [32]:
print(len(docs))

print(docs[0].page_content)

3
So all right, online resources. The class has a home page, so it's in on the handouts. I 
won't write on the chalkboard — http:// cs229.stanford.edu. And so when there are 
homework assignments or things like that, we  usually won't sort of — in the mission of 
saving trees, we will usually not give out many handouts in class. So homework 
assignments, homework solutions will be posted online at the course home page.  
As far as this class, I've also written, a nd I guess I've also revised every year a set of 
fairly detailed lecture notes that cover the te chnical content of this  class. And so if you 
visit the course homepage, you'll also find the detailed lecture notes that go over in detail 
all the math and equations and so on  that I'll be doing in class.  
There's also a newsgroup, su.class.cs229, also written on the handout. This is a 
newsgroup that's sort of a forum for people in  the class to get to  know each other and 
have whatever discussions you want to ha ve amongst

Let's save this so we can use it later!

In [33]:
vectordb.persist()

## **Failure modes**

- This seems great, and basic similarity search will get you 80% of the way there very easily.

- But there are some failure modes that can creep up.

- Here are some edge cases that can arise - we'll fix them in the next class.

In [35]:
question = "what did they say about matlab?"

docs = vectordb.similarity_search(question, k = 5)

In [39]:
docs[0].page_content

Document(page_content="So later this quarter, we'll use the discussion sections to talk about things like convex \noptimization, to talk a little bit about hidde n Markov models, which is a type of machine \nlearning algorithm for modeling time series and a few other things, so  extensions to the \nmaterials that I'll be covering in the main  lectures. And attend ance at the discussion \nsections is optional, okay?  \nSo that was all I had from l ogistics. Before we move on to start talking a bit about \nmachine learning, let me check what questions you have. Yeah?  \nStudent : [Inaudible] R or something like that?  \nInstructor (Andrew Ng) : Oh, yeah, let's see, right. So our policy has been that you're \nwelcome to use R, but I would strongly advi se against it, mainly because in the last \nproblem set, we actually supply some code th at will run in Octave  but that would be \nsomewhat painful for you to translate into R yourself. So for your other assignments, if \nyou wanna submit 

In [41]:
docs[1].page_content

"So later this quarter, we'll use the discussion sections to talk about things like convex \noptimization, to talk a little bit about hidde n Markov models, which is a type of machine \nlearning algorithm for modeling time series and a few other things, so  extensions to the \nmaterials that I'll be covering in the main  lectures. And attend ance at the discussion \nsections is optional, okay?  \nSo that was all I had from l ogistics. Before we move on to start talking a bit about \nmachine learning, let me check what questions you have. Yeah?  \nStudent : [Inaudible] R or something like that?  \nInstructor (Andrew Ng) : Oh, yeah, let's see, right. So our policy has been that you're \nwelcome to use R, but I would strongly advi se against it, mainly because in the last \nproblem set, we actually supply some code th at will run in Octave  but that would be \nsomewhat painful for you to translate into R yourself. So for your other assignments, if \nyou wanna submit a solution in R, that'

Notice that we're getting duplicate chunks (because of the duplicate `MachineLearning-Lecture01.pdf` in the index).

Semantic search fetches all similar documents, but does not enforce diversity.

We can see a new failure mode.

The question below asks a question about the third lecture, but includes results from other lectures as well.

In [44]:
question = "what did they say about regression, only in the third lecture?"

docs = vectordb.similarity_search(question, k = 5)

In [45]:
for doc in docs:
    print(doc.metadata)

{'page': 2, 'source': 'MachineLearning-Lecture02.pdf'}
{'page': 0, 'source': 'MachineLearning-Lecture03.pdf'}
{'page': 7, 'source': 'MachineLearning-Lecture03.pdf'}
{'page': 6, 'source': 'MachineLearning-Lecture03.pdf'}
{'page': 0, 'source': 'MachineLearning-Lecture03.pdf'}


We can see that the **first document** comes from the **second lecture** despite the fact that we asked for only the third lecture.