# **Retrieval**

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow.

Let's get our vectorDB from before.

## **Vectorstore retrieval**


In [1]:
import os

os.environ["GROQ_API_KEY"] = "YOUR-API-KEY"

In [2]:
! pip install langchain langchain_groq langchain_community lark
! pip install sentence-transformers

Collecting langchain
  Downloading langchain-0.2.1-py3-none-any.whl (973 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.5/973.5 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain_groq
  Downloading langchain_groq-0.1.4-py3-none-any.whl (11 kB)
Collecting langchain_community
  Downloading langchain_community-0.2.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting lark
  Downloading lark-1.1.9-py3-none-any.whl (111 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.7/111.7 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
Collecting langchain-core<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_core-0.2.3-py3-none-any.whl (310 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.2/310.2 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-text-splitters<0.3.0,>=0.2.0 

First, Let's read the documents:

In [3]:
! pip install pypdf

Collecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/290.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━[0m [32m204.8/290.4 kB[0m [31m6.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pypdf
Successfully installed pypdf-4.2.0


In [4]:
from langchain.document_loaders import PyPDFLoader

## Load PDFs
loaders = [
    PyPDFLoader("MachineLearning-Lecture01.pdf"),
    PyPDFLoader("MachineLearning-Lecture01.pdf"),
    PyPDFLoader("MachineLearning-Lecture02.pdf"),
    PyPDFLoader("MachineLearning-Lecture03.pdf")
]

docs = []
for loader in loaders:
    docs.extend(loader.load())

In [5]:
print(f"Number of pages: {len(docs)}")

Number of pages: 78


We have 78 pages totally. now the next step is to split our document into chunks.

In [6]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

splits = text_splitter.split_documents(docs)

In [7]:
print(f"Number of chunks: {len(splits)}")

Number of chunks: 209


So, we splitted our document into 209 chunks.

## **Embeddings**

In [8]:
from langchain.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

hf = HuggingFaceEmbeddings(
    model_name = "sentence-transformers/all-mpnet-base-v2"
)

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [9]:
! pip install chromadb

Collecting chromadb
  Downloading chromadb-0.5.0-py3-none-any.whl (526 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m526.8/526.8 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard]>=0.18.3 (from chromadb)
  Downloading uvicorn-0.30.1-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.4/62.4 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.5.0-py2.

In [10]:
persist_directory = 'docs/chroma/'

vectordb = Chroma.from_documents(
    documents = splits,
    embedding = hf,
    persist_directory = persist_directory
)

In [11]:
print(vectordb._collection.count())

209


## **Similarity Search**

In [12]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [13]:
smalldb = Chroma.from_texts(
    texts = texts,
    embedding = hf
)

In [14]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

smalldb.similarity_search(question, k = 2)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

In [15]:
smalldb.max_marginal_relevance_search(question, k = 2, fetch_k = 3)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

We got the same results, wired !

### **Addressing Diversity: Maximum marginal relevance**

Last class we introduced one problem: how to enforce diversity in the search results.

`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

In [16]:
question = "what did they say about matlab?"

docs_ss = vectordb.similarity_search(question, k = 3)

In [17]:
## First retrevied document
print(docs_ss[0].page_content[:100])

## Second retrevied document
print(docs_ss[1].page_content[:100])

So later this quarter, we'll use the discussion sections to talk about things like convex 
optimizat
So later this quarter, we'll use the discussion sections to talk about things like convex 
optimizat


As in the previous notebook, th both documents are the same.

In [18]:
docs_mmr = vectordb.max_marginal_relevance_search(question, k = 3)

## First retrevied document
print(docs_mmr[0].page_content[:100])

## Second retrevied document
print(docs_mmr[1].page_content[:100])

So later this quarter, we'll use the discussion sections to talk about things like convex 
optimizat
many biologers are there here? Wow, just a few, not many. I'm surprised. Anyone from 
statistics? Ok


We can see that by using the **mmr**, we get different results.

### **Addressing Specificity: working with metadata**

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [19]:
question = "what did they say about regression in the third lecture?"

In [20]:
docs = vectordb.similarity_search(
    question,
    k = 3,
    filter = {"source": "MachineLearning-Lecture03.pdf"}
)

In [21]:
for doc in docs:
    print(doc.metadata)

{'page': 0, 'source': 'MachineLearning-Lecture03.pdf'}
{'page': 7, 'source': 'MachineLearning-Lecture03.pdf'}
{'page': 6, 'source': 'MachineLearning-Lecture03.pdf'}


This time, all returned documents belong to the third lecture.

### **Addressing Specificity: working with metadata using self-query retriever**

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:

1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [22]:
from langchain_groq import ChatGroq
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [27]:
metadata_field_info = [
    AttributeInfo(
        name = "source",
        description="The lecture the chunk is from, should be one of `MachineLearning-Lecture01.pdf`, `MachineLearning-Lecture02.pdf`, or `MachineLearning-Lecture03.pdf`",
        type = "string"
    ),

    AttributeInfo(
        name = "page",
        description = "The page from the lecture",
        type = "integer"
    )

]

In [28]:
document_content_description = "Lecture notes"
llm = ChatGroq(
    model = "llama3-70b-8192",
    temperature = 0
)

retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose = True
)

In [29]:
question = "what did they say about regression in the third lecture?"

**You will receive a warning** about predict_and_parse being deprecated the first time you executing the next line. This can be safely ignored.

In [30]:
docs = retriever.get_relevant_documents(question)

for doc in docs:
    print(doc.metadata)

  warn_deprecated(


{'page': 0, 'source': 'MachineLearning-Lecture03.pdf'}
{'page': 5, 'source': 'MachineLearning-Lecture03.pdf'}
{'page': 2, 'source': 'MachineLearning-Lecture03.pdf'}
{'page': 0, 'source': 'MachineLearning-Lecture03.pdf'}


## **Other types of retrieval**

It's worth noting that vectordb as not the only kind of tool to retrieve documents.

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [31]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [33]:
# Load PDF
loader = PyPDFLoader("MachineLearning-Lecture01.pdf")
pages = loader.load()

all_page_text = [page.page_content for page in pages]
joined_page_text = " ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500, chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)

In [34]:
print(len(splits))

45


In [35]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits, hf)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [39]:
from IPython.display import display, Markdown

question = "What are major topics for this class?"
docs_svm = svm_retriever.get_relevant_documents(question)

display(Markdown(docs_svm[0].page_content))

let me just check what questions you have righ t now. So if there are no questions, I'll just 
close with two reminders, which are after class today or as you start to talk with other 
people in this class, I just encourage you again to start to form project partners, to try to 
find project partners to do your project with. And also, this is a good time to start forming 
study groups, so either talk to your friends  or post in the newsgroup, but we just 
encourage you to try to star t to do both of those today, okay? Form study groups, and try 
to find two other project partners.  
So thank you. I'm looking forward to teaching this class, and I'll see you in a couple of 
days.   [End of Audio]  
Duration: 69 minutes

In [42]:
question = "what did they say about matlab?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)

display(Markdown(docs_tfidf[0].page_content))

Saxena and Min Sun here did, wh ich is given an image like this, right? This is actually a 
picture taken of the Stanford campus. You can apply that sort of cl ustering algorithm and 
group the picture into regions. Let me actually blow that up so that you can see it more 
clearly. Okay. So in the middle, you see the lines sort of groupi ng the image together, 
grouping the image into [inaudible] regions.  
And what Ashutosh and Min did was they then  applied the learning algorithm to say can 
we take this clustering and us e it to build a 3D model of the world? And so using the 
clustering, they then had a lear ning algorithm try to learn what the 3D structure of the 
world looks like so that they could come up with a 3D model that you can sort of fly 
through, okay? Although many people used to th ink it's not possible to take a single 
image and build a 3D model, but using a lear ning algorithm and that sort of clustering 
algorithm is the first step. They were able to.  
I'll just show you one more example. I like this  because it's a picture of Stanford with our 
beautiful Stanford campus. So again, taking th e same sort of clustering algorithms, taking 
the same sort of unsupervised learning algor ithm, you can group the pixels into different 
regions. And using that as a pre-processing step, they eventually built this sort of 3D model of Stanford campus in a single picture.  You can sort of walk  into the ceiling, look