# Retrieval

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow. 

Let's get our vectorDB from before.

## Vectorstore retrieval


In [18]:
import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

embedding_model = "text-embedding-3-small"
llm_model = "gpt-4.1-nano"

### Similarity Search

In [2]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'

In [3]:
embedding = OpenAIEmbeddings(model=embedding_model)
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

In [4]:
print(vectordb._collection.count())

208


In [5]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [6]:
smalldb = Chroma.from_texts(texts, embedding=embedding)

In [8]:
print(smalldb._collection.count())

3


In [9]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [10]:
smalldb.similarity_search(question, k=2)

[Document(id='78130b6a-8560-4c9b-9864-15cce48c96f2', metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(id='61018552-1ce9-45ff-924d-9a72ff4ef23c', metadata={}, page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).')]

In [11]:
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(id='78130b6a-8560-4c9b-9864-15cce48c96f2', metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(id='0314fc09-c020-44a4-a640-afb23a69a5b8', metadata={}, page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

### Addressing Diversity: Maximum marginal relevance

Last class we introduced one problem: how to enforce diversity in the search results.
 
`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

In [5]:
question = "what did they say about matlab?"
docs_ss = vectordb.similarity_search(question,k=3)

In [8]:
print(docs_ss[0].page_content[:100])

those homeworks will be done in either MATLAB or in Octave, which is sort of — I 
know some people c


In [9]:
print(docs_ss[1].page_content[:100])

those homeworks will be done in either MATLAB or in Octave, which is sort of — I 
know some people c


Note the difference in results with `MMR`.

In [10]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [11]:
print(docs_mmr[0].page_content[:100])

those homeworks will be done in either MATLAB or in Octave, which is sort of — I 
know some people c


In [12]:
print(docs_mmr[1].page_content[:100])

least squares regression being a bad idea for classification problems and then I did a 
bunch of mat


### Addressing Specificity: working with metadata

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [13]:
question = "what did they say about regression in the third lecture?"

In [14]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)

In [15]:
for d in docs:
    print(d.metadata)

{'author': '', 'creationdate': '2008-07-11T11:25:03-07:00', 'creator': 'PScript5.dll Version 5.2.2', 'moddate': '2008-07-11T11:25:03-07:00', 'page': 0, 'page_label': '1', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'title': '', 'total_pages': 16}
{'author': '', 'creationdate': '2008-07-11T11:25:03-07:00', 'creator': 'PScript5.dll Version 5.2.2', 'moddate': '2008-07-11T11:25:03-07:00', 'page': 14, 'page_label': '15', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'title': '', 'total_pages': 16}
{'author': '', 'creationdate': '2008-07-11T11:25:03-07:00', 'creator': 'PScript5.dll Version 5.2.2', 'moddate': '2008-07-11T11:25:03-07:00', 'page': 6, 'page_label': '7', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'title': '', 'total_pages': 16}


### Addressing Specificity: working with metadata using self-query retriever

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:
 
1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [16]:
from langchain_openai import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [22]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'docs/cs229_lectures/MachineLearning-Lecture02.pdf', or 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [23]:
document_content_description = "Lecture notes"
llm = OpenAI(temperature=0, model=llm_model)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [24]:
question = "what did they say about regression in the third lecture?"

**You will receive a warning** about predict_and_parse being deprecated the first time you executing the next line. This can be safely ignored.

In [25]:
docs = retriever.invoke(question)

In [26]:
for d in docs:
    print(d.metadata)

{'author': '', 'creationdate': '2008-07-11T11:25:03-07:00', 'creator': 'PScript5.dll Version 5.2.2', 'moddate': '2008-07-11T11:25:03-07:00', 'page': 3, 'page_label': '4', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'title': '', 'total_pages': 16}
{'author': '', 'creationdate': '2008-07-11T11:25:03-07:00', 'creator': 'PScript5.dll Version 5.2.2', 'moddate': '2008-07-11T11:25:03-07:00', 'page': 3, 'page_label': '4', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'title': '', 'total_pages': 16}
{'author': '', 'creationdate': '2008-07-11T11:25:03-07:00', 'creator': 'PScript5.dll Version 5.2.2', 'moddate': '2008-07-11T11:25:03-07:00', 'page': 3, 'page_label': '4', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'title': '', 'total_pages': 16}


### Additional tricks: compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text. 

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this. 

In [27]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [28]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))


In [29]:
# Wrap our vectorstore
llm = OpenAI(temperature=0, model="gpt-4.1-nano")
compressor = LLMChainExtractor.from_llm(llm)

In [30]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [31]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.invoke(question)
pretty_print_docs(compressed_docs)

Document 1:

those homeworks will be done in either MATLAB or in Octave, which is sort of — I know some people call it a free version of MATLAB, which it sort of is, sort of isn't.  
So I guess for those of you that haven't seen MATLAB before, and I know most of you 
have, MATLAB is I guess part of the programming language that makes it very easy to 
write codes using matrices, to write code for numerical routines, to move data around, to 
plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of 
learning algorithms.  
And in case some of you want to work on your own home computer or something if you 
don't have a MATLAB license, for the purposes of this class, there's also — [inaudible] 
write that down [inaudible] MATLAB — there' s also a software package called Octave 
that you can download for free off the Internet. And it has somewhat fewer features than 
MATLAB, but it's free, and for the purposes of this class, it will work for just about 
eve

## Combining various techniques

In [32]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [33]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.invoke(question)
pretty_print_docs(compressed_docs)

Document 1:

those homeworks will be done in either MATLAB or in Octave, which is sort of — I know some people call it a free version of MATLAB, which it sort of is, sort of isn't.  
So I guess for those of you that haven't seen MATLAB before, and I know most of you 
have, MATLAB is I guess part of the programming language that makes it very easy to 
write codes using matrices, to write code for numerical routines, to move data around, to 
plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of 
learning algorithms.  
And in case some of you want to work on your own home computer or something if you 
don't have a MATLAB license, for the purposes of this class, there's also — [inaudible] 
write that down [inaudible] MATLAB — there' s also a software package called Octave 
that you can download for free off the Internet. And it has somewhat fewer features than 
MATLAB, but it's free, and for the purposes of this class, it will work for just about 
eve

## Other types of retrieval

It's worth noting that vectordb as not the only kind of tool to retrieve documents. 

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [34]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [35]:
# Load PDF
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)


In [38]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [41]:
question = "What are major topics for this class?"
docs_svm=svm_retriever.invoke(question)
print(docs_svm[0].page_content)

cs229-qa@cs.stanford.edu. This goes to an account that's read by all the TAs and me. So 
rather than sending us email individually, if you send email to this account, it will 
actually let us get back to you maximally quickly with answers to your questions.  
If you're asking questions about homework problems, please say in the subject line which 
assignment and which question the email refers to, since that will also help us to route 
your question to the appropriate TA or to me appropriately and get the response back to 
you quickly.  
Let's see. Skipping ahead — let's see — for homework, one midterm, one open and term 
project. Notice on the honor code. So one thing that I think will help you to succeed and 
do well in this class and even help you to enjoy this class more is if you form a study 
group.  
So start looking around where you're sitting now or at the end of class today, mingle a 
little bit and get to know your classmates. I strongly encourage you to form study groups 
a

In [42]:
question = "what did they say about matlab?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
print(docs_tfidf[0].page_content)

yourselves. You can also come and talk to me or the TAs if you want to brainstorm ideas 
with us.  
Okay. So one more organizational question. I'm curious, how many of you know 
MATLAB? Wow, cool, quite a lot. Okay. So as part of the — act ually how many of you 
know Octave or have used Octave? Oh, okay, much smaller number.  
So as part of this class, especially in the homeworks, we'll ask you to implement a few 
programs, a few machine learning algorithms as part of the homeworks. And most of those homeworks will be done in either MATLAB or in Octave, which is sort of — I 
know some people call it a free version of MATLAB, which it sort of is, sort of isn't.  
So I guess for those of you that haven't seen MATLAB before, and I know most of you 
have, MATLAB is I guess part of the programming language that makes it very easy to 
write codes using matrices, to write code for numerical routines, to move data around, to 
plot data. And it's sort of an extremely easy to learn tool to use f