# Retrieval

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow. 

Let's get our vectorDB from before.

### Similarity Search

In [3]:
from langchain_community.llms import Ollama
from langchain.embeddings import OllamaEmbeddings
persist_directory = 'docs/chroma/'

In [2]:
llm = Ollama(model = "llama3")
embedding = OllamaEmbeddings(model = "nomic-embed-text")

In [4]:
from langchain.vectorstores import Chroma
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

In [6]:
print(vectordb._collection.count())

228


In [7]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [8]:
smalldb = Chroma.from_texts(texts=texts, embedding= embedding)

In [9]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [11]:
smalldb.similarity_search(question, k= 2)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

In [12]:
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

### Addressing Diversity: Maximum marginal relevance

Last class we introduced one problem: how to enforce diversity in the search results.
 
`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

In [20]:
question = "what did they say about matlab?"

In [21]:
docs_ss = vectordb.similarity_search(question, k=3)

In [22]:
docs_ss[0].page_content[:100]

'into his office and he said, "Oh, professo r, professor, thank you so much for your \nmachine learnin'

In [23]:
docs_ss[1].page_content[:100]

'into his office and he said, "Oh, professo r, professor, thank you so much for your \nmachine learnin'

In [24]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [29]:
docs_mmr[0].page_content[:100]

'into his office and he said, "Oh, professo r, professor, thank you so much for your \nmachine learnin'

In [30]:
docs_mmr[1].page_content[:100]

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

### Addressing Specificity: working with metadata

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [31]:
question = "what did they say about regression in the third lecture?"

In [35]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"docs/MachineLearning-Lecture01.pdf"}
)

In [36]:
for d in docs:
    print(d.metadata)

{'page': 8, 'source': 'docs/MachineLearning-Lecture01.pdf'}
{'page': 8, 'source': 'docs/MachineLearning-Lecture01.pdf'}
{'page': 8, 'source': 'docs/MachineLearning-Lecture01.pdf'}


### Additional tricks: compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text. 

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this. 

In [37]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [38]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

In [39]:
compressor = LLMChainExtractor.from_llm(llm)

In [40]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [41]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

*AS IS*

"So... what was it that you learned? Was it logistic regression? Was it the PCA? Was it the data networks? What was it that you learned that was so helpful?" And the student said, "Oh, it was the MATLAB." So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one of the discussion sections for those of you that don't know it.

Extracted relevant parts:
----------------------------------------------------------------------------------------------------
Document 2:

Here are the extracted relevant parts:

"So my friend was very excited. He said, 'W ow. That's great. I'm glad to hear this machine learning stuff was actually useful. So what was it that you learned? Was it logistic regression? Was it the PCA? Was it the data networks? What was it that you learned that was so helpful?' And the student said, 'Oh, it was the MATLAB.'"
--------------------------------------------------------

## Combining various techniques

In [42]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [43]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

*AS IS*

"So my friend was very excited. He said, 'Wow. That's great. I'm glad to hear this machine learning stuff was actually useful. So what was it that you learned? Was it logistic regression? Was it the PCA? Was it the data networks? What was it that you learned that was so helpful?' And the student said, 'Oh, it was the MATLAB.'"

*NO OUTPUT*
----------------------------------------------------------------------------------------------------
Document 2:

> those homeworks will be done in either MATLA B or in Octave, which is sort of  — I know some people call it a free version of MATLAB, which it sort of is, sort of isn' t.
> So I guess for those of you that haven't seen MATLAB before, and I know most of you have, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot

## Other types of retrieval

It's worth noting that vectordb as not the only kind of tool to retrieve documents. 

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [44]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [48]:
# Load
loader = PyPDFLoader(
    "docs/MachineLearning-Lecture01.pdf"
)
pages = loader.load()

all_page_content = [p.page_content for p in pages]
joined_page_content = " ".join(all_page_content)

In [55]:
# Split

r_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500, chunk_overlap=150)
splits = r_splitter.split_text(joined_page_content)

In [57]:
# Reterive 

from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever

svm_retriever = SVMRetriever.from_texts(splits, embeddings=embedding) 
tfid_retriever = SVMRetriever.from_texts(splits, embeddings=embedding) 


In [59]:
question = "What are major topics for this class?"
docs_svm = svm_retriever.get_relevant_documents(question)
docs_svm[0]



Document(page_content="course information handout. So let me just sa y a few words about parts of these. On the \nthird page, there's a section that says Online Resources.  \nOh, okay. Louder? Actually, could you turn up the volume? Testing. Is this better? \nTesting, testing. Okay, cool. Thanks.   So all right, online resources. The class has a home page, so it's in on the handouts. I \nwon't write on the chalkboard — http:// cs229.stanford.edu. And so when there are \nhomework assignments or things like that, we  usually won't sort of — in the mission of \nsaving trees, we will usually not give out many handouts in class. So homework \nassignments, homework solutions will be posted online at the course home page.  \nAs far as this class, I've also written, a nd I guess I've also revised every year a set of \nfairly detailed lecture notes that cover the te chnical content of this  class. And so if you \nvisit the course homepage, you'll also find the detailed lecture notes that go ove

In [60]:
question = "What did they say about Matlab?"
docs_tfid = tfid_retriever.get_relevant_documents(question)
docs_tfid[0]



Document(page_content="So as part of forming study groups, later t oday as you get to know your classmates, I \ndefinitely also encourage you to grab two ot her people and form a group of up to three \npeople for your project, okay? And just start brainstorming ideas for now amongst \nyourselves. You can also come and talk to me or the TAs if you want to brainstorm ideas \nwith us.  \nOkay. So one more organizational ques tion. I'm curious, how many of you know \nMATLAB? Wow, cool, quite a lot. Okay. So as part of the — act ually how many of you \nknow Octave or have used Octave ? Oh, okay, much smaller number.  \nSo as part of this class, especially in the homeworks, we'll ask you to implement a few \nprograms, a few machine learning algorithms as  part of the homeworks. And most of  those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn't.  \nSo I guess for those of you 