# Advanced Retrieval Techniques in LangChain with Vector Databases and Compression
This notebook demonstrates advanced document retrieval techniques using LangChain, including similarity search, maximum marginal relevance (MMR), metadata filtering, self-querying with LLMs, contextual compression, and comparison with traditional methods like TF-IDF and SVM. Example documents are embedded into ChromaDB and retrieved using a variety of retrieval strategies.

## Vectorstore retrieval


In [9]:
# !pip install lark

In [8]:
# !pip install langchain_community

In [12]:
# !pip install langchain-openai

### Similarity Search

In [73]:
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
embedding_model = OpenAIEmbeddings()
persist_dir = 'docs/chroma/'

In [74]:
import os
os.environ["OPENAI_API_KEY"] = 'xxx'

In [15]:
# !pip install chromadb

In [75]:
documents = [
    "The Transformer architecture has fundamentally reshaped natural language processing, enabling models such as BERT, GPT, and T5 to excel in tasks like translation, summarization, and question answering.",
    "In financial modeling, Monte Carlo simulations are used to understand the impact of risk and uncertainty by simulating a wide range of possible outcomes.",
    "The CRISPR-Cas9 gene-editing technology has revolutionized genetic engineering, allowing scientists to make precise modifications to DNA in living organisms."
]


In [76]:
vectordb = Chroma.from_texts(
    texts=documents,
    embedding=embedding_model,
    persist_directory=persist_dir
)
vectordb.persist()
embedding = OpenAIEmbeddings()
persist_directory = 'docs/chroma/'

In [77]:
print(vectordb._collection.count())

6


In [78]:
science_corpus = [
    "Quantum entanglement is a physical phenomenon where particles remain interconnected, sharing physical states even when separated by vast distances.",
    "A white dwarf is a stellar remnant composed mostly of electron-degenerate matter and forms when a low-mass star has exhausted all its central nuclear fuel.",
    "Schrödinger's cat is a thought experiment that illustrates the paradox of quantum superposition."
]
science_vectordb = Chroma.from_texts(science_corpus, embedding=embedding_model)

In [35]:
# !pip install tiktoken

In [79]:
query = "What happens to stars when they die?"
science_vectordb.similarity_search(query, k=2)

[Document(metadata={}, page_content='A white dwarf is a stellar remnant composed mostly of electron-degenerate matter and forms when a low-mass star has exhausted all its central nuclear fuel.'),
 Document(metadata={}, page_content='Quantum entanglement is a physical phenomenon where particles remain interconnected, sharing physical states even when separated by vast distances.')]

In [80]:
science_vectordb.max_marginal_relevance_search(query, k=2, fetch_k=3)

[Document(metadata={}, page_content='A white dwarf is a stellar remnant composed mostly of electron-degenerate matter and forms when a low-mass star has exhausted all its central nuclear fuel.'),
 Document(metadata={}, page_content="Schrödinger's cat is a thought experiment that illustrates the paradox of quantum superposition.")]

### Addressing Diversity: Maximum marginal relevance

`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

In [81]:
query = "Explain the benefits of transformer-based models."
results = vectordb.similarity_search(query, k=3)

In [84]:
results[0].page_content


'The Transformer architecture has fundamentally reshaped natural language processing, enabling models such as BERT, GPT, and T5 to excel in tasks like translation, summarization, and question answering.'

In [85]:
results[1].page_content[:100]

'In financial modeling, Monte Carlo simulations are used to understand the impact of risk and uncerta'

Note the difference in results with `MMR`.

In [86]:
mmr_results = vectordb.max_marginal_relevance_search(query, k=3)



In [88]:
mmr_results[0].page_content[:100]

'The Transformer architecture has fundamentally reshaped natural language processing, enabling models'

In [89]:
mmr_results[1].page_content[:100]

'In financial modeling, Monte Carlo simulations are used to understand the impact of risk and uncerta'

### Addressing Specificity: working with metadata

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [90]:
query = "What were the applications of SVMs discussed in Lecture 3?"
filtered_docs = vectordb.similarity_search(
    query,
    k=3,
    filter={"source": "docs/ai_course/lecture3.pdf"}
)

In [91]:
for doc in filtered_docs:
    print(doc.metadata)

### Addressing Specificity: working with metadata using self-query retriever


To address this, use `SelfQueryRetriever`, which uses an LLM to extract:

1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [93]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [94]:
metadata_fields = [
    AttributeInfo(name="source", description="Source file path", type="string"),
    AttributeInfo(name="page", description="Page number within source", type="integer"),
]

In [98]:
document_content_description = "AI research papers and lecture notes"
llm = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0)
self_query_retriever = SelfQueryRetriever.from_llm(
    llm_model,
    vectordb,
    document_content_description,
    metadata_field_info=metadata_fields,
    verbose=True
)

In [99]:
query = "Did Lecture 2 mention linear regression in any detail?"
retrieved_docs = self_query_retriever.get_relevant_documents(query)

**You will receive a warning** about predict_and_parse being deprecated the first time you executing the next line. This can be safely ignored.

In [100]:
for doc in retrieved_docs:
    print(doc.metadata)

### Additional tricks: compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text.

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this.

In [102]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [103]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + doc.page_content for i, doc in enumerate(docs)]))

In [104]:
compressor = LLMChainExtractor.from_llm(llm_model)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [105]:
query = "How are GPT models used in summarization?"
compressed_docs = compression_retriever.get_relevant_documents(query)
pretty_print_docs(compressed_docs)

Document 1:

The Transformer architecture, models such as BERT, GPT, and T5, tasks like translation, summarization, and question answering.


## Combining various techniques

In [106]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type="mmr")
)
compressed_docs = compression_retriever.get_relevant_documents(query)
pretty_print_docs(compressed_docs)



Document 1:

The Transformer architecture, models such as BERT, GPT, and T5, tasks like translation, summarization, and question answering.


## Other types of retrieval

It's worth noting that vectordb as not the only kind of tool to retrieve documents.

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [63]:
from langchain.retrievers import SVMRetriever, TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [66]:
# !pip install pypdf

In [107]:
loader = PyPDFLoader("/content/sample_data/sample-local-pdf.pdf")
pages = loader.load()
full_text = " ".join([p.page_content for p in pages])

splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150)
pdf_chunks = splitter.split_text(full_text)



In [111]:
svm_retriever = SVMRetriever.from_texts(pdf_chunks, embedding_model)
tfidf_retriever = TFIDFRetriever.from_texts(pdf_chunks)

In [117]:
question = "How long is this"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]

Document(metadata={}, page_content='1')

In [115]:
question = "Quisque volutpat condimentum velit"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]

Document(metadata={}, page_content='lacinia aliquet. Mauris ipsum. Nulla metus metus, ullamcorper vel, tincidunt sed, euismod in, nibh.   Quisque volutpat condimentum velit. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Nam nec ante. Sed lacinia, urna non tincidunt mattis, tortor neque adipiscing diam, a cursus ipsum ante quis turpis. Nulla facilisi. Ut fringilla. Suspendisse potenti. Nunc feugiat mi a tellus consequat imperdiet. Vestibulum sapien. Proin quam. Etiam ultrices.   Suspendisse in justo eu magna luctus suscipit. Sed lectus. Integer euismod lacus luctus magna. Quisque cursus, metus vitae pharetra auctor, sem massa mattis sem, at interdum magna augue eget diam. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Morbi lacinia molestie dui. Praesent blandit dolor. Sed non quam. In vel mi sit amet augue congue 3')