# **LangChain:** Retrieval

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow. 

Let's get our vectorDB from before.

## Vectorstore retrieval


In [31]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [2]:
#!pip install lark

### Similarity Search

In [32]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = '/home/centrox_ai/Desktop/ABDULLAH/langchain/LangChain-Chat-with-your-Data/chroma/'

In [33]:
embedding = OpenAIEmbeddings()
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

In [34]:
print(vectordb._collection.count())

2225


In [6]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [7]:
smalldb = Chroma.from_texts(texts, embedding=embedding)

In [8]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [9]:
smalldb.similarity_search(question, k=2)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).')]

In [10]:
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

In [30]:

question = "what did they say about regression in the third lecture?"
vectordb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(page_content='MachineLearning-Lecture03  \nInstructor (Andrew Ng) :Okay. Good morning and welcome b ack to the third lecture of \nthis class. So here’s what I want to do t oday, and some of the topics I do today may seem \na little bit like I’m jumping, sort  of, from topic to topic, but here’s, sort of, the outline for \ntoday and the illogical flow of ideas. In the last lecture, we  talked about linear regression \nand today I want to talk about sort of an  adaptation of that called locally weighted \nregression. It’s very a popular  algorithm that’s actually one of my former mentors \nprobably favorite machine learning algorithm.  \nWe’ll then talk about a probabl e second interpretation of linear regression and use that to \nmove onto our first classification algorithm, which is logistic regr ession; take a brief \ndigression to tell you about something cal led the perceptron algorithm, which is \nsomething we’ll come back to, again, later this  quarter; and time allowing

### Addressing Diversity: Maximum marginal relevance

Last class we introduced one problem: how to enforce diversity in the search results.
 
`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

In [11]:
question = "what did they say about matlab?"
docs_ss = vectordb.similarity_search(question,k=3)

In [12]:
docs_ss[0].page_content[:100]

'algorithm then? So what’s different? How come  I was making all that noise earlier about \nleast squa'

In [13]:
docs_ss[1].page_content[:100]

"amount of notation. We'll probably all get used  to it in a few days and we'll standardize \nnotation"

Note the difference in results with `MMR`.

In [14]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [15]:
docs_mmr[0].page_content[:100]

'algorithm then? So what’s different? How come  I was making all that noise earlier about \nleast squa'

In [16]:
docs_mmr[1].page_content[:100]

"Today, I'm also going to delve into a fair amount  – some amount of linear algebra, and so \nif you w"

In [24]:
print(vectordb._collection.count())

2225


### Addressing Specificity: working with metadata

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [22]:
question = "what did they say about regression in the third lecture?"

In [35]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"/home/centrox_ai/Desktop/ABDULLAH/llama2/rag/a.pdf"}
)
len(docs)

3

In [36]:
for d in docs:
    print(d.metadata)

{'page': 0, 'source': '/home/centrox_ai/Desktop/ABDULLAH/llama2/rag/a.pdf'}
{'page': 14, 'source': '/home/centrox_ai/Desktop/ABDULLAH/llama2/rag/a.pdf'}
{'page': 4, 'source': '/home/centrox_ai/Desktop/ABDULLAH/llama2/rag/a.pdf'}


### Addressing Specificity: working with metadata using self-query retriever

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:
 
1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [42]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [47]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `/home/centrox_ai/Desktop/ABDULLAH/llama2/rag/a.pdf`, `/home/centrox_ai/Desktop/ABDULLAH/llama2/rag/MachineLearning-Lecture02.pdf`, or `/home/centrox_ai/Desktop/ABDULLAH/llama2/rag/1.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [48]:
document_content_description = "Lecture notes"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [49]:
question = "what did they say about regression in the third lecture?"

**You will receive a warning** about predict_and_parse being deprecated the first time you executing the next line. This can be safely ignored.

In [50]:
docs = retriever.get_relevant_documents(question)

OutputParserException: Parsing text
```json
{
    "query": "regression",
    "filter": "eq(\"source\", \"/home/centrox_ai/Desktop/ABDULLAH/llama2/rag/MachineLearning-Lecture02.pdf\") and eq(\"page\", 3)"
}
```
 raised following error:
Unexpected token Token('CNAME', 'and') at line 1, column 92.
Expected one of: 
	* $END
Previous tokens: [Token('RPAR', ')')]


In [41]:
len(docs)

0

In [14]:
for d in docs:
    print(d.metadata)

### Additional tricks: compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text. 

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this. 

In [51]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [52]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))


In [53]:
# Wrap our vectorstore
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

In [54]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [55]:
question = "who was the successor of Aurelian after he was murdered?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)



Document 1:

Tacitus
----------------------------------------------------------------------------------------------------
Document 2:

Aurelian was murdered in 275 A. D., and was succeeded by Tacitus, who met a like fate after a rule of less than two years. He was followed by Marcus Aurelius Probus, an able Illyrian officer.
----------------------------------------------------------------------------------------------------
Document 3:

- Aurelian
- successor of Aurelian
- murdered
----------------------------------------------------------------------------------------------------
Document 4:

- Aurelian
- successor of Aurelian
- murdered


## Combining various techniques

In [56]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [57]:
question = "who was the successor of Aurelian after he was murdered?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)



Document 1:

Tacitus
----------------------------------------------------------------------------------------------------
Document 2:

Constantius
----------------------------------------------------------------------------------------------------
Document 3:

Gaius Valerius Aurelius Diocletianus
----------------------------------------------------------------------------------------------------
Document 4:

Aetius himself became master of the soldiers and the real ruler of the empire. However, the Augusta Placidia endeavored to compass his downfall by an appeal to Bonifacius, who after his revolt of 427 A. D.had fought in the imperial cause against the Vandals. In 432 Bonifacius returned to Italy and was appointed master of the soldiers in place of Aetius.


## Other types of retrieval

It's worth noting that vectordb as not the only kind of tool to retrieve documents. 

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [58]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [59]:
# Load PDF
loader = PyPDFLoader("/home/centrox_ai/Desktop/ABDULLAH/llama2/rag/1.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)


In [60]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [61]:
question = "who was the successor of Aurelian after he was murdered?"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]

Document(page_content='Milan, when he was killed at the instigation of his officers, who\nproclaimed as his successor one of their own number, Marcus\nAurelius Claudius.\nClaudius Gothicus, 268 –270 A. D. The rule of Claudius\nlasted only two years, in which his greatest achievement was the\ncrushing defeat which he inflicted upon the Goths who had again\noverrun Greece and the adjacent lands (269 A. D.). This victory\nwon him the name of Gothicus. Upon the death of Claudius\nin 270 A. D., the army chose Lucius Domitius Aurelianus as\nemperor.\nLucius Domitius Aurelianus, 270 –275 A. D. Aurelian’s first\ntask was to clear Italy and the Danubian provinces of barbarian\ninvaders. Two incursions of the Alamanni into Raetia and\nItaly were repulsed, the latter with great slaughter. But the\nemperor recognized that the security of Italy could no longer\nbe guaranteed and so he ordered the fortification of the Italian\ncities. The imposing wall which still marks the boundary of part\nof anci

In [62]:
question = "who was the successor of Aurelian after he was murdered?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]

Document(page_content='Gaul had acknowledged the Roman emperor Claudius Gothicus.\nAfter several successors of Postumus had been overthrown by the\nmutinous Gallic soldiery, Publius Esuvius Tetricus was appointed\nemperor in Gaul and Britain. However, foreseeing the speedy\ndissolution of his empire, he secretly entered into negotiations\nwith Aurelian. The latter invaded Gaul and met the Gallic army\nat the plain of Chalons. In the course of the battle, Tetricus went\nover to Aurelian, who won a complete victory. Britain and Gaul\nsubmitted to the conqueror (274 A. D.). Thus the unity of the\nempire was restored and Aurelian assumed the title of “Restorer\nof the World ”(restitutor orbis ).\nDominus et deus natus. Not only was Aurelian one of\nthe greatest of Roman commanders; he also displayed sound\njudgment in his administration. Here his chief work was the 314 A History of Rome to 565 A. D.\nsuppression of the debased silver currency and the issuing of\na much improved coinage. Au