# LangChain’s indexes and retrievers

LangChain’s indexes and retrievers provide modular, adaptable, and customizable options for handling unstructured data with LLMs. The primary index types in LangChain are based on vector databases, mainly emphasizing indexes using embeddings.
The role of retrievers is to extract relevant documents for integration into language model prompts. In LangChain, a retriever employs a get_relevant_documents method, taking a query string as input and generating a list of documents that are relevant to that query.

### Install the necessary Python packages and use the TextLoader class to load text files and create a LangChain Document object.

In [1]:
%pip install langchain==1.2.0 langchain-community==0.4.1 langchain-text-splitters==1.1.0 openai tiktoken python-dotenv "deeplake[enterprise]<4.0.0"

[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
from langchain_community.document_loaders import TextLoader

### text to write to a local file

In [3]:
text =""" Google opens up its AI language model PaLM to challenge OpenAI and GPT-3 Google offers developers access to one of its most advanced AI language models: PaLM. The search giant is launching an API for PaLM alongside a number of AI enterprise tools it says will help businesses "generate text, images, code, videos, audio, and more from simple natural language prompts."PaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or Meta's LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs, PaLM is a flexible system that can potentially carry out all sorts of text generation and editing tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for example, or you could use it for tasks like summarizing text or even writing code. (It's similar to features Google also announced today for its Workspace apps like Google Docs and Gmail.)"""
text

' Google opens up its AI language model PaLM to challenge OpenAI and GPT-3 Google offers developers access to one of its most advanced AI language models: PaLM. The search giant is launching an API for PaLM alongside a number of AI enterprise tools it says will help businesses "generate text, images, code, videos, audio, and more from simple natural language prompts."PaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or Meta\'s LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs, PaLM is a flexible system that can potentially carry out all sorts of text generation and editing tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for example, or you could use it for tasks like summarizing text or even writing code. (It\'s similar to features Google also announced today for its Workspace apps like Google Docs and Gmail.)'

### write text to local file

In [4]:
with open("my_file.txt", "w") as file: file.write(text)

# use TextLoader to load text from local file

In [5]:
loader = TextLoader("my_file.txt")
docs_from_file = loader.load()

print(len(docs_from_file))

1


### Use CharacterTextSplitter to split the documents into text snippets called “chunks.” Chunk_overlap is the number of characters that overlap between two chunks. It preserves context and improves coherence by ensuring that important information is not cut off at the boundaries of chunks.

In [6]:
from langchain_text_splitters import CharacterTextSplitter

### create a text splitter

In [7]:
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20)

### split documents into chunks

In [8]:
docs = text_splitter.split_documents(docs_from_file)
docs

[Document(metadata={'source': 'my_file.txt'}, page_content='Google opens up its AI language model PaLM to challenge OpenAI and GPT-3 Google offers developers access to one of its most advanced AI language models: PaLM. The search giant is launching an API for PaLM alongside a number of AI enterprise tools it says will help businesses "generate text, images, code, videos, audio, and more from simple natural language prompts."PaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or Meta\'s LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs, PaLM is a flexible system that can potentially carry out all sorts of text generation and editing tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for example, or you could use it for tasks like summarizing text or even writing code. (It\'s similar to features Google also announced today for its Workspace apps like Google Docs and Gmail.)')]

### Create a vector embedding for each text snippet.

These embeddings allow us to effectively search for documents or portions of documents that relate to our query by examining their semantic similarities.

In [9]:
from dotenv import load_dotenv
load_dotenv()

from langchain_community.embeddings import OpenAIEmbeddings

In [10]:
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

  embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")


### Vectorstore

In [11]:
from langchain_community.vectorstores import DeepLake

  import pkg_resources  # type: ignore


### Create Deep Lake Dataset

In [12]:
import os

my_activeloop_org_id = os.getenv("ACTIVELOOP_ORG_ID")
my_activeloop_dataset_name = "langchain_course_indexers_retrievers"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# Add documents to the Deep Lake dataset
db.add_documents(docs)

print(db)

  db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)
Using embedding function is deprecated and will be removed in the future. Please use embedding instead.


Deep Lake Dataset in hub://rafaljacobmatthew/langchain_course_indexers_retrievers already exists, loading from the storage


Creating 1 embeddings in 1 batches of size 1:: 100%|██████████| 1/1 [00:42<00:00, 42.35s/it]

Dataset(path='hub://rafaljacobmatthew/langchain_course_indexers_retrievers', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
 embedding  embedding  (1, 1536)  float32   None   
    id        text      (1, 1)      str     None   
 metadata     json      (1, 1)      str     None   
   text       text      (1, 1)      str     None   
<langchain_community.vectorstores.deeplake.DeepLake object at 0x7f59268f1b90>





### Create retriever from db

In [13]:
retriever = db.as_retriever()
retriever

VectorStoreRetriever(tags=['DeepLake', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.deeplake.DeepLake object at 0x7f59268f1b90>, search_kwargs={})

### Use the RetrievalQA class to define a question answering chain using external data source and start with question-answering.

In [14]:
from langchain_classic.chains import RetrievalQA
from langchain_community.chat_models import ChatOpenAI

### Create a retrieval chain

In [24]:
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-3.5-turbo"),
    chain_type="stuff",
    retriever=retriever
)

### Query our document about a specific topic found in the documents.

In [16]:
query = "How Google plans to challenge OpenAI?"
response = qa_chain.run(query)
response

  response = qa_chain.run(query)


"Google plans to challenge OpenAI by offering developers access to its advanced AI language model PaLM, which is similar to OpenAI's GPT series. By launching an API for PaLM and providing AI enterprise tools for businesses to generate text, images, code, videos, audio, and more from natural language prompts, Google aims to compete with OpenAI in the field of AI language models."

## Behind The Scenes

In creating the retriever stages, we set the `chain_type` to "stuff." This is the most straightforward document chain ("stuff" as in "to stuff" or "to fill"). It takes a list of documents, inserts them all into a prompt, and passes that prompt to an LLM. This approach is only efficient with shorter documents due to the context length limitations of most LLMs.

The process also involves conducting a similarity search using embeddings to find documents that match and can be used as context for the LLM. While this might appear limited in scope with a single document, its effectiveness is enhanced when dealing with multiple documents segmented into "chunks." We can supply the LLM with the relevant information within its context size by selecting the most relevant documents based on semantic similarity.

This example highlighted the critical role of indexes and retrievers in augmenting the performance of LLMs when managing document-based data. The system's efficiency in sourcing and presenting relevant information is increased by transforming documents and user queries into numerical vectors (embeddings) and storing these in specialized databases like Deep Lake.

The effectiveness of this approach in enhancing the language comprehension of Large Language Models (LLMs) is underscored by the retriever's ability to pinpoint documents closely related to a user's query in the embedding space.

## A Potential Problem

This method poses a notable challenge, especially when dealing with a more extensive data set. In the example, the text was divided into equal parts, which resulted in both relevant and irrelevant text being presented in response to a user's query.

Incorporating unrelated content in the LLM prompt can be problematic for two main reasons:

1. It may distract the LLM from focusing on essential details.
2. It consumes space in the prompt that could be allocated to more relevant information.

## Possible Solution: Contextual Compression

A `DocumentCompressor` can address this issue. Instead of immediately returning retrieved documents as-is, you can compress them using the context of the given query so that only the relevant information is returned. "Compressing" here refers to compressing an individual document's contents and filtering out documents wholesale.

The `ContextualCompressionRetriever` serves as a wrapper for another retriever within LangChain. It combines a base retriever with a `DocumentCompressor`, ensuring that only the most pertinent segments of the documents retrieved by the base retriever are presented in response to a specific query.

A standard tool that can use the compressor is `LLMChainExtractor`. This tool employs an LLMChain to isolate only those statements from the documents that are relevant to the query. A `ContextualCompressionRetriever`, incorporating an `LLMChainExtractor`, is utilized to enhance the document retrieval process. The `LLMChainExtractor` reviews the initially retrieved documents and selectively extracts content directly relevant to the user's query.

In [20]:
from langchain_classic.retrievers import ContextualCompressionRetriever
from langchain_classic.retrievers.document_compressors import LLMChainExtractor

# create GPT wrapper
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# create compressor for the retriever
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

### Retrieve Compressed Documents

Once the `compression_retriever` is created, we can retrieve the relevant compressed documents for a query.

In [22]:
# retrieving compressed documents
retrieved_docs = compression_retriever.invoke(
    "How Google plans to challenge OpenAI?"
)
print(retrieved_docs[0].page_content)

Google opens up its AI language model PaLM to challenge OpenAI and GPT-3. Google offers developers access to one of its most advanced AI language models: PaLM. PaLM is a large language model, or LLM, similar to the GPT series created by OpenAI. Google first announced PaLM in April 2022.


Compressors try to simplify the process by sending only essential data to the LLM. This also allows you to provide more information to the LLM. Letting the compressors handle precision during the initial retrieval step will allow you to focus on recall (for example, by increasing the number of documents returned).

We saw how it is possible to create a retriever from a `.txt` file; however, data can come in different types. The LangChain framework offers diverse classes that enable data to be loaded from multiple sources, including PDFs, URLs, and Google Drive, among others, which we will explore later.