# Retrievers

Retreivers go a step further than simple similarity search, while there is a type called `Vector Store backed retreiver` which does exactly as the name implies, there are other - very powerful - types of retreivers, namely: `Multi-Query Retriever`, `Self-Querying Retriever`, and `Parent Document Retriever`. In this notebook, we explore all four of them.

In [None]:
%pip install --user "ibm-watsonx-ai==1.1.2" | tail -n 1
%pip install --user "langchain==0.2.1" | tail -n 1
%pip install --user "langchain-ibm==0.1.11" | tail -n 1
%pip install --user "langchain-community==0.2.1" | tail -n 1
%pip install --user "chromadb==0.4.24" | tail -n 1
%pip install --user "pypdf==4.3.1" | tail -n 1
%pip install --user "lark==1.1.9" | tail -n 1

First we need to build an LLM, split documents into chunks and finally build an embedding model to process the chunks.

### Building LLM

In [None]:
from ibm_watsonx_ai.foundation_models import ModelInference
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models.extensions.langchain import WatsonxLLM

In [None]:
def llm():
    model_id = 'mistralai/mixtral-8x7b-instruct-v01'
    
    parameters = {
        GenParams.MAX_NEW_TOKENS: 256,  # this controls the maximum number of tokens in the generated output
        GenParams.TEMPERATURE: 0.5, # this randomness or creativity of the model's responses
    }
    
    credentials = {
        "url": "https://us-south.ml.cloud.ibm.com"
    }
    
    
    project_id = "skills-network"
    
    model = ModelInference(
        model_id=model_id,
        params=parameters,
        credentials=credentials,
        project_id=project_id
    )
    
    mixtral_llm = WatsonxLLM(model = model)
    return mixtral_llm

### Splitting Text

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
def text_splitter(data, chunk_size, chunk_overlap):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
    )
    chunks = text_splitter.split_documents(data)
    return chunks

### Building Embedding Model

In [None]:
from ibm_watsonx_ai.metanames import EmbedTextParamsMetaNames
from langchain_ibm import WatsonxEmbeddings

In [None]:
def watsonx_embedding():
    embed_params = {
        EmbedTextParamsMetaNames.TRUNCATE_INPUT_TOKENS: 3,
        EmbedTextParamsMetaNames.RETURN_OPTIONS: {"input_text": True},
    }
    
    watsonx_embedding = WatsonxEmbeddings(
        model_id="ibm/slate-125m-english-rtrvr",
        url="https://us-south.ml.cloud.ibm.com",
        project_id="skills-network",
        params=embed_params,
    )
    return watsonx_embedding

Now we are ready to explore the retrievers ...

## Exploring Retrievers

### Vector Store-Backed Retriever

In [None]:
!wget "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/MZ9z1lm-Ui3YBp3SYWLTAQ/companypolicies.txt"

In [None]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("companypolicies.txt")
txt_data = loader.load()

txt_data

In [None]:
chunks_txt = text_splitter(txt_data, 200, 20)

In [None]:
from langchain.vectorstores import Chroma

vectordb = Chroma.from_documents(chunks_txt, watsonx_embedding())

#### Similarity Search

In [None]:
query = "email policy"
retriever = vectordb.as_retriever()

In [None]:
docs = retriever.invoke(query)
docs

Or using kwargs, to limit to k results

In [None]:
retriever = vectordb.as_retriever(search_kwargs={"k": 1})
docs = retriever.invoke(query)
docs

You can also use a similarity threshold

In [None]:
retriever = vectordb.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.4}
)
docs = retriever.invoke(query)
docs

#### MMR Retrieval.

According to IBM: "MMR in vector stores is a technique used to balance the relevance and diversity of retrieved results. It selects documents that are both highly relevant to the query and minimally similar to previously selected documents. This approach helps to avoid redundancy and ensures a more comprehensive coverage of different aspects of the query."

In [None]:
retriever = vectordb.as_retriever(search_type="mmr")
docs = retriever.invoke(query)
docs

### Multi-Query Retriever

The `MultiQueryRetriever` uses an LLM to generate multiple queries from different perspectives for a given user input query. For each query, it retrieves a set of relevant documents and then takes the unique union of these results to form a larger set of potentially relevant documents.

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/ioch1wsxkfqgfLLgmd-6Rw/langchain-paper.pdf")
pdf_data = loader.load()

pdf_data[1]

In [None]:
# Split
chunks_pdf = text_splitter(pdf_data, 500, 20)

# VectorDB
ids = vectordb.get()["ids"]
vectordb.delete(ids) # We need to delete existing embeddings from previous documents and then store current document embeddings in.
vectordb = Chroma.from_documents(documents=chunks_pdf, embedding=watsonx_embedding())

In [None]:
from langchain.retrievers.multi_query import MultiQueryRetriever

query = "What does the paper say about langchain?"

retriever = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(), llm=llm()
)

Setup logging

In [None]:
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [None]:
docs = retriever.invoke(query)
docs

### Self-Querying Retriever

A Self-Querying Retriever, as the name suggests, has the ability to query itself. Specifically, given a natural language query, the retriever uses a query-constructing LLM chain to generate a structured query. It then applies this structured query to its underlying vector store.

In [None]:
from langchain_core.documents import Document
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from lark import lark

In [None]:
docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "year": 1979,
            "director": "Andrei Tarkovsky",
            "genre": "thriller",
            "rating": 9.9,
        },
    ),
]

In [None]:
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]

In [None]:
vectordb = Chroma.from_documents(docs, watsonx_embedding())

document_content_description = "Brief summary of a movie."

retriever = SelfQueryRetriever.from_llm(
    llm(),
    vectordb,
    document_content_description,
    metadata_field_info,
)

In [None]:
# This example only specifies a filter
retriever.invoke("I want to watch a movie rated higher than 8.5")

# This example specifies a query and a filter
retriever.invoke("Has Greta Gerwig directed any movies about women")

# This example specifies a composite filter
retriever.invoke("What's a highly rated (above 8.5) science fiction film?")

### Parent Document Retriever

When splitting documents for retrieval, there are often conflicting desires:

1. You may want to have small documents so that their embeddings can most accurately reflect their meaning. If the documents are too long, the embeddings can lose meaning.
2. You want to have long enough documents so that the context of each chunk is retained.

The `ParentDocumentRetriever` strikes that balance by splitting and storing small chunks of data. During retrieval, it first fetches the small chunks but then looks up the parent IDs for those chunks and returns those larger documents.

In [None]:
from langchain.retrievers import ParentDocumentRetriever
from langchain_text_splitters import CharacterTextSplitter
from langchain.storage import InMemoryStore

In [None]:
# Set two splitters. One is with big chunk size (parent) and one is with small chunk size (child)
parent_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=20, separator='\n')
child_splitter = CharacterTextSplitter(chunk_size=400, chunk_overlap=20, separator='\n')

vectordb = Chroma(
    collection_name="split_parents", embedding_function=watsonx_embedding()
)

# The storage layer for the parent documents
store = InMemoryStore()

In [None]:
retriever = ParentDocumentRetriever(
    vectorstore=vectordb,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

retriever.add_documents(chunks_txt)

In [None]:
len(list(store.yield_keys()))

In [None]:
sub_docs = vectordb.similarity_search("smoking policy")
print(sub_docs[0].page_content)

Then we can retrieve the relevant larger chunk

In [None]:
retrieved_docs = retriever.invoke("smoking policy")
print(retrieved_docs[0].page_content)