# Self Retrievers

Un recuperador autoconsultante puede analizar y entender las consultas que se le hacen en lenguaje natural, y luego, puede buscar y filtrar información relevante de su base de datos o documentos almacenados basándose en esas consultas. Esto lo hace transformando las consultas en un formato estructurado que puede interpretar y procesar de manera eficiente. Esto significa que, además de comparar la consulta del usuario con los documentos para encontrar coincidencias, también puede filtrar los resultados según criterios específicos extraídos de la consulta del usuario.

![Self Retrievers](./diagrams/slide_diagrama_02.png)


## Librerías

In [1]:
from pprint import pprint

from dotenv import load_dotenv
from langchain.chains import create_tagging_chain_pydantic
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.indexes import SQLRecordManager, index
from langchain.retrievers import SelfQueryRetriever
from langchain.schema import Document
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from pydantic import BaseModel, Field

from src.langchain_docs_loader import LangchainDocsLoader, num_tokens_from_string

load_dotenv()

False

## Carga de datos

In [2]:
text_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN,
    chunk_size=400,
    chunk_overlap=50,
    length_function=num_tokens_from_string,
)

loader = LangchainDocsLoader(include_output_cells=False)
docs = loader.load()
docs = text_splitter.split_documents(docs)
len(docs)

4398

In [3]:
docs = [doc for doc in docs if doc.page_content != "```"]

## Inicializado de modelo de lenguaje

In [5]:

llm = ChatOpenAI(temperature=0.1)

## Etiquetado de documentos

Los documentos por sí mismos son útiles, pero cuando son etiquetados con información adicional, pueden volverse más útiles. Por ejemplo, si etiquetamos los documentos con su idioma, podemos filtrar los documentos que no estén en el idioma que nos interesa. Si etiquetamos los documentos con su tema, podemos filtrar los documentos que no estén relacionados con el tema que nos interesa. De esta manera, podemos reducir el espacio de búsqueda y obtener mejores resultados.

### Creación de esquema de etiquetas

In [6]:
class Tags(BaseModel):
    completness: str = Field(
        description="Describes how useful is the text in terms of self-explanation. It is critical to excel.",
        enum=["Very", "Quite", "Medium", "Little", "Not"],
    )
    code_snippet: bool = Field(
        default=False,
        description="Whether the text fragment includes a code snippet. Code snippets are valid markdown code blocks.",
    )
    description: bool = Field(
        default=False, description="Whether the text fragment includes a description."
    )
    talks_about_vectorstore: bool = Field(
        default=False,
        description="Whether the text fragment talks about a vectorstore.",
    )
    talks_about_retriever: bool = Field(
        default=False, description="Whether the text fragment talks about a retriever."
    )
    talks_about_chain: bool = Field(
        default=False, description="Whether the text fragment talks about a chain."
    )
    talks_about_expression_language: bool = Field(
        default=False,
        description="Whether the text fragment talks about an langchain expression language.",
    )
    contains_markdown_table: bool = Field(
        default=False,
        description="Whether the text fragment contains a markdown table.",
    )


pprint(Tags.schema())

{'properties': {'code_snippet': {'default': False,
                                 'description': 'Whether the text fragment '
                                                'includes a code snippet. Code '
                                                'snippets are valid markdown '
                                                'code blocks.',
                                 'title': 'Code Snippet',
                                 'type': 'boolean'},
                'completness': {'description': 'Describes how useful is the '
                                               'text in terms of '
                                               'self-explanation. It is '
                                               'critical to excel.',
                                'enum': ['Very',
                                         'Quite',
                                         'Medium',
                                         'Little',
                                         'Not'],

### Creación de cadena de generación de etiquetas (etiquetador)

In [7]:
tagging_prompt = """Extract the desired information from the following passage.

Only extract the properties mentioned in the 'information_extraction' function.
Completness should involve more than one sentence.
To consider that a passage talks about a property, it is enough that it mentions it once.
If there is no mention of a property, set it to False. It only applies for the talk_about_* properties.

For instance,
To set `talks_about_vectorstore` to True, document should contain the word 'vectorstore' at least once.
To set `talks_about_retriever` to True, document should contain the word 'retriever' at least once.
To set `talks_about_chain` to True, document should contain the word 'chain' at least once.
To set `talks_about_expression_language` to True, document should contain the word 'expression language' or 'LCEL' at least once.

Passage:
{input}
"""

tagging_chain = create_tagging_chain_pydantic(Tags, llm)

### Ejemplos de uso del etiquetador

Probablemente, un fragmento que únicamente contiene una lista de enlaces a otros fragmentos que también se encuentran indexados no es muy útil. Esto podría ocasionar que recuperemos un documento que no es relevante para la consulta, mientras el documento que sí es relevante no se encuentre en los primeros lugares de la lista de resultados.

In [8]:
idx = 0

result = tagging_chain.invoke(input={"input": docs[idx].page_content})
print(result.get("input"))
pprint(result.get("text").dict())

# LangChain cookbook

Example code for building applications with LangChain, with an emphasis on more applied and end-to-end examples than contained in the [main documentation](https://python.langchain.com).
{'code_snippet': True,
 'completness': 'Medium',
 'contains_markdown_table': False,
 'description': True,
 'talks_about_chain': False,
 'talks_about_expression_language': False,
 'talks_about_retriever': False,
 'talks_about_vectorstore': False}


Un fragmento con enlace a su documentación y ejemplo de uso sería más útil.

In [9]:
idx = 1000

result = tagging_chain.invoke(input={"input": docs[idx].page_content})
print(result.get("input"))
pprint(result.get("text").dict())

## Example​

To use the Nuclia document loader, you need to instantiate a
`NucliaUnderstandingAPI` tool:

```python
from langchain_community.tools.nuclia import NucliaUnderstandingAPI

nua = NucliaUnderstandingAPI(enable_ml=False)
```

```python
from langchain_community.document_loaders.nuclia import NucliaLoader

loader = NucliaLoader("./interview.mp4", nua)
```

You can now call the `load` the document in a loop until you get the
document.

```python
import time

pending = True
while pending:
    time.sleep(15)
    docs = loader.load()
    if len(docs) > 0:
        print(docs[0].page_content)
        print(docs[0].metadata)
        pending = False
    else:
        print("waiting...")
```
{'code_snippet': True,
 'completness': 'Very',
 'contains_markdown_table': False,
 'description': False,
 'talks_about_chain': False,
 'talks_about_expression_language': False,
 'talks_about_retriever': False,
 'talks_about_vectorstore': False}


In [10]:
idx = 1400

result = tagging_chain.invoke(input={"input": docs[idx].page_content})
print(result.get("input"))
pprint(result.get("text").dict())

YdT/AC967vQ/B1paOpgt2ubgf8tZBnafYdB/P3rtbPw+Bhrltx/ur0/OvGxWbxhpTIlUSOK0TwdaWjqYLc3NwP8AlrIM4PsOg/nXa2fh8cNctuP91en51vQWaRIFRAqjsOKsqgUdK+fr4ypVd2zCVRsrQWiRKFRAqjsBirIQKKdRXLuZhRRRSEGKMUUUwEooooAKKKKACiiigAooooGFFFFAB2pKWkoAKKWkoAKKKKACiiigC5RRRWRQUUUUAFFFFABRRRQAlFLSUAFFFFABRRRQAUlLSUAFFFFMQUUUUABANRvCGqSigDKvdKgulxLGCezdCPxrmdS8NP5bqqLcQt96N1BOPp0Nd1jNRvEGrWnXnTejKUmjwXWPANtMzvYMbWYdYnyUz/Mfr9K4
{'code_snippet': False,
 'completness': 'Not',
 'contains_markdown_table': False,
 'description': False,
 'talks_about_chain': False,
 'talks_about_expression_language': False,
 'talks_about_retriever': False,
 'talks_about_vectorstore': False}


### Etiquetado de documentos

In [11]:
tagging_results=tagging_chain.batch(
    inputs=[{"input":doc.page_content} for doc in docs[:100]],
    return_exceptions=True,
    config={"max:concurrency":50}
)

docs_with_tags=[

]
docs_with_tags = [
    Document(
        page_content=doc.page_content,
        metadata={
            **doc.metadata,
            **result.get("text").dict(),
        },
    )
    for doc, result in zip(docs, tagging_results)
    if not isinstance(result, Exception)
]

f"Documents with tags: {len(docs_with_tags)}"

'Documents with tags: 71'

## Indexado de documentos

In [12]:
vectorstore = Chroma(
    collection_name="langchain_docs",
    embedding_function=OpenAIEmbeddings(),
)

record_manager = SQLRecordManager(
    db_url="sqlite:///:memory:",
    namespace="chroma/langchain_docs",
)

record_manager.create_schema()

index(
    docs_source=docs_with_tags,
    record_manager=record_manager,
    vector_store=vectorstore,
    cleanup="full",
    source_id_key="source",
)

{'num_added': 71, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

## Recuperación de documentos con un `Self Retriever`

### Creación de interfaz de los metadatos disponibles en el índice

In [13]:
metadata_field_info = [
    AttributeInfo(
        name="completness",
        description="Describes how useful is the text in terms of self-explanation. It is critical to excel.",
        type='enum=["Very", "Quite", "Medium", "Little", "Not"]',
    ),
    AttributeInfo(
        name="code_snippet",
        description="Whether the text fragment includes a code snippet. Code snippets are valid markdown code blocks.",
        type="bool",
    ),
    AttributeInfo(
        name="description",
        description="Whether the text fragment includes a description.",
        type="bool",
    ),
    AttributeInfo(
        name="talks_about_vectorstore",
        description="Whether the text fragment talks about a vectorstore.",
        type="bool",
    ),
    AttributeInfo(
        name="talks_about_retriever",
        description="Whether the text fragment talks about a retriever.",
        type="bool",
    ),
    AttributeInfo(
        name="talks_about_chain",
        description="Whether the text fragment talks about a chain.",
        type="bool",
    ),
    AttributeInfo(
        name="contains_markdown_table",
        description="Whether the text fragment contains a markdown table.",
        type="bool",
    ),
]

document_content_description = "Langchain documentation"

### Creación de `retriever`

In [14]:

retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectorstore,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    enable_limit=True,
    verbose=True,
)

### Recuperación de documentos con el `retriever`

In [15]:
relevant_documents = retriever.get_relevant_documents(
    "useful documents that talk about expression language and retrievers"
)
relevant_documents

[Document(page_content='```\n\n```python\nfrom langchain.utils.math import cosine_similarity\nfrom langchain_core.output_parsers import StrOutputParser\nfrom langchain_core.prompts import PromptTemplate\nfrom langchain_core.runnables import RunnableLambda, RunnablePassthrough\nfrom langchain_openai import ChatOpenAI, OpenAIEmbeddings\n\nphysics_template = """You are a very smart physics professor. \\\nYou are great at answering questions about physics in a concise and easy to understand manner. \\\nWhen you don\'t know the answer to a question you admit that you don\'t know.\n\nHere is a question:\n{query}"""\n\nmath_template = """You are a very good mathematician. You are great at answering math questions. \\\nYou are so good because you are able to break down hard problems into their component parts, \\\nanswer the component parts, and then put them together to answer the broader question.\n\nHere is a question:\n{query}"""\n\nembeddings = OpenAIEmbeddings()\nprompt_templates = [phys

In [16]:
relevant_documents = retriever.get_relevant_documents(
    "useful documents that talk about expression language and retrievers or vectorstores"
)
relevant_documents

[Document(page_content='# Code writing\n\nExample of how to use LCEL to write Python code.\n\n```python\n%pip install --upgrade --quiet  langchain-core langchain-experimental langchain-openai\n```\n\n```python\nfrom langchain_core.output_parsers import StrOutputParser\nfrom langchain_core.prompts import (\n    ChatPromptTemplate,\n)\nfrom langchain_experimental.utilities import PythonREPL\nfrom langchain_openai import ChatOpenAI\n```\n\n```python\ntemplate = """Write some python code to solve the user\'s problem. \n\nReturn only python code in Markdown format, e.g.:\n\n```python\n....\n```"""\nprompt = ChatPromptTemplate.from_messages([("system", template), ("human", "{input}")])\n\nmodel = ChatOpenAI()\n```\n\n```python\ndef _sanitize_output(text: str):\n    _, after = text.split("```python")\n    return after.split("```")[0]\n```\n\n```python\nchain = prompt | model | StrOutputParser() | _sanitize_output | PythonREPL().run\n```\n\n```python\nchain.invoke({"input": "whats 2 plus 2"})\