# 4 - Recuperadores

<img src="https://raw.githubusercontent.com/Hack-io-AI/ai_images/main/langchain.jpeg" style="width:400px;"/>

<h1>Tabla de Contenidos<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1---Recuperadores" data-toc-modified-id="1---Recuperadores-1">1 - Recuperadores</a></span><ul class="toc-item"><li><span><a href="#1.1---Chroma-Retriever" data-toc-modified-id="1.1---Chroma-Retriever-1.1">1.1 - Chroma Retriever</a></span></li><li><span><a href="#1.2---MultiQueryRetriever" data-toc-modified-id="1.2---MultiQueryRetriever-1.2">1.2 - MultiQueryRetriever</a></span></li><li><span><a href="#1.3---ContextualCompressionRetriever" data-toc-modified-id="1.3---ContextualCompressionRetriever-1.3">1.3 - ContextualCompressionRetriever</a></span></li><li><span><a href="#1.4---EnsembleRetriever" data-toc-modified-id="1.4---EnsembleRetriever-1.4">1.4 - EnsembleRetriever</a></span></li></ul></li></ul></div>

## 1 - Recuperadores 

Los recuperadores en LangChain son interfaces que devuelven documentos en respuesta a una consulta no estructurada. Son más generales que las bases de datos vectoriales, ya que se centran en la recuperación en lugar del almacenamiento. Aunque las bases de datos vectoriales pueden utilizarse como base para un recuperador, también existen otros tipos de recuperadores. Veamos como crear un recuperador con las herramientas que hemos visto hasta ahora.

### 1.1 - Chroma Retriever

In [1]:
# primero cargamos la API KEY de OpenAI

from dotenv import load_dotenv 
import os

# carga de variables de entorno
load_dotenv()


# api key openai, nombre que tiene por defecto en LangChain
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

In [2]:
# librerias

from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

In [3]:
# importamos un archivo de texto

with open('../../../files/state_of_the_union.txt', 'r') as f:
    
    texto = f.read()

In [4]:
# usamos el splitter

splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

In [5]:
# troceamos el texto

trozos = splitter.split_text(texto)

In [6]:
len(trozos)

42

In [7]:
# modelo de embedding

modelo = OpenAIEmbeddings()

In [8]:
# creacion de la base de datos

db = Chroma.from_texts(texts=trozos, 
                       embedding=OpenAIEmbeddings(), 
                       persist_directory= './db', 
                       collection_name='estado')

In [9]:
# definicion de recuperador

recuperador = db.as_retriever()

In [10]:
# consulta del usuario

consulta = 'What did the president say about Ketanji Brown Jackson?'

In [11]:
# recuperar documentos

docs = recuperador.invoke(consulta)

In [12]:
# 4 documentos por defecto

len(docs)

4

In [13]:
# contenido del primer documento, el mas parecido a nuestra consulta

docs[0].page_content

'One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. \n\nA former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n\nAnd if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n\nWe can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.  \n\nWe’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.'

### 1.2 - MultiQueryRetriever

Podemos usar el `MultiQueryRetriever`, que automatiza el ajuste de prompts generando múltiples consultas para una consulta de entrada del usuario y combina los resultados. Es decir, utiliza un LLM para recuperar documentos más relacionados con nuestra consulta.

In [14]:
from langchain_openai import ChatOpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever

In [15]:
llm = ChatOpenAI(temperature=0)

In [16]:
recuperador =  MultiQueryRetriever.from_llm(retriever=db.as_retriever(), 
                                            llm=llm)

In [17]:
consulta2 = 'What are the approaches to Task Decomposition?'

In [18]:
docs = recuperador.invoke(consulta2)

In [19]:
len(docs)

6

### 1.3 - ContextualCompressionRetriever

La compresión contextual en LangChain, `ContextualCompressionRetriever`,  comprime los documentos recuperados utilizando el contexto de la consulta, asegurando que solo se devuelva la información relevante. Esto implica la reducción de contenido y el filtrado de documentos menos relevantes. 

In [20]:
from langchain_openai import OpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [21]:
llm = OpenAI(temperature=0)

In [22]:
compresor = LLMChainExtractor.from_llm(llm)

In [23]:
recuperador = ContextualCompressionRetriever(base_compressor=compresor, 
                                             base_retriever=db.as_retriever())

In [24]:
docs = recuperador.get_relevant_documents(consulta)

  docs = recuperador.get_relevant_documents(consulta)


In [25]:
len(docs)

1

In [26]:
docs[0].page_content

'I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson.'

### 1.4 - EnsembleRetriever

El `EnsembleRetriever` combina diferentes algoritmos de recuperación para lograr un mejor rendimiento. Veamos un ejemplo de la combinación de los recuperadores `BM25` y `Chroma`. Para usar BM25 tenemos que ejecutar el siguiente comando:

```bash
pip install rank_bm25
```

In [27]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

In [28]:
# recuperador BM25

recuperador_bm25 = BM25Retriever.from_texts(trozos)

In [29]:
# ensamblaje de recuperadores con el mismo peso

recuperador_ensamblado = EnsembleRetriever(retrievers=[recuperador_bm25, db.as_retriever()], 
                                           weights=[0.5, 0.5])

In [30]:
docs = recuperador_ensamblado.get_relevant_documents(consulta)

In [31]:
docs[0].page_content

'One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. \n\nA former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n\nAnd if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n\nWe can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.  \n\nWe’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.'