# Retrieval

- Many LLM applications require user-specific data outside of the model training set.
- E.g. ask a question on some documents that are private to the company.
- How do we do? Two simples approaches one could think of could be:
  - Option 1: Fine-tune the LLM on the new data.
  - Option 2: Pass the internal documents as part of the prompt.
- However, both of these approaches have drawbacks:
  - Option 1
    - Fine-tuning is expensive (both in terms of time and money)
    - What if we want to add more documents later?
    - In general, remember: A more useful way to think of LLMs is to think of them as **reasoning engines**, not assume they know everything!
  - Option 2
    - Now we are on the right track!
    - But what if the concatenation of all documents is too long for the context length of the LLM?
    - And even if it fits in the context length, maybe it is not the most efficient way to do it (we pay per tokens: USD 0.03/1k prompt tokens + USD 0.06/1k sampled tokens).
- So the idea is to apply **retrieval augmented generation** (RAG).
  - Fundamentally, same architecture of classical Information Retrieval (IR) systems.
  - But Reader is replaced by LLM: like Option 2, but prompt is just set of relevant paragraphs.
- Reminder, the fundamental structure of an IR system is:
  - **Indexing**: Create an index of the documents/paragraphs. Historically, using `FULLTEXT INDEX` in MySQL, then more advanced tools like [Elasticsearch](https://www.elastic.co/), and more recently **Dense Passage Retrieval** i.e. Deep Learning (Sentence Transformers, GPT embeddings, ...) based vector embeddings.
  - **Retriever**: Given a new query, retrieve the most relevant documents/paragraphs. As we now use vector embeddings, this is done by computing the cosine similarity between the query embedding and all the precomputed document embeddings (runtime is typically very fast, as approximate neighbor search with clustering, dimensionality reduction, and other techniques can be used).
  - **Re-Ranker**: The goal of Retrieval is to very quickly fetch from a huge number of documents, a handful of potentially relevant documents: so speed is key, not accuracy. The goal of the Re-Ranker is to very accurately re-rank the documents. This can be done e.g. with a Cross-Encoder, that takes as input the query and the document, and outputs a score. Why not directly use the Cross-Encoder for Retrieval? Because it is too slow! Why is Bi-Encoder so fast? Because only 1 embedding is computed per query, all the document embeddings are pre-computed!
  - **Reader**: Given the top-k documents, the Reader will read them and extract the answer. This means that we have a context = top-k documents, and we can use that as the prompt (with the question) for the Reader. Historically, the Reader was e.g. a BERT model fine-tuned on SQuAD v2. Now, we can use a LLM, and the prompt is the concatenation of the question and the top-k documents.
- Even if it doesn't necessarily involve LLMs (e.g. you can use Sentence Transformers embeddings instead of GPT embeddings), the Indexing is key to the success of the application. In particular fast retrieval is key when the number of documents grows huge. Hence, vector databases have become very popular!
  - Pinecone
  - Weaviate
  - Milvus
  - Qdrant
  - ...

- How does LangChain fit in these picture? It provides implementation for all the main components!
  - **Document Loaders** — Load documents from any source (HTML, PDF, source code, ...) and location (local, S3, websites, ...).
  - **Document Transformers** — Split documents into smaller chunks, e.g. paragraphs, sentences, code blocks, ...
  - **Text Embedding Models** — Create embeddings for the text chunks, e.g. Sentence Transformers, GPT, ...
  - **Vector Stores** — Store the embeddings in a vector database, e.g. Pinecone, Weaviate, Milvus, Qdrant, ...
  - **Retrievers** — Retrieve the most relevant documents given a query, e.g. from Simple Semantic Search to more advanced algorithms like Parent Document Retriever, Self Query Retriever, Ensemble Retriever, ...

In [1]:
from dotenv import load_dotenv
import os
from pprint import pprint

In [2]:
# 1. Copy .env.example file as .env: `cp .env.example .env`
# 2. Open .env file and set all the env variables
load_dotenv(".env")
OPENAI_KEY = os.getenv("OPENAI_KEY")
assert OPENAI_KEY, "Please set your OPENAI_KEY environment variable."

In [3]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.vectorstores import FAISS
from langchain.indexes import VectorstoreIndexCreator
from langchain.embeddings import OpenAIEmbeddings

## Querying Textual Documents with Vector Stores

- This is the most common application.
- We have a question, and we want to find the answer based on a number of textual documents.

In [4]:
# 1. Document Loader
file_path = "data/state_of_the_union_2023.txt"
document_loader = TextLoader(file_path=file_path)

In [5]:
# 2. Embedding Model
embedding_model = OpenAIEmbeddings(openai_api_key=OPENAI_KEY, model="text-embedding-ada-002")

In [6]:
# 3. Index
index = VectorstoreIndexCreator(
    vectorstore_cls=FAISS,
    embedding=embedding_model,
).from_loaders([document_loader])

In [7]:
# 4. Reader Model
chat_model = ChatOpenAI(
    openai_api_key=OPENAI_KEY,
    model_name="gpt-4",
    temperature=0,
    model_kwargs={"top_p":1},
)

In [8]:
response = index.query("Summarize in bullet points what the President said about Putin.", llm=chat_model)
print(response)

- Putin's invasion was a test for America and the world, challenging principles of sovereignty and freedom from tyranny.
- America, along with NATO and a global coalition, stood against Putin's aggression and in support of the Ukrainian people.
- The President spoke about the situation one year after Putin unleashed a brutal war against Ukraine, comparing it to the death and destruction Europe suffered in World War II.
- The President affirmed that if any country, including Russia, threatens America's sovereignty, they will act to protect their country.


## Querying a Graph with NetworkX

- Sometimes our data is in a graph format, and we want to query it.
- We can use NetworkX to do this.

In [9]:
from langchain.chains import GraphQAChain
from langchain.indexes.graph import NetworkxEntityGraph

file_path = "./data/people_and_jobs_graph.gml"
graph = NetworkxEntityGraph.from_gml(file_path)

chain = GraphQAChain.from_llm(llm=chat_model, graph=graph, verbose=True)

chain.run("Where does Irene Isaac work?")




[1m> Entering new GraphQAChain chain...[0m
Entities Extracted:
[32;1m[1;3mIrene Isaac[0m
Full Context:
[32;1m[1;3mIrene Isaac works_at Ford
Irene Isaac lives_in GreenVillage[0m

[1m> Finished chain.[0m


'Irene Isaac works at Ford.'

In [10]:
chain.run("Where is GreenVillage located?")



[1m> Entering new GraphQAChain chain...[0m
Entities Extracted:
[32;1m[1;3mGreenVillage[0m
Full Context:
[32;1m[1;3mGreenVillage is_in Illinois[0m

[1m> Finished chain.[0m


'GreenVillage is located in Illinois.'

In [11]:
chain.run("Where is the company where Irene Isaac works located?")



[1m> Entering new GraphQAChain chain...[0m
Entities Extracted:
[32;1m[1;3mIrene Isaac[0m
Full Context:
[32;1m[1;3mIrene Isaac works_at Ford
Irene Isaac lives_in GreenVillage[0m

[1m> Finished chain.[0m


"The text doesn't provide information on the location of the company where Irene Isaac works."