# Build RAG pipelines with txtai

Large Language Models (LLMs) have completely dominated the AI and machine learning space in 2023. The results have been amazing and the public imagination is almost endless.

While LLMs have been impressive, they are not problem free. The biggest challenge is with hallucinations. Hallucinations is the term for when a LLM generates output that is factually incorrect. The alarming part of this is that on a cursory glance, it actually sounds like good content. The default behavior of LLMs is to produce plausible answers even when no plausible answer exists. LLMs are not great at saying I don't know.

Retrieval augmented generation (RAG) helps reduce the risk of hallucinations by limiting the context in which a LLM can generate answers. This is typically done with a vector search query that hydrates a prompt with a relevant context. RAG is one of the most practical and production-ready use cases for *Generative AI*. It's so popular now, that some are creating their entire companies around it.

[txtai](https://github.com/neuml/txtai) has long had question-answering pipelines, which employ the same process of retrieving a relevant context. LLMs are now the preferred approach for analyzing that context and RAG pipelines are one of the main features of txtai. One of the other main features of txtai is that it's a vector database! You can build your prompts and limit your context all with one library. Hence the phrase *all-in-one embeddings database*.

This notebook shows how to build RAG pipelines with txtai.

# Install dependencies

Install `txtai` and all dependencies. Since this notebook is using optional pipelines, we need to install the pipeline extras package.

In [None]:
# %%capture
# !pip install 'git+https://github.com/neuml/txtai#egg=txtai[pipeline] autoawq==0.1.5'

# # Get test data
# !wget -N https://github.com/neuml/txtai/releases/download/v6.2.0/tests.tar.gz
# !tar -xvzf tests.tar.gz

# # Install NLTK
# import nltk
# nltk.download('punkt')

# Start with the basics

Let's jump right in and start with a simple LLM pipeline. The [LLM pipeline](https://neuml.github.io/txtai/pipeline/text/llm/) supports loading local LLM models from the [Hugging Face Hub](https://huggingface.co/models).

For those using LLM API services (OpenAI, Cohere, etc), this call can easily be replaced with an API call.

In [None]:
%%capture

from txtai.embeddings import Embeddings
from txtai.pipeline import Extractor
from txtai.pipeline import Textractor


# Create textractor model
textractor = Textractor(sentences=True) # this is using apache tika

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})

# Create extractor instance
llm = Extractor(embeddings, "distilbert-base-cased-distilled-squad")

In [None]:
from txtai.pipeline import LLM

# Create LLM
llm = LLM("microsoft/phi-2")


Next, we'll load a document to query. The [Textractor pipeline](https://neuml.github.io/txtai/pipeline/data/textractor/) has support for extracting text from common document formats (docx, pdf, xlsx).

In [None]:
from txtai.pipeline import Textractor

# Create Textractor
textractor = Textractor()
text = textractor("txtai/生产安全事故应急预案 1.pdf")
print(text)

Now we'll define a simple LLM pipeline. It takes a question and context (which in this case is the whole file), creates a prompt and runs it with the LLM.

In [None]:
def execute(question, text):
  prompt = f"""<|im_start|>system
  You are a friendly assistant. You answer questions from users.<|im_end|>
  <|im_start|>user
  Answer the following question using only the context below. Only include information specifically discussed.

  question: {question}
  context: {text} <|im_end|>
  <|im_start|>assistant
  """

  return llm(prompt, maxlength=4096, pad_token_id=32000)

execute("Tell me about txtai in one sentence", text)

In [None]:
execute("What model does txtai recommend for transcription?", text)

In [None]:
execute("I don't know anything about txtai, what would be the best thing to read?", text)

If this is the first time you've seen *Generative AI*, then these statements are 🤯. Even if you've been in the space a while, it's still amazing how much a language model can understand and the high level of quality in it's answers.

While this use case is fun, lets try to scale it to a larger set of documents.

# Build a RAG pipeline with vector search

Let's say we have a large number of documents, hundreds/thousands etc. We can't just put all those documents into a single prompt, we'll run out of GPU memory fast!

This is where retrieval augmented generation enters the picture. We can use a query step that finds the best candidates to add to the prompt.

Typically, this candidate query uses vector search but it can be anything that runs a search and returns results. In fact, many complex production systems have customized retrieval pipelines that feed a context into LLM prompts.

The first step in building our RAG pipeline is creating the knowledge store. In this case, it's a vector database of file content. The files will be split into paragraphs with each paragraph stored as a separate row.

In [None]:
import os

from txtai import Embeddings

def stream(path):
  for f in sorted(os.listdir(path)):
    fpath = os.path.join(path, f)

    # Only accept documents
    if f.endswith(("docx", "xlsx", "pdf")):
      print(f"Indexing {fpath}")
      for paragraph in textractor(fpath):
        yield paragraph

# Document text extraction, split into paragraphs
textractor = Textractor(paragraphs=True)

# Vector Database
embeddings = Embeddings(content=True)
embeddings.index(stream("txtai"))

The next step is defining the RAG pipeline. This pipeline takes the input question, runs a vector search and builds a context using the search results. The context is then inserted into a prompt template and run with the LLM.

In [None]:
def context(question):
  context =  "\n".join(x["text"] for x in embeddings.search(question))
  return context

def rag(question):
  return execute(question, context(question))

rag("What model does txtai recommend for image captioning?")

In [None]:
result = rag("When was the BLIP model added for image captioning?")
print(result)

As we can see, the result is similar to what we had before without vector search. The difference is that we only used a relevant portion of the documents to generate the answer.

As we discussed before, this is important when dealing with large volumes of data. Not all of the data can be added to a LLM prompt. Additionally, having only the most relevant context helps the LLM generate higher quality answers.

# Citations for LLMs

A healthy level of skepticism should be applied to answers generated by AI. We're far from the day where we can blindly trust answers from an AI model.

txtai has a couple approaches for generating citations. The basic approach is to take the answer and search the vector database for the closest match.

In [None]:
for x in embeddings.search(result):
  print(x["text"])

While the basic approach above works in this case, txtai has a more robust pipeline to handle citations and references.

The Extractor pipeline is defined below. An Extractor pipeline works in the same way as a LLM + Vector Search pipeline, except it has special logic for generating citations. This pipeline takes the answers and compares it to the context passed to the LLM to determine the most likely reference.

In [None]:
from txtai.pipeline import Extractor

# Extractor prompt
def prompt(question):
  return [{
    "query": question,
    "question": f"""
Answer the following question using only the context below. Only include information specifically discussed.

question: {question}
context:
"""
}]

# Create LLM with system prompt template
llm = LLM("TheBloke/Mistral-7B-OpenOrca-AWQ", template="""<|im_start|>system
You are a friendly assistant. You answer questions from users.<|im_end|>
<|im_start|>user
{text} <|im_end|>
<|im_start|>assistant
""")

# Create extractor instance
extractor = Extractor(embeddings, llm, output="reference")

In [None]:
result = extractor(prompt("What version of Python is supported?"), maxlength=4096, pad_token_id=32000)[0]
print("ANSWER:", result["answer"])
print("CITATION:", embeddings.search("select id, text from txtai where id = :id", limit=1, parameters={"id": result["reference"]}))

And as we can see, not only is the answer to the statement shown, the extractor pipeline also provides a citation. This step is crucial in any line of work where answers must be verified (which is most lines of work).

# Wrapping up

This notebook introduced retrieval augmented generation (RAG), explained why we need it and showed the options available for running RAG pipelines with txtai.

The advantages of building RAG pipelines with txtai are:

- **All-in-one database** - one library can handle LLM inference and vector search retrieval
- **Generating citations** - generating answers is useful but referencing where those answers came from is crucial in gaining the trust of users
- **Simple yet powerful** - building pipelines can be done in a small amount of Python. Options are available to build pipelines in YAML and/or run through the API