<center><a href="https://www.nvidia.cn/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>

# 为 NVIDIA Triton 文档网站构建 RAG 链

在这个 notebook 中，我们演示了如何使用 [NVIDIA AI Endpoints for LangChain](https://python.langchain.com/docs/integrations/text_embedding/nvidia_ai_endpoints) 构建一个 RAG。我们通过下载网页并使用 FAISS 生成它们的嵌入来创建一个向量存储。接着，我们展示了两种不同的聊天链用于查询向量存储。这个例子使用的是 NVIDIA Triton 文档网站，不过代码可以很容易地修改为使用其它来源。  

### 第一阶段是从网络加载 NVIDIA Triton 文档，分块数据，并使用 FAISS 生成嵌入

要运行这个 notebook，您需要完成[设置](https://python.langchain.com/docs/integrations/text_embedding/nvidia_ai_endpoints#setup)并生成一个 API 密钥。

In [None]:
import os
from langchain.chains import ConversationalRetrievalChain, LLMChain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT, QA_PROMPT
from langchain.chains.question_answering import load_qa_chain

from langchain.memory import ConversationBufferMemory
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

运行下面的单元提供 API 密钥。

In [None]:
import getpass

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

用于加载 html 文件的辅助函数，我们将用它来生成嵌入。稍后会用到这个来从 Triton 文档网站加载相关的 html 文档并转换为向量存储。

In [None]:
import re
from typing import List, Union

import requests
from bs4 import BeautifulSoup

def html_document_loader(url: Union[str, bytes]) -> str:
    """
    Loads the HTML content of a document from a given URL and return it's content.

    Args:
        url: The URL of the document.

    Returns:
        The content of the document.

    Raises:
        Exception: If there is an error while making the HTTP request.

    """
    try:
        response = requests.get(url)
        html_content = response.text
    except Exception as e:
        print(f"Failed to load {url} due to exception {e}")
        return ""

    try:
        # Create a Beautiful Soup object to parse html
        soup = BeautifulSoup(html_content, "html.parser")

        # Remove script and style tags
        for script in soup(["script", "style"]):
            script.extract()

        # Get the plain text from the HTML document
        text = soup.get_text()

        # Remove excess whitespace and newlines
        text = re.sub("\s+", " ", text).strip()

        return text
    except Exception as e:
        print(f"Exception {e} while loading document")
        return ""

读取 html 文件并拆分文本以准备生成嵌入。
注意 chunk_size 值必须与用于生成嵌入的特定 LLM 匹配

确保关注 TextSplitter 中的 chunk_size 参数。设置合适的 chunk 大小对 RAG 的性能至关重要，因为 RAG 成功的很大一部分依赖于检索步骤找到生成的正确上下文。整个提示词（检索到的块 + 用户查询）必须适合 LLM 的上下文窗口。因此，不应指定过大的块大小，并且要与估计的查询大小保持平衡。例如，虽然 OpenAI LLM 的上下文窗口为 8k-32k tokens，但 Llama3 限制在 8k tokens。可以尝试不同的块大小，但典型的值应该在 100-600 之间，这取决于 LLM。

In [None]:
def create_embeddings(embedding_path: str = "./data/nv_embedding"):

    embedding_path = "./data/nv_embedding"
    print(f"Storing embeddings to {embedding_path}")

    # List of web pages containing NVIDIA Triton technical documentation
    urls = [
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html",
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/quickstart.html",
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_repository.html",
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_analyzer.html",
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/architecture.html",
    ]

    documents = []
    for url in urls:
        document = html_document_loader(url)
        documents.append(document)


    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=0,
        length_function=len,
    )
    texts = text_splitter.create_documents(documents)
    index_docs(url, text_splitter, texts, embedding_path)
    print("Generated embedding successfully")

使用 NVIDIA AI Endpoints for LangChain 生成嵌入，并将嵌入保存到 ./data/nv_embedding 目录的离线向量存储中，以便将来复用。

In [None]:
def index_docs(url: Union[str, bytes], splitter, documents: List[str], dest_embed_dir) -> None:
    """
    Split the document into chunks and create embeddings for the document

    Args:
        url: Source url for the document.
        splitter: Splitter used to split the document
        documents: list of documents whose embeddings needs to be created
        dest_embed_dir: destination directory for embeddings

    Returns:
        None
    """
    embeddings = NVIDIAEmbeddings(model="NV-Embed-QA", truncate="END")

    for document in documents:
        texts = splitter.split_text(document.page_content)

        # metadata to attach to document
        metadatas = [document.metadata]

        # create embeddings and add to vector store
        if os.path.exists(dest_embed_dir):
            update = FAISS.load_local(folder_path=dest_embed_dir, embeddings=embeddings, allow_dangerous_deserialization=True)
            update.add_texts(texts, metadatas=metadatas)
            update.save_local(folder_path=dest_embed_dir)
        else:
            docsearch = FAISS.from_texts(texts, embedding=embeddings, metadatas=metadatas)
            docsearch.save_local(folder_path=dest_embed_dir)

In [None]:
create_embeddings()

### 第二阶段是加载向量存储中的嵌入并使用 NVIDIAEmbeddings 构建 RAG

通过 NVIDIA Retrieval QA 嵌入端点创建嵌入模型。这个模型将单词、短语或其它实体表示为数字向量，并理解单词和短语之间的关系。详细信息请参考： https://build.nvidia.com/nvidia/embed-qa-4

In [None]:
embedding_model = NVIDIAEmbeddings(model="NV-Embed-QA", truncate="END", allow_dangerous_deserialization=True)

使用 FAISS 从向量数据库加载文档

In [None]:
# Embed documents
embedding_path = "./data/nv_embedding"
docsearch = FAISS.load_local(folder_path=embedding_path, embeddings=embedding_model, allow_dangerous_deserialization=True)
retriever = docsearch.as_retriever()

In [None]:
# This should return documents related to the test query
retriever.invoke("Deploy TensorRT-LLM Engine on Triton Inference Server")

创建一个 ConversationalRetrievalChain 链。在这个链中，我们演示了如何使用两个 LLM：一个用于摘要，另一个用于对话。这在更复杂的场景中提高了整体结果。我们将使用 Llama3 70B 作为第一个 LLM，Mixtral 作为链中的对话元素。我们添加一个 question_generator 来生成相关的查询提示词。详细信息请参考： https://python.langchain.com/docs/modules/chains/popular/chat_vector_db#conversationalretrievalchain-with-streaming-to-stdout

In [None]:
print(f"{CONDENSE_QUESTION_PROMPT = }")
print(f"{QA_PROMPT = }")

In [None]:
llm = ChatNVIDIA(model='mistralai/mixtral-8x7b-instruct-v0.1')
chat = ChatNVIDIA(model="mistralai/mixtral-8x7b-instruct-v0.1", temperature=0.1, max_tokens=1000, top_p=1.0)

retriever = docsearch.as_retriever()

## Requires question and chat_history
qa_chain = (RunnablePassthrough()
    ## {question, chat_history} -> str
    | CONDENSE_QUESTION_PROMPT | llm | StrOutputParser()
    # | RunnablePassthrough(print)
    ## str -> {question, context}
    | {"question": lambda x: x, "context": retriever}
    # | RunnablePassthrough(print)
    ## {question, context} -> str
    | QA_PROMPT | chat | StrOutputParser()
)

问任何关于 Triton 的问题

In [None]:
chat_history = []

query = "What is Triton?"
chat_history += [qa_chain.invoke({"question": query, "chat_history": chat_history})]
chat_history

再问一个关于 Triton 的问题

In [None]:
query = "What interfaces does Triton support?"
chat_history += [""]
for token in qa_chain.stream({"question": query, "chat_history": chat_history[:-1]}):
    print(token, end="")
    chat_history[-1] += token

最后通过询问之前的查询来展示聊天能力

In [None]:
query = "But why?"
for token in qa_chain.stream({"question": query, "chat_history": chat_history}):
    print(token, end="")

现在我们展示一个更简单的链，仅使用一个 LLM，即聊天 LLM

In [None]:
chat = ChatNVIDIA(
    model='mistralai/mixtral-8x7b-instruct-v0.1', 
    temperature=0.1, 
    max_tokens=1000, 
    top_p=1.0
)

qa_prompt = ChatPromptTemplate.from_messages([
    ("user", 
        "Use the following pieces of context to answer the question at the end."
        " If you don't know the answer, just say that you don't know, don't try to make up an answer."
        "\n\nHISTORY: {history}\n\n{context}\n\nQuestion: {question}\nHelpful Answer:"
    )
])

## Requires question and chat_history
qa_chain = (
    RunnablePassthrough.assign(context = (lambda state: state.get("question")) | retriever)
    # | RunnablePassthrough(print)
    | qa_prompt | chat | StrOutputParser()
)

现在尝试用更简单的链询问关于 Triton 的问题。将答案与之前复杂链模型的结果进行比较

In [None]:
chat_history = []

query = "What is Triton?"
chat_history += [qa_chain.invoke({"question": query, "history": chat_history})]
chat_history

再问一个关于 Triton 的问题

In [None]:
query = "Does Triton support ONNX?"
chat_history += [""]
for token in qa_chain.stream({"question": query, "history": chat_history[:-1]}):
    print(token, end="")
    chat_history[-1] += token

最后通过询问之前的查询来展示聊天能力

In [None]:
query = "How come?"
for token in qa_chain.stream({"question": query, "history": chat_history}):
    print(token, end="")