[INST] Context: {context} Question: {question} Only return the helpful answer below and nothing else. Helpful answer:[/INST]\"\n",
+ ")\n",
+ "\n",
+ "LLAMA_PROMPT = PromptTemplate.from_template(LLAMA_PROMPT_TEMPLATE)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3310462b-f215-4d00-9d59-e613921bed0a",
+ "metadata": {},
+ "source": [
+ "### Step 3: Load Documents [*(Retrieval)*](https://python.langchain.com/docs/modules/data_connection/)\n",
+ "LangChain provides a variety of [document loaders](https://python.langchain.com/docs/integrations/document_loaders) that load various types of documents (HTML, PDF, code) from many different sources and locations (private s3 buckets, public websites).\n",
+ "\n",
+ "Document loaders load data from a source as **Documents**. A **Document** is a piece of text (the page_content) and associated metadata. Document loaders provide a ``load`` method for loading data as documents from a configured source. \n",
+ "\n",
+ "In this example, we use a LangChain [`UnstructuredFileLoader`](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file) to load a research paper about Llama2 from Meta.\n",
+ "\n",
+ "[Here](https://python.langchain.com/docs/integrations/document_loaders) are some of the other document loaders available from LangChain."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "70c92132-4c34-44fc-af28-6aa0769b006c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "! wget -O \"llama2_paper.pdf\" -nc --user-agent=\"Mozilla\" https://arxiv.org/pdf/2307.09288.pdf"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b4382b61",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain.document_loaders import UnstructuredFileLoader\n",
+ "loader = UnstructuredFileLoader(\"llama2_paper.pdf\")\n",
+ "data = loader.load()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4e0449e4",
+ "metadata": {},
+ "source": [
+ "### Step 4: Transform Documents [*(Retrieval)*](https://python.langchain.com/docs/modules/data_connection/)\n",
+ "Once documents have been loaded, they are often transformed. One method of transformation is known as **chunking**, which breaks down large pieces of text, for example, a long document, into smaller segments. This technique is valuable because it helps [optimize the relevance of the content returned from the vector database](https://www.pinecone.io/learn/chunking-strategies/). \n",
+ "\n",
+ "LangChain provides a [variety of document transformers](https://python.langchain.com/docs/integrations/document_transformers/), such as text splitters. In this example, we use a [``SentenceTransformersTokenTextSplitter``](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.SentenceTransformersTokenTextSplitter.html#langchain.text_splitter.SentenceTransformersTokenTextSplitter). The ``SentenceTransformersTokenTextSplitter`` is a specialized text splitter for use with the sentence-transformer models. The default behaviour is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use. This sentence transformer model is used to generate the embeddings from documents. \n",
+ "\n",
+ "There are some nuanced complexities to text splitting since semantically related text, in theory, should be kept together. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "21ec0438",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import time\n",
+ "from langchain.text_splitter import SentenceTransformersTokenTextSplitter\n",
+ "TEXT_SPLITTER_MODEL = \"intfloat/e5-large-v2\"\n",
+ "TEXT_SPLITTER_CHUNCK_SIZE = 510\n",
+ "TEXT_SPLITTER_CHUNCK_OVERLAP = 200\n",
+ "\n",
+ "text_splitter = SentenceTransformersTokenTextSplitter(\n",
+ " model_name=TEXT_SPLITTER_MODEL,\n",
+ " chunk_size=TEXT_SPLITTER_CHUNCK_SIZE,\n",
+ " chunk_overlap=TEXT_SPLITTER_CHUNCK_OVERLAP,\n",
+ ")\n",
+ "start_time = time.time()\n",
+ "documents = text_splitter.split_documents(data)\n",
+ "print(f\"--- {time.time() - start_time} seconds ---\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "183aaeeb-7461-4f58-9fc4-2a51fa723714",
+ "metadata": {},
+ "source": [
+ "Let's view a sample of content that is chunked together in the documents."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "46525e4e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "documents[40].page_content"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "3f580c54",
+ "metadata": {},
+ "source": [
+ "### Step 5: Generate Embeddings and Store Embeddings in the Vector Store [*(Retrieval)*](https://python.langchain.com/docs/modules/data_connection/)\n",
+ "\n",
+ "#### a) Generate Embeddings\n",
+ "[Embeddings](https://python.langchain.com/docs/modules/data_connection/text_embedding/) for documents are created by vectorizing the document text; this vectorization captures the semantic meaning of the text. This allows you to quickly and efficiently find other pieces of text that are similar. The embedding model used below is [intfloat/e5-large-v2](https://huggingface.co/intfloat/e5-large-v2).\n",
+ "\n",
+ "LangChain provides a wide variety of [embedding models](https://python.langchain.com/docs/integrations/text_embedding) from many providers and makes it simple to swap out the models. \n",
+ "\n",
+ "When a user sends in their query, the query is also embedded using the same embedding model that was used to embed the documents. As explained earlier, this allows to find similar (relevant) documents to the user's query. \n",
+ "\n",
+ "#### b) Store Document Embeddings in the Vector Store\n",
+ "Once the document embeddings are generated, they are stored in a vector store so that at query time we can:\n",
+ "1) Embed the user query and\n",
+ "2) Retrieve the embedding vectors that are most similar to the embedding query.\n",
+ "\n",
+ "A vector store takes care of storing the embedded data and performing a vector search.\n",
+ "\n",
+ "LangChain provides support for a [great selection of vector stores](https://python.langchain.com/docs/integrations/vectorstores/). \n",
+ "\n",
+ "\n",
+ " \n",
+ "⚠️ For this workflow, [Milvus](https://milvus.io/) vector database is running as a microservice. \n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9bd8b943",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain.embeddings import HuggingFaceEmbeddings\n",
+ "from langchain.vectorstores import Milvus\n",
+ "import torch\n",
+ "import time\n",
+ "\n",
+ "#Running the model on CPU as we want to conserve gpu memory. \n",
+ "#In the production deployment (API server shown as part of the 5th notebook we run the model on GPU)\n",
+ "model_name = \"intfloat/e5-large-v2\"\n",
+ "model_kwargs = {\"device\": \"cpu\"}\n",
+ "encode_kwargs = {\"normalize_embeddings\": False}\n",
+ "hf_embeddings = HuggingFaceEmbeddings(\n",
+ " model_name=model_name,\n",
+ " model_kwargs=model_kwargs,\n",
+ " encode_kwargs=encode_kwargs,\n",
+ ")\n",
+ "start_time = time.time()\n",
+ "vectorstore = Milvus.from_documents(documents=documents, embedding=hf_embeddings, connection_args={\"host\": \"milvus\", \"port\": \"19530\"})\n",
+ "print(f\"--- {time.time() - start_time} seconds ---\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f7fa622f",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "# Simple Example: Retrieve Documents from the Vector Database\n",
+ "# note: this is just for demonstration purposes of a similarity search \n",
+ "question = \"Can you talk about safety evaluation of llama2 chat?\"\n",
+ "docs = vectorstore.similarity_search(question)\n",
+ "print(docs[2].page_content)"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "f6960255",
+ "metadata": {},
+ "source": [
+ " > ### Simple Example: Retrieve Documents from the Vector Database [*(Retrieval)*](https://python.langchain.com/docs/modules/data_connection/)\n",
+ ">Given a user query, relevant splits for the question are returned through a **similarity search**. This is also known as a semantic search, and it is done with meaning. It is different from a lexical search, where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query. A semantic search tends to generate more relevant results than a lexical search.\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9c8148dc",
+ "metadata": {},
+ "source": [
+ "### Step 6: Compose a streamed answer using a Chain\n",
+ "We have already integrated the Llama2 TRT LLM into LangChain with a custom wrapper, loaded and transformed documents, and generated and stored document embeddings in a vector database. To finish the pipeline, we need to add a few more LangChain components and combine all the components together with a [chain](https://python.langchain.com/docs/modules/chains/).\n",
+ "\n",
+ "A [LangChain chain](https://python.langchain.com/docs/modules/chains/) combines components together. In this case, we use a [RetrievalQA chain](https://js.langchain.com/docs/modules/chains/popular/vector_db_qa/), which is a chain type for question-answering against a vector index. It combines a *Retriever* and a *question answering (QA) chain*.\n",
+ "\n",
+ "We pass it 3 of our LangChain components:\n",
+ "- Our instance of the LLM (from step 1).\n",
+ "- A [retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/), which is an interface that returns documents given an unstructured query. In this case, we use our vector store as the retriever.\n",
+ "- Our prompt template constructed from the prompt format for Llama2 (from step 2)\n",
+ "\n",
+ "```\n",
+ "qa_chain = RetrievalQA.from_chain_type(\n",
+ " llm,\n",
+ " retriever=vectorstore.as_retriever(),\n",
+ " chain_type_kwargs={\"prompt\": LLAMA_PROMPT}\n",
+ ")\n",
+ "```\n",
+ "\n",
+ "Lastly, we pass a user query to the chain and stream the result. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "69de32a0",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain.chains import RetrievalQA\n",
+ "import time\n",
+ "\n",
+ "qa_chain = RetrievalQA.from_chain_type(\n",
+ " llm,\n",
+ " retriever=vectorstore.as_retriever(),\n",
+ " chain_type_kwargs={\"prompt\": LLAMA_PROMPT}\n",
+ ")\n",
+ "start_time = time.time()\n",
+ "result = qa_chain({\"query\": question})\n",
+ "print(f\"\\n--- {time.time() - start_time} seconds ---\")"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.13"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/RetrievalAugmentedGeneration/notebooks/03_llama_index_simple.ipynb b/RetrievalAugmentedGeneration/notebooks/03_llama_index_simple.ipynb
new file mode 100644
index 000000000..3da624d3e
--- /dev/null
+++ b/RetrievalAugmentedGeneration/notebooks/03_llama_index_simple.ipynb
@@ -0,0 +1,461 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "9a4cb825-0940-44a7-9f79-c1ca73b37906",
+ "metadata": {},
+ "source": [
+ "# Notebook 3: Document Question-Answering with LlamaIndex\n",
+ "This notebook demonstrates how to use [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/) to build a chatbot that references a custom knowledge base. \n",
+ "\n",
+ "Suppose you have some text documents (PDF, blog, Notion pages, etc.) and want to ask questions related to the contents of those documents. LLMs, given their proficiency in understanding text, are a great tool for this. \n",
+ "\n",
+ "\n",
+ " \n",
+ "⚠️ The notebook before this one, `02_langchain_index_simple.ipynb`, contains the same functionality as this notebook but uses some LangChain components instead of LlamaIndex components. \n",
+ "\n",
+ "Concepts that are used in this notebook are explained in-depth in the previous notebook. If you are new to retrieval augmented generation, it is recommended to go through the previous notebook before this one. \n",
+ "\n",
+ "Ultimately, we recommend reading about LangChain vs. LlamaIndex and picking the software/components of the software that makes the most sense to you. This is discussed a bit further below. \n",
+ "\n",
+ "
\n",
+ "\n",
+ "### [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/)\n",
+ "[**LlamaIndex**](https://gpt-index.readthedocs.io/en/stable/) is a data framework for LLM applications to ingest, structure, and access private or domain-specific data. Since LLMs are both only trained up to a fixed point in time and do not contain knowledge that is proprietary to an Enterprise, they can't answer questions about new or proprietary knowledge. LlamaIndex helps solve this problem by providing data connectors to ingest data, indices to structure data for storage, and engines to communicate with data. \n",
+ "\n",
+ "\n",
+ "### [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/) or [LangChain](https://python.langchain.com/docs/get_started/introduction)?\n",
+ "\n",
+ "It's recommended to read more about the unique strengths of both LlamaIndex and LangChain. At a high level, LangChain is a more general framework for building applications with LLMs. LangChain is (currently) more mature when it comes to multi-step chains and some other chat functionality such as conversational memory. LlamaIndex has plenty of overlap with LangChain, but is particularly strong for loading data from a wide variety of sources and indexing/querying tasks. \n",
+ "\n",
+ "Since LlamaIndex can be used *with* LangChain, the frameworks' unique capabilities can be leveraged together; the combination of the two is demonstrated in this notebook.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d76e8af7-2124-4cb6-8ade-e1c1c42d1701",
+ "metadata": {},
+ "source": [
+ "### Step 1: Integrate TensorRT-LLM to LangChain *and* LlamaIndex\n",
+ "#### Customized LangChain LLM in LlamaIndex\n",
+ "As noted in the previous notebook, Langchain allows you to [create custom wrappers for your LLM](https://python.langchain.com/docs/modules/model_io/models/llms/custom_llm) in case you want to use your own LLM or a different wrapper than the one that is supported in LangChain. Since we are using a custom Llama2 model hosted on Triton with TRT-LLM, we have written a custom wrapper for our LLM. \n",
+ "\n",
+ "We can easily take a custom LLM that has been wrapped for LangChain and plug it into [LlamaIndex as an LLM](https://docs.llamaindex.ai/en/stable/understanding/using_llms/using_llms.html#using-llms)! We use the [LlamaIndex LangChainLLM library](https://gpt-index.readthedocs.io/en/latest/api_reference/llms/langchain.html) so the LangChain LLM can be used in LlamaIndex. \n",
+ "\n",
+ "\n",
+ " \n",
+ "WARNING! Be sure to replace `server_url` with the address and port that Triton is running on. \n",
+ "\n",
+ "
\n",
+ "\n",
+ "Use the address and port that the Triton is available on; for example `localhost:8001`. **If you are running this notebook as part of the generative ai workflow, your can use the existing url.**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "8a80987e-1ddb-4248-b76c-f3ce16745ca3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from trt_llm import TensorRTLLM\n",
+ "from llama_index.llms import LangChainLLM\n",
+ "trtllm =TensorRTLLM(server_url =\"triton:8001\", model_name=\"ensemble\", tokens=500)\n",
+ "llm = LangChainLLM(llm=trtllm)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bc57b68d-afd5-4a0c-832c-0ad8f3f475d5",
+ "metadata": {},
+ "source": [
+ "### Step 2: Create a Prompt Template\n",
+ "\n",
+ "A [**prompt template**](https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/prompts.html) is a common paradigm in LLM development.\n",
+ "\n",
+ "They are a pre-defined set of instructions provided to the LLM and guide the output produced by the model. They can contain few shot examples and guidance and are a quick way to engineer the responses from the LLM. Llama 2 accepts the [prompt format](https://huggingface.co/blog/llama2#how-to-prompt-llama-2) shown in `LLAMA_PROMPT_TEMPLATE`, which we manipulate to be constructed with:\n",
+ "- The system prompt\n",
+ "- The context\n",
+ "- The user's question\n",
+ " \n",
+ "Much like LangChain's abstraction of prompts, LlamaIndex has similar abstractions for you to create prompts."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "682ec812-33be-430f-8bb1-ae3d68690198",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index import Prompt\n",
+ "\n",
+ "LLAMA_PROMPT_TEMPLATE = (\n",
+ " \"[INST] <>\"\n",
+ " \"Use the following context to answer the user's question. If you don't know the answer, just say that you don't know, don't try to make up an answer.\"\n",
+ " \"<>\"\n",
+ " \"[INST] Context: {context_str} Question: {query_str} Only return the helpful answer below and nothing else. Helpful answer:[/INST]\"\n",
+ ")\n",
+ "\n",
+ "qa_template = Prompt(LLAMA_PROMPT_TEMPLATE)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "056850b3-70c6-438a-9c35-e017ab611252",
+ "metadata": {},
+ "source": [
+ "### Step 3: Load Documents\n",
+ "\n",
+ "\n",
+ "

\n",
+ "
\n",
+ "\n",
+ "LlamaIndex provides [**data loaders**](https://docs.llamaindex.ai/en/stable/module_guides/loading/connector/root.html#data-connectors-llamahub) through [Llama Hub](https://llamahub.ai/). These allow for custom data sources to be connected to your LLM via plugins. So, for example, plugins are available to load documents from [Jira](https://llamahub.ai/l/jira), [Outlook Calendar](https://llamahub.ai/l/outlook_localcalendar), [Slack](https://llamahub.ai/l/slack), [Trello](https://llamahub.ai/l/trello), and many other applications. \n",
+ "\n",
+ "At the core of each data loader is a `download_loader` function which downloads the loader file into a module that you can use in your application. Once the loader is downloaded, data is ingested through the loader. The output of this ingestion is data formatted as a LlamaIndex [**Document**](https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/root.html#documents-nodes) (text and metadata). \n",
+ "\n",
+ "Similar to the previous notebook with LangChain, an [`UnstructuredReader`](https://llamahub.ai/l/file-unstructured) is used in this example. However, this time it's from from [Llama Hub](https://llamahub.ai/) (LlamaIndex). Again, we load a research paper about Llama2 from Meta. \n",
+ "\n",
+ "[Here](https://python.langchain.com/docs/integrations/document_loaders) are some of the other document loaders available from LangChain."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "e9457012-e436-4371-9157-56c1ce4be667",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "File ‘llama2_paper.pdf’ already there; not retrieving.\n"
+ ]
+ }
+ ],
+ "source": [
+ "! wget -O \"llama2_paper.pdf\" -nc --user-agent=\"Mozilla\" https://arxiv.org/pdf/2307.09288.pdf"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "4f9adbc8-2060-4b16-9252-ac6727b862ee",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "[nltk_data] Downloading package punkt to /root/nltk_data...\n",
+ "[nltk_data] Package punkt is already up-to-date!\n",
+ "[nltk_data] Downloading package averaged_perceptron_tagger to\n",
+ "[nltk_data] /root/nltk_data...\n",
+ "[nltk_data] Package averaged_perceptron_tagger is already up-to-\n",
+ "[nltk_data] date!\n"
+ ]
+ }
+ ],
+ "source": [
+ "from llama_hub.file.unstructured.base import UnstructuredReader\n",
+ "import time\n",
+ "\n",
+ "loader = UnstructuredReader()\n",
+ "start_time = time.time()\n",
+ "documents = loader.load_data(file=\"llama2_paper.pdf\")\n",
+ "print(f\"--- {time.time() - start_time} seconds ---\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f03d6d82-8157-4dbc-97dd-29e3b990f8aa",
+ "metadata": {},
+ "source": [
+ "### Step 4: Transform Documents with Text Splitting and a Node Parser\n",
+ "#### a) Generate Embeddings \n",
+ "Once documents have been loaded, they are often transformed. One method of transformation is known as **chunking**, which breaks down large pieces of text, for example, a long document, into smaller segments. This technique is valuable because it helps [optimize the relevance of the content returned from the vector database](https://www.pinecone.io/learn/chunking-strategies/). \n",
+ "\n",
+ "This is the same process as the previous notebook; again, we use a LangChain text splitter. In this example, we use a [``SentenceTransformersTokenTextSplitter``](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.SentenceTransformersTokenTextSplitter.html#langchain.text_splitter.SentenceTransformersTokenTextSplitter). The ``SentenceTransformersTokenTextSplitter`` is a specialized text splitter for use with the sentence-transformer models. The default behavior is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use. This sentence transformer model is used to generate the embeddings from documents.\n",
+ "\n",
+ "There are some nuanced complexities to text splitting since semantically related text, in theory, should be kept together. \n",
+ "\n",
+ "This time, we also use a [**LlamaIndex node parser**](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/root.html#node-parser) on top of the text splitter from LangChain. This is not required, but since LlamaIndex provides a [**node structure**](https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/root.html#documents-nodes), we choose to use this functionality to level up our storage of documents. \n",
+ "\n",
+ "**Nodes** represent chunks of source documents, but they also contain metadata and relationship information with other nodes and index structures. Since nodes provide these additional forms of hierarchy and connections across the data, they can help generate more accurate answers upon retrieval."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "fa366250-108e-45a0-88ce-e6f7274da8e1",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/usr/local/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+ " from .autonotebook import tqdm as notebook_tqdm\n"
+ ]
+ }
+ ],
+ "source": [
+ "from langchain.text_splitter import SentenceTransformersTokenTextSplitter\n",
+ "from llama_index.node_parser import SimpleNodeParser\n",
+ "\n",
+ "\n",
+ "TEXT_SPLITTER_MODEL = \"intfloat/e5-large-v2\"\n",
+ "TEXT_SPLITTER_CHUNCK_SIZE = 510\n",
+ "TEXT_SPLITTER_CHUNCK_OVERLAP = 200\n",
+ "\n",
+ "text_splitter = SentenceTransformersTokenTextSplitter(\n",
+ " model_name=TEXT_SPLITTER_MODEL,\n",
+ " chunk_size=TEXT_SPLITTER_CHUNCK_SIZE,\n",
+ " chunk_overlap=TEXT_SPLITTER_CHUNCK_OVERLAP,\n",
+ ")\n",
+ "\n",
+ "node_parser = SimpleNodeParser.from_defaults(\n",
+ " text_splitter=text_splitter\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cf9e2595-ae85-4c00-b561-d7d1a40933bf",
+ "metadata": {},
+ "source": [
+ "Additionally, we use a LlamaIndex [``PromptHelper``](https://gpt-index.readthedocs.io/en/latest/api_reference/service_context/prompt_helper.html) to help deal with LLM context window token limitations. It calculates available context size to the LLM by taking the initial context token length and subtracting out reserved token space for the prompt template and output. It provides a utility for re-packing text chunks from the index to maximally use the context window to minimize requests sent to the LLM.\n",
+ "\n",
+ "- ``context_window``: context window for the LLM -- the context length for Llama2 is 4k tokens\n",
+ "- ``num_ouptut``: number of output tokens for the LLM\n",
+ "- ``chunk_overlap_ratio``: chunk overlap as a ratio to chunk size\n",
+ "- ``chunk_size_limit``: maximum chunk size to use"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "dc9a6082-34a0-4aa7-964b-7fe3f2015aa9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index import PromptHelper\n",
+ "\n",
+ "prompt_helper = PromptHelper(\n",
+ " context_window=4096, \n",
+ " num_output=256, \n",
+ " chunk_overlap_ratio=0.1, \n",
+ " chunk_size_limit=None\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b8dab583-a12d-4fb1-a9eb-3a1b1f04075d",
+ "metadata": {},
+ "source": [
+ "### Step 5: Generate and Store Embeddings\n",
+ "#### a) Generate Embeddings \n",
+ "[Embeddings](https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html#embeddings) for documents are created by vectorizing the document text; this vectorization captures the semantic meaning of the text. This allows you to quickly and efficiently find other pieces of text that are similar. \n",
+ "\n",
+ "When a user sends in their query, the query is also embedded using the same embedding model that was used to embed the documents. As explained earlier, this allows us to find similar (relevant) documents to the user's query. \n",
+ "\n",
+ "Like other sections in this notebook, we can easily take a LangChain embedding object and use with LlamaIndex. We use the [LangchainEmbedding library](https://docs.llamaindex.ai/en/stable/api_reference/service_context/embeddings.html#langchainembedding), which acts as a wrapper around Langchain's embedding models. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "e9011ba0-f3f6-41f0-8a15-48f264743545",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain.embeddings import HuggingFaceEmbeddings\n",
+ "from llama_index import LangchainEmbedding, ServiceContext\n",
+ "import torch\n",
+ "\n",
+ "#Running the model on CPU as we want to conserve gpu memory. \n",
+ "#In the production deployment (API server shown as part of the 5th notebook we run the model on GPU)\n",
+ "model_name=\"intfloat/e5-large-v2\"\n",
+ "model_kwargs = {\"device\": \"cpu\"}\n",
+ "encode_kwargs = {\"normalize_embeddings\": False}\n",
+ "hf_embeddings = HuggingFaceEmbeddings(\n",
+ " model_name=model_name,\n",
+ " model_kwargs=model_kwargs,\n",
+ " encode_kwargs=encode_kwargs,\n",
+ ")\n",
+ "# Load in a specific embedding model\n",
+ "embed_model = LangchainEmbedding(hf_embeddings)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8db99124-e438-406d-880d-557501a461d3",
+ "metadata": {},
+ "source": [
+ "#### b) Store Embeddings \n",
+ "\n",
+ "LlamaIndex provides a supporting module, [`ServiceContext`](https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/service_context.html#servicecontext), to bundle commonly used resources during the indexing and querying stage. In this example, we bundle resources we've built: the LLM, the embedding model, the node parser, and the prompt helper. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "0e493f9d-589a-4820-902d-f68932bfb0d8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index import ServiceContext\n",
+ "service_context = ServiceContext.from_defaults(\n",
+ " llm=llm,\n",
+ " embed_model=embed_model,\n",
+ " node_parser=node_parser,\n",
+ " prompt_helper=prompt_helper\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d339a5b9-0d76-43e7-86d7-0f544f0805a2",
+ "metadata": {},
+ "source": [
+ "Set the service context globally, to avoid passing it to every llm call/"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "ba0efae7-a8ad-4db0-80ea-7edd69bf4719",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index import set_global_service_context\n",
+ "set_global_service_context(service_context)"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "79c7923c-d778-4f32-be37-4314063ecd2f",
+ "metadata": {},
+ "source": [
+ "\n",
+ " \n",
+ "⚠️ in the deployment of this workflow, [Milvus](https://milvus.io/) is running as a vector database microservice.\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "1e94e53e-41a9-47d3-a9d3-7c0af4c07f76",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index import VectorStoreIndex\n",
+ "from llama_index.storage.storage_context import StorageContext\n",
+ "from llama_index.vector_stores import MilvusVectorStore\n",
+ "\n",
+ "vector_store = MilvusVectorStore(uri=\"http://milvus:19530\", dim=1024, overwrite=False)\n",
+ "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
+ "index = VectorStoreIndex.from_vector_store(vector_store)"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "5783e23b",
+ "metadata": {},
+ "source": [
+ "Let's load the documents into the vector database index"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "474b8820",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import time\n",
+ "start_time = time.time()\n",
+ "nodes = node_parser.get_nodes_from_documents(documents)\n",
+ "index.insert_nodes(nodes)\n",
+ "print(f\"--- {time.time() - start_time} seconds ---\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "57e7aa7f-a219-44fe-8757-432daf278f6a",
+ "metadata": {},
+ "source": [
+ "### Step 6: Build the Query Engine and Stream Response\n",
+ "\n",
+ "#### a) Build the Query Engine\n",
+ "\n",
+ "A query engine is an object that takes in a query and returns a response. Each vector index has a default corresponding query engine; for example, the default query engine for a vector index performs a standard top-k retrieval over the vector store.\n",
+ "\n",
+ "A query engine contains the following components:\n",
+ "- Retriever\n",
+ "- Node PostProcessor\n",
+ "- Response Synthesizer "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f56f37e0-341e-4d7d-b282-f374a16f55b2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "query_engine = index.as_query_engine(text_qa_template=qa_template, streaming=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a2359014-ef1f-4d0f-bac9-8fdd37a93351",
+ "metadata": {},
+ "source": [
+ "#### b) Stream a Response from the Query Engine\n",
+ "Lastly, we pass the query engine a user's question and stream the response. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "38d23754-ea6b-47ce-8b3b-ebd37c0f5693",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import time\n",
+ "\n",
+ "start_time = time.time()\n",
+ "response = query_engine.query(\"what is the context length of llama2?\")\n",
+ "response.print_response_stream()\n",
+ "print(f\"\\n--- {time.time() - start_time} seconds ---\")"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.13"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/RetrievalAugmentedGeneration/notebooks/04_llamaindex_hier_node_parser.ipynb b/RetrievalAugmentedGeneration/notebooks/04_llamaindex_hier_node_parser.ipynb
new file mode 100644
index 000000000..5740c727d
--- /dev/null
+++ b/RetrievalAugmentedGeneration/notebooks/04_llamaindex_hier_node_parser.ipynb
@@ -0,0 +1,466 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "5298d283-47b0-4420-9fba-d64055e0cdd9",
+ "metadata": {},
+ "source": [
+ "# Notebook 4: Advanced Document Question-Answering with LlamaIndex\n",
+ "This notebook demonstrates how to use [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/) to build a more complex retrieval for a chatbot. \n",
+ "\n",
+ "The retrieval method shown in this notebook works well for code documentation; it retrieves more contiguous document blocks that preserve both code snippets and explanations of code. \n",
+ "\n",
+ "\n",
+ " \n",
+ "⚠️ There are many node parsing and retrieval techniques supported in LlamaIndex and this notebook just shows how two of these techniques, [HierarchialNodeParser](https://gpt-index.readthedocs.io/en/stable/api_reference/service_context/node_parser.html#llama_index.node_parser.HierarchicalNodeParser) and [AutoMergingRetriever](https://gpt-index.readthedocs.io/en/latest/examples/retrievers/auto_merging_retriever.html), can be useful for chatting with code documentation. \n",
+ "
\n",
+ "\n",
+ "In this demo, we'll use the [`llama_docs_bot`](https://github.com/run-llama/llama_docs_bot/tree/main) GitHub repository as our sample documentation to query. This repository contains the content for a development series with LlamaIndex covering the following topics: \n",
+ "- LLMs\n",
+ "- Nodes and documents\n",
+ "- Evaluation\n",
+ "- Embeddings\n",
+ "- Retrieval"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e1472274-8f2a-4f28-ad00-fd2a84542335",
+ "metadata": {},
+ "source": [
+ "### Step 1: Prerequisite Setup\n",
+ "By now you should be familiar with these steps:\n",
+ "1. Create an LLM client.\n",
+ "2. Set the prompt template for the LLM.\n",
+ "3. Download embeddings.\n",
+ "4. Set the service context.\n",
+ "5. Split the text\n",
+ "\n",
+ "\n",
+ " \n",
+ "WARNING! Be sure to replace `server_url` with the address and port that Triton is running on. \n",
+ "\n",
+ "
\n",
+ "\n",
+ "Use the address and port that the Triton is available on; for example `localhost:8001`. **If you are running this notebook as part of the generative ai workflow, you can use the existing url."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "639738d6-6b97-49a4-b091-c3f2bd0b3c4b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from trt_llm import TensorRTLLM\n",
+ "from llama_index.llms import LangChainLLM\n",
+ "trtllm =TensorRTLLM(server_url =\"triton:8001\", model_name=\"ensemble\", tokens=500)\n",
+ "llm = LangChainLLM(llm=trtllm)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "11e79465-2ba5-41e6-8ed3-027146ac2289",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index import Prompt\n",
+ "\n",
+ "LLAMA_PROMPT_TEMPLATE = (\n",
+ " \"[INST] <>\"\n",
+ " \"Use the following context to answer the user's question. If you don't know the answer, just say that you don't know, don't try to make up an answer.\"\n",
+ " \"<>\"\n",
+ " \"[INST] Context: {context_str} Question: {query_str} Only return the helpful answer below and nothing else. Helpful answer:[/INST]\"\n",
+ ")\n",
+ "\n",
+ "qa_template = Prompt(LLAMA_PROMPT_TEMPLATE)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "cae4c5e2-7726-4ba3-8d37-c919c916e755",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ " from langchain.embeddings import HuggingFaceEmbeddings\n",
+ "from llama_index import LangchainEmbedding, ServiceContext, set_global_service_context\n",
+ "import torch\n",
+ "\n",
+ "model_kwargs = {\"device\": \"cpu\"}\n",
+ "encode_kwargs = {\"normalize_embeddings\": False}\n",
+ "hf_embeddings = HuggingFaceEmbeddings(\n",
+ " model_name=\"intfloat/e5-large-v2\",\n",
+ " model_kwargs=model_kwargs,\n",
+ " encode_kwargs=encode_kwargs,\n",
+ ")\n",
+ "# Load in a specific embedding model\n",
+ "embed_model = LangchainEmbedding(hf_embeddings)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "333fb0b6-8cb9-4b26-9b3b-9360d153a325",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ " service_context = ServiceContext.from_defaults(\n",
+ " llm=llm,\n",
+ " embed_model=embed_model\n",
+ ")\n",
+ "set_global_service_context(service_context)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9c6490da-b867-4104-b8ff-ec03928a2212",
+ "metadata": {},
+ "source": [
+ "When splitting the text, we split it into a parent node of 1024 tokens and two children nodes of 510 tokens. Our leaf nodes' maximum size is 512 tokens, so we need to make the largest leaves that can exist under 512 tokens. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2a6036d7-ea65-4f35-87fb-737593acab2c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index.text_splitter import TokenTextSplitter\n",
+ "from transformers import AutoTokenizer\n",
+ "text_splitter_ids = [\"1024\", \"510\"]\n",
+ "text_splitter_map = {}\n",
+ "for ids in text_splitter_ids:\n",
+ " text_splitter_map[ids] = TokenTextSplitter(\n",
+ " chunk_size=int(ids),\n",
+ " chunk_overlap=200\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3cf80d8f-88fc-45ed-912c-8171601a19db",
+ "metadata": {},
+ "source": [
+ "### Step 2: Clone the Llama Docs Bot Repo \n",
+ "This repository will be our sample documentation that we chat with. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7d748587-791b-484d-8e11-ede14d4d7984",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "!git clone https://github.com/run-llama/llama_docs_bot.git"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d601d983-9dbc-46c4-ab45-8f83e4ae57bd",
+ "metadata": {},
+ "source": [
+ "### Step 3: Define Document Loading and Node Parsing Function\n",
+ "\n",
+ "Assuming hierarchical node parsing is set to true, this function:\n",
+ "- Parses each directory into a single giant document\n",
+ "- Chunks the document into a hierarchy of nodes with a top-level chunk size (1024) and children chunks that are smaller (aka **hierarchical node parsing**)\n",
+ " ```\n",
+ " 1024\n",
+ " /--------\\\n",
+ " 1024//2 1024//2\n",
+ "\n",
+ " ```\n",
+ "\n",
+ "#### Hierarchical Node Parser\n",
+ "The novel part of this step is using LlamaIndex's [**Hierarchical Node Parser**](https://gpt-index.readthedocs.io/en/stable/api_reference/service_context/node_parser.html#llama_index.node_parser.HierarchicalNodeParser). This parses nodes into several chunk sizes. \n",
+ "\n",
+ "During retrieval, if a majority of chunks are retrieved that have the same parent chunk, the larger parent chunk is returned instead of the smaller chunks.\n",
+ "\n",
+ "#### Simple Node Parser\n",
+ "If hierarchical parsing is false, a simple node structure is used and returned."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fbd8f9b7-3294-4c22-8c0f-554f5457a566",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index import SimpleDirectoryReader, Document\n",
+ "from llama_index.node_parser import HierarchicalNodeParser, SimpleNodeParser, get_leaf_nodes\n",
+ "from llama_index.schema import MetadataMode\n",
+ "from llama_docs_bot.llama_docs_bot.markdown_docs_reader import MarkdownDocsReader\n",
+ "\n",
+ "# This function takes in a directory of files, puts them in a giant document, and parses and returns them as:\n",
+ "# - a hierarchical node structure if it's a hierarchical implementation\n",
+ "# - a simple node structure if it's a non-hierarchial implementation\n",
+ "def load_markdown_docs(filepath, hierarchical=True):\n",
+ " \"\"\"Load markdown docs from a directory, excluding all other file types.\"\"\"\n",
+ " loader = SimpleDirectoryReader(\n",
+ " input_dir=filepath, \n",
+ " required_exts=[\".md\"],\n",
+ " file_extractor={\".md\": MarkdownDocsReader()},\n",
+ " recursive=True\n",
+ " )\n",
+ "\n",
+ " documents = loader.load_data()\n",
+ "\n",
+ " if hierarchical:\n",
+ " # combine all documents into one\n",
+ " documents = [\n",
+ " Document(text=\"\\n\\n\".join(\n",
+ " document.get_content(metadata_mode=MetadataMode.ALL) \n",
+ " for document in documents\n",
+ " )\n",
+ " )\n",
+ " ]\n",
+ "\n",
+ " # chunk into 3 levels\n",
+ " # majority means 2/3 are retrieved before using the parent\n",
+ " large_chunk_size = 1536\n",
+ " node_parser = HierarchicalNodeParser.from_defaults(text_splitter_ids=text_splitter_ids, text_splitter_map=text_splitter_map)\n",
+ "\n",
+ " nodes = node_parser.get_nodes_from_documents(documents)\n",
+ " return nodes, get_leaf_nodes(nodes)\n",
+ " ########## This is NOT a hierarchical parser for demonstration purposes later in the notebook ##########\n",
+ " else:\n",
+ " node_parser = SimpleNodeParser.from_defaults()\n",
+ " nodes = node_parser.get_nodes_from_documents(documents)\n",
+ " return nodes"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c9d2797c-6f66-495f-9941-85c8b7eb942e",
+ "metadata": {},
+ "source": [
+ "### Step 4: Load and Parse Documents with Node Parser \n",
+ "\n",
+ "First, we define all of the documentation directories we want to pull from. \n",
+ "\n",
+ "Next, we load the documentation and store parent nodes in a `SimpleDocumentStore` and leaf nodes in a `VectorStoreIndex`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b22ecf62-617b-48c4-9bca-f62c0d775636",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "docs_directories = {\n",
+ " \"./llama_docs_bot/docs/community\": \"Useful for information on community integrations with other libraries, vector dbs, and frameworks.\", \n",
+ " \"./llama_docs_bot/docs/core_modules/agent_modules\": \"Useful for information on data agents and tools for data agents.\", \n",
+ " \"./llama_docs_bot/docs/core_modules/data_modules\": \"Useful for information on data, storage, indexing, and data processing modules.\",\n",
+ " \"./llama_docs_bot/docs/core_modules/model_modules\": \"Useful for information on LLMs, embedding models, and prompts.\",\n",
+ " \"./llama_docs_bot/docs/core_modules/query_modules\": \"Useful for information on various query engines and retrievers, and anything related to querying data.\",\n",
+ " \"./llama_docs_bot/docs/core_modules/supporting_modules\": \"Useful for information on supporting modules, like callbacks, evaluators, and other supporting modules.\",\n",
+ " \"./llama_docs_bot/docs/getting_started\": \"Useful for information on getting started with LlamaIndex.\", \n",
+ " \"./llama_docs_bot/docs/development\": \"Useful for information on contributing to LlamaIndex development.\",\n",
+ "}\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "71ebcb8b-a27e-4b22-b7c6-a813d66b4240",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index import VectorStoreIndex,StorageContext, load_index_from_storage\n",
+ "from llama_index.query_engine import RetrieverQueryEngine\n",
+ "\n",
+ "from llama_index.tools import QueryEngineTool, ToolMetadata\n",
+ "from llama_index.storage.docstore import SimpleDocumentStore\n",
+ "import os\n",
+ "import time\n",
+ "\n",
+ "start_time = time.time()\n",
+ "for directory, description in docs_directories.items():\n",
+ " nodes, leaf_nodes = load_markdown_docs(directory, hierarchical=True)\n",
+ " \n",
+ " docstore = SimpleDocumentStore()\n",
+ " docstore.add_documents(nodes)\n",
+ " storage_context = StorageContext.from_defaults(docstore=docstore)\n",
+ " \n",
+ " index = VectorStoreIndex(leaf_nodes, storage_context=storage_context)\n",
+ " index.storage_context.persist(persist_dir=f\"./data_{os.path.basename(directory)}\")\n",
+ "\n",
+ "print(f\"--- {time.time() - start_time} seconds ---\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7910668e-6072-4786-9560-2fcc9559159c",
+ "metadata": {},
+ "source": [
+ "### Step 5: Define Custom Node Post-Processor\n",
+ "\n",
+ "A [**Node PostProcessor**](https://gpt-index.readthedocs.io/en/v0.5.27/how_to/query/node_postprocessor.html) takes a list of retrieved nodes and transforms them (filtering, replacement, etc). \n",
+ "\n",
+ "This custom node post-processor provides a simple approach to approximate token counts and returns the most nodes that fit within the token count (2500 tokens). Nodes are already sorted, so the most similar ones are returned first. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4192140d-0f46-44df-b278-396a23c9ad12",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from typing import Callable, Optional\n",
+ "\n",
+ "from llama_index.utils import globals_helper\n",
+ "from llama_index.schema import MetadataMode\n",
+ "\n",
+ "class LimitRetrievedNodesLength:\n",
+ "\n",
+ " def __init__(self, limit: int = 2500, tokenizer: Optional[Callable] = None):\n",
+ " self._tokenizer = tokenizer or globals_helper.tokenizer\n",
+ " self.limit = limit\n",
+ "\n",
+ " def postprocess_nodes(self, nodes, query_bundle):\n",
+ " included_nodes = []\n",
+ " current_length = 0\n",
+ "\n",
+ " for node in nodes:\n",
+ " current_length += len(self._tokenizer(node.node.get_content(metadata_mode=MetadataMode.LLM)))\n",
+ " if current_length > self.limit:\n",
+ " break\n",
+ " included_nodes.append(node)\n",
+ "\n",
+ " return included_nodes"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7c229696-f9d4-4032-ae2a-b37bc842689f",
+ "metadata": {},
+ "source": [
+ "### Step 5: Build the Retriever and Query Engine\n",
+ "\n",
+ "#### AutoMergingRetriever\n",
+ "The [`AutoMergingRetriever`](https://gpt-index.readthedocs.io/en/v0.8.11.post2/examples/retrievers/auto_merging_retriever.html) takes in a set of leaf nodes and recursively merges subsets of leaf nodes that reference a parent node beyond a given threshold. This allows for a consolidation of potentially disparate, smaller contexts into a larger context that may help synthesize disparate information. \n",
+ "\n",
+ "#### Query Engine\n",
+ "A query engine is an object that takes in a query and returns a response.\n",
+ "\n",
+ "It may contain the following components:\n",
+ "- **Retriever**: Given a query, retrieves relevant nodes.\n",
+ " - We use an [`AutoMergingRetriever`](https://gpt-index.readthedocs.io/en/latest/examples/retrievers/auto_merging_retriever.html) if it's a hierarchial implementation.\n",
+ " *This replaces the retrieved nodes with the larger parent chunk*. \n",
+ "- **Node PostProcessor**: Takes a list of retrieved nodes and transforms them (filtering, replacement, etc.)\n",
+ " - We use a post-processor that filters the retrieved nodes to a limited length. \n",
+ "- **Response Synthesizer**: Takes a list of relevant nodes and synthesizes a response with an LLM."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "efd682ff-8760-4d82-a137-5c4ebb79796b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index.retrievers import AutoMergingRetriever\n",
+ "from llama_index.query_engine import RetrieverQueryEngine\n",
+ "from llama_index import (\n",
+ " VectorStoreIndex,\n",
+ " get_response_synthesizer,\n",
+ ")\n",
+ "\n",
+ "retriever = AutoMergingRetriever(\n",
+ " index.as_retriever(similarity_top_k=12), \n",
+ " storage_context=storage_context\n",
+ " )\n",
+ "\n",
+ "query_engine = RetrieverQueryEngine.from_args(\n",
+ " retriever,\n",
+ " text_qa_template=qa_template,\n",
+ " node_postprocessors=[LimitRetrievedNodesLength(limit=2500)],\n",
+ " streaming=True\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "aa87b7bd-5d0e-4e10-b0e5-c3010c522bab",
+ "metadata": {},
+ "source": [
+ "### Step 6: Stream Response"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e3552b80-fc46-4cd4-9f49-dfe3068eaa08",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "query = \"How do I setup a weaviate vector db? Give me a code sample please.\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0f04685a-115d-4b3c-a9b7-89baa253571e",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "import time\n",
+ "\n",
+ "start_time = time.time()\n",
+ "response = query_engine.query(query)\n",
+ "response.print_response_stream()\n",
+ "print(f\"\\n--- {time.time() - start_time} seconds ---\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "83032355-29b8-4797-ac99-1ff1d700dc79",
+ "metadata": {},
+ "source": [
+ "To clear out cached data run:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "85b6ad54-088e-43ff-ad93-2a9ebc8134df",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!rm -rf data_*"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.13"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/RetrievalAugmentedGeneration/notebooks/05_dataloader.ipynb b/RetrievalAugmentedGeneration/notebooks/05_dataloader.ipynb
new file mode 100644
index 000000000..f1e580039
--- /dev/null
+++ b/RetrievalAugmentedGeneration/notebooks/05_dataloader.ipynb
@@ -0,0 +1,181 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "2a74231c-df99-461f-8cdc-ea17f808d717",
+ "metadata": {},
+ "source": [
+ "### Notebook-5: Create a NVIDIA PR chatbot\n",
+ "As part of this generative AI workflow, we create a NVIDIA PR chatbot that answers questions from the nvidia news and blogs from years of 2022 and 2023. For this, we have created a REST FastAPI server that wraps llama-index. The API server has two methods, ```upload_document``` and ```generate```. The ```upload_document``` method takes a document from the user's computer and uploads it to a Milvus vector database after splitting, chunking and embedding the document. The ```generate``` API method generates an answer from the provided prompt optionally sourcing information from a vector database. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e2c8a538-b766-482d-9fde-9d20482fe7db",
+ "metadata": {},
+ "source": [
+ "#### Step-1: Load the pdf files from the dataset folder.\n",
+ "\n",
+ "You can upload the pdf files containing the NVIDIA blogs to ```query:8081/uploadDocument``` API endpoint"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d598b9e7-8a04-4220-a875-69e6bbe6a2ce",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%capture\n",
+ "!unzip dataset.zip"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "cd70a746-9bc0-4025-b6d5-76bba2473ceb",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import requests\n",
+ "import mimetypes\n",
+ "\n",
+ "def upload_document(file_path, url):\n",
+ " headers = {\n",
+ " 'accept': 'application/json'\n",
+ " }\n",
+ " mime_type, _ = mimetypes.guess_type(file_path)\n",
+ " files = {\n",
+ " 'file': (file_path, open(file_path, 'rb'), mime_type)\n",
+ " }\n",
+ " response = requests.post(url, headers=headers, files=files)\n",
+ "\n",
+ " return response.text\n",
+ "\n",
+ "def upload_pdf_files(folder_path, upload_url, num_files):\n",
+ " i = 0\n",
+ " for files in os.listdir(folder_path):\n",
+ " file_path = os.path.join(folder_path, files)\n",
+ " print(upload_document(file_path, upload_url))\n",
+ " i += 1\n",
+ " if i > num_files:\n",
+ " break"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5819b268-5867-45fd-9f52-00d5d797e772",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import time\n",
+ "\n",
+ "start_time = time.time()\n",
+ "NUM_DOCS_TO_UPLOAD=100\n",
+ "upload_pdf_files(\"dataset\", \"http://query:8081/uploadDocument\", NUM_DOCS_TO_UPLOAD)\n",
+ "print(f\"--- {time.time() - start_time} seconds ---\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f9d813be-76f2-42d1-9b43-403625c0b4ce",
+ "metadata": {},
+ "source": [
+ "#### Step-2 : Ask a question without referring to the knowledge base\n",
+ "Ask Tensorrt LLM llama-2 13B model a question about \"the nvidia grace superchip\" without seeking help from the vectordb/knowledge base by setting ```use_knowledge_base``` to ```false```"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "04d43b1a-f9b2-4119-b6c1-66dd38130e98",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import time\n",
+ "\n",
+ "data = {\n",
+ " \"question\": \"how many cores are on the nvidia grace superchip?\",\n",
+ " \"context\": \"\",\n",
+ " \"use_knowledge_base\": \"false\",\n",
+ " \"num_tokens\": 256\n",
+ "}\n",
+ "\n",
+ "url = \"http://query:8081/generate\"\n",
+ "\n",
+ "start_time = time.time()\n",
+ "with requests.post(url, stream=True, json=data) as r:\n",
+ " for chunk in r.iter_content(16):\n",
+ " print(chunk.decode(\"UTF-8\"), end =\"\")\n",
+ "print(f\"--- {time.time() - start_time} seconds ---\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cb186d4b-61c9-4b48-8ec9-3d238db787b8",
+ "metadata": {},
+ "source": [
+ "Now ask it the same question by setting ```use_knowledge_base``` to ```true```"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "903635bb-8d60-40ae-b613-67a2b8a08eb9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "data = {\n",
+ " \"question\": \"how many cores are on the nvidia grace superchip?\",\n",
+ " \"context\": \"\",\n",
+ " \"use_knowledge_base\": \"true\",\n",
+ " \"num_tokens\": 50\n",
+ "}\n",
+ "\n",
+ "url = \"http://query:8081/generate\"\n",
+ "\n",
+ "start_time = time.time()\n",
+ "tokens_generated = 0\n",
+ "with requests.post(url, stream=True, json=data) as r:\n",
+ " for chunk in r.iter_content(16):\n",
+ " tokens_generated += 1\n",
+ " print(chunk.decode(\"UTF-8\"), end =\"\")\n",
+ "total_time = time.time() - start_time\n",
+ "print(f\"\\n--- Generated {tokens_generated} tokens in {total_time} seconds ---\")\n",
+ "print(f\"--- {tokens_generated/total_time} tokens/sec\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d9c8fffa-97d1-4b27-b1cd-87067853a65f",
+ "metadata": {},
+ "source": [
+ "### Next steps\n",
+ "We have setup a playground UI for you to upload files and get answers from, the UI is available on the same IP address as the notebooks: `host_ip:8090/converse`"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.13"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/RetrievalAugmentedGeneration/notebooks/dataset.zip b/RetrievalAugmentedGeneration/notebooks/dataset.zip
new file mode 100644
index 000000000..21b1ac2c7
--- /dev/null
+++ b/RetrievalAugmentedGeneration/notebooks/dataset.zip
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:b09437e0eded0ca7736cb3b0d216748c5e53434385e9693521afabde808512c1
+size 5803988
diff --git a/RetrievalAugmentedGeneration/notebooks/imgs/data_connection_langchain.jpeg b/RetrievalAugmentedGeneration/notebooks/imgs/data_connection_langchain.jpeg
new file mode 100644
index 000000000..6ae42c480
Binary files /dev/null and b/RetrievalAugmentedGeneration/notebooks/imgs/data_connection_langchain.jpeg differ
diff --git a/RetrievalAugmentedGeneration/notebooks/imgs/llama_hub.png b/RetrievalAugmentedGeneration/notebooks/imgs/llama_hub.png
new file mode 100644
index 000000000..dc984f3f4
Binary files /dev/null and b/RetrievalAugmentedGeneration/notebooks/imgs/llama_hub.png differ
diff --git a/RetrievalAugmentedGeneration/notebooks/imgs/vector_stores.jpeg b/RetrievalAugmentedGeneration/notebooks/imgs/vector_stores.jpeg
new file mode 100644
index 000000000..37bfd15c9
Binary files /dev/null and b/RetrievalAugmentedGeneration/notebooks/imgs/vector_stores.jpeg differ
diff --git a/RetrievalAugmentedGeneration/notebooks/requirements.txt b/RetrievalAugmentedGeneration/notebooks/requirements.txt
new file mode 100644
index 000000000..9e84b8ab9
--- /dev/null
+++ b/RetrievalAugmentedGeneration/notebooks/requirements.txt
@@ -0,0 +1,14 @@
+fastapi==0.104.1
+uvicorn[standard]==0.24.0
+python-multipart==0.0.6
+langchain==0.0.330
+tritonclient[all]==2.39.0
+unstructured[all-docs]==0.10.28
+sentence-transformers==2.2.2
+llama-index==0.8.28
+dataclass-wizard==0.22.2
+opencv-python==4.8.0.74
+llama-hub==0.0.43
+pymilvus==2.3.1
+jupyterlab==4.0.8
+openai==0.28.1
\ No newline at end of file
diff --git a/RetrievalAugmentedGeneration/requirements.txt b/RetrievalAugmentedGeneration/requirements.txt
new file mode 100644
index 000000000..a8b762d1f
--- /dev/null
+++ b/RetrievalAugmentedGeneration/requirements.txt
@@ -0,0 +1,12 @@
+fastapi==0.104.1
+uvicorn[standard]==0.24.0
+python-multipart==0.0.6
+langchain==0.0.330
+tritonclient[all]==2.39.0
+unstructured[all-docs]==0.10.28
+sentence-transformers==2.2.2
+openai==0.28.1
+llama-index==0.8.28
+pymilvus==2.3.1
+dataclass-wizard==0.22.2
+opencv-python==4.8.0.74