# RAG Example Using NVIDIA API Catalog and LlamaIndex

This notebook introduces how to use LlamaIndex to interact with NVIDIA hosted NIM microservices like chat, embedding, and reranking models to build a simple retrieval-augmented generation (RAG) application.

Alternatively, for a more interactive experience with a graphical user interface, you can refer to our [code](https://github.com/jayrodge/llm-assistant-cloud-app/) and [YouTube video](https://www.youtube.com/watch?v=09uDCmLzYHA) for Gradio-based RAG Q&A reference application that also uses NVIDIA hosted NIM microservices.

## Terminology

#### RAG

- RAG is a technique for augmenting LLM knowledge with additional data.
- LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on.
- If you want to build AI applications that can reason about private data or data introduced after a model's cutoff date, you need to augment the knowledge of the model with the specific information it needs.
- The process of bringing the appropriate information and inserting it into the model prompt is known as retrieval augmented generation (RAG).

The preceding summary of RAG originates in the LangChain v0.2 tutorial [Build a RAG App](https://python.langchain.com/v0.2/docs/tutorials/rag/) tutorial in the LangChain v0.2 documentation.

For comprehensive information, refer to the LLamaIndex documentation for [Building an LLM Application](https://docs.llamaindex.ai/en/stable/understanding/#:~:text=on%20your%20machine.-,Building%20a%20RAG%20pipeline,-%3A%20Retrieval%2DAugmented%20Generation).

#### NIM

- [NIM microservices](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/) are containerized microservices that simplify the deployment of generative AI models like LLMs and are optimized to run on NVIDIA GPUs. 
- NIM microservices support models across domains like chat, embedding, reranking, and more from both the community and NVIDIA.

#### NVIDIA API Catalog

- [NVIDIA API Catalog](https://build.nvidia.com/explore/discover) is a hosted platform for accessing a wide range of microservices online.
- You can test models on the catalog and then export them with an NVIDIA AI Enterprise license for on-premises or cloud deployment

#### LlamaIndex Concepts

 - `Data connectors` ingest your existing data from their native source and format.
 - `Data indexes` structure your data in intermediate representations that are easy and performant for LLMs to consume.
 - `Engines` provide natural language access to your data for building context-augmented LLM apps.

LlamaIndex also provides integrations like `llms-nvidia`, `embeddings-nvidia` & `nvidia-rerank` to work with NVIDIA microservices.

## Installation and Requirements

Create a Python environment (preferably with Conda) using Python version 3.10.14. 
To install Jupyter Lab, refer to the [installation](https://jupyter.org/install) page.

## Getting Started!

In [3]:
# Requirements
!pip install --upgrade pip
!pip install llama-index-core==0.10.50
!pip install llama-index-readers-file==0.1.25
!pip install llama-index-llms-nvidia==0.1.3
!pip install llama-index-embeddings-nvidia==0.1.4
!pip install llama-index-postprocessor-nvidia-rerank==0.1.2
!pip install ipywidgets==8.1.3

Collecting llama-index-core==0.10.50
  Using cached llama_index_core-0.10.50-py3-none-any.whl.metadata (2.4 kB)
Using cached llama_index_core-0.10.50-py3-none-any.whl (15.4 MB)
Installing collected packages: llama-index-core
  Attempting uninstall: llama-index-core
    Found existing installation: llama-index-core 0.10.68.post1
    Uninstalling llama-index-core-0.10.68.post1:
      Successfully uninstalled llama-index-core-0.10.68.post1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llama-index-llms-openai 0.1.31 requires llama-index-core<0.11.0,>=0.10.57, but you have llama-index-core 0.10.50 which is incompatible.[0m[31m
[0mSuccessfully installed llama-index-core-0.10.50
Collecting llama-index-core<0.11.0,>=0.10.0 (from llama-index-llms-nvidia==0.1.3)
  Using cached llama_index_core-0.10.68.post1-py3-none-any.whl.metadata (2.5 kB)
Using cached llama_

To get started you need a `NVIDIA_API_KEY` to use NVIDIA AI Foundation models:

1) Create a free account with [NVIDIA](https://build.nvidia.com/explore/discover).
2) Click on your model of choice.
3) Under Input select the Python tab, and click **Get API Key** and then click **Generate Key**.
4) Copy and save the generated key as NVIDIA_API_KEY. From there, you should have access to the endpoints.

In [28]:
import getpass
import os

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvidia_api_key = getpass.getpass("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")
    assert nvidia_api_key.startswith("nvapi-"), f"{nvidia_api_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvidia_api_key

## RAG Example using LLM and Embedding

### 1) Initialize the LLM

`llama-index-llms-nvidia`, also known as NVIDIA's LLM connector,
allows your connect to and generate from compatible models available on the NVIDIA API catalog.

Here we will use **mixtral-8x7b-instruct-v0.1** 

In [29]:
# Settings enables global configuration as a singleton object throughout your application.
# Here, it is used to set the LLM, embedding model, and text splitter configurations globally.
from llama_index.core import Settings
from llama_index.llms.nvidia import NVIDIA

# Here we are using mixtral-8x7b-instruct-v0.1 model from API Catalog
Settings.llm = NVIDIA(model="mistralai/mixtral-8x7b-instruct-v0.1")

### 2) Intiatlize the embedding

We selected **NV-Embed-QA** as the embedding

In [30]:
from llama_index.embeddings.nvidia import NVIDIAEmbedding
Settings.embed_model = NVIDIAEmbedding(model="NV-Embed-QA", truncate="END")

### 3) Obtain some toy text dataset
Here we are loading a toy data from a text documents and in real-time data can be loaded from various sources. 

Real world documents can be very long, this makes it hard to fit in the context window of many models. Even for those models that could fit the full post in their context window, models can struggle to find information in very long inputs.

To handle this we’ll split the Document into chunks for embedding and vector storage.

In [31]:
# For this example we load a toy data set (it's a simple text file with some information about Sweden)
TOY_DATA_PATH = "./data/"

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import SimpleDirectoryReader
Settings.text_splitter = SentenceSplitter(chunk_size=400)
documents = SimpleDirectoryReader(TOY_DATA_PATH).load_data()

Note:
 - `SimpleDirectoryReader` takes care of storing basic file information such as the filename, filepath, and file type as metadata by default. This metadata can be used to keep track of the source file, allowing us to use it later for citation or metadata filtering.

### 4) Process the documents into VectorStoreIndex

In RAG, your data is loaded and prepared for queries or "indexed". User queries act on the index, which filters your data down to the most relevant context. This context and your query then go to the LLM along with a prompt, and the LLM provides a response.

In [32]:
from llama_index.core import VectorStoreIndex
# When you use from_documents, your Documents are split into chunks and parsed into Node objects
# By default, VectorStoreIndex stores everything in memory
index = VectorStoreIndex.from_documents(documents)

### 5) Create a Query Engine to ask question over your data

In [33]:
# Returns a Query engine for this index.
# If no LLM as input, this will automtically use Settings.embed_model as the LLM.
query_engine = index.as_query_engine(similarity_top_k=10)

In [34]:
response = query_engine.query(
    "In the chinese article, instead of 'Sword Lake World' theme park. What other competitors are mentioned in the article?"
)
print(response)

In the Chinese article, rather than the "Sword Lake World" theme park, other competitors mentioned are various theme parks that utilize a one-ticket-all-inclusive pricing strategy, which makes it challenging for consumers to negotiate the price. However, no specific names of these theme parks are provided in the article.


## RAG Example with LLM, Embedding & Reranking

In [15]:
# Let's test a more complex query using the above LLM Embedding query_engine and see if the reranker can help.
response = query_engine.query(
    "What is Nordic Channel?"
)
print(response)

I don't have information on a television service called "Nordic Channel." The original answer suggests that it was a Swedish-language satellite service that was launched in 1989 and later became known as Kanal 5. However, I couldn't find any additional context or updates about this channel in relation to the new context provided.


### Enhancing accuracy for single data sources

This example demonstrates how a re-ranking model can be used to combine retrieval results and improve accuracy during retrieval of documents.

Typically, reranking is a critical piece of high-accuracy, efficient retrieval pipelines. Generally, there are two important use cases:

- Combining results from multiple data sources
- Enhancing accuracy for single data sources

Here, we focus on demonstrating only the second use case.

In [42]:
# We will narrow the collection to 40 results and further narrow it to 4 with the reranker.
from llama_index.postprocessor.nvidia_rerank import NVIDIARerank

reranker_query_engine = index.as_query_engine(
    similarity_top_k=40, node_postprocessors=[NVIDIARerank(top_n=5)]
)

response = reranker_query_engine.query(
    "How many articles you have in the vector store? and how do you know which article content that I am asking?"
)
print(response)

I have five articles in the vector store, as can be inferred from the file paths provided in the context:

1. /Users/leechiyuan/python/nvidia-llamaindex/data/eva-air-swot-analysis.txt (mentioned 4 times)
2. https://reurl.cc/qrmWRy (TechNews科技新報, 2023)
3. https://reurl.cc/krVRpq (長榮航空111年度年報, 2023)
4. https://reurl.cc/mrqVzG (長榮航空, 2023)
5. https://reurl.cc/D4vR5j (鏡周刊, 2023)

Regarding your second question, I determine the article content based on the provided context and the specific query. I carefully extract and summarize the relevant information without directly referencing the context. This way, I can provide accurate and contextually appropriate answers to your questions.


#### Note:
 - In this notebook, we used NVIDIA NIM microservices from the NVIDIA API Catalog.
 - The above APIs, NVIDIA (llms), NVIDIAEmbedding, and NVIDIARerank, also support self-hosted microservices.
 - Change the `base_url` to your deployed NIM URL
 - Example: NVIDIA(model="meta/llama3-8b-instruct", base_url="http://your-nim-host-address:8000/v1")
 - NIM can be hosted locally using Docker, following the [NVIDIA NIM for LLMs](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html) documentation.

In [None]:
# Example Code snippet if you want to use a self-hosted NIM
from llama_index.llms.nvidia import NVIDIA

llm = NVIDIA(model="meta/llama3-8b-instruct", base_url="http://your-nim-host-address:8000/v1")

In [39]:
# Example questions
"In the chinese article, what is the 'Sword Lake World' ticket price? Could you list all of them?"
"In the chinese article, instead of 'Sword Lake World' theme park. What other competitors are mentioned in the article?"

"In the chinese article, what is the 'Sword Lake World' ticket price? Could you list all of them?"

In [48]:
#from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# Step 1: Load and embed the old documents
#documents = SimpleDirectoryReader("path/to/old/documents").load_data()
#index = VectorStoreIndex.from_documents(documents)

# Step 2: Load new documents
new_documents = SimpleDirectoryReader("./new_data/").load_data()

# Step 3: Inser the new_documents in to the index.
for new_doc in new_documents:
    index.insert(new_doc)

# The index now contains both old and new data


In [49]:
# We will narrow the collection to 40 results and further narrow it to 4 with the reranker.
from llama_index.postprocessor.nvidia_rerank import NVIDIARerank

reranker_query_engine = index.as_query_engine(
    similarity_top_k=40, node_postprocessors=[NVIDIARerank(top_n=5)]
)

response = reranker_query_engine.query(
    "do you have the information about the tesla?"
)
print(response)

Yes, I can share some information about Tesla based on the provided context. Tesla recently unveiled two new self-driving cars that utilize their in-house FSD (Full Self Driving) technology. The first one to be showcased was the much-anticipated "Robotaxi self-driving taxi," named "Cybercab" by Elon Musk. The second vehicle introduced was a 20-person capacity self-driving shuttle bus, the "Robovan." Additionally, a group of next-generation Optimus intelligent robots made an appearance, demonstrating their abilities through dance performances and interacting with the event's attendees.

Elon Musk, CEO of Tesla, personally rode in the Cybercab on his way to the stage, showcasing the vehicle's autonomous driving capabilities and its lack of a steering wheel or pedals. The Cybercab navigated through pre-set road scenarios, displaying the technology's prowess. Musk announced that the Cybercab's price would be below $30,000 (approximately 960,000 TWD) and shared plans to begin mass productio