# RAG Example Using NVIDIA API Catalog and LlamaIndex

This notebook introduces how to use LlamaIndex to interact with NVIDIA hosted NIM microservices like chat, embedding, and reranking models to build a simple retrieval-augmented generation (RAG) application.

Alternatively, for a more interactive experience with a graphical user interface, you can refer to our [code](https://github.com/jayrodge/llm-assistant-cloud-app/) and [YouTube video](https://www.youtube.com/watch?v=09uDCmLzYHA) for Gradio-based RAG Q&A reference application that also uses NVIDIA hosted NIM microservices.

## Terminology

#### RAG

- RAG is a technique for augmenting LLM knowledge with additional data.
- LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on.
- If you want to build AI applications that can reason about private data or data introduced after a model's cutoff date, you need to augment the knowledge of the model with the specific information it needs.
- The process of bringing the appropriate information and inserting it into the model prompt is known as retrieval augmented generation (RAG).

The preceding summary of RAG originates in the LangChain v0.2 tutorial [Build a RAG App](https://python.langchain.com/v0.2/docs/tutorials/rag/) tutorial in the LangChain v0.2 documentation.

For comprehensive information, refer to the LLamaIndex documentation for [Building an LLM Application](https://docs.llamaindex.ai/en/stable/understanding/#:~:text=on%20your%20machine.-,Building%20a%20RAG%20pipeline,-%3A%20Retrieval%2DAugmented%20Generation).

#### NIM

- [NIM microservices](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/) are containerized microservices that simplify the deployment of generative AI models like LLMs and are optimized to run on NVIDIA GPUs. 
- NIM microservices support models across domains like chat, embedding, reranking, and more from both the community and NVIDIA.

#### NVIDIA API Catalog

- [NVIDIA API Catalog](https://build.nvidia.com/explore/discover) is a hosted platform for accessing a wide range of microservices online.
- You can test models on the catalog and then export them with an NVIDIA AI Enterprise license for on-premises or cloud deployment

#### LlamaIndex Concepts

 - `Data connectors` ingest your existing data from their native source and format.
 - `Data indexes` structure your data in intermediate representations that are easy and performant for LLMs to consume.
 - `Engines` provide natural language access to your data for building context-augmented LLM apps.

LlamaIndex also provides integrations like `llms-nvidia`, `embeddings-nvidia` & `nvidia-rerank` to work with NVIDIA microservices.

## Installation and Requirements

Create a Python environment (preferably with Conda) using Python version 3.10.14. 
To install Jupyter Lab, refer to the [installation](https://jupyter.org/install) page.

## Getting Started!

In [1]:
# Requirements
!pip install --upgrade pip
!pip install llama-index-core==0.10.50
!pip install llama-index-readers-file==0.1.25
!pip install llama-index-llms-nvidia==0.1.3
!pip install llama-index-embeddings-nvidia==0.1.4
!pip install llama-index-postprocessor-nvidia-rerank==0.1.2
!pip install ipywidgets==8.1.3

Collecting pip
  Using cached pip-24.2-py3-none-any.whl (1.8 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.2
    Uninstalling pip-22.0.2:
      Successfully uninstalled pip-22.0.2
Successfully installed pip-24.2
Collecting llama-index-core==0.10.50
  Using cached llama_index_core-0.10.50-py3-none-any.whl.metadata (2.4 kB)
Collecting llama-cloud<0.0.7,>=0.0.6 (from llama-index-core==0.10.50)
  Using cached llama_cloud-0.0.6-py3-none-any.whl.metadata (750 bytes)


Using cached llama_index_core-0.10.50-py3-none-any.whl (15.4 MB)
Using cached llama_cloud-0.0.6-py3-none-any.whl (130 kB)
Installing collected packages: llama-cloud, llama-index-core
  Attempting uninstall: llama-cloud
    Found existing installation: llama-cloud 0.1.0
    Uninstalling llama-cloud-0.1.0:
      Successfully uninstalled llama-cloud-0.1.0
  Attempting uninstall: llama-index-core
    Found existing installation: llama-index-core 0.11.14
    Uninstalling llama-index-core-0.11.14:
      Successfully uninstalled llama-index-core-0.11.14
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llama-index 0.11.14 requires llama-index-core<0.12.0,>=0.11.14, but you have llama-index-core 0.10.50 which is incompatible.
llama-index-agent-openai 0.3.4 requires llama-index-core<0.12.0,>=0.11.0, but you have llama-index-core 0.10.50 which is incompatible.
llama-i

Using cached llama_index_readers_file-0.1.25-py3-none-any.whl (37 kB)
Installing collected packages: llama-index-readers-file
  Attempting uninstall: llama-index-readers-file
    Found existing installation: llama-index-readers-file 0.2.2
    Uninstalling llama-index-readers-file-0.2.2:
      Successfully uninstalled llama-index-readers-file-0.2.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llama-index 0.11.14 requires llama-index-core<0.12.0,>=0.11.14, but you have llama-index-core 0.10.50 which is incompatible.
llama-index 0.11.14 requires llama-index-readers-file<0.3.0,>=0.2.0, but you have llama-index-readers-file 0.1.25 which is incompatible.[0m[31m
[0mSuccessfully installed llama-index-readers-file-0.1.25
Collecting llama-index-llms-nvidia==0.1.3
  Using cached llama_index_llms_nvidia-0.1.3-py3-none-any.whl.metadata (1.7 kB)
Collecting llama-i

Collecting llama-index-core<0.11.0,>=0.10.0 (from llama-index-llms-nvidia==0.1.3)
  Using cached llama_index_core-0.10.68.post1-py3-none-any.whl.metadata (2.5 kB)
Using cached llama_index_llms_nvidia-0.1.3-py3-none-any.whl (4.3 kB)
Using cached llama_index_llms_openai-0.1.31-py3-none-any.whl (12 kB)
Using cached llama_index_core-0.10.68.post1-py3-none-any.whl (1.6 MB)
Installing collected packages: llama-index-core, llama-index-llms-openai, llama-index-llms-nvidia
  Attempting uninstall: llama-index-core
    Found existing installation: llama-index-core 0.10.50
    Uninstalling llama-index-core-0.10.50:
      Successfully uninstalled llama-index-core-0.10.50
  Attempting uninstall: llama-index-llms-openai
    Found existing installation: llama-index-llms-openai 0.2.9
    Uninstalling llama-index-llms-openai-0.2.9:
      Successfully uninstalled llama-index-llms-openai-0.2.9


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llama-index 0.11.14 requires llama-index-core<0.12.0,>=0.11.14, but you have llama-index-core 0.10.68.post1 which is incompatible.
llama-index 0.11.14 requires llama-index-llms-openai<0.3.0,>=0.2.9, but you have llama-index-llms-openai 0.1.31 which is incompatible.
llama-index 0.11.14 requires llama-index-readers-file<0.3.0,>=0.2.0, but you have llama-index-readers-file 0.1.25 which is incompatible.
llama-index-agent-openai 0.3.4 requires llama-index-core<0.12.0,>=0.11.0, but you have llama-index-core 0.10.68.post1 which is incompatible.
llama-index-agent-openai 0.3.4 requires llama-index-llms-openai<0.3.0,>=0.2.9, but you have llama-index-llms-openai 0.1.31 which is incompatible.
llama-index-cli 0.3.1 requires llama-index-core<0.12.0,>=0.11.0, but you have llama-index-core 0.10.68.post1 which is incompatible

Using cached llama_index_embeddings_nvidia-0.1.4-py3-none-any.whl (5.5 kB)
Installing collected packages: llama-index-embeddings-nvidia
Successfully installed llama-index-embeddings-nvidia-0.1.4
Collecting llama-index-postprocessor-nvidia-rerank==0.1.2
  Using cached llama_index_postprocessor_nvidia_rerank-0.1.2-py3-none-any.whl.metadata (4.1 kB)


Using cached llama_index_postprocessor_nvidia_rerank-0.1.2-py3-none-any.whl (6.4 kB)
Installing collected packages: llama-index-postprocessor-nvidia-rerank
Successfully installed llama-index-postprocessor-nvidia-rerank-0.1.2


To get started you need a `NVIDIA_API_KEY` to use NVIDIA AI Foundation models:

1) Create a free account with [NVIDIA](https://build.nvidia.com/explore/discover).
2) Click on your model of choice.
3) Under Input select the Python tab, and click **Get API Key** and then click **Generate Key**.
4) Copy and save the generated key as NVIDIA_API_KEY. From there, you should have access to the endpoints.

In [None]:
import getpass
import os

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvidia_api_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvidia_api_key.startswith("nvapi-"), f"{nvidia_api_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvidia_api_key

## RAG Example using LLM and Embedding

### 1) Initialize the LLM

`llama-index-llms-nvidia`, also known as NVIDIA's LLM connector,
allows your connect to and generate from compatible models available on the NVIDIA API catalog.

Here we will use **mixtral-8x7b-instruct-v0.1** 

In [3]:
# Settings enables global configuration as a singleton object throughout your application.
# Here, it is used to set the LLM, embedding model, and text splitter configurations globally.
from llama_index.core import Settings
from llama_index.llms.nvidia import NVIDIA

# Here we are using mixtral-8x7b-instruct-v0.1 model from API Catalog
Settings.llm = NVIDIA(model="meta/llama-3.1-405b-instruct")

[nltk_data] Downloading package punkt_tab to /home/polabs2/Code/RPG_te
[nltk_data]     acher/venv_llamaindex2/lib/python3.10/site-
[nltk_data]     packages/llama_index/core/_static/nltk_cache...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


### 2) Intiatlize the embedding

We selected **NV-Embed-QA** as the embedding

In [4]:
from llama_index.embeddings.nvidia import NVIDIAEmbedding
Settings.embed_model = NVIDIAEmbedding(model="NV-Embed-QA", truncate="END")

### 3) Obtain some toy text dataset
Here we are loading a toy data from a text documents and in real-time data can be loaded from various sources. 

Real world documents can be very long, this makes it hard to fit in the context window of many models. Even for those models that could fit the full post in their context window, models can struggle to find information in very long inputs.

To handle this we’ll split the Document into chunks for embedding and vector storage.

In [13]:
# For this example we load a toy data set (it's a simple text file with some information about Sweden)
TOY_DATA_PATH = "/home/polabs2/Code/RPG_teacher/data/out"

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import SimpleDirectoryReader
Settings.text_splitter = SentenceSplitter(chunk_size=400)
documents = SimpleDirectoryReader(TOY_DATA_PATH).load_data()

Note:
 - `SimpleDirectoryReader` takes care of storing basic file information such as the filename, filepath, and file type as metadata by default. This metadata can be used to keep track of the source file, allowing us to use it later for citation or metadata filtering.

### 4) Process the documents into VectorStoreIndex

In RAG, your data is loaded and prepared for queries or "indexed". User queries act on the index, which filters your data down to the most relevant context. This context and your query then go to the LLM along with a prompt, and the LLM provides a response.

In [6]:
from llama_index.core import VectorStoreIndex
# When you use from_documents, your Documents are split into chunks and parsed into Node objects
# By default, VectorStoreIndex stores everything in memory
index = VectorStoreIndex.from_documents(documents)

### 5) Create a Query Engine to ask question over your data

In [7]:
# Returns a Query engine for this index.
query_engine = index.as_query_engine(similarity_top_k=15)

In [14]:
response = query_engine.query(
    "What was the riddle Bilbo Baggins used to win the Ring from Gollum?"
)
print(response)

The riddle Bilbo Baggins used to win the Ring from Gollum was not the one described in the passage, which was "Nolegs lay on oneleg twolegs sat near on threelegs fourlegs got some". This riddle was answered by Gollum, and then he asked a new riddle. The correct answer is still "What have I got in my pocket?"


In [9]:
response = query_engine.query(
    "Who is Bilbo Baggins?"
)
print(response)

Bilbo Baggins is a hobbit with a mix of courage and wisdom, who values simple pleasures like food, cheer, and song, and has a capacity for adventure that sets him apart from his comfort-loving family. Despite being a Baggins, he has inherited a sense of adventure and a bit of queerness from his Took side, which waited for a chance to emerge.


## RAG Example with LLM, Embedding & Reranking

In [10]:
response = query_engine.query(
    "Create three open-ended questions on web site design that are also thematically set in the Hobbit."
)
print(response)

Here are three open-ended questions on web site design that are thematically set in the Hobbit:

1. As Bilbo and the dwarves navigate the treacherous waters of the river, they come across a mysterious and ancient bridge. If you were tasked with designing a website for a company that specializes in bridge-building, how would you create a sense of stability and trust for visitors, while also showcasing the company's expertise and innovative approaches to bridge design?

2. The elves of Rivendell are known for their exceptional craftsmanship and attention to detail. If you were designing a website for an elven artisan, how would you balance the need to showcase their intricate and beautiful work with the need to create a clean and simple user experience?

3. The Lonely Mountain is a place of legend and myth, a symbol of the dwarves' rich history and culture. If you were designing a website for a museum or cultural institution dedicated to the history of the Lonely Mountain, how would you 

In [11]:
# Let's test a more complex query using the above LLM Embedding query_engine and see if the reranker can help.
response = query_engine.query(    "What are all the chapters in the Hobbit?")
print(response)

Here are all the chapters in the Hobbit:

1. Chapter I: AN UNEXPECTED PARTY
2. Chapter II: ROAST MUTTON
3. Chapter III: A SHORT REST
4. Chapter IV: OVER HILL AND UNDER HILL
5. Chapter V: RIDDLES IN THE DARK
6. Chapter VI: OUT OF THE FRYINGPAN INTO THE FIRE
7. Chapter VII: QUEER LODGINGS
8. Chapter VIII: FLIES AND SPIDERS
9. Chapter IX: BARRELS OUT OF BOND
10. Chapter X: A WARM WELCOME
11. Chapter XI: ON THE DOORSTEP
12. Chapter XII: INSIDE INFORMATION
13. Chapter XIII: NOT AT HOME
14. Chapter XIV: FIRE AND WATER
15. Chapter XV: THE GATHERING OF THE CLOUDS
16. Chapter XVI: A THIEF IN THE NIGHT
17. Chapter XVII: THE CLOUDS BURST
18. Chapter XVIII: THE RETURN JOURNEY
19. Chapter XIX: THE LAST STAGE


In [12]:
import pandas as pd

# Load the data
doc_data = pd.read_csv('/home/polabs2/Code/RPG_teacher/data/chapter_summary_notes.csv', delimiter='\t', header=0)

# Filter the data for the document "hobbit"
doc_data = doc_data[doc_data['document'] == 'hobbit']
print(doc_data.head(2))

# Iterate over the rows in the DataFrame
for i, row in doc_data.iterrows():
    chapter = row['chapter']
    text = row['text']
    
    # Generate the query string
    q = f"You are a helpful book summarizing assistant. Please use the provided context to summarize the following chapter into about 10 events: chapter {chapter} {text}"
    
    # Send the query to the engine
    response = query_engine.query(q)
    
    # Print the response
    print(response)


KeyError: 'document'

### Enhancing accuracy for single data sources

This example demonstrates how a re-ranking model can be used to combine retrieval results and improve accuracy during retrieval of documents.

Typically, reranking is a critical piece of high-accuracy, efficient retrieval pipelines. Generally, there are two important use cases:

- Combining results from multiple data sources
- Enhancing accuracy for single data sources

Here, we focus on demonstrating only the second use case.

In [None]:
# We will narrow the collection to 40 results and further narrow it to 4 with the reranker.
from llama_index.postprocessor.nvidia_rerank import NVIDIARerank

reranker_query_engine = index.as_query_engine(
    similarity_top_k=40, node_postprocessors=[NVIDIARerank(top_n=4)]
)

response = reranker_query_engine.query(
    "What are the names of all the dwarves on Bilbo's adventure?"
)
print(response)

#### Note:
 - In this notebook, we used NVIDIA NIM microservices from the NVIDIA API Catalog.
 - The above APIs, NVIDIA (llms), NVIDIAEmbedding, and NVIDIARerank, also support self-hosted microservices.
 - Change the `base_url` to your deployed NIM URL
 - Example: NVIDIA(model="meta/llama3-8b-instruct", base_url="http://your-nim-host-address:8000/v1")
 - NIM can be hosted locally using Docker, following the [NVIDIA NIM for LLMs](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html) documentation.

In [None]:
# Example Code snippet if you want to use a self-hosted NIM
from llama_index.llms.nvidia import NVIDIA

llm = NVIDIA(model="meta/llama3-8b-instruct", base_url="http://your-nim-host-address:8000/v1")