# RAG Example Using NVIDIA API Catalog and LlamaIndex

This notebook introduces how to use LlamaIndex to interact with NVIDIA hosted NIM microservices like chat, embedding, and reranking models to build a simple retrieval-augmented generation (RAG) application.

Alternatively, for a more interactive experience with a graphical user interface, you can refer to our [code](https://github.com/jayrodge/llm-assistant-cloud-app/) and [YouTube video](https://www.youtube.com/watch?v=09uDCmLzYHA) for Gradio-based RAG Q&A reference application that also uses NVIDIA hosted NIM microservices.

## Terminology

#### RAG

- RAG is a technique for augmenting LLM knowledge with additional data.
- LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on.
- If you want to build AI applications that can reason about private data or data introduced after a model's cutoff date, you need to augment the knowledge of the model with the specific information it needs.
- The process of bringing the appropriate information and inserting it into the model prompt is known as retrieval augmented generation (RAG).

The preceding summary of RAG originates in the LangChain v0.2 tutorial [Build a RAG App](https://python.langchain.com/v0.2/docs/tutorials/rag/) tutorial in the LangChain v0.2 documentation.

For comprehensive information, refer to the LLamaIndex documentation for [Building an LLM Application](https://docs.llamaindex.ai/en/stable/understanding/#:~:text=on%20your%20machine.-,Building%20a%20RAG%20pipeline,-%3A%20Retrieval%2DAugmented%20Generation).

#### NIM

- [NIM microservices](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/) are containerized microservices that simplify the deployment of generative AI models like LLMs and are optimized to run on NVIDIA GPUs. 
- NIM microservices support models across domains like chat, embedding, reranking, and more from both the community and NVIDIA.

#### NVIDIA API Catalog

- [NVIDIA API Catalog](https://build.nvidia.com/explore/discover) is a hosted platform for accessing a wide range of microservices online.
- You can test models on the catalog and then export them with an NVIDIA AI Enterprise license for on-premises or cloud deployment

#### LlamaIndex Concepts

 - `Data connectors` ingest your existing data from their native source and format.
 - `Data indexes` structure your data in intermediate representations that are easy and performant for LLMs to consume.
 - `Engines` provide natural language access to your data for building context-augmented LLM apps.

LlamaIndex also provides integrations like `llms-nvidia`, `embeddings-nvidia` & `nvidia-rerank` to work with NVIDIA microservices.

## Installation and Requirements

Create a Python environment (preferably with Conda) using Python version 3.10.14. 
To install Jupyter Lab, refer to the [installation](https://jupyter.org/install) page.

## Getting Started!

In [1]:
# Requirements
!pip install --upgrade pip
!pip install llama-index-core==0.10.50
!pip install llama-index-readers-file==0.1.25
!pip install llama-index-llms-nvidia==0.1.3
!pip install llama-index-embeddings-nvidia==0.1.4
!pip install llama-index-postprocessor-nvidia-rerank==0.1.2
!pip install ipywidgets==8.1.3

Collecting pip
  Using cached pip-24.2-py3-none-any.whl (1.8 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.2
    Uninstalling pip-22.0.2:
      Successfully uninstalled pip-22.0.2
Successfully installed pip-24.2
Collecting llama-index-core==0.10.50
  Downloading llama_index_core-0.10.50-py3-none-any.whl.metadata (2.4 kB)
Collecting SQLAlchemy>=1.4.49 (from SQLAlchemy[asyncio]>=1.4.49->llama-index-core==0.10.50)
  Downloading SQLAlchemy-2.0.35-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting dataclasses-json (from llama-index-core==0.10.50)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting deprecated>=1.2.9.3 (from llama-index-core==0.10.50)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl.metadata (5.4 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index-core==0.10.50)
  Downloading dirtyjson-1.0.8-py3-none-any.whl.metadata (11 kB)
Collecting lla

To get started you need a `NVIDIA_API_KEY` to use NVIDIA AI Foundation models:

1) Create a free account with [NVIDIA](https://build.nvidia.com/explore/discover).
2) Click on your model of choice.
3) Under Input select the Python tab, and click **Get API Key** and then click **Generate Key**.
4) Copy and save the generated key as NVIDIA_API_KEY. From there, you should have access to the endpoints.

In [3]:
import getpass
import os

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvidia_api_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvidia_api_key.startswith("nvapi-"), f"{nvidia_api_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvidia_api_key

## RAG Example using LLM and Embedding

### 1) Initialize the LLM

`llama-index-llms-nvidia`, also known as NVIDIA's LLM connector,
allows your connect to and generate from compatible models available on the NVIDIA API catalog.

Here we will use **mixtral-8x7b-instruct-v0.1** 

In [4]:
# Settings enables global configuration as a singleton object throughout your application.
# Here, it is used to set the LLM, embedding model, and text splitter configurations globally.
from llama_index.core import Settings
from llama_index.llms.nvidia import NVIDIA

# Here we are using mixtral-8x7b-instruct-v0.1 model from API Catalog
Settings.llm = NVIDIA(model="meta/llama-3.1-405b-instruct")

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/polabs2/venv_nvidia_llamaindex/lib/python3.10/si
[nltk_data]     te-packages/llama_index/core/_static/nltk_cache...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


### 2) Intiatlize the embedding

We selected **NV-Embed-QA** as the embedding

In [5]:
from llama_index.embeddings.nvidia import NVIDIAEmbedding
Settings.embed_model = NVIDIAEmbedding(model="NV-Embed-QA", truncate="END")

### 3) Obtain some toy text dataset
Here we are loading a toy data from a text documents and in real-time data can be loaded from various sources. 

Real world documents can be very long, this makes it hard to fit in the context window of many models. Even for those models that could fit the full post in their context window, models can struggle to find information in very long inputs.

To handle this we’ll split the Document into chunks for embedding and vector storage.

In [7]:
# For this example we load a toy data set (it's a simple text file with some information about Sweden)
TOY_DATA_PATH = "/home/polabs2/Code/RPG_teacher/data/out"

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import SimpleDirectoryReader
Settings.text_splitter = SentenceSplitter(chunk_size=400)
documents = SimpleDirectoryReader(TOY_DATA_PATH).load_data()

Note:
 - `SimpleDirectoryReader` takes care of storing basic file information such as the filename, filepath, and file type as metadata by default. This metadata can be used to keep track of the source file, allowing us to use it later for citation or metadata filtering.

### 4) Process the documents into VectorStoreIndex

In RAG, your data is loaded and prepared for queries or "indexed". User queries act on the index, which filters your data down to the most relevant context. This context and your query then go to the LLM along with a prompt, and the LLM provides a response.

In [8]:
from llama_index.core import VectorStoreIndex
# When you use from_documents, your Documents are split into chunks and parsed into Node objects
# By default, VectorStoreIndex stores everything in memory
index = VectorStoreIndex.from_documents(documents)

### 5) Create a Query Engine to ask question over your data

In [9]:
# Returns a Query engine for this index.
query_engine = index.as_query_engine(similarity_top_k=15)

In [11]:
response = query_engine.query(
    "What was the riddle Bilbo Baggins used to win the Ring from Gollum?"
)
print(response)

"What have I got in my pocket?"


In [12]:
response = query_engine.query(
    "Who is Bilbo Baggins?"
)
print(response)

Bilbo Baggins is a hobbit who finds himself on a grand adventure with a group of dwarves, navigating treacherous landscapes and encountering strange creatures. He is a resourceful and clever individual who relies on his wits to overcome challenges, and has gained a reputation as a skilled burglar among the dwarves. Despite his initial reluctance to embark on adventures, Bilbo has proven himself to be a valuable companion to the dwarves and has earned the respect of their leader, Thorin.


## RAG Example with LLM, Embedding & Reranking

In [14]:
response = query_engine.query(
    "Create three open-ended questions on web site design that are also thematically set in the Hobbit."
)
print(response)

Here are three open-ended questions on web site design that are thematically set in the Hobbit:

1. If you were tasked with designing a website for the spider colony, how would you create a navigation system that mimics the intricate webs spun by the spiders, while also ensuring that users can easily find the information they need without getting caught in a virtual snare?

2. Imagine you are designing a website for Beorn's Hall, where visitors can learn about the art of beekeeping and the importance of protecting the natural world. How would you use visual elements and interactive features to convey the sense of a thriving ecosystem, while also providing a clear and easy-to-use interface for users?

3. If you were designing a website for the dwarves of Erebor, how would you balance the need to showcase their rich history and cultural heritage with the need to create a modern and functional website that appeals to a wide range of users, from hobbits to elves to humans?


In [15]:
# Let's test a more complex query using the above LLM Embedding query_engine and see if the reranker can help.
response = query_engine.query(    "What are all the chapters in the Hobbit?")
print(response)

Here are the chapters in the Hobbit:

1. Chapter I: An Unexpected Party
2. Chapter II: Roast Mutton
3. Chapter III: A Short Rest
4. Chapter IV: Over Hill and Under Hill
5. Chapter V: Riddles in the Dark
6. Chapter VI: Out of the Frying-Pan into the Fire
7. Chapter VII: Queer Lodgings
8. Chapter VIII: Flies and Spiders
9. Chapter IX: Barrels out of Bond
10. Chapter X: A Warm Welcome
11. Chapter XI: On the Doorstep
12. Chapter XII: Inside Information
13. Chapter XIII: Not at Home
14. Chapter XIV: Fire and Water
15. Chapter XV: The Gathering of the Clouds
16. Chapter XVI: A Thief in the Night
17. Chapter XVII: The Clouds Burst
18. Chapter XVIII: The Return Journey
19. Chapter XIX: The Last Stage 

Note: The new context wasn't useful in refining the answer, but I was able to pull more information from the provided text and give a more complete answer.


In [41]:
import pandas as pd

# Load the data
doc_data = pd.read_csv('/home/polabs2/Code/RPG_teacher/data/chapter_summary_notes.csv', delimiter='\t', header=0)

# Filter the data for the document "hobbit"
doc_data = doc_data[doc_data['document'] == 'hobbit']
print(doc_data.head(2))

# Iterate over the rows in the DataFrame
for i, row in doc_data.iterrows():
    chapter = row['chapter']
    text = row['text']
    
    # Generate the query string
    q = f"You are a helpful book summarizing assistant. Please use the provided context to summarize the following chapter into about 10 events: chapter {chapter} {text}"
    
    # Send the query to the engine
    response = query_engine.query(q)
    
    # Print the response
    print(response)


    document_type document  chapter     data_type  \
34  fantasy_novel   hobbit      1.0  chapter_name   
35  fantasy_novel   hobbit      2.0  chapter_name   
36  fantasy_novel   hobbit      3.0  chapter_name   
37  fantasy_novel   hobbit      4.0  chapter_name   
38  fantasy_novel   hobbit      5.0  chapter_name   
39  fantasy_novel   hobbit      6.0  chapter_name   
40  fantasy_novel   hobbit      7.0  chapter_name   
41  fantasy_novel   hobbit      8.0  chapter_name   
42  fantasy_novel   hobbit      9.0  chapter_name   
43  fantasy_novel   hobbit     10.0  chapter_name   
44  fantasy_novel   hobbit     11.0  chapter_name   
45  fantasy_novel   hobbit     12.0  chapter_name   
46  fantasy_novel   hobbit     13.0  chapter_name   
47  fantasy_novel   hobbit     14.0  chapter_name   
48  fantasy_novel   hobbit     15.0  chapter_name   
49  fantasy_novel   hobbit     16.0  chapter_name   
50  fantasy_novel   hobbit     17.0  chapter_name   
51  fantasy_novel   hobbit     18.0  chapter_n

KeyboardInterrupt: 

### Enhancing accuracy for single data sources

This example demonstrates how a re-ranking model can be used to combine retrieval results and improve accuracy during retrieval of documents.

Typically, reranking is a critical piece of high-accuracy, efficient retrieval pipelines. Generally, there are two important use cases:

- Combining results from multiple data sources
- Enhancing accuracy for single data sources

Here, we focus on demonstrating only the second use case.

In [13]:
# We will narrow the collection to 40 results and further narrow it to 4 with the reranker.
from llama_index.postprocessor.nvidia_rerank import NVIDIARerank

reranker_query_engine = index.as_query_engine(
    similarity_top_k=40, node_postprocessors=[NVIDIARerank(top_n=4)]
)

response = reranker_query_engine.query(
    "What are the names of all the dwarves on Bilbo's adventure?"
)
print(response)

The names of the dwarves on Bilbo's adventure are:

1. Thorin
2. Balin
3. Dwalin
4. Fili
5. Kili
6. Dori
7. Nori
8. Ori
9. Oin
10. Gloin
11. Bifur
12. Bofur
13. Bombur


#### Note:
 - In this notebook, we used NVIDIA NIM microservices from the NVIDIA API Catalog.
 - The above APIs, NVIDIA (llms), NVIDIAEmbedding, and NVIDIARerank, also support self-hosted microservices.
 - Change the `base_url` to your deployed NIM URL
 - Example: NVIDIA(model="meta/llama3-8b-instruct", base_url="http://your-nim-host-address:8000/v1")
 - NIM can be hosted locally using Docker, following the [NVIDIA NIM for LLMs](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html) documentation.

In [None]:
# Example Code snippet if you want to use a self-hosted NIM
from llama_index.llms.nvidia import NVIDIA

llm = NVIDIA(model="meta/llama3-8b-instruct", base_url="http://your-nim-host-address:8000/v1")