## Creating a RAG application with LLM, Embedding model, and Vector DataBase hosted locally

This notebook showcases how to create a Retrieval Augmented Generation (RAG) application where the LLM model, embedding model, and Vector DataBase (VDB) are deployed locally. 

We will be using NVIDIA NIM microservices to locally host [Llama3-8B-instruct model](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html). The microservice will be connected using [LangChain NVIDIA AI Endpoints](https://python.langchain.com/docs/integrations/chat/nvidia_ai_endpoints/) package.

For creating embeddings from your proprietary documents, we will be using embedding model hosted on [hugginface](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) which will again be connected using LangChain's [hugginface plugin](https://python.langchain.com/v0.2/docs/integrations/platforms/huggingface/).

Lastly for storing the embeddings, we will be using the Facebook AI Similarity Search (FAISS) plugin available in [LangChain](https://python.langchain.com/v0.2/docs/integrations/vectorstores/faiss/).

This notebook is divided into two parts: 
1. In the first part we will showcase how to create embeddings from your documents and store them in a VDB.
2. In the second part we will orchestrate the RAG application using Langchain framework and create a Gradio-based simple UI to interact with this application.

### API Key generation:

Before we get started, generate the API keys to use model from NVIDIA NIM microservice and download the embedding model from HuggingFace. 

**To generate 'NVIDIA_API_KEY' for NVIDIA NIM microserice:**

1. Create a free account with [NVIDIA](https://build.nvidia.com/explore/discover).
2. Click on your model of choice.
3. Under Input select the Python tab, and click **Get API Key** and then click **Generate Key**.
4. Copy and save the generated key as NVIDIA_API_KEY. From there, you should have access to the endpoints.

**To generate 'ACCESS_TOKENS' for Huggingface:**

1. Log in to [Hugging Face Hub](https://huggingface.co/)
2. Click your profile icon in the top right corner
3. Click Settings from the drop-down list
4. Click Access Tokens in the left-hand navigation panel
5. Click **New token**
6. Enter a name for your token and select a role
7. Click **Generate token**
8. Click Copy to copy the token to your clipboard


## Getting Started!

Install all the prerequisite libraries to orchestrate the chat application with LangChain

In [1]:
!pip install langchain
!pip install langchain-core
!pip install langchain-community
!pip install langchain-huggingface
!pip install pypdf
!pip install transformers
!pip install faiss-gpu
!pip install gradio
!pip install langchain-nvidia-ai-endpoints

!pip install sentence-transformers

## Create Vector Data Base (VDB) using your Proprietary Documents (PART 1)
<div style="text-align:center">
  <img src="./data/imgs/offline.png" alt="Alternative text" />
</div>

In this section, we will create vector embeddings from your documents and store them in a vector database. We will be following the below-listed steps

1. Download the embedding model from Huggingface using LangChain plugin
2. Parse all the PDF documents in a folder and break them into text chunks.
3. Pass the chunks to the embedding model to create embeddings.
4. Save the generated embeddings into a FAISS vector database.


### Import all the required libraries

In [2]:
import torch
import os
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
import faiss
import warnings
warnings.filterwarnings('ignore')

### Embedding model and global vairables

We provide the name of the embedding model and the location of where the VDB needs to be stored. We also initialize few global variables

In [3]:
model_name = "Alibaba-NLP/gte-large-en-v1.5"
vectorDB_name = "./papers.db"

In [4]:
# Initialize global variables
vectorstore = None
embeddings = None

### Download embedding model

In this function, we download the embedding model from Huggingface using the Huggingface plugin in LangChain. In this function we also check for the availability of GPU/s and pass that as an argument in the `encode` and `model` arguments.

In [5]:
def dowload_embedding_model(model_name):
    global embeddings
    # making GPU is available
    device = "cuda:1" if torch.cuda.is_available() else "cpu"
    # pick the embedding model from huggingface
    encode_kwargs = {
                        "device": device, 
                        "normalize_embeddings": True
                    }
    model_kwargs =  {
                        "device": device,
                        "trust_remote_code":True
                    }
    # Create a custom HuggingFaceEmbeddings instance
    embeddings = HuggingFaceEmbeddings(
        model_name=model_name,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs
    )

### Document processing

In this function, we load the document and use the text splitter to split the document into chunks. We can use the `chunk_size` and `chunk_overlap` as parameters to specify the number of tokens per chunk and how much overlap we need between each chunk.

In [6]:
def process_pdf(pdf_path):
    # Load PDF using PyPDFLoader
    loader = PyPDFLoader(pdf_path)
    pages = loader.load()
    
    # Split text into chunks using RecursiveCharacterTextSplitter
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    chunks = text_splitter.split_documents(pages)
    return chunks

### Loading and processing 

In this function, we parse through the directory that has all the pdf documents, breaks them into chunks using the `process_pdf` function. We also call the `download_embedding_model` function to download the embedding model. 

The document chunks and embedding model are passed to the FAISS plugin to create a Vector Database Base (VDB).

In [7]:
def load_documents(directory):
    global vectorstore, embeddings, model_name, vectorDB_name
    try:
        # Process all PDFs in the directory
        all_chunks = []
        for file in directory:
            print(file)
            if file.name.endswith(".pdf"):
                print(file.name)
                chunks = process_pdf(file.name)
                all_chunks.extend(chunks)
                
        print(f'number of chunks: {len(all_chunks)}')
        dowload_embedding_model(model_name)
        # Create FAISS index and vector embeddings for chucks of data
        vectorstore = FAISS.from_documents(all_chunks, embeddings)
        # Save the index
        vectorstore.save_local(vectorDB_name)
        return f"Successfully loaded documents."
    except Exception as e:
        return f"Error loading documents: {str(e)}"

### Load Vector DataBase (VDB)

In this function we load the VDB and pass it to the GPU.

In [8]:
def load_vector_database(vectorDB_name):
    global vectorstore, embeddings
    vectorstore = FAISS.load_local(vectorDB_name, embeddings, allow_dangerous_deserialization=True)
    # Move the index to GPU
    res = faiss.StandardGpuResources()
    gpu_index = faiss.index_cpu_to_gpu(res, 0, vectorstore.index)
    vectorstore.index = gpu_index

### Test Vector DataBase (VDB)

We use this function to test the embeddings generated and stored in the VDB.

In [9]:
def test_vdb():
    global vectorstore, embeddings
    retriever = vectorstore.as_retriever()
    query = "What is dora ?"
    results = retriever.get_relevant_documents(query)
    print(f"Number of retrieved documents: {len(results)}")
    for doc in results:
        print(doc.page_content[:100])  # Print first 100 characters of each documen

In [10]:
# load_vector_database(vectorDB_name)
# test_vdb()

### Create Interactive Chat application (PART 2)

<div style="text-align:center">
  <img src="./data/imgs/online.png" alt="Alternative text" />
</div>

In this section, we will orchestrate the chat application using LangChain.

1. Before we move ahead, deploy `Llama3-8b-instruct` model using the [NVIDIA NIM] (https://build.nvidia.com/meta/llama3-70b). Select the `docker` tab and follow the instructions to deploy it locally.
2. We will connect with the VDB to retrieve the relevant document chunks based on the user query
3. Using the prompt, retrieved document chunk, the LLMs model will generate a response for the user
4. Create a simple Gradio-based UI to interact with this chat application

### Import all the required libraries

In [11]:
import gradio as gr
import torch
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

### Prompt template

In this fuction we use one of the default prompt template provided by langchain and use it for context and question asked by the user.

In [12]:
def prompt_template():
    prompt = ChatPromptTemplate.from_template("""
    Answer the following question based on the given context:
    <context>
    {context}
    </context>
    Question: {input}
    """)
    return prompt

### Response to user query

In this function, we connect the LLM model using the `ChatNVIDIA` plugin. We also connect the VDB using the retriever object. The LLM model, VDB retriever, and prompt template are passed to generate response to the user query.

In [13]:
def chat_response(message, history):
    global vectorstore, embeddings

    llm = ChatNVIDIA(base_url="http://localhost:8000/v1", model="meta/llama3-8b-instruct")
    prompt = prompt_template()
    print(message)
    try:
        document_chain = create_stuff_documents_chain(llm, prompt)
        retriever = vectorstore.as_retriever()
        retrieval_chain = create_retrieval_chain(retriever, document_chain)
        response = retrieval_chain.invoke({"input": message})
        print(response["answer"])
        # Return the complete response as part of the chat history
        return history + [(message, response["answer"])]
    except Exception as e:
        return history + [(message, f"Error processing query: {str(e)}")]

### UI to interact with chat application

In this block, we create a simple UI using Gradio to interact with the application. The `select folder` link provides functionality to select the folder where your documents are located. the `load document` loads the document files in the selected folder and generates embeddings for those documents. 

Once processed, you can ask questions about your document in the `Enter your question` tab. The generated response will be displayed in the window while maintaining the history of past queries.

In [14]:
with gr.Blocks() as demo:
    gr.Markdown("# RAG Q&A Chat Application")
    with gr.Row():
        folder_input = gr.File(file_count="directory", label="Select folder ... ")
        load_btn = gr.Button("Load Documents")
    
    load_output = gr.Textbox(label="Load Status")
    
    chatbot = gr.Chatbot()
    msg = gr.Textbox(label="Enter your question", interactive=True)
    clear = gr.Button("Clear")

    load_btn.click(load_documents, inputs=[folder_input], outputs=[load_output])
    msg.submit(chat_response, inputs=[msg, chatbot], outputs=[chatbot])
    msg.submit(lambda: "", outputs=[msg])  # Clear input box after submission
    clear.click(lambda: None, None, chatbot, queue=False)

In [None]:
if __name__ == "__main__":
    dowload_embedding_model(model_name)
    load_vector_database(vectorDB_name)
    demo.launch(share=True)