Copyright (c) 2024 Habana Labs, Ltd. an Intel Company.
SPDX-License-Identifier: Apache-2.0


# Retrieval Augmented Generation (RAG) application using Intel&reg; Gaudi&reg; 2 AI Processor

A scalable Retrieval Augmented Generation (RAG) application using huggingface tools as an way of deploying optimized applications utilizing the Intel Gaudi 2 accelerator.

### Introduction
This tutorial will show how to build a RAG application using Intel Gaudi 2. The Application will be built from easily accessible huggingface tools such as: text-generation-inference (TGI) and text-embeddings-inference (TEI). To make the code easier to understand, Langchain will be used.  The User interface at the end of the tutorial will use Gradio to submit your queries. This application will be in a docker environment, but can be easily deployed to a kubernetes cluster.

Retrieval-augmented generation (RAG) is a method that enhances the precision and dependability of generative AI models by incorporating facts from external sources. This technique addresses the limitations of large language models (LLMs), which, despite their ability to generate responses to general prompts rapidly, may not provide in-depth or specific information. By enabling access to external knowledge sources, RAG improves factual consistency, increases the reliability of generated responses, and helps to mitigate the issue of "hallucination" in more complex and knowledge-intensive tasks.

This Tutorial will show the steps of building the full RAG pipeline on Intel Gaudi 2.  To first ingest a text file and run an embedding model to create a vector store index and database.  Then to start to run a query, it will run the embedding model again on the query, match it with the contents of the database and send the overall prompt and query response to the Falcon 7B Large Language model to generate a full formed response. 

<p style="text-align:center;">
    <img src="./RAG/img/rag-overview.png" />
</p>

## 1. Intial Setup

In [None]:
# Start with the `exit()` command to restart the Python kernel to ensure that there are no other processes holding the Intel Gaudi Accelerator as you start to run this notebook.
exit()

#### Setup the docker environment in this notebook:
At this point you have cloned the Gaudi-tutorials notebook inside your docker image and have opened this notebook.  Now start to follow the steps.  Note that you will need to install docker again inside the Intel Gaudi container to manage the execution of the RAG tools. 

In [None]:
# SKIP THIS STEP IF YOU ARE RUNNING DIRECLTY FROM THE JUPYTER NOTEBOOKS IN THE Intel Tiber Developer Cloud
#!apt-get update
#!apt-get install docker.io curl -y

## 2. Loading the Tools for RAG
There are three steps in creating the RAG environment, text generation, text embedding and vectorization

### Text Generation Inference (TGI)
The first building block of application will be text-generation-inference. Its purpose will be serving the LLM model that will answer a question based on context. To run it, we need to download the docker image:

Please note: The Hugging Face Text Generation Inference depends on software that is subject to non-open source licenses.  If you use or redistribute this software, it is your sole responsibility to ensure compliance with such licenses.

In [None]:
!docker pull ghcr.io/huggingface/tgi-gaudi:2.0.5

### After downloading the image you will run it:

#### How to access and use the Falcon 7B Instruct model
For the LLM model as part of the Text Generation Inference, the [Falcon 7B instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) model will be used. This model has equivalent performance to Llama based models, but does not require an additional Hugging Face authentication token

This docker command will start the Hugging Face TGI service and download the Falcon 7B model.  This may take a few minutes.

In [None]:
!docker run -d -p 9001:80 \
    --runtime=habana \
    --name gaudi-tgi \
    -e HABANA_VISIBLE_DEVICES=0 \
    -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
    -e HF_TOKEN="{hf_token}" \
    --cap-add=sys_nice \
    --ipc=host \
    ghcr.io/huggingface/tgi-gaudi:2.0.5 \
    --model-id tiiuae/falcon-7b-instruct \
    --max-input-tokens 1024 --max-total-tokens 2048

The docker logs command below will show the status of the `gaudi-tgi` docker image. If you do not see `Connected` message like the example below and you see some type of error, you need to stop the docker, remove the docker and run the command above again.

To stop and remove the docker, run this command in a terminal window: 

`docker stop gaudi-tgi && docker rm gaudi-tgi`

In [None]:
!docker logs gaudi-tgi

After running the docker server, it will take some time to download the model and load it into the device. To check the status run: `docker logs gaudi-tgi` and you should see:

```
2024-02-23T16:24:35.125179Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-23T16:24:40.729388Z  INFO shard-manager: text_generation_launcher: Shard ready in 65.710470677s rank=0
2024-02-23T16:24:40.796775Z  INFO text_generation_launcher: Starting Webserver
2024-02-23T16:24:42.589516Z  WARN text_generation_router: router/src/main.rs:355: `--revision` is not set
2024-02-23T16:24:42.589551Z  WARN text_generation_router: router/src/main.rs:356: We strongly advise to set it to a known supported commit.
2024-02-23T16:24:42.842098Z  INFO text_generation_router: router/src/main.rs:377: Serving revision e852bc2e78a3fe509ec28c6d76512df3012acba7 of model Intel/neural-chat-7b-v3-1
2024-02-23T16:24:42.845898Z  INFO text_generation_router: router/src/main.rs:219: Warming up model
2024-02-23T16:24:42.846613Z  WARN text_generation_router: router/src/main.rs:230: Model does not support automatic max batch total tokens
2024-02-23T16:24:42.846620Z  INFO text_generation_router: router/src/main.rs:252: Setting max batch total tokens to 16000
2024-02-23T16:24:42.846623Z  INFO text_generation_router: router/src/main.rs:253: Connected
2024-02-23T16:24:42.846626Z  WARN text_generation_router: router/src/main.rs:258: Invalid hostname, defaulting to 0.0.0.0
```


Once the setup is complete, you can verify that that the text generation is working by sending a request to it (note that the first request could be slow due to graph compilation):

In [4]:
!curl 127.0.0.1:9001/generate \
    -X POST \
    -d '{"inputs":"why is the earth round?","parameters":{"max_new_tokens":200}}' \
    -H 'Content-Type: application/json'

{"generated_text":"\nThe Earth is round because it is the result of the gravitational forces acting on the planet. Gravity pulls everything towards the center of the Earth, causing the planet to become round."}

### Text Embedding Interface (TEI)

The next building block will be text-embeddings-inference. Its purpose will be serving an embeddings model that will produce embeddings for a vector database. We will use the standard Hugging Face text-embeddings-inference (TEI) and run this on the CPU, since we only have access to one Intel Gaudi Card in this notebook.  To run it, we execute the following docker image:

In [None]:
!docker run -d -p 9002:80 \
    --name cpu-tei \
    -v $PWD/RAG/data:/data \
    --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.2 \
    --model-id BAAI/bge-large-en-v1.5

In [7]:
!docker logs cpu-tei

[2m2024-10-22T01:14:57.796889Z[0m [32m INFO[0m [2mtext_embeddings_router[0m[2m:[0m [2mrouter/src/main.rs[0m[2m:[0m[2m140:[0m Args { model_id: "BAA*/***-*****-**-v1.5", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, hf_api_token: None, hostname: "0505654999f7", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, cors_allow_origin: None }
[2m2024-10-22T01:14:57.797002Z[0m [32m INFO[0m [2mhf_hub[0m[2m:[0m [2m/usr/local/cargo/git/checkouts/hf-hub-1aadb4c6e2cbe1ba/b167f69/src/lib.rs[0m[2m:[0m[2m55:[0m Token file not found "/root/.cache/huggingface/token"    
[2m2024-10-22T01:14:58.756011Z[0m [32m INFO[0m [1mdownload_artifacts[0m[2m:[0m [2mtext_embeddings_core::download[0m[2m:

Note that here you need also wait for the model to load.  You can run `docker logs cpu-tei` to confirm that the model is setup and running.  To confirm that the model is working, we can send a request to it.   Running the command below will show embeddings of the input prompt:

In [None]:
!curl 127.0.0.1:9002/embed \
    -X POST \
    -d '{"inputs":"What is Deep Learning?"}' \
    -H 'Content-Type: application/json'

### PGVector
Third building block is a vector database, in this tutorial the choice was PGVector. Setting up the docker should be straightforward:

In [None]:
!docker pull pgvector/pgvector:pg16
!docker run \
    --name postgres_vectordb \
    -d \
    -e POSTGRES_PASSWORD=postgres \
    -p 9003:5432 \
    pgvector/pgvector:pg16

Check the status of starting the vectordb container. If successful you should see the message "database system is ready to accept connections"

In [None]:
!docker logs postgres_vectordb

### Application Front End 
The last building block will be a frontend that will serve as a http server. The frontend is implemented in python using the Gradio interface. To setup an environment we need to run:

In [18]:
!pip install --quiet -r RAG/requirements.txt

[0m

## 3. Data preparation

To have a good quality RAG application, we need to prepare data. The data processing stage for vector database will extract text information from documents (for example PDFs, CSVs) then split it into chunks not exceeding max length, with additional metadata (for example filename or file creation date). Then it will upload preprocessed data to the vector database.Â 

In the process of data preprocessing, text splitting plays a crucial role. It involves breaking down the text into smaller, semantically meaningful chunks for further processing and analysis. Here are some common methods of text splitting:

- **By Character**: This method involves splitting the text into individual characters. It's a straightforward approach, but it may not always be the most effective, as it doesn't take into account the semantic meaning of words or phrases.

- **Recursive**: Recursive splitting involves breaking down the text into smaller parts repeatedly until a certain condition is met. This method is particularly useful when dealing with complex structures in the text, as it allows for a more granular level of splitting.

- **HTML Specific**: When dealing with HTML content, text splitting can be done based on specific HTML tags or elements. This method is useful for extracting meaningful information from web pages or other HTML documents.

- **Code Specific**: In the context of programming code, text can be split based on specific code syntax or structures. This method is particularly useful for code analysis or for building tools that work with code.

- **By Tokens**: Tokenization is a common method of text splitting in Natural Language Processing (NLP). It involves breaking down the text into individual words or tokens. This method is effective for understanding the semantic meaning of the text, as it allows for the analysis of individual words and their context.

In conclusion, the choice of text splitting method depends largely on the nature of the text and the specific requirements of the task at hand. It's important to choose a method that effectively captures the semantic meaning of the text and facilitates further processing and analysis.

In this tutorial we will use the **Recursive** method. For better understanding of the topic you can check the https://langchain-text-splitter.streamlit.app/ app.

### Database Population
Database population is a step where we load documents, embed them and then load into a database.

#### Data Loading
For ease of use, we'll use helper functions from langchain. Note that langchain_community is also required. 

In [12]:
from pathlib import Path

from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_postgres import PGVector
from langchain_huggingface import HuggingFaceEndpointEmbeddings

#### Loading Documents with embeddings
Here we need to create huggingface TEI client and PGVector client. For PGVector, collection name corresponds to table name, within connection string there is connection protocol: `postgresql+psycopg2`, next is user, password, host, port and database name. For ease of use, pre_delete_collection is set to true to prevent duplicates in database.   

**NOTE: the first time this is run, it may show a message "Collection not found", this is expected.**

In [None]:
embeddings = HuggingFaceEndpointEmbeddings(model="http://localhost:9002", huggingfacehub_api_token="EMPTY")
store = PGVector(
    collection_name="documents",
    connection="postgresql+psycopg2://postgres:postgres@localhost:9003/postgres",
    embeddings=embeddings,
    pre_delete_collection=True
)

#### Data Loading and Splitting
Data is loaded from text files from `data/`, then documents are splitted into chunks of 512 characters in size and finally loaded into the database. Note that documents can have metadata that can be stored in the vector database. 

You can load a new text file in the `data/` folder to run the RAG pipeline on new content by running the following cell again with new data.  This cell will create a new Database to run your query. 

In [14]:
def load_file_to_db(path: str, store: PGVector):
    loader = TextLoader(path)
    document = loader.load()
    text_splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=0)
    for chunk in text_splitter.split_documents(document):
        store.add_documents([chunk])

for doc in Path("RAG/data/").glob("*.txt"):
    print(f"Loading {doc}...")
    load_file_to_db(str(doc), store)

print("Finished.")

Loading RAG/data/state_of_the_union.txt...
Finished.


## 4. Running the Application
To start the application run the following commands below to setup the functions to call the TGI, TEI and Vector databases.  
Load a text file in the `./RAG/data` folder and then run the cell above and the application will ingest and start the chat application to ask a question to the document.  
You will see that it's directly accessing the TGI and TEI libraries to ingest, create the embeddings and vector database, then run the query through the database and then use the LLM to generate an answer to your query.  

In [None]:
from langchain.vectorstores.pgvector import PGVector
from langchain_huggingface import HuggingFaceEndpointEmbeddings
from text_generation import Client

rag_prompt_intel_raw = """### System: You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise. 

### User: Question: {question}

Context: {context}

### Assistant: """

def get_sources(question):
    embeddings = HuggingFaceEndpointEmbeddings(model="http://localhost:9002", huggingfacehub_api_token="EMPTY")
    store = PGVector(
        collection_name="documents",
        connection_string="postgresql+psycopg2://postgres:postgres@localhost:9003/postgres",
        embedding_function=embeddings,
    )
    return store.similarity_search(f"Represent this sentence for searching relevant passages: {question}", k=2)

def sources_to_str(sources):
    return "\n".join(f"{i+1}. {s.page_content}" for i, s in enumerate(sources))

def get_answer(question, sources):
    client = Client("http://localhost:9001") #change this to 9009 for the new model
    context = "\n".join(s.page_content for s in sources)
    prompt = rag_prompt_intel_raw.format(question=question, context=context)
    return client.generate(prompt, max_new_tokens=1024, stop_sequences=["### User:", "</s>"]).generated_text

default_question = "What is this the summary of this document?"

def rag_answer(question, history):
    question = question["text"]
    sources = get_sources(question)
    answer = get_answer(question, sources)
    return f"{answer}"
    

### Run the UI to query the RAG
This sets up a simple UI to submit queries to the RAG.

In [None]:
import ipywidgets as widgets
from IPython.display import display, HTML
import threading
import time

# Define colors
INTEL_BLUE = "#0071C5"
LIGHT_BLUE = "#5DADEC"

# Create UI elements
title = widgets.HTML(
    value="<h1 style='color:{}; text-align:center;'>Question Answering Mini App</h1>".format(INTEL_BLUE)
)
question_input = widgets.Textarea(
    placeholder='Enter your question here',
    description='Question:',
    layout=widgets.Layout(width='100%', height='100px')
)
submit_button = widgets.Button(
    description='Submit',
    button_style='primary',
    layout=widgets.Layout(width='100px'),
    style={'button_color': INTEL_BLUE}
)
output_area = widgets.Output()
loading_spinner = widgets.HTML(
    value="<i class='fa fa-spinner fa-spin' style='font-size:24px; color:{}'></i>".format(INTEL_BLUE),
    layout=widgets.Layout(display='none')
)

# Define the interaction function
def on_submit_button_clicked(b):
    with output_area:
        output_area.clear_output()
        display(loading_spinner)
        loading_spinner.layout.display = 'inline-block'
        
        def process_question():
            try:
                question = question_input.value
                history = []  # Assuming history is empty for simplicity
                answer = rag_answer({"text": question}, history)
                with output_area:
                    loading_spinner.layout.display = 'none'
                    output_area.append_stdout(f"Answer:\n{answer}\n")
            except Exception as e:
                with output_area:
                    loading_spinner.layout.display = 'none'
                    output_area.append_stdout(f"Error: {str(e)}\n")
        
        thread = threading.Thread(target=process_question)
        thread.start()

submit_button.on_click(on_submit_button_clicked)

# Display the UI elements
display(
    widgets.VBox([
        title,
        question_input,
        submit_button,
        loading_spinner,
        output_area
    ])
)

# HTML to enhance loading spinner style
display(HTML('''
<style>
    @keyframes spinner {
        to {transform: rotate(360deg);}
    }
    .fa-spinner {
        margin: 0 auto;
        display: block;
        animation: spinner 1s linear infinite;
    }
</style>
'''))

In [None]:
# Please be sure to run this exit command to ensure that the resorces running on Intel Gaudi are released and the Docker images are deleted 
!docker stop gaudi-tgi && docker rm gaudi-tgi
!docker stop cpu-tei && docker rm cpu-tei
!docker stop postgres_vectordb && docker rm postgres_vectordb
exit()