Copyright (c) 2023 Habana Labs, Ltd. an Intel Company.

#### Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

# Using LocalGPT with Retrieval Augmented Generation (RAG) on the Intel&reg; Gaudi&reg; 2 AI accelerator with the Llama 2 70B model
This tutorial will show how to use the [LocalGPT](https://github.com/PromtEngineer/localGPT) open source initiative on the Intel Gaudi 2 AI accelerator.  LocalGPT allows you to load your own documents and run an interactive chat session with this material using concepts from Retrieval Augmented Generation (RAG).  

This allows you to query and summarize your content by loading any .pdf or .txt documents into the `SOURCE DOCUMENTS` folder, using utilities from the ingest.py script to tokenize your content and then the run_localGPT.py script to start the interaction.  

The first section shows how RAG works and run thruough the steps of the indexing the local content, retrieval and text generation with a single question and response. 

The last section uses the full LocalGPT framework with the **meta-llama/Llama-2-70b-chat-hf** model as the reference model that will manage the inference on Gaudi 2.  DeepSpeed inference is used based on the size of the model.

To optimize this instantiation of LocalGPT, we have created new content on top of the existing Hugging Face based "text-generation" inference task and pipelines, including:

1. Using the Hugging Face Optimum Habana Library with the Llama 2 70B model, which is optimized on Gaudi2. 
2. Using LangChain to import the source document with an embedding model, using the Hugging Face Optimum Habana Library.
3. We are using a custom pipeline class, `GaudiTextGenerationPipeline` that optimizes text-generation tasks for padding and indexing for static shapes, to improve performance.


##### Install DeepSpeed to run inference on the full Llama 2 70B model

In [None]:
!pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.18.0

##### Go to the LocalGPT folder

In [None]:
%cd /root/Gaudi-tutorials/PyTorch/localGPT_inference

##### Install the requirements for LocalGPT 

In [None]:
!pip install -q --upgrade pip
!pip install -q -r requirements.txt

##### Install the Optimum Habana Library from Hugging Face

In [None]:
!pip install -q optimum-habana==1.14.1

## Retrieval-Augmented Generation (RAG)
LocalGPT uses Retrieval-Augmented Generation (RAG) at it's core. RAG is a relatively new AI technique that combines an information retrieval system with text-generation models/LLMs. It provides an effective way to ground LLMs by using retrieved contexts from an external knowledge base, without having to perform retraining or finetuning.
The LocalGPT application workflow can be broken down as follows:
* Document Ingestion: This step involves creating an external knowledge base via a vector database. The text present in the documents is parsed, split into chunks and converted to embeddings using an embedding model. The vector embeddings are finally stored in the vector database.

![](img/ingest.jpg)

* Text Generation: This step involves accepting a query from the user, converting the query to embeddings and retrieving appropriate contexts from the knowledge base. The input prompt to the LLM is the concatenation of the query, contexts and chat history.

![](img/documentqa.jpg)

### Document Ingestion
Copy all of your files into the `SOURCE_DOCUMENTS` directory

The current default file types are .txt, .pdf, .csv, and .xlsx, if you want to use any other file type, you will need to convert it to one of the default file types.

Run the following cells to ingest all the data. This notebook uses LangChain tools to parse the documents and create embeddings locally using the HuggingFace Optimum Habana Library. It then stores the result in a local vector database (DB) using Chroma vector store. 

If you want to start from an empty database, delete the DB folder and run the next few cells again. 

##### Load your files as LangChain Documents

In [None]:
from constants import SOURCE_DIRECTORY
from ingest import load_documents

documents = load_documents(SOURCE_DIRECTORY)
print(f"Loaded {len(documents)} documents from {SOURCE_DIRECTORY}")

##### Split the text into chunks

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)
print(f"Created {len(texts)} chunks of text")

##### Create embeddings from chunks of text

In [None]:
from constants import EMBEDDING_MODEL_NAME
from langchain_huggingface import HuggingFaceEmbeddings

from habana_frameworks.torch.utils.library_loader import load_habana_module
from optimum.habana.sentence_transformers.modeling_utils import adapt_sentence_transformers_to_gaudi

load_habana_module()

adapt_sentence_transformers_to_gaudi()
embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME, model_kwargs={"device": "hpu"})

##### Create a Chroma vector database to store embeddings

In [None]:
import time
from constants import PERSIST_DIRECTORY, CHROMA_SETTINGS
from langchain_chroma import Chroma

start_time = time.perf_counter()
db = Chroma.from_documents(texts, embeddings, persist_directory=PERSIST_DIRECTORY, client_settings=CHROMA_SETTINGS)
end_time = time.perf_counter()
print(f"Time taken to create vector store: {(end_time-start_time)*1000} ms")

### How to access and Use the Llama 2 model
Use of the pretrained model is subject to compliance with third party licenses, including the “Llama 2 Community License Agreement” (LLAMAV2). For guidance on the intended use of the LLAMA2 model, what will be considered misuse and out-of-scope uses, who are the intended users and additional terms please review and read the instructions in this link https://ai.meta.com/llama/license/. Users bear sole liability and responsibility to follow and comply with any third party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third party licenses.

To be able to run gated models like this Llama-2-70b-chat-hf, you need the following:

* Have a HuggingFace account
* Agree to the terms of use of the model in its model card on the HF Hub
* Set a read token
* Login to your account using the HF CLI: run huggingface-cli login before launching your script

In [None]:
#!huggingface-cli login --token <your token here>

### Text Generation
Once the Chroma vector database is ready, we can explore the text-generation component of LocalGPT.

The next few cells describe all the steps in the text generation process. We use the smallest Llama 2 model **meta-llama/Llama-2-7b-chat-hf** to perform augmented text-generation after retrieving relevant contexts from the vector database.

##### Load the LLM

In [None]:
from run_localGPT import load_model

model_id = "meta-llama/Llama-2-7b-chat-hf"
llm, _ = load_model(device_type="hpu", model_id=model_id, temperature=0.2, top_p=0.95, model_basename=None)

##### Define the Retriever

In [None]:
db = Chroma(persist_directory=PERSIST_DIRECTORY, embedding_function=embeddings, client_settings=CHROMA_SETTINGS)
retriever = db.as_retriever()

##### Create the prompt template

In [None]:
from langchain_core.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end. If you don't know the answer,\
just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""

prompt = PromptTemplate(input_variables=["context", "question"], template=template)

##### Initialize a LangChain object

In [None]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

qa = create_retrieval_chain(retriever, create_stuff_documents_chain(llm, prompt))

##### Ask a question

In [None]:
query = "What is this document about?"
res = qa.invoke({"question": query, "input": query})
print(res["answer"])

##### Clean up before running Full LocalGPT below
To run the full Local LocalGPT model below, you need to restart the Kernel in the Jupyter Server to ensure that all the Intel Gaudi Accelerators are released.  This can be accomplished by selecting this option in the Kernel menu or the `exit()` command at the bottom of this notebook. 

### Running the LocalGPT full example with Llama 2 70B Chat 

### Set the model Usage

To change the model, you can modify the "LLM_ID = <add model here>" in the `constants.py` file. For this example, the default is `meta-llama/Llama-2-70b-chat-hf`.  

Since this is interactive, it's a better experince to launch this from a terminal window.  This run_localGPT.py script uses a local LLM (Llama 2 in this case) to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the documentation.  This is the run command to use:

`PT_HPU_LAZY_ACC_PAR_MODE=1 PT_HPU_ENABLE_LAZY_COLLECTIVES=true python gaudi_spawn.py --use_deepspeed --world_size 8 run_localGPT.py --device_type hpu --temperature 0.7 --top_p 0.95`

Running the full 70B model takes up ~128GB of disk space, so if your system is storage constrained, it may be best to run the Llama 2 7B or 13B chat models.  Change the LLM_ID variable in the `constants.py` file (example: `LLM_ID = "meta-llama/Llama-2-7b-chat-hf"`) and use the command below.
`python run_localGPT.py --device_type hpu --temperature 0.7 --top_p 0.95`

Note: The inference is running sampling mode, so the user can optinally modify the temperature and top_p settings.  The current settings are temperature=0.7, top_p=0.95.  Type "exit" at the prompt to stop the execution.


In [None]:
#Run this command in a terminal window to start the interactive chat: `PT_HPU_LAZY_ACC_PAR_MODE=1 PT_HPU_ENABLE_LAZY_COLLECTIVES=true python gaudi_spawn.py --use_deepspeed --world_size 8 run_localGPT.py --device_type hpu --temperature 0.7 --top_p 0.95`, the example below is showing the initial output:   

In [None]:
!PT_HPU_LAZY_ACC_PAR_MODE=1 PT_HPU_ENABLE_LAZY_COLLECTIVES=true python gaudi_spawn.py --use_deepspeed --world_size 8 run_localGPT.py --device_type hpu --temperature 0.2 --top_p 0.95

In [None]:
exit()