Copyright (c) 2023 Habana Labs, Ltd. an Intel Company.

#### Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

# Using LocalGPT on Gaudi2 with the Llama2 model
This tutorial will show how to use the [LocalGPT](https://github.com/PromtEngineer/localGPT) open source initiative on the Gaudi2 processor.  LocalGPT allows you to load your own documents and run an interactive chat session with this material.  This allows you to query and summarize your content by loading any .pdf or .txt documents into the `SOURCE DOCUMENTS` folder, running the ingest.py script to tokenize your content and then the run_localGPT.py script to start the interaction.  

In this example, we're using the **meta-llama/Llama-2-13b-chat-hf** model as the refrence model that will manage the inference on Gaudi2. 

To optimize this instantiation of LocalGPT, we have created new content on top of the existing Hugging Face based "text-generation" inference task and pipelines, including:

1. Using the Hugging Face Optimum Habana Library with the Llama2-13B model, which is optimized on Gaudi2. 
2. Using Langchain to import the source docuement with a custom embedding model, using a `GaudiHuggingFaceInstructEmbeddings` class based on HuggingFaceInstructEmbeddings.
3. We are using a custom pipeline class, `GaudiTextGenerationPipeline` that optimizes text-generation tasks for padding and indexing for static shapes, to improve performance.


In [1]:
%cd /root/Gaudi-tutorials/PyTorch/localGPT_inference
!export DEBIAN_FRONTEND="noninteractive"
!export TZ=Etc/UTC

/root/Gaudi-tutorials/PyTorch/localGPT_inference


##### Install the requirements for LocalGPT 

In [None]:
!apt-get update
!apt-get install -y tzdata bash-completion python3-pip openssh-server      vim git iputils-ping net-tools protobuf-compiler curl bc gawk tmux     && rm -rf /var/lib/apt/lists/*
!pip install -q --upgrade pip
!pip install -q -r requirements.txt

##### Install the Optimum Habana Library from Hugging Face

In [3]:
!pip install -q --upgrade-strategy eager optimum[habana]

[0m

### Load your Local Content
Copy all of your files into the `SOURCE_DOCUMENTS` directory

The current default file types are .txt, .pdf, .csv, and .xlsx, if you want to use any other file type, you will need to convert it to one of the default file types.

Run the following command to ingest all the data. The ingest.py uses LangChain tools to parse the document and create embeddings locally using the GaudiHuggingFaceInstructEmbeddings class. It then stores the result in a local vector database (DB) using Chroma vector store. 

If you want to start from an empty database, delete the DB folder and run the ingest script again. 

In [4]:
!python ingest.py --device_type hpu

2023-10-10 23:23:58,137 - INFO - ingest.py:124 - Loading documents from /root/Gaudi-tutorials/PyTorch/localGPT_inference/SOURCE_DOCUMENTS
2023-10-10 23:23:58,148 - INFO - ingest.py:37 - Loading document batch
2023-10-10 23:24:48,208 - INFO - ingest.py:133 - Loaded 1 documents from /root/Gaudi-tutorials/PyTorch/localGPT_inference/SOURCE_DOCUMENTS
2023-10-10 23:24:48,208 - INFO - ingest.py:134 - Split into 2227 chunks of text
Loading Habana modules from /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib
2023-10-10 23:24:49,625 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
2023-10-10 23:24:50,149 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
2023-10-10 23:24:50,453 - INFO - __init__.py:88 - Running Chroma using direct local API.
2023-10-10 23:24:50,719 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations
2023-10-10 23:24:50,723 - INFO - json_impl.py:45

### How to access and Use the Llama2 model
Use of the pretrained model is subject to compliance with third party licenses, including the “Llama 2 Community License Agreement” (LLAMAV2). For guidance on the intended use of the LLAMA2 model, what will be considered misuse and out-of-scope uses, who are the intended users and additional terms please review and read the instructions in this link https://ai.meta.com/llama/license/. Users bear sole liability and responsibility to follow and comply with any third party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third party licenses.

To be able to run gated models like this Llama-2-13b-chat-hf, you need the following:

* Have a HuggingFace account
* Agree to the terms of use of the model in its model card on the HF Hub
* Set a read token
* Login to your account using the HF CLI: run huggingface-cli login before launching your script

In [5]:
!huggingface-cli login --token <your token here>

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### Running the LocalGPT model with Llama2 13B Chat 

### Set the model Usage

To change the model, you can modify the "LLM_ID = <add model here>" in the `constants.py` file. For this example, the default is `meta-llama/Llama-2-13b-chat-hf`

Since this is interactive, it's a better experince to launch this from a terminal window.  This run_localGPT.py script uses a local LLM (Llama2 in this case) to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the documentation.

`python run_localGPT.py --device_type hpu`

Note: The inference is running sampling mode, so the user can optinally modify the temperature and top_p settings in run_localGPT.py line 84 to modify the output.  The current settings are temperature=0.5, top_p=0.8.  Type "exit" at the prompt to stop the execution.


In [None]:
# Run this in a terminal window to start the chat: `python run_localGPT.py --device_type hpu`, the example below is showing the initial output:   

In [7]:
python run_localGPT.py --device_type hpu

2023-10-10 23:29:55,812 - INFO - run_localGPT.py:186 - Running on: hpu
2023-10-10 23:29:55,812 - INFO - run_localGPT.py:187 - Display Source Documents set to: False
2023-10-10 23:29:56,315 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
2023-10-10 23:29:56,718 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
2023-10-10 23:29:56,922 - INFO - __init__.py:88 - Running Chroma using direct local API.
2023-10-10 23:29:56,935 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations
2023-10-10 23:29:56,938 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings
2023-10-10 23:29:57,183 - INFO - duckdb.py:460 - loaded in 6681 embeddings
2023-10-10 23:29:57,184 - INFO - duckdb.py:472 - loaded in 1 collections
2023-10-10 23:29:57,185 - INFO - duckdb.py:89 - collection with name langchain already exists, returning existing collection
2023-10-10 23:29:57,186 - INFO 

In [None]:
exit()