# **An Open Source Document Querying AI Tool:**
## **Utilising Large Lanaguage Models (LLMs), Chroma, Huggingface, Langchain and Streamlit.**

This notebook will detail a Python project which will use open source Large Language Models (LLMs) to allow users to receive answers to their questions on a long PDF document, via an LLM AI tool. This will be achieved through the use of Langchain, Chroma Vector Store, HuggingFace and Streamlit libraries.   
The PDF document that will be used for this project is a 25 page bicycle insurance policy from cycleguard.

This project has been implemented within a Google Colab cloud notebook environment, which should allow others to use this code without the need for locally downloading and computing. Running this notebook is possible with the CPU enabled but it will be much faster with GPU acceleration.

# **1. Import Required Libraries**

In [None]:
!pip install langchain streamlit langchain-community InstructorEmbedding sentence_transformers==2.2.2 pypdf chromadb

In [None]:
import os
import time
import langchain
from langchain.llms import HuggingFaceHub
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceInstructEmbeddings
import urllib
!npm install localtunnel

# **2. Text Embedding Model & Chroma Embedding Database**
The text embedding model is used to process the input PDF text document so that sentences and words can be converted into a series of vectors that can be stored in a vector database. Chroma is the open source embedding database that will be used for this project (https://www.trychroma.com/).

User queries will also be embedded using the text embedding model, which allows the embedding database to be searched for closely related embeddings/vectors. This is the implementation of a semantic search through the PDF document, rather than a key word search, which should provide improved results.

Huggingface has a leaderboard of text embedding models that can be found at this link: https://huggingface.co/blog/mteb and for this project the text embedding model called "hkunlp/instructor-xl" wil be used (https://huggingface.co/hkunlp/instructor-xl). This model is well rated for the embedding of text documents for multiple applications, so should be suitable for this application.

## **2.1 Load the PDF Document**

In [5]:
# Set your huggingface API token to allow access to models:
os.environ['HUGGINGFACEHUB_API_TOKEN'] = 'your_private_API_key'# Required for the selected LLM model

text_embedding_model = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl")

# Load the PDF document:
input_document = PyPDFLoader('cycleGuard Policy Wording 2021-03.pdf') # Ensure this PDF document file has already been loaded into Colab working directory

# Split pages from the PDF
pdf_pages = input_document.load_and_split()

load INSTRUCTOR_Transformer
max_seq_length  512


## **2.2 Embed the PDF Document in Chroma Vector Database**
The input PDF document will be embedded in the next cell using the text embedding model and the Chroma database.

In [62]:
# Covert the uploaded document to vector format and print out the time taken:
t1 = time.time()

vector_store = Chroma.from_documents(pdf_pages, text_embedding_model, collection_name='cycle_insurance')

t2 = time.time()
time_taken = t2 - t1
print("Time to create vector store [mins]:", round(time_taken/60, 2)) # Takes ~20s with GPU, ~12mins with CPU

Time to create vector store [mins]: 0.18


## **2.3 Query the Chroma Vector Database**
Test queries will be run to verify that the PDF document has been embedded successfully. Text queries are automatically converted into an embedding with Chroma, so that the database can be interrogated.

In [7]:
question1 = "What is the name and address of the insuance company?"
vector_query_1 = vector_store.similarity_search(question1)
print(vector_query_1[0].page_content)

The Administrator
cycleGuard is a trading style of Thistle Insurance Services Limited. 
Thistle Insurance Services Limited is authorised and regulated by the Financial Conduct Authority. FRN 310419.  
Registered in England No. 00338645. 
Registered office: Rossington’s Business Park, West Carr Road, Retford, Nottinghamshire, DN22 7SW
Thistle Insurance Services Limited is part of the PIB Group.
The Underwriter
This insurance is underwritten by UK General Insurance Limited on behalf of Watford Insurance Company Europe 
Limited. Watford Insurance Company Europe Limited is a Gibraltar based insurance company with its registered office at; 
PO Box 1338, First Floor, Grand Ocean Plaza, Ocean Village, Gibraltar.  
UK General Insurance Limited is authorised and regulated by the Financial Conduct Authority. Firm Reference No. 
310101. You  can check our  details on the Financial Services Register https://register.fca.org.uk/
Watford Insurance Company Europe Limited is authorised and regulated b

The output from the first query is verbose, but it does contain all of the pertinent information from the document.

In [8]:
question2 = "What are the excess amounts?"
vector_query_2 = vector_store.similarity_search(question2)
print(vector_query_2[0].page_content)

Details of Y our Excess
All claims for insured items are subject to the following excess  unless otherwise stated on your  Insurance Schedule:
Claim amount Excess payable
£0 - £1,499 £50
£1,500 - £2,999 £100
£3,000 - £4,999 £150
£5,000 or above £200
Public Liability claims are subject to a £500 excess for all third-party property damage. 
7IMPORTANT 
INFORMATION


The second query has also returned the key information from the document relating to excess amounts.

## **3. Create an AI Tool Using Langchain**
Next, an LLM will be used to summarise the output from the Chroma vector store, which will then provide concise answers to user queries relating to the PDF document. This will be achieved through the use of a langchain retrieval chain, which chains together the text embedding model, vector store and LLM.

The LLM that will be used for this part of the project is called Google "flan-t5-large", and it can be viewed at this link: https://huggingface.co/google/flan-t5-large. This model boasts better performance than larger models and after some testing it was selected for use in this project.

The purpose of this LLM is to summarise the returned information from the vector embedding database.


In [70]:
llm_hf = HuggingFaceHub(repo_id="google/flan-t5-large", model_kwargs={"temperature":0.2, "max_length":512})

retriever = vector_store.as_retriever(search_kwargs={"k": 3})

chain_qa = RetrievalQA.from_chain_type(
    llm=llm_hf,
    chain_type="stuff",
    retriever=retriever,
    input_key = 'question')

In [71]:
query1 = 'What are the excess amounts?'
chain_qa.invoke({"question": query1})

{'question': 'What are the excess amounts?',
 'result': '£0 - £1,499 £50 £1,500 - £2,999 £100 £3,000 - £4,999 £150 £5,000 or above £200 Public Liability claims are subject to a £500 excess for all third-party property damage.'}

In [73]:
query2 = 'What is the name of the insurance company?'
chain_qa.invoke({"question": query2})

{'question': 'What is the name of the insurance company?',
 'result': 'Thistle Insurance Services Limited'}

In [74]:
query3 = 'What are all of the circumstances where the insurance not valid?'
chain_qa.invoke({"question": query3})

{'question': 'What are all of the circumstances where the insurance not valid?',
 'result': '1. You engaging in any illegal or criminal act. 2. You being under the influence of drugs, solvents or alcohol, or the injection or ingestion of any substance except those prescribed by a registered medical doctor. 3. Suicide, attempted suicide or deliberate injury to you or putting yourself in unnecessary danger (unless trying to save human life). Pressure Waves This policy does not provide cover for claims, contributed to or caused by pressure waves from aircraft or other aerial devices travelling at supersonic speeds. Riot, Civil Commotion or Strikes This policy does not provide cover for claims, contributed to or caused by riot, civil commotion or strikes.'}

All of the above queries have pulled the relevant information from the insurance document.

## **4. Streamlit Application**
The langchain retrieval chain will be used within a streamlit application, which will provide a much more professional user interface for the document querying AI tool.



In [3]:
%%writefile streamlit_app.py

## Note: Use GPU acceleration for this streamlit app due to much shorter time to embed the document

import streamlit as st
import os
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.llms import HuggingFaceHub
from langchain.chains import RetrievalQA

# Set HuggingFace private API Key
os.environ['HUGGINGFACEHUB_API_TOKEN'] = 'your_private_API_key' # Your private huggingface API key

# Huggingface text embedding model:
text_embedding_model = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl")

# Load the PDF document:
input_document = PyPDFLoader('cycleGuard Policy Wording 2021-03.pdf') # Ensure this PDF document file has already been loaded into Colab working directory!
# Split pages from the PDF
pages = input_document.load_and_split()
# Load documents into chroma embedding database:
vector_store = Chroma.from_documents(pages, text_embedding_model, collection_name='cycle_insurance')

# Huggingface LLM
LLM = HuggingFaceHub(repo_id="google/flan-t5-large", model_kwargs={"temperature":0.2, "max_length":512})
retriever = vector_store.as_retriever(search_kwargs={"k": 3})
retrieval_QA_chain = RetrievalQA.from_chain_type(
    llm=LLM,
    chain_type="stuff",
    retriever=retriever,
    input_key = 'question')

#-----------------Streamlit App Functionality----------------------#
st.title('Using HuggingFace Open Source LLMs to Answer Queries on an Insurance Document') # App title
user_input = st.text_input('Enter your query here:') # User input box
if user_input: # If user enters a query via the app interface, pass the query to Huggingface LLM
    HF_response = retrieval_QA_chain.invoke({"question": user_input})
    st.write(HF_response["result"]) # Display the LLM response


Overwriting streamlit_app.py


The next cell contains the code to run the streamlit app via a hyperlink that will be displayed in the colab notebook. Click the url and then enter the password on that page to reach the streamlit app.

In [None]:
print("Password/Enpoint IP for localtunnel is:", urllib.request.urlopen('https://ipv4.icanhazip.com').read().decode('utf8').strip("\n"))
!streamlit run streamlit_app.py &>/content/logs.txt & npx localtunnel --port 8501 & curl ipv4.icanhazip.com

Password/Enpoint IP for localtunnel is: 34.125.193.56
34.125.193.56
[K[?25hnpx: installed 22 in 2.286s
your url is: https://bumpy-cups-wonder.loca.lt


# **5. Conclusion**
This notebook has shown how a document querying AI tool can be created using open source models that are accessed via API calls to the Huggingface hub. This has been further developed into a Streamlit app, which provides a clean user interface and removes the need for the user to execute Python code directly.
For further work, different LLMs could be investigated, chat memory could be displayed or multiple documents could be added.  