# **An OpenAI Document Querying AI Tool:**
## **Utilising OpenAI Large Lanaguage Models (LLMs), Chroma, OpenAI API, Langchain and Streamlit.**

This notebook will detail a Python project which will use OpenAI Large Language Models (LLMs) to allow users to receive answers to their questions on a long PDF document, via an LLM AI tool. This will be achieved through the use of Langchain, Chroma Vector Store, the OpenAI API and Streamlit libraries.
The document that will be used for this project will be a 25 page bicycle insurance policy from cycleguard.  

This project has been implemented within a Google Colab cloud notebook environment, which should allow others to use this code without the need for locally downloading and computing. Running this notebook is possible with only the CPU enabled.

# **1. Import Required Libraries**

In [None]:
!pip install langchain langchain-openai streamlit langchain-community pypdf chromadb

In [None]:
import os
import time
import langchain
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA
import urllib
!npm install localtunnel

# **2. Text Embedding Model & Chroma Vector Database**
The text embedding model is used to process the input PDF text document so that sentences and words can be converted into a series of numerical vectors that can be stored in the vector database. Chroma is the open source vector database that will be used for this project (https://www.trychroma.com/).

User queries will also be embedded using the text embedding model, which allows the embedding database to be searched for closely related embeddings/vectors. This is the implementation of a semantic search through the PDF document, rather than a key word search, which should provide improved results.

The OpenAI text embedding model that will used is the model called "text-embedding-ada-002", which can be found at this link: https://platform.openai.com/docs/models/embeddings. This model will be used due to its low cost and suitability to prototyping NLP applications.

## **2.1 Load the PDF Document**

In [5]:
# Set your OpenAI API key to allow access to models:
os.environ['OPENAI_API_KEY'] = 'your_private_openAI_API_key'# Required for the text embedding model and LLM model

text_embedding_model = OpenAIEmbeddings(model='text-embedding-ada-002') # Use the OpenAI text embedding model

# Load the PDF document:
input_document = PyPDFLoader('cycleGuard Policy Wording 2021-03.pdf') # Ensure this PDF document file has already been loaded into Colab working directory

# Split pages from the input PDF
pdf_pages = input_document.load_and_split()

## **2.2 Embed the PDF Document with Chroma Vector Database**
The input PDF document will be embedded in the next cell using the OpenAI text embedding model and the Chroma database.

In [6]:
# Covert the uploaded document to vector format and print out the time taken:
t1 = time.time()

# The next cell will consume ~$0.001 OpenAI account cedit [12,008 tokens]
vector_store = Chroma.from_documents(pdf_pages, text_embedding_model, collection_name='cycle_insurance')

t2 = time.time()
time_taken = t2 - t1
print("Time to create vector store [mins]:", round(time_taken/60, 2)) # Takes ~2s with CPU

Time to create vector store [mins]: 0.01


## **2.3 Query the Chroma Vector Database**
Test queries will be run to verify that the PDF document has been embedded successfully. Text queries are automatically converted into an embedding with Chroma, so that the database can be interrogated.

In [7]:
question1 = "What is the name of the insuance company?"
store_similarity_1 = vector_store.similarity_search(question1)
print(store_similarity_1[0].page_content) # The most relevant result to the query is at index 0

It is our intention to give you  the best possible service however if you  do have any cause for complaint about this 
insurance or the handling of any claim you  should follow the complaints procedure below:
Policy Sales
If your complaint is about the sale of your policy, please email: complaints@Guardcover.co.uk 
call: 0333 004 3444
or write to:  
cycleGuard, Thistle Insurance Services Limited, Southgate House, Southgate Street, Gloucester, GL1 1UB
Policy Claims
If your complaint is about a claim, please email: claims@Guardcover.co.uk 
call: 0333 004 3444
or write to:  
Claims Department, Thistle Insurance Services Limited, Southgate House, Southgate Street, Gloucester, GL1 1UB
In all correspondence please state that your insurance is underwritten by UK General Insurance and quote your unique 
policy number from your policy schedule. 
Following our complaints procedure does not affect your  legal rights as a consumer. For further information you  can 
contact the Citizens Advice Bure

The output from the first query is verbose, but it does contain the pertinent information from the document relating to Thistle Insurance Services.

In [8]:
question2 = "What are the excess amounts?"
store_similarity_2 = vector_store.similarity_search(question2)
print(store_similarity_2[0].page_content)

Details of Y our Excess
All claims for insured items are subject to the following excess  unless otherwise stated on your  Insurance Schedule:
Claim amount Excess payable
£0 - £1,499 £50
£1,500 - £2,999 £100
£3,000 - £4,999 £150
£5,000 or above £200
Public Liability claims are subject to a £500 excess for all third-party property damage. 
7IMPORTANT 
INFORMATION


The second query has also returned the key information from the document relating to excess amounts.

## **3. Create an AI Tool Using Langchain**
Next, an LLM will be used to summarise the output from the Chroma vector store, which will then provide concise answers to user queries relating to the PDF document. This will be achieved through the use of a langchain retrieval chain, which chains together the OpenAI text embedding model, vector store and OpenAI LLM.

The OpenAI LLM that will be used for this part of the project is called "gpt-3.5-turbo-instruct", which can be viewed at this link: https://platform.openai.com/docs/models/gpt-3-5-turbo

The purpose of this LLM is to concisely summarise the returned information from the vector embedding database. This model was chosen as it is the default OpenAI GPT model, while also being fast and inexpensive model that is suited to basic tasks.


In [20]:
llm_openAI = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0.2) # This uses the GPT-3.5-turbo-instruct model

retriever = vector_store.as_retriever(search_kwargs={"k": 3})

# Create the langchain:
chain_qa = RetrievalQA.from_chain_type(
    llm=llm_openAI,
    chain_type="stuff",
    retriever=retriever,
    input_key = 'question')

Each query that is passed through the chain will consume approximately $0.002 OpenAI account credit.

In [21]:
query1 = 'What are the excess amounts?'
chain_qa.invoke({"question": query1})

{'question': 'What are the excess amounts?',
 'result': ' The excess amounts are £50 for claims up to £1,499, £100 for claims between £1,500 and £2,999, £150 for claims between £3,000 and £4,999, and £200 for claims of £5,000 or above. Public Liability claims have a £500 excess for third-party property damage.'}

In [22]:
query1b = 'Are there any excess amounts for public liability claims?'
chain_qa.invoke({"question": query1b})

{'question': 'Are there any excess amounts for public liability claims?',
 'result': ' Yes, there is a £500 excess for all third-party property damage for public liability claims.'}

In [23]:
query2 = 'What is the name of the insurance company?'
chain_qa.invoke({"question": query2})

{'question': 'What is the name of the insurance company?',
 'result': ' The insurance company is Thistle Insurance Services Limited.'}

In [24]:
query3 = 'What are all of the circumstances where the insurance not valid?'
chain_qa.invoke({"question": query3})

{'question': 'What are all of the circumstances where the insurance not valid?',
 'result': ' The insurance is not valid in the following circumstances: \n1. Any loss or damage that occurred prior to the commencement of the insurance.\n2. Claims or incidents caused by illegal or criminal acts, being under the influence of drugs or alcohol, or intentional self-harm.\n3. Claims caused by pressure waves from supersonic aircraft or devices.\n4. Claims caused by riot, civil commotion, or strikes.\n5. Fraudulent acts by the insured party.\n6. If there is another insurance policy covering the same loss, damage, or liability.\n7. Failure to take reasonable care to prevent accidental damage, theft, or comply with statutory obligations.\n8. If the insured value chosen is less than the full replacement value of the insured items.\n9. Any costs not approved by the insurance company.\n10. More than 3 claims or an aggregate total of £25,000 in any one period of insurance.\n11. Accidents or incidents

## **4. Streamlit Application**
The langchain retrieval chain will be used within a streamlit application, which will provide a much more professional user interface for the document querying AI tool.



In [30]:
%%writefile streamlit_app.py

import streamlit as st
import os
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain.chains import RetrievalQA

# Set OpenAI private API Key:
os.environ['OPENAI_API_KEY'] = 'your_private_OpenAI_API_key'

# OpenAI text embedding model:
text_embedding_model = OpenAIEmbeddings(model='text-embedding-ada-002')

# Load the PDF document:
input_document = PyPDFLoader('cycleGuard Policy Wording 2021-03.pdf') # Ensure this PDF document file has already been loaded into Colab working directory!
# Split pages from the PDF
pages = input_document.load_and_split()
# Load documents into chroma embedding database:
vector_store = Chroma.from_documents(pages, text_embedding_model, collection_name='cycle_insurance')

# OpenAI LLM
LLM = OpenAI(name='gpt-3.5-turbo-instruct', temperature=0.2)
retriever = vector_store.as_retriever(search_kwargs={"k": 3})
retrieval_QA_chain = RetrievalQA.from_chain_type(
    llm=LLM,
    chain_type="stuff",
    retriever=retriever,
    input_key = 'question')

#-----------------Streamlit App Functionality----------------------#
st.title('Using OpenAI LLMs to Answer Queries on an Insurance Document') # App title
user_input = st.text_input('Enter your query here:') # User input box
if user_input: # If user enters a query via the app interface, pass the query to OpenAI LLM
    openAI_response = retrieval_QA_chain.invoke({"question": user_input})
    st.write(openAI_response["result"]) # Display the LLM response


Overwriting streamlit_app.py


The next cell contains the code to run the streamlit app via a hyperlink that will be displayed in the colab notebook. Click the url and then enter the password on that page to reach the streamlit app.

In [None]:
print("Password/Enpoint IP for localtunnel is:", urllib.request.urlopen('https://ipv4.icanhazip.com').read().decode('utf8').strip("\n"))
!streamlit run streamlit_app.py &>/content/logs.txt & npx localtunnel --port 8501 & curl ipv4.icanhazip.com

Password/Enpoint IP for localtunnel is: 35.247.123.112
35.247.123.112
[K[?25hnpx: installed 22 in 2.575s
your url is: https://busy-pigs-turn.loca.lt


# **5. Conclusion**
This notebook has shown how a document querying AI tool can be created using OpenAI models that are accessed via OpenAI API calls. This has been further developed into a Streamlit app, which provides a clean user interface and removes the need for the user to execute Python code directly.
For further work, an open source implementation of this project could be investigated.  