# **An Open Source Document Querying AI Tool - Streamlit App:**
## **Utilising Large Lanaguage Models (LLMs), Chroma, Huggingface, Langchain and Streamlit.**

This notebook will detail a Python project which will use open source Large Language Models (LLMs) to allow users to receive answers to their questions on a long PDF document, via an LLM AI tool. This will be achieved through the use of Langchain, Chroma Vector Store, HuggingFace and Streamlit libraies.   

This notebook only contains the code required to run the streamlit app from a Google Colab notebook, which should have GPU acceleration.

Ensure that the 25 page cycleguard bicycle insurance PDF document has been loaded into the notebook working directory before running the cells.

In [None]:
!pip install langchain streamlit langchain-community InstructorEmbedding sentence_transformers==2.2.2 pypdf chromadb
!npm install localtunnel
import urllib

In [2]:
%%writefile streamlit_app.py

## Note: Use GPU acceleration for this streamlit app due to much shorter time to embed the document

import streamlit as st
import os
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.llms import HuggingFaceHub
from langchain.chains import RetrievalQA

# Set HuggingFace private API Key
os.environ['HUGGINGFACEHUB_API_TOKEN'] = 'your_private_API_key' # Your private huggingface API key

# Huggingface text embedding model:
text_embedding_model = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl")

# Load the PDF document:
input_document = PyPDFLoader('cycleGuard Policy Wording 2021-03.pdf') # Ensure this PDF document file has already been loaded into Colab working directory!
# Split pages from the PDF
pages = input_document.load_and_split()
# Load documents into chroma embedding database:
vector_store = Chroma.from_documents(pages, text_embedding_model, collection_name='cycle_insurance')

# Huggingface LLM
LLM = HuggingFaceHub(repo_id="google/flan-t5-large", model_kwargs={"temperature":0.2, "max_length":512})
retriever = vector_store.as_retriever(search_kwargs={"k": 3})
retrieval_QA_chain = RetrievalQA.from_chain_type(
    llm=LLM,
    chain_type="stuff",
    retriever=retriever,
    input_key = 'question')

#-----------------Streamlit App Functionality----------------------#
st.title('Using HuggingFace Open Source LLMs to Answer Queries on an Insurance Document') # App title
user_input = st.text_input('Enter your query here:') # User input box
if user_input: # If user enters a query via the app interface, pass the query to Huggingface LLM
    HF_response = retrieval_QA_chain.invoke({"question": user_input})
    st.write(HF_response["result"]) # Display the LLM response


Writing streamlit_app.py


In [None]:
print("Password for localtunnel:", urllib.request.urlopen('https://ipv4.icanhazip.com').read().decode('utf8').strip("\n"))
!streamlit run streamlit_app.py &>/content/logs.txt & npx localtunnel --port 8501 & curl ipv4.icanhazip.com

Password for localtunnel: 34.142.236.232
34.142.236.232
[K[?25hnpx: installed 22 in 1.515s
your url is: https://lazy-lizards-hug.loca.lt
