Install the libraries

In [0]:
%pip install -U gradio==4.31.4 boto3==1.34.108 langchain==0.2.0 langchain-community==0.2.0 pypdf==4.2.0 sentence-transformers==2.7.0 chromadb mlflow==2.11.0 -q
# %pip install -U gradio==4.31.4 boto3==1.34.108 langchain==0.2.0 langchain-community==0.2.0 pypdf==4.2.0 sentence-transformers==2.7.0 faiss-cpu==1.8.0 mlflow==2.11.0 -q
dbutils.library.restartPython()

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
databricks-feature-engineering 0.2.1 requires pyspark<4,>=3.1.2, which is not installed.
ydata-profiling 4.2.0 requires pydantic<2,>=1.8.1, but you have pydantic 2.8.2 which is incompatible.
spacy 3.7.2 requires typer<0.10.0,>=0.3.0, but you have typer 0.12.3 which is incompatible.
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


This Python code snippet imports various classes and functions from the langchain_community and langchain_text_splitters packages, which are likely part of a larger framework for working with language, documents, and embeddings in machine learning or natural language processing (NLP) applications. Here's a breakdown of what each import does:

PyPDFLoader: This class is imported from langchain_community.document_loaders. It is likely used to load or parse PDF documents, making their content accessible for processing or analysis.

CharacterTextSplitter and RecursiveCharacterTextSplitter: These classes are imported from langchain_text_splitters. They are probably used to split text into smaller pieces, such as sentences, words, or characters. The "RecursiveCharacterTextSplitter" might offer a more complex or hierarchical way of splitting text compared to the straightforward "CharacterTextSplitter".

SentenceTransformerEmbeddings and HuggingFaceEmbeddings: These classes are imported from langchain_community.embeddings.sentence_transformer. They are likely used to generate embeddings for text. Embeddings are dense vector representations of text that capture semantic meaning. "SentenceTransformerEmbeddings" suggests the use of the Sentence Transformers library, which is known for producing high-quality sentence-level embeddings. "HuggingFaceEmbeddings" implies integration with Hugging Face's Transformers library, a popular choice for various NLP tasks and transformer-based models.

Chroma: This class is imported from langchain_community.vectorstores. It probably provides functionality for storing and managing the vector embeddings generated by the aforementioned embedding classes. "Vectorstores" are typically used to efficiently search and retrieve similar embeddings, facilitating tasks like semantic search, clustering, and more.

Overall, this code snippet sets up the groundwork for a project that involves processing PDF documents, splitting text into manageable pieces, generating meaningful embeddings for those pieces, and storing those embeddings in a way that they can be efficiently accessed and used for further analysis or machine learning tasks.

In [0]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import CharacterTextSplitter, RecursiveCharacterTextSplitter

from langchain_community.embeddings.sentence_transformer import (
  SentenceTransformerEmbeddings, HuggingFaceEmbeddings
)
from langchain_community.vectorstores import Chroma

#### Create Index Part

##### Load pdf

In [0]:

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("/Volumes/sarbani-rag-dbdemo/customer_support_bot/pdf_insurance/insurance-agent-faq.pdf")
pages = loader.load_and_split(text_splitter=RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200))
len(pages)

108

In [0]:
pages[1]

Document(metadata={'source': '/Volumes/sarbani-rag-dbdemo/customer_support_bot/pdf_insurance/insurance-agent-faq.pdf', 'page': 0}, page_content='term plans, these plans are valid for the entirety of the policyholder’s life. Both types of policies have their own perks. You must assess your needs ﬁrst to decide which is better. If you seek high coverage at low premium rates, then term plans are a better option. However, if you want life cover as well as savings beneﬁts with a longer tenure, a whole life plan will be a perfect pick. InsuranceLife InsuranceCan I get life insurance at 62 years?Yes, you can buy life insurance at 62 years. Most life insurance policies have a maximum entry age ranging between 55 years and 60 years. However, there are numerous policies that are designed speciﬁcally for senior citizens. Such plans are useful for individuals who haven’t invested in a plan earlier in life. Certain plans for senior citizens also oﬀer retirement beneﬁts and pay outs. InsuranceLife I

In [0]:
# Initialize an embedding function using the SentenceTransformer model "all-MiniLM-L6-v2"
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# Create a Chroma object from the loaded document pages with the specified embedding function
db = Chroma.from_documents(pages, embedding_function)

2024-07-12 22:57:35.809073: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-12 22:57:35.855114: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-12 22:57:35.855159: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-12 22:57:35.855186: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-12 22:57:35.863716: I tensorflow/core/platform/cpu_feature_g

#### Read part

In [0]:
# Initialize a retriever from the Chroma object with a specified number of documents to retrieve (k=2)
retriever = db.as_retriever(search_kwargs={"k": 2})

# Retrieve the top 2 documents relevant to the given query about life insurance costs and factors affecting the price
docs = retriever.get_relevant_documents("What’s the Average Cost of a Life Insurance Plan and what Affects the Price? ?")

# Get the number of documents retrieved
len(docs)

2

In [0]:
print(docs[1])

page_content='Loan Facility: People who obtain life insurance plans will have the option of borrowing money against it, which may assist them to cover unforeseen expenses as they progress through life without jeopardising the policy's bene ﬁts.
 Redemption of Mortgage: The ﬁnest tool for covering loans and mortgages taken out by the policyholder is a life insurance policy. The insurance can be used to pay oﬀ the loan or mortgage if there is ever an unanticipated circumstance that prevents the policyholder from being able to repay his or her loan or mortgage. In this case, the grieving family members will not be responsible for repayment.
Tax Beneﬁts: Life insurance policies oﬀer attractive tax beneﬁts and help you save a signiﬁcant amount of money which would otherwise be spent on taxes.
What’s the Average Cost of a Life Insurance Plan and what Aﬀects the Price?' metadata={'page': 2, 'source': '/Volumes/sarbani-rag-dbdemo/customer_support_bot/pdf_insurance/insurance-agent-faq.pdf'}


#### Generate Humanlike/Chat/any task using LLM

In [0]:
%pip install langchain_openai  # Install the langchain_openai package to use OpenAI models within the notebook

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting langchain_openai
  Using cached langchain_openai-0.1.16-py3-none-any.whl (46 kB)
Collecting openai<2.0.0,>=1.32.0
  Using cached openai-1.35.13-py3-none-any.whl (328 kB)
Collecting tiktoken<1,>=0.7
  Using cached tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
Installing collected packages: tiktoken, openai, langchain_openai
  Attempting uninstall: tiktoken
    Found existing installation: tiktoken 0.5.2
    Not uninstalling tiktoken at /databricks/python3/lib/python3.10/site-packages, outside environment /local_disk0/.ephemeral_nfs/envs/pythonEnv-c0b852a0-c1e1-4453-b60d-1e0f2c26c06c
    Can't uninstall 'tiktoken'. No files were found to uninstall.
  Attempting uninstall: openai
    Found existing installation: openai 0.28.1
    Not uninstalling openai at /databricks/python3/lib/python3.10/site-packages, outside environment /loca

In [0]:
# from langchain_openai import AzureOpenAI, AzureChatOpenAI

# azure_openai_llm = AzureChatOpenAI(
#     openai_api_version="2023-05-15",
#     azure_deployment="dbdemo-gpt35",
#     model_name="gpt-35-turbo",
#     temperature=0
# )

# # # llm = AzureChatOpenAI(
# # #     deployment_name="dbdemo-gpt35",
# # #     model_name="gpt-35-turbo",
# # #     temperature=0,
# # #     max_tokens=20
# # # )

#### Generate Humanlike/Chat/any task using hosted LLMs

In [0]:
# from langchain.chat_models import ChatDatabricks

# chat_model = ChatDatabricks(endpoint="databricks-mixtral-8x7b-instruct", max_tokens=256)
# print(chat_model.invoke('what is the banks yearly performence'))

[0;31m---------------------------------------------------------------------------[0m
[0;31mValueError[0m                                Traceback (most recent call last)
File [0;32m<command-2668253031794707>, line 4[0m
[1;32m      1[0m [38;5;66;03m#loader = PyPDFLoader("/Volumes/sarbani_dbrx_catalog/dbrx_llm_schema/customer_doc_demo/hdfc_news_release.pdf")[39;00m
[0;32m----> 4[0m loader [38;5;241m=[39m PyPDFLoader([38;5;124m"[39m[38;5;124m/Volumes/sarbani_dbrx_catalog/dbrx_llm_schema/customer_doc_demo/hdfc_earning.pdf[39m[38;5;124m"[39m)
[1;32m      5[0m pages [38;5;241m=[39m loader[38;5;241m.[39mload_and_split(text_splitter[38;5;241m=[39mRecursiveCharacterTextSplitter(chunk_size[38;5;241m=[39m[38;5;241m1000[39m, chunk_overlap[38;5;241m=[39m[38;5;241m200[39m))
[1;32m      6[0m [38;5;28mlen[39m(pages)

File [0;32m/local_disk0/.ephemeral_nfs/envs/pythonEnv-1136a816-43f9-4303-ba11-22501c337a95/lib/python3.10/site-packages/langchain_community/docume

#### I am using Databricks DBRX Model

In [0]:
# Import ChatDatabricks class from langchain_community.chat_models for chat model interactions
from langchain_community.chat_models import ChatDatabricks

# Initialize the Databricks Foundation LLM model with a specific endpoint and max_tokens setting
dbrx_chat_model = ChatDatabricks(endpoint="databricks-dbrx-instruct", max_tokens=200)

# Use the predict method of the chat model to ask a question and print the response
print(f"Test chat model: {dbrx_chat_model.predict('What are the Late Charges on Life Insurance Premiums?')}")

* 'schema_extra' has been renamed to 'json_schema_extra'
  warn_deprecated(


Test chat model: Late charges on life insurance premiums refer to the additional fees that policyholders may incur if they fail to pay their premiums on time. The specifics of late charges can vary depending on the insurance provider and the policy agreement. Some providers may offer a grace period before applying late charges, while others may not. It's essential to review the policy documents or contact the insurance provider for accurate information regarding late charges.


#### Create a QA Chain from Langchains

In [0]:
from langchain.chains import RetrievalQA

# Initialize a RetrievalQA chain with a specific LLM, chain type, retriever, and option to return source documents
qa_chain = RetrievalQA.from_chain_type(llm=dbrx_chat_model,  # Use the previously initialized Databricks Foundation LLM model
                                       chain_type="stuff",  # Specify the type of chain to use (placeholder value)
                                       retriever=retriever,  # Specify the retriever to use for fetching documents
                                       return_source_documents=True)  # Option to return the documents used in generating the answer

In [0]:
# Define the query to be answered
query = "What are the Late Charges on Life Insurance Premiums?"
# Use the RetrievalQA chain to process the query and store the response
llm_response = qa_chain(query)

In [0]:
llm_response['result']

"Late charges on life insurance premiums vary depending on the policy and the insurer. Policyholders will need to pay penalties for late charges, and the penalty amount can differ based on the duration for which the premium payment is due. To revive the policy through 'reinstatement', policyholders need to pay all the outstanding premiums, and applicable interest rates will be charged. It's important to note that late payment charges can be avoided by selecting the auto-debit option, setting reminder alerts before the premium payment date, keeping track of the reminders, and opting for yearly premium payment instead of monthly payments."

In [0]:
page_list = []
for doc in llm_response['source_documents']:
    # Print the current page number from the document's metadata
    print(doc.metadata['page'])
    # Increment the page number by 1
    tmp = doc.metadata['page'] + 1
    # Append the incremented page number to the page_list
    page_list.append(tmp)
    # The commented print statement below was likely used for debugging
    #print(doc)

# Return the list of incremented page numbers
page_list

2
2


[3, 3]

In [0]:
# Generate a string of unique page numbers from page_list, separated by commas, and prefix it with 'Document Page Numbers ='
doc_pages = 'Document Page Numbers =' + ','.join(set([str(pgno) for pgno in page_list]))

# Generate a string of unique document source names from llm_response['source_documents'], separated by commas, and prefix it with 'Document ='
doc_name = "Document =" + ','.join(set([str(doc.metadata['source']) for doc in llm_response['source_documents']]))

# Generate a string of unique document page contents from llm_response['source_documents'], separated by a period and newline, and prefix it with 'Document Details ='
doc_content = "Document Details =" + '.\n'.join(set([str(doc.page_content) for doc in llm_response['source_documents']]))

In [0]:
# Print the string of unique page numbers
print(doc_pages)
# Print the string of unique document source names
print(doc_name)
# Print a separator for readability
print("**********content*********")
# Print the string of unique document page contents
print(doc_content)

Document Page Numbers =3
Document =/Volumes/sarbani-rag-dbdemo/customer_support_bot/pdf_insurance/insurance-agent-faq.pdf
**********content*********
Document Details =Cash value divides your premium into two parts:
o   One portion helps in wealth creation and earns interest
o   While the other portion helps in covering the cost of ﬁnancial security
Once enough cash value has been accumulated it can be received at the time of maturity or apply for loan in case of emergencyInsuranceLife InsuranceWhat are the Late Charges on Life Insurance Premiums?Here are some details regarding late charges on life insurance premiums:
Policyholders will need to pay penalties on late charges on life insurance premiums
The penalty amount varies depending on policy and the insurer
To revive the policy through ‘reinstatement’, policyholders need to pay all the outstanding premiums with applicable interest rates will be applicable
Late payment charges also depend on the duration for which premium payment is 

#### GRADIO with PDF source location + page number

In [0]:
import gradio as gr

def question_answer(question, image):
    # Process the question and image to get an answer and related document information
    output = qa_chain(question)

    page_list = []
    for doc in output['source_documents']:
        # Increment page number by 1 to adjust for zero-based indexing
        tmp = doc.metadata['page'] + 1
        page_list.append(tmp)

    # Generate a string of unique page numbers, separated by commas
    doc_pages = ','.join(set([str(pgno) for pgno in page_list]))
    # Generate a string of unique document source names, separated by commas
    doc_name = ','.join(set([str(doc.metadata['source']) for doc in output['source_documents']]))
    # Generate a string of unique document page contents, separated by a period and newline
    doc_content = '.\n'.join(set([str(doc.page_content) for doc in output['source_documents']]))
    
    return output['result'], doc_pages, doc_name, doc_content

# HTML content for the title, including a Databricks logo and the title text
title_html = "<h1 style='text-align: center; display: flex; align-items: center;'>"
title_html += "<a href='https://seekvectorlogo.com/databricks-vector-logo-svg/' target='_blank'>"
title_html += "<img src='https://seekvectorlogo.com/wp-content/uploads/2022/02/databricks-vector-logo-2022.png' style='width: 100px; height: auto; margin-right: 10px;' />"
title_html += "</a>"
title_html += "LLM Powered Insurance Chatbot built with RAG architecture"
title_html += "</h1>"

# HTML content for the subtitle
subtitle_html = "<p style='text-align: center;'>AI Chatbot helping customers answering insurance related questions </p>"

# Create and launch the Gradio interface
gr.Interface(
    fn=question_answer,
    inputs=["text", gr.Image(value='https://raw.githubusercontent.com/databricks-industry-solutions/hls-llm-doc-qa/hls-llm-qa-gpu/images/solution-overview.jpeg')],
    outputs=[
        gr.Textbox(label='Chatbot Response'),
        gr.Textbox(label='Document Page#'),
        gr.Textbox(label='Document Location'),
        gr.Textbox(label='Document Context')
    ],
    title=title_html + subtitle_html,  # Combine title and subtitle HTML for the interface title
    
).launch(share=True)

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://5fcd7e14c10ba5f235.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


