## Purpose of the project

Create a chatbot to interact with my insurance provider. The goal is to show potential employers that an employee passionate about a specific field can achieve great things. For the past two years, I have been expressing an interest in artificial intelligence, but I have not yet had the opportunity to work in this field. I have completed numerous courses related to artificial intelligence, big data processing, RAG, LLMs, and cloud training. However, I have never had the chance to apply these skills in a professional context. Currently interested in the insurance sector, I decided to undertake this project to demonstrate what I am capable of achieving. 

## Configuration

Create your virtual environment by running `(python -m venv ./venv)` and activate it by navigating to `venv/Scripts/activate`. Press `CTRL + SHIFT + P` and select your interpreter (or kernel if you're working with Jupyter Notebook).  

Create your OpenAI API key and your `.env` file to store your API key.

## 1. Installation of Required Packages

I have created a requirements.txt file containing all the libraries necessary to execute this project. This file will be updated as the project progresses. Run the following command to install the packages: `pip install -r requirements.txt`

## 2. Helpers functions

In [1]:
# t-SNE, or t-Distributed Stochastic Neighbor Embedding, is a statistical method for visualizing 
# high-dimensional data by reducing it to lower- dimensional spaces, typically two or three dimensions
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
def tsne_plot(data):
    # Apply t-SNE to reduce to 3D
    tsne = TSNE(n_components=3, random_state=42,perplexity=data.shape[0]-1)
    data_3d = tsne.fit_transform(data)
    
    # Plotting
    fig = plt.figure(figsize=(10, 7))
    ax = fig.add_subplot(111, projection='3d')
    
    # Assign colors for each point based on its index
    num_points = len(data_3d)
    colors = plt.cm.tab20(np.linspace(0, 1, num_points))
    
    # Plot scatter with unique colors for each point
    for idx, point in enumerate(data_3d):
        ax.scatter(point[0], point[1], point[2], label=str(idx), color=colors[idx])
    
    # Adding labels and titles
    ax.set_xlabel('TSNE Component 1')
    ax.set_ylabel('TSNE Component 2')
    ax.set_zlabel('TSNE Component 3')
    plt.title('3D t-SNE Visualization')
    plt.legend(title='Input Order')
    plt.show()

## 3. Loading data

In [2]:
# Load module and API key
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # Read .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [3]:
# Load my insurance policy pdf file
from langchain.document_loaders import PyPDFLoader
# loader = PyPDFLoader("data/POLICE_ASSURANCE.pdf")
loader = PyPDFLoader("data/police_assurance.pdf")
pages = loader.load()

Each page is a document. A document contains text (`page_content`) and metadata (`metadata`).

In [4]:
# Number of pages of the document
len(pages)

1

In [5]:
# Information present in the first page of the document
page = pages[0]
page

Document(metadata={'source': 'data/police_assurance.pdf', 'page': 0}, page_content="1. Informations générales  \n  \nLa police d'assurance habitation de **TAKODJOU DJOKO JUSTIN JOEL**, identifiée par le numéro \n**0F10156H**, offre une couverture complète pour une période allant du **29 juin 2024** au **29 juin \n2025**. Cette assurance protège l'habitation située à l'adresse suivante : **8510 Rue de la Comtoise, \nAppartement 517, Ville de Québec, Province de Québec, Canada**, avec le code postal **G2C 0N4**. Le \ncontrat, de type **assurance habitation**, garantit une protection adaptée aux besoins du souscripteur, \ncouvrant divers risques liés à l'habitation, conformément aux conditions générales et particulières de la \npolice. Ce contrat constitue une solution fiable pour sécuriser le domicile et offrir une tranquillité d'esprit \ntout au long de la période de validité.  \n  \n2. GARANTIES POUR LES DOMMAGES AUX BIENS – AVEC VOL  \n  \nLa police d’assurance inclut une couverture é

In [6]:
# Display the first 100 characters present in the first page
print(page.page_content[0:100]) 

1. Informations générales  
  
La police d'assurance habitation de **TAKODJOU DJOKO JUSTIN JOEL**, i


In [7]:
# Remove * & ###
# Assuming page.page_content contains the text
page.page_content = page.page_content.replace("*", "").replace("###", "")

page.page_content

"1. Informations générales  \n  \nLa police d'assurance habitation de TAKODJOU DJOKO JUSTIN JOEL, identifiée par le numéro \n0F10156H, offre une couverture complète pour une période allant du 29 juin 2024 au 29 juin \n2025. Cette assurance protège l'habitation située à l'adresse suivante : 8510 Rue de la Comtoise, \nAppartement 517, Ville de Québec, Province de Québec, Canada, avec le code postal G2C 0N4. Le \ncontrat, de type assurance habitation, garantit une protection adaptée aux besoins du souscripteur, \ncouvrant divers risques liés à l'habitation, conformément aux conditions générales et particulières de la \npolice. Ce contrat constitue une solution fiable pour sécuriser le domicile et offrir une tranquillité d'esprit \ntout au long de la période de validité.  \n  \n2. GARANTIES POUR LES DOMMAGES AUX BIENS – AVEC VOL  \n  \nLa police d’assurance inclut une couverture étendue pour les dommages aux biens, incluant le vol, \noffrant une protection complète pour diverses situations

In [8]:
# Metadata
page.metadata

{'source': 'data/police_assurance.pdf', 'page': 0}

## 4. Document splitting

This process involves splitting a large document into smaller pieces. It is particularly useful for:

- Limiting context size: Language models (like GPT) have a limit on the amount of text they can process in a single query (the "context window"). By dividing a long text into smaller chunks, each chunk can be processed individually without exceeding this limit.
- Optimizing processing: Splitting a text into chunks allows for more targeted information extraction or answering questions by focusing on relevant sections rather than the entire document.
- Building logical chains: LangChain often uses chunks in complex workflows, such as information extraction, summary generation, or semantic search.

There are several splitting methods. In my case here, I will use **RecursiveCharacterTextSplitter** because my document is structured with paragraphs, titles, and sections. This method tries to split the text intelligently to maintain coherence.

In [9]:
# Merge the content of the pages into one (TODO: Create a function to automate with reading).  
# When there are multiple pages.
full_text = " ".join([page.page_content for page in pages])

In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

r_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=300,
    chunk_overlap=50
    
)
text_split = r_splitter.split_text(full_text)

In [11]:
for i in range(4):
    print(f"sample: {i} paragraph: {text_split[i]} \n" )
print(len(text_split))

sample: 0 paragraph: 1. Informations générales  
  
La police d'assurance habitation de TAKODJOU DJOKO JUSTIN JOEL, identifiée par le numéro 
0F10156H, offre une couverture complète pour une période allant du 29 juin 2024 au 29 juin 

sample: 1 paragraph: 2025. Cette assurance protège l'habitation située à l'adresse suivante : 8510 Rue de la Comtoise, 
Appartement 517, Ville de Québec, Province de Québec, Canada, avec le code postal G2C 0N4. Le 
contrat, de type assurance habitation, garantit une protection adaptée aux besoins du souscripteur, 

sample: 2 paragraph: couvrant divers risques liés à l'habitation, conformément aux conditions générales et particulières de la 
police. Ce contrat constitue une solution fiable pour sécuriser le domicile et offrir une tranquillité d'esprit 
tout au long de la période de validité. 

sample: 3 paragraph: tout au long de la période de validité.  
  
2. GARANTIES POUR LES DOMMAGES AUX BIENS – AVEC VOL  
  
La police d’assurance inclut une couvertur

## 5. Vector stores and embeddings

In [12]:
from langchain.vectorstores import Chroma
# pip install -U langchain-openai 
from langchain.embeddings.openai import OpenAIEmbeddings
embedding_model = OpenAIEmbeddings()

  embedding_model = OpenAIEmbeddings()


In [13]:
persist_directory = 'data/chroma/'

In [14]:
# Create vector database
# pip install -U langchain-chroma
vectordb = Chroma(
    persist_directory=persist_directory, # Where to save data locally, remove if not necessary
    embedding_function=embedding_model
)

  vectordb = Chroma(


In [15]:
smalldb = Chroma.from_texts(text_split, embedding=embedding_model)
smalldb._collection.count()

14

In [16]:
question = "Quel est le nom de la personne assure ?"

In [17]:
docs = smalldb.similarity_search(question, k=2)
docs

[Document(metadata={}, page_content='garanties visent à protéger l’assuré et à offrir une couverture adaptée aux besoins en cas de \nresponsabilité civile ou de situations imprévues.  \n  \n4. Garanties couvertes  \n  \nLes garanties couvertes par la police d’assurance offrent une protection complète contre divers risques.'),
 Document(metadata={}, page_content="1. Informations générales  \n  \nLa police d'assurance habitation de TAKODJOU DJOKO JUSTIN JOEL, identifiée par le numéro \n0F10156H, offre une couverture complète pour une période allant du 29 juin 2024 au 29 juin")]

In [18]:
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(metadata={}, page_content='garanties visent à protéger l’assuré et à offrir une couverture adaptée aux besoins en cas de \nresponsabilité civile ou de situations imprévues.  \n  \n4. Garanties couvertes  \n  \nLes garanties couvertes par la police d’assurance offrent une protection complète contre divers risques.'),
 Document(metadata={}, page_content="1. Informations générales  \n  \nLa police d'assurance habitation de TAKODJOU DJOKO JUSTIN JOEL, identifiée par le numéro \n0F10156H, offre une couverture complète pour une période allant du 29 juin 2024 au 29 juin")]

In [19]:
docs[0].page_content

'garanties visent à protéger l’assuré et à offrir une couverture adaptée aux besoins en cas de \nresponsabilité civile ou de situations imprévues.  \n  \n4. Garanties couvertes  \n  \nLes garanties couvertes par la police d’assurance offrent une protection complète contre divers risques.'

In [20]:
vectordb.persist()

  vectordb.persist()


## 6. Information Retrieval 

In [21]:
%pip install -U langchain-openai

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [22]:
from langchain.chat_models import ChatOpenAI
llm_name = 'gpt-3.5-turbo'
llm = ChatOpenAI(model_name=llm_name, temperature=0)
llm.predict("Hello world!")

  llm = ChatOpenAI(model_name=llm_name, temperature=0)
  llm.predict("Hello world!")


'Hello! How can I assist you today?'

In [23]:
# Build prompt
from langchain.prompts import PromptTemplate
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context", "question"],template=template,)

# Run chain
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(llm,
                                       retriever=smalldb.as_retriever(),
                                       return_source_documents=True,
                                       chain_type_kwargs={"prompt": QA_CHAIN_PROMPT})


result = qa_chain({"query": question})
result["result"]

  result = qa_chain({"query": question})


'Le nom de la personne assurée est TAKODJOU DJOKO JUSTIN JOEL.'

In [24]:
# Memory
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

  memory = ConversationBufferMemory(


In [25]:
from langchain.chains import ConversationalRetrievalChain
retriever=vectordb.as_retriever()
qa = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=retriever,
    memory=memory
)

In [26]:
question = "Quel est le numero de la police d'assurance ?"
result = qa({"question": question})

In [27]:
result['answer']

"Je ne sais pas, veuillez contacter votre compagnie d'assurance pour obtenir le numéro de police d'assurance."

In [28]:
question = "De quel type d'assurance s'agit-il ?"
result = qa({"question": question})

In [29]:
result['answer']

"Je suis désolé, mais vous n'avez pas fourni suffisamment d'informations pour que je puisse déterminer de quel type d'assurance il s'agit. Pouvez-vous donner plus de détails ou poser une question plus précise ?"

In [30]:
question = "Quel est le montant couvert pour les meubles, la franchise et la prime ? "
result = qa({"question": question})

In [31]:
result['answer']

'Je ne dispose pas des informations nécessaires pour répondre à votre question.'

## chatbot 

In [32]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.document_loaders import TextLoader
from langchain.chains import RetrievalQA,  ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader

In [34]:
police_insurance_path = 'data/police_assurance.pdf'

In [40]:
def load_db(file, chain_type, k):
    # Load pdf files
    loader = PyPDFLoader(file)
    pages = loader.load()
    for page in pages:
        page.page_content = page.page_content.replace("*", "").replace("###", "")

    # Split documents
    full_text = " ".join([page.page_content for page in pages]) # When there are multiple pages.
    r_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=300,
    chunk_overlap=50
        
    )
    text_split = r_splitter.split_text(full_text)
    # define embedding
    embeddings = OpenAIEmbeddings()
    # create vector database from data
    db = DocArrayInMemorySearch.from_texts(text_split, embeddings)
    # define retriever
    retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": k})
    # create a chatbot chain. Memory is managed externally.
    qa = ConversationalRetrievalChain.from_llm(
        llm=ChatOpenAI(model_name=llm_name, temperature=0), 
        chain_type=chain_type, 
        retriever=retriever, 
        memory=memory,
        return_generated_question=True,
    )
    return qa

In [None]:
import panel as pn
import param

class cbfs(param.Parameterized):
    chat_history = param.List([])
    answer = param.String("")
    db_query  = param.String("")
    db_response = param.List([])
    
    def __init__(self,  **params):
        super(cbfs, self).__init__( **params)
        self.panels = []
        self.loaded_file = "data/police_assurance.pdf"
        self.qa = load_db(self.loaded_file,"stuff", 4)
    
    def call_load_db(self, count):
        if count == 0 or file_input.value is None:  # init or no file specified :
            return pn.pane.Markdown(f"Loaded File: {self.loaded_file}")
        else:
            file_input.save("temp.pdf")  # local copy
            self.loaded_file = file_input.filename
            button_load.button_style="outline"
            self.qa = load_db("temp.pdf", "stuff", 4)
            button_load.button_style="solid"
        self.clr_history()
        return pn.pane.Markdown(f"Loaded File: {self.loaded_file}")

    def convchain(self, query):
        if not query:
            return pn.WidgetBox(pn.Row('User:', pn.pane.Markdown("", width=600)), scroll=True)
        result = self.qa({"question": query, "chat_history": self.chat_history})
        self.chat_history.extend([(query, result["answer"])])
        self.db_query = result["generated_question"]
        self.db_response = result["source_documents"]
        self.answer = result['answer'] 
        self.panels.extend([
            pn.Row('User:', pn.pane.Markdown(query, width=600)),
            pn.Row('ChatBot:', pn.pane.Markdown(self.answer, width=600, style={'background-color': '#F6F6F6'}))
        ])
        inp.value = ''  #clears loading indicator when cleared
        return pn.WidgetBox(*self.panels,scroll=True)

    @param.depends('db_query ', )
    def get_lquest(self):
        if not self.db_query :
            return pn.Column(
                pn.Row(pn.pane.Markdown(f"Last question to DB:", styles={'background-color': '#F6F6F6'})),
                pn.Row(pn.pane.Str("no DB accesses so far"))
            )
        return pn.Column(
            pn.Row(pn.pane.Markdown(f"DB query:", styles={'background-color': '#F6F6F6'})),
            pn.pane.Str(self.db_query )
        )

    @param.depends('db_response', )
    def get_sources(self):
        if not self.db_response:
            return 
        rlist=[pn.Row(pn.pane.Markdown(f"Result of DB lookup:", styles={'background-color': '#F6F6F6'}))]
        for doc in self.db_response:
            rlist.append(pn.Row(pn.pane.Str(doc)))
        return pn.WidgetBox(*rlist, width=600, scroll=True)

    @param.depends('convchain', 'clr_history') 
    def get_chats(self):
        if not self.chat_history:
            return pn.WidgetBox(pn.Row(pn.pane.Str("No History Yet")), width=600, scroll=True)
        rlist=[pn.Row(pn.pane.Markdown(f"Current Chat History variable", styles={'background-color': '#F6F6F6'}))]
        for exchange in self.chat_history:
            rlist.append(pn.Row(pn.pane.Str(exchange)))
        return pn.WidgetBox(*rlist, width=600, scroll=True)

    def clr_history(self,count=0):
        self.chat_history = []
        return 


In [None]:
cb = cbfs()

file_input = pn.widgets.FileInput(accept='.pdf')
button_load = pn.widgets.Button(name="Load DB", button_type='primary')
button_clearhistory = pn.widgets.Button(name="Clear History", button_type='warning')
button_clearhistory.on_click(cb.clr_history)
inp = pn.widgets.TextInput( placeholder='Enter text here…')

bound_button_load = pn.bind(cb.call_load_db, button_load.param.clicks)
conversation = pn.bind(cb.convchain, inp) 

jpg_pane = pn.pane.Image( './img/convchain.jpg')

tab1 = pn.Column(
    pn.Row(inp),
    pn.layout.Divider(),
    pn.panel(conversation,  loading_indicator=True, height=300),
    pn.layout.Divider(),
)
tab2= pn.Column(
    pn.panel(cb.get_lquest),
    pn.layout.Divider(),
    pn.panel(cb.get_sources ),
)
tab3= pn.Column(
    pn.panel(cb.get_chats),
    pn.layout.Divider(),
)
tab4=pn.Column(
    pn.Row( file_input, button_load, bound_button_load),
    pn.Row( button_clearhistory, pn.pane.Markdown("Clears chat history. Can use to start a new topic" )),
    pn.layout.Divider(),
    pn.Row(jpg_pane.clone(width=400))
)
dashboard = pn.Column(
    pn.Row(pn.pane.Markdown('# ChatWithYourData_Bot')),
    pn.Tabs(('Conversation', tab1), ('Database', tab2), ('Chat History', tab3),('Configure', tab4))
)
pn.extension()
dashboard

BokehModel(combine_events=True, render_bundle={'docs_json': {'58e1292f-8dac-4d3d-a250-55609948bd06': {'version…