### In this file, we will be using the processed data to create functions and output the final ChatBot.

### Import Modules and Data
First up, we will import the "paragraphs.csv" dataset which we created in the Text Analysis file.

In [30]:
import PyPDF2
import pandas as pd
import re

In [31]:
#specify pdf path for later use
pdf_path = 'deposit-account-agreement.pdf'

In [32]:
#import our data
paragraphs_df = pd.read_csv("paragraphs.csv")

In [33]:
#quick look at data
paragraphs_df.head()

Unnamed: 0.1,Unnamed: 0,Page,Formatted_content,Processed_content,keywords
0,0,5,Effective 10/15/2023Deposit Account Agreement,Effective 10/15/2023Deposit Account Agreement,"['2023deposit account agreement', 'account agr..."
1,1,5,This agreement is the contract that governs yo...,agreement contract governs account . Whether p...,"['agreement jpmorgan chase', 'basic agreement ..."
2,2,5,signing a signature card or submitting an acco...,signing signature card submitting account appl...,"['agreement applicable chase', 'chase agreemen..."
3,3,5,"fees, and (4) other disclosures, agreements, a...","fee , ( 4 ) disclosure , agreement , amendment...","['fee apply account', 'fee disclosure agreemen..."
4,4,5,Here are some important terms that we use thro...,important term use throughout agreement : Acco...,"['account covered agreement', 'agreement accou..."


In [34]:
# remove unwanted column
paragraphs_df = paragraphs_df.drop('Unnamed: 0', axis=1)

In [35]:
#take another look at df
paragraphs_df.head()

Unnamed: 0,Page,Formatted_content,Processed_content,keywords
0,5,Effective 10/15/2023Deposit Account Agreement,Effective 10/15/2023Deposit Account Agreement,"['2023deposit account agreement', 'account agr..."
1,5,This agreement is the contract that governs yo...,agreement contract governs account . Whether p...,"['agreement jpmorgan chase', 'basic agreement ..."
2,5,signing a signature card or submitting an acco...,signing signature card submitting account appl...,"['agreement applicable chase', 'chase agreemen..."
3,5,"fees, and (4) other disclosures, agreements, a...","fee , ( 4 ) disclosure , agreement , amendment...","['fee apply account', 'fee disclosure agreemen..."
4,5,Here are some important terms that we use thro...,important term use throughout agreement : Acco...,"['account covered agreement', 'agreement accou..."


Looks like we are good to go! Let's create all the algorithms and functions we will require to make the chatbot.

# All functions required

### Function to get keywords from query
Just like we had the function that extracts keywords from the data file, we will have a function that gets keywords from the user's query using the same keybert library.

The only difference is, we will be chaning the parameters in the model to get a lower number of keywords as text in query will be much less.

In [36]:
#import keybert
from keybert import KeyBERT

In [37]:
#declare model
kw_model = KeyBERT(model='all-mpnet-base-v2')

In [38]:
#function to get keywords using model
def getKeywordsOfQuery(s):
    keywords = kw_model.extract_keywords(s, 

                                     keyphrase_ngram_range=(1, 3), 

                                     stop_words='english', 

                                     highlight=False,

                                     top_n=3)

    keywords_list= list(dict(keywords).keys())
    return keywords_list

### Function to remove stopwords and lemmatize query
Keep in mind, we also need to remove stopwords and lemmatize query like we did for the data file.

In [137]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [138]:
#initialize stopwords and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [139]:
#function to lemmatize and remove stopwords
def preprocess_text(text):
    tokens = nltk.word_tokenize(text)
    filtered_tokens = [lemmatizer.lemmatize(token) for token in tokens if token.lower() not in stop_words]
    return ' '.join(filtered_tokens)

### Text highlight function
This function will not be used for the chatbot itself, but to get an extra output of a file that has highlighted text from the source document, relating to the query the user is asking.

This will be useful when the format of the output in the chatbot is not clear enough, or the user wants more information about the topic they asked the question on.

In [39]:
#import required libraries
import fitz  # PyMuPDF
import sys

Here's a quick rundown on what this function does:

1. The most important text to highlight is the first keyword generated from the paragraph, as it has the most significance.
2. If this keyword is not found in page, then highlight the SECOND keyword as it has the next most significance.
3. We will get ALL the instances of text (or keyword) that we want to highlight in our specific page.
4. We will then highlight all the instances
5. Lastly, we will save the highlighted PDF as "highlighted_" + original name of pdf

In [106]:

def highlight_texts_in_pdf(pdf_path, page_number, texts_to_highlight):    
    #get document
    doc = fitz.open(pdf_path)   
    

    #get specific page
    page = doc[page_number-1]
    #list for instances
    multi_instances=[]
    
    ##Finding the words in pdf
    text_instances = page.search_for(texts_to_highlight[0])
    if not text_instances:
        #if first keyword is not there in page, then use second keyword
        text_instances = page.search_for(texts_to_highlight[1])
    #add instance to list    
    multi_instances.append(text_instances)
    ### HIGHLIGHT
    for instance in multi_instances:
        for inst in instance:
            #highlight
            highlight = page.add_highlight_annot(inst)
    
    #output file name and save
    output = "Highlighted_"+pdf_path
    doc.save(output, garbage=4, deflate=True, clean=False)
    #return name of output file
    return output
        


### Functions to match keywords from query to keywords of document and print results

We have the user's query with keywords and paragraphs from the data file with keywords.

We need to match the user's query to paragraphs from data file using an algorithm. We cannot just show every paragraph that has one of the keywords from the query. If one of the keywords from the query was "bank", then we would end up displaying almost every paragraph in the document! 

The solution I created for this is: 
1. create a function count_match that gives the "score" of a query for a single paragraph(or row). This score will be the number of keywords from the query which were also present in the keywords of that row.
2. A function match_with_query that takes in the "maxcount" or the maximum score attained by the query using the previous function. 
3. This function match_with_query then checks which paragraph(s) atatined this maximum score and appends all those paragraphs to the final result text.

In [107]:
#function to count number of query keywords found in row keywords
#takes in query's keyword list and one row at a time
def count_match(qlist, row):
    count=0
    for q in qlist:
        if q in row.keywords:
            count+=1
    return count
            

In [143]:
#function to match paragraphs with query using maxcount.
#maxcount is obtained in the query_results function
def match_with_query(qlist, maxcount, row):
    count=0
    matchlist=[]
    result_text=""
    #check count of the row again (similar to previous function)
    for q in qlist:
        if q in row.keywords:
            count+=1
            #append keyword found to matchlist
            matchlist.append(q)
            
    # check if count matches maxcount, if yes then add content to result text
    if count==maxcount:
        result_text += "\n" + row['Formatted_content'] + "\n"
        result_text += f"Content found in Page {row['Page']} of document {pdf_path}\n"
        #call highlight function if matchlist has any keywords in it
        if  len(matchlist)>0:    
            highlighted_pdf_path = highlight_texts_in_pdf(pdf_path, row['Page'], matchlist)
    
    #return the final text
    return result_text


### Final function to integrate all:
This is the final function that will be called by the gradio chatbot. This function needs to integrate all the functions we have created so far. Here's a step by step process of what happens in this "query_results" funciton:
1. We first call the preprocess_text function on the query.
2. We then get a list of keywords from the processed query using the getKeywordsOfQuery function
3. We then get the count (or score) of each row in the dataframe against the query list and store all these scores in a list
4. We get the maximum count (or score) from the list
5. We call the match_with_query function, and get the resulting paragraph texts.
6. We append the resulting texts and additional text we want the chatbot to display all into one result
7. We return this result to the chatbot, for it to display!

In [140]:
def query_results(query, history):
    #preprocess the query
    query=preprocess_text(query)
    #get keyword list of query
    qlist=getKeywordsOfQuery(query)
    #get keyword count for all rows
    l=list(paragraphs_df.apply(partial(count_match, qlist), axis=1))
    #get max count value
    maxcount=max(l)
    #get result texts from the data into a list
    result_list=list(paragraphs_df.apply(partial(match_with_query, qlist, maxcount),axis=1))
    
    #append all text, including result list, into one text variable result
    result = "Here's what I could gather from the bank documents..."
    result += "\n".join(result_list) #convert list to string
    result+= "\n " + f"Highlighted PDF saved at: highlighted_{pdf_path}"
    result+= "\n can I answer any more questions?"
    return result

## Set up gradio link
To wrap the whole thing up, we will call the gradio chatbot for our function!

In [141]:
import gradio as gr

In [144]:
#call chatbot for our function
iface = gr.ChatInterface(query_results)

# Launch the interface
iface.launch(share=True)

Running on local URL:  http://127.0.0.1:7885

Could not create share link. Please check your internet connection or our status page: https://status.gradio.app.


