# Chatbot QA for APRA regulatory data
The aim of the QA chat is to experiment various techniques for accuracy in answering questions from a private corpus of data. The pipeline starts with reading the documents from the dataset in docx file format located in ./dataset/word_standards. The document data is then divided into smaller chunks for embedding. The embeddings are then stored into the Pinecone vectord database hosted on the cloud. The document embeddings are generated using Cohere from Bedrock. For answer generation, OoenAI is used.

![Design architecture](architecture.png)

## Techniques of embedding
The documents are divided into smaller chunks due to resource limitation of the current technology standards. Accuracy is important when it comes to regulations. So the balance of embedding size or number of chunks is considered with the balance of accuracy, retrieval latency and compute resources with the number of documents provided.

### Preserve relevant context
When documents are divided into chunks, it also loses context. And context is very important in retrieval. To preserve relevancy, chunking for each document has the longest charactor length and overlapping. 

### Prevent context cross contamination
To ensure precisision, preverving relevant context within the same document is important. Chunking is done by invidual document to prevent text overlapping with another documents, which will contain different regulation data. Cross contaminating data will lead to inaccuracy.

## AWS Bedrock 
Cohere is used for dense vector embedding on all the documents before storing into the vector database. Claude is used for generation. Usuing AWS Bedrock provides the performance and flexibility to change foundation models.

## Vector store
Pinecone is a cloud based vector store. The embeddings generated from the documents are stored as one-off tasks. This embedding is later retrieved to generate answers based on queries. Cloud vector database provides the performance when it comes go vector search. Cloud service is used for scalability and performance.

## Retrieval and generation
Based on a query, search on vectors is performed using cosine similarity between query and database vectors. OpenAI is used for prompting and answer generation based on retrievals from Pinecone vector store.

## SET UP
Before running this file, please run <code>pip install -r requirements.txt</code>. To use various cloude services, following accounts need to be set up to obtain access keys. The following keys are needed in the .env file locatedin the same directory as this notebook.

<code>
PINECONE_ENV=</br>
PINECONE_INDEX_NAME=</br>
PINECONE_INDEX_HOST=</br>
ANTHROPIC_API_KEY=</br>
OPENAI_API_KEY=</br>
BEDROCK_REGION=</br>
AWS_ACCESS_KEY_ID=</br>
AWS_SECRET_ACCESS_KEY=</br>
</code>



## Load environment variables

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

True

## Data preprocessing

#### NLTK for tagging

In [2]:
import nltk
print(nltk.data.find('taggers/averaged_perceptron_tagger'))

nltk_data_path = './nltk_data'
nltk.data.path.append(nltk_data_path)
nltk.download('averaged_perceptron_tagger', download_dir=nltk_data_path)
nltk.download('averaged_perceptron_tagger_eng', download_dir=nltk_data_path)
nltk.download('punkt_tab')

print(nltk.data.find('taggers/averaged_perceptron_tagger'))
print(nltk.data.find('taggers/averaged_perceptron_tagger_eng'))

/Users/jasper/nltk_data/taggers/averaged_perceptron_tagger
/Users/jasper/nltk_data/taggers/averaged_perceptron_tagger
/Users/jasper/nltk_data/taggers/averaged_perceptron_tagger_eng


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     ./nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     ./nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/jasper/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


#### Langchain framework for handling data preprocessing and retrieval

In [3]:
from langchain.document_loaders import DirectoryLoader

doc_loader = DirectoryLoader('./dataset/word_standards/', glob="*.docx")
apra_docs = doc_loader.load()

print("documents:", len(apra_docs))

documents: 91


In [4]:

def get_file_id(doc):
    file_path = doc.metadata['source']
    id = os.path.splitext(os.path.basename(file_path))[0]
    
    return id

import uuid
def chunk_doc(doc, text_splitter):
    chunk_data = []
    chunks = text_splitter.split_text(doc.page_content)  # Accessing the text of the document
    for idx, chunk in enumerate(chunks):
        doc_id = get_file_id(doc)
        data = {'id':  str(uuid.uuid4()), 
                'metadata': {'text':doc_id + " " + str(idx), 
                             'doc_id': doc_id, 
                             'chunk_index': idx}, 
                'content': chunk}     
        chunk_data.append(data)
    
    return chunk_data
    

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2048, chunk_overlap=400)
chunk_data = [] # [{content:, metadata{doc_id:, part_index:}}]

# chuck each document individually to prevent regulation data overlap that can give the wrong answer
for doc in apra_docs:
    doc_chunks = chunk_doc(doc, text_splitter)
    chunk_data.extend(doc_chunks)
    
print("Documents:", len(apra_docs))
print("Chunks:", len(chunk_data))
print("example", chunk_data[0])

Documents: 91
Chunks: 2531
example {'id': '6c7c4c74-4429-4673-a7a9-3531306c1ba6', 'metadata': {'text': 'F2021L01119 0', 'doc_id': 'F2021L01119', 'chunk_index': 0}, 'content': 'Financial Sector (Collection of Data) (reporting standard) determination No. 28 of 2021 \n\nReporting Standard ARS 720.3 ABS/RBA Intra-group Assets and Liabilities\n\nFinancial Sector (Collection of Data) Act 2001\n\nI, Alison Bliss, delegate of APRA, under paragraph 13(1)(a) of the Financial Sector (Collection of Data) Act 2001 (the Act) and subsection 33(3) of the Acts Interpretation Act 1901:\n\nrevoke Financial Sector (Collection of Data) (reporting standard) determination No. 5 of 2019, including Reporting Standard ARS 720.3 ABS/RBA Intra-group Assets and Liabilities made under that Determination; and\n\ndetermine Reporting Standard ARS 720.3 ABS/RBA Intra-group Assets and Liabilities, in the form set out in the Schedule, which applies to the financial sector entities to the extent provided in paragraph 4 of

## Vector store

#### Initialise Pinecone vector database

In [28]:
from pinecone import Pinecone, ServerlessSpec

# Fetch Pinecone index created from on the server
index_name = os.getenv("PINECONE_INDEX_NAME")
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
pc_index = pc.Index(index_name, host=os.getenv("PINECONE_INDEX_HOST"))

print("Index:", pc_index)

# # NOT NEEDED: Index created on the server website
# if index_name not in pc_indexes:
#     # dimensions are for cohere.embed-english-v3
#     pc.create_index(
#         name=index_name,
#         dimension=2048, # Replace with your model dimensions
#         metric="cosine", # Replace with your model metric
#         spec=ServerlessSpec(
#             cloud="aws",
#             region="us-east-1" # Virgia region (starter plan)
#         ) 
#     )

Index: <pinecone.data.index.Index object at 0x7f8b182a6a30>


#### Generate embdding from AWS Bedrock foundation model Cohere

In [7]:
import boto3
from langchain_community.embeddings import BedrockEmbeddings

# Load embedding LLM
bedrock_client = boto3.client("bedrock-runtime",
                              region_name=os.getenv("BEDROCK_REGION"),
                              aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
                              aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"))

embedding_model = BedrockEmbeddings(model_id="cohere.embed-english-v3", client=bedrock_client)

In [8]:
# testing
test_text = chunk_data[0]['content']
test_embeddings = embedding_model.embed_documents([test_text])
print(test_embeddings)

[[-0.037628174, -0.005771637, -0.107299805, -0.03527832, 0.008735657, -0.017974854, -0.021224976, 0.05404663, 0.044189453, 0.015594482, -0.08795166, 0.0017461777, -0.0096206665, 0.0077705383, -0.02809143, -0.05053711, 0.009849548, 0.007156372, 0.031433105, -2.6166439e-05, -0.018600464, 0.01626587, 0.013244629, -0.025527954, 0.030471802, -0.015853882, 0.026779175, 0.058563232, 0.0107040405, 0.012519836, 0.011314392, -0.034362793, 0.017715454, 0.01600647, 0.024032593, 0.04385376, 0.0025577545, 0.0073051453, 0.0209198, 0.03366089, 0.03704834, -0.038238525, 0.057891846, -0.049865723, 0.010894775, 0.035095215, 0.028381348, -0.0309906, 0.025817871, 0.04043579, -0.022460938, 0.024536133, -0.04437256, 0.099609375, -0.021987915, -0.024017334, -0.022628784, -0.011306763, 0.025268555, 0.044921875, 0.0017080307, -0.0032196045, 0.01222229, -0.028121948, -0.05316162, -0.004310608, 0.041809082, -0.0044288635, 0.054016113, -0.03945923, -0.023910522, 0.06329346, 0.0031757355, -0.015487671, -0.02104187,

In [9]:
# CAUTION: long (and expensive) task to generate embedding for all chunks on AWS Bedrock.
texts = [chunk["content"] for chunk in chunk_data] 
embeddings = embedding_model.embed_documents(texts)

print("Embeddings:", len(embeddings))

Embeddings: 2531


#### Insert data into pinecone vector database

In [27]:
# clean the database
pc_index.delete(deleteAll="true") 

{}

In [11]:
# construct vector entries
entries = []
for idx, embedding in enumerate(embeddings):
    entries.append({"id": chunk_data[idx]['id'], 
                    "values": embedding, 
                    "metadata": chunk_data[idx]['metadata']})

In [18]:
print("Entries:", len(entries))
print("Sample metadata:", entries[0]['metadata'])
print("Sample metadata:", entries[1]['metadata'])

Entries: 2531
Sample metadata: {'text': 'F2021L01119 0', 'doc_id': 'F2021L01119', 'chunk_index': 0}
Sample metadata: {'text': 'F2021L01119 1', 'doc_id': 'F2021L01119', 'chunk_index': 1}


In [30]:
# upsert in batches due to size limit in Pinecone
batch_size = 200
batches = int(len(embeddings)/batch_size)
print(batches, "batches of", batch_size, "embeddings")

for b in range(batches):
    start = (b) * batch_size
    end = (b + 1) * batch_size
    
    if b == batches - 1: # the last batch
        start = end
        end = len(embeddings)
    
    print("inserting ", start, "to", end)
    pc_index.upsert(entries[start:end])

12 batches of 200 embeddings
inserting  0 to 200
inserting  200 to 400
inserting  400 to 600
inserting  600 to 800
inserting  800 to 1000
inserting  1000 to 1200
inserting  1200 to 1400
inserting  1400 to 1600
inserting  1600 to 1800
inserting  1800 to 2000
inserting  2000 to 2200
inserting  2400 to 2531


In [33]:
# Vector store complete
pc_index.describe_index_stats()

{'dimension': 1024,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 2331}},
 'total_vector_count': 2331}

## QA query and answering

### Set up vector db connection and LLM

In [55]:
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.vectorstores import Pinecone
from langchain_openai import OpenAI
from IPython.display import Markdown, display

# Fetch index to vector database
vectorstore_pinecone = Pinecone.from_existing_index(
    embedding=embedding_model,
    index_name = index_name
)

# Retriever to the vector db 
# Each document averages about 2531 / 91 38.8 chunks. Set top k to 60 to get enough context
retriever=vectorstore_pinecone.as_retriever(search_type="similarity", search_kwargs={"k": 60})
vector_index = VectorStoreIndexWrapper(vectorstore=vectorstore_pinecone)

# Decoder LLM for answer generation. Temperature set to 0 for consistent answer
llm = OpenAI(openai_api_key=os.getenv('OPENAI_API_KEY'), temperature=0.0, max_tokens=400)


In [73]:
test_query = "What is the quality control on International Banking Statistics Balance Sheet Items"

### Pinecone query and retrieval 

In [74]:
results = vectorstore_pinecone.similarity_search_with_score(query=test_query,k=60)
for doc, score in results[:5]: # peeking at 5 of the chunks retrieved
    print(f"* [Similarity={score:3f}] {doc.page_content} [{doc.metadata}]")

* [Similarity=0.522018] F2023L00410 16 [{'chunk_index': 16.0, 'doc_id': 'F2023L00410'}]
* [Similarity=0.473942] F2018L01116 16 [{'chunk_index': 16.0, 'doc_id': 'F2018L01116'}]
* [Similarity=0.469355] F2021L01470 12 [{'chunk_index': 12.0, 'doc_id': 'F2021L01470'}]
* [Similarity=0.462184] F2023L00417 29 [{'chunk_index': 29.0, 'doc_id': 'F2023L00417'}]
* [Similarity=0.462184] F2023L00417 17 [{'chunk_index': 17.0, 'doc_id': 'F2023L00417'}]


In [90]:
# examining top 1 retrieval chunk
top_doc, _ = results[0]
top_doc_id = top_doc.metadata['doc_id']
top_doc_content = ""

for chunk in chunk_data: 
    if chunk['metadata']['doc_id'] == top_doc_id:
        top_doc_content = chunk['content']
        break
    

display(Markdown("<b>Top chunk content:</b><br>" + top_doc_content))
    

<b>Top chunk content:</b><br>Financial Sector (Collection of Data) (reporting standard) determination No. 56 of 2023 

Reporting Standard ARS 223.0 Residential Mortgage Lending

Financial Sector (Collection of Data) Act 2001

I, Michael Murphy, delegate of APRA, under paragraph 13(1)(a) of the Financial Sector (Collection of Data) Act 2001 (the Act) and subsection 33(3) of the Acts Interpretation Act 1901:

revoke Financial Sector (Collection of Data) (reporting standard) determination No. 11 of 2022, including -	Reporting Standard ARS 223.0 Residential Mortgage Lending made under that Determination; and

determine Reporting Standard ARS 223.0 Residential Mortgage Lending, in the form set out in the Schedule, which applies to the financial sector entities to the extent provided in paragraph 3 of the reporting standard.

Under section 15 of the Act, I declare that the reporting standard shall begin to apply to those financial sector entities, and the revoked reporting standard shall cease to apply, on the day it is registered on the Federal Register of Legislation. 

This instrument commences upon registration on the Federal Register of Legislation.

Dated: 31 March 2023

Michael Murphy

Acting Chief Data Officer

Technology and Data Division

Interpretation

In this Determination:

APRA means the Australian Prudential Regulation Authority.



Federal Register of Legislation means the register established under section 15A of the Legislation Act 2003.

financial sector entity has the meaning given by section 5 of the Act. 

Schedule 

Reporting Standard ARS 223.0 Residential Mortgage Lending comprises the document commencing on the following page.

March 2023



Reporting Standard ARS 223.0

Residential Mortgage Lending

Objective of this Reporting Standard

This reporting standard outlines the requirements for the provision of information to APRA relating to an authorised deposit-taking institution’s residential mortgage lending.

It includes Reporting Form ARF 223.0 Residential Mortgage Lending and the associated specific instructions.

### QA prompt engineering with Pinecone vector store

In [76]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

prompt_template = """
You are an Australia APRA expert on regulations providing answers to customers.
Give clear responses to the following question: 
{question}. 
Do not make up answers.

{context}

Question: {question}
Answer:"""

prompt = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

test_answer = qa({"query": test_query})

In [87]:
# display the results
display(Markdown("<b>Question:</b>" + test_query))
display(Markdown("<b>Answer (Vectorstore):</b>" + test_answer['result']))

<b>Question:</b>What is the quality control on International Banking Statistics Balance Sheet Items

<b>Answer (Vectorstore):</b> The quality control on International Banking Statistics Balance Sheet Items is overseen by the Australian Prudential Regulation Authority (APRA). APRA sets and enforces prudential standards and requirements for banks and other financial institutions in Australia, including those related to the reporting and accuracy of balance sheet items. APRA conducts regular reviews and audits to ensure compliance with these standards and to identify any potential issues or discrepancies. Additionally, APRA works closely with other international regulatory bodies to ensure consistency and accuracy in reporting across borders.

## Compare answer to general promot without vector store

In [86]:

prompt = PromptTemplate.from_template(template=prompt_template)

# format the prompt to add variable values
prompt_formatted_str: str = prompt.format(
    question=test_query,
    context=None)

# make a prediction
prediction = llm.predict(prompt_formatted_str)

display(Markdown("<b>Question:</b>" + test_query))
display(Markdown("<b>Answer (Vectorstore)</b>:" + test_answer['result']))
display(Markdown("<b>Answer (OpenAI):</b>" + prediction))

<b>Question:</b>What is the quality control on International Banking Statistics Balance Sheet Items

<b>Answer (Vectorstore)</b>: The quality control on International Banking Statistics Balance Sheet Items is overseen by the Australian Prudential Regulation Authority (APRA). APRA sets and enforces prudential standards and requirements for banks and other financial institutions in Australia, including those related to the reporting and accuracy of balance sheet items. APRA conducts regular reviews and audits to ensure compliance with these standards and to identify any potential issues or discrepancies. Additionally, APRA works closely with other international regulatory bodies to ensure consistency and accuracy in reporting across borders.

<b>Answer (OpenAI):</b> The quality control on International Banking Statistics Balance Sheet Items is overseen by the Bank for International Settlements (BIS). The BIS has established a set of guidelines and standards for reporting and compiling these statistics, which are regularly reviewed and updated. Additionally, national central banks and supervisory authorities also have their own quality control measures in place to ensure accuracy and consistency in reporting.