In [1]:
import sys
import os

# Add the parent directory (Auditbot_backend) to the system path
sys.path.append(
    os.path.abspath(
        os.path.join(
            os.path.dirname(f"{os.getcwd()}/RAG.ipynb"),
            '..'
        )
    )
)

# Using RAG to Build a Custom ChatBot
## 3. Retrieval-Augmented Generation

> **Notice:**  
> Before starting this tutorial series, read up on the RAG pipeline.

This tutorial series assumes prerequisite understanding of RAG and therefore goes through the implementation of an advanced and customized RAG pipeline, explaining the micro-decisions made along the way.

> **Data Corpus:** 
> This tutorial uses [AGO yearly audit reports](https://www.ago.gov.sg/publications/annual-reports/) as an example. However, this repo's code is applicable to most pdf documents. The code examples for other documents (such as national day rally) will be referenced later. 

### Step 1: Load vector store

We need to load up the embedding vectors in our chroma db to perform dense retrieval. Here is the syntax for that. 


In [2]:
# chromadb library
import chromadb
from chromadb.utils import embedding_functions

# custom helper functions
from utils.db_utils import chroma_get_or_create_collection 

# constants
from utils.initialisations import OPENAI_API_KEY

In [4]:
client_dense = chromadb.PersistentClient(path="../data/db")

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                api_key=OPENAI_API_KEY,
                model_name="text-embedding-3-small"
            )

# make sure to set reset = False as we are only loading already saved data
collection = chroma_get_or_create_collection(client_dense, 
                                             name = "audit", 
                                             embedding_function = openai_ef, 
                                             reset = False)

### Step 2: Load index

We need to load up the string index in the elastic container to perform sparce retrieval. Here is the syntax for that. 

In [5]:
# elastic search library
from elasticsearch import Elasticsearch

# constants
from utils.initialisations import LOCAL_HOST_URL, HTTP_AUTH

In [6]:
client_sparce = Elasticsearch(
    LOCAL_HOST_URL,
    basic_auth=HTTP_AUTH
)

### Step 3: Perform dense retrieval

ChromaDB comes with the functionality of performing dense retrieval using our choice of bi-encoder. 

In [9]:
# custom helper functions
from utils.db_utils import chromadb_embedding_search

In [10]:
# Can be run multiple times when change in query
query = "What are the findings pertaining to grant?"
top_k = 30

# dense embeding search 
embedding_results = chromadb_embedding_search(collection, query, top_k)

An alternative to performing dense retrieval in chromadb is using [Facebook AI Similarity Search](https://ai.meta.com/tools/faiss/). Their efficient implementation of cosine similarity serch has allowed for very quick dense retrieval. 

Check out ["../notebooks/inmemory_retriever.ipynb"](../notebooks/inmemory_retriever.ipynb) to try FAISS using the AGO documents.

### Step 4: Perform sparce retrieval

I chose a bag-of-words retrieval function (Okapi BM25) to perform sparce retrieval due to its success in traditional NLP projects and a plathora of documentation. Other good alternatives include TF-IDF (Term Frequency-Inverse Document Frequency).

I used of ElasticSearch's implementation of ["BM25""](https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables) to perform sparce retrieval. 

In [11]:
# import helper functions
from utils.db_utils import bm25_elasticsearch

# constants
from utils.initialisations import index_name

In [12]:
# perform bm_25 using elasticsearch
bm25_results = bm25_elasticsearch(client_sparce, index_name, HTTP_AUTH, query, top_k)

An alternative to performing dense retrieval in ElasticSearch is using [LangChain's BM25 function](https://python.langchain.com/v0.2/docs/integrations/retrievers/bm25/). However, they do not index the documents before performing BM25 search so it will not be as efficient as ElasticSearch. For langchain's BM25 search, data has to be loaded in from in-code memory (no use of data storage unlike elastic and chromadb).

Check out ["../notebooks/inmemory_retriever.ipynb"](../notebooks/inmemory_retriever.ipynb) to try BM25Retriever from LangChain using the AGO documents.

### Step 5: Combine results of both retrievals (Ranking)
Both retrievals produce a rank of what they think are the chunks closely related to the query. Since they probably conflict, we need to fuse both ranks to produce a combined rank. I went with [Reciprocal Rank Fusion (RRF)](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) as it allows us to tune the combined rank to produce the best output we want. 

A custom RRF function was written for this step as I added weights to each rank. The definition of this function ```reciprocal_rank_fusion``` can be found in ["../utils/db_utils.py"](../utils/db_utils.py)

In [14]:
# custom helper functions
from utils.db_utils import reciprocal_rank_fusion
from utils.custom_print import pretty_print_list

In [15]:
# weights for each retrieval for reciprocal rank fusion
weights = [0.5, 0.5]

# reciprocal ranking fusion constant
k = 60

# RRF
good_chunks  = reciprocal_rank_fusion(bm25_results, 
                                 embedding_results, 
                                 weights, 
                                 k)

print("COMBINED RANKING---------------------------------------------------------\n")
pretty_print_list(good_chunks[:3])

COMBINED RANKING---------------------------------------------------------

idx: 0

Details of the lapses pertaining to the enforcement of SDL collections are in the 
 
following paragraphs

----------------------------------------------------------

idx: 1

Stage 1: Grant Design and Setup
– whether there were processes and controls in place to ensure that 
grant programmes were authorised and administered in accordance 
with the objective(s) of the grant

----------------------------------------------------------

idx: 2

Audit findings are conveyed by AGO to the ministries and statutory boards audited 
by way of “management letters”

----------------------------------------------------------



An alternative to performing RRF is using [LangChain's Ensemble Retriever](https://python.langchain.com/docs/how_to/ensemble_retriever/). 

This function appears to behave as follows:
1. combine both sparce and dense retrievers into an ensemble retriever
2. perform a single retreival using the newly created ensemble retreival 

However, that is not true. Under the hood, it still performs both dense and sparce retrieval seperately and combines both results using RRF. Therefore, it is similar to our custom setup. Ensemble retrieval can however exhibit unexpected behaviours. 

I have carried out some experiments with this Ensemble Retriever at the end of ["../notebooks/inmemory_retriever.ipynb"](../notebooks/inmemory_retriever.ipynb) and explained how it works. This notebook also uses Ensemble Retriever on AGO reports as an example. 

### Step 6: ReRanking

I have picked 2 models. 

1. A familiar and light model, [RoBERTa](https://arxiv.org/abs/1907.11692). Various versions of [opensource RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/roberta) exist so feel free to explore before choosing one.

2. For higher accuracy (but increased difficulty of setup), use the Microsoft Machine Reading Comprehension Leaderboard ([MS MACRO](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-2-v2)) or the HuggingFace Massive Text Embedding Benchmark Leaderboard ([MTEB](https://huggingface.co/spaces/mteb/leaderboard)) to pick the best cross-encoders for reranking. The 2nd model chosen is ms-marco-MiniLM-L-12-v2, the best performing model on MS-MACRO leaderboard.

In [16]:
# custom helper functions
from utils.retriever import reranking
from utils.custom_print import pretty_print_rank

In [17]:
# Reranking ------------------------------------------------------------------

# top n matches for reranking
top_n = 20

# Cross encoder model

# claimed to be deprecated because it is bad but seems to still work fine
# model_name = "cross-encoder/stsb-roberta-base"

# best performing on Microsoft tests
model_name = "cross-encoder/ms-marco-MiniLM-L-12-v2"

# Can be run multiple times when change in query
# Reranking
best_chunks, scores = reranking(model_name, good_chunks, query, top_n)

print("RERANKING-----------------------------------------------------------\n")
pretty_print_rank(best_chunks, scores)

RERANKING-----------------------------------------------------------

RANK: 1

Pertaining to the lack of checks on declarations by grant recipients, EDB 
explained that there were specific controls in place to ensure that grant recipients take 
ownership for accurate and credible reporting

SCORE: 3.238358

----------------------------------------------------------

RANK: 2

The audit examined whether there was a proper framework for grant 
management and whether due process was followed for the above stages

SCORE: -0.9363227

----------------------------------------------------------

RANK: 3

The audit examined whether there was a proper framework for grant 
management and whether due process was followed for the above stages by the two 
agencies

SCORE: -1.0825868

----------------------------------------------------------

RANK: 4

Stage 4: Grant Monitoring and Review
–	
Whether there were processes and controls in place to ensure 
that grants were managed in accordance with relev

### Step 7: Augment

["../utils/prompt_engineering.py"](../utils/prompt_engineering.py) performs prompt engineering to augment all the retrieved chunks together. There are 2 prompt engineering examples provided, for AGO reports and National Day Rally speeches. 

This notebook also provides an example of a bad prompt (without prompt engineering and adding metadata.) ["../notebooks/RAG_db.ipynb"](../notebooks/RAG_db.ipynb) compares the output of good and bad prompts (at the end). 

In [18]:
# custom helper functions
from utils.prompt_engineering import generate_prompt
from utils.json_parser import json_file_to_dict

# constants
from utils.initialisations import save_inverted_tree_path, s_p_pairs_path

In [19]:
# Chunk into sentences ('s') or paragraphs ('p') or fixed-size strings ('f')
chunking='s' 

# Group smaller chunks into a bigger chunk
grouping=1

# question by user
# query can be set to be same as user's question or 
#  by using HyDE, a hypothetical answer to the user's question
question = query

# RUN ONCE
# retrieve all required data structures

# load tree
inverted_tree = json_file_to_dict(save_inverted_tree_path)

# load chunks from tree's keys
chunks = list(inverted_tree.keys())
print("Number of unique chunks:", len(chunks))

# load sentence paragraph pairs. 
if (chunking == 's' or chunking == 'f') and grouping == 1:
    print("s_p_pairs will be filled")
    s_p_pairs = json_file_to_dict(s_p_pairs_path)
else:
    s_p_pairs = {}

Number of unique chunks: 8210
s_p_pairs will be filled


In [20]:

# Can be run multiple times when change in query
prompt = generate_prompt(question, 
                         inverted_tree, 
                         best_chunks, 
                         chunking, 
                         s_p_pairs)

print(prompt)

Role:
You are a specialist who uses the context provided to answer the query.

Instruction:
Your response should cite sources' year and page number.
If possible, make ministries or government agencies the headings.
If you are unable to provide an answer, state "Unable to find, submit prompt again."

Background:
The context is taken from audit reports from the Auditor-General's Office (AGO) of Singapore. 
AGO is an independent organ of state and the national auditor. They play an important role in enhancing public accountability in the management and use of public funds and resources through their audits.

They audit
    government ministries and departments
    organs of state
    statutory boards
    government funds
    other public authorities and bodies administering public funds (upon their request for audit), e.g. government-owned companies.

They report their audit observations to the President, Parliament and the public through the Annual Report of the Auditor-General managemen

### Step 7.5: Prompt Expansion

Notice how despite doing sentence based chunking, our prompt contains paragraphs instead? That is because sentences usually do not provide sufficient information to answer the user's questions. 

That's why I have used sentence_paragraph_pairs to replace the retrieved sentences with paragraphs for the final prompt. This also works for fixed-size chunks. 

### Step 8: Generate

The final step of RAG is to feed the prompt into an LLM and generate an answer. I have chosen GPT-4o due to its large context window. LLMs will keep improving so remember to pick the LLM that best suits your use case. 

In order to keep track of LLM inputs and outputs, I have used LangSmith. The free tier is sufficient for development environments. LangSmith will also allow you to grade your LLM outputs, hence creating metrics for evaluating your chatbot performance in the future. Follow this [guide](https://docs.smith.langchain.com) to get started with LangSmith.

In the utilities file, ["../utils/langsmith_trace.py"](../utils/langsmith_trace.py), I have documented how one can trace their functions using LangSmith. Our final chatbot (every api call) also uses LangSmith tracing.

In [21]:
# custom helper functions
from utils.langsmith_trace import rag_pipeline, llm

In [22]:
# pack parameters for tracing using LangSmith 

# control minimum chunk size
min_chunk_size=100

# add to data base in batches
batch_size = 1000

# An optional add-on, To be explained in the experiments tutorial
HyDE = False
if HyDE:
    comments = "This is using HyDE"
else:
    comments = "None"

params = (question, chunking, grouping, min_chunk_size, batch_size, top_k, 
          weights, k, top_n, model_name, HyDE, comments)

In [23]:
# I get a deadlock error when LLM is initialised in another file. Is this ok?
# yes, it's ok. Doesn't affect the LLM outputs. So just disable parallelism since not needed 
'''
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
'''

# prevents deadlocks (corrects for the above error)
os.environ["TOKENIZERS_PARALLELISM"] = "false"

response = rag_pipeline(params, prompt, llm)

# Ministry of Trade and Industry

## Economic Development Board
### Lapses in Administration of Grants
- **Year:** 2016-17
- **Page Number:** 46
- **Findings:** For 47 grant projects audited by AGO, there were seven projects where there was no evidence that EDB had followed up with the grant recipients to determine that the project conditions and milestones had been met by the stipulated due dates (AGO, 2016-17, p. 46).

### Lapses in Administration of Grants
- **Year:** 2016-17
- **Page Number:** 48
- **Findings:** AGO noted a lack of checks on declarations by grant recipients. EDB explained specific controls in place, such as sample checks with onsite visits by its Internal Audit and conduct of site visits by its Cluster Groups for certain incentive schemes. However, these site visits would apply to only five of the nine schemes audited by AGO (AGO, 2016-17, p. 48).

# Ministry of Transport

## Civil Aviation Authority of Singapore
### Lapses in Management of Grants
- **Year:** 2022-23
- **Page Number:** 42
- **Findings:** Certain eligibility criteria were either inaccurately stated or not included in the grant agreements with two companies (AGO, 2022-23, p. 42).

# Ministry of Sustainability and the Environment

## National Environment Agency
### Possible Irregularities in Quotations Submitted for Grant Applications
- **Year:** 2021-22
- **Page Number:** 46
- **Findings:** The grant scheme aims to raise operational efficiency and productivity in the environmental services industry through technology adoption. Among other things, applicants are required to submit a few quotations for the identified equipment or solution to demonstrate cost reasonableness (AGO, 2021-22, p. 46).

# Ministry of Health and Ministry of Social and Family Development

### Management of Social Grant Programmes
- **Year:** 2018-19
- **Page Number:** 7
- **Findings:** AGO identified gaps in the management of social grant programmes. $1.59 billion was disbursed by MOH and MSF under their social grant programmes to 1,058 Programme-Voluntary Welfare Organisations (VWOs) from 1 April 2016 to 31 March 2018. AGO test-checked 429 Programme-VWOs covering a disbursement value of $488.52 million (30.7 per cent). The audit covered five stages of grant management: grant design and setup, grant evaluation and approval, disbursement of grants, monitoring and review of grants, and cessation of grants (AGO, 2018-19, p. 7).

# Workforce Singapore (WSG) and Enterprise Singapore (ESG)

### Roles and Responsibilities in Grant Management
- **Year:** 2019-20
- **Page Number:** 54
- **Findings:** The audit examined whether there was a proper framework for grant management and whether due process was followed. For grants managed jointly by WSG and ESG with their programme partners (PPs), such as Trade Associations and Chambers, the audit focus was on the roles and responsibilities of WSG and ESG in grant management (AGO, 2019-20, p. 54).

### Administration of Grant Programmes by WSG
- **Year:** 2019-20
- **Page Number:** 57
- **Findings:** There was an inadequate assessment of proposed costs to be supported and verification of grant applicants' eligibility. AGO also noted instances where companies or individuals might have circumvented the grant requirements and controls. Despite the administration of grant programmes being outsourced to PPs, WSG remains accountable for how the funds are managed and should maintain adequate oversight of the PPs (AGO, 2019-20, p. 57).

### Oversight of Programme Partners
- **Year:** 2019-20
- **Page Number:** 57
- **Findings:** WSG needs better oversight of PPs administering grants on its behalf. PPs of varying sizes with different control systems led to inconsistent practices in stipulating requirements and performing checks on grant applications (AGO, 2019-20, p. 57).

# General Findings

### Common Weaknesses in Grant Administration
- **Year:** 2014-15
- **Page Number:** 3
- **Findings:** AGO uncovered several instances indicating laxity in the administration of grants. Common weaknesses include the failure to ensure that the correct amount of grants is disbursed and that conditions for grants are adhered to (AGO, 2014-15, p. 3).

### Procurement and Contract Management
- **Year:** 2011-12
- **Page Number:** 5
- **Findings:** A substantial portion of audit findings pertained to procurement and contract management, and financial administration. Lapses and irregularities point to the need for improvements in these areas (AGO, 2011-12, p. 5).

# Anonymous Observations

### Grant Evaluation and Approval
- **Year:** 2019-20
- **Page Number:** 53
- **Findings:** Audit examined whether controls ensured that grant applications were properly evaluated and approved and whether agreements with grant recipients were properly entered into (AGO, 2019-20, p. 53).

### Grant Design and Setup
- **Year:** 2019-20
- **Page Number:** 53
- **Findings:** Audit examined whether processes and controls were in place to ensure that grant programmes were authorised and administered in accordance with objectives (AGO, 2019-20, p. 53).

### Grant Design and Setup
- **Year:** 2022-23
- **Page Number:** 8
- **Findings:** The grant eligibility criteria and operational requirements were properly laid down in legislation or implementation documents. Proper contracts and agreements were entered into with parties administering the schemes' key processes. Approval was also obtained from the Ministry of Finance for the funding of the schemes (AGO, 2022-23, p. 8).

### Grant Monitoring and Review
- **Year:** 2022-23
- **Page Number:** 49
- **Findings:** Processes and controls were examined to ensure that grants were managed in accordance with relevant terms and conditions and that deliverables were achieved (AGO, 2022-23, p. 49).

Unable to find, submit prompt again.

### Complete RAG Pipeline
We have provided some notebooks where you can run the whole RAG pipeline

1. ["../notebooks/RAG_db.ipynb"](../notebooks/RAG_db.ipynb) 
    - Uses AGO audit reports
    - Uses both chromadb and ElasticSearch
    - Can experiment with sentence, paragraph and fized size chunking
    - Experiment with extent of prompt engineering

2. ["../notebooks/RAG_db_NDR.ipynb"](../notebooks/RAG_db_NDR.ipynb) 
    - Uses National Day Rally (NDR) speech 2024. 
    - Does not use content pages (not available in NDR speech)

3. ["../notebooks/RAG_langchain.ipynb"](../notebooks/RAG_langchain.ipynb)
    - Does RAG using the langchain environment  
    - Langchain FAISS (dense retrieval)
    - Langchain BM25 (sparce retrieval)
    - Langchain Ensemble retriever
    