In [1]:
import sys
import os

# Add the parent directory (Auditbot_backend) to the system path
sys.path.append(
    os.path.abspath(
        os.path.join(
            os.path.dirname(f"{os.getcwd()}/data_processing.ipynb"),
            '..'
        )
    )
)

# Using RAG to Build a Custom ChatBot
## 1. Data Processing

> **Notice:**  
> Before starting this tutorial series, read up on the RAG pipeline.

This tutorial series assumes prerequisite understanding of RAG and therefore goes through the implementation of an advanced and customized RAG pipeline, explaining the micro-decisions made along the way.

> **Data Corpus:** 
> This tutorial uses [AGO yearly audit reports](https://www.ago.gov.sg/publications/annual-reports/) as an example. However, this repo's code is applicable to most pdf documents. The code examples for other documents (such as national day rally) will be referenced later. 

Also, have a look at ["../flowchart.png"](../flowchart.png) for an overview on the whole pipeline.

### Step 1: Building a data corpus
Using http get requests and the BeautifulSoup library, download of all audit reports from the AGO website can be automated. There are also infographics files on the website that I have decided to ignore as they repeat the content from the actual audit reports. Howevever, feel free to use them as well. 

The code for this can be found in ["../notebooks/web_scraper.ipynb"](../notebooks/web_scraper.ipynb). This notebook also extracts the content pages from the sudit reports as content pages provide hierarchical information that is useful (analysed later).

Your data directory should now be arranged as:

```md
Main_dir/
└── data/
    └── documents/
        ├── ar_fy2008_09_content_pages.pdf
        ├── ar_fy2008_09.pdf
        ├── ar_fy2009_10_content_pages.pdf
        ├── ar_fy2009_10.pdf
        .
        .
        .
        └── ar_fy2022_23.pdf
```



### Step 2: Converting pdf to text
There are two main approaches to this, Optical Character Recognition (OCR) and pdf readers

#### OCR
OCR uses computer vision techniques to convert image to text. However, current OCR techniques are not accurate enough to achieve 100% conversion to text. It struggles greatly with new lines, paragraphs and page numbers, which are all important for chunking. Even vision transformers and GPT4-o have not achieved this. 

Check out ["../notebooks/ocr.ipynb"](../notebooks/ocr.ipynb) to try OCR for yourself. It might be satisfactory for your use case.

#### PDF Readers
The most accurate pdf reader as of now is PyMuPDF.
```bash
!pip install PyMuPDF 
```
```python
import fitz
pdf_document_ = fitz.open(pdf_path)
```
Helper functions ```pdf_to_pages(pdf_path)``` and ```pdf_to_text(pdf_path)``` have been provided in ["../utils/preprocessing.py"](../utils/preprocessing.py). They are also later used in chunking. 

### Step 3: Generate Hierarchial structure for Documents
```json
{
    "2022_23": {
        "OVERVIEW": {
            "SUMMARY": "pages 1 to 12"
        },
        "PART I A : AUDIT OF GOVERNMENT FINANCIAL STATEMENTS": {
            "SUMMARY": "pages 13 to 14"
        },
        "PART I B : AUDIT OF GOVERNMENT MINISTRIES, ORGANS OF STATE AND GOVERNMENT FUNDS": {
            "SUMMARY": "pages 15 to 16",
            "MINISTRY OF COMMUNICATIONS AND INFORMATION": {
                "Tenderers Appointed Despite Not Meeting Evaluation Criteria": "page 18"
                
            }
            .
            .
            .
        }
    }
    .
    .
    .
}
```

This tree will be extremely useful in determing where the chunks originated from and providing metadata to the LLM. The code to generate this tree can be found in ["../notebooks/content_page_parser.ipynb"](../notebooks/content_page_parser.ipynb). 

> **Notice:**  
> Only documents with content pages can generate this tree as content pages provide hierarchial information. Documents without content pages will lose out on the metadata provided by this tree. 

### Step 4: Chunking
I used sentence based chunking as it provides the best results for the AGO reports. Other documents might perform better with alternative chunking methods. Explore the different chunking methods in [../notebooks/text_splitter.ipynb](../notebooks/text_splitter.ipynb). 

> **FYI**  
> Although not proven, it is experiemntally sugested that chunks of similar size to questions will rank higher by dense retrievers. Therefore, it is not surprising that this project observed sentence based chunking produce the best results.  


In [2]:
# main function for chunking
from utils.preprocessing import generate_chunks

# contains bunch of constants such as saving paths 
from utils.initialisations import *

In [3]:
# HYPERPARAMETERS ============================================================
# preprocessing --------------------------------------------------------------

# Chunk into sentences ('s') or paragraphs ('p')
chunking='s' 

# Group smaller chunks into a bigger chunk
grouping=1

# control minimum chunk size
min_chunk_size=100

# RUN ONCE
# generate chunks and other useful data structures
generate_chunks(DOCUMENT_DIR,
                chunks_path,
                chunk_pageNum_pairs_path,
                s_p_pairs_path, 
                chunking, 
                grouping, 
                min_chunk_size,
                DOC_IDENTIFIER)


2.7536468505859375 seconds
number of chunks: 9127


### Step 5: Retrieving Metadata on chunks
Having more metadata on the chunks means more information is provided to the final LLM and hence the more acurate the answer. Example:
| Chunk   | Source             | Year | Location in Document                                                                                                                                               | Page Number |
|---------|--------------------|------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------|
| chunk 1 | ar_fy2018_19.pdf    | 2019 | **Part 1B: Audit of Government Ministries, Organs of State and Government Funds** <br> *Ministry of Education* <br> Connect Fund | 24          |

The source document, year and page number can be easily determined during chunking by using string methods. However, Location in document, which potentially provides key information is very difficult to obtain directly from pdf readers.

I have instead made use of the tree structure obtained from the content page to fill in the "Location in Document" metadata. Documents that do not contain content pages will miss out on this metadata. The following code will create a dictionary (inverted tree) where the keys are the chunks and the values are the metadata. 

In [4]:
from utils.content_page_parser import generate_inverted_tree

In [5]:
# generate inverted tree
has_content_page = True
generate_inverted_tree(chunk_pageNum_pairs_path, 
                       has_content_page, 
                       save_inverted_tree_path,
                       tree_path)

Use [../notebooks/preprocessing.ipynb](../notebooks/preprocessing.ipynb) to understand how metadata for chunks can be retrieved