In [1]:
import sys
import os

# Add the parent directory (Auditbot_backend) to the system path
sys.path.append(
    os.path.abspath(
        os.path.join(
            os.path.dirname(f"{os.getcwd()}/database_setup.ipynb"),
            '..'
        )
    )
)

# Using RAG to Build a Custom ChatBot
## 2. Database Setup

> **Notice:**  
> Before starting this tutorial series, read up on the RAG pipeline.

This tutorial series assumes prerequisite understanding of RAG and therefore goes through the implementation of an advanced and customized RAG pipeline, explaining the micro-decisions made along the way.

> **Data Corpus:** 
> This tutorial uses [AGO yearly audit reports](https://www.ago.gov.sg/publications/annual-reports/) as an example. However, this repo's code is applicable to most pdf documents. The code examples for other documents (such as national day rally) will be referenced later. 

### Step 1: Retrieve all required data structures

In the previous tutorial, we generated chunks and stored them as json files. However for production, data bases will be required. Therefore, data has to be retrireved from these json files and transferred to appropriate databases. 

In [2]:
# custom helper functions
from utils.json_parser import json_file_to_dict

# constants
from utils.initialisations import save_inverted_tree_path, s_p_pairs_path

In [3]:
# Chunk into sentences ('s') or paragraphs ('p') or fixed-size strings ('f')
chunking='s' 

# Group smaller chunks into a bigger chunk
grouping=1

# RUN ONCE
# retrieve all required data structures

# load tree
inverted_tree = json_file_to_dict(save_inverted_tree_path)

# load chunks from tree's keys
chunks = list(inverted_tree.keys())
print("Number of unique chunks:", len(chunks))

# load sentence paragraph pairs. 
if (chunking == 's' or chunking == 'f') and grouping == 1:
    print("s_p_pairs will be filled")
    s_p_pairs = json_file_to_dict(s_p_pairs_path)
else:
    s_p_pairs = {}

Number of unique chunks: 8210
s_p_pairs will be filled


### Step 2: Fill up vector datastore

I used Chroma as the go to vector store as it is free, easy to setup and built ontop of sqlite3. It also provides various bi-encoders to generate vector embeddings.  

In [4]:
# chromadb library
import chromadb
from chromadb.utils import embedding_functions
from chromadb.config import Settings

# custom helper functions
from utils.db_utils import (chroma_get_or_create_collection, 
                            chroma_fill_db,
                            chroma_preprocess_metadata)

# constants
from utils.initialisations import OPENAI_API_KEY

In [6]:
# vector store ---------------------------------------------------------------

# add to data base in batches
batch_size = 1000

# prepare metadata for chromadb
pre_metadata = list(inverted_tree.values())
metadata = chroma_preprocess_metadata(pre_metadata)

# RUN ONCE
# set up vector database for dense embedding search

# chromadb supported model
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                api_key=OPENAI_API_KEY,
                model_name="text-embedding-3-small"
            )

# create db
# When creating the data base for the first time, make sure the data base is 
# reset (all collections are erased)
client_dense = chromadb.PersistentClient(path="../data/db", 
                                   settings = Settings(allow_reset=True))

# chromadb's embedding function needs streaming in batches
# Basically, chunks are added in batches
collection = chroma_get_or_create_collection(client_dense, 
                                             name = "audit", 
                                             embedding_function = openai_ef, 
                                             reset = True)

# fill db
chroma_fill_db(collection, chunks, metadata, batch_size)
print("number of embeddings in database:",collection.count())

number of embeddings in database: 8210


If you would like to quickly test ChromaDB, a mock up RAG pipeline has been written in ["../notebooks/chroma_db.ipynb"](../notebooks/chroma_db.ipynb). No other setup other than chromadb is required. 

This notebook also includes an alternative to the embedding functions provided by chroma. This alternative (by langchain) avoids streaming into chromadb in batches. If adding data in batches is not possible for your use case, use this iinstead.

### Step 3: Create index for sparce retrieval

Elasticsearch provides a wide range of options for sparce retrieval methods as well as indexing to speed up sparce retrievals. Follow the [ElasticSearch tutorial](https://www.elastic.co/guide/en/elasticsearch/reference/current/run-elasticsearch-locally.html) to create a container. A local development set-up for the container has been used in this notebook but the python code remains the same even if a production development setup is used. 

Bash script to start a container:
```bash
export ELASTIC_PASSWORD="<ES_PASSWORD>"  # password for "elastic" username

docker run -p 127.0.0.1:9200:9200 -d --name elasticsearch --network elastic-net 
  -e ELASTIC_PASSWORD=$ELASTIC_PASSWORD 
  -e "discovery.type=single-node" 
  -e "xpack.security.http.ssl.enabled=false" 
  -e "xpack.license.self_generated.type=trial" 
  docker.elastic.co/elasticsearch/elasticsearch:8.14.3
```

In [7]:
# elastic search library
from elasticsearch import Elasticsearch

# custom helper functions
from utils.db_utils import index_elastic_db

# constants
from utils.initialisations import LOCAL_HOST_URL, HTTP_AUTH, index_name

In [8]:
# RUN ONCE
# connect to the Elasticsearch cluster from python elasticsearch client
client_sparce = Elasticsearch(
    LOCAL_HOST_URL,
    basic_auth=HTTP_AUTH
)
# checks if client is connected to docker container
print(client_sparce.info(http_auth=HTTP_AUTH))

# index chunks using elasticsearch (saved in docker)
index_elastic_db(client_sparce, index_name, HTTP_AUTH, chunks, reset = True)

{'name': '50cd72118574', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'l-puMSQDTvqL7nQDfYreOg', 'version': {'number': '8.14.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'd55f984299e0e88dee72ebd8255f7ff130859ad0', 'build_date': '2024-07-07T22:04:49.882652950Z', 'build_snapshot': False, 'lucene_version': '9.10.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'}
reset index: chromadb_documents


["../notebooks/elasticsearch.ipynb"](../notebooks/elasticsearch.ipynb) offers code to look into the index and explore its functionality