### Prepare vector database for RAG

Build a RAG for Turing College [knowledge base confluence pages](https://turingcollege.atlassian.net/wiki/spaces/DLG/overview) \
so learners can chat some basic questions related to learning in TC with this \
chatbot.

In [None]:
import dotenv, os

dotenv.load_dotenv()
UPSTASH_TC_HYBRID_CHAT_TOKEN = os.getenv("UPSTASH_TC_HYBRID_CHAT_TOKEN")
UPSTASH_TC_HYBRID_INDEX_ENDPOINT = os.getenv("UPSTASH_TC_HYBRID_INDEX_ENDPOINT")

UPSTASH_TC_CHAT_DENSE_TOKEN = os.getenv("UPSTASH_TC_CHAT_DENSE_TOKEN")
UPSTASH_TC_CHAT_DENSE_ENDPOINT = os.getenv("UPSTASH_TC_CHAT_DENSE_ENDPOINT")

UPSTASH_TC_CHAT_DENSE_1024_TOKEN = os.getenv("UPSTASH_TC_CHAT_DENSE_1024_TOKEN")
UPSTASH_TC_CHAT_DENSE_1024_ENDPOINT = os.getenv("UPSTASH_TC_CHAT_DENSE_1024_ENDPOINT")

#### Scraping all links on the [Overview Page](https://turingcollege.atlassian.net/wiki/spaces/DLG/overview)

Use BeautifulSoup for scraping, scrape in the context tab:
- h1: title of the page
- h3: title of a passage
- p: a sentence


In [None]:
import requests
from bs4 import BeautifulSoup


url = "https://turingcollege.atlassian.net/wiki/spaces/DLG/overview"

response = requests.get(url)
soup2 = BeautifulSoup(response.text, "html.parser")

pages = []

for tag in soup2.find_all("a"):
    title = tag.text
    link = tag.get("href")

    if link and link.startswith("https") and "DLG" in link:
        pages.append({"title": title, "link": link})

#### Scrape content in the links and chunk the content using RecursiveCharacterTextSplitter

__Chunking deliminator__
1. Hard Limit: no overlap 
    - Page
    - Passage in page (`h3` tag)

2. Soft limit: with overlap
    - `. ` End of sentence
    - `\n`
    - ' '

__Chunk size__
- 800

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

titles_ls = [page["title"] for page in pages]
links_ls = [page["link"] for page in pages]

articles_chunks_dict_ls = []
# loop through the links_ls
for article_id in range(len(links_ls)):
    response = requests.get(links_ls[article_id])
    soup = BeautifulSoup(response.text, "html.parser")
    content_div = soup.find("div", class_="ak-renderer-document")
    accumulated_article_str = ""  # reset the article string

    # preserve article subheadings structure
    for tag in soup.find_all(["h1", "h3", "p"]):
        if tag.name == "h3":
            tag_name = "<h3_tag> "
        elif tag.name == "h1":
            tag_name = "<h1_tag> "
        else:
            tag_name = ""
        text = tag_name + tag.text + " "
        accumulated_article_str += text
    # add article break tag
    accumulated_article_str += "<article_end_tag> "

    # cleaning up the joined texts
    accumulated_article_str = (
        accumulated_article_str.replace("\xa0", "")
        .replace("_______________\nTuring College", "")
        .replace("<h3_tag> Analytics ", "<h3_tag> ")
    )

    # split the joined text into chunks
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=800,
        chunk_overlap=200,
        separators=["<article_end_tag>", "<h3_tag>", ". ", "\n", " "],
    )

    splitted_article = splitter.split_text(accumulated_article_str)

    # create a dictionary for each chunk
    for chunk_id in range(len(splitted_article)):
        if len(splitted_article[chunk_id]) < 30:  # exclude empty docs
            continue
        articles_chunks_dict_ls.append(
            {
                "id": f"article_{article_id}_chunk_{chunk_id}",
                "data": splitted_article[chunk_id],
                "metadata": {
                    "title": titles_ls[article_id],
                    "link": links_ls[article_id],
                    "text": splitted_article[chunk_id],
                },
            }
        )

#### Upsert chunks to Upstash

__Embedding model__

- SentenceTransformer all-MiniLM-L6-v2
- dimension: 384

In [None]:
from upstash_vector import Index, Vector

# upsert
index = Index(url=UPSTASH_TC_HYBRID_INDEX_ENDPOINT, token=UPSTASH_TC_HYBRID_CHAT_TOKEN)

for i in articles_chunks_dict_ls:
    index.upsert(vectors=[Vector(id=i["id"], data=i["data"], metadata=i["metadata"])])

__Retrieval__

As we are using internal document, there are some specific organisation internal \
lingo which are OOV to embedding models. Thus we are using a combination of \
dense embedding ANN search and sparse embedding keyword matching search.

- Dense embedding cosine similarity search
- Sparse embedding BM25

In [None]:
def retrieve_ref(query_str, top_k=5):

    index = Index(
        url=UPSTASH_TC_HYBRID_INDEX_ENDPOINT, token=UPSTASH_TC_HYBRID_CHAT_TOKEN
    )

    ref_ls = index.query(
        data=query_str,
        top_k=top_k,
        include_metadata=True,
    )
    metadata_ls = [ref.metadata for ref in ref_ls]
    return metadata_ls