# Preparing Text Data for RAG-Based Chat with Azure SDK Application

In this step-by-step guide, we will explore an example and discuss essential considerations when preparing text data for a retrieval-augmented chatbot.

## Required Libraries

There are a few Python libraries we must `pip install` for this notebook to run, those are:

In [None]:
!python -m pip install -qU langchain tiktoken tqdm beautifulsoup4

## Preparing Data

In this example, we will download the Azure SDK docs from [Azure SDK](https://azure.github.io/azure-sdk/general_introduction.html). We get all `.html` files located on the site like so:

In [None]:
!wsl wget --recursive -A.html -P docs https://azure.github.io/azure-sdk/general_introduction.html

Output: FINISHED --2023-09-20 10:38:51--
Total wall clock time: 4m 49s
Downloaded: 1129 files, 57M in 12s (4.55 MB/s)

This downloads all HTML into the `docs` directory. Now we can use LangChain itself to process these docs. We do this using the [ReadTheDocsLoader](https://python.langchain.com/docs/integrations/document_loaders/readthedocs_documentation) like so:

In [None]:
from langchain.document_loaders import ReadTheDocsLoader

loader = ReadTheDocsLoader("docs", encoding="utf-8", features="html.parser")
docs = loader.load()
len(docs)

The loader loops over all files under path and extracts the actual content of the files by retrieving main html tags. Default main html tags include `<main id="main-content">`, `<div role="main">`, and `<article role="main">`. If you need to include other HTML tag, you can do so by providing a custom tag using the `custom_html_tag=('p', {})` parameter.

In my case, HTML files do not contain a specific main tag, and I require text from various HTML tags across all the HTML files. To achieve this, I'm extracting text from all HTML tags to keep it simple as shown below:

In [None]:
import os
from langchain.docstore.document import Document
from typing import List
from bs4 import BeautifulSoup

def parse_html_content(path):
    docs: List[Document] = []

    for root, _, files in os.walk(path):
        for file_name in files:
            if file_name.endswith(".html"):
                file_path = os.path.join(root, file_name)

                with open(file_path, "r", encoding="utf-8") as file:
                    html_content = file.read()

                soup = BeautifulSoup(html_content, "html.parser")

                # Get the whole text content without modifications
                text = soup.get_text()

                # Remove 3 or more empty lines
                extracted_text = "\n".join([t for t in text.split("\n\n\n") if t])

                metadata = {"source": str(file_path)}
                docs.append(Document(page_content=extracted_text, metadata=metadata))

    return docs

docs = parse_html_content("docs")
len(docs)

This leaves us with `1089` processed doc pages. Let's take a look at the format each one contains:

In [None]:
docs[1]

We access the plaintext page content like so:

In [None]:
print(docs[1].page_content)

In [None]:
print(docs[5].page_content)

We can also find the source of each document:

In [None]:
docs[150].metadata['source'].replace('docs\\', 'https://')

Looks good, we need to also consider the length of each page with respect to the number of tokens that will reasonably fit within the window of the latest LLMs.

You can explore an interactive example by visiting https://platform.openai.com/tokenizer to get a basic understanding of how tokens are created. For instance, you can input a sentence like "Hi, how are you today? I am a Chiropractor." to see how common words are represented by a single token and less common words are divided into multiple tokens. The tiktokenizer library from OpenAI handles this automatically in Python.

We will use `gpt-4` as an example. To count the number of tokens that `gpt-4` will use for some text, we need to initialize the `tiktoken` tokenizer.

In [None]:
import tiktoken

tokenizer = tiktoken.get_encoding('cl100k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

In [None]:
tiktoken_len("I am a Chiropractor")

Output: I am a Chi rop ractor = Token Length 6


Note that for the tokenizer we defined the encoder as `"cl100k_base"`. This is a specific tiktoken encoder which is used by `gpt-4`. Other encoders exist. At the time of writing the OpenAI specific tokenizers (using `tiktoken`) are summarized as:

| Encoder | Models |
| --- | --- |
| `cl100k_base` | `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002` |
| `p50k_base` | `text-davinci-003`, `code-davinci-002`, `code-cushman-002` |
| `r50k_base` | `text-davinci-001`, `davinci`, `text-similarity-davinci-001` |
| `gpt2` | `gpt2` |

You can find these details in the [Tiktoken `model.py` script](https://github.com/openai/tiktoken/blob/main/tiktoken/model.py), or using `tiktoken.encoding_for_model`:

In [None]:
tiktoken.encoding_for_model('gpt-4')

Using the `tiktoken_len` function, let's count and visualize the number of tokens across our webpages.

In [None]:
token_counts = [tiktoken_len(doc.page_content) for doc in docs]

Let's see `min`, average, and `max` values:

In [None]:
print(f"""Min: {min(token_counts)}
Avg: {int(sum(token_counts) / len(token_counts))}
Max: {max(token_counts)}""")

### Chunking the Text

At the time of writing, `gpt-4` supports a context window of 8192 tokens — that means that input tokens + generated ( / completion) output tokens, cannot total more than 8192 without hitting an error.

So we 100% need to keep below this. If we assume a very safe margin of ~4000 tokens for the input prompt into `gpt-4`, leaving ~4000 tokens for conversation history and response completion.

With this ~4000 token limit we may want to include *five* documents of relevant information, meaning each document can be no more than **800** token long.

![Alt text](Chunks-1.jpg)

To create these documents we use the `RecursiveCharacterTextSplitter` from LangChain. To measure the length of documents, we also need a *length function*. This is a function that consumes text, counts the number of tokens within the text (after tokenization using the `gpt-4` tokenizer), and returns that number. We define it like so:

With the length function defined we can initialize our `RecursiveCharacterTextSplitter` object like so:

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=50,  # number of tokens overlap between chunks
    length_function=tiktoken_len,
    separators=['\n\n', '\n', ' ', '']
)

Then we split the text for a document like so:

In [None]:
tiktoken_len(docs[1].page_content)

In [None]:
chunks = text_splitter.split_text(docs[1].page_content)
len(chunks)

In [None]:
tiktoken_len(chunks[0]), tiktoken_len(chunks[1])

For `docs[1]` we created `2` chunks of token length `472` and `770`.

This is for a single document, we need to do this over all of our documents. While we iterate through the docs to create these chunks we will reformat them into a format that looks like:

```json
[
    {
        "id": "abc-0",
        "text": "some important document text",
        "source": "https://azure.github.io/azure-sdk/typescript_implementation.html"
    },
    {
        "id": "abc-1",
        "text": "the next chunk of important document text",
        "source": "https://azure.github.io/azure-sdk/typescript_implementation.html"
    }
    ...
]
```

The `"id"` will be created based on the URL of the text + it's chunk number.

In [None]:
import hashlib
m = hashlib.md5()  # this will convert URL into unique ID

url = docs[5].metadata['source'].replace('docs\\', 'https://')
print(url)

# convert URL to unique ID
m.update(url.encode('utf-8'))
uid = m.hexdigest()[:12]
print(uid)

Then use the `uid` alongside chunk number and actual `url` to create the format needed:

In [None]:
data = [
    {
        'id': f'{uid}-{i}',
        'text': chunk,
        'source': url
    } for i, chunk in enumerate(chunks)
]
data

Now we repeat the same logic across our full dataset:

In [None]:
from tqdm.auto import tqdm

documents = []

for doc in tqdm(docs):
    url = doc.metadata['source'].replace('docs\\', 'https://')
    m.update(url.encode('utf-8'))
    uid = m.hexdigest()[:12]
    chunks = text_splitter.split_text(doc.page_content)
    for i, chunk in enumerate(chunks):
        documents.append({
            'id': f'{uid}-{i}',
            'text': chunk,
            'source': url
        })

len(documents)

We're now left with `4057` documents. We can save them to a JSON lines (`.jsonl`) file like so:

In [None]:
import json

with open('AzureSDKDocuments.jsonl', 'w') as f:
    for doc in documents:
        f.write(json.dumps(doc) + '\n')

To load the data from file we'd write:

In [None]:
documents = []

with open('AzureSDKDocuments.jsonl', 'r') as f:
    for line in f:
        documents.append(json.loads(line))

len(documents)

In [None]:
documents[0]