## User Guide on HTMLCleanser and HTMLSplitter
:: This guide introduces way to use the HTMLCleanser and HTMLSplitter.

- Its purpose is on introducing the HTMLCleanser to preprocess the html (e.g. cleanse out invalid html tags and attributes).
- Its purpose is on introducing the HTMLSplitter to split the documents before indexing the documents using IR model to perform RAG.
- It supposes that your document is of the format `html`.



:: Main modules: [cleanser.py, splitter.py]

- The system pre-processes files in the html format. If your document is in another format, please convert it to an HTML file beforehand.

- The `HTMLCleanser` in cleanser.py removes invalid tags and attributes within the BeautifulSoup object.
- The `HTMLSplitter` in splitter.py divides the file into several documents without altering the HTML contents.
    - When developing an AI model or utilizing the LLM, the existence of token_max can make it challenging to input the entire document into the model.
    - Consequently, you may have been splitting the documents to avoid exceeding the token_max, using strategies such as doc_stride, etc.
    - If you simply truncate the document based on the token length, the content may be chopped off, resulting in a loss of context.
    - The `HTMLSplitter` takes into account the token_max when splitting the file, ensuring that the HTML file is divided without compromising context.

- [as-is] The contents (e.g., tables) inside the document were truncated without considering the context due to the maximum token limit imposed by the AI model.

- [to-be] The contents (e.g., tables) inside the document will be split without losing context and structure, with a guarantee that the splitted chunks do not exceed the maximum tokens.

    Firstly, split the target file into a set of documents before indexing, while considering the maximum token limit accepted by the model.

    Secondly, insert the splitted documents into the model as usual.

- For detailed usage instructions of the modules, kindly consult the `guide.ipynb` file located within the html directory.

[TODO]
- Add more funtionality in HTMLSplitter._split_chunk 
- Think of how to handle the context window 

```
- Writer: Eungi Cho
- Last update: 23.11.21
```

In [1]:
from bs4 import BeautifulSoup
from splitter.cleanser import HTMLCleanser

# suppose you have the html file as below:
txt = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

<table>
    <thead>
        <tr>
            <th>Header 1</th>
            <th>Header 2</th>
            <th>Header 3</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.</td>
            <td>Data 2</td>
            <td>Data 3</td>
        </tr>
        <tr>
            <td>Data 4</td>
            <td>Data 5</td>
            <td>Data 6</td>
        </tr>
        <tr>
            <td>Data 4</td>
            <td>Data 5</td>
            <td>Data 6</td>
        </tr>
        <tr>
            <td>Data 4</td>
            <td>Data 5</td>
            <td>Data 6</td>
        </tr>
        <tr>
            <td>Data 4</td>
            <td>Data 5</td>
            <td>Data 6</td>
        </tr>
        <tr>
            <td>Data 4</td>
            <td>Data 5</td>
            <td>Data 6</td>
        </tr>
    </tbody>
</table>
"""
# get soup object by specifying the parser
soup = BeautifulSoup(txt, "lxml")

# usage of cleanser: it will print out default valid_tags and valid_attributes
# to leave in the text when it is initialized.
cleanser = HTMLCleanser()
# add or remove any tag as your pleases.
# cleanser.add_valid_tags(["p", "img", "a", "span", "title"])
# cleanser.remove_valid_tags(["span"])
# # add or remove any attributes as your pleases.
# cleanser.add_valid_attrs(["href", "src", "alt", "font"])
# cleanser.remove_valid_attrs(["font"])

# cleanse your html first.
soup = cleanser.cleanse_html(soup)

[32m2024-03-18 17:49:04.990[0m | [1mINFO[0m | [36msplitter.cleanser[0m:[36m__init__[0m:[36m40[0m - [1mNo valid_tags or valid_attrs provided.
                    Initialize the HTMLCleanser with default ones:
                    - valid_tags: ['table', 'tr', 'td', 'th']
                    - valid_attrs: ['rowspan', 'colspan'][0m


In [2]:
from splitter.splitter import HTMLSplitter

splitter = HTMLSplitter(
    soup=soup, length_func=len, token_max=400, split_trial_max=15, raise_error=False
)
chunks = splitter.get_chunks()
chunks = splitter.split_chunks(chunks)
documents = splitter.make_documents(chunks)

The length of chunk `table: (4025, 4902)` exceeds the token_max, but failed to split the chunk.
A single row of the table may exceed the token_max, 
or the split function for table has not been developped.


In [3]:
documents

[Document(page=0, page_content="Lorem Ipsum is simply dummy text of the printing and typesetting industry.  Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book", length=245, metadata={}),
 Document(page=1, page_content=' It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.  It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum', length=329, metadata={}),
 Document(page=2, page_content="\nLorem Ipsum is simply dummy text of the printing and typesetting industry.  Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book", length=246, metadata={

In [4]:
print(f"Number of documents: {len(documents)}")

for doc in documents:
    print(f"----------- Page {doc.page} | Tokens {len(doc.page_content)} -----------")
    print(doc.page_content)
    print(f"----------------------------------------")
    print("\n\n")

Number of documents: 19
----------- Page 0 | Tokens 245 -----------
Lorem Ipsum is simply dummy text of the printing and typesetting industry.  Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book
----------------------------------------



----------- Page 1 | Tokens 329 -----------
 It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.  It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum
----------------------------------------



----------- Page 2 | Tokens 246 -----------

Lorem Ipsum is simply dummy text of the printing and typesetting industry.  Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer t