## User Guide on HTMLCleanser and HTMLSplitter
:: This guide introduces way to use the HTMLCleanser and HTMLSplitter.

- Its purpose is on introducing the HTMLCleanser to preprocess the html (e.g. cleanse out invalid html tags and attributes).
- Its purpose is on introducing the HTMLSplitter to split the documents before indexing the documents using IR model to perform RAG.
- It supposes that your document is of the format `html`.
- If your document is in another format, please transform it into the html file first.
- It <b>prevents</b> the document from being splitted without considering the document contents during indexing.
    - `[as-is]` the contents (e.g. table) inside the document were chopped off without considering the context due to the maximum token that the IR model allows.
    - `[to-be]` the contents (e.g. table) inside the document will be splitted without loosing the context and the structure, with guarantee that the splitted chunks do not exceed the maximum tokens.
        - Firstly, split the target document into a set of Documents before indexing, while considering the maximum token that the IR model accepts.
        - Secondly, put the splitted documents into the IR model indexing process.

[TODO]
- Add more funtionality in HTMLSplitter._split_chunk 
- Think of how to handle the context window 

```
- Writer: Eungi Cho
- Last update: 23.11.21
```

In [1]:
from bs4 import BeautifulSoup
from _html.cleanser import HTMLCleanser

# suppose you have the html file as below:
txt = """
<table>
    <thead>
        <tr>
            <th>Header 1</th>
            <th>Header 2</th>
            <th>Header 3</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Data 1</td>
            <td>Data 2</td>
            <td>Data 3</td>
        </tr>
        <tr>
            <td>Data 4</td>
            <td>Data 5</td>
            <td>Data 6</td>
        </tr>
    </tbody>
</table>
"""
# get soup object by specifying the parser
soup = BeautifulSoup(txt, "lxml")

# usage of cleanser: it will print out default valid_tags and valid_attributes
# to leave in the text when it is initialized.
cleanser = HTMLCleanser()
# add or remove any tag as your pleases.
cleanser.add_valid_tags(["p", "img", "a", "span", "title"])
cleanser.remove_valid_tags(["span"])
# add or remove any attributes as your pleases.
cleanser.add_valid_attrs(["href", "src", "alt", "font"])
cleanser.remove_valid_attrs(["font"])

# cleanse your html first.
soup = cleanser.cleanse_html(soup)

[32m2023-11-21 17:12:25.196[0m | [1mINFO[0m | [36m_html.cleanser[0m:[36m__init__[0m:[36m40[0m - [1mNo valid_tags or valid_attrs provided.
                    Initialize the HTMLCleanser with default ones:
                    - valid_tags: ['table', 'tr', 'td', 'th']
                    - valid_attrs: ['rowspan', 'colspan'][0m
[32m2023-11-21 17:12:25.197[0m | [1mINFO[0m | [36m_html.cleanser[0m:[36madd_valid_tags[0m:[36m67[0m - [1mvalid_tags: ['a', 'p', 'title', 'img', 'tr', 'td', 'table', 'span', 'th'][0m
[32m2023-11-21 17:12:25.198[0m | [1mINFO[0m | [36m_html.cleanser[0m:[36mremove_valid_tags[0m:[36m77[0m - [1mvalid_tags: ['a', 'p', 'title', 'img', 'tr', 'td', 'table', 'th'][0m
[32m2023-11-21 17:12:25.199[0m | [1mINFO[0m | [36m_html.cleanser[0m:[36madd_valid_attrs[0m:[36m73[0m - [1mvalid_attrs: ['src', 'alt', 'colspan', 'href', 'font', 'rowspan'][0m
[32m2023-11-21 17:12:25.200[0m | [1mINFO[0m | [36m_html.cleanser[0m:[36mremove_vali

In [2]:
from _html.splitter import HTMLSplitter

splitter = HTMLSplitter(soup=soup, length_func=len, token_max=200)
chunks = splitter.get_chunks()
chunks = splitter.split_chunks(chunks)
documents = splitter.make_documents(chunks)

In [4]:
print(f"Number of documents: {len(documents)}")

for doc in documents:
    print(f"----------- Page {doc.page} | Tokens {len(doc.page_content)} -----------")
    print(doc.page_content)
    print(f"----------------------------------------")
    print("\n\n")

Number of documents: 2
----------- Page 0 | Tokens 199 -----------
<table>
 <tr>
  <th>
   Header 1
  </th>
  <th>
   Header 2
  </th>
  <th>
   Header 3
  </th>
 </tr>
 <tr>
  <td>
   Data 1
  </td>
  <td>
   Data 2
  </td>
  <td>
   Data 3
  </td>
 </tr>
</table>

----------------------------------------



----------- Page 1 | Tokens 199 -----------
<table>
 <tr>
  <th>
   Header 1
  </th>
  <th>
   Header 2
  </th>
  <th>
   Header 3
  </th>
 </tr>
 <tr>
  <td>
   Data 4
  </td>
  <td>
   Data 5
  </td>
  <td>
   Data 6
  </td>
 </tr>
</table>

----------------------------------------



