Tool for parsing HTML into a chain of chunks with relevant headers.
The API entry-point is in src/html_cleaver/cleaver.
The logical algorithm and data-structures are in src/html_cleaver/handler.
This is a "tree-capitator" if you will,
cleaving headers together while cleaving text apart.
pip install html-cleaver
Optionally, if you're working with HTML that requires javascript rendering:
pip install selenium
Testing an example on the command-line:
python -m html_cleaver.cleaver https://plato.stanford.edu/entries/goedel/
Cleaving pages of varying difficulties:
from html_cleaver.cleaver import get_cleaver
# default parser is "lxml" for loose html
with get_cleaver() as cleaver:
# handle chunk-events directly
# (example of favorable structure yielding high-quality chunks)
cleaver.parse_events(
["https://plato.stanford.edu/entries/goedel/"],
print)
# get collection of chunks
# (example of moderate structure yielding medium-quality chunks)
for c in cleaver.parse_chunk_sequence(
["https://en.wikipedia.org/wiki/Kurt_G%C3%B6del"]):
print(c)
# sequence of chunks from sequence of pages
# (examples of challenging structure yielding poor-quality chunks)
l = [
"https://www.gutenberg.org/cache/epub/56852/pg56852-images.html",
"https://www.cnn.com/2023/09/25/opinions/opinion-vincent-doumeizel-seaweed-scn-climate-c2e-spc-intl"]
for c in cleaver.parse_chunk_sequence(l):
print(c)
# example of mitigating/improving challenging structure by focusing on certain headers
with get_cleaver("lxml", ["h4", "h5"]) as cleaver:
cleaver.parse_events(
["https://www.gutenberg.org/cache/epub/56852/pg56852-images.html"],
print)Using selenium on a page that requires javascript to load contents:
from html_cleaver.cleaver import get_cleaver
print("using default lxml produces very few chunks:")
with get_cleaver() as cleaver:
cleaver.parse_events(
["https://www.youtube.com/watch?v=rfscVS0vtbw"],
print)
print("using selenium produces many more chunks:")
with get_cleaver("selenium") as cleaver:
cleaver.parse_events(
["https://www.youtube.com/watch?v=rfscVS0vtbw"],
print)Testing without Poetry:
pip install lxml
pip install selenium
python -m unittest discover -s src
Testing with Poetry:
poetry install
poetry run pytest
Building from source:
rm dist/*
python -m build
Installing from the build:
pip install dist/*.whl
Publishing from the build:
python -m twine upload --skip-existing -u __token__ -p $TESTPYPI_TOKEN --repository testpypi dist/*
python -m twine upload --skip-existing -u __token__ -p $PYPI_TOKEN dist/*