Remove specific tags from a html file or stream and output detagged/minified HTML to file or stdout.
html_deltags
is a Python module and Shell script designed to detag, normalize and 'minify' HTML documents. Removes specified HTML tags, including those containing certain keywords, and comments, streamlining further analysis of remaining HTML using clean input.
- Removes specified HTML tags and comments from an HTML document.
- Can target and delete tags based on contained keywords.
- Flexible usage as both a standalone script and an importable Python module.
git clone https://github.com/Open-Technology-Foundation/html_deltags.git && sudo html_deltags/html_deltags.install
html_deltags.install
will:
- Copy html_deltags files to
/usr/local/share/html_deltags
- Create a Python virtual environment with all dependencies in the installation directory
- Create a symlink at
/usr/local/bin/html_deltags
Options:
-U, --upgrade
: Download the latest version from the repository before installing-h, --help
: Show help message
Example:
# Install normally
sudo ./html_deltags.install
# Install using the latest version from the repository
sudo ./html_deltags.install --upgrade
Root access is required for installation.
As a script:
html_deltags [options] [input_file]
input_file Path to HTML file to be detagged.
Reads from stdin if not provided.
Options:
-O|--output filename
Output file for detagged HTML.
Defaults to stdout.
-d|--delete tag[,tag,tag]
HTML tags to remove, as a comma-separated list.
Multiple -d options allowed.
Example: ... -d script,link,meta ...
-D|--delete-common
Add common tags to delete list in optimal order: doctype,head,header,footer,nav,
iframe,svg,script,style,noscript,comments,path,img,button.
Equivalent to -d with the above tags in this specific order.
-k|--kw-delete 'tag keyword'
Remove tags containing specific keywords.
Specify tag, space, then pattern/keyword.
Multiple -k options allowed.
Example: ... -k 'div sometext' ...
-p|--parser html5lib|lxml|html.parser
BS4 html parser to use.
Default: html5lib
-h|--help
Display this help message and exit.
Each of the parsers has its strengths and weaknesses:
Speed: lxml is the fastest, followed by html.parser, then html5lib.
Error Tolerance: html5lib and lxml are more forgiving of bad or broken HTML compared to html.parser.
Dependencies: html.parser has the advantage of not requiring any external dependencies.
Standards Conformance: html5lib is best for parsing HTML in a way that's consistent with modern web browsers.
html_deltags my.html -d head,comments,nav
html_deltags -d head,comments,nav < my.html > mynew.html
html_deltags my.html -D -O clean.html
html_deltags my.html -d head,comments,nav -d svg,path -O mynew.html
html_deltags my.html -d head,nav -k 'div class="t1"'
As a module:
from html_deltags import html_deltags
...
clean_html = html_deltags(input_source, output, deltags, deltagkws)
...
- Python 3
- BeautifulSoup4
- Bash 5
Contributions, issues, and feature requests are welcome. Check issues page.
Distributed under the GPL3 License. See LICENSE
for more information.
Project Link: https://github.com/Open-Technology-Foundation/html_deltags