A powerful, LLM-friendly HTML scraper and converter for seamless web content extraction.
OpenCrawl is an intelligent tool that empowers you to fetch HTML content from any URL and elegantly transform single or multiple pages into clean HTML, Markdown, or JSON formats. Generate detailed site maps with unlimited depth exploration to map entire domains. The script offers flexible integration through three powerful interfaces:
- Command-Line Interactive - User-friendly menu-driven interface for quick operations
- Direct Function Calls - Seamless integration with your Python codebase
- LLM Function Calls - Structured API designed for large language model interactions
- Dependencies
- Installation
- Script Overview
- Key Functions
- Dictionaries & Parameters
- Running the Script
- Examples
The script uses the following external libraries:
- questionary (for CLI prompts)
- requests
- html2text
- rich
- beautifulsoup4
- playwright (for optional JavaScript rendering)
If you do not have them installed, run:
pip install questionary requests html2text rich beautifulsoup4 playwrightAfter installing, run playwright install to download the necessary browser binaries.
- Save the script to a file, for example:
opencrawl.py
- Make it executable (if you want to run it directly, on UNIX-like systems):
chmod +x opencrawl.py
- Run it with Python:
python opencrawl.py
This script fetches HTML from user-supplied URLs, extracts the main content (excluding <header>, <nav>, <footer>, and elements whose ID or class includes "footer"), removes doc-specific symbol elements, and converts the results into:
- Clean HTML,
- Markdown (with optional Table of Contents),
- JSON (containing the extracted text in a simple dictionary).
All requests send a custom User-Agent header of OpenCrawl/1.0 when fetching pages.
It also supports:
- Recursive Crawling of same-domain pages (via BFS with a user-defined depth).
- Site Map generation for all same-domain links (unlimited depth). The site map can be in HTML, Markdown, or JSON adjacency form.
File collisions are handled by automatically appending "_1", "_2", etc. whenever a filename already exists.
def do_single_page_conversion(
url,
output_format="Markdown",
keep_images=True,
keep_links=True,
keep_emphasis=True,
generate_toc=False,
custom_filename=None
) -> Optional[str]- Description: Fetches one page from
url, cleans it, and converts to the chosenoutput_format(HTML, Markdown, or JSON). - Output: The script writes the file into the
outputdirectory (no subdirectory for single page). - Returns: The absolute path to the newly created file, or None if an error occurred (e.g. invalid URL).
def do_recursive_crawling(
url,
max_depth=1,
output_format="Markdown",
keep_images=True,
keep_links=True,
keep_emphasis=True,
generate_toc=False,
custom_filename=None
) -> None- Description: Recursively crawls all same-domain links up to
max_depth, converting each page to the chosenoutput_format. - Output: Files are saved in the
recursive_crawlsubdirectory. - Returns:
None.
def do_map_only(
url,
output_format="Markdown",
keep_images=True,
keep_links=True,
keep_emphasis=True,
generate_toc=False,
custom_filename=None
) -> None- Description: Creates a site map (only) by traversing all same-domain links to unlimited depth.
- Output:
- If HTML, saves
site_map.html(with nested lists). - If Markdown, saves
site_map.md(nested bullet points). - If JSON, saves
site_map.json(dictionary adjacency).
- If HTML, saves
- Returns:
None. - Path: Written in
site_mapsubdirectory.
def do_recursive_crawling_and_map(
url,
max_depth=1,
output_format="Markdown",
keep_images=True,
keep_links=True,
keep_emphasis=True,
generate_toc=False,
custom_filename=None
) -> None- Description: Combines Recursive Crawling (with user-defined depth) and then does a full site map (unlimited depth).
- Output:
- Pages go into
crawl_and_mapsubdirectory. - The site map (html/md/json) is also placed there.
- Pages go into
- Returns:
None.
def monitor_page_for_changes(
url,
check_interval=60,
checks=2,
render_js=False,
keep_links=True,
keep_images=True,
keep_emphasis=True
) -> None- Description: Periodically checks a page and prints a diff when changes are detected.
- Output: Changes are displayed in the terminal.
- Returns:
None.
def llm_function_call(
function_name: str,
url: str,
output_format: str = "Markdown",
keep_images: bool = True,
keep_links: bool = True,
keep_emphasis: bool = True,
generate_toc: bool = False,
custom_filename: str = None,
max_depth: int = 1,
render_js: bool = False,
check_interval: int = 60,
checks: int = 2
) -> Any- Description: A universal dispatcher so an LLM can call the script’s features by name.
- Parameters:
- function_name: A string in {
"do_single_page_conversion","do_recursive_crawling","do_map_only","do_recursive_crawling_and_map"} - url: The base URL to process.
- output_format:
"HTML","Markdown", or"JSON". - keep_images, keep_links, keep_emphasis: Booleans controlling what is kept in the output.
- generate_toc: Whether to generate a table of contents if
output_format="Markdown". - custom_filename: Optional string for the file name.
- max_depth: Only relevant for crawling functions (otherwise ignored).
- render_js: Use Playwright to render JavaScript before scraping.
- check_interval: Seconds between checks when monitoring a page.
- checks: Number of checks when monitoring a page.
- function_name: A string in {
- Returns: The result of the underlying function, typically a path or None.
When the script is run interactively, it uses:
settings = {
"output_format": <"HTML"|"Markdown"|"JSON">,
"keep_images": <bool>,
"keep_links": <bool>,
"keep_emphasis": <bool>,
"render_js": <bool>,
"generate_toc": <bool>,
"custom_filename": <bool>, # Did user choose a custom file name?
"output_filename": <str or None> # The actual custom file name if chosen
}When calling llm_function_call, you pass in the function_name plus named arguments:
function_nameurloutput_format(default"Markdown")keep_images(defaultTrue)keep_links(defaultTrue)keep_emphasis(defaultTrue)generate_toc(defaultFalse)custom_filename(defaultNone)max_depth(default1)render_js(defaultFalse)check_interval(default60)checks(default2)
-
Launch:
python opencrawl.py
-
Select your desired action from the main menu:
- Single Page Conversion
- Recursive Crawling
- Map
- Recursive Crawling & Map
- Monitor Web Page for Changes
- Exit
-
Answer the prompts (URL, advanced settings, depth if applicable).
- Import the script:
from opencrawl import ( do_single_page_conversion, do_recursive_crawling, do_map_only, do_recursive_crawling_and_map, monitor_page_for_changes, llm_function_call )
- Call any function with your parameters.
Single Page Example
- Run:
python opencrawl.py
- Choose
1. Single Page Conversion. - Enter a URL (e.g.,
https://example.com). - Choose advanced settings (e.g., output format
Markdown, keep images = yes, etc.). - The script writes a
.mdfile into theoutputfolder.
Single Page Conversion programmatically:
from opencrawl import do_single_page_conversion
result_path = do_single_page_conversion(
url="https://example.com",
output_format="Markdown",
keep_images=True,
keep_links=True,
keep_emphasis=True,
generate_toc=True,
custom_filename=None
)
print(f"File saved to {result_path}")Recursive Crawling programmatically:
from opencrawl import do_recursive_crawling
do_recursive_crawling(
url="https://example.com",
max_depth=2, # crawl 2 levels deep
output_format="HTML", # produce cleaned HTML for each page
keep_images=False,
keep_links=True
)Hypothetical GPT-4–style function call in JSON (for a single page conversion):
Which translates in Python to:
from opencrawl import llm_function_call
llm_function_call(
function_name="do_single_page_conversion",
url="https://example.com",
output_format="Markdown",
keep_images=True,
keep_links=True,
keep_emphasis=True,
generate_toc=True,
custom_filename=None,
max_depth=1,
render_js=False
)This performs a single-page fetch & convert to Markdown with a table of contents, saving the file in output/. If a collision is detected, a suffix (_1, _2, etc.) is appended automatically.
Enjoy this script for your multi-format scraping and domain exploration needs! Feel free to extend or adapt it for your specific workflows. If you have any questions, please let us know!
{ "name": "llm_function_call", "arguments": { "function_name": "do_single_page_conversion", "url": "https://example.com", "output_format": "Markdown", "keep_images": true, "keep_links": true, "keep_emphasis": true, "generate_toc": true, "custom_filename": null, "max_depth": 1, "render_js": false } }