OpenCrawl 🦊

A powerful, LLM-friendly HTML scraper and converter for seamless web content extraction.

OpenCrawl is an intelligent tool that empowers you to fetch HTML content from any URL and elegantly transform single or multiple pages into clean HTML, Markdown, or JSON formats. Generate detailed site maps with unlimited depth exploration to map entire domains. The script offers flexible integration through three powerful interfaces:

Command-Line Interactive - User-friendly menu-driven interface for quick operations
Direct Function Calls - Seamless integration with your Python codebase
LLM Function Calls - Structured API designed for large language model interactions

Description: Fetches one page from url, cleans it, and converts to the chosen output_format (HTML, Markdown, or JSON).
Output: The script writes the file into the output directory (no subdirectory for single page).
Returns: The absolute path to the newly created file, or None if an error occurred (e.g. invalid URL).

4.2 Recursive Crawling

def do_recursive_crawling(
    url,
    max_depth=1,
    output_format="Markdown",
    keep_images=True,
    keep_links=True,
    keep_emphasis=True,
    generate_toc=False,
    custom_filename=None
) -> None

Description: Recursively crawls all same-domain links up to max_depth, converting each page to the chosen output_format.
Output: Files are saved in the recursive_crawl subdirectory.
Returns: None.

4.3 Map (Site Map Only)

def do_map_only(
    url,
    output_format="Markdown",
    keep_images=True,
    keep_links=True,
    keep_emphasis=True,
    generate_toc=False,
    custom_filename=None
) -> None

Description: Creates a site map (only) by traversing all same-domain links to unlimited depth.
Output:
- If HTML, saves site_map.html (with nested lists).
- If Markdown, saves site_map.md (nested bullet points).
- If JSON, saves site_map.json (dictionary adjacency).
Returns: None.
Path: Written in site_map subdirectory.

4.4 Recursive Crawling & Map

def do_recursive_crawling_and_map(
    url,
    max_depth=1,
    output_format="Markdown",
    keep_images=True,
    keep_links=True,
    keep_emphasis=True,
    generate_toc=False,
    custom_filename=None
) -> None

Description: Combines Recursive Crawling (with user-defined depth) and then does a full site map (unlimited depth).
Output:
- Pages go into crawl_and_map subdirectory.
- The site map (html/md/json) is also placed there.
Returns: None.

4.5 Monitor Web Page for Changes

def monitor_page_for_changes(
    url,
    check_interval=60,
    checks=2,
    render_js=False,
    keep_links=True,
    keep_images=True,
    keep_emphasis=True
) -> None

Description: Periodically checks a page and prints a diff when changes are detected.
Output: Changes are displayed in the terminal.
Returns: None.

4.6 Universal LLM Function Call

def llm_function_call(
    function_name: str,
    url: str,
    output_format: str = "Markdown",
    keep_images: bool = True,
    keep_links: bool = True,
    keep_emphasis: bool = True,
    generate_toc: bool = False,
    custom_filename: str = None,
    max_depth: int = 1,
    render_js: bool = False,
    check_interval: int = 60,
    checks: int = 2
) -> Any

Description: A universal dispatcher so an LLM can call the script’s features by name.
Parameters:
- function_name: A string in { "do_single_page_conversion", "do_recursive_crawling", "do_map_only", "do_recursive_crawling_and_map" }
- url: The base URL to process.
- output_format: "HTML", "Markdown", or "JSON".
- keep_images, keep_links, keep_emphasis: Booleans controlling what is kept in the output.
- generate_toc: Whether to generate a table of contents if output_format="Markdown".
- custom_filename: Optional string for the file name.
- max_depth: Only relevant for crawling functions (otherwise ignored).
- render_js: Use Playwright to render JavaScript before scraping.
- check_interval: Seconds between checks when monitoring a page.
- checks: Number of checks when monitoring a page.
Returns: The result of the underlying function, typically a path or None.

5. Dictionaries & Parameters

5.1 The Settings Dictionary (CLI)

When the script is run interactively, it uses:

settings = {
    "output_format": <"HTML"|"Markdown"|"JSON">,
    "keep_images": <bool>,
    "keep_links": <bool>,
    "keep_emphasis": <bool>,
    "render_js": <bool>,
    "generate_toc": <bool>,
    "custom_filename": <bool>,  # Did user choose a custom file name?
    "output_filename": <str or None>  # The actual custom file name if chosen
}

5.2 LLM Function Call Parameters

When calling llm_function_call, you pass in the function_name plus named arguments:

function_name
url
output_format (default "Markdown")
keep_images (default True)
keep_links (default True)
keep_emphasis (default True)
generate_toc (default False)
custom_filename (default None)
max_depth (default 1)
render_js (default False)
check_interval (default 60)
checks (default 2)

6. Running the Script

Command-Line / Interactive Menu

Launch:
```
python opencrawl.py
```
Select your desired action from the main menu:
- Single Page Conversion
- Recursive Crawling
- Map
- Recursive Crawling & Map
- Monitor Web Page for Changes
- Exit
Answer the prompts (URL, advanced settings, depth if applicable).

Importing as a Module

Import the script:

from opencrawl import (
    do_single_page_conversion,
    do_recursive_crawling,
    do_map_only,
    do_recursive_crawling_and_map,
    monitor_page_for_changes,
    llm_function_call
)

Call any function with your parameters.

7. Examples

7.1 CLI Usage

Single Page Example

Run:
```
python opencrawl.py
```
Choose 1. Single Page Conversion.
Enter a URL (e.g., https://example.com).
Choose advanced settings (e.g., output format Markdown, keep images = yes, etc.).
The script writes a .md file into the output folder.

7.2 Python Imports

Single Page Conversion programmatically:

from opencrawl import do_single_page_conversion

result_path = do_single_page_conversion(
    url="https://example.com",
    output_format="Markdown",
    keep_images=True,
    keep_links=True,
    keep_emphasis=True,
    generate_toc=True,
    custom_filename=None
)

print(f"File saved to {result_path}")

Recursive Crawling programmatically:

from opencrawl import do_recursive_crawling

do_recursive_crawling(
    url="https://example.com",
    max_depth=2,              # crawl 2 levels deep
    output_format="HTML",     # produce cleaned HTML for each page
    keep_images=False,
    keep_links=True
)

7.3 LLM Function Call Example

Hypothetical GPT-4–style function call in JSON (for a single page conversion):

{
  "name": "llm_function_call",
  "arguments": {
    "function_name": "do_single_page_conversion",
    "url": "https://example.com",
    "output_format": "Markdown",
    "keep_images": true,
    "keep_links": true,
    "keep_emphasis": true,
    "generate_toc": true,
    "custom_filename": null,
    "max_depth": 1,
    "render_js": false
  }
}

Which translates in Python to:

from opencrawl import llm_function_call

llm_function_call(
    function_name="do_single_page_conversion",
    url="https://example.com",
    output_format="Markdown",
    keep_images=True,
    keep_links=True,
    keep_emphasis=True,
    generate_toc=True,
    custom_filename=None,
    max_depth=1,
    render_js=False
)

This performs a single-page fetch & convert to Markdown with a table of contents, saving the file in output/. If a collision is detected, a suffix (_1, _2, etc.) is appended automatically.

Enjoy this script for your multi-format scraping and domain exploration needs! Feel free to extend or adapt it for your specific workflows. If you have any questions, please let us know!

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
opencrawl.py		opencrawl.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenCrawl 🦊

Table of Contents

1. Dependencies

2. Installation

3. Script Overview

4. Key Functions

4.1 Single Page Conversion

4.2 Recursive Crawling

4.3 Map (Site Map Only)

4.4 Recursive Crawling & Map

4.5 Monitor Web Page for Changes

4.6 Universal LLM Function Call

5. Dictionaries & Parameters

5.1 The Settings Dictionary (CLI)

5.2 LLM Function Call Parameters

6. Running the Script

Command-Line / Interactive Menu

Importing as a Module

7. Examples

7.1 CLI Usage

7.2 Python Imports

7.3 LLM Function Call Example

About

Uh oh!

Releases

Packages

Languages

License

trevor-nichols/OpenCrawl

Folders and files

Latest commit

History

Repository files navigation

OpenCrawl 🦊

Table of Contents

1. Dependencies

2. Installation

3. Script Overview

4. Key Functions

4.1 Single Page Conversion

4.2 Recursive Crawling

4.3 Map (Site Map Only)

4.4 Recursive Crawling & Map

4.5 Monitor Web Page for Changes

4.6 Universal LLM Function Call

5. Dictionaries & Parameters

5.1 The Settings Dictionary (CLI)

5.2 LLM Function Call Parameters

6. Running the Script

Command-Line / Interactive Menu

Importing as a Module

7. Examples

7.1 CLI Usage

7.2 Python Imports

7.3 LLM Function Call Example

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages