Skip to content

trevor-nichols/OpenCrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenCrawl 🦊

A powerful, LLM-friendly HTML scraper and converter for seamless web content extraction.

OpenCrawl is an intelligent tool that empowers you to fetch HTML content from any URL and elegantly transform single or multiple pages into clean HTML, Markdown, or JSON formats. Generate detailed site maps with unlimited depth exploration to map entire domains. The script offers flexible integration through three powerful interfaces:

  1. Command-Line Interactive - User-friendly menu-driven interface for quick operations
  2. Direct Function Calls - Seamless integration with your Python codebase
  3. LLM Function Calls - Structured API designed for large language model interactions

Table of Contents

  1. Dependencies
  2. Installation
  3. Script Overview
  4. Key Functions
    1. Single Page Conversion
    2. Recursive Crawling
    3. Map (Site Map Only)
    4. Recursive Crawling & Map
    5. Monitor Web Page for Changes
    6. Universal LLM Function Call
  5. Dictionaries & Parameters
    1. The Settings Dictionary (CLI)
    2. LLM Function Call Parameters
  6. Running the Script
  7. Examples
    1. CLI Usage
    2. Python Imports
    3. LLM Function Call Example

1. Dependencies

The script uses the following external libraries:

  • questionary (for CLI prompts)
  • requests
  • html2text
  • rich
  • beautifulsoup4
  • playwright (for optional JavaScript rendering)

If you do not have them installed, run:

pip install questionary requests html2text rich beautifulsoup4 playwright

After installing, run playwright install to download the necessary browser binaries.


2. Installation

  1. Save the script to a file, for example:
    opencrawl.py
  2. Make it executable (if you want to run it directly, on UNIX-like systems):
    chmod +x opencrawl.py
  3. Run it with Python:
    python opencrawl.py

3. Script Overview

This script fetches HTML from user-supplied URLs, extracts the main content (excluding <header>, <nav>, <footer>, and elements whose ID or class includes "footer"), removes doc-specific symbol elements, and converts the results into:

  • Clean HTML,
  • Markdown (with optional Table of Contents),
  • JSON (containing the extracted text in a simple dictionary).

All requests send a custom User-Agent header of OpenCrawl/1.0 when fetching pages.

It also supports:

  • Recursive Crawling of same-domain pages (via BFS with a user-defined depth).
  • Site Map generation for all same-domain links (unlimited depth). The site map can be in HTML, Markdown, or JSON adjacency form.

File collisions are handled by automatically appending "_1", "_2", etc. whenever a filename already exists.


4. Key Functions

4.1 Single Page Conversion

def do_single_page_conversion(
    url,
    output_format="Markdown",
    keep_images=True,
    keep_links=True,
    keep_emphasis=True,
    generate_toc=False,
    custom_filename=None
) -> Optional[str]
  • Description: Fetches one page from url, cleans it, and converts to the chosen output_format (HTML, Markdown, or JSON).
  • Output: The script writes the file into the output directory (no subdirectory for single page).
  • Returns: The absolute path to the newly created file, or None if an error occurred (e.g. invalid URL).

4.2 Recursive Crawling

def do_recursive_crawling(
    url,
    max_depth=1,
    output_format="Markdown",
    keep_images=True,
    keep_links=True,
    keep_emphasis=True,
    generate_toc=False,
    custom_filename=None
) -> None
  • Description: Recursively crawls all same-domain links up to max_depth, converting each page to the chosen output_format.
  • Output: Files are saved in the recursive_crawl subdirectory.
  • Returns: None.

4.3 Map (Site Map Only)

def do_map_only(
    url,
    output_format="Markdown",
    keep_images=True,
    keep_links=True,
    keep_emphasis=True,
    generate_toc=False,
    custom_filename=None
) -> None
  • Description: Creates a site map (only) by traversing all same-domain links to unlimited depth.
  • Output:
    • If HTML, saves site_map.html (with nested lists).
    • If Markdown, saves site_map.md (nested bullet points).
    • If JSON, saves site_map.json (dictionary adjacency).
  • Returns: None.
  • Path: Written in site_map subdirectory.

4.4 Recursive Crawling & Map

def do_recursive_crawling_and_map(
    url,
    max_depth=1,
    output_format="Markdown",
    keep_images=True,
    keep_links=True,
    keep_emphasis=True,
    generate_toc=False,
    custom_filename=None
) -> None
  • Description: Combines Recursive Crawling (with user-defined depth) and then does a full site map (unlimited depth).
  • Output:
    • Pages go into crawl_and_map subdirectory.
    • The site map (html/md/json) is also placed there.
  • Returns: None.

4.5 Monitor Web Page for Changes

def monitor_page_for_changes(
    url,
    check_interval=60,
    checks=2,
    render_js=False,
    keep_links=True,
    keep_images=True,
    keep_emphasis=True
) -> None
  • Description: Periodically checks a page and prints a diff when changes are detected.
  • Output: Changes are displayed in the terminal.
  • Returns: None.

4.6 Universal LLM Function Call

def llm_function_call(
    function_name: str,
    url: str,
    output_format: str = "Markdown",
    keep_images: bool = True,
    keep_links: bool = True,
    keep_emphasis: bool = True,
    generate_toc: bool = False,
    custom_filename: str = None,
    max_depth: int = 1,
    render_js: bool = False,
    check_interval: int = 60,
    checks: int = 2
) -> Any
  • Description: A universal dispatcher so an LLM can call the script’s features by name.
  • Parameters:
    • function_name: A string in { "do_single_page_conversion", "do_recursive_crawling", "do_map_only", "do_recursive_crawling_and_map" }
    • url: The base URL to process.
    • output_format: "HTML", "Markdown", or "JSON".
    • keep_images, keep_links, keep_emphasis: Booleans controlling what is kept in the output.
    • generate_toc: Whether to generate a table of contents if output_format="Markdown".
    • custom_filename: Optional string for the file name.
    • max_depth: Only relevant for crawling functions (otherwise ignored).
    • render_js: Use Playwright to render JavaScript before scraping.
    • check_interval: Seconds between checks when monitoring a page.
    • checks: Number of checks when monitoring a page.
  • Returns: The result of the underlying function, typically a path or None.

5. Dictionaries & Parameters

5.1 The Settings Dictionary (CLI)

When the script is run interactively, it uses:

settings = {
    "output_format": <"HTML"|"Markdown"|"JSON">,
    "keep_images": <bool>,
    "keep_links": <bool>,
    "keep_emphasis": <bool>,
    "render_js": <bool>,
    "generate_toc": <bool>,
    "custom_filename": <bool>,  # Did user choose a custom file name?
    "output_filename": <str or None>  # The actual custom file name if chosen
}

5.2 LLM Function Call Parameters

When calling llm_function_call, you pass in the function_name plus named arguments:

  • function_name
  • url
  • output_format (default "Markdown")
  • keep_images (default True)
  • keep_links (default True)
  • keep_emphasis (default True)
  • generate_toc (default False)
  • custom_filename (default None)
  • max_depth (default 1)
  • render_js (default False)
  • check_interval (default 60)
  • checks (default 2)

6. Running the Script

Command-Line / Interactive Menu

  1. Launch:

    python opencrawl.py
  2. Select your desired action from the main menu:

    • Single Page Conversion
    • Recursive Crawling
    • Map
    • Recursive Crawling & Map
    • Monitor Web Page for Changes
    • Exit
  3. Answer the prompts (URL, advanced settings, depth if applicable).

Importing as a Module

  1. Import the script:
    from opencrawl import (
        do_single_page_conversion,
        do_recursive_crawling,
        do_map_only,
        do_recursive_crawling_and_map,
        monitor_page_for_changes,
        llm_function_call
    )
  2. Call any function with your parameters.

7. Examples

7.1 CLI Usage

Single Page Example

  1. Run:
    python opencrawl.py
  2. Choose 1. Single Page Conversion.
  3. Enter a URL (e.g., https://example.com).
  4. Choose advanced settings (e.g., output format Markdown, keep images = yes, etc.).
  5. The script writes a .md file into the output folder.

7.2 Python Imports

Single Page Conversion programmatically:

from opencrawl import do_single_page_conversion

result_path = do_single_page_conversion(
    url="https://example.com",
    output_format="Markdown",
    keep_images=True,
    keep_links=True,
    keep_emphasis=True,
    generate_toc=True,
    custom_filename=None
)

print(f"File saved to {result_path}")

Recursive Crawling programmatically:

from opencrawl import do_recursive_crawling

do_recursive_crawling(
    url="https://example.com",
    max_depth=2,              # crawl 2 levels deep
    output_format="HTML",     # produce cleaned HTML for each page
    keep_images=False,
    keep_links=True
)

7.3 LLM Function Call Example

Hypothetical GPT-4–style function call in JSON (for a single page conversion):

{
  "name": "llm_function_call",
  "arguments": {
    "function_name": "do_single_page_conversion",
    "url": "https://example.com",
    "output_format": "Markdown",
    "keep_images": true,
    "keep_links": true,
    "keep_emphasis": true,
    "generate_toc": true,
    "custom_filename": null,
    "max_depth": 1,
    "render_js": false
  }
}

Which translates in Python to:

from opencrawl import llm_function_call

llm_function_call(
    function_name="do_single_page_conversion",
    url="https://example.com",
    output_format="Markdown",
    keep_images=True,
    keep_links=True,
    keep_emphasis=True,
    generate_toc=True,
    custom_filename=None,
    max_depth=1,
    render_js=False
)

This performs a single-page fetch & convert to Markdown with a table of contents, saving the file in output/. If a collision is detected, a suffix (_1, _2, etc.) is appended automatically.


Enjoy this script for your multi-format scraping and domain exploration needs! Feel free to extend or adapt it for your specific workflows. If you have any questions, please let us know!

About

LLM friendly html scrapper for all

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages