feat: add documentation snapshot crawler tool #41

enitrat · 2025-08-01T16:50:48Z

PR Summary: Add Documentation Snapshot Crawler Tool

📜 High-Level Summary

This PR introduces a new documentation snapshot crawler tool that extracts clean documentation content from websites and converts it to Markdown format. The tool prioritizes sitemap discovery, falls back to crawling when needed, filters non-documentation paths, and handles concurrent fetching with retries. It's designed to create portable snapshots of documentation sites for offline reading and ingestion into our Ingester Pipeline. The crawler includes URL filtering to only process pages that are children of the specified base URL, ensuring focused documentation extraction.

⚙️ Detailed Changeset Breakdown

Changeset 1: Add Core Documentation Crawler Implementation

Files Affected:

python/scripts/docs_crawler.py

Summary of Changes:

Created new DocsCrawler class with async context manager support for HTTP session management
Implemented sitemap discovery and parsing with support for nested sitemaps (sitemap index files)
Added URL validation logic that filters same-host URLs and ensures they are children of the base URL path
Implemented fallback crawling mechanism (BFS) when sitemap is unavailable, limited to 100 pages
Added concurrent page fetching with semaphore-based concurrency control (6 concurrent requests)
Implemented robust content extraction using multiple strategies: common doc selectors, largest div fallback, and body fallback
Added HTML to Markdown conversion with cleanup of boilerplate elements and formatting
Included comprehensive URL filtering to exclude admin, API, asset, and other non-documentation paths
Added exponential backoff retry mechanism for failed requests (up to 3 retries per URL)
Implemented logical URL sorting for crawled pages (prioritizes root pages, then sorts by depth and section)
Added command-line interface with argparse supporting base URL argument and help text
Included progress indication via tqdm for concurrent fetching operations

[TRIAGE]: NEEDS_REVIEW

Changeset 2: Update Project Dependencies and Configuration

Files Affected:

python/pyproject.toml
python/uv.lock

Summary of Changes:

Added five new dependencies for the crawler: aiohttp>=3.9.0, beautifulsoup4>=4.12.0, lxml>=4.9.0, markdownify>=0.11.0, tqdm>=4.66.0
Added new CLI script entry point docs-crawler = "scripts.docs_crawler:main" for easy execution
Updated lock file with all transitive dependencies including soupsieve (BeautifulSoup dependency) and other related packages
Dependencies follow semantic versioning with minimum version constraints to ensure compatibility

[TRIAGE]: NEEDS_REVIEW

Changeset 3: Add Sample Output Documentation

Files Affected:

python/doc_dump.md

Summary of Changes:

Added large sample output file (19,067 lines) demonstrating the crawler's functionality
Contains crawled content from OpenZeppelin Contracts for Cairo documentation
Shows the tool's ability to extract clean markdown content with proper source URL attribution
Demonstrates the structured output format with document headers, page sections, and separators
Includes comprehensive documentation covering access control, accounts, governance, security, token standards, and utilities

[TRIAGE]: APPROVED

- Created DocsCrawler class with async HTTP session management - Implemented sitemap discovery and parsing with nested sitemap support - Added URL validation to only crawl children of base URL path - Included fallback crawling mechanism limited to 100 pages - Added concurrent fetching with semaphore control and retry logic - Implemented multi-strategy content extraction and HTML to Markdown conversion - Added comprehensive URL filtering for non-documentation paths - Included CLI interface with argparse and progress indication - Updated dependencies: aiohttp, beautifulsoup4, lxml, markdownify, tqdm - Added docs-crawler script entry point to pyproject.toml

ijusttookadnatest

Issue Identified

AdapterParseError Frequency Increase

Problem: The AdapterParseError exceptions that were already present in the system have become more frequent after this PR, causing a performance regression from 86-96% success rate to 76%.

Error Evidence:

dspy.utils.exceptions.AdapterParseError: Adapter ChatAdapter failed to parse the LM response.

Impact:

Pre-existing AdapterParseError issues have increased in frequency
Success rate dropped from 86-96% to 76%
Multiple HTTP 500 errors during API calls during script starklings execution.

Error Stack Trace:

File "/app/python/src/cairo_coder/server/app.py", line 342, in _handle_chat_completion
  return await self._generate_chat_completion(agent, query, messages[:-1], mcp_mode)
File "/app/python/src/cairo_coder/server/app.py", line 444, in _generate_chat_completion
  response: dspy.Prediction = await agent.aforward(
File "/app/python/src/cairo_coder/core/rag_pipeline.py", line 212, in aforward
  processed_query, documents = await self._aprocess_query_and_retrieve_docs(
File "/app/python/src/cairo_coder/core/rag_pipeline.py", line 154, in _aprocess_query_and_retrieve_docs
  processed_query = await self.query_processor.aforward(query=query, chat_history=chat_history_str)
File "/app/python/src/cairo_coder/dspy/query_processor.py", line 154, in aforward
  result = await self.retrieval_program.aforward(query=query, chat_history=chat_history)

that leads to more frequently 500 Internal server error during starkling evaluation :

[DEBUG] API call attempt 1/3 for starknet6 (feedback attempt 3)
[DEBUG] ❌ API call failed (attempt 1/3) for starknet6 (feedback attempt 3): HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] Waiting 3000ms before retry...
[DEBUG] API call attempt 2/3 for starknet6 (feedback attempt 3)
[DEBUG] ❌ API call failed (attempt 2/3) for starknet6 (feedback attempt 3): HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] Waiting 6000ms before retry...
[DEBUG] API call attempt 3/3 for starknet6 (feedback attempt 3)
[DEBUG] ❌ API call failed (attempt 3/3) for starknet6 (feedback attempt 3): HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] ❌ starknet6 - API call failed on attempt 3: HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] 
--- Attempt 4/4 for starknet6 ---
[DEBUG] API call attempt 1/3 for starknet6 (feedback attempt 4)
[DEBUG] ❌ API call failed (attempt 1/3) for starknet6 (feedback attempt 4): HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] Waiting 3000ms before retry...
[DEBUG] API call attempt 2/3 for starknet6 (feedback attempt 4)
[DEBUG] ❌ API call failed (attempt 2/3) for starknet6 (feedback attempt 4): HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] Waiting 6000ms before retry...
[DEBUG] API call attempt 3/3 for starknet6 (feedback attempt 4)
[DEBUG] ❌ API call failed (attempt 3/3) for starknet6 (feedback attempt 4): HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] ❌ starknet6 - API call failed on attempt 4: HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] Restored original file and cleaned up backup
[DEBUG] [STARKNET] starknet6: ❌ (4 attempts)
[DEBUG] [STARKNET] Completed: 1/7 (14.3%) - Avg attempts: 3.9 - Feedback successes: 1
[RUN 1] 43/55 exercises passed (78.2%) - Avg attempts: 1.8 - Feedback successes: 4 (7.3%)

Status

The PR implementation itself appears solid, but the impact on existing error patterns requires attention.

enitrat · 2025-08-05T16:43:19Z

The AdapterParseError exceptions that were already present in the system have become more frequent after this PR, causing a performance regression from 86-96% success rate to 76%.

Not related to this PR, hopefully it's just a temporary problem from Gemini...

enitrat force-pushed the feat/web-doc-crawler branch from 00ae403 to 5f63616 Compare August 1, 2025 16:53

enitrat force-pushed the feat/web-doc-crawler branch from 5f63616 to 8ebf2c9 Compare August 1, 2025 17:52

ijusttookadnatest approved these changes Aug 5, 2025

View reviewed changes

enitrat merged commit e805c5e into main Aug 5, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add documentation snapshot crawler tool #41

feat: add documentation snapshot crawler tool #41

Uh oh!

enitrat commented Aug 1, 2025 •

edited

Loading

Uh oh!

ijusttookadnatest left a comment

Uh oh!

enitrat commented Aug 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add documentation snapshot crawler tool #41

feat: add documentation snapshot crawler tool #41

Uh oh!

Conversation

enitrat commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary: Add Documentation Snapshot Crawler Tool

📜 High-Level Summary

⚙️ Detailed Changeset Breakdown

Changeset 1: Add Core Documentation Crawler Implementation

Changeset 2: Update Project Dependencies and Configuration

Changeset 3: Add Sample Output Documentation

Uh oh!

ijusttookadnatest left a comment

Choose a reason for hiding this comment

Issue Identified

AdapterParseError Frequency Increase

Status

Uh oh!

enitrat commented Aug 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

enitrat commented Aug 1, 2025 •

edited

Loading