Skip to content

Conversation

enitrat
Copy link
Collaborator

@enitrat enitrat commented Aug 1, 2025

PR Summary: Add Documentation Snapshot Crawler Tool

📜 High-Level Summary

This PR introduces a new documentation snapshot crawler tool that extracts clean documentation content from websites and converts it to Markdown format. The tool prioritizes sitemap discovery, falls back to crawling when needed, filters non-documentation paths, and handles concurrent fetching with retries. It's designed to create portable snapshots of documentation sites for offline reading and ingestion into our Ingester Pipeline. The crawler includes URL filtering to only process pages that are children of the specified base URL, ensuring focused documentation extraction.

⚙️ Detailed Changeset Breakdown


Changeset 1: Add Core Documentation Crawler Implementation

Files Affected:

  • python/scripts/docs_crawler.py

Summary of Changes:

  • Created new DocsCrawler class with async context manager support for HTTP session management
  • Implemented sitemap discovery and parsing with support for nested sitemaps (sitemap index files)
  • Added URL validation logic that filters same-host URLs and ensures they are children of the base URL path
  • Implemented fallback crawling mechanism (BFS) when sitemap is unavailable, limited to 100 pages
  • Added concurrent page fetching with semaphore-based concurrency control (6 concurrent requests)
  • Implemented robust content extraction using multiple strategies: common doc selectors, largest div fallback, and body fallback
  • Added HTML to Markdown conversion with cleanup of boilerplate elements and formatting
  • Included comprehensive URL filtering to exclude admin, API, asset, and other non-documentation paths
  • Added exponential backoff retry mechanism for failed requests (up to 3 retries per URL)
  • Implemented logical URL sorting for crawled pages (prioritizes root pages, then sorts by depth and section)
  • Added command-line interface with argparse supporting base URL argument and help text
  • Included progress indication via tqdm for concurrent fetching operations

[TRIAGE]: NEEDS_REVIEW


Changeset 2: Update Project Dependencies and Configuration

Files Affected:

  • python/pyproject.toml
  • python/uv.lock

Summary of Changes:

  • Added five new dependencies for the crawler: aiohttp>=3.9.0, beautifulsoup4>=4.12.0, lxml>=4.9.0, markdownify>=0.11.0, tqdm>=4.66.0
  • Added new CLI script entry point docs-crawler = "scripts.docs_crawler:main" for easy execution
  • Updated lock file with all transitive dependencies including soupsieve (BeautifulSoup dependency) and other related packages
  • Dependencies follow semantic versioning with minimum version constraints to ensure compatibility

[TRIAGE]: NEEDS_REVIEW


Changeset 3: Add Sample Output Documentation

Files Affected:

  • python/doc_dump.md

Summary of Changes:

  • Added large sample output file (19,067 lines) demonstrating the crawler's functionality
  • Contains crawled content from OpenZeppelin Contracts for Cairo documentation
  • Shows the tool's ability to extract clean markdown content with proper source URL attribution
  • Demonstrates the structured output format with document headers, page sections, and separators
  • Includes comprehensive documentation covering access control, accounts, governance, security, token standards, and utilities

[TRIAGE]: APPROVED

@enitrat enitrat force-pushed the feat/web-doc-crawler branch from 00ae403 to 5f63616 Compare August 1, 2025 16:53
- Created DocsCrawler class with async HTTP session management
- Implemented sitemap discovery and parsing with nested sitemap support
- Added URL validation to only crawl children of base URL path
- Included fallback crawling mechanism limited to 100 pages
- Added concurrent fetching with semaphore control and retry logic
- Implemented multi-strategy content extraction and HTML to Markdown conversion
- Added comprehensive URL filtering for non-documentation paths
- Included CLI interface with argparse and progress indication
- Updated dependencies: aiohttp, beautifulsoup4, lxml, markdownify, tqdm
- Added docs-crawler script entry point to pyproject.toml
@enitrat enitrat force-pushed the feat/web-doc-crawler branch from 5f63616 to 8ebf2c9 Compare August 1, 2025 17:52
Copy link
Collaborator

@ijusttookadnatest ijusttookadnatest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue Identified

AdapterParseError Frequency Increase

Problem: The AdapterParseError exceptions that were already present in the system have become more frequent after this PR, causing a performance regression from 86-96% success rate to 76%.

Error Evidence:

dspy.utils.exceptions.AdapterParseError: Adapter ChatAdapter failed to parse the LM response.

Impact:

  • Pre-existing AdapterParseError issues have increased in frequency
  • Success rate dropped from 86-96% to 76%
  • Multiple HTTP 500 errors during API calls during script starklings execution.

Error Stack Trace:

File "/app/python/src/cairo_coder/server/app.py", line 342, in _handle_chat_completion
  return await self._generate_chat_completion(agent, query, messages[:-1], mcp_mode)
File "/app/python/src/cairo_coder/server/app.py", line 444, in _generate_chat_completion
  response: dspy.Prediction = await agent.aforward(
File "/app/python/src/cairo_coder/core/rag_pipeline.py", line 212, in aforward
  processed_query, documents = await self._aprocess_query_and_retrieve_docs(
File "/app/python/src/cairo_coder/core/rag_pipeline.py", line 154, in _aprocess_query_and_retrieve_docs
  processed_query = await self.query_processor.aforward(query=query, chat_history=chat_history_str)
File "/app/python/src/cairo_coder/dspy/query_processor.py", line 154, in aforward
  result = await self.retrieval_program.aforward(query=query, chat_history=chat_history)

that leads to more frequently 500 Internal server error during starkling evaluation :

[DEBUG] API call attempt 1/3 for starknet6 (feedback attempt 3)
[DEBUG] ❌ API call failed (attempt 1/3) for starknet6 (feedback attempt 3): HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] Waiting 3000ms before retry...
[DEBUG] API call attempt 2/3 for starknet6 (feedback attempt 3)
[DEBUG] ❌ API call failed (attempt 2/3) for starknet6 (feedback attempt 3): HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] Waiting 6000ms before retry...
[DEBUG] API call attempt 3/3 for starknet6 (feedback attempt 3)
[DEBUG] ❌ API call failed (attempt 3/3) for starknet6 (feedback attempt 3): HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] ❌ starknet6 - API call failed on attempt 3: HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] 
--- Attempt 4/4 for starknet6 ---
[DEBUG] API call attempt 1/3 for starknet6 (feedback attempt 4)
[DEBUG] ❌ API call failed (attempt 1/3) for starknet6 (feedback attempt 4): HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] Waiting 3000ms before retry...
[DEBUG] API call attempt 2/3 for starknet6 (feedback attempt 4)
[DEBUG] ❌ API call failed (attempt 2/3) for starknet6 (feedback attempt 4): HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] Waiting 6000ms before retry...
[DEBUG] API call attempt 3/3 for starknet6 (feedback attempt 4)
[DEBUG] ❌ API call failed (attempt 3/3) for starknet6 (feedback attempt 4): HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] ❌ starknet6 - API call failed on attempt 4: HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] Restored original file and cleaned up backup
[DEBUG] [STARKNET] starknet6: ❌ (4 attempts)
[DEBUG] [STARKNET] Completed: 1/7 (14.3%) - Avg attempts: 3.9 - Feedback successes: 1
[RUN 1] 43/55 exercises passed (78.2%) - Avg attempts: 1.8 - Feedback successes: 4 (7.3%)

Status

The PR implementation itself appears solid, but the impact on existing error patterns requires attention.

@enitrat
Copy link
Collaborator Author

enitrat commented Aug 5, 2025

The AdapterParseError exceptions that were already present in the system have become more frequent after this PR, causing a performance regression from 86-96% success rate to 76%.

Not related to this PR, hopefully it's just a temporary problem from Gemini...

@enitrat enitrat merged commit e805c5e into main Aug 5, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants