-
Notifications
You must be signed in to change notification settings - Fork 3
feat: add documentation snapshot crawler tool #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
00ae403
to
5f63616
Compare
- Created DocsCrawler class with async HTTP session management - Implemented sitemap discovery and parsing with nested sitemap support - Added URL validation to only crawl children of base URL path - Included fallback crawling mechanism limited to 100 pages - Added concurrent fetching with semaphore control and retry logic - Implemented multi-strategy content extraction and HTML to Markdown conversion - Added comprehensive URL filtering for non-documentation paths - Included CLI interface with argparse and progress indication - Updated dependencies: aiohttp, beautifulsoup4, lxml, markdownify, tqdm - Added docs-crawler script entry point to pyproject.toml
5f63616
to
8ebf2c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue Identified
AdapterParseError Frequency Increase
Problem: The AdapterParseError
exceptions that were already present in the system have become more frequent after this PR, causing a performance regression from 86-96% success rate to 76%.
Error Evidence:
dspy.utils.exceptions.AdapterParseError: Adapter ChatAdapter failed to parse the LM response.
Impact:
- Pre-existing
AdapterParseError
issues have increased in frequency - Success rate dropped from 86-96% to 76%
- Multiple HTTP 500 errors during API calls during script starklings execution.
Error Stack Trace:
File "/app/python/src/cairo_coder/server/app.py", line 342, in _handle_chat_completion
return await self._generate_chat_completion(agent, query, messages[:-1], mcp_mode)
File "/app/python/src/cairo_coder/server/app.py", line 444, in _generate_chat_completion
response: dspy.Prediction = await agent.aforward(
File "/app/python/src/cairo_coder/core/rag_pipeline.py", line 212, in aforward
processed_query, documents = await self._aprocess_query_and_retrieve_docs(
File "/app/python/src/cairo_coder/core/rag_pipeline.py", line 154, in _aprocess_query_and_retrieve_docs
processed_query = await self.query_processor.aforward(query=query, chat_history=chat_history_str)
File "/app/python/src/cairo_coder/dspy/query_processor.py", line 154, in aforward
result = await self.retrieval_program.aforward(query=query, chat_history=chat_history)
that leads to more frequently 500 Internal server error during starkling evaluation :
[DEBUG] API call attempt 1/3 for starknet6 (feedback attempt 3)
[DEBUG] ❌ API call failed (attempt 1/3) for starknet6 (feedback attempt 3): HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] Waiting 3000ms before retry...
[DEBUG] API call attempt 2/3 for starknet6 (feedback attempt 3)
[DEBUG] ❌ API call failed (attempt 2/3) for starknet6 (feedback attempt 3): HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] Waiting 6000ms before retry...
[DEBUG] API call attempt 3/3 for starknet6 (feedback attempt 3)
[DEBUG] ❌ API call failed (attempt 3/3) for starknet6 (feedback attempt 3): HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] ❌ starknet6 - API call failed on attempt 3: HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG]
--- Attempt 4/4 for starknet6 ---
[DEBUG] API call attempt 1/3 for starknet6 (feedback attempt 4)
[DEBUG] ❌ API call failed (attempt 1/3) for starknet6 (feedback attempt 4): HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] Waiting 3000ms before retry...
[DEBUG] API call attempt 2/3 for starknet6 (feedback attempt 4)
[DEBUG] ❌ API call failed (attempt 2/3) for starknet6 (feedback attempt 4): HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] Waiting 6000ms before retry...
[DEBUG] API call attempt 3/3 for starknet6 (feedback attempt 4)
[DEBUG] ❌ API call failed (attempt 3/3) for starknet6 (feedback attempt 4): HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] ❌ starknet6 - API call failed on attempt 4: HTTP error! status: 500 - {"detail":{"error":{"message":"Internal server error","type":"server_error","code":"internal_error","param":null}}}
[DEBUG] Restored original file and cleaned up backup
[DEBUG] [STARKNET] starknet6: ❌ (4 attempts)
[DEBUG] [STARKNET] Completed: 1/7 (14.3%) - Avg attempts: 3.9 - Feedback successes: 1
[RUN 1] 43/55 exercises passed (78.2%) - Avg attempts: 1.8 - Feedback successes: 4 (7.3%)
Status
The PR implementation itself appears solid, but the impact on existing error patterns requires attention.
Not related to this PR, hopefully it's just a temporary problem from Gemini... |
PR Summary: Add Documentation Snapshot Crawler Tool
📜 High-Level Summary
This PR introduces a new documentation snapshot crawler tool that extracts clean documentation content from websites and converts it to Markdown format. The tool prioritizes sitemap discovery, falls back to crawling when needed, filters non-documentation paths, and handles concurrent fetching with retries. It's designed to create portable snapshots of documentation sites for offline reading and ingestion into our Ingester Pipeline. The crawler includes URL filtering to only process pages that are children of the specified base URL, ensuring focused documentation extraction.
⚙️ Detailed Changeset Breakdown
Changeset 1: Add Core Documentation Crawler Implementation
Files Affected:
python/scripts/docs_crawler.py
Summary of Changes:
DocsCrawler
class with async context manager support for HTTP session management[TRIAGE]: NEEDS_REVIEW
Changeset 2: Update Project Dependencies and Configuration
Files Affected:
python/pyproject.toml
python/uv.lock
Summary of Changes:
aiohttp>=3.9.0
,beautifulsoup4>=4.12.0
,lxml>=4.9.0
,markdownify>=0.11.0
,tqdm>=4.66.0
docs-crawler = "scripts.docs_crawler:main"
for easy executionsoupsieve
(BeautifulSoup dependency) and other related packages[TRIAGE]: NEEDS_REVIEW
Changeset 3: Add Sample Output Documentation
Files Affected:
python/doc_dump.md
Summary of Changes:
[TRIAGE]: APPROVED