Skip to content

Conversation

@enitrat
Copy link
Collaborator

@enitrat enitrat commented Aug 2, 2025

PR Summary: feat: add documentation crawler for OpenZeppelin docs

📜 High-Level Summary

This PR replaces the AsciiDoc-based OpenZeppelin documentation ingestion with a simpler approach using pre-crawled markdown files. The crawler fetches documentation from websites using sitemaps and converts HTML content to clean markdown, making it easier to maintain up-to-date documentation sources for the Cairo Coder RAG pipeline.

Changeset 1: Replace AsciiDoc with Pre-Crawled Markdown for OpenZeppelin Docs

Files Affected:

  • packages/ingester/src/ingesters/OpenZeppelinDocsIngester.ts
  • packages/ingester/src/utils/RecursiveMarkdownSplitter.ts
  • packages/ingester/asciidoc/oz-playbook.yml (deleted)
  • packages/ingester/asciidoc/playbook.yml

Summary of Changes:

  • Modified OpenZeppelinDocsIngester to read from a pre-crawled markdown file instead of processing AsciiDoc
  • Added RecursiveMarkdownSplitter utility with comprehensive markdown-aware chunking logic
  • Removed the separate oz-playbook.yml configuration as it's no longer needed
  • Updated imports and chunk processing to use the new markdown splitter

[TRIAGE]: NEEDS_REVIEW


Changeset 3: Add Pre-Crawled OpenZeppelin Documentation

Files Affected:

  • python/scripts/summarizer/generated/openzeppelin_docs_summary.md

Summary of Changes:

  • Added a pre-crawled snapshot of the OpenZeppelin documentation dated 2025-08-02
  • This file serves as the source for the OpenZeppelinDocsIngester

[TRIAGE]: NEEDS_REVIEW

@enitrat enitrat changed the title feat: use crawled OZ docs in ingester feat: add documentation crawler for OpenZeppelin docs Aug 2, 2025
@enitrat enitrat force-pushed the feat/ingest-oz-crawler branch from 06ff121 to ebeb3ed Compare August 2, 2025 22:01
@enitrat enitrat force-pushed the feat/ingest-oz-crawler branch from ebeb3ed to 05fc44b Compare August 5, 2025 16:44
@enitrat enitrat changed the base branch from feat/web-doc-crawler to main August 5, 2025 16:44
@enitrat enitrat marked this pull request as ready for review August 5, 2025 16:45
Copy link
Collaborator

@ijusttookadnatest ijusttookadnatest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR, everything's ok

@ijusttookadnatest ijusttookadnatest merged commit ffe4982 into main Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants