An intelligent document assembly system that combines targeted web crawling, AI-driven research compilation, and smart template-based document generation. DocAssembler helps users create comprehensive documents by automatically gathering, analyzing, and synthesizing information from various web sources.
- Crawls entire website directories to gather documentation
- Focuses on process documentation, API specs, software instructions
- Supports various documentation types:
- Technical documentation
- API documentation
- Process instructions
- Wiki pages
- Social media profiles
- Knowledge bases
- Tag-based research gathering
- Accepts user summaries and topic keywords
- Performs deep web searches on individual tags
- Analyzes tag relationships and domain contexts
- Generates comprehensive research reports in PDF/Markdown
- Similar to OpenAI's DeepResearch functionality
- Semi-automated document completion
- User provides partial information
- AI completes missing sections
- Supported templates include:
- Software Requirements Specification (SRS)
- Executive Summaries
- CREST Data Problem Reports
- RFP Proposals
- Report Abstracts
- Plot Summaries
- Research Synopses
-
Web Crawler (
packages/webcrawler
):- Intelligent web crawling with domain/subdomain support
- Respects robots.txt and implements rate limiting
- Concurrent crawling with proper session management
- Content extraction and relationship mapping
-
Documentation Generator (
packages/docgen
):- Markdown and HTML processing
- Table of contents generation
- PDF output support
- Template-based document generation
- Metadata handling
-
Web Interface (
services/web
):- React/Vite-based web application
- Real-time processing feedback
- Document preview and editing
- Configuration management
DocAssembler stores raw text in a traditional relational database and vectors in a vector database. By default we recommend MySQL for document metadata and ChromaDB for vector search. Example setup scripts are located in scripts/databases/
.
- Python 3.12+
- Node.js 20+
- Docker (optional)
-
Clone the repository:
git clone https://github.com/cloudcurio/doc_assembler_web.git cd doc_assembler_web
-
Set up Python packages:
# Install Poetry curl -sSL https://install.python-poetry.org | python3 - # Install webcrawler package cd packages/webcrawler poetry install # Install docgen package cd ../docgen poetry install
-
Set up web interface:
cd ../../services/web npm install
-
Install pre-commit hooks:
pip install pre-commit pre-commit install
-
Configure environment:
cp .env.example .env # Edit .env with your settings
# Run webcrawler tests
cd packages/webcrawler
poetry run pytest
# Run docgen tests
cd ../docgen
poetry run pytest
# Run web interface tests
cd ../../services/web
npm test
from webcrawler.core.config import CrawlerConfig
from webcrawler.core.crawler import Crawler
# Configure and run documentation crawler
config = CrawlerConfig(
start_url="https://docs.example.com",
doc_types=["api", "wiki", "technical"],
content_filters=["documentation", "guide", "manual"]
)
async with Crawler(config) as crawler:
docs = await crawler.gather_documentation()
print(f"Found {len(docs)} documentation pages")
from airesearch.core.researcher import Researcher
from airesearch.models.topic import ResearchTopic
# Configure research parameters
topic = ResearchTopic(
tags=["kubernetes", "service mesh", "istio"],
context="cloud native architecture",
depth="technical"
)
# Generate research report
researcher = Researcher()
report = await researcher.compile_research(
topic=topic,
output_format="pdf",
include_citations=True
)
from docgen.core.assembler import DocumentAssembler
from docgen.models.template import Template
# Create SRS document from template
assembler = DocumentAssembler()
srs_doc = await assembler.create_document(
template=Template.SRS,
initial_content={
"project_name": "MyProject",
"project_scope": "Cloud-based service...",
},
auto_complete=True
)
Build and run using Docker:
# Build images
docker compose build
# Run services
docker compose up -d
doc_types
: Types of documentation to gather- api, wiki, technical, process, social
content_filters
: Content type filtersdepth
: Crawling depth configurationextract_assets
: Include images and diagramsrate_limits
: Domain-specific rate limiting
search_depth
: Research depth leveltag_relationships
: Tag correlation settingssource_quality
: Source validation rulescitation_style
: Citation formatanalysis_level
: Research analysis depth
templates/
: Customizable document templates- SRS Template
- Executive Summary Template
- RFP Template
- Research Report Template
completion_rules/
: AI completion guidelinesstyle_guides/
: Document styling rules
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Website: cloudcurio.cc
- Email: dev@cloudcurio.cc
This template provides a minimal setup to get React working in Vite with HMR and some ESLint rules.
Currently, two official plugins are available:
- @vitejs/plugin-react uses Babel for Fast Refresh
- @vitejs/plugin-react-swc uses SWC for Fast Refresh
If you are developing a production application, we recommend using TypeScript with type-aware lint rules enabled. Check out the TS template for information on how to integrate TypeScript and typescript-eslint
in your project.