Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,6 @@ packages/**/dist
!.trunk/trunk.yaml
!.trunk/.gitignore

starklings/
debug/

fixtures/runner_crate/target
Expand Down
4 changes: 4 additions & 0 deletions .trunk/trunk.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ runtimes:
- python@3.10.8
# This is the section where you manage your linters. (https://docs.trunk.io/check/configuration)
lint:
ignore:
- linters: [ALL]
paths:
- python/src/cairo_coder_tools/ingestion/generated
enabled:
- ruff@0.12.3
- actionlint@1.7.7
Expand Down
23 changes: 12 additions & 11 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,24 @@ This file documents conventions and checklists for making changes that affect th

When adding a new documentation source (e.g., a new docs site or SDK) make sure to complete all of the following steps:

1. TypeScript ingestion (packages/ingester)
1. TypeScript ingestion (ingesters)

- Create an ingester class extending `BaseIngester` or `MarkdownIngester` under `packages/ingester/src/ingesters/`.
- Register it in `packages/ingester/src/IngesterFactory.ts`.
- Create an ingester class extending `BaseIngester` or `MarkdownIngester` under `ingesters/src/ingesters/`.
- Register it in `ingesters/src/IngesterFactory.ts`.
- Ensure chunks carry correct metadata: `uniqueId`, `contentHash`, `sourceLink`, and `source`.
- Run `pnpm generate-embeddings` (or `generate-embeddings:yes`) to populate/update the vector store.
- Run the embeddings generator to populate/update the vector store:
- Prefer: `bun run src/generateEmbeddings.ts` (or `bun run src/generateEmbeddings.ts -y`)
- If you have scripts wired: `bun run generate-embeddings` (or `generate-embeddings:yes`)

2. Agents (TS)
2. Agents (Python)

- Add the new enum value to `packages/agents/src/types/index.ts` under `DocumentSource`.
- Verify Postgres vector store accepts the new `source` and filters on it (`packages/agents/src/db/postgresVectorStore.ts`).
- Add the new enum value to `python/src/cairo_coder/core/types.py` under `DocumentSource`.
- Ensure filtering by `metadata->>'source'` works with the new value in `python/src/cairo_coder/dspy/document_retriever.py`.
- Update the query processor resource descriptions in `python/src/cairo_coder/dspy/query_processor.py` (`RESOURCE_DESCRIPTIONS`). The module validates that every `DocumentSource` has a description.

3. Retrieval Pipeline (Python)

- Add the new enum value to `python/src/cairo_coder/core/types.py` under `DocumentSource`.
- Ensure filtering by `metadata->>'source'` works with the new value in `python/src/cairo_coder/dspy/document_retriever.py`.
- Update the query processor resource descriptions in `python/src/cairo_coder/dspy/query_processor.py` (`RESOURCE_DESCRIPTIONS`).
- No extra steps beyond the above; the retriever already supports filtering by `metadata->>'source'`.

4. Optimized Program Files (Python) — required

Expand All @@ -34,7 +35,7 @@ When adding a new documentation source (e.g., a new docs site or SDK) make sure

- Ensure the new source appears where appropriate (e.g., `/v1/agents` output and documentation tables):
- `API_DOCUMENTATION.md`
- `packages/ingester/README.md`
- `ingesters/README.md`
- Any user-facing lists of supported sources

6. Quick Sanity Check
Expand Down
22 changes: 17 additions & 5 deletions API_DOCUMENTATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,14 +58,25 @@ Lists every agent registered in Cairo Coder.
"openzeppelin_docs",
"corelib_docs",
"scarb_docs",
"starknet_js"
"starknet_js",
"starknet_blog"
]
},
{
"id": "scarb-assistant",
"name": "Scarb Assistant",
"description": "Specialized assistant for Scarb build tool",
"sources": ["scarb_docs"]
"id": "starknet-agent",
"name": "Starknet Agent",
"description": "Assistant for the Starknet ecosystem (contracts, tools, docs).",
"sources": [
"cairo_book",
"starknet_docs",
"starknet_foundry",
"cairo_by_example",
"openzeppelin_docs",
"corelib_docs",
"scarb_docs",
"starknet_js",
"starknet_blog"
]
}
]
```
Expand All @@ -82,6 +93,7 @@ Lists every agent registered in Cairo Coder.
| `corelib_docs` | Cairo core library docs |
| `scarb_docs` | Scarb package manager documentation |
| `starknet_js` | StarknetJS guides and SDK documentation |
| `starknet_blog` | Starknet blog posts and announcements |

## Chat Completions

Expand Down
4 changes: 2 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,8 +118,8 @@ ingesters/

```text
Python Summarizer → Generated Markdown → Ingester → PostgreSQL → RAG Pipeline → Code Generation
(python/) (python/src/scripts/ (ingesters/) (pgvector) (python/)
summarizer/generated/)
(python/) (python/src/cairo_coder_tools/ (ingesters/) (pgvector) (python/)
ingestion/generated/)
```

## Configuration
Expand Down
2 changes: 1 addition & 1 deletion ingester.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ COPY ingesters ./ingesters


# Copy ingester files generated from python
COPY python/src/scripts/summarizer/generated ./python/src/scripts/summarizer/generated
COPY python/src/cairo_coder_tools/ingestion/generated ./python/src/cairo_coder_tools/ingestion/generated

# Install dependencies
WORKDIR /app/ingesters
Expand Down
14 changes: 7 additions & 7 deletions ingesters/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ The ingester currently supports the following documentation sources:
6. **Core Library Docs** (`corelib_docs`): Cairo core library documentation
7. **Scarb Docs** (`scarb_docs`): Scarb package manager documentation
8. **StarknetJS Guides** (`starknet_js`): StarknetJS guides and tutorials
9. **Starknet Blog** (`starknet_blog`): Starknet blog posts and announcements

## Architecture

Expand Down Expand Up @@ -112,17 +113,16 @@ const chunks = splitter.splitMarkdownToChunks(markdown);

## Usage

To use the ingester package, run the `generateEmbeddings.ts` script:
To use the ingester package, run the embeddings generator script:

```bash
# From the root of the package
pnpm run generate-embeddings

# From the root of the project
turbo run generate-embeddings
# Preferred (direct bun):
bun run src/generateEmbeddings.ts
# Non-interactive (yes to prompts):
bun run src/generateEmbeddings.ts -y
```

This will prompt you to select a documentation source to ingest. You can also select "Everything" to ingest all sources.
This will prompt you to select a documentation source to ingest (or use `-y` for all). You can also select "Everything" to ingest all sources.

## Adding a New Documentation Source

Expand Down
2 changes: 1 addition & 1 deletion ingesters/src/ingesters/AsciiDocIngester.ts
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@ export abstract class AsciiDocIngester extends BaseIngester {
sections.forEach((section: ParsedSection, index: number) => {
const hash: string = calculateHash(section.content);
const sourceLink = `${this.config.baseUrl}/${page.name}${this.config.urlSuffix}${section.anchor ? '#' + section.anchor : ''}`;
console.debug(
logger.debug(
`Section Title: ${section.title}, source: ${this.source}, sourceLink: ${sourceLink}`,
);
chunks.push(
Expand Down
4 changes: 2 additions & 2 deletions ingesters/src/ingesters/CairoBookIngester.ts
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ export class CairoBookIngester extends MarkdownIngester {
async readSummaryFile(): Promise<string> {
const summaryPath = getPythonPath(
'src',
'scripts',
'summarizer',
'cairo_coder_tools',
'ingestion',
'generated',
'cairo_book_summary.md',
);
Expand Down
4 changes: 2 additions & 2 deletions ingesters/src/ingesters/CoreLibDocsIngester.ts
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ export class CoreLibDocsIngester extends MarkdownIngester {
async readCorelibSummaryFile(): Promise<string> {
const summaryPath = getPythonPath(
'src',
'scripts',
'summarizer',
'cairo_coder_tools',
'ingestion',
'generated',
'corelib_summary.md',
);
Expand Down
6 changes: 3 additions & 3 deletions ingesters/src/ingesters/StarknetBlogIngester.ts
Original file line number Diff line number Diff line change
Expand Up @@ -47,10 +47,10 @@ export class StarknetBlogIngester extends MarkdownIngester {
async readSummaryFile(): Promise<string> {
const summaryPath = getPythonPath(
'src',
'scripts',
'summarizer',
'cairo_coder_tools',
'ingestion',
'generated',
'blog_summary.md',
'starknet-blog.md',
);

logger.info(`Reading Starknet blog summary from ${summaryPath}`);
Expand Down
13 changes: 9 additions & 4 deletions ingesters/src/utils/RecursiveMarkdownSplitter.ts
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,8 @@ export class RecursiveMarkdownSplitter {
const sourceRanges = this.parseSourceRanges(markdown);

// Find all headers
const headerRegex = /^(#{1,6})\s+(.+?)(?:\s*#*)?$/gm;
// Allow up to 3 leading spaces before ATX headers per CommonMark
const headerRegex = /^\s{0,3}(#{1,6})\s+(.+?)(?:\s*#*)?$/gm;
let match: RegExpExecArray | null;

while ((match = headerRegex.exec(markdown)) !== null) {
Expand All @@ -214,10 +215,14 @@ export class RecursiveMarkdownSplitter {
// Find all code blocks
this.findCodeBlocks(markdown, codeBlocks);

// Filter out headers that are inside code blocks
// Filter out headers that are inside non-breakable code blocks
// Allow headers inside oversized or malformed (breakable) code blocks
const filteredHeaders = headers.filter((header) => {
return !codeBlocks.some(
(block) => header.start >= block.start && header.end <= block.end,
(block) =>
header.start >= block.start &&
header.end <= block.end &&
!block.breakable,
);
});

Expand Down Expand Up @@ -950,7 +955,7 @@ export class RecursiveMarkdownSplitter {
}
}

console.debug(`Chunk Title: ${title}, Source link: ${sourceLink}`);
logger.debug(`Chunk Title: ${title}, Source link: ${sourceLink}`);

chunks.push({
content: rawChunk.content,
Expand Down
13 changes: 13 additions & 0 deletions ingesters/src/utils/__tests__/RecursiveMarkdownSplitter.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,19 @@ More content.`;
expect(chunks[0]!.meta.title).toBe('Header with trailing hashes');
});

it('should detect headers with up to 3 leading spaces', () => {
const splitter = new RecursiveMarkdownSplitter({
maxChars: 100,
minChars: 0,
overlap: 0,
headerLevels: [1, 2],
});
const text = ' ## Indented H2 Header\nBody under indented header.';
const chunks = splitter.splitMarkdownToChunks(text);
expect(chunks.length).toBeGreaterThanOrEqual(1);
expect(chunks[0]!.meta.title).toBe('Indented H2 Header');
});

it('should prefer deepest header of configured levels (e.g., H2) for title', () => {
const splitter = new RecursiveMarkdownSplitter({
maxChars: 80,
Expand Down
1 change: 0 additions & 1 deletion python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,6 @@ docker compose up postgres backend --build
```bash
curl -X POST "http://localhost:3001/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
-d '{
"messages": [
{
Expand Down
14 changes: 9 additions & 5 deletions python/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -71,22 +71,26 @@ dev = [
]

[project.scripts]
# Main server
cairo-coder = "cairo_coder.server.app:main"
cairo-coder-api = "cairo_coder.api.server:run"

# Optimization tools
generate_starklings_dataset = "cairo_coder.optimizers.generation.generate_starklings_dataset:cli_main"
optimize_generation = "cairo_coder.optimizers.generation.optimize_generation:main"
starklings_evaluate = "scripts.starklings_evaluate:main"
cairo-coder-summarize = "scripts.summarizer.cli:app"
docs-crawler = "scripts.docs_crawler:main"
cairo-coder-datasets = "scripts.datasets.cli:app"

# Other scripts
eval = "scripts.eval:main"
ingest = "scripts.ingest:app"
dataset = "scripts.dataset:app"

[project.urls]
"Homepage" = "https://github.com/cairo-coder/cairo-coder"
"Bug Tracker" = "https://github.com/cairo-coder/cairo-coder/issues"

[tool.uv.build-backend]
module-root = "src"
module-name = ["cairo_coder", "scripts"]
module-name = ["cairo_coder", "cairo_coder_tools", "scripts"]

[tool.ruff]
line-length = 100
Expand Down
1 change: 1 addition & 0 deletions python/src/cairo_coder_tools/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Cairo Coder Tools - Utilities for evaluation, ingestion, and dataset management."""
5 changes: 5 additions & 0 deletions python/src/cairo_coder_tools/datasets/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
"""Dataset utilities for Cairo Coder."""

from .analysis import DatasetAnalyzer, analyze_dataset

__all__ = ["DatasetAnalyzer", "analyze_dataset"]
Loading