Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 34 additions & 1 deletion docs/guide/lite.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@ await lite.index([
| `version` | `string?` | Library version |
| `sourceType` | `string?` | `"manual"` (default), `"library"`, `"topic"`, or `"model-generated"` |
| `topicId` | `string?` | Topic ID to associate the document with |
| `language` | `string?` | Language alias for code-aware tree-sitter chunking (e.g. `"typescript"`, `"cpp"`, `"go"`). When set and tree-sitter is available, chunks at function/class boundaries instead of text boundaries. |

### `indexRaw(input)`

Expand Down Expand Up @@ -269,7 +270,39 @@ The LLM provider must support streaming. Providers that don't expose a `complete

## Code Indexing

For source code files, use the tree-sitter chunker to split at function and class boundaries:
LibScope Lite can split source code files at function and class boundaries using tree-sitter rather than plain text chunking. The preferred way to enable this is to set the `language` field on a `LiteDoc` — no extra imports or chunking steps required.

### Preferred: set `language` on `LiteDoc`

```ts
// Preferred: just set language on LiteDoc — chunking is automatic
await lite.index([
{
title: "src/auth.cpp",
content: fileContent,
library: "my-repo",
language: "cpp", // enables tree-sitter chunking at function boundaries
},
]);
```

Setting `language` on a `LiteDoc` automatically triggers code-aware tree-sitter chunking. This is the preferred approach over using `TreeSitterChunker` directly. If tree-sitter is not installed or parsing fails, indexing falls back silently to the standard text chunker.

Supported languages and their extension aliases:

| Language | Aliases |
|---|---|
| `typescript` | `ts`, `tsx` |
| `javascript` | `js`, `jsx`, `mjs`, `cjs` |
| `python` | `py` |
| `csharp` | `cs` |
| `cpp` | `cc`, `cxx`, `hpp`, `h` |
| `c` | — |
| `go` | — |

### Advanced: using `TreeSitterChunker` directly

Direct use of `TreeSitterChunker` is rarely needed when using `LibScopeLite` — setting `language` on `LiteDoc` covers most cases. Use `TreeSitterChunker` directly only when you need access to the raw `CodeChunk` objects (e.g., to extract line numbers for display, filter by node type, or build custom chunk titles):

```ts
import { LibScopeLite } from "libscope/lite";
Expand Down
32 changes: 31 additions & 1 deletion docs/reference/lite-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ interface LiteOptions {
async index(docs: LiteDoc[]): Promise<void>
```

Index an array of pre-parsed documents. Each document is chunked using the markdown-aware chunker, embedded, and stored.
Index an array of pre-parsed documents. Each document is chunked using the markdown-aware chunker (or code-aware tree-sitter chunker when `language` is set), embedded, and stored.

**`LiteDoc`**

Expand Down Expand Up @@ -107,9 +107,39 @@ interface LiteDoc {

/** Topic ID to associate the document with for topic-scoped search. */
topicId?: string;

/**
* Language alias for code-aware tree-sitter chunking.
* When set and tree-sitter is available, chunks at function/class boundaries
* instead of text boundaries. Falls back silently to the standard text chunker
* if tree-sitter is not installed or parsing fails.
*
* Supported languages and aliases:
* - `"typescript"` (aliases: `"ts"`, `"tsx"`)
* - `"javascript"` (aliases: `"js"`, `"jsx"`, `"mjs"`, `"cjs"`)
* - `"python"` (alias: `"py"`)
* - `"csharp"` (alias: `"cs"`)
* - `"cpp"` (aliases: `"cc"`, `"cxx"`, `"hpp"`, `"h"`)
* - `"c"`
* - `"go"`
*/
language?: string;
}
```

**`LiteDoc` properties:**

| Property | Type | Required | Description |
|---|---|---|---|
| `title` | `string` | Yes | Document title. Used in search result display and title boosting. |
| `content` | `string` | Yes | Full document text. Will be chunked before embedding. |
| `url` | `string` | No | Source URL for deduplication — replaced if content hash changed, skipped if unchanged. |
| `sourceType` | `string` | No | `"manual"` (default), `"library"`, `"topic"`, or `"model-generated"`. |
| `library` | `string` | No | Library namespace for scoped search. |
| `version` | `string` | No | Library version. Used with `library` for version-scoped search. |
| `topicId` | `string` | No | Topic ID to associate the document with for topic-scoped search. |
| `language` | `string` | No | Language alias for code-aware tree-sitter chunking (e.g. `"typescript"`, `"cpp"`, `"go"`). When set and tree-sitter is available, chunks at function/class boundaries instead of text boundaries. |

**Example:**

```ts
Expand Down
16 changes: 10 additions & 6 deletions src/core/indexing.ts
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ export interface IndexDocumentInput {
dedupOptions?: DedupOptions | undefined;
/** ISO 8601 expiry timestamp. Document will be pruned by pruneExpiredDocuments() after this time. */
expiresAt?: string | undefined;
/** If set, skip chunkContent() and use these directly as document chunks. */
preChunked?: string[] | undefined;
}

export interface IndexedDocument {
Expand Down Expand Up @@ -430,13 +432,15 @@ export async function indexDocument(
if (titleResult) return titleResult;

const docId = randomUUID();
const useStreaming = input.content.length > STREAMING_THRESHOLD;
const chunks = useStreaming ? chunkContentStreaming(input.content) : chunkContent(input.content);
let chunks: string[];
if (input.preChunked && input.preChunked.length > 0) {
chunks = input.preChunked;
} else {
const useStreaming = input.content.length > STREAMING_THRESHOLD;
chunks = useStreaming ? chunkContentStreaming(input.content) : chunkContent(input.content);
}

log.info(
{ docId, title: input.title, chunkCount: chunks.length, streaming: useStreaming },
"Indexing document",
);
log.info({ docId, title: input.title, chunkCount: chunks.length }, "Indexing document");

const metaPrefix = buildMetaPrefix(input);
const textsForEmbedding = chunks.map((c) => metaPrefix + c);
Expand Down
18 changes: 18 additions & 0 deletions src/lite/core.ts
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ import { bulkDelete } from "../core/bulk.js";
import { rateDocument } from "../core/ratings.js";
import { askQuestion, getContextForQuestion, type LlmProvider } from "../core/rag.js";
import { normalizeRawInput } from "./normalize.js";
import { TreeSitterChunker } from "./chunker-treesitter.js";
import type {
LiteOptions,
LiteDoc,
Expand All @@ -25,6 +26,11 @@ export class LibScopeLite {
private readonly db: Database.Database;
private readonly provider: EmbeddingProvider;
private readonly llmProvider: LlmProvider | null;
private _chunker: TreeSitterChunker | undefined;
private get chunker(): TreeSitterChunker {
this._chunker ??= new TreeSitterChunker();
return this._chunker;
}

constructor(opts: LiteOptions = {}) {
this.provider = opts.provider ?? new LocalEmbeddingProvider();
Expand All @@ -49,6 +55,17 @@ export class LibScopeLite {

async index(docs: LiteDoc[]): Promise<void> {
for (const doc of docs) {
let preChunked: string[] | undefined;

if (doc.language && this.chunker.supports(doc.language)) {
try {
const codeChunks = await this.chunker.chunk(doc.content, doc.language);
preChunked = codeChunks.map((c) => c.content);
} catch {
// tree-sitter not installed or parse failed — fall back to text chunker
}
}

await indexDocument(this.db, this.provider, {
title: doc.title,
content: doc.content,
Expand All @@ -57,6 +74,7 @@ export class LibScopeLite {
version: doc.version,
topicId: doc.topicId,
url: doc.url,
preChunked,
});
}
}
Expand Down
2 changes: 2 additions & 0 deletions src/lite/index.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
export { LibScopeLite } from "./core.js";
export { TreeSitterChunker } from "./chunker-treesitter.js";
export type { CodeChunk } from "./chunker-treesitter.js";
export type {
LiteOptions,
LiteDoc,
Expand Down
69 changes: 68 additions & 1 deletion tests/unit/indexing.test.ts
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
import { describe, it, expect } from "vitest";
import { describe, it, expect, beforeEach, afterEach } from "vitest";
import {
chunkContent,
chunkContentStreaming,
STREAMING_THRESHOLD,
indexDocument,
} from "../../src/core/indexing.js";
import Database from "better-sqlite3";
import { runMigrations, createVectorTable } from "../../src/db/schema.js";
import { createDatabase } from "../../src/db/connection.js";
import { MockEmbeddingProvider } from "../fixtures/mock-provider.js";

describe("chunkContent", () => {
it("should split content by markdown headings", () => {
Expand Down Expand Up @@ -321,3 +326,65 @@ describe("STREAMING_THRESHOLD", () => {
expect(STREAMING_THRESHOLD).toBe(1024 * 1024);
});
});

describe("indexDocument preChunked", () => {
let db: Database.Database;
let provider: MockEmbeddingProvider;

beforeEach(() => {
db = createDatabase(":memory:");
runMigrations(db);
try {
createVectorTable(db, 4);
} catch {
/* sqlite-vec not available */
}
provider = new MockEmbeddingProvider();
});

afterEach(() => {
db.close();
});

it("uses preChunked when provided, bypassing chunkContent", async () => {
const preChunked = ["chunk one", "chunk two", "chunk three"];
const result = await indexDocument(db, provider, {
title: "Test File",
content: "some content",
sourceType: "manual",
preChunked,
});

expect(result.chunkCount).toBe(3);

const chunks = db
.prepare("SELECT content FROM chunks WHERE document_id = ? ORDER BY chunk_index")
.all(result.id) as Array<{ content: string }>;
expect(chunks).toHaveLength(3);
expect(chunks[0]?.content).toBe("chunk one");
expect(chunks[1]?.content).toBe("chunk two");
expect(chunks[2]?.content).toBe("chunk three");
});

it("falls back to text chunking when preChunked is empty", async () => {
const result = await indexDocument(db, provider, {
title: "Test File",
content: "Some content that will be chunked normally.",
sourceType: "manual",
preChunked: [],
});

// Should have chunked via chunkContent (at least 1 chunk)
expect(result.chunkCount).toBeGreaterThanOrEqual(1);
});

it("falls back to text chunking when preChunked is undefined", async () => {
const result = await indexDocument(db, provider, {
title: "Test File",
content: "Some content without preChunked.",
sourceType: "manual",
});

expect(result.chunkCount).toBeGreaterThanOrEqual(1);
});
});
76 changes: 76 additions & 0 deletions tests/unit/lite.test.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import { describe, it, expect, beforeEach, afterEach, vi } from "vitest";
import { LibScopeLite } from "../../src/lite/index.js";
import { MockEmbeddingProvider } from "../fixtures/mock-provider.js";
import { TreeSitterChunker } from "../../src/lite/chunker-treesitter.js";
import type { LlmProvider } from "../../src/core/rag.js";

function* fakeStream(): Generator<string> {
Expand Down Expand Up @@ -269,4 +270,79 @@ describe("LibScopeLite", () => {
expect(() => instance.close()).not.toThrow();
});
});

describe("index() with language/tree-sitter chunking", () => {
afterEach(() => {
vi.restoreAllMocks();
});

it("calls TreeSitterChunker.chunk() when language is set and supported", async () => {
vi.spyOn(TreeSitterChunker.prototype, "supports").mockReturnValue(true);
const chunkSpy = vi.spyOn(TreeSitterChunker.prototype, "chunk").mockResolvedValue([
{
content: "function foo() {}",
startLine: 1,
endLine: 3,
nodeType: "function_declaration",
},
{
content: "function bar() {}",
startLine: 5,
endLine: 7,
nodeType: "function_declaration",
},
]);

await lite.index([
{
title: "src/main.ts",
content: "function foo() {}\nfunction bar() {}",
language: "typescript",
},
]);

expect(chunkSpy).toHaveBeenCalledWith("function foo() {}\nfunction bar() {}", "typescript");
});

it("does not call chunk() when language is not set", async () => {
const chunkSpy = vi.spyOn(TreeSitterChunker.prototype, "chunk");

await lite.index([{ title: "Doc", content: "Some content here." }]);

expect(chunkSpy).not.toHaveBeenCalled();
});

it("falls back silently when tree-sitter throws", async () => {
vi.spyOn(TreeSitterChunker.prototype, "supports").mockReturnValue(true);
vi.spyOn(TreeSitterChunker.prototype, "chunk").mockRejectedValue(
new Error("tree-sitter not installed"),
);

// Should not throw — fallback to text chunker
await expect(
lite.index([
{
title: "src/main.go",
content: "package main\nfunc main() {}",
language: "go",
},
]),
).resolves.toBeUndefined();
});

it("does not call chunk() when language is set but not supported", async () => {
vi.spyOn(TreeSitterChunker.prototype, "supports").mockReturnValue(false);
const chunkSpy = vi.spyOn(TreeSitterChunker.prototype, "chunk");

await lite.index([
{
title: "src/main.rb",
content: "def hello; end",
language: "ruby",
},
]);

expect(chunkSpy).not.toHaveBeenCalled();
});
});
});
Loading