feat: URL spidering — crawl linked pages with configurable depth and page limits (#315)#343
Conversation
…page limits (#315) Adds opt-in spidering to URL indexing. A single seed URL can now crawl and index an entire documentation site or wiki section in one call. New files: - src/core/link-extractor.ts: indexOf-based <a href> extraction, relative URL resolution, fragment stripping, dedup, scheme filtering. No regex. - src/core/spider.ts: BFS crawl engine with sameDomain, pathPrefix, excludePatterns (glob), maxPages (hard cap 200), maxDepth (hard cap 5), 10-min total timeout, robots.txt (User-agent: * and libscope), and 1s inter-request delay. Yields SpiderResult per page; returns SpiderStats. - tests/unit/link-extractor.test.ts: 25 tests covering relative resolution, dedup, fragment stripping, scheme filtering, attribute order, edge cases. - tests/unit/spider.test.ts: 20 tests covering BFS order, depth/page limits, domain + path + pattern filtering, cycle detection, robots.txt, partial failure recovery, stats, and abortReason reporting. Modified: - src/core/url-fetcher.ts: adds fetchRaw() export returning raw body + contentType + finalUrl before HTML-to-markdown conversion, so the spider can extract links from HTML before conversion. - src/api/routes.ts: POST /api/v1/documents/url now accepts spider, maxPages, maxDepth, sameDomain, pathPrefix, excludePatterns. Returns { documents, pagesIndexed, pagesCrawled, pagesSkipped, errors, abortReason }. - src/mcp/server.ts: submit-document tool gains spider, maxPages, maxDepth, sameDomain, pathPrefix, excludePatterns parameters. Safety: all fetched URLs pass through the existing SSRF validation in fetchRaw() (DNS resolution, private IP blocking, scheme allowlist). Hard limits (200 pages, depth 5, 10min) cannot be overridden by callers. robots.txt is fetched once per origin and Disallow rules are honoured. Individual page failures do not abort the crawl. Closes #315 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
- Remove unnecessary type assertions (routes.ts, mcp/server.ts) — TypeScript already narrows SpiderResult/SpiderStats from the generator - Add explicit return type annotation on mock fetchRaw to satisfy no-unsafe-return rule in spider.test.ts - Replace .resolves.not.toThrow() with a direct assertion — vitest .resolves requires a Promise, not an async function Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds opt-in URL spidering to the URL indexing flow so a single seed URL can crawl and index linked pages (within configurable depth/page limits), reusing existing SSRF protections and integrating with both the REST API and MCP tooling.
Changes:
- Introduces a link extractor and BFS spider engine to crawl and yield pages from a seed URL.
- Exposes spidering options via
POST /api/v1/documents/urland thesubmit-documentMCP tool. - Adds
fetchRaw()to the URL fetcher to provide raw HTML/text for link extraction, plus new unit test coverage for the spider and extractor.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
src/core/link-extractor.ts |
New indexOf-based <a href> extraction, normalization, and scheme filtering. |
src/core/spider.ts |
New BFS spider async generator with limits, filtering, and robots.txt handling. |
src/core/url-fetcher.ts |
Adds fetchRaw() export returning raw body/contentType/finalUrl for spider use. |
src/api/routes.ts |
Adds spider options and spider-mode response fields to POST /api/v1/documents/url. |
src/mcp/server.ts |
Adds spider parameters and spider-mode behavior to submit-document tool. |
tests/unit/link-extractor.test.ts |
New unit tests for link extraction and normalization edge cases. |
tests/unit/spider.test.ts |
New unit tests covering BFS order, limits, filtering, robots, and failure handling. |
src/api/routes.ts
Outdated
| const spiderOptions: SpiderOptions = { | ||
| fetchOptions, | ||
| ...(typeof b["maxPages"] === "number" && { maxPages: b["maxPages"] }), | ||
| ...(typeof b["maxDepth"] === "number" && { maxDepth: b["maxDepth"] }), |
There was a problem hiding this comment.
The API accepts maxPages/maxDepth as any number and passes them straight through. Non-integers, NaN, or negative values currently lead to surprising behavior (e.g., the crawl loop won’t run at all if maxPages becomes NaN/negative). Consider validating/coercing these inputs at the route level (e.g., ensure finite integers and clamp to sensible min/max) before building spiderOptions.
| const spiderOptions: SpiderOptions = { | |
| fetchOptions, | |
| ...(typeof b["maxPages"] === "number" && { maxPages: b["maxPages"] }), | |
| ...(typeof b["maxDepth"] === "number" && { maxDepth: b["maxDepth"] }), | |
| const rawMaxPages = b["maxPages"]; | |
| const rawMaxDepth = b["maxDepth"]; | |
| let maxPages: number | undefined; | |
| if (rawMaxPages !== undefined) { | |
| if (typeof rawMaxPages !== "number" || !Number.isFinite(rawMaxPages)) { | |
| sendError(res, 400, "VALIDATION_ERROR", "Field 'maxPages' must be a finite number"); | |
| return; | |
| } | |
| const normalized = Math.trunc(rawMaxPages); | |
| if (normalized <= 0) { | |
| sendError(res, 400, "VALIDATION_ERROR", "Field 'maxPages' must be a positive integer"); | |
| return; | |
| } | |
| // Clamp to a sensible upper bound to avoid runaway crawls | |
| const MAX_SPIDER_PAGES = 1000; | |
| maxPages = Math.min(normalized, MAX_SPIDER_PAGES); | |
| } | |
| let maxDepth: number | undefined; | |
| if (rawMaxDepth !== undefined) { | |
| if (typeof rawMaxDepth !== "number" || !Number.isFinite(rawMaxDepth)) { | |
| sendError(res, 400, "VALIDATION_ERROR", "Field 'maxDepth' must be a finite number"); | |
| return; | |
| } | |
| const normalized = Math.trunc(rawMaxDepth); | |
| if (normalized <= 0) { | |
| sendError(res, 400, "VALIDATION_ERROR", "Field 'maxDepth' must be a positive integer"); | |
| return; | |
| } | |
| // Clamp to a sensible upper bound for crawl depth | |
| const MAX_SPIDER_DEPTH = 10; | |
| maxDepth = Math.min(normalized, MAX_SPIDER_DEPTH); | |
| } | |
| const spiderOptions: SpiderOptions = { | |
| fetchOptions, | |
| ...(maxPages !== undefined && { maxPages }), | |
| ...(maxDepth !== undefined && { maxDepth }), |
| // Convert to markdown | ||
| const isHtml = raw.contentType.includes("text/html"); | ||
| const content = isHtml ? htmlToMarkdown(raw.body) : raw.body; | ||
| const title = isHtml ? extractTitle(raw.body, url) : extractTextTitle(raw.body, url); | ||
|
|
||
| stats.pagesIndexed++; | ||
| yield { url, title, content, depth }; | ||
|
|
||
| // Extract and enqueue child links if we haven't hit maxDepth | ||
| if (depth < maxDepth) { | ||
| if (isHtml) { | ||
| const links = extractLinks(raw.body, raw.finalUrl || url); | ||
| for (const link of links) { | ||
| if (!visited.has(link)) { | ||
| queue.push({ url: link, depth: depth + 1 }); | ||
| } | ||
| } | ||
| log.debug({ url, linksFound: links.length }, "Spider: extracted links"); |
There was a problem hiding this comment.
The spider tracks visited and yields/indexes using the pre-redirect URL (url), but link extraction uses raw.finalUrl. If a page redirects, this can lead to (1) indexing the wrong canonical URL and (2) fetching the final URL again later because visited contains only the pre-redirect URL. Consider normalizing to raw.finalUrl after fetch (use it for visited, for yielding, and as the base for further link extraction).
| // Convert to markdown | |
| const isHtml = raw.contentType.includes("text/html"); | |
| const content = isHtml ? htmlToMarkdown(raw.body) : raw.body; | |
| const title = isHtml ? extractTitle(raw.body, url) : extractTextTitle(raw.body, url); | |
| stats.pagesIndexed++; | |
| yield { url, title, content, depth }; | |
| // Extract and enqueue child links if we haven't hit maxDepth | |
| if (depth < maxDepth) { | |
| if (isHtml) { | |
| const links = extractLinks(raw.body, raw.finalUrl || url); | |
| for (const link of links) { | |
| if (!visited.has(link)) { | |
| queue.push({ url: link, depth: depth + 1 }); | |
| } | |
| } | |
| log.debug({ url, linksFound: links.length }, "Spider: extracted links"); | |
| const finalUrl = raw.finalUrl || url; | |
| visited.add(finalUrl); | |
| // Convert to markdown | |
| const isHtml = raw.contentType.includes("text/html"); | |
| const content = isHtml ? htmlToMarkdown(raw.body) : raw.body; | |
| const title = isHtml | |
| ? extractTitle(raw.body, finalUrl) | |
| : extractTextTitle(raw.body, finalUrl); | |
| stats.pagesIndexed++; | |
| yield { url: finalUrl, title, content, depth }; | |
| // Extract and enqueue child links if we haven't hit maxDepth | |
| if (depth < maxDepth) { | |
| if (isHtml) { | |
| const links = extractLinks(raw.body, finalUrl); | |
| for (const link of links) { | |
| if (!visited.has(link)) { | |
| queue.push({ url: link, depth: depth + 1 }); | |
| } | |
| } | |
| log.debug({ url: finalUrl, linksFound: links.length }, "Spider: extracted links"); |
| export interface SpiderStats { | ||
| pagesIndexed: number; | ||
| pagesCrawled: number; | ||
| pagesSkipped: number; | ||
| errors: Array<{ url: string; error: string }>; | ||
| abortReason?: "maxPages" | "timeout"; | ||
| } |
There was a problem hiding this comment.
SpiderStats.pagesIndexed is produced by the spider itself, but the spider doesn’t actually perform indexing (it fetches + converts + yields). This name is likely to mislead callers into thinking it represents successfully indexed documents. Consider renaming to something like pagesFetched/pagesYielded (and let API/MCP layers report their own "indexed" counts).
| // Index from URL (with optional spidering) | ||
| if (pathname === "/api/v1/documents/url" && method === "POST") { | ||
| const body = await parseJsonBody(req); |
There was a problem hiding this comment.
This PR adds new request parameters/response fields for POST /api/v1/documents/url, but src/api/openapi.ts doesn’t appear to document them (no references to "spider" there). Please update the OpenAPI spec so clients can discover spider, maxPages, maxDepth, sameDomain, pathPrefix, excludePatterns, and the new response fields.
src/mcp/server.ts
Outdated
| ...(params.maxPages !== undefined && { maxPages: params.maxPages }), | ||
| ...(params.maxDepth !== undefined && { maxDepth: params.maxDepth }), | ||
| ...(params.sameDomain !== undefined && { sameDomain: params.sameDomain }), | ||
| ...(params.pathPrefix !== undefined && { pathPrefix: params.pathPrefix }), | ||
| ...(params.excludePatterns !== undefined && { excludePatterns: params.excludePatterns }), | ||
| }; | ||
|
|
There was a problem hiding this comment.
The conditional spread pattern ...(condition && { ... }) will create a false | {...} type and can fail strict TypeScript checks (and it isn’t the pattern used elsewhere in this repo). Consider switching to condition ? { ... } : {} spreads or building the options object via explicit if assignments.
| ...(params.maxPages !== undefined && { maxPages: params.maxPages }), | |
| ...(params.maxDepth !== undefined && { maxDepth: params.maxDepth }), | |
| ...(params.sameDomain !== undefined && { sameDomain: params.sameDomain }), | |
| ...(params.pathPrefix !== undefined && { pathPrefix: params.pathPrefix }), | |
| ...(params.excludePatterns !== undefined && { excludePatterns: params.excludePatterns }), | |
| }; | |
| }; | |
| if (params.maxPages !== undefined) { | |
| spiderOptions.maxPages = params.maxPages; | |
| } | |
| if (params.maxDepth !== undefined) { | |
| spiderOptions.maxDepth = params.maxDepth; | |
| } | |
| if (params.sameDomain !== undefined) { | |
| spiderOptions.sameDomain = params.sameDomain; | |
| } | |
| if (params.pathPrefix !== undefined) { | |
| spiderOptions.pathPrefix = params.pathPrefix; | |
| } | |
| if (params.excludePatterns !== undefined) { | |
| spiderOptions.excludePatterns = params.excludePatterns; | |
| } |
| while (searchPos < lowerTag.length) { | ||
| const hrefIdx = lowerTag.indexOf("href", searchPos); | ||
| if (hrefIdx === -1) return null; | ||
|
|
||
| // Skip whitespace before = | ||
| let eqIdx = hrefIdx + 4; | ||
| while (eqIdx < tag.length && (tag[eqIdx] === " " || tag[eqIdx] === "\t")) eqIdx++; | ||
|
|
||
| if (tag[eqIdx] !== "=") { | ||
| searchPos = hrefIdx + 4; | ||
| continue; | ||
| } |
There was a problem hiding this comment.
extractHref() searches for the substring "href" anywhere in the tag, which can incorrectly treat attributes like data-href / aria-href as real links (since "href" is a suffix and is followed by =). This can cause the spider to crawl URLs that aren’t actual anchors. Consider requiring an attribute boundary (e.g., preceding char is whitespace/< and the next non-whitespace char after href is =), or explicitly matching href as a full attribute name.
src/core/spider.ts
Outdated
| // Fetch robots.txt once for the origin | ||
| const disallowed = await fetchRobotsTxt(seedOrigin, fetchOptions); | ||
| log.debug({ origin: seedOrigin, rules: disallowed.size }, "Loaded robots.txt rules"); | ||
|
|
There was a problem hiding this comment.
Robots.txt is fetched only once for the seed origin (fetchRobotsTxt(seedOrigin, ...)). When crawling subdomains (allowed by isSameDomain) or when sameDomain=false, this ends up applying the seed origin’s rules to other origins and ignores their own robots.txt entirely, which contradicts the stated behavior and can violate robots policies. Consider maintaining a per-origin cache (Map<origin, disallowSet>) and fetching/parsing robots.txt for each new origin encountered.
src/core/spider.ts
Outdated
| ): Promise<Set<string>> { | ||
| const robotsUrl = origin + "/robots.txt"; | ||
| try { | ||
| const raw = await fetchRaw(robotsUrl, { timeout: 10_000, ...fetchOptions }); |
There was a problem hiding this comment.
In fetchRobotsTxt(), the object spread order allows caller-provided fetchOptions.timeout to override the intended 10s robots.txt timeout ({ timeout: 10_000, ...fetchOptions }). If the goal is a hard cap for robots.txt fetches, spread fetchOptions first and then set timeout (or explicitly clamp it).
| const raw = await fetchRaw(robotsUrl, { timeout: 10_000, ...fetchOptions }); | |
| const effectiveTimeout = | |
| fetchOptions?.timeout !== undefined && Number.isFinite(fetchOptions.timeout) | |
| ? Math.min(fetchOptions.timeout, 10_000) | |
| : 10_000; | |
| const raw = await fetchRaw(robotsUrl, { ...fetchOptions, timeout: effectiveTimeout }); |
src/api/routes.ts
Outdated
| ...(typeof b["maxPages"] === "number" && { maxPages: b["maxPages"] }), | ||
| ...(typeof b["maxDepth"] === "number" && { maxDepth: b["maxDepth"] }), | ||
| ...(typeof b["sameDomain"] === "boolean" && { sameDomain: b["sameDomain"] }), | ||
| ...(typeof b["pathPrefix"] === "string" && { pathPrefix: b["pathPrefix"] }), | ||
| ...(Array.isArray(b["excludePatterns"]) && { | ||
| excludePatterns: (b["excludePatterns"] as unknown[]).filter( | ||
| (p): p is string => typeof p === "string", | ||
| ), | ||
| }), |
There was a problem hiding this comment.
The conditional spread pattern ...(condition && { ... }) will produce a false | {...} type and is not used elsewhere in this codebase (which typically uses condition ? {...} : {} for spreads). In strict TS this can fail type-checking (Spread types may only be created from object types). Consider switching to ternaries or building spiderOptions via imperative if statements.
| ...(typeof b["maxPages"] === "number" && { maxPages: b["maxPages"] }), | |
| ...(typeof b["maxDepth"] === "number" && { maxDepth: b["maxDepth"] }), | |
| ...(typeof b["sameDomain"] === "boolean" && { sameDomain: b["sameDomain"] }), | |
| ...(typeof b["pathPrefix"] === "string" && { pathPrefix: b["pathPrefix"] }), | |
| ...(Array.isArray(b["excludePatterns"]) && { | |
| excludePatterns: (b["excludePatterns"] as unknown[]).filter( | |
| (p): p is string => typeof p === "string", | |
| ), | |
| }), | |
| ...(typeof b["maxPages"] === "number" ? { maxPages: b["maxPages"] } : {}), | |
| ...(typeof b["maxDepth"] === "number" ? { maxDepth: b["maxDepth"] } : {}), | |
| ...(typeof b["sameDomain"] === "boolean" ? { sameDomain: b["sameDomain"] } : {}), | |
| ...(typeof b["pathPrefix"] === "string" ? { pathPrefix: b["pathPrefix"] } : {}), | |
| ...(Array.isArray(b["excludePatterns"]) | |
| ? { | |
| excludePatterns: (b["excludePatterns"] as unknown[]).filter( | |
| (p): p is string => typeof p === "string", | |
| ), | |
| } | |
| : {}), |
src/api/routes.ts
Outdated
| try { | ||
| const gen = spiderUrl(url, spiderOptions); | ||
| let result = await gen.next(); | ||
| while (!result.done) { | ||
| const page = result.value; | ||
| try { | ||
| const doc = await indexDocument(db, provider, { | ||
| content: page.content, | ||
| title: page.title, | ||
| sourceType: "manual", | ||
| url: page.url, | ||
| topicId, | ||
| }); | ||
| indexedDocs.push({ id: doc.id, title: page.title, url: page.url }); | ||
| } catch (indexErr) { | ||
| const msg = indexErr instanceof Error ? indexErr.message : String(indexErr); | ||
| errors.push({ url: page.url, error: msg }); | ||
| } | ||
| result = await gen.next(); | ||
| } | ||
| // result.value is SpiderStats when done (generator is exhausted) | ||
| if (result.done && result.value) { | ||
| stats = result.value; | ||
| stats.errors = errors; | ||
| } | ||
| } catch (err) { | ||
| const msg = err instanceof Error ? err.message : String(err); | ||
| sendError(res, 502, "FETCH_ERROR", msg); | ||
| return; | ||
| } |
There was a problem hiding this comment.
In spider mode, fetch/init errors are caught and returned as HTTP 502, but single-URL mode relies on the outer handler (which currently doesn’t map FetchError to a 4xx/5xx explicitly). This creates inconsistent error semantics for the same endpoint. Consider removing the inner try/catch and letting the top-level handler decide, or adding consistent FetchError handling at the top level (and reusing it for both spider and non-spider paths).
| try { | |
| const gen = spiderUrl(url, spiderOptions); | |
| let result = await gen.next(); | |
| while (!result.done) { | |
| const page = result.value; | |
| try { | |
| const doc = await indexDocument(db, provider, { | |
| content: page.content, | |
| title: page.title, | |
| sourceType: "manual", | |
| url: page.url, | |
| topicId, | |
| }); | |
| indexedDocs.push({ id: doc.id, title: page.title, url: page.url }); | |
| } catch (indexErr) { | |
| const msg = indexErr instanceof Error ? indexErr.message : String(indexErr); | |
| errors.push({ url: page.url, error: msg }); | |
| } | |
| result = await gen.next(); | |
| } | |
| // result.value is SpiderStats when done (generator is exhausted) | |
| if (result.done && result.value) { | |
| stats = result.value; | |
| stats.errors = errors; | |
| } | |
| } catch (err) { | |
| const msg = err instanceof Error ? err.message : String(err); | |
| sendError(res, 502, "FETCH_ERROR", msg); | |
| return; | |
| } | |
| const gen = spiderUrl(url, spiderOptions); | |
| let result = await gen.next(); | |
| while (!result.done) { | |
| const page = result.value; | |
| try { | |
| const doc = await indexDocument(db, provider, { | |
| content: page.content, | |
| title: page.title, | |
| sourceType: "manual", | |
| url: page.url, | |
| topicId, | |
| }); | |
| indexedDocs.push({ id: doc.id, title: page.title, url: page.url }); | |
| } catch (indexErr) { | |
| const msg = indexErr instanceof Error ? indexErr.message : String(indexErr); | |
| errors.push({ url: page.url, error: msg }); | |
| } | |
| result = await gen.next(); | |
| } | |
| // result.value is SpiderStats when done (generator is exhausted) | |
| if (result.done && result.value) { | |
| stats = result.value; | |
| stats.errors = errors; | |
| } |
link-extractor.ts (CodeQL #30 — incomplete URL scheme check): Replace the enumerated scheme blocklist (javascript:, vbscript:, etc.) with a strict http/https allowlist check on the resolved URL protocol. An allowlist is exhaustive by definition; a blocklist will always miss obscure schemes like vbscript:, blob:, or future additions. spider.ts (CodeQL #31 — incomplete multi-character sanitization): Replace the regex-based tag stripper /<[^>]+>/g in extractTitle() with an indexOf-based stripTags() function. The regex stops at the first > which can be inside a quoted attribute value (e.g. <img alt="a>b">), potentially leaving partial tag content in the extracted title. The new implementation walks quoted attribute values explicitly so no tag content leaks through regardless of its internal structure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- link-extractor: add word-boundary check in extractHref to prevent matching data-href, aria-href (false positives on non-href attributes) - spider: rename pagesIndexed → pagesFetched throughout (SpiderStats interface already used pagesFetched; sync implementation + tests) - spider: per-origin robots.txt cache (Map<origin, Set>) fetched lazily as new origins are encountered during crawl (was seed-only before) - spider: normalize to raw.finalUrl after redirects — visited set, yielded URL, and link-extraction base all use the canonical URL - routes: validate maxPages/maxDepth are finite positive integers - routes: change conditional spread &&-patterns to ternaries - routes: remove inner try/catch for spider fetch errors; add FetchError to top-level handler (consistent with single-URL mode → 502) - mcp/server: replace conditional spreads with explicit if assignments - mcp/server: validate spider=true requires url (throws ValidationError) - openapi: document spider request fields in IndexFromUrlRequest schema, add SpiderResponse schema, update 201 response to oneOf Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| { | ||
| documents: indexedDocs, | ||
| pagesFetched: indexedDocs.length, | ||
| pagesCrawled: stats.pagesCrawled, | ||
| pagesSkipped: stats.pagesSkipped, | ||
| errors, | ||
| abortReason: stats.abortReason ?? null, | ||
| }, |
There was a problem hiding this comment.
API response naming/semantics look inconsistent with the PR description: the PR text says the endpoint returns pagesIndexed, but the implementation returns pagesFetched, and sets it to indexedDocs.length (indexed pages), not the spider’s fetched/yielded count. Consider aligning the response fields with the documented contract (e.g. pagesIndexed for indexedDocs length, and optionally include pagesFetched from SpiderStats.pagesFetched if you need both).
| let seedOrigin: string; | ||
| try { | ||
| const parsed = new URL(seedUrl); | ||
| seedHostname = parsed.hostname; | ||
| seedOrigin = parsed.origin; | ||
| } catch { | ||
| throw new FetchError("Invalid seed URL: " + seedUrl); |
There was a problem hiding this comment.
spiderUrl treats an invalid seedUrl as a FetchError (throw new FetchError("Invalid seed URL...")). In the API handler, FetchError is mapped to HTTP 502, so malformed input will surface as a server error instead of a 400 validation error. Consider throwing ValidationError here (or another error that routes maps to 400) for URL parsing failures.
| // Normalize to the final URL after any redirects. | ||
| // This ensures the visited set, yielded URL, and link-extraction base are all consistent. | ||
| const canonicalUrl = raw.finalUrl || url; | ||
| if (canonicalUrl !== url) { | ||
| visited.add(canonicalUrl); | ||
| } | ||
|
|
||
| // Convert to markdown | ||
| const isHtml = raw.contentType.includes("text/html"); | ||
| const content = isHtml ? htmlToMarkdown(raw.body) : raw.body; | ||
| const title = isHtml | ||
| ? extractTitle(raw.body, canonicalUrl) | ||
| : extractTextTitle(raw.body, canonicalUrl); | ||
|
|
||
| stats.pagesFetched++; | ||
| yield { url: canonicalUrl, title, content, depth }; | ||
|
|
||
| // Extract and enqueue child links if we haven't hit maxDepth | ||
| if (depth < maxDepth) { | ||
| if (isHtml) { | ||
| const links = extractLinks(raw.body, canonicalUrl); |
There was a problem hiding this comment.
Redirects can bypass crawl filters: filtering (sameDomain/pathPrefix/excludePatterns/robots) happens before fetch, but after fetchRaw you switch to canonicalUrl = raw.finalUrl and yield it without re-checking. If an in-scope URL redirects to an out-of-scope domain/path, it will still be fetched/yielded and used as the base for link extraction. Re-apply the filtering/robots checks to canonicalUrl (and update stats) before yielding/enqueueing children.
| // Normalize to the final URL after any redirects. | |
| // This ensures the visited set, yielded URL, and link-extraction base are all consistent. | |
| const canonicalUrl = raw.finalUrl || url; | |
| if (canonicalUrl !== url) { | |
| visited.add(canonicalUrl); | |
| } | |
| // Convert to markdown | |
| const isHtml = raw.contentType.includes("text/html"); | |
| const content = isHtml ? htmlToMarkdown(raw.body) : raw.body; | |
| const title = isHtml | |
| ? extractTitle(raw.body, canonicalUrl) | |
| : extractTextTitle(raw.body, canonicalUrl); | |
| stats.pagesFetched++; | |
| yield { url: canonicalUrl, title, content, depth }; | |
| // Extract and enqueue child links if we haven't hit maxDepth | |
| if (depth < maxDepth) { | |
| if (isHtml) { | |
| const links = extractLinks(raw.body, canonicalUrl); | |
| // Convert to markdown | |
| const isHtml = raw.contentType.includes("text/html"); | |
| const content = isHtml ? htmlToMarkdown(raw.body) : raw.body; | |
| const title = isHtml | |
| ? extractTitle(raw.body, url) | |
| : extractTextTitle(raw.body, url); | |
| stats.pagesFetched++; | |
| yield { url, title, content, depth }; | |
| // Extract and enqueue child links if we haven't hit maxDepth | |
| if (depth < maxDepth) { | |
| if (isHtml) { | |
| const links = extractLinks(raw.body, url); |
| charAfterA !== ">" && | ||
| charAfterA !== "/" | ||
| ) { | ||
| pos = tagStart + 2; | ||
| continue; | ||
| } |
There was a problem hiding this comment.
extractLinks finds the end of the <a ...> opening tag via html.indexOf(">", tagStart). This will terminate early if a > appears inside a quoted attribute value (valid HTML), causing extractHref to parse an incomplete tag and potentially miss/extract incorrect links. The tag-end scan should skip over quoted sections (similar to stripTags in spider.ts) instead of using a raw indexOf.
| // which can't happen in a valid <a> tag and so we skip it). | ||
| const charBefore = hrefIdx > 0 ? lowerTag[hrefIdx - 1] : ""; | ||
| if (charBefore !== " " && charBefore !== "\t" && charBefore !== "\n" && charBefore !== "\r") { | ||
| searchPos = hrefIdx + 4; | ||
| continue; | ||
| } | ||
|
|
||
| // Skip whitespace before = | ||
| let eqIdx = hrefIdx + 4; | ||
| while (eqIdx < tag.length && (tag[eqIdx] === " " || tag[eqIdx] === "\t")) eqIdx++; | ||
|
|
||
| if (tag[eqIdx] !== "=") { | ||
| searchPos = hrefIdx + 4; |
There was a problem hiding this comment.
extractHref only treats space/tab as whitespace around the = and before the value. Newlines/CR are also valid attribute whitespace in HTML, so href\n=... or href=\n"..." won't be parsed. Consider using the same whitespace set used elsewhere in this file (space/tab/\n/\r) for the loops that skip whitespace.
| stats = result.value; | ||
| stats.errors = errors; |
There was a problem hiding this comment.
In spider mode, stats.errors = errors overwrites the spider's own fetch/robots/etc error list returned from spiderUrl. This can silently drop crawl errors. Consider keeping these as separate arrays (e.g. crawlErrors vs indexErrors) or merging them rather than replacing the generator’s stats.
| stats = result.value; | |
| stats.errors = errors; | |
| const generatorStats = result.value; | |
| const combinedErrors = | |
| Array.isArray(generatorStats.errors) && generatorStats.errors.length > 0 | |
| ? [...generatorStats.errors, ...errors] | |
| : errors; | |
| stats = { | |
| ...generatorStats, | |
| errors: combinedErrors, | |
| }; |
* chore: add development branch workflow (#327) * chore: add development branch workflow - Add merge-gate.yml: enforces only 'development' can merge into main - Update CI/CodeQL/Docker workflows to run on both main and development - Update dependabot.yml: target-branch set to development for all ecosystems - Update copilot-instructions.md: document branch workflow convention - Rulesets configured: Main (requires merge-gate + squash-only), Development (requires CI status checks + PR) - Default branch set to development - All open PRs retargeted to development Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: skip Vercel preview deployments on non-main branches Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: trigger check refresh --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: create-pack from local folder or URL sources (#329) * fix: comprehensive audit fixes — security, performance, resilience, API hardening Addresses findings from issue #314: - SSRF protection for webhook URLs (CRITICAL) - Scrub secrets from exports - Stored XSS prevention on document URL - O(n²) and N+1 fixes in bulk operations - Rate limit cache eviction improvement - SSE backpressure handling - Replace raw Error() with typed errors - Fetch timeouts on all network calls - Input validation on API parameters - Search query length limit - Silent catch block logging - DNS rebinding check fix - N+1 in Slack user resolution - Pagination on webhook/search list endpoints Closes #314 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR review comments: fix SSE listener leak, preserve caller signals, add SSRF validation to webhook test, chunk SQL params, use dynamic import in test - SSE backpressure: create single disconnect promise, race against drain (no listener accumulation) - http-utils.ts/onenote.ts: use AbortSignal.any() to combine caller signal with timeout - Webhook test endpoint: validate URL with validateWebhookUrlSsrf before fetch - bulk.ts: chunk IN clause to 999 params max (SQLite limit) - webhooks.test.ts: dynamic import after vi.mock() for deterministic DNS mocking Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ci: consolidate and fix CI/CD workflows - Merge lint + typecheck into single job (saves one npm ci) - Add concurrency groups to ci, docker, codeql (cancel stale runs) - Add dependency-review-action on PRs (block vulnerable deps) - Add workflow_call trigger to ci.yml for reusability - Remove duplicate npm publish from release.yml (release-please owns it) - Fix SDK paths: sdk-go/ → sdk/go/, sdk-python/ → sdk/python/ - Fix Dependabot paths to match actual SDK directories - Add github-actions ecosystem to Dependabot (keep actions up to date) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add HTML file parser for .html/.htm document indexing Adds HtmlParser using the existing node-html-markdown dependency. Strips <script>, <style>, and <nav> tags before conversion. Registered for .html and .htm extensions. Includes 12 tests covering conversion, tag stripping, edge cases. Closes #317 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address CodeQL and review comments on HTML parser - Replace regex-based tag stripping with node-html-markdown's native ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization, bad HTML filtering regexp) - Wrap translate() in try/catch, throw ValidationError (consistent with other parsers) - Use trimEnd() instead of trim() to preserve leading indentation - Reuse single NHM instance for efficiency Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Revert "fix: skip Vercel preview deployments on non-main branches" This reverts commit eb48187. * feat: add --from option to pack create for folder/URL sources Adds createPackFromSource() that builds packs directly from local folders, files, or URLs without requiring database interaction. CLI: libscope pack create --name my-pack --from ~/docs/ [--extensions .md,.html] [--exclude pattern] [--no-recursive] Features: - Walks directories recursively using registered parsers - Fetches URLs via fetchAndConvert - Supports extension filtering, exclude patterns, progress callback - Multiple --from sources supported - Output format identical to DB export (pack install works unchanged) Closes #328 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * style: fix prettier formatting Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add gzip support for pack files (.json.gz) Pack files can now be compressed with gzip for smaller distribution: - writePackFile/readPackFile auto-detect gzip by extension or magic bytes - installPack accepts both .json and .json.gz files - createPackFromSource defaults to .json.gz output (source packs can be large) - createPack (DB export) still defaults to .json - Auto-detects gzip by magic bytes even if extension is .json 5 new tests covering gzip write, install, magic byte detection, and round-trip. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add progress logging and fix dedup handling in pack install - Log each document as it's indexed so large installs show progress - Change pack install to use dedup: 'skip' for graceful duplicate handling - Make title+content_length dedup check respect the dedup mode setting (previously it always threw ValidationError regardless of dedup mode) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: auto-generate tags during pack creation and apply on install - Export tokenize() and add suggestTagsFromText() in tags.ts for DB-free tag generation - createPackFromSource() now auto-generates tags per document via TF-IDF - installPack() applies doc.tags via addTagsToDocument() after indexing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add HTML file parser for .html/.htm document indexing (#318) * feat: add HTML file parser for .html/.htm document indexing Adds HtmlParser using the existing node-html-markdown dependency. Strips <script>, <style>, and <nav> tags before conversion. Registered for .html and .htm extensions. Includes 12 tests covering conversion, tag stripping, edge cases. Closes #317 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address CodeQL and review comments on HTML parser - Replace regex-based tag stripping with node-html-markdown's native ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization, bad HTML filtering regexp) - Wrap translate() in try/catch, throw ValidationError (consistent with other parsers) - Use trimEnd() instead of trim() to preserve leading indentation - Reuse single NHM instance for efficiency Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Revert "fix: skip Vercel preview deployments on non-main branches" This reverts commit eb48187. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * build(deps-dev): Bump eslint-config-prettier from 9.1.2 to 10.1.8 (#325) Bumps [eslint-config-prettier](https://github.com/prettier/eslint-config-prettier) from 9.1.2 to 10.1.8. - [Release notes](https://github.com/prettier/eslint-config-prettier/releases) - [Changelog](https://github.com/prettier/eslint-config-prettier/blob/main/CHANGELOG.md) - [Commits](https://github.com/prettier/eslint-config-prettier/commits/v10.1.8) --- updated-dependencies: - dependency-name: eslint-config-prettier dependency-version: 10.1.8 dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps-dev): Bump lint-staged from 16.3.1 to 16.3.2 Bumps the minor-and-patch group with 1 update: [lint-staged](https://github.com/lint-staged/lint-staged). Updates `lint-staged` from 16.3.1 to 16.3.2 - [Release notes](https://github.com/lint-staged/lint-staged/releases) - [Changelog](https://github.com/lint-staged/lint-staged/blob/main/CHANGELOG.md) - [Commits](lint-staged/lint-staged@v16.3.1...v16.3.2) --- updated-dependencies: - dependency-name: lint-staged dependency-version: 16.3.2 dependency-type: direct:development update-type: version-update:semver-patch dependency-group: minor-and-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): Bump the actions group with 5 updates Bumps the actions group with 5 updates: | Package | From | To | | --- | --- | --- | | [actions/checkout](https://github.com/actions/checkout) | `4` | `6` | | [actions/setup-node](https://github.com/actions/setup-node) | `4` | `6` | | [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4` | `7` | | [actions/setup-python](https://github.com/actions/setup-python) | `5` | `6` | | [actions/setup-go](https://github.com/actions/setup-go) | `5` | `6` | Updates `actions/checkout` from 4 to 6 - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v4...v6) Updates `actions/setup-node` from 4 to 6 - [Release notes](https://github.com/actions/setup-node/releases) - [Commits](actions/setup-node@v4...v6) Updates `actions/upload-artifact` from 4 to 7 - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@v4...v7) Updates `actions/setup-python` from 5 to 6 - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](actions/setup-python@v5...v6) Updates `actions/setup-go` from 5 to 6 - [Release notes](https://github.com/actions/setup-go/releases) - [Commits](actions/setup-go@v5...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/setup-node dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/upload-artifact dependency-version: '7' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/setup-python dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/setup-go dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps-dev): Bump @types/node from 22.19.13 to 25.3.3 Bumps [@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node) from 22.19.13 to 25.3.3. - [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases) - [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node) --- updated-dependencies: - dependency-name: "@types/node" dependency-version: 25.3.3 dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): Bump better-sqlite3 from 11.10.0 to 12.6.2 Bumps [better-sqlite3](https://github.com/WiseLibs/better-sqlite3) from 11.10.0 to 12.6.2. - [Release notes](https://github.com/WiseLibs/better-sqlite3/releases) - [Commits](WiseLibs/better-sqlite3@v11.10.0...v12.6.2) --- updated-dependencies: - dependency-name: better-sqlite3 dependency-version: 12.6.2 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps-dev): Bump eslint from 9.39.3 to 10.0.2 Bumps [eslint](https://github.com/eslint/eslint) from 9.39.3 to 10.0.2. - [Release notes](https://github.com/eslint/eslint/releases) - [Commits](eslint/eslint@v9.39.3...v10.0.2) --- updated-dependencies: - dependency-name: eslint dependency-version: 10.0.2 dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Robert DeRienzo <rderienzo@voloridge.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add passthrough LLM mode for ask-question tool (#335) * feat: add passthrough LLM mode for ask-question tool Adds llm.provider = "passthrough" so the ask-question MCP tool returns retrieved context chunks directly to the calling LLM instead of requiring a separate OpenAI/Ollama provider. This is the natural design for MCP tools where the client already has an LLM (e.g. Claude Code). - config.ts: add "passthrough" to llm.provider union type and env var handling - rag.ts: add isPassthroughMode() helper and getContextForQuestion() which retrieves and formats context without an LLM call - mcp/server.ts: ask-question checks passthrough first and returns context directly; falls through to existing LLM path otherwise Enable via config: { "llm": { "provider": "passthrough" } } Enable via env var: LIBSCOPE_LLM_PROVIDER=passthrough Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: format config.ts and include passthrough in provider override - Reformat long if-condition to satisfy prettier (printWidth: 100) - Fix logic bug: passthrough provider was checked in outer condition but not spread into overrides.llm.provider Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address 9 audit findings from issue #332 (#333) * fix: address 9 audit findings from issue #332 Security - middleware: use timingSafeEqual for API key comparison (#2) - url-fetcher, http-utils: replace process-wide NODE_TLS_REJECT_UNAUTHORIZED mutation with per-request undici Agent to eliminate TLS race condition (#1) Bugs - indexing: re-throw unexpected embedding errors so transaction rolls back instead of silently committing chunks with no vector (#3) - search: replace correlated minRating subquery with avg_r.avg_rating from the pre-joined aggregate in FTS and LIKE search paths (#4) Performance - bulk: replace O(n²) docs.find() loops with pre-built Map; replace per-document getDocumentTags() calls with a single getDocumentTagsBatch() query (#5) - config: add 30-second TTL cache to loadConfig() so disk reads are not repeated on every request (#6) Code quality - routes: check res.write() return value to handle SSE backpressure (#7) - reindex: delegate to schema.createVectorTable() instead of duplicating the vec0 DDL inline (#8) - obsidian: replace hand-rolled parseSimpleYaml() with js-yaml, normalise Date objects back to ISO-8601 strings (#9) Docs - agents.md: expand architecture tree to include src/api/ and src/connectors/; add Security Patterns section with correct undici examples - CONTRIBUTING.md: fix check-suite command (npm test → npm run test:coverage) and correct coverage threshold (80% → actual 75%/74%) Tests - bulk.test: add dateFrom/dateTo filter coverage - config.test: add cache-hit test; call invalidateConfigCache() before env-var tests so TTL cache doesn't return stale results Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: remove unused warnIfTlsBypassMissing function Dead code after conflict resolution chose the per-request undici Agent approach (which doesn't need a warning about NODE_TLS_REJECT_UNAUTHORIZED). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: update tests for config cache and retry semantics - Add invalidateConfigCache() before loadConfig() in 4 env-override tests that were failing because the 30s TTL cache introduced in the config module was returning stale results from the previous test's cache entry - Update http-utils retry assertion: maxRetries=2 means 1 initial + 2 retries = 3 total calls (loop is attempt <= maxRetries) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: CLI logging improvements and pack installation performance (#330) (#336) - Add `src/cli/reporter.ts`: PrettyReporter (ANSI colors + \r progress bar), SilentReporter (no-op for verbose/JSON mode), `isVerbose()` and `createReporter()` factory. `LIBSCOPE_VERBOSE=1` env var alternative. - Update `setupLogging` to default to "silent" in CLI mode (pretty reporter handles user-facing output). Verbose/`--log-level` flags still route to structured JSON pino logs. Fix duplicate `initLogger` calls in onenote connect/disconnect commands to use `setupLogging` consistently. - Update `installPack` in `packs.ts` to support batch embedding and progress reporting: - New `InstallOptions` interface with `batchSize`, `resumeFrom`, `onProgress` fields - Batch documents: chunk all → single `provider.embedBatch` call per batch → single SQLite transaction per batch (avoids N embedding calls) - `resumeFrom` skips the first N documents (enables partial install resume after failure) - `InstallResult` now includes `errors` count - Add `--batch-size` and `--resume-from` CLI options to `pack install` - Tests: `tests/unit/reporter.test.ts` (17 tests covering PrettyReporter, SilentReporter, isVerbose, env var detection); extended `tests/unit/packs.test.ts` with 7 new tests for progress callbacks, batch efficiency, resumeFrom, embedBatch failure handling. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Claude/fix issue 331 s1qzu (#338) * build(deps): Bump the npm_and_yarn group across 1 directory with 2 updates (#337) Bumps the npm_and_yarn group with 2 updates in the / directory: [@hono/node-server](https://github.com/honojs/node-server) and [hono](https://github.com/honojs/hono). Updates `@hono/node-server` from 1.19.9 to 1.19.10 - [Release notes](https://github.com/honojs/node-server/releases) - [Commits](honojs/node-server@v1.19.9...v1.19.10) Updates `hono` from 4.12.3 to 4.12.5 - [Release notes](https://github.com/honojs/hono/releases) - [Commits](honojs/hono@v4.12.3...v4.12.5) --- updated-dependencies: - dependency-name: "@hono/node-server" dependency-version: 1.19.10 dependency-type: indirect dependency-group: npm_and_yarn - dependency-name: hono dependency-version: 4.12.5 dependency-type: indirect dependency-group: npm_and_yarn ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (#341) * fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (closes #340) **SSRF (CWE-918 — CodeQL alert #28)** Replace the two-step validate-then-fetch approach in url-fetcher.ts with IP-pinned requests using node:http / node:https directly. validateUrl() resolves DNS and checks for private IPs, then the validated IP is passed straight to the TCP connection (hostname: pinnedIp, servername: original hostname for TLS SNI). There is now zero TOCTOU window between validation and the actual network request. The redundant post-fetch DNS rebinding check and the env-var-based TLS bypass wrapper are removed; rejectUnauthorized is now passed directly to the request options. An internal _setRequestImpl hook is exported for unit test injection so tests can stub responses without touching node:http / node:https. Tests are updated accordingly. **ReDoS (CWE-1333 — CodeQL alert #24)** Five regexes in confluence.ts used the pattern [^>]*ac:name="X"[^>]* — two [^>]* quantifiers around a fixed literal. For input that contains a large attribute blob without the target ac:name value, the engine must try all O(n²) splits before concluding no match (catastrophic backtracking). Fix: rewrite the leading [^>]* as (?:(?!ac:name="X")[^>])* — a negative lookahead prevents the quantifier from overlapping with the literal, making backtracking structurally impossible. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: replace remaining polynomial regexes in confluence.ts with indexOf helpers The conflict resolution left the original `/<ac:structured-macro [^>]*>([\s\S]*?) <\/ac:structured-macro>/gi` patterns in place for code blocks and info/tip panels (those were not part of the original security fix diff). These have the same O(n²) backtracking problem: with k opening tags and no closing tags, [\s\S]*? scans O(n - pos) chars per attempt, totalling O(n²). Replace the entire convertConfluenceStorage function with the indexOf-based approach (replaceStructuredMacros / replaceTagPairs / extractTagContent helpers) that eliminates all regex-based tag parsing. Also add removeSelfClosingMacros to handle the self-closing TOC case without regex, since the previous self-closing fix still used a [^>]*ac:name="toc"[^>]* pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: concurrent pack installation and -v verbose shorthand (issue #330) (#339) * feat: concurrent pack installation and -v verbose shorthand (issue #330) Add concurrent batch embedding to installPack for significant performance improvement on large packs, plus CLI ergonomics improvements. Key changes: - `InstallOptions.concurrency` (default: 4): controls how many embedBatch calls run simultaneously; embedding is I/O-bound so parallelism directly reduces wall-clock installation time - Refactor installPack to pre-chunk all documents upfront, then use a semaphore-based scheduler to run up to `concurrency` embedBatch calls concurrently while inserting completed batches in-order (SQLite requires serialised writes); progress callbacks fire after each batch as before - `pack install --concurrency <n>` CLI flag exposes the new option - `-v` shorthand for `--verbose` on the global program options - Fix transaction install-count tracking: count committed docs accurately without relying on subtract-on-failure arithmetic - Add 6 new tests covering concurrency=1 sequential, concurrency=4 parallel, multiple embedBatch calls per install, concurrency limit enforcement, incremental progress reporting, and partial-failure error counting https://claude.ai/code/session_019hzhbEgV1ysnGmFVXBWzkZ * fix: address all 4 Copilot review comments on PR #339 - Validate batchSize, concurrency, resumeFrom at the start of installPack and throw ValidationError for invalid values (comments 3 & 4). Concurrency <= 0 would silently hang the semaphore indefinitely. - Add CLI lower-bound guard: --concurrency < 1 exits with a user-facing error before ever calling installPack (comment 3). - Lazy chunking: pre-chunking all documents upfront held chunks for the entire pack in memory simultaneously. Batches now store only the raw documents; resolveBatch() chunks on demand right before embedBatch is called, so only one batch's worth of chunks is in memory at a time (comment 2). - Wrap provider.embedBatch() in try/catch so synchronous throws are converted to rejected Promises rather than escaping scheduleNext() and leaving the outer Promise permanently pending (comment 1). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> * fix: address 7 pre-release bugs from audit (#342) (#344) - Guard JSON.parse in rowToWebhook with try/catch, default to [] - Guard JSON.parse in rowToSavedSearch with try/catch, default to null - Guard JSON.parse in loadDbConnectorConfig with try/catch, throw ConfigError - Push dateFrom/dateTo filters into SQL WHERE in listDocuments (before LIMIT) - Validate negative limit in resolveSelector, throw ValidationError - Replace manual substring extension parsing with path.extname() in packs.ts - Verified reporter.ts is already tracked on development (no action needed) - Added tests for all fixes (corrupted JSON, SQL-level date filtering, negative limit) Closes #342 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: URL spidering — crawl linked pages with configurable depth and page limits (#315) (#343) * feat: URL spidering — crawl linked pages with configurable depth and page limits (#315) Adds opt-in spidering to URL indexing. A single seed URL can now crawl and index an entire documentation site or wiki section in one call. New files: - src/core/link-extractor.ts: indexOf-based <a href> extraction, relative URL resolution, fragment stripping, dedup, scheme filtering. No regex. - src/core/spider.ts: BFS crawl engine with sameDomain, pathPrefix, excludePatterns (glob), maxPages (hard cap 200), maxDepth (hard cap 5), 10-min total timeout, robots.txt (User-agent: * and libscope), and 1s inter-request delay. Yields SpiderResult per page; returns SpiderStats. - tests/unit/link-extractor.test.ts: 25 tests covering relative resolution, dedup, fragment stripping, scheme filtering, attribute order, edge cases. - tests/unit/spider.test.ts: 20 tests covering BFS order, depth/page limits, domain + path + pattern filtering, cycle detection, robots.txt, partial failure recovery, stats, and abortReason reporting. Modified: - src/core/url-fetcher.ts: adds fetchRaw() export returning raw body + contentType + finalUrl before HTML-to-markdown conversion, so the spider can extract links from HTML before conversion. - src/api/routes.ts: POST /api/v1/documents/url now accepts spider, maxPages, maxDepth, sameDomain, pathPrefix, excludePatterns. Returns { documents, pagesIndexed, pagesCrawled, pagesSkipped, errors, abortReason }. - src/mcp/server.ts: submit-document tool gains spider, maxPages, maxDepth, sameDomain, pathPrefix, excludePatterns parameters. Safety: all fetched URLs pass through the existing SSRF validation in fetchRaw() (DNS resolution, private IP blocking, scheme allowlist). Hard limits (200 pages, depth 5, 10min) cannot be overridden by callers. robots.txt is fetched once per origin and Disallow rules are honoured. Individual page failures do not abort the crawl. Closes #315 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: resolve CI lint errors in spider implementation - Remove unnecessary type assertions (routes.ts, mcp/server.ts) — TypeScript already narrows SpiderResult/SpiderStats from the generator - Add explicit return type annotation on mock fetchRaw to satisfy no-unsafe-return rule in spider.test.ts - Replace .resolves.not.toThrow() with a direct assertion — vitest .resolves requires a Promise, not an async function Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address CodeQL security findings in spider/link-extractor link-extractor.ts (CodeQL #30 — incomplete URL scheme check): Replace the enumerated scheme blocklist (javascript:, vbscript:, etc.) with a strict http/https allowlist check on the resolved URL protocol. An allowlist is exhaustive by definition; a blocklist will always miss obscure schemes like vbscript:, blob:, or future additions. spider.ts (CodeQL #31 — incomplete multi-character sanitization): Replace the regex-based tag stripper /<[^>]+>/g in extractTitle() with an indexOf-based stripTags() function. The regex stops at the first > which can be inside a quoted attribute value (e.g. <img alt="a>b">), potentially leaving partial tag content in the extracted title. The new implementation walks quoted attribute values explicitly so no tag content leaks through regardless of its internal structure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address all Copilot review comments on spider PR (#343) - link-extractor: add word-boundary check in extractHref to prevent matching data-href, aria-href (false positives on non-href attributes) - spider: rename pagesIndexed → pagesFetched throughout (SpiderStats interface already used pagesFetched; sync implementation + tests) - spider: per-origin robots.txt cache (Map<origin, Set>) fetched lazily as new origins are encountered during crawl (was seed-only before) - spider: normalize to raw.finalUrl after redirects — visited set, yielded URL, and link-extraction base all use the canonical URL - routes: validate maxPages/maxDepth are finite positive integers - routes: change conditional spread &&-patterns to ternaries - routes: remove inner try/catch for spider fetch errors; add FetchError to top-level handler (consistent with single-URL mode → 502) - mcp/server: replace conditional spreads with explicit if assignments - mcp/server: validate spider=true requires url (throws ValidationError) - openapi: document spider request fields in IndexFromUrlRequest schema, add SpiderResponse schema, update 201 response to oneOf Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: fix prettier formatting in spider files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: add development branch workflow (#327) * chore: add development branch workflow - Add merge-gate.yml: enforces only 'development' can merge into main - Update CI/CodeQL/Docker workflows to run on both main and development - Update dependabot.yml: target-branch set to development for all ecosystems - Update copilot-instructions.md: document branch workflow convention - Rulesets configured: Main (requires merge-gate + squash-only), Development (requires CI status checks + PR) - Default branch set to development - All open PRs retargeted to development * fix: skip Vercel preview deployments on non-main branches * chore: trigger check refresh --------- * feat: create-pack from local folder or URL sources (#329) * fix: comprehensive audit fixes — security, performance, resilience, API hardening Addresses findings from issue #314: - SSRF protection for webhook URLs (CRITICAL) - Scrub secrets from exports - Stored XSS prevention on document URL - O(n²) and N+1 fixes in bulk operations - Rate limit cache eviction improvement - SSE backpressure handling - Replace raw Error() with typed errors - Fetch timeouts on all network calls - Input validation on API parameters - Search query length limit - Silent catch block logging - DNS rebinding check fix - N+1 in Slack user resolution - Pagination on webhook/search list endpoints Closes #314 * Address PR review comments: fix SSE listener leak, preserve caller signals, add SSRF validation to webhook test, chunk SQL params, use dynamic import in test - SSE backpressure: create single disconnect promise, race against drain (no listener accumulation) - http-utils.ts/onenote.ts: use AbortSignal.any() to combine caller signal with timeout - Webhook test endpoint: validate URL with validateWebhookUrlSsrf before fetch - bulk.ts: chunk IN clause to 999 params max (SQLite limit) - webhooks.test.ts: dynamic import after vi.mock() for deterministic DNS mocking * ci: consolidate and fix CI/CD workflows - Merge lint + typecheck into single job (saves one npm ci) - Add concurrency groups to ci, docker, codeql (cancel stale runs) - Add dependency-review-action on PRs (block vulnerable deps) - Add workflow_call trigger to ci.yml for reusability - Remove duplicate npm publish from release.yml (release-please owns it) - Fix SDK paths: sdk-go/ → sdk/go/, sdk-python/ → sdk/python/ - Fix Dependabot paths to match actual SDK directories - Add github-actions ecosystem to Dependabot (keep actions up to date) * feat: add HTML file parser for .html/.htm document indexing Adds HtmlParser using the existing node-html-markdown dependency. Strips <script>, <style>, and <nav> tags before conversion. Registered for .html and .htm extensions. Includes 12 tests covering conversion, tag stripping, edge cases. Closes #317 * fix: address CodeQL and review comments on HTML parser - Replace regex-based tag stripping with node-html-markdown's native ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization, bad HTML filtering regexp) - Wrap translate() in try/catch, throw ValidationError (consistent with other parsers) - Use trimEnd() instead of trim() to preserve leading indentation - Reuse single NHM instance for efficiency * Revert "fix: skip Vercel preview deployments on non-main branches" This reverts commit eb48187. * feat: add --from option to pack create for folder/URL sources Adds createPackFromSource() that builds packs directly from local folders, files, or URLs without requiring database interaction. CLI: libscope pack create --name my-pack --from ~/docs/ [--extensions .md,.html] [--exclude pattern] [--no-recursive] Features: - Walks directories recursively using registered parsers - Fetches URLs via fetchAndConvert - Supports extension filtering, exclude patterns, progress callback - Multiple --from sources supported - Output format identical to DB export (pack install works unchanged) Closes #328 * style: fix prettier formatting * feat: add gzip support for pack files (.json.gz) Pack files can now be compressed with gzip for smaller distribution: - writePackFile/readPackFile auto-detect gzip by extension or magic bytes - installPack accepts both .json and .json.gz files - createPackFromSource defaults to .json.gz output (source packs can be large) - createPack (DB export) still defaults to .json - Auto-detects gzip by magic bytes even if extension is .json 5 new tests covering gzip write, install, magic byte detection, and round-trip. * feat: add progress logging and fix dedup handling in pack install - Log each document as it's indexed so large installs show progress - Change pack install to use dedup: 'skip' for graceful duplicate handling - Make title+content_length dedup check respect the dedup mode setting (previously it always threw ValidationError regardless of dedup mode) * feat: auto-generate tags during pack creation and apply on install - Export tokenize() and add suggestTagsFromText() in tags.ts for DB-free tag generation - createPackFromSource() now auto-generates tags per document via TF-IDF - installPack() applies doc.tags via addTagsToDocument() after indexing --------- * feat: add HTML file parser for .html/.htm document indexing (#318) * feat: add HTML file parser for .html/.htm document indexing Adds HtmlParser using the existing node-html-markdown dependency. Strips <script>, <style>, and <nav> tags before conversion. Registered for .html and .htm extensions. Includes 12 tests covering conversion, tag stripping, edge cases. Closes #317 * fix: address CodeQL and review comments on HTML parser - Replace regex-based tag stripping with node-html-markdown's native ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization, bad HTML filtering regexp) - Wrap translate() in try/catch, throw ValidationError (consistent with other parsers) - Use trimEnd() instead of trim() to preserve leading indentation - Reuse single NHM instance for efficiency * Revert "fix: skip Vercel preview deployments on non-main branches" This reverts commit eb48187. --------- * build(deps-dev): Bump eslint-config-prettier from 9.1.2 to 10.1.8 (#325) Bumps [eslint-config-prettier](https://github.com/prettier/eslint-config-prettier) from 9.1.2 to 10.1.8. - [Release notes](https://github.com/prettier/eslint-config-prettier/releases) - [Changelog](https://github.com/prettier/eslint-config-prettier/blob/main/CHANGELOG.md) - [Commits](https://github.com/prettier/eslint-config-prettier/commits/v10.1.8) --- updated-dependencies: - dependency-name: eslint-config-prettier dependency-version: 10.1.8 dependency-type: direct:development update-type: version-update:semver-major ... * build(deps-dev): Bump lint-staged from 16.3.1 to 16.3.2 Bumps the minor-and-patch group with 1 update: [lint-staged](https://github.com/lint-staged/lint-staged). Updates `lint-staged` from 16.3.1 to 16.3.2 - [Release notes](https://github.com/lint-staged/lint-staged/releases) - [Changelog](https://github.com/lint-staged/lint-staged/blob/main/CHANGELOG.md) - [Commits](lint-staged/lint-staged@v16.3.1...v16.3.2) --- updated-dependencies: - dependency-name: lint-staged dependency-version: 16.3.2 dependency-type: direct:development update-type: version-update:semver-patch dependency-group: minor-and-patch ... * build(deps): Bump the actions group with 5 updates Bumps the actions group with 5 updates: | Package | From | To | | --- | --- | --- | | [actions/checkout](https://github.com/actions/checkout) | `4` | `6` | | [actions/setup-node](https://github.com/actions/setup-node) | `4` | `6` | | [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4` | `7` | | [actions/setup-python](https://github.com/actions/setup-python) | `5` | `6` | | [actions/setup-go](https://github.com/actions/setup-go) | `5` | `6` | Updates `actions/checkout` from 4 to 6 - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v4...v6) Updates `actions/setup-node` from 4 to 6 - [Release notes](https://github.com/actions/setup-node/releases) - [Commits](actions/setup-node@v4...v6) Updates `actions/upload-artifact` from 4 to 7 - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@v4...v7) Updates `actions/setup-python` from 5 to 6 - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](actions/setup-python@v5...v6) Updates `actions/setup-go` from 5 to 6 - [Release notes](https://github.com/actions/setup-go/releases) - [Commits](actions/setup-go@v5...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/setup-node dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/upload-artifact dependency-version: '7' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/setup-python dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/setup-go dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions ... * build(deps-dev): Bump @types/node from 22.19.13 to 25.3.3 Bumps [@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node) from 22.19.13 to 25.3.3. - [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases) - [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node) --- updated-dependencies: - dependency-name: "@types/node" dependency-version: 25.3.3 dependency-type: direct:development update-type: version-update:semver-major ... * build(deps): Bump better-sqlite3 from 11.10.0 to 12.6.2 Bumps [better-sqlite3](https://github.com/WiseLibs/better-sqlite3) from 11.10.0 to 12.6.2. - [Release notes](https://github.com/WiseLibs/better-sqlite3/releases) - [Commits](WiseLibs/better-sqlite3@v11.10.0...v12.6.2) --- updated-dependencies: - dependency-name: better-sqlite3 dependency-version: 12.6.2 dependency-type: direct:production update-type: version-update:semver-major ... * build(deps-dev): Bump eslint from 9.39.3 to 10.0.2 Bumps [eslint](https://github.com/eslint/eslint) from 9.39.3 to 10.0.2. - [Release notes](https://github.com/eslint/eslint/releases) - [Commits](eslint/eslint@v9.39.3...v10.0.2) --- updated-dependencies: - dependency-name: eslint dependency-version: 10.0.2 dependency-type: direct:development update-type: version-update:semver-major ... * feat: add passthrough LLM mode for ask-question tool (#335) * feat: add passthrough LLM mode for ask-question tool Adds llm.provider = "passthrough" so the ask-question MCP tool returns retrieved context chunks directly to the calling LLM instead of requiring a separate OpenAI/Ollama provider. This is the natural design for MCP tools where the client already has an LLM (e.g. Claude Code). - config.ts: add "passthrough" to llm.provider union type and env var handling - rag.ts: add isPassthroughMode() helper and getContextForQuestion() which retrieves and formats context without an LLM call - mcp/server.ts: ask-question checks passthrough first and returns context directly; falls through to existing LLM path otherwise Enable via config: { "llm": { "provider": "passthrough" } } Enable via env var: LIBSCOPE_LLM_PROVIDER=passthrough * fix: format config.ts and include passthrough in provider override - Reformat long if-condition to satisfy prettier (printWidth: 100) - Fix logic bug: passthrough provider was checked in outer condition but not spread into overrides.llm.provider --------- * fix: address 9 audit findings from issue #332 (#333) * fix: address 9 audit findings from issue #332 Security - middleware: use timingSafeEqual for API key comparison (#2) - url-fetcher, http-utils: replace process-wide NODE_TLS_REJECT_UNAUTHORIZED mutation with per-request undici Agent to eliminate TLS race condition (#1) Bugs - indexing: re-throw unexpected embedding errors so transaction rolls back instead of silently committing chunks with no vector (#3) - search: replace correlated minRating subquery with avg_r.avg_rating from the pre-joined aggregate in FTS and LIKE search paths (#4) Performance - bulk: replace O(n²) docs.find() loops with pre-built Map; replace per-document getDocumentTags() calls with a single getDocumentTagsBatch() query (#5) - config: add 30-second TTL cache to loadConfig() so disk reads are not repeated on every request (#6) Code quality - routes: check res.write() return value to handle SSE backpressure (#7) - reindex: delegate to schema.createVectorTable() instead of duplicating the vec0 DDL inline (#8) - obsidian: replace hand-rolled parseSimpleYaml() with js-yaml, normalise Date objects back to ISO-8601 strings (#9) Docs - agents.md: expand architecture tree to include src/api/ and src/connectors/; add Security Patterns section with correct undici examples - CONTRIBUTING.md: fix check-suite command (npm test → npm run test:coverage) and correct coverage threshold (80% → actual 75%/74%) Tests - bulk.test: add dateFrom/dateTo filter coverage - config.test: add cache-hit test; call invalidateConfigCache() before env-var tests so TTL cache doesn't return stale results * fix: remove unused warnIfTlsBypassMissing function Dead code after conflict resolution chose the per-request undici Agent approach (which doesn't need a warning about NODE_TLS_REJECT_UNAUTHORIZED). * fix: update tests for config cache and retry semantics - Add invalidateConfigCache() before loadConfig() in 4 env-override tests that were failing because the 30s TTL cache introduced in the config module was returning stale results from the previous test's cache entry - Update http-utils retry assertion: maxRetries=2 means 1 initial + 2 retries = 3 total calls (loop is attempt <= maxRetries) --------- * feat: CLI logging improvements and pack installation performance (#330) (#336) - Add `src/cli/reporter.ts`: PrettyReporter (ANSI colors + \r progress bar), SilentReporter (no-op for verbose/JSON mode), `isVerbose()` and `createReporter()` factory. `LIBSCOPE_VERBOSE=1` env var alternative. - Update `setupLogging` to default to "silent" in CLI mode (pretty reporter handles user-facing output). Verbose/`--log-level` flags still route to structured JSON pino logs. Fix duplicate `initLogger` calls in onenote connect/disconnect commands to use `setupLogging` consistently. - Update `installPack` in `packs.ts` to support batch embedding and progress reporting: - New `InstallOptions` interface with `batchSize`, `resumeFrom`, `onProgress` fields - Batch documents: chunk all → single `provider.embedBatch` call per batch → single SQLite transaction per batch (avoids N embedding calls) - `resumeFrom` skips the first N documents (enables partial install resume after failure) - `InstallResult` now includes `errors` count - Add `--batch-size` and `--resume-from` CLI options to `pack install` - Tests: `tests/unit/reporter.test.ts` (17 tests covering PrettyReporter, SilentReporter, isVerbose, env var detection); extended `tests/unit/packs.test.ts` with 7 new tests for progress callbacks, batch efficiency, resumeFrom, embedBatch failure handling. * Claude/fix issue 331 s1qzu (#338) * build(deps): Bump the npm_and_yarn group across 1 directory with 2 updates (#337) Bumps the npm_and_yarn group with 2 updates in the / directory: [@hono/node-server](https://github.com/honojs/node-server) and [hono](https://github.com/honojs/hono). Updates `@hono/node-server` from 1.19.9 to 1.19.10 - [Release notes](https://github.com/honojs/node-server/releases) - [Commits](honojs/node-server@v1.19.9...v1.19.10) Updates `hono` from 4.12.3 to 4.12.5 - [Release notes](https://github.com/honojs/hono/releases) - [Commits](honojs/hono@v4.12.3...v4.12.5) --- updated-dependencies: - dependency-name: "@hono/node-server" dependency-version: 1.19.10 dependency-type: indirect dependency-group: npm_and_yarn - dependency-name: hono dependency-version: 4.12.5 dependency-type: indirect dependency-group: npm_and_yarn ... * fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (#341) * fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (closes #340) **SSRF (CWE-918 — CodeQL alert #28)** Replace the two-step validate-then-fetch approach in url-fetcher.ts with IP-pinned requests using node:http / node:https directly. validateUrl() resolves DNS and checks for private IPs, then the validated IP is passed straight to the TCP connection (hostname: pinnedIp, servername: original hostname for TLS SNI). There is now zero TOCTOU window between validation and the actual network request. The redundant post-fetch DNS rebinding check and the env-var-based TLS bypass wrapper are removed; rejectUnauthorized is now passed directly to the request options. An internal _setRequestImpl hook is exported for unit test injection so tests can stub responses without touching node:http / node:https. Tests are updated accordingly. **ReDoS (CWE-1333 — CodeQL alert #24)** Five regexes in confluence.ts used the pattern [^>]*ac:name="X"[^>]* — two [^>]* quantifiers around a fixed literal. For input that contains a large attribute blob without the target ac:name value, the engine must try all O(n²) splits before concluding no match (catastrophic backtracking). Fix: rewrite the leading [^>]* as (?:(?!ac:name="X")[^>])* — a negative lookahead prevents the quantifier from overlapping with the literal, making backtracking structurally impossible. * fix: replace remaining polynomial regexes in confluence.ts with indexOf helpers The conflict resolution left the original `/<ac:structured-macro [^>]*>([\s\S]*?) <\/ac:structured-macro>/gi` patterns in place for code blocks and info/tip panels (those were not part of the original security fix diff). These have the same O(n²) backtracking problem: with k opening tags and no closing tags, [\s\S]*? scans O(n - pos) chars per attempt, totalling O(n²). Replace the entire convertConfluenceStorage function with the indexOf-based approach (replaceStructuredMacros / replaceTagPairs / extractTagContent helpers) that eliminates all regex-based tag parsing. Also add removeSelfClosingMacros to handle the self-closing TOC case without regex, since the previous self-closing fix still used a [^>]*ac:name="toc"[^>]* pattern. --------- * feat: concurrent pack installation and -v verbose shorthand (issue #330) (#339) * feat: concurrent pack installation and -v verbose shorthand (issue #330) Add concurrent batch embedding to installPack for significant performance improvement on large packs, plus CLI ergonomics improvements. Key changes: - `InstallOptions.concurrency` (default: 4): controls how many embedBatch calls run simultaneously; embedding is I/O-bound so parallelism directly reduces wall-clock installation time - Refactor installPack to pre-chunk all documents upfront, then use a semaphore-based scheduler to run up to `concurrency` embedBatch calls concurrently while inserting completed batches in-order (SQLite requires serialised writes); progress callbacks fire after each batch as before - `pack install --concurrency <n>` CLI flag exposes the new option - `-v` shorthand for `--verbose` on the global program options - Fix transaction install-count tracking: count committed docs accurately without relying on subtract-on-failure arithmetic - Add 6 new tests covering concurrency=1 sequential, concurrency=4 parallel, multiple embedBatch calls per install, concurrency limit enforcement, incremental progress reporting, and partial-failure error counting https://claude.ai/code/session_019hzhbEgV1ysnGmFVXBWzkZ * fix: address all 4 Copilot review comments on PR #339 - Validate batchSize, concurrency, resumeFrom at the start of installPack and throw ValidationError for invalid values (comments 3 & 4). Concurrency <= 0 would silently hang the semaphore indefinitely. - Add CLI lower-bound guard: --concurrency < 1 exits with a user-facing error before ever calling installPack (comment 3). - Lazy chunking: pre-chunking all documents upfront held chunks for the entire pack in memory simultaneously. Batches now store only the raw documents; resolveBatch() chunks on demand right before embedBatch is called, so only one batch's worth of chunks is in memory at a time (comment 2). - Wrap provider.embedBatch() in try/catch so synchronous throws are converted to rejected Promises rather than escaping scheduleNext() and leaving the outer Promise permanently pending (comment 1). --------- * fix: address 7 pre-release bugs from audit (#342) (#344) - Guard JSON.parse in rowToWebhook with try/catch, default to [] - Guard JSON.parse in rowToSavedSearch with try/catch, default to null - Guard JSON.parse in loadDbConnectorConfig with try/catch, throw ConfigError - Push dateFrom/dateTo filters into SQL WHERE in listDocuments (before LIMIT) - Validate negative limit in resolveSelector, throw ValidationError - Replace manual substring extension parsing with path.extname() in packs.ts - Verified reporter.ts is already tracked on development (no action needed) - Added tests for all fixes (corrupted JSON, SQL-level date filtering, negative limit) Closes #342 * feat: URL spidering — crawl linked pages with configurable depth and page limits (#315) (#343) * feat: URL spidering — crawl linked pages with configurable depth and page limits (#315) Adds opt-in spidering to URL indexing. A single seed URL can now crawl and index an entire documentation site or wiki section in one call. New files: - src/core/link-extractor.ts: indexOf-based <a href> extraction, relative URL resolution, fragment stripping, dedup, scheme filtering. No regex. - src/core/spider.ts: BFS crawl engine with sameDomain, pathPrefix, excludePatterns (glob), maxPages (hard cap 200), maxDepth (hard cap 5), 10-min total timeout, robots.txt (User-agent: * and libscope), and 1s inter-request delay. Yields SpiderResult per page; returns SpiderStats. - tests/unit/link-extractor.test.ts: 25 tests covering relative resolution, dedup, fragment stripping, scheme filtering, attribute order, edge cases. - tests/unit/spider.test.ts: 20 tests covering BFS order, depth/page limits, domain + path + pattern filtering, cycle detection, robots.txt, partial failure recovery, stats, and abortReason reporting. Modified: - src/core/url-fetcher.ts: adds fetchRaw() export returning raw body + contentType + finalUrl before HTML-to-markdown conversion, so the spider can extract links from HTML before conversion. - src/api/routes.ts: POST /api/v1/documents/url now accepts spider, maxPages, maxDepth, sameDomain, pathPrefix, excludePatterns. Returns { documents, pagesIndexed, pagesCrawled, pagesSkipped, errors, abortReason }. - src/mcp/server.ts: submit-document tool gains spider, maxPages, maxDepth, sameDomain, pathPrefix, excludePatterns parameters. Safety: all fetched URLs pass through the existing SSRF validation in fetchRaw() (DNS resolution, private IP blocking, scheme allowlist). Hard limits (200 pages, depth 5, 10min) cannot be overridden by callers. robots.txt is fetched once per origin and Disallow rules are honoured. Individual page failures do not abort the crawl. Closes #315 * fix: resolve CI lint errors in spider implementation - Remove unnecessary type assertions (routes.ts, mcp/server.ts) — TypeScript already narrows SpiderResult/SpiderStats from the generator - Add explicit return type annotation on mock fetchRaw to satisfy no-unsafe-return rule in spider.test.ts - Replace .resolves.not.toThrow() with a direct assertion — vitest .resolves requires a Promise, not an async function * fix: address CodeQL security findings in spider/link-extractor link-extractor.ts (CodeQL #30 — incomplete URL scheme check): Replace the enumerated scheme blocklist (javascript:, vbscript:, etc.) with a strict http/https allowlist check on the resolved URL protocol. An allowlist is exhaustive by definition; a blocklist will always miss obscure schemes like vbscript:, blob:, or future additions. spider.ts (CodeQL #31 — incomplete multi-character sanitization): Replace the regex-based tag stripper /<[^>]+>/g in extractTitle() with an indexOf-based stripTags() function. The regex stops at the first > which can be inside a quoted attribute value (e.g. <img alt="a>b">), potentially leaving partial tag content in the extracted title. The new implementation walks quoted attribute values explicitly so no tag content leaks through regardless of its internal structure. * fix: address all Copilot review comments on spider PR (#343) - link-extractor: add word-boundary check in extractHref to prevent matching data-href, aria-href (false positives on non-href attributes) - spider: rename pagesIndexed → pagesFetched throughout (SpiderStats interface already used pagesFetched; sync implementation + tests) - spider: per-origin robots.txt cache (Map<origin, Set>) fetched lazily as new origins are encountered during crawl (was seed-only before) - spider: normalize to raw.finalUrl after redirects — visited set, yielded URL, and link-extraction base all use the canonical URL - routes: validate maxPages/maxDepth are finite positive integers - routes: change conditional spread &&-patterns to ternaries - routes: remove inner try/catch for spider fetch errors; add FetchError to top-level handler (consistent with single-URL mode → 502) - mcp/server: replace conditional spreads with explicit if assignments - mcp/server: validate spider=true requires url (throws ValidationError) - openapi: document spider request fields in IndexFromUrlRequest schema, add SpiderResponse schema, update 201 response to oneOf * style: fix prettier formatting in spider files --------- --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: add development branch workflow (#327) * chore: add development branch workflow - Add merge-gate.yml: enforces only 'development' can merge into main - Update CI/CodeQL/Docker workflows to run on both main and development - Update dependabot.yml: target-branch set to development for all ecosystems - Update copilot-instructions.md: document branch workflow convention - Rulesets configured: Main (requires merge-gate + squash-only), Development (requires CI status checks + PR) - Default branch set to development - All open PRs retargeted to development Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: skip Vercel preview deployments on non-main branches Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: trigger check refresh --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: create-pack from local folder or URL sources (#329) * fix: comprehensive audit fixes — security, performance, resilience, API hardening Addresses findings from issue #314: - SSRF protection for webhook URLs (CRITICAL) - Scrub secrets from exports - Stored XSS prevention on document URL - O(n²) and N+1 fixes in bulk operations - Rate limit cache eviction improvement - SSE backpressure handling - Replace raw Error() with typed errors - Fetch timeouts on all network calls - Input validation on API parameters - Search query length limit - Silent catch block logging - DNS rebinding check fix - N+1 in Slack user resolution - Pagination on webhook/search list endpoints Closes #314 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR review comments: fix SSE listener leak, preserve caller signals, add SSRF validation to webhook test, chunk SQL params, use dynamic import in test - SSE backpressure: create single disconnect promise, race against drain (no listener accumulation) - http-utils.ts/onenote.ts: use AbortSignal.any() to combine caller signal with timeout - Webhook test endpoint: validate URL with validateWebhookUrlSsrf before fetch - bulk.ts: chunk IN clause to 999 params max (SQLite limit) - webhooks.test.ts: dynamic import after vi.mock() for deterministic DNS mocking Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ci: consolidate and fix CI/CD workflows - Merge lint + typecheck into single job (saves one npm ci) - Add concurrency groups to ci, docker, codeql (cancel stale runs) - Add dependency-review-action on PRs (block vulnerable deps) - Add workflow_call trigger to ci.yml for reusability - Remove duplicate npm publish from release.yml (release-please owns it) - Fix SDK paths: sdk-go/ → sdk/go/, sdk-python/ → sdk/python/ - Fix Dependabot paths to match actual SDK directories - Add github-actions ecosystem to Dependabot (keep actions up to date) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add HTML file parser for .html/.htm document indexing Adds HtmlParser using the existing node-html-markdown dependency. Strips <script>, <style>, and <nav> tags before conversion. Registered for .html and .htm extensions. Includes 12 tests covering conversion, tag stripping, edge cases. Closes #317 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address CodeQL and review comments on HTML parser - Replace regex-based tag stripping with node-html-markdown's native ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization, bad HTML filtering regexp) - Wrap translate() in try/catch, throw ValidationError (consistent with other parsers) - Use trimEnd() instead of trim() to preserve leading indentation - Reuse single NHM instance for efficiency Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Revert "fix: skip Vercel preview deployments on non-main branches" This reverts commit eb48187. * feat: add --from option to pack create for folder/URL sources Adds createPackFromSource() that builds packs directly from local folders, files, or URLs without requiring database interaction. CLI: libscope pack create --name my-pack --from ~/docs/ [--extensions .md,.html] [--exclude pattern] [--no-recursive] Features: - Walks directories recursively using registered parsers - Fetches URLs via fetchAndConvert - Supports extension filtering, exclude patterns, progress callback - Multiple --from sources supported - Output format identical to DB export (pack install works unchanged) Closes #328 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * style: fix prettier formatting Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add gzip support for pack files (.json.gz) Pack files can now be compressed with gzip for smaller distribution: - writePackFile/readPackFile auto-detect gzip by extension or magic bytes - installPack accepts both .json and .json.gz files - createPackFromSource defaults to .json.gz output (source packs can be large) - createPack (DB export) still defaults to .json - Auto-detects gzip by magic bytes even if extension is .json 5 new tests covering gzip write, install, magic byte detection, and round-trip. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add progress logging and fix dedup handling in pack install - Log each document as it's indexed so large installs show progress - Change pack install to use dedup: 'skip' for graceful duplicate handling - Make title+content_length dedup check respect the dedup mode setting (previously it always threw ValidationError regardless of dedup mode) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: auto-generate tags during pack creation and apply on install - Export tokenize() and add suggestTagsFromText() in tags.ts for DB-free tag generation - createPackFromSource() now auto-generates tags per document via TF-IDF - installPack() applies doc.tags via addTagsToDocument() after indexing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add HTML file parser for .html/.htm document indexing (#318) * feat: add HTML file parser for .html/.htm document indexing Adds HtmlParser using the existing node-html-markdown dependency. Strips <script>, <style>, and <nav> tags before conversion. Registered for .html and .htm extensions. Includes 12 tests covering conversion, tag stripping, edge cases. Closes #317 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address CodeQL and review comments on HTML parser - Replace regex-based tag stripping with node-html-markdown's native ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization, bad HTML filtering regexp) - Wrap translate() in try/catch, throw ValidationError (consistent with other parsers) - Use trimEnd() instead of trim() to preserve leading indentation - Reuse single NHM instance for efficiency Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Revert "fix: skip Vercel preview deployments on non-main branches" This reverts commit eb48187. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * build(deps-dev): Bump eslint-config-prettier from 9.1.2 to 10.1.8 (#325) Bumps [eslint-config-prettier](https://github.com/prettier/eslint-config-prettier) from 9.1.2 to 10.1.8. - [Release notes](https://github.com/prettier/eslint-config-prettier/releases) - [Changelog](https://github.com/prettier/eslint-config-prettier/blob/main/CHANGELOG.md) - [Commits](https://github.com/prettier/eslint-config-prettier/commits/v10.1.8) --- updated-dependencies: - dependency-name: eslint-config-prettier dependency-version: 10.1.8 dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps-dev): Bump lint-staged from 16.3.1 to 16.3.2 Bumps the minor-and-patch group with 1 update: [lint-staged](https://github.com/lint-staged/lint-staged). Updates `lint-staged` from 16.3.1 to 16.3.2 - [Release notes](https://github.com/lint-staged/lint-staged/releases) - [Changelog](https://github.com/lint-staged/lint-staged/blob/main/CHANGELOG.md) - [Commits](lint-staged/lint-staged@v16.3.1...v16.3.2) --- updated-dependencies: - dependency-name: lint-staged dependency-version: 16.3.2 dependency-type: direct:development update-type: version-update:semver-patch dependency-group: minor-and-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): Bump the actions group with 5 updates Bumps the actions group with 5 updates: | Package | From | To | | --- | --- | --- | | [actions/checkout](https://github.com/actions/checkout) | `4` | `6` | | [actions/setup-node](https://github.com/actions/setup-node) | `4` | `6` | | [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4` | `7` | | [actions/setup-python](https://github.com/actions/setup-python) | `5` | `6` | | [actions/setup-go](https://github.com/actions/setup-go) | `5` | `6` | Updates `actions/checkout` from 4 to 6 - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v4...v6) Updates `actions/setup-node` from 4 to 6 - [Release notes](https://github.com/actions/setup-node/releases) - [Commits](actions/setup-node@v4...v6) Updates `actions/upload-artifact` from 4 to 7 - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@v4...v7) Updates `actions/setup-python` from 5 to 6 - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](actions/setup-python@v5...v6) Updates `actions/setup-go` from 5 to 6 - [Release notes](https://github.com/actions/setup-go/releases) - [Commits](actions/setup-go@v5...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/setup-node dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/upload-artifact dependency-version: '7' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/setup-python dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/setup-go dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps-dev): Bump @types/node from 22.19.13 to 25.3.3 Bumps [@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node) from 22.19.13 to 25.3.3. - [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases) - [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node) --- updated-dependencies: - dependency-name: "@types/node" dependency-version: 25.3.3 dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): Bump better-sqlite3 from 11.10.0 to 12.6.2 Bumps [better-sqlite3](https://github.com/WiseLibs/better-sqlite3) from 11.10.0 to 12.6.2. - [Release notes](https://github.com/WiseLibs/better-sqlite3/releases) - [Commits](WiseLibs/better-sqlite3@v11.10.0...v12.6.2) --- updated-dependencies: - dependency-name: better-sqlite3 dependency-version: 12.6.2 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps-dev): Bump eslint from 9.39.3 to 10.0.2 Bumps [eslint](https://github.com/eslint/eslint) from 9.39.3 to 10.0.2. - [Release notes](https://github.com/eslint/eslint/releases) - [Commits](eslint/eslint@v9.39.3...v10.0.2) --- updated-dependencies: - dependency-name: eslint dependency-version: 10.0.2 dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Robert DeRienzo <rderienzo@voloridge.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add passthrough LLM mode for ask-question tool (#335) * feat: add passthrough LLM mode for ask-question tool Adds llm.provider = "passthrough" so the ask-question MCP tool returns retrieved context chunks directly to the calling LLM instead of requiring a separate OpenAI/Ollama provider. This is the natural design for MCP tools where the client already has an LLM (e.g. Claude Code). - config.ts: add "passthrough" to llm.provider union type and env var handling - rag.ts: add isPassthroughMode() helper and getContextForQuestion() which retrieves and formats context without an LLM call - mcp/server.ts: ask-question checks passthrough first and returns context directly; falls through to existing LLM path otherwise Enable via config: { "llm": { "provider": "passthrough" } } Enable via env var: LIBSCOPE_LLM_PROVIDER=passthrough Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: format config.ts and include passthrough in provider override - Reformat long if-condition to satisfy prettier (printWidth: 100) - Fix logic bug: passthrough provider was checked in outer condition but not spread into overrides.llm.provider Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address 9 audit findings from issue #332 (#333) * fix: address 9 audit findings from issue #332 Security - middleware: use timingSafeEqual for API key comparison (#2) - url-fetcher, http-utils: replace process-wide NODE_TLS_REJECT_UNAUTHORIZED mutation with per-request undici Agent to eliminate TLS race condition (#1) Bugs - indexing: re-throw unexpected embedding errors so transaction rolls back instead of silently committing chunks with no vector (#3) - search: replace correlated minRating subquery with avg_r.avg_rating from the pre-joined aggregate in FTS and LIKE search paths (#4) Performance - bulk: replace O(n²) docs.find() loops with pre-built Map; replace per-document getDocumentTags() calls with a single getDocumentTagsBatch() query (#5) - config: add 30-second TTL cache to loadConfig() so disk reads are not repeated on every request (#6) Code quality - routes: check res.write() return value to handle SSE backpressure (#7) - reindex: delegate to schema.createVectorTable() instead of duplicating the vec0 DDL inline (#8) - obsidian: replace hand-rolled parseSimpleYaml() with js-yaml, normalise Date objects back to ISO-8601 strings (#9) Docs - agents.md: expand architecture tree to include src/api/ and src/connectors/; add Security Patterns section with correct undici examples - CONTRIBUTING.md: fix check-suite command (npm test → npm run test:coverage) and correct coverage threshold (80% → actual 75%/74%) Tests - bulk.test: add dateFrom/dateTo filter coverage - config.test: add cache-hit test; call invalidateConfigCache() before env-var tests so TTL cache doesn't return stale results Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: remove unused warnIfTlsBypassMissing function Dead code after conflict resolution chose the per-request undici Agent approach (which doesn't need a warning about NODE_TLS_REJECT_UNAUTHORIZED). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: update tests for config cache and retry semantics - Add invalidateConfigCache() before loadConfig() in 4 env-override tests that were failing because the 30s TTL cache introduced in the config module was returning stale results from the previous test's cache entry - Update http-utils retry assertion: maxRetries=2 means 1 initial + 2 retries = 3 total calls (loop is attempt <= maxRetries) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: CLI logging improvements and pack installation performance (#330) (#336) - Add `src/cli/reporter.ts`: PrettyReporter (ANSI colors + \r progress bar), SilentReporter (no-op for verbose/JSON mode), `isVerbose()` and `createReporter()` factory. `LIBSCOPE_VERBOSE=1` env var alternative. - Update `setupLogging` to default to "silent" in CLI mode (pretty reporter handles user-facing output). Verbose/`--log-level` flags still route to structured JSON pino logs. Fix duplicate `initLogger` calls in onenote connect/disconnect commands to use `setupLogging` consistently. - Update `installPack` in `packs.ts` to support batch embedding and progress reporting: - New `InstallOptions` interface with `batchSize`, `resumeFrom`, `onProgress` fields - Batch documents: chunk all → single `provider.embedBatch` call per batch → single SQLite transaction per batch (avoids N embedding calls) - `resumeFrom` skips the first N documents (enables partial install resume after failure) - `InstallResult` now includes `errors` count - Add `--batch-size` and `--resume-from` CLI options to `pack install` - Tests: `tests/unit/reporter.test.ts` (17 tests covering PrettyReporter, SilentReporter, isVerbose, env var detection); extended `tests/unit/packs.test.ts` with 7 new tests for progress callbacks, batch efficiency, resumeFrom, embedBatch failure handling. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Claude/fix issue 331 s1qzu (#338) * build(deps): Bump the npm_and_yarn group across 1 directory with 2 updates (#337) Bumps the npm_and_yarn group with 2 updates in the / directory: [@hono/node-server](https://github.com/honojs/node-server) and [hono](https://github.com/honojs/hono). Updates `@hono/node-server` from 1.19.9 to 1.19.10 - [Release notes](https://github.com/honojs/node-server/releases) - [Commits](honojs/node-server@v1.19.9...v1.19.10) Updates `hono` from 4.12.3 to 4.12.5 - [Release notes](https://github.com/honojs/hono/releases) - [Commits](honojs/hono@v4.12.3...v4.12.5) --- updated-dependencies: - dependency-name: "@hono/node-server" dependency-version: 1.19.10 dependency-type: indirect dependency-group: npm_and_yarn - dependency-name: hono dependency-version: 4.12.5 dependency-type: indirect dependency-group: npm_and_yarn ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (#341) * fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (closes #340) **SSRF (CWE-918 — CodeQL alert #28)** Replace the two-step validate-then-fetch approach in url-fetcher.ts with IP-pinned requests using node:http / node:https directly. validateUrl() resolves DNS and checks for private IPs, then the validated IP is passed straight to the TCP connection (hostname: pinnedIp, servername: original hostname for TLS SNI). There is now zero TOCTOU window between validation and the actual network request. The redundant post-fetch DNS rebinding check and the env-var-based TLS bypass wrapper are removed; rejectUnauthorized is now passed directly to the request options. An internal _setRequestImpl hook is exported for unit test injection so tests can stub responses without touching node:http / node:https. Tests are updated accordingly. **ReDoS (CWE-1333 — CodeQL alert #24)** Five regexes in confluence.ts used the pattern [^>]*ac:name="X"[^>]* — two [^>]* quantifiers around a fixed literal. For input that contains a large attribute blob without the target ac:name value, the engine must try all O(n²) splits before concluding no match (catastrophic backtracking). Fix: rewrite the leading [^>]* as (?:(?!ac:name="X")[^>])* — a negative lookahead prevents the quantifier from overlapping with the literal, making backtracking structurally impossible. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: replace remaining polynomial regexes in confluence.ts with indexOf helpers The conflict resolution left the original `/<ac:structured-macro [^>]*>([\s\S]*?) <\/ac:structured-macro>/gi` patterns in place for code blocks and info/tip panels (those were not part of the original security fix diff). These have the same O(n²) backtracking problem: with k opening tags and no closing tags, [\s\S]*? scans O(n - pos) chars per attempt, totalling O(n²). Replace the entire convertConfluenceStorage function with the indexOf-based approach (replaceStructuredMacros / replaceTagPairs / extractTagContent helpers) that eliminates all regex-based tag parsing. Also add removeSelfClosingMacros to handle the self-closing TOC case without regex, since the previous self-closing fix still used a [^>]*ac:name="toc"[^>]* pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: concurrent pack installation and -v verbose shorthand (issue #330) (#339) * feat: concurrent pack installation and -v verbose shorthand (issue #330) Add concurrent batch embedding to installPack for significant performance improvement on large packs, plus CLI ergonomics improvements. Key changes: - `InstallOptions.concurrency` (default: 4): controls how many embedBatch calls run simultaneously; embedding is I/O-bound so parallelism directly reduces wall-clock installation time - Refactor installPack to pre-chunk all documents upfront, then use a semaphore-based scheduler to run up to `concurrency` embedBatch calls concurrently while inserting completed batches in-order (SQLite requires serialised writes); progress callbacks fire after each batch as before - `pack install --concurrency <n>` CLI flag exposes the new option - `-v` shorthand for `--verbose` on the global program options - Fix transaction install-count tracking: count committed docs accurately without relying on subtract-on-failure arithmetic - Add 6 new tests covering concurrency=1 sequential, concurrency=4 parallel, multiple embedBatch calls per install, concurrency limit enforcement, incremental progress reporting, and partial-failure error counting https://claude.ai/code/session_019hzhbEgV1ysnGmFVXBWzkZ * fix: address all 4 Copilot review comments on PR #339 - Validate batchSize, concurrency, resumeFrom at the start of installPack and throw ValidationError for invalid values (comments 3 & 4). Concurrency <= 0 would silently hang the semaphore indefinitely. - Add CLI lower-bound guard: --concurrency < 1 exits with a user-facing error before ever calling installPack (comment 3). - Lazy chunking: pre-chunking all documents upfront held chunks for the entire pack in memory simultaneously. Batches now store only the raw documents; resolveBatch() chunks on demand right before embedBatch is called, so only one batch's worth of chunks is in memory at a time (comment 2). - Wrap provider.embedBatch() in try/catch so synchronous throws are converted to rejected Promises rather than escaping scheduleNext() and leaving the outer Promise permanently pending (comment 1). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> * fix: address 7 pre-release bugs from audit (#342) (#344) - Guard JSON.parse in rowToWebhook with try/catch, default to [] - Guard JSON.parse in rowToSavedSearch with try/catch, default to null - Guard JSON.parse in loadDbConnectorConfig with try/catch, throw ConfigError - Push dateFrom/dateTo filters into SQL WHERE in listDocuments (before LIMIT) - Validate negative limit in resolveSelector, throw ValidationError - Replace manual substring extension parsing with path.extname() in packs.ts - Verified reporter.ts is already tracked on development (no action needed) - Added tests for all fixes (corrupted JSON, SQL-level date filtering, negative limit) Closes #342 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: URL spidering — crawl linked pages with configurable depth and page limits (#315) (#343) * feat: URL spidering — crawl linked pages with configurable depth and page limits (#315) Adds opt-in spidering to URL indexing. A single seed URL can now crawl and index an entire documentation site or wiki section in one call. New files: - src/core/link-extractor.ts: indexOf-based <a href> extraction, relative URL resolution, fragment stripping, dedup, scheme filtering. No regex. - src/core/spider.ts: BFS crawl engine with sameDomain, pathPrefix, excludePatterns (glob), maxPages (hard cap 200), maxDepth (hard cap 5), 10-min total timeout, robots.txt (User-agent: * and libscope), and 1s inter-request delay. Yields SpiderResult per page; returns SpiderStats. - tests/unit/link-extractor.test.ts: 25 tests covering relative resolution, dedup, fragment stripping, scheme filtering, attribute order, edge cases. - tests/unit/spider.test.ts: 20 tests covering BFS order, depth/page limits, domain + path + pattern filtering, cycle detection, robots.txt, partial failure recovery, stats, and abortReason reporting. Modified: - src/core/url-fetcher.ts: adds fetchRaw() export returning raw body + contentType + finalUrl before HTML-to-markdown conversion, so the spider can extract links from HTML before conversion. - src/api/routes.ts: POST /api/v1/documents/url now accepts spider, maxPages, maxDepth, sameDomain, pathPrefix, excludePatterns. Returns { documents, pagesIndexed, pagesCrawled, pagesSkipped, errors, abortReason }. - src/mcp/server.ts: submit-document tool gains spider, maxPages, maxDepth, sameDomain, pathPrefix, excludePatterns parameters. Safety: all fetched URLs pass through the existing SSRF validation in fetchRaw() (DNS resolution, private IP blocking, scheme allowlist). Hard limits (200 pages, depth 5, 10min) cannot be overridden by callers. robots.txt is fetched once per origin and Disallow rules are honoured. Individual page failures do not abort the crawl. Closes #315 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: resolve CI lint errors in spider implementation - Remove unnecessary type assertions (routes.ts, mcp/server.ts) — TypeScript already narrows SpiderResult/SpiderStats from the generator - Add explicit return type annotation on mock fetchRaw to satisfy no-unsafe-return rule in spider.test.ts - Replace .resolves.not.toThrow() with a direct assertion — vitest .resolves requires a Promise, not an async function Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address CodeQL security findings in spider/link-extractor link-extractor.ts (CodeQL #30 — incomplete URL scheme check): Replace the enumerated scheme blocklist (javascript:, vbscript:, etc.) with a strict http/https allowlist check on the resolved URL protocol. An allowlist is exhaustive by definition; a blocklist will always miss obscure schemes like vbscript:, blob:, or future additions. spider.ts (CodeQL #31 — incomplete multi-character sanitization): Replace the regex-based tag stripper /<[^>]+>/g in extractTitle() with an indexOf-based stripTags() function. The regex stops at the first > which can be inside a quoted attribute value (e.g. <img alt="a>b">), potentially leaving partial tag content in the extracted title. The new implementation walks quoted attribute values explicitly so no tag content leaks through regardless of its internal structure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address all Copilot review comments on spider PR (#343) - link-extractor: add word-boundary check in extractHref to prevent matching data-href, aria-href (false positives on non-href attributes) - spider: rename pagesIndexed → pagesFetched throughout (SpiderStats interface already used pagesFetched; sync implementation + tests) - spider: per-origin robots.txt cache (Map<origin, Set>) fetched lazily as new origins are encountered during crawl (was seed-only before) - spider: normalize to raw.finalUrl after redirects — visited set, yielded URL, and link-extraction base all use the canonical URL - routes: validate maxPages/maxDepth are finite positive integers - routes: change conditional spread &&-patterns to ternaries - routes: remove inner try/catch for spider fetch errors; add FetchError to top-level handler (consistent with single-URL mode → 502) - mcp/server: replace conditional spreads with explicit if assignments - mcp/server: validate spider=true requires url (throws ValidationError) - openapi: document spider request fields in IndexFromUrlRequest schema, add SpiderResponse schema, update 201 response to oneOf Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: fix prettier formatting in spider files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: comprehensive documentation update for v1.3.0 (#347) - README: fix license (BUSL-1.1, not MIT), expand MCP tools table to all 26 tools, expand REST API table with all endpoints (webhooks, links, analytics, connectors status, suggest-tags, bulk ops), add webhooks section with HMAC signing example, add missing CLI commands (bulk ops, saved searches, document links, docs update) - getting-started: fix Node.js requirement (20, not 18), add sections for web dashboard, organize/annotate features, REST API - mcp-setup: expand available tools section to list all 26 tools grouped by category instead of just 4 - mcp-tools reference: add 5 missing tools — update-document, suggest-tags, link-documents, get-document-links, delete-link - rest-api reference: add all missing endpoints, reorganize by category, add examples for update, bulk retag, webhooks, links, saved searches - configuration guide: document passthrough LLM provider - configuration reference: add passthrough LLM, llm.ollamaUrl key, expand config set examples to cover all settable keys - cli reference: expand config set supported keys list Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: allow release-please PRs to pass merge gate and trigger CI (#348) * docs: comprehensive documentation update for v1.3.0 - README: fix license (BUSL-1.1, not MIT), expand MCP tools table to all 26 tools, expand REST API table with all endpoints (webhooks, links, analytics, connectors status, suggest-tags, bulk ops), add webhooks section with HMAC signing example, add missing CLI commands (bulk ops, saved searches, document links, docs update) - getting-started: fix Node.js requirement (20, not 18), add sections for web dashboard, organize/annotate features, REST API - mcp-setup: expand available tools section to list all 26 tools grouped by category instead of just 4 - mcp-tools reference: add 5 missing tools — update-document, suggest-tags, link-documents, get-document-links, delete-link - rest-api reference: add all missing endpoints, reorganize by category, add examples for update, bulk retag, webhooks, links, saved searches - configuration guide: document passthrough LLM provider - configuration reference: add passthrough LLM, llm.ollamaUrl key, expand config set examples to cover all settable keys - cli reference: expand config set supported keys list Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: allow release-please PRs to pass merge gate and trigger CI Two issues prevented PR #238 from getting CI runs: 1. merge-gate blocked release-please PRs — the gate only allowed 'development' as the source branch, but release-please uses 'release-please--branches--main--components--libscope'. Updated to allow any branch matching 'release-please--*'. 2. CI never ran on the PR — GitHub does not trigger workflows when GITHUB_TOKEN creates a PR (intentional security restriction to prevent infinite loops). Fixed by passing a PAT via secrets.GH_TOKEN to the release-please action so its PR creation triggers CI. Note: requires a 'GH_TOKEN' secret in repo settings — a classic PAT with repo and workflow scopes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: trigger checks --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: add development branch workflow (#327)
* chore: add development branch workflow
- Add merge-gate.yml: enforces only 'development' can merge into main
- Update CI/CodeQL/Docker workflows to run on both main and development
- Update dependabot.yml: target-branch set to development for all ecosystems
- Update copilot-instructions.md: document branch workflow convention
- Rulesets configured: Main (requires merge-gate + squash-only),
Development (requires CI status checks + PR)
- Default branch set to development
- All open PRs retargeted to development
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix: skip Vercel preview deployments on non-main branches
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* chore: trigger check refresh
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: create-pack from local folder or URL sources (#329)
* fix: comprehensive audit fixes — security, performance, resilience, API hardening
Addresses findings from issue #314:
- SSRF protection for webhook URLs (CRITICAL)
- Scrub secrets from exports
- Stored XSS prevention on document URL
- O(n²) and N+1 fixes in bulk operations
- Rate limit cache eviction improvement
- SSE backpressure handling
- Replace raw Error() with typed errors
- Fetch timeouts on all network calls
- Input validation on API parameters
- Search query length limit
- Silent catch block logging
- DNS rebinding check fix
- N+1 in Slack user resolution
- Pagination on webhook/search list endpoints
Closes #314
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Address PR review comments: fix SSE listener leak, preserve caller signals, add SSRF validation to webhook test, chunk SQL params, use dynamic import in test
- SSE backpressure: create single disconnect promise, race against drain (no listener accumulation)
- http-utils.ts/onenote.ts: use AbortSignal.any() to combine caller signal with timeout
- Webhook test endpoint: validate URL with validateWebhookUrlSsrf before fetch
- bulk.ts: chunk IN clause to 999 params max (SQLite limit)
- webhooks.test.ts: dynamic import after vi.mock() for deterministic DNS mocking
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* ci: consolidate and fix CI/CD workflows
- Merge lint + typecheck into single job (saves one npm ci)
- Add concurrency groups to ci, docker, codeql (cancel stale runs)
- Add dependency-review-action on PRs (block vulnerable deps)
- Add workflow_call trigger to ci.yml for reusability
- Remove duplicate npm publish from release.yml (release-please owns it)
- Fix SDK paths: sdk-go/ → sdk/go/, sdk-python/ → sdk/python/
- Fix Dependabot paths to match actual SDK directories
- Add github-actions ecosystem to Dependabot (keep actions up to date)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: add HTML file parser for .html/.htm document indexing
Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.
Closes #317
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix: address CodeQL and review comments on HTML parser
- Replace regex-based tag stripping with node-html-markdown's native
ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Revert "fix: skip Vercel preview deployments on non-main branches"
This reverts commit eb481870c883f77278291b72245b1ca0b890a78c.
* feat: add --from option to pack create for folder/URL sources
Adds createPackFromSource() that builds packs directly from local
folders, files, or URLs without requiring database interaction.
CLI: libscope pack create --name my-pack --from ~/docs/ [--extensions .md,.html] [--exclude pattern] [--no-recursive]
Features:
- Walks directories recursively using registered parsers
- Fetches URLs via fetchAndConvert
- Supports extension filtering, exclude patterns, progress callback
- Multiple --from sources supported
- Output format identical to DB export (pack install works unchanged)
Closes #328
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* style: fix prettier formatting
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: add gzip support for pack files (.json.gz)
Pack files can now be compressed with gzip for smaller distribution:
- writePackFile/readPackFile auto-detect gzip by extension or magic bytes
- installPack accepts both .json and .json.gz files
- createPackFromSource defaults to .json.gz output (source packs can be large)
- createPack (DB export) still defaults to .json
- Auto-detects gzip by magic bytes even if extension is .json
5 new tests covering gzip write, install, magic byte detection, and round-trip.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: add progress logging and fix dedup handling in pack install
- Log each document as it's indexed so large installs show progress
- Change pack install to use dedup: 'skip' for graceful duplicate handling
- Make title+content_length dedup check respect the dedup mode setting
(previously it always threw ValidationError regardless of dedup mode)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: auto-generate tags during pack creation and apply on install
- Export tokenize() and add suggestTagsFromText() in tags.ts for DB-free tag generation
- createPackFromSource() now auto-generates tags per document via TF-IDF
- installPack() applies doc.tags via addTagsToDocument() after indexing
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: add HTML file parser for .html/.htm document indexing (#318)
* feat: add HTML file parser for .html/.htm document indexing
Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.
Closes #317
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix: address CodeQL and review comments on HTML parser
- Replace regex-based tag stripping with node-html-markdown's native
ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Revert "fix: skip Vercel preview deployments on non-main branches"
This reverts commit eb481870c883f77278291b72245b1ca0b890a78c.
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* build(deps-dev): Bump eslint-config-prettier from 9.1.2 to 10.1.8 (#325)
Bumps [eslint-config-prettier](https://github.com/prettier/eslint-config-prettier) from 9.1.2 to 10.1.8.
- [Release notes](https://github.com/prettier/eslint-config-prettier/releases)
- [Changelog](https://github.com/prettier/eslint-config-prettier/blob/main/CHANGELOG.md)
- [Commits](https://github.com/prettier/eslint-config-prettier/commits/v10.1.8)
---
updated-dependencies:
- dependency-name: eslint-config-prettier
dependency-version: 10.1.8
dependency-type: direct:development
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* build(deps-dev): Bump lint-staged from 16.3.1 to 16.3.2
Bumps the minor-and-patch group with 1 update: [lint-staged](https://github.com/lint-staged/lint-staged).
Updates `lint-staged` from 16.3.1 to 16.3.2
- [Release notes](https://github.com/lint-staged/lint-staged/releases)
- [Changelog](https://github.com/lint-staged/lint-staged/blob/main/CHANGELOG.md)
- [Commits](https://github.com/lint-staged/lint-staged/compare/v16.3.1...v16.3.2)
---
updated-dependencies:
- dependency-name: lint-staged
dependency-version: 16.3.2
dependency-type: direct:development
update-type: version-update:semver-patch
dependency-group: minor-and-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* build(deps): Bump the actions group with 5 updates
Bumps the actions group with 5 updates:
| Package | From | To |
| --- | --- | --- |
| [actions/checkout](https://github.com/actions/checkout) | `4` | `6` |
| [actions/setup-node](https://github.com/actions/setup-node) | `4` | `6` |
| [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4` | `7` |
| [actions/setup-python](https://github.com/actions/setup-python) | `5` | `6` |
| [actions/setup-go](https://github.com/actions/setup-go) | `5` | `6` |
Updates `actions/checkout` from 4 to 6
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)
Updates `actions/setup-node` from 4 to 6
- [Release notes](https://github.com/actions/setup-node/releases)
- [Commits](https://github.com/actions/setup-node/compare/v4...v6)
Updates `actions/upload-artifact` from 4 to 7
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v7)
Updates `actions/setup-python` from 5 to 6
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/v5...v6)
Updates `actions/setup-go` from 5 to 6
- [Release notes](https://github.com/actions/setup-go/releases)
- [Commits](https://github.com/actions/setup-go/compare/v5...v6)
---
updated-dependencies:
- dependency-name: actions/checkout
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/setup-node
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/upload-artifact
dependency-version: '7'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/setup-python
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/setup-go
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* build(deps-dev): Bump @types/node from 22.19.13 to 25.3.3
Bumps [@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node) from 22.19.13 to 25.3.3.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases)
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node)
---
updated-dependencies:
- dependency-name: "@types/node"
dependency-version: 25.3.3
dependency-type: direct:development
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* build(deps): Bump better-sqlite3 from 11.10.0 to 12.6.2
Bumps [better-sqlite3](https://github.com/WiseLibs/better-sqlite3) from 11.10.0 to 12.6.2.
- [Release notes](https://github.com/WiseLibs/better-sqlite3/releases)
- [Commits](https://github.com/WiseLibs/better-sqlite3/compare/v11.10.0...v12.6.2)
---
updated-dependencies:
- dependency-name: better-sqlite3
dependency-version: 12.6.2
dependency-type: direct:production
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* build(deps-dev): Bump eslint from 9.39.3 to 10.0.2
Bumps [eslint](https://github.com/eslint/eslint) from 9.39.3 to 10.0.2.
- [Release notes](https://github.com/eslint/eslint/releases)
- [Commits](https://github.com/eslint/eslint/compare/v9.39.3...v10.0.2)
---
updated-dependencies:
- dependency-name: eslint
dependency-version: 10.0.2
dependency-type: direct:development
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Robert DeRienzo <rderienzo@voloridge.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: add passthrough LLM mode for ask-question tool (#335)
* feat: add passthrough LLM mode for ask-question tool
Adds llm.provider = "passthrough" so the ask-question MCP tool returns
retrieved context chunks directly to the calling LLM instead of requiring
a separate OpenAI/Ollama provider. This is the natural design for MCP tools
where the client already has an LLM (e.g. Claude Code).
- config.ts: add "passthrough" to llm.provider union type and env var handling
- rag.ts: add isPassthroughMode() helper and getContextForQuestion() which
retrieves and formats context without an LLM call
- mcp/server.ts: ask-question checks passthrough first and returns context
directly; falls through to existing LLM path otherwise
Enable via config: { "llm": { "provider": "passthrough" } }
Enable via env var: LIBSCOPE_LLM_PROVIDER=passthrough
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: format config.ts and include passthrough in provider override
- Reformat long if-condition to satisfy prettier (printWidth: 100)
- Fix logic bug: passthrough provider was checked in outer condition but
not spread into overrides.llm.provider
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: address 9 audit findings from issue #332 (#333)
* fix: address 9 audit findings from issue #332
Security
- middleware: use timingSafeEqual for API key comparison (#2)
- url-fetcher, http-utils: replace process-wide NODE_TLS_REJECT_UNAUTHORIZED
mutation with per-request undici Agent to eliminate TLS race condition (#1)
Bugs
- indexing: re-throw unexpected embedding errors so transaction rolls back
instead of silently committing chunks with no vector (#3)
- search: replace correlated minRating subquery with avg_r.avg_rating from
the pre-joined aggregate in FTS and LIKE search paths (#4)
Performance
- bulk: replace O(n²) docs.find() loops with pre-built Map; replace
per-document getDocumentTags() calls with a single getDocumentTagsBatch()
query (#5)
- config: add 30-second TTL cache to loadConfig() so disk reads are not
repeated on every request (#6)
Code quality
- routes: check res.write() return value to handle SSE backpressure (#7)
- reindex: delegate to schema.createVectorTable() instead of duplicating
the vec0 DDL inline (#8)
- obsidian: replace hand-rolled parseSimpleYaml() with js-yaml, normalise
Date objects back to ISO-8601 strings (#9)
Docs
- agents.md: expand architecture tree to include src/api/ and src/connectors/;
add Security Patterns section with correct undici examples
- CONTRIBUTING.md: fix check-suite command (npm test → npm run test:coverage)
and correct coverage threshold (80% → actual 75%/74%)
Tests
- bulk.test: add dateFrom/dateTo filter coverage
- config.test: add cache-hit test; call invalidateConfigCache() before env-var
tests so TTL cache doesn't return stale results
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: remove unused warnIfTlsBypassMissing function
Dead code after conflict resolution chose the per-request undici Agent
approach (which doesn't need a warning about NODE_TLS_REJECT_UNAUTHORIZED).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: update tests for config cache and retry semantics
- Add invalidateConfigCache() before loadConfig() in 4 env-override tests
that were failing because the 30s TTL cache introduced in the config
module was returning stale results from the previous test's cache entry
- Update http-utils retry assertion: maxRetries=2 means 1 initial + 2
retries = 3 total calls (loop is attempt <= maxRetries)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: CLI logging improvements and pack installation performance (#330) (#336)
- Add `src/cli/reporter.ts`: PrettyReporter (ANSI colors + \r progress
bar), SilentReporter (no-op for verbose/JSON mode), `isVerbose()` and
`createReporter()` factory. `LIBSCOPE_VERBOSE=1` env var alternative.
- Update `setupLogging` to default to "silent" in CLI mode (pretty
reporter handles user-facing output). Verbose/`--log-level` flags still
route to structured JSON pino logs. Fix duplicate `initLogger` calls in
onenote connect/disconnect commands to use `setupLogging` consistently.
- Update `installPack` in `packs.ts` to support batch embedding and
progress reporting:
- New `InstallOptions` interface with `batchSize`, `resumeFrom`,
`onProgress` fields
- Batch documents: chunk all → single `provider.embedBatch` call per
batch → single SQLite transaction per batch (avoids N embedding calls)
- `resumeFrom` skips the first N documents (enables partial install
resume after failure)
- `InstallResult` now includes `errors` count
- Add `--batch-size` and `--resume-from` CLI options to `pack install`
- Tests: `tests/unit/reporter.test.ts` (17 tests covering PrettyReporter,
SilentReporter, isVerbose, env var detection); extended
`tests/unit/packs.test.ts` with 7 new tests for progress callbacks,
batch efficiency, resumeFrom, embedBatch failure handling.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Claude/fix issue 331 s1qzu (#338)
* build(deps): Bump the npm_and_yarn group across 1 directory with 2 updates (#337)
Bumps the npm_and_yarn group with 2 updates in the / directory: [@hono/node-server](https://github.com/honojs/node-server) and [hono](https://github.com/honojs/hono).
Updates `@hono/node-server` from 1.19.9 to 1.19.10
- [Release notes](https://github.com/honojs/node-server/releases)
- [Commits](https://github.com/honojs/node-server/compare/v1.19.9...v1.19.10)
Updates `hono` from 4.12.3 to 4.12.5
- [Release notes](https://github.com/honojs/hono/releases)
- [Commits](https://github.com/honojs/hono/compare/v4.12.3...v4.12.5)
---
updated-dependencies:
- dependency-name: "@hono/node-server"
dependency-version: 1.19.10
dependency-type: indirect
dependency-group: npm_and_yarn
- dependency-name: hono
dependency-version: 4.12.5
dependency-type: indirect
dependency-group: npm_and_yarn
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (#341)
* fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (closes #340)
**SSRF (CWE-918 — CodeQL alert #28)**
Replace the two-step validate-then-fetch approach in url-fetcher.ts with
IP-pinned requests using node:http / node:https directly. validateUrl()
resolves DNS and checks for private IPs, then the validated IP is passed
straight to the TCP connection (hostname: pinnedIp, servername: original
hostname for TLS SNI). There is now zero TOCTOU window between validation
and the actual network request. The redundant post-fetch DNS rebinding
check and the env-var-based TLS bypass wrapper are removed; rejectUnauthorized
is now passed directly to the request options.
An internal _setRequestImpl hook is exported for unit test injection so
tests can stub responses without touching node:http / node:https.
Tests are updated accordingly.
**ReDoS (CWE-1333 — CodeQL alert #24)**
Five regexes in confluence.ts used the pattern [^>]*ac:name="X"[^>]* —
two [^>]* quantifiers around a fixed literal. For input that contains a
large attribute blob without the target ac:name value, the engine must try
all O(n²) splits before concluding no match (catastrophic backtracking).
Fix: rewrite the leading [^>]* as (?:(?!ac:name="X")[^>])* — a negative
lookahead prevents the quantifier from overlapping with the literal,
making backtracking structurally impossible.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: replace remaining polynomial regexes in confluence.ts with indexOf helpers
The conflict resolution left the original `/<ac:structured-macro [^>]*>([\s\S]*?)
<\/ac:structured-macro>/gi` patterns in place for code blocks and info/tip panels
(those were not part of the original security fix diff). These have the same
O(n²) backtracking problem: with k opening tags and no closing tags, [\s\S]*?
scans O(n - pos) chars per attempt, totalling O(n²).
Replace the entire convertConfluenceStorage function with the indexOf-based
approach (replaceStructuredMacros / replaceTagPairs / extractTagContent helpers)
that eliminates all regex-based tag parsing. Also add removeSelfClosingMacros
to handle the self-closing TOC case without regex, since the previous self-closing
fix still used a [^>]*ac:name="toc"[^>]* pattern.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: concurrent pack installation and -v verbose shorthand (issue #330) (#339)
* feat: concurrent pack installation and -v verbose shorthand (issue #330)
Add concurrent batch embedding to installPack for significant performance
improvement on large packs, plus CLI ergonomics improvements.
Key changes:
- `InstallOptions.concurrency` (default: 4): controls how many embedBatch
calls run simultaneously; embedding is I/O-bound so parallelism directly
reduces wall-clock installation time
- Refactor installPack to pre-chunk all documents upfront, then use a
semaphore-based scheduler to run up to `concurrency` embedBatch calls
concurrently while inserting completed batches in-order (SQLite requires
serialised writes); progress callbacks fire after each batch as before
- `pack install --concurrency <n>` CLI flag exposes the new option
- `-v` shorthand for `--verbose` on the global program options
- Fix transaction install-count tracking: count committed docs accurately
without relying on subtract-on-failure arithmetic
- Add 6 new tests covering concurrency=1 sequential, concurrency=4 parallel,
multiple embedBatch calls per install, concurrency limit enforcement,
incremental progress reporting, and partial-failure error counting
https://claude.ai/code/session_019hzhbEgV1ysnGmFVXBWzkZ
* fix: address all 4 Copilot review comments on PR #339
- Validate batchSize, concurrency, resumeFrom at the start of installPack
and throw ValidationError for invalid values (comments 3 & 4). Concurrency
<= 0 would silently hang the semaphore indefinitely.
- Add CLI lower-bound guard: --concurrency < 1 exits with a user-facing
error before ever calling installPack (comment 3).
- Lazy chunking: pre-chunking all documents upfront held chunks for the
entire pack in memory simultaneously. Batches now store only the raw
documents; resolveBatch() chunks on demand right before embedBatch
is called, so only one batch's worth of chunks is in memory at a time
(comment 2).
- Wrap provider.embedBatch() in try/catch so synchronous throws are
converted to rejected Promises rather than escaping scheduleNext() and
leaving the outer Promise permanently pending (comment 1).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
* fix: address 7 pre-release bugs from audit (#342) (#344)
- Guard JSON.parse in rowToWebhook with try/catch, default to []
- Guard JSON.parse in rowToSavedSearch with try/catch, default to null
- Guard JSON.parse in loadDbConnectorConfig with try/catch, throw ConfigError
- Push dateFrom/dateTo filters into SQL WHERE in listDocuments (before LIMIT)
- Validate negative limit in resolveSelector, throw ValidationError
- Replace manual substring extension parsing with path.extname() in packs.ts
- Verified reporter.ts is already tracked on development (no action needed)
- Added tests for all fixes (corrupted JSON, SQL-level date filtering, negative limit)
Closes #342
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: URL spidering — crawl linked pages with configurable depth and page limits (#315) (#343)
* feat: URL spidering — crawl linked pages with configurable depth and page limits (#315)
Adds opt-in spidering to URL indexing. A single seed URL can now crawl
and index an entire documentation site or wiki section in one call.
New files:
- src/core/link-extractor.ts: indexOf-based <a href> extraction, relative
URL resolution, fragment stripping, dedup, scheme filtering. No regex.
- src/core/spider.ts: BFS crawl engine with sameDomain, pathPrefix,
excludePatterns (glob), maxPages (hard cap 200), maxDepth (hard cap 5),
10-min total timeout, robots.txt (User-agent: * and libscope), and
1s inter-request delay. Yields SpiderResult per page; returns SpiderStats.
- tests/unit/link-extractor.test.ts: 25 tests covering relative resolution,
dedup, fragment stripping, scheme filtering, attribute order, edge cases.
- tests/unit/spider.test.ts: 20 tests covering BFS order, depth/page limits,
domain + path + pattern filtering, cycle detection, robots.txt, partial
failure recovery, stats, and abortReason reporting.
Modified:
- src/core/url-fetcher.ts: adds fetchRaw() export returning raw body +
contentType + finalUrl before HTML-to-markdown conversion, so the
spider can extract links from HTML before conversion.
- src/api/routes.ts: POST /api/v1/documents/url now accepts spider, maxPages,
maxDepth, sameDomain, pathPrefix, excludePatterns. Returns { documents,
pagesIndexed, pagesCrawled, pagesSkipped, errors, abortReason }.
- src/mcp/server.ts: submit-document tool gains spider, maxPages, maxDepth,
sameDomain, pathPrefix, excludePatterns parameters.
Safety: all fetched URLs pass through the existing SSRF validation in
fetchRaw() (DNS resolution, private IP blocking, scheme allowlist).
Hard limits (200 pages, depth 5, 10min) cannot be overridden by callers.
robots.txt is fetched once per origin and Disallow rules are honoured.
Individual page failures do not abort the crawl.
Closes #315
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: resolve CI lint errors in spider implementation
- Remove unnecessary type assertions (routes.ts, mcp/server.ts) —
TypeScript already narrows SpiderResult/SpiderStats from the generator
- Add explicit return type annotation on mock fetchRaw to satisfy
no-unsafe-return rule in spider.test.ts
- Replace .resolves.not.toThrow() with a direct assertion — vitest
.resolves requires a Promise, not an async function
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: address CodeQL security findings in spider/link-extractor
link-extractor.ts (CodeQL #30 — incomplete URL scheme check):
Replace the enumerated scheme blocklist (javascript:, vbscript:, etc.)
with a strict http/https allowlist check on the resolved URL protocol.
An allowlist is exhaustive by definition; a blocklist will always miss
obscure schemes like vbscript:, blob:, or future additions.
spider.ts (CodeQL #31 — incomplete multi-character sanitization):
Replace the regex-based tag stripper /<[^>]+>/g in extractTitle() with
an indexOf-based stripTags() function. The regex stops at the first >
which can be inside a quoted attribute value (e.g. <img alt="a>b">),
potentially leaving partial tag content in the extracted title.
The new implementation walks quoted attribute values explicitly so no
tag content leaks through regardless of its internal structure.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: address all Copilot review comments on spider PR (#343)
- link-extractor: add word-boundary check in extractHref to prevent
matching data-href, aria-href (false positives on non-href attributes)
- spider: rename pagesIndexed → pagesFetched throughout (SpiderStats
interface already used pagesFetched; sync implementation + tests)
- spider: per-origin robots.txt cache (Map<origin, Set>) fetched lazily
as new origins are encountered during crawl (was seed-only before)
- spider: normalize to raw.finalUrl after redirects — visited set,
yielded URL, and link-extraction base all use the canonical URL
- routes: validate maxPages/maxDepth are finite positive integers
- routes: change conditional spread &&-patterns to ternaries
- routes: remove inner try/catch for spider fetch errors; add FetchError
to top-level handler (consistent with single-URL mode → 502)
- mcp/server: replace conditional spreads with explicit if assignments
- mcp/server: validate spider=true requires url (throws ValidationError)
- openapi: document spider request fields in IndexFromUrlRequest schema,
add SpiderResponse schema, update 201 response to oneOf
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* style: fix prettier formatting in spider files
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs: comprehensive documentation update for v1.3.0 (#347)
- README: fix license (BUSL-1.1, not MIT), expand MCP tools table to
all 26 tools, expand REST API table with all endpoints (webhooks,
links, analytics, connectors status, suggest-tags, bulk ops), add
webhooks section with HMAC signing example, add missing CLI commands
(bulk ops, saved searches, document links, docs update)
- getting-started: fix Node.js requirement (20, not 18), add sections
for web dashboard, organize/annotate features, REST API
- mcp-setup: expand available tools section to list all 26 tools
grouped by category instead of just 4
- mcp-tools reference: add 5 missing tools — update-document,
suggest-tags, link-documents, get-document-links, delete-link
- rest-api reference: add all missing endpoints, reorganize by category,
add examples for update, bulk retag, webhooks, links, saved searches
- configuration guide: document passthrough LLM provider
- configuration reference: add passthrough LLM, llm.ollamaUrl key,
expand config set examples to cover all settable keys
- cli reference: expand config set supported keys list
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: allow release-please PRs to pass merge gate and trigger CI (#348)
* docs: comprehensive documentation update for v1.3.0
- README: fix license (BUSL-1.1, not MIT), expand MCP tools table to
all 26 tools, expand REST API table with all endpoints (webhooks,
links, analytics, connectors status, suggest-tags, bulk ops), add
webhooks section with HMAC signing example, add missing CLI commands
(bulk ops, saved searches, document links, docs update)
- getting-started: fix Node.js requirement (20, not 18), add sections
for web dashboard, organize/annotate features, REST API
- mcp-setup: expand available tools section to list all 26 tools
grouped by category instead of just 4
- mcp-tools reference: add 5 missing tools — update-document,
suggest-tags, link-documents, get-document-links, delete-link
- rest-api reference: add all missing endpoints, reorganize by category,
add examples for update, bulk retag, webhooks, links, saved searches
- configuration guide: document passthrough LLM provider
- configuration reference: add passthrough LLM, llm.ollamaUrl key,
expand config set examples to cover all settable keys
- cli reference: expand config set supported keys list
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: allow release-please PRs to pass merge gate and trigger CI
Two issues prevented PR #238 from getting CI runs:
1. merge-gate blocked release-please PRs — the gate only allowed
'development' as the source branch, but release-please uses
'release-please--branches--main--components--libscope'. Updated
to allow any branch matching 'release-please--*'.
2. CI never ran on the PR — GitHub does not trigger workflows when
GITHUB_TOKEN creates a PR (intentional security restriction to
prevent infinite loops). Fixed by passing a PAT via secrets.GH_TOKEN
to the release-please action so its PR creation triggers CI.
Note: requires a 'GH_TOKEN' secret in repo settings — a classic PAT
with repo and workflow scopes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Prepare for release (#345) (#350)
* chore: add development branch workflow (#327)
* chore: add development branch workflow
- Add merge-gate.yml: enforces only 'development' can merge into main
- Update CI/CodeQL/Docker workflows to run on both main and development
- Update dependabot.yml: target-branch set to development for all ecosystems
- Update copilot-instructions.md: document branch workflow convention
- Rulesets configured: Main (requires merge-gate + squash-only),
Development (requires CI status checks + PR)
- Default branch set to development
- All open PRs retargeted to development
* fix: skip Vercel preview deployments on non-main branches
* chore: trigger check refresh
---------
* feat: create-pack from local folder or URL sources (#329)
* fix: comprehensive audit fixes — security, performance, resilience, API hardening
Addresses findings from issue #314:
- SSRF protection for webhook URLs (CRITICAL)
- Scrub secrets from exports
- Stored XSS prevention on document URL
- O(n²) and N+1 fixes in bulk operations
- Rate limit cache eviction improvement
- SSE backpressure handling
- Replace raw Error() with typed errors
- Fetch timeouts on all network calls
- Input validation on API parameters
- Search query length limit
- Silent catch block logging
- DNS rebinding check fix
- N+1 in Slack user resolution
- Pagination on webhook/search list endpoints
Closes #314
* Address PR review comments: fix SSE listener leak, preserve caller signals, add SSRF validation to webhook test, chunk SQL params, use dynamic import in test
- SSE backpressure: create single disconnect promise, race against drain (no listener accumulation)
- http-utils.ts/onenote.ts: use AbortSignal.any() to combine caller signal with timeout
- Webhook test endpoint: validate URL with validateWebhookUrlSsrf before fetch
- bulk.ts: chunk IN clause to 999 params max (SQLite limit)
- webhooks.test.ts: dynamic import after vi.mock() for deterministic DNS mocking
* ci: consolidate and fix CI/CD workflows
- Merge lint + typecheck into single job (saves one npm ci)
- Add concurrency groups to ci, docker, codeql (cancel stale runs)
- Add dependency-review-action on PRs (block vulnerable deps)
- Add workflow_call trigger to ci.yml for reusability
- Remove duplicate npm publish from release.yml (release-please owns it)
- Fix SDK paths: sdk-go/ → sdk/go/, sdk-python/ → sdk/python/
- Fix Dependabot paths to match actual SDK directories
- Add github-actions ecosystem to Dependabot (keep actions up to date)
* feat: add HTML file parser for .html/.htm document indexing
Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.
Closes #317
* fix: address CodeQL and review comments on HTML parser
- Replace regex-based tag stripping with node-html-markdown's native
ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency
* Revert "fix: skip Vercel preview deployments on non-main branches"
This reverts commit eb481870c883f77278291b72245b1ca0b890a78c.
* feat: add --from option to pack create for folder/URL sources
Adds createPackFromSource() that builds packs directly from local
folders, files, or URLs without requiring database interaction.
CLI: libscope pack create --name my-pack --from ~/docs/ [--extensions .md,.html] [--exclude pattern] [--no-recursive]
Features:
- Walks directories recursively using registered parsers
- Fetches URLs via fetchAndConvert
- Supports extension filtering, exclude patterns, progress callback
- Multiple --from sources supported
- Output format identical to DB export (pack install works unchanged)
Closes #328
* style: fix prettier formatting
* feat: add gzip support for pack files (.json.gz)
Pack files can now be compressed with gzip for smaller distribution:
- writePackFile/readPackFile auto-detect gzip by extension or magic bytes
- installPack accepts both .json and .json.gz files
- createPackFromSource defaults to .json.gz output (source packs can be large)
- createPack (DB export) still defaults to .json
- Auto-detects gzip by magic bytes even if extension is .json
5 new tests covering gzip write, install, magic byte detection, and round-trip.
* feat: add progress logging and fix dedup handling in pack install
- Log each document as it's indexed so large installs show progress
- Change pack install to use dedup: 'skip' for graceful duplicate handling
- Make title+content_length dedup check respect the dedup mode setting
(previously it always threw ValidationError regardless of dedup mode)
* feat: auto-generate tags during pack creation and apply on install
- Export tokenize() and add suggestTagsFromText() in tags.ts for DB-free tag generation
- createPackFromSource() now auto-generates tags per document via TF-IDF
- installPack() applies doc.tags via addTagsToDocument() after indexing
---------
* feat: add HTML file parser for .html/.htm document indexing (#318)
* feat: add HTML file parser for .html/.htm document indexing
Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.
Closes #317
* fix: address CodeQL and review comments on HTML parser
- Replace regex-based tag stripping with node-html-markdown's native
ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency
* Revert "fix: skip Vercel preview deployments on non-main branches"
This reverts commit eb481870c883f77278291b72245b1ca0b890a78c.
---------
* build(deps-dev): Bump eslint-config-prettier from 9.1.2 to 10.1.8 (#325)
Bumps [eslint-config-prettier](https://github.com/prettier/eslint-config-prettier) from 9.1.2 to 10.1.8.
- [Release notes](https://github.com/prettier/eslint-config-prettier/releases)
- [Changelog](https://github.com/prettier/eslint-config-prettier/blob/main/CHANGELOG.md)
- [Commits](https://github.com/prettier/eslint-config-prettier/commits/v10.1.8)
---
updated-dependencies:
- dependency-name: eslint-config-prettier
dependency-version: 10.1.8
dependency-type: direct:development
update-type: version-update:semver-major
...
* build(deps-dev): Bump lint-staged from 16.3.1 to 16.3.2
Bumps the minor-and-patch group with 1 update: [lint-staged](https://github.com/lint-staged/lint-staged).
Updates `lint-staged` from 16.3.1 to 16.3.2
- [Release notes](https://github.com/lint-staged/lint-staged/releases)
- [Changelog](https://github.com/lint-staged/lint-staged/blob/main/CHANGELOG.md)
- [Commits](https://github.com/lint-staged/lint-staged/compare/v16.3.1...v16.3.2)
---
updated-dependencies:
- dependency-name: lint-staged
dependency-version: 16.3.2
dependency-type: direct:development
update-type: version-update:semver-patch
dependency-group: minor-and-patch
...
* build(deps): Bump the actions group with 5 updates
Bumps the actions group with 5 updates:
| Package | From | To |
| --- | --- | --- |
| [actions/checkout](https://github.com/actions/checkout) | `4` | `6` |
| [actions/setup-node](https://github.com/actions/setup-node) | `4` | `6` |
| [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4` | `7` |
| [actions/setup-python](https://github.com/actions/setup-python) | `5` | `6` |
| [actions/setup-go](https://github.com/actions/setup-go) | `5` | `6` |
Updates `actions/checkout` from 4 to 6
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)
Updates `actions/setup-node` from 4 to 6
- [Release notes](https://github.com/actions/setup-node/releases)
- [Commits](https://github.com/actions/setup-node/compare/v4...v6)
Updates `actions/upload-artifact` from 4 to 7
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v7)
Updates `actions/setup-python` from 5 to 6
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/v5...v6)
Updates `actions/setup-go` from 5 to 6
- [Release notes](https://github.com/actions/setup-go/releases)
- [Commits](https://github.com/actions/setup-go/compare/v5...v6)
---
updated-dependencies:
- dependency-name: actions/checkout
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/setup-node
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/upload-artifact
dependency-version: '7'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/setup-python
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/setup-go
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
...
* build(deps-dev): Bump @types/node from 22.19.13 to 25.3.3
Bumps [@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node) from 22.19.13 to 25.3.3.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases)
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node)
---
updated-dependencies:
- dependency-name: "@types/node"
dependency-version: 25.3.3
dependency-type: direct:development
update-type: version-update:semver-major
...
* build(deps): Bump better-sqlite3 from 11.10.0 to 12.6.2
Bumps [better-sqlite3](https://github.com/WiseLibs/better-sqlite3) from 11.10.0 to 12.6.2.
- [Release notes](https://github.com/WiseLibs/better-sqlite3/releases)
- [Commits](https://github.com/WiseLibs/better-sqlite3/compare/v11.10.0...v12.6.2)
---
updated-dependencies:
- dependency-name: better-sqlite3
dependency-version: 12.6.2
dependency-type: direct:production
update-type: version-update:semver-major
...
* build(deps-dev): Bump eslint from 9.39.3 to 10.0.2
Bumps [eslint](https://github.com/eslint/eslint) from 9.39.3 to 10.0.2.
- [Release notes](https://github.com/eslint/eslint/releases)
- [Commits](https://github.com/eslint/eslint/compare/v9.39.3...v10.0.2)
---
updated-dependencies:
- dependency-name: eslint
dependency-version: 10.0.2
dependency-type: direct:development
update-type: version-update:semver-major
...
* feat: add passthrough LLM mode for ask-question tool (#335)
* feat: add passthrough LLM mode for ask-question tool
Adds llm.provider = "passthrough" so the ask-question MCP tool returns
retrieved context chunks directly to the calling LLM instead of requiring
a separate OpenAI/Ollama provider. This is the natural design for MCP tools
where the client already has an LLM (e.g. Claude Code).
- config.ts: add "passthrough" to llm.provider union type and env var handling
- rag.ts: add isPassthroughMode() helper and getContextForQuestion() which
retrieves and formats context without an LLM call
- mcp/server.ts: ask-question checks passthrough first and returns context
directly; falls through to existing LLM path otherwise
Enable via config: { "llm": { "provider": "passthrough" } }
Enable via env var: LIBSCOPE_LLM_PROVIDER=passthrough
* fix: format config.ts and include passthrough in provider override
- Reformat long if-condition to satisfy prettier (printWidth: 100)
- Fix logic bug: passthrough provider was checked in outer condition but
not spread into overrides.llm.provider
---------
* fix: address 9 audit findings from issue #332 (#333)
* fix: address 9 audit findings from issue #332
Security
- middleware: use timingSafeEqual for API key comparison (#2)
- url-fetcher, http-utils: replace process-wide NODE_TLS_REJECT_UNAUTHORIZED
mutation with per-request undici Agent to eliminate TLS race condition (#1)
Bugs
- indexing: re-throw unexpected embedding errors so transaction rolls back
instead of silently committing chunks with no vector (#3)
- search: replace correlated minRating subquery with avg_r.avg_rating from
the pre-joined aggregate in FTS and LIKE search paths (#4)
Performance
- bulk: replace O(n²) docs.find() loops with pre-built Map; replace
per-document getDocumentTags() calls with a single getDocumentTagsBatch()
query (#5)
- config: add 30-second TTL cache to loadConfig() so disk reads are not
repeated on every request (#6)
Code quality
- routes: check res.write() return value to handle SSE backpressure (#7)
- reindex: delegate to schema.createVectorTable() instead of duplicating
the vec0 DDL inline (#8)
- obsidian: replace hand-rolled parseSimpleYaml() with js-yaml, normalise
Date objects back to ISO-8601 strings (#9)
Docs
- agents.md: expand architecture tree to include src/api/ and src/connectors/;
add Security Patterns section with correct undici examples
- CONTRIBUTING.md: fix check-suite command (npm test → npm run test:coverage)
and correct coverage threshold (80% → actual 75%/74%)
Tests
- bulk.test: add dateFrom/dateTo filter coverage
- config.test: add cache-hit test; call invalidateConfigCache() before env-var
tests so TTL cache doesn't return stale results
* fix: remove unused warnIfTlsBypassMissing function
Dead code after conflict resolution chose the per-request undici Agent
approach (which doesn't need a warning about NODE_TLS_REJECT_UNAUTHORIZED).
* fix: update tests for config cache and retry semantics
- Add invalidateConfigCache() before loadConfig() in 4 env-override tests
that were failing because the 30s TTL cache introduced in the config
module was returning stale results from the previous test's cache entry
- Update http-utils retry assertion: maxRetries=2 means 1 initial + 2
retries = 3 total calls (loop is attempt <= maxRetries)
---------
* feat: CLI logging improvements and pack installation performance (#330) (#336)
- Add `src/cli/reporter.ts`: PrettyReporter (ANSI colors + \r progress
bar), SilentReporter (no-op for verbose/JSON mode), `isVerbose()` and
`createReporter()` factory. `LIBSCOPE_VERBOSE=1` env var alternative.
- Update `setupLogging` to default to "silent" in CLI mode (pretty
reporter handles user-facing output). Verbose/`--log-level` flags still
route to structured JSON pino logs. Fix duplicate `initLogger` calls in
onenote connect/disconnect commands to use `setupLogging` consistently.
- Update `installPack` in `packs.ts` to support batch embedding and
progress reporting:
- New `InstallOptions` interface with `batchSize`, `resumeFrom`,
`onProgress` fields
- Batch documents: chunk all → single `provider.embedBatch` call per
batch → single SQLite transaction per batch (avoids N embedding calls)
- `resumeFrom` skips the first N documents (enables partial install
resume after failure)
- `InstallResult` now includes `errors` count
- Add `--batch-size` and `--resume-from` CLI options to `pack install`
- Tests: `tests/unit/reporter.test.ts` (17 tests covering PrettyReporter,
SilentReporter, isVerbose, env var detection); extended
`tests/unit/packs.test.ts` with 7 new tests for progress callbacks,
batch efficiency, resumeFrom, embedBatch failure handling.
* Claude/fix issue 331 s1qzu (#338)
* build(deps): Bump the npm_and_yarn group across 1 directory with 2 updates (#337)
Bumps the npm_and_yarn group with 2 updates in the / directory: [@hono/node-server](https://github.com/honojs/node-server) and [hono](https://github.com/honojs/hono).
Updates `@hono/node-server` from 1.19.9 to 1.19.10
- [Release notes](https://github.com/honojs/node-server/releases)
- [Commits](https://github.com/honojs/node-server/compare/v1.19.9...v1.19.10)
Updates `hono` from 4.12.3 to 4.12.5
- [Release notes](https://github.com/honojs/hono/releases)
- [Commits](https://github.com/honojs/hono/compare/v4.12.3...v4.12.5)
---
updated-dependencies:
- dependency-name: "@hono/node-server"
dependency-version: 1.19.10
dependency-type: indirect
dependency-group: npm_and_yarn
- dependency-name: hono
dependency-version: 4.12.5
dependency-type: indirect
dependency-group: npm_and_yarn
...
* fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (#341)
* fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (closes #340)
**SSRF (CWE-918 — CodeQL alert #28)**
Replace the two-step validate-then-fetch approach in url-fetcher.ts with
IP-pinned requests using node:http / node:https directly. validateUrl()
resolves DNS and checks for private IPs, then the validated IP is passed
straight to the TCP connection (hostname: pinnedIp, servername: original
hostname for TLS SNI). There is now zero TOCTOU window between validation
and the actual network request. The redundant post-fetch DNS rebinding
check and the env-var-based TLS bypass wrapper are removed; rejectUnauthorized
is now passed directly to the request options.
An internal _setRequestImpl hook is exported for unit test injection so
tests can stub responses without touching node:http / node:https.
Tests are updated accordingly.
**ReDoS (CWE-1333 — CodeQL alert #24)**
Five regexes in confluence.ts used the pattern [^>]*ac:name="X"[^>]* —
two [^>]* quantifiers around a fixed literal. For input that contains a
large attribute blob without the target ac:name value, the engine must try
all O(n²) splits before concluding no match (catastrophic backtracking).
Fix: rewrite the leading [^>]* as (?:(?!ac:name="X")[^>])* — a negative
lookahead prevents the quantifier from overlapping with the literal,
making backtracking structurally impossible.
* fix: replace remaining polynomial regexes in confluence.ts with indexOf helpers
The conflict resolution left the original `/<ac:structured-macro [^>]*>([\s\S]*?)
<\/ac:structured-macro>/gi` patterns in place for code blocks and info/tip panels
(those were not part of the original security fix diff). These have the same
O(n²) backtracking problem: with k opening tags and no closing tags, [\s\S]*?
scans O(n - pos) chars per attempt, totalling O(n²).
Replace the entire convertConfluenceStorage function with the indexOf-based
approach (replaceStructuredMacros / replaceTagPairs / extractTagContent helpers)
that eliminates all regex-based tag parsing. Also add removeSelfClosingMacros
to handle the self-closing TOC case without regex, since the previous self-closing
fix still used a [^>]*ac:name="toc"[^>]* pattern.
---------
* feat: concurrent pack installation and -v verbose shorthand (issue #330) (#339)
* feat: concurrent pack installation and -v verbose shorthand (issue #330)
Add concurrent batch embedding to installPack for significant performance
improvement on large packs, plus CLI ergonomics improvements.
Key changes:
- `InstallOptions.concurrency` (default: 4): controls how many embedBatch
calls run simultaneously; embedding is I/O-bound so parallelism directly
reduces wall-clock installation time
- Refactor installPack to pre-chunk all documents upfront, then use a
semaphore-based scheduler to run up to `concurrency` embedBatch calls
concurrently while inserting completed batches in-order (SQLite requires
serialised writes); progress callbacks fire after each batch as before
- `pack install --concurrency <n>` CLI flag exposes the new option
- `-v` shorthand for `--verbose` on the global program options
- Fix transaction install-count tracking: count committed docs accurately
without relying on subtract-on-failure arithmetic
- Add 6 new tests covering concurrency=1 sequential, concurrency=4 parallel,
multiple embedBatch calls per install, concurrency limit enforcement,
incremental progress reporting, and partial-failure error counting
https://claude.ai/code/session_019hzhbEgV1ysnGmFVXBWzkZ
* fix: address all 4 Copilot review comments on PR #339
- Validate batchSize, concurrency, resumeFrom at the start of installPack
and throw ValidationError for invalid values (comments 3 & 4). Concurrency
<= 0 would silently hang the semaphore indefinitely.
- Add CLI lower-bound guard: --concurrency < 1 exits with a user-facing
error before ever calling installPack (comment 3).
- Lazy chunking: pre-chunking all documents upfront held chunks for the
entire pack in memory simultaneously. Batches now store only the raw
documents; resolveBatch() chunks on demand right before embedBatch
is called, so only one batch's worth of chunks is in memory at a time
(comment 2).
- Wrap provider.embedBatch() in try/catch so synchronous throws are
converted to rejected Promises rather than escaping scheduleNext() and
leaving the outer Promise permanently pending (comment 1).
---------
* fix: address 7 pre-release bugs from audit (#342) (#344)
- Guard JSON.parse in rowToWebhook with try/catch, default to []
- Guard JSON.parse in rowToSavedSearch with try/catch, default to null
- Guard JSON.parse in loadDbConnectorConfig with try/catch, throw ConfigError
- Push dateFrom/dateTo filters into SQL WHERE in listDocuments (before LIMIT)
- Validate negative limit in resolveSelector, throw ValidationError
- Replace manual substring extension parsing with path.extname() in packs.ts
- Verified reporter.ts is already tracked on development (no action needed)
- Added tests for all fixes (corrupted JSON, SQL-level date filtering, negative limit)
Closes #342
* feat: URL spidering — crawl linked pages with configurable depth and page limits (#315) (#343)
* feat: URL spidering — crawl linked pages with configurable depth and page limits (#315)
Adds opt-in spidering to URL indexing. A single seed URL can now crawl
and index an entire documentation site or wiki section in one call.
New files:
- src/core/link-extractor.ts: indexOf-based <a href> extraction, relative
URL resolution, fragment stripping, dedup, scheme filtering. No regex.
- src/core/spider.ts: BFS crawl engine with sameDomain, pathPrefix,
excludePatterns (glob), maxPages (hard cap 200), maxDepth (hard cap 5),
10-min total timeout, robots.txt (User-agent: * and libscope), and
1s inter-request delay. Yields SpiderResult per page; returns SpiderStats.
- tests/unit/link-extractor.test.ts: 25 tests covering relative resolution,
dedup, fragment stripping, scheme filtering, attribute order, edge cases.
- tests/unit/spider.test.ts: 20 tests covering BFS order, depth/page limits,
domain + path + pattern filtering, cycle detection, robots.txt, partial
failure recovery, stats, and abortReason reporting.
Modified:
- src/core/url-fetcher.ts: adds fetchRaw() export returning raw body +
contentType + finalUrl before HTML-to-markdown conversion, so the
spider can extract links from HTML before conversion.
- src/api/routes.ts: POST /api/v1/documents/url now accepts spider, maxPages,
maxDepth, sameDomain, pathPrefix, excludePatterns. Returns { documents,
pagesIndexed, pagesCrawled, pagesSkipped, errors, abortReason }.
- src/mcp/server.ts: submit-document tool gains spider, maxPages, maxDepth,
sameDomain, pathPrefix, excludePatterns parameters.
Safety: all fetched URLs pass through the existing SSRF validation in
fetchRaw() (DNS resolution, private IP blocking, scheme allowlist).
Hard limits (200 pages, depth 5, 10min) cannot be overridden by callers.
robots.txt is fetched once per origin and Disallow rules are honoured.
Individual page failures do not abort the crawl.
Closes #315
* fix: resolve CI lint errors in spider implementation
- Remove unnecessary type assertions (routes.ts, mcp/server.ts) —
TypeScript already narrows SpiderResult/SpiderStats from the generator
- Add explicit return type annotation on mock fetchRaw to satisfy
no-unsafe-return rule in spider.test.ts
- Replace .resolves.not.toThrow() with a direct assertion — vitest
.resolves requires a Promise, not an async function
* fix: address CodeQL security findings in spider/link-extractor
link-extractor.ts (CodeQL #30 — incomplete URL scheme check):
Replace the enumerated scheme blocklist (javascript:, vbscript:, etc.)
with a strict http/https allowlist check on the resolved URL protocol.
An allowlist is exhaustive by definition; a blocklist will always miss
obscure schemes like vbscript:, blob:, or future additions.
spider.ts (CodeQL #31 — incomplete multi-character sanitization):
Replace the regex-based tag stripper /<[^>]+>/g in extractTitle() with
an indexOf-based stripTags() function. The regex stops at the first >
which can be inside a quoted attribute value (e.g. <img alt="a>b">),
potentially leaving partial tag content in the extracted title.
The new implementation walks quoted attribute values explicitly so no
tag content leaks through regardless of its internal structure.
* fix: address all Copilot review comments on spider PR (#343)
- link-extractor: add word-boundary check in extractHref to prevent
matching data-href, aria-href (false positives on non-href attributes)
- spider: rename pagesIndexed → pagesFetched throughout (SpiderStats
interface already used pagesFetched; sync implementation + tests)
- spider: per-origin robots.txt cache (Map<origin, Set>) fetched lazily
as new origins are encountered during crawl (was seed-only before)
- spider: normalize to raw.finalUrl after redirects — visited set,
yielded URL, and link-extraction base all use the canonical URL
- routes: validate maxPages/maxDepth are finite positive integers
- routes: change conditional spread &&-patterns to ternaries
- routes: remove inner try/catch for spider fetch errors; add FetchError
to top-level handler (consistent with single-URL mode → 502)
- mcp/server: replace conditional spreads with explicit if assignments
- mcp/server: validate spider=true requires url (throws ValidationError)
- openapi: document spider request fields in IndexFromUrlRequest schema,
add SpiderResponse schema, update 201 response to oneOf
* style: fix prettier formatting in spider files
---------
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: add development branch workflow (#327)
* chore: add development branch workflow
- Add merge-gate.yml: enforces only 'development' can merge into main
- Update CI/CodeQL/Docker workflows to run on both main and development
- Update dependabot.yml: target-branch set to development for all ecosystems
- Update copilot-instructions.md: document branch workflow convention
- Rulesets configured: Main (requires merge-gate + squash-only),
Development (requires CI status checks + PR)
- Default branch set to development
- All open PRs retargeted to development
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix: skip Vercel preview deployments on non-main branches
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* chore: trigger check refresh
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: create-pack from local folder or URL sources (#329)
* fix: comprehensive audit fixes — security, performance, resilience, API hardening
Addresses findings from issue #314:
- SSRF protection for webhook URLs (CRITICAL)
- Scrub secrets from exports
- Stored XSS prevention on document URL
- O(n²) and N+1 fixes in bulk operations
- Rate limit cache eviction improvement
- SSE backpressure handling
- Replace raw Error() with typed errors
- Fetch timeouts on all network calls
- Input validation on API parameters
- Search query length limit
- Silent catch block logging
- DNS rebinding check fix
- N+1 in Slack user resolution
- Pagination on webhook/search list endpoints
Closes #314
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Address PR review comments: fix SSE listener leak, preserve caller signals, add SSRF validation to webhook test, chunk SQL params, use dynamic import in test
- SSE backpressure: create single disconnect promise, race against drain (no listener accumulation)
- http-utils.ts/onenote.ts: use AbortSignal.any() to combine caller signal with timeout
- Webhook test endpoint: validate URL with validateWebhookUrlSsrf before fetch
- bulk.ts: chunk IN clause to 999 params max (SQLite limit)
- webhooks.test.ts: dynamic import after vi.mock() for deterministic DNS mocking
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* ci: consolidate and fix CI/CD workflows
- Merge lint + typecheck into single job (saves one npm ci)
- Add concurrency groups to ci, docker, codeql (cancel stale runs)
- Add dependency-review-action on PRs (block vulnerable deps)
- Add workflow_call trigger to ci.yml for reusability
- Remove duplicate npm publish from release.yml (release-please owns it)
- Fix SDK paths: sdk-go/ → sdk/go/, sdk-python/ → sdk/python/
- Fix Dependabot paths to match actual SDK directories
- Add github-actions ecosystem to Dependabot (keep actions up to date)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: add HTML file parser for .html/.htm document indexing
Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.
Closes #317
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix: address CodeQL and review comments on HTML parser
- Replace regex-based tag stripping with node-html-markdown's native
ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Revert "fix: skip Vercel preview deployments on non-main branches"
This reverts commit eb481870c883f77278291b72245b1ca0b890a78c.
* feat: add --from option to pack create for folder/URL sources
Adds createPackFromSource() that builds packs directly from local
folders, files, or URLs without requiring database interaction.
CLI: libscope pack create --name my-pack --from ~/docs/ [--extensions .md,.html] [--exclude pattern] [--no-recursive]
Features:
- Walks directories recursively using registered parsers
- Fetches URLs via fetchAndConvert
- Supports extension filtering, exclude patterns, progress callback
- Multiple --from sources supported
- Output format identical to DB export (pack install works unchanged)
Closes #328
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* style: fix prettier formatting
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: add gzip support for pack files (.json.gz)
Pack files can now be compressed with gzip for smaller distribution:
- writePackFile/readPackFile auto-detect gzip by extension or magic bytes
- installPack accepts both .json and .json.gz files
- createPackFromSource defaults to .json.gz output (source packs can be large)
- createPack (DB export) still defaults to .json
- Auto-detects gzip by magic bytes even if extension is .json
5 new tests covering gzip write, install, magic byte detection, and round-trip.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: add progress logging and fix dedup handling in pack install
- Log each document as it's indexed so large installs show progress
- Change pack install to use dedup: 'skip' for graceful duplicate handling
- Make title+content_length dedup check respect the dedup mode setting
(previously it always threw ValidationError regardless of dedup mode)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: auto-generate tags during pack creation and apply on install
- Export tokenize() and add suggestTagsFromText() in tags.ts for DB-free tag generation
- createPackFromSource() now auto-generates tags per document via TF-IDF
- installPack() applies doc.tags via addTagsToDocument() after indexing
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: add HTML file parser for .html/.htm document indexing (#318)
* feat: add HTML file parser for .html/.htm document indexing
Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.
Closes #317
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix: address CodeQL and review comments on HTML parser
- Replace regex-based tag stripping with node-html-markdown's native
ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Revert "fix: skip Vercel preview deployments on non-main branches"
This reverts commit eb481870c883f77278291b72245b1ca0b890a78c.
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* build(deps-dev): Bump eslint-config-prettier from 9.1.2 to 10.1.8 (#325)
Bumps [eslint-config-prettier](https://github.com/prettier/eslint-config-prettier) from 9.1.2 to 10.1.8.
- [Release notes](https://github.com/prettier/eslint-config-prettier/releases)
- [Changelog](https://github.com/prettier/eslint-config-prettier/blob/main/CHANGELOG.md)
- [Commits](https://github.com/prettier/eslint-config-prettier/commits/v10.1.8)
---
updated-dependencies:
- dependency-name: eslint-config-prettier
dependency-version: 10.1.8
dependency-type: direct:development
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* build(deps-dev): Bump lint-staged from 16.3.1 to 16.3.2
Bumps the minor-and-patch group with 1 update: [lint-staged](https://github.com/lint-staged/lint-staged).
Updates `lint-staged` from 16.3.1 to 16.3.2
- [Release notes](https://github.com/lint-staged/lint-staged/releases)
- [Changelog](https://github.com/lint-staged/lint-staged/blob/main/CHANGELOG.md)
- [Commits](https://github.com/lint-staged/lint-staged/compare/v16.3.1...v16.3.2)
---
updated-dependencies:
- dependency-name: lint-staged
dependency-version: 16.3.2
dependency-type: direct:development
update-type: version-update:semver-patch
dependency-group: minor-and-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* build(deps): Bump the actions group with 5 updates
Bumps the actions group with 5 updates:
| Package | From | To |
| --- | --- | --- |
| [actions/checkout](https://github.com/actions/checkout) | `4` | `6` |
| [actions/setup-node](https://github.com/actions/setup-node) | `4` | `6` |
| [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4` | `7` |
| [actions/setup-python](https://github.com/actions/setup-python) | `5` | `6` |
| [actions/setup-go](https://github.com/actions/setup-go) | `5` | `6` |
Updates `actions/checkout` from 4 to 6
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)
Updates `actions/setup-node` from 4 to 6
- [Release notes](https://github.com/actions/setup-node/releases)
- [Commits](https://github.com/actions/setup-node/compare/v4...v6)
Updates `actions/upload-artifact` from 4 to 7
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v7)
Updates `actions/setup-python` from 5 to 6
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/v5...v6)
Updates `actions/setup-go` from 5 to 6
- [Release notes](https://github.com/actions/setup-go/releases)
- [Commits](https://github.com/actions/setup-go/compare/v5...v6)
---
updated-dependencies:
- dependency-name: actions/checkout
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/setup-node
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/upload-artifact
dependency-version: '7'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/setup-python
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/setup-go
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* build(deps-dev): Bump @types/node from 22.19.13 to 25.3.3
Bumps [@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node) from 22.19.13 to 25.3.3.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases)
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node)
---
updated-dependencies:
- dependency-name: "@types/node"
dependency-version: 25.3.3
dependency-type: direct:development
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* build(deps): Bump better-sqlite3 from 11.10.0 to 12.6.2
Bumps [better-sqlite3](https://github.com/WiseLibs/better-sqlite3) from 11.10.0 to 12.6.2.
- [Release notes](https://github.com/WiseLibs/better-sqlite3/releases)
- [Commits](https://github.com/WiseLibs/better-sqlite3/compare/v11.10.0...v12.6.2)
---
updated-dependencies:
- dependency-name: better-sqlite3
dependency-version: 12.6.2
dependency-type: direct:production
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* build(deps-dev): Bump eslint from 9.39.3 to 10.0.2
Bumps [eslint](https://github.com/eslint/eslint) from 9.39.3 to 10.0.2.
- [Release notes](https://github.com/eslint/eslint/releases)
- [Commits](https://github.com/eslint/eslint/compare/v9.39.3...v10.0.2)
---
updated-dependencies:
- dependency-name: eslint
dependency-version: 10.0.2
dependency-type: direct:development
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Robert DeRienzo <rderienzo@voloridge.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: add passthrough LLM mode for ask-question tool (#335)
* feat: add passthrough LLM mode for ask-question tool
Adds llm.provider = "passthrough" so the ask-question MCP tool returns
retrieved context chunks directly to the calling LLM instead of requiring
a separate OpenAI/Ollama provider. This is the natural design for MCP tools
where the client already has an LLM (e.g. Claude Code).
- config.ts: add "passthrough" to llm.provider union type and env var handling
- rag.ts: add isPassthroughMode() helper and getContextForQuestion() which
retrieves and formats context without an LLM call
- mcp/server.ts: ask-question checks passthrough first and returns context
directly; falls through to existing LLM path otherwise
Enable via config: { "llm": { "provider": "passthrough" } }
Enable via env var: LIBSCOPE_LLM_PROVIDER=passthrough
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: format config.ts and include passthrough in provider override
- Reformat long if-condition to satisfy prettier (printWidth: 100)
- Fix logic bug: passthrough provider was checked in outer condition but
not spread into overrides.llm.provider
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: address 9 audit findings from issue #332 (#333)
* fix: address 9 audit findings from issue #332
Security
- middleware: use timingSafeEqual for API key comparison (#2)
- url-fetcher, http-utils: replace process-wide NODE_TLS_REJECT_UNAUTHORIZED
mutation with per-request undici Agent to eliminate TLS race condition (#1)
Bugs
- indexing: re-throw unexpected embedding errors so transaction rolls back
instead of silently committing chunks with no vector (#3)
- search: replace correlated minRating subquery with avg_r.avg_rating from
the pre-joined aggregate in FTS and LIKE search paths (#4)
Performance
- bulk: replace O(n²) docs.find() loops with pre-built Map; replace
per-document getDocumentTags() calls with a single getDocumentTagsBatch()
query (#5)
- config: add 30-second TTL cache to loadConfig() so disk reads are not
repeated on every request (#6)
Code quality
- routes: check res.write() return value to handle SSE backpressure (#7)
- reindex: delegate to schema.createVectorTable() instead of duplicating
the vec0 DDL inline (#8)
- obsidian: replace hand-rolled parseSimpleYaml() with js-yaml, normalise
Date objects back to ISO-8601 strings (#9)
Docs
- agents.md: expand architecture tree to include src/api/ and src/connectors/;
add Security Patterns section with correct undici examples
- CONTRIBUTING.md: fix check-suite command (npm test → npm run test:coverage)
and correct coverage threshold (80% → actual 75%/74%)
Tests
- bulk.test: add dateFrom/dateTo filter coverage
- config.test: add cache-hit test; call invalidateConfigCache() before env-var
tests so TTL cache doesn't return stale results
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: remove unused warnIfTlsBypassMissing function
Dead code after conflict resolution chose the per-request undici Agent
approach (which doesn't need a warning about NODE_TLS_REJECT_UNAUTHORIZED).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: update tests for config cache and retry semantics
- Add invalidateConfigCache() before loadConfig() in 4 env-override tests
that were failing because the 30s TTL cache introduced in the config
module was returning stale results from the previous test's cache entry
- Update http-utils retry assertion: maxRetries=2 means 1 initial + 2
retries = 3 total calls (loop is attempt <= maxRetries)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: CLI logging improvements and pack installation performance (#330) (#336)
- Add `src/cli/reporter.ts`: PrettyReporter (ANSI colors + \r progress
bar), SilentReporter (no-op for verbose/JSON mode), `isVerbose()` and
`createReporter()` factory. `LIBSCOPE_VERBOSE=1` env var alternative.
- Update `setupLogging` to default to "silent" in CLI mode (pretty
reporter handles user-facing output). Verbose/`--log-level` flags still
route to structured JSON pino logs. Fix duplicate `initLogger` calls in
onenote connect/disconnect commands to use `setupLogging` consistently.
- Update `installPack` in `packs.ts` to support batch embedding and
progress reporting:
- New `InstallOptions` interface with `batchSize`, `resumeFrom`,
`onProgress` fields
- Batch documents: chunk all → single `provider.embedBatch` call per
batch → single SQLite transaction per batch (avoids N embedding calls)
- `resumeFrom` skips the first N documents (enables partial install
resume after failure)
- `InstallResult` now includes `errors` count
- Add `--batch-size` and `--resume-from` CLI options to `pack install`
- Tests: `tests/unit/reporter.test.ts` (17 tests covering PrettyReporter,
SilentReporter, isVerbose, env var detection); extended
`tests/unit/packs.test.ts` with 7 new tests for progress callbacks,
batch efficiency, resumeFrom, embedBatch failure handling.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Claude/fix issue 331 s1qzu (#338)
* build(deps): Bump the npm_and_yarn group across 1 directory with 2 updates (#337)
Bumps the npm_and_yarn group with 2 updates in the / directory: [@hono/node-server](https://github.com/honojs/node-server) and [hono](https://github.com/honojs/hono).
Updates `@hono/node-server` from 1.19.9 to 1.19.10
- [Release notes](https://github.com/honojs/node-server/releases)
- [Commits](https://github.com/honojs/node-server/compare/v1.19.9...v1.19.10)
Updates `hono` from 4.12.3 to 4.12.5
- [Release notes](https://github.com/honojs/hono/releases)
- [Commits](https://github.com/honojs/hono/compare/v4.12.3...v4.12.5)
---
updated-dependencies:
- dependency-name: "@hono/node-server"
dependency-version: 1.19.10
dependency-type: indirect
dependency-group: npm_and_yarn
- dependency-name: hono
dependency-version: 4.12.5
dependency-type: indirect
dependency-group: npm_and_yarn
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (#341)
* fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (closes #340)
**SSRF (CWE-918 — CodeQL alert #28)**
Replace the two-step validate-then-fetch approach in url-fetcher.ts with
IP-pinned requests using node:http / node:https directly. validateUrl()
resolves DNS and checks for private IPs, then the validated IP is passed
straight to the TCP connection (hostname: pinnedIp, servername: original
hostname for TLS SNI). There is now zero TOCTOU window between validation
and the actual network request. The redundant post-fetch DNS rebinding
check and the env-var-based TLS bypass wrapper are removed; rejectUnauthorized
is now passed directly to the request options.
An internal _setRequestImpl hook is exported for unit test injection so
tests can stub responses without touching node:http / node:https.
Tests are updated accordingly.
**ReDoS (CWE-1333 — CodeQL alert #24)**
Five regexes in confluence.ts used the pattern [^>]*ac:name="X"[^>]* —
two [^>]* quantifiers around a fixed literal. For input that contains a
large attribute blob without the target ac:name value, the engine must try
all O(n²) splits before concluding no match (catastrophic backtracking).
Fix: rewrite the leading [^>]* as (?:(?!ac:name="X")[^>])* — a negative
lookahead prevents the quantifier from overlapping with the literal,
making backtracking structurally impossible.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: replace remaining polynomial regexes in confluence.ts with indexOf helpers
The conflict resolution left the original `/<ac:structured-macro [^>]*>([\s\S]*?)
<\/ac:structured-macro>/gi` patterns in place for code blocks and info/tip panels
(those were not part of the original security fix diff). These have the same
O(n²) backtracking problem: with k opening tags and no closing tags, [\s\S]*?
scans O(n - pos) chars per attempt, totalling O(n²).
Replace the entire convertConfluenceStorage function with the indexOf-based
approach (replaceStructuredMacros / replaceTagPairs / extractTagContent helpers)
that eliminates all regex-based tag parsing. Also add removeSelfClosingMacros
to handle the self-closing TOC case without regex, since the previous self-closing
fix still used a [^>]*ac:name="toc"[^>]* pattern.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: concurrent pack installation and -v verbose shorthand (issue #330) (#339)
* feat: concurrent pack installation and -v verbose shorthand (issue #330)
Add concurrent batch embedding to installPack for significant performance
improvement on large packs, plus CLI ergonomics improvements.
Key changes:
- `InstallOptions.concurrency` (default: 4): controls how many embedBatch
calls run simultaneously; embedding is I/O-bound so parallelism directly
reduces wall-clock installation time
- Refactor installPack to pre-chunk all documents upfront, then use a
semaphore-based scheduler to run up to `concurrency` embedBatch calls
concurrently while inserting completed batches in-order (SQLite requires
serialised writes); progress callbacks fire after each batch as before
- `pack install --concurrency <n>` CLI flag exposes the new option
- `-v` shorthand for `--verbose` on the global program options
- Fix transaction install-count tracking: count committed docs accurately
without relying on subtract-on-failure arithmetic
- Add 6 new tests covering concurrency=1 sequential, concurrency=4 parallel,
multiple embedBatch calls per install, concurrency limit enforcement,
incremental progress reporting, and partial-failure error counting
https://claude.ai/code/session_019hzhbEgV1ysnGmFVXBWzkZ
* fix: address all 4 Copilot review comments on PR #339
- Validate batchSize, concurrency, resumeFrom at the start of installPack
and throw ValidationError for invalid values (comments 3 & 4). Concurrency
<= 0 would silently hang the semaphore indefinitely.
- Add CLI lower-bound guard: --concurrency < 1 exits with a user-facing
error before ever calling installPack (comment 3).
- Lazy chunking: pre-chunking all documents upfront held chunks for the
entire pack in memory simultaneously. Batches now store only the raw
documents; resolveBatch() chunks on demand right before embedBatch
is called, so only one batch's worth of chunks is in memory at a time
(comment 2).
- Wrap provider.embedBatch() in try/catch so synchronous throws are
converted to rejected Promises rather than escaping scheduleNext() and
leaving the outer Promise permanently pending (comment 1).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
* fix: address 7 pre-release bugs from audit (#342) (#344)
- Guard JSON.parse in rowToWebhook with try/catch, default to []
- Guard JSON.parse in rowToSavedSearch with try/catch, default to null
- Guard JSON.parse in loadDbConnectorConfig with try/catch, throw ConfigError
- Push dateFrom/dateTo filters into SQL WHERE in listDocuments (before LIMIT)
- Validate negative limit in resolveSelector, throw ValidationError
- Replace manual substring extension parsing with path.extname() in packs.ts
- Verified reporter.ts is already tracked on development (no action needed)
- Added tests for all fixes (corrupted JSON, SQL-level date filtering, negative limit)
Closes #342
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: URL spidering — crawl linked pages with configurable depth and page limits (#315) (#343)
* feat: URL spidering — crawl linked pages with configurable depth and page limits (#315)
Adds opt-in spidering to URL indexing. A single seed URL can now crawl
and index an entire documentation site or wiki section in one call.
New files:
- src/core/link-extractor.ts: indexOf-based <a href> extraction, relative
URL resolution, fragment stripping, dedup, scheme filtering. No regex.
- src/core/spider.ts: BFS crawl engine with sameDomain, pathPrefix,
excludePatterns (glob), maxPages (hard cap 200), maxDepth (hard cap 5),
10-min total timeout, robots.txt (User-agent: * and libscope), and
1s inter-request delay. Yields SpiderResult per page; returns SpiderStats.
- tests/unit/link-extractor.test.ts: 25 tests covering relative resolution,
dedup, fragment stripping, scheme filtering, attribute order, edge cases.
- tests/unit/spider.test.ts: 20 tests covering BFS order, depth/page limits,
domain + path + pattern filtering, cycle detection, robots.txt, partial
failure recovery, stats, and abortReason reporting.
Modified:
- src/core/url-fetcher.ts: adds fetchRaw() export returning raw body +
contentType + finalUrl before HTML-to-markdown conversion, so the
spider can extract links from HTML before conversion.
- src/api/routes.ts: POST /api/v1/documents/url now accepts spider, maxPages,
maxDepth, sameDomain, pathPrefix, excludePatterns. Returns { documents,
pagesIndexed, pagesCrawled, pagesSkipped, errors, abortReason }.
- src/mcp/server.ts: submit-document tool gains spider, maxPages, maxDepth,
sameDomain, pathPrefix, excludePatterns parameters.
Safety: all fetched URLs pass through the existing SSRF validation in
fetchRaw() (DNS resolution, private IP blocking, scheme allowlist).
Hard limits (200 pages, depth 5, 10min) cannot be overridden by callers.
robots.txt is fetched once per origin and Disallow rules are honoured.
Individual page failures do not abort the crawl.
Closes #315
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: resolve CI lint errors in spider implementation
- Remove unnecessary type assertions (routes.ts, mcp/server.ts) —
TypeScript already narrows SpiderResult/SpiderStats from the generator
- Add explicit return type annotation on mock fetchRaw to satisfy
no-unsafe-return rule in spider.test.ts
- Replace .resolves.not.toThrow() with a direct assertion — vitest
.resolves requires a Promise, not an async function
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: address CodeQL security findings in spider/link-extractor
link-extractor.ts (CodeQL #30 — incomplete URL scheme check):
Replace the enumerated scheme blocklist (javascript:, vbscript:, etc.)
with a strict http/https allowlist check on the resolved URL protocol.
An allowlist is exhaustive by definition; a blocklist will always miss
obscure schemes like vbscript:, blob:, or future additions.
spider.ts (CodeQL #31 — incomplete multi-character sanitization):
Replace the regex-based tag stripper /<[^>]+>/g in extractTitle() with
an indexOf-based stripTags() function. The regex stops at the first >
which can be inside a quoted attribute value (e.g. <img alt="a>b">),
potentially leaving partial tag content in the extracted title.
The new implementation walks quoted attribute values explicitly so no
tag content leaks through regardless of its internal structure.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: address all Copilot review comments on spider PR (#343)
- link-extractor: add word-boundary check in extractHref to prevent
matching data-href, aria-href (false positives on non-href attributes)
- spider: rename pagesIndexed → pagesFetched throughout (SpiderStats
interface already used pagesFetched; sync implementation + tests)
- spider: per-origin robots.txt cache (Map<origin, Set>) fetched lazily
as new origins are encountered during crawl (was seed-only before)
- spider: normalize to raw.finalUrl after redirects — visited set,
yielded URL, and link-extraction base all use the canonical URL
- routes: validate maxPages/maxDepth are finite positive integers
- routes: change conditional spread &&-patterns to ternaries
- routes: remove inner try/catch for spider fetch errors; add FetchError
to top-level handler (consistent with single-URL mode → 502)
- mcp/server: replace conditional spreads with explicit if assignments
- mcp/server: validate spider=true requires url (throws ValidationError)
- openapi: document spider request fields in IndexFromUrlRequest schema,
add SpiderResponse schema, update 201 response to oneOf
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* style: fix prettier formatting in spider files
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs: comprehensive documentation update for v1.3.0 (#347)
- README: fix license (BUSL-1.1, not MIT), expand MCP tools table to
all 26 tools, expand REST API table with all endpoints (webhooks,
links, analytics, connectors status, suggest-tags, bulk ops), add
webhooks section with HMAC signing example, add missing CLI commands
(bulk ops, saved searches, document links, docs update)
- getting-started: fix Node.js requirement (20, not 18), add sections
for web dashboard, organize/annotate features, REST API
- mcp-setup: expand available tools section to list all 26 tools
grouped by category instead of just 4
- mcp-tools reference: add 5 missing tools — update-document,
suggest-tags, link-documents, get-document-links, delete-link
- rest-api reference: add all missing endpoints, reorganize by category,
add examples for update, bulk retag, webhooks, links, saved searches
- configuration guide: document passthrough LLM provider
- configuration reference: add passthrough LLM, llm.ollamaUrl key,
expand config set examples to cover all settable keys
- cli reference: expand config set supported keys list
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: allow release-please PRs to pass merge gate and trigger CI (#348)
* docs: comprehensive documentation update for v1.3.0
- README: fix license (BUSL-1.1, not MIT), expand MCP tools table to
all 26 tools, expand REST API table with all endpoints (webhooks,
links, analytics, connectors status, suggest-tags, bulk ops), add
webhooks section with HMAC signing example, add missing CLI commands
(bulk ops, saved searches, document links, docs update)
- getting-started: fix Node.js requirement (20, not 18), add sections
for web dashboard, organize/annotate features, REST API
- mcp-setup: expand available tools section to list all 26 tools
grouped by category instead of just 4
- mcp-tools reference: add 5 missing tools — update-document,
suggest-tags, link-documents, get-document-links, delete-link
- rest-api reference: add all missing endpoints, reorganize by category,
add examples for update, bulk retag, webhooks, links, saved searches
- configuration guide: document passthrough LLM provider
- configuration reference: add passthrough LLM, llm.ollamaUrl key,
expand config set examples to cover all settable keys
- cli reference: expand config set supported keys list
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: allow release-please PRs to pass merge gate and trigger CI
Two issues prevented PR #238 from getting CI runs:
1. merge-gate blocked release-please PRs — the gate only allowed
'development' as the source branch, but release-please uses
'release-please--branches--main--components--libscope'. Updated
to allow any branch matching 'release-please--*'.
2. CI never ran on the PR — GitHub does not trigger workflows when
GITHUB_TOKEN creates a PR (intentional security restriction to
prevent infinite loops). Fixed by passing a PAT via secrets.GH_TOKEN
to the release-please action so its PR creation triggers CI.
Note: requires a 'GH_TOKEN' secret in repo settings — a classic PAT
with repo and workflow scopes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Prepare for release (#345) (#350)
* chore: add development branch workflow (#327)
* chore: add development branch workflow
- Add merge-gate.yml: enforces only 'development' can merge into main
- Update CI/CodeQL/Docker workflows to run on both main and development
- Update dependabot.yml: target-branch set to development for all ecosystems
- Update copilot-instructions.md: document branch workflow convention
- Rulesets configured: Main (requires merge-gate + squash-only),
Development (requires CI status checks + PR)
- Default branch set to development
- All open PRs retargeted to development
* fix: skip Vercel preview deployments on non-main branches
* chore: trigger check refresh
---------
* feat: create-pack from local folder or URL sources (#329)
* fix: comprehensive audit fixes — security, performance, resilience, API hardening
Addresses findings from issue #314:
- SSRF protection for webhook URLs (CRITICAL)
- Scrub secrets from exports
- Stored XSS prevention on document URL
- O(n²) and N+1 fixes in bulk operations
- Rate limit cache eviction improvement
- SSE backpressure handling
- Replace raw Error() with typed errors
- Fetch timeouts on all network calls
- Input validation on API parameters
- Search query length limit
- Silent catch block logging
- DNS rebinding check fix
- N+1 in Slack user resolution
- Pagination on webhook/search list endpoints
Closes #314
* Address PR review comments: fix SSE listener leak, preserve caller signals, add SSRF validation to webhook test, chunk SQL params, use dynamic import in test
- SSE backpressure: create single disconnect promise, race against drain (no listener accumulation)
- http-utils.ts/onenote.ts: use AbortSignal.any() to combine caller signal with timeout
- Webhook test endpoint: validate URL with validateWebhookUrlSsrf before fetch
- bulk.ts: chunk IN clause to 999 params max (SQLite limit)
- webhooks.test.ts: dynamic import after vi.mock() for deterministic DNS mocking
* ci: consolidate and fix CI/CD workflows
- Merge lint + typecheck into single job (saves one npm ci)
- Add concurrency groups to ci, docker, codeql (cancel stale runs)
- Add dependency-review-action on PRs (block vulnerable deps)
- Add workflow_call trigger to ci.yml for reusability
- Remove duplicate npm publish from release.yml (release-please owns it)
- Fix SDK paths: sdk-go/ → sdk/go/, sdk-python/ → sdk/python/
- Fix Dependabot paths to match actual SDK directories
- Add github-actions ecosystem to Dependabot (keep actions up to date)
* feat: add HTML file parser for .html/.htm document indexing
Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.
Closes #317
* fix: address CodeQL and review comments on HTML parser
- Replace regex-based tag stripping with node-html-markdown's native
ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency
* Revert "fix: skip Vercel preview deployments on non-main branches"
This reverts commit eb481870c883f77278291b72245b1ca0b890a78c.
* feat: add --from option to pack create for folder/URL sources
Adds createPackFromSource() that builds packs directly from local
folders, files, or URLs without requiring database interaction.
CLI: libscope pack create --name my-pack --from ~/docs/ [--extensions .md,.html] [--exclude pattern] [--no-recursive]
Features:
- Walks directories recursively using registered parsers
- Fetches URLs via fetchAndConvert
- Supports extension filtering, exclude patterns, progress callback
- Multiple --from sources supported
- Output format identical to DB export (pack install works unchanged)
Closes #328
* style: fix prettier formatting
* feat: add gzip support for pack files (.json.gz)
Pack files can now be compressed with gzip for smaller distribution:
- writePackFile/readPackFile auto-detect gzip by extension or magic bytes
- installPack accepts both .json and .json.gz files
- createPackFromSource defaults to .json.gz output (source packs can be large)
- createPack (DB export) still defaults to .json
- Auto-detects gzip by magic bytes even if extension is .json
5 new tests covering gzip write, install, magic byte detection, and round-trip.
* feat: add progress logging and fix dedup handling in pack install
- Log each document as it's indexed so large installs show progress
- Change pack install to use dedup: 'skip' for graceful duplicate handling
- Make title+content_length dedup check respect the dedup mode setting
(previously it always threw ValidationError regardless of dedup mode)
* feat: auto-generate tags during pack creation and apply on install
- Export tokenize() and add suggestTagsFromText() in tags.ts for DB-free tag generation
- createPackFromSource() now auto-generates tags per document via TF-IDF
- installPack() applies doc.tags via addTagsToDocument() after indexing
---------
* feat: add HTML file parser for .html/.htm document indexing (#318)
* feat: add HTML file parser for .html/.htm document indexing
Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.
Closes #317
* fix: address CodeQL and review comments on HTML parser
- Replace regex-based tag stripping with node-html-markdown's native
ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency
* Revert "fix: skip Vercel preview deployments on non-main branches"
This reverts commit eb481870c883f77278291b72245b1ca0b890a78c.
---------
* build(deps-dev): Bump eslint-config-prettier from 9.1.2 to 10.1.8 (#325)
Bumps [eslint-config-prettier](https://github.com/prettier/eslint-config-prettier) from 9.1.2 to 10.1.8.
- [Release notes](https://github.com/prettier/eslint-config-prettier/releases)
- [Changelog](https://github.com/prettier/eslint-config-prettier/blob/main/CHANGELOG.md)
- [Commits](https://github.com/prettier/eslint-config-prettier/commits/v10.1.8)
---
updated-dependencies:
- dependency-name: eslint-config-prettier
dependency-version: 10.1.8
dependency-type: direct:development
update-type: version-update:semver-major
...
* build(deps-dev): Bump lint-staged from 16.3.1 to 16.3.2
Bumps the minor-and-patch group with 1 update: [lint-staged](https://github.com/lint-staged/lint-staged).
Updates `lint-staged` from 16.3.1 to 16.3.2
- [Release notes](https://github.com/lint-staged/lint-staged/releases)
- [Changelog](https://github.com/lint-staged/lint-staged/blob/main/CHANGELOG.md)
- [Commits](https://github.com/lint-staged/lint-staged/compare/v16.3.1...v16.3.2)
---
updated-dependencies:
- dependency-name: lint-staged
dependency-version: 16.3.2
dependency-type: direct:development
update-type: version-update:semver-patch
dependency-group: minor-and-patch
...
* build(deps): Bump the actions group with 5 updates
Bumps the actions group with 5 updates:
| Package | From | To |
| --- | --- | --- |
| [actions/checkout](https://github.com/actions/checkout) | `4` | `6` |
| [actions/setup-node](https://github.com/actions/setup-node) | `4` | `6` |
| [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4` | `7` |
| [actions/setup-python](https://github.com/actions/setup-python) | `5` | `6` |
| [actions/setup-go](https://github.com/actions/setup-go) | `5` | `6` |
Updates `actions/checkout` from 4 to 6
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)
Updates `actions/setup-node` from 4 to 6
- [Release notes](https://github.com/actions/setup-node/releases)
- [Commits](https://github.com/actions/setup-node/compare/v4...v6)
Updates `actions/upload-artifact` from 4 to 7
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v7)
Updates `actions/setup-python` from 5 to 6
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/v5...v6)
Updates `actions/setup-go` from 5 to 6
- [Release notes](https://github.com/actions/setup-go/releases)
- [Commits](https://github.com/actions/setup-go/compare/v5...v6)
---
updated-dependencies:
- dependency-name: actions/checkout
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/setup-node
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/upload-artifact
dependency-version: '7'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/setup-python
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/setup-go
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
...
* build(deps-dev): Bump @types/node from 22.19.13 to 25.3.3
Bumps [@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node) from 22.19.13 to 25.3.3.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases)
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node)
---
updated-dependencies:
- dependency-name: "@types/node"
dependency-version: 25.3.3
dependency-type: direct:development
update-type: version-update:semver-major
...
* build(deps): Bump better-sqlite3 from 11.10.0 to 12.6.2
Bumps [better-sqlite3](https://github.com/WiseLibs/better-sqlite3) from 11.10.0 to 12.6.2.
- [Release notes](https://github.com/WiseLibs/better-sqlite3/releases)
- [Commits](https://github.com/WiseLibs/better-sqlite3/compare/v11.10.0...v12.6.2)
---
updated-dependencies:
- dependency-name: better-sqlite3
dependency-version: 12.6.2
dependency-type: direct:production
update-type: version-update:semver-major
...
* build(deps-dev): Bump eslint from 9.39.3 to 10.0.2
Bumps [eslint](https://github.com/eslint/eslint) from 9.39.3 to 10.0.2.
- [Release notes](https://github.com/eslint/eslint/releases)
- [Commits](https://github.com/eslint/eslint/compare/v9.39.3...v10.0.2)
---
updated-dependencies:
- dependency-name: eslint
dependency-version: 10.0.2
dependency-type: direct:development
update-type: version-update:semver-major
...
* feat: add passthrough LLM mode for ask-question tool (#335)
* feat: add passthrough LLM mode for ask-question tool
Adds llm.provider = "passthrough" so the ask-question MCP tool returns
retrieved context chunks directly to the calling LLM instead of requiring
a separate OpenAI/Ollama provider. This is the natural design for MCP tools
where the client already has an LLM (e.g. Claude Code).
- config.ts: add "passthrough" to llm.provider union type and env var handling
- rag.ts: add isPassthroughMode() helper and getContextForQuestion() which
retrieves and formats context without an LLM call
- mcp/server.ts: ask-question checks passthrough first and returns context
directly; falls through to existing LLM path otherwise
Enable via config: { "llm": { "provider": "passthrough" } }
Enable via env var: LIBSCOPE_LLM_PROVIDER=passthrough
* fix: format config.ts and include passthrough in provider override
- Reformat long if-condition to satisfy prettier (printWidth: 100)
- Fix logic bug: passthrough provider was checked in outer condition but
not spread into overrides.llm.provider
---------
* fix: address 9 audit findings from issue #332 (#333)
* fix: address 9 audit findings from issue #332
Security
- middleware: use timingSafeEqual for API key comparison (#2)
- url-fetcher, http-utils: replace process-wide NODE_TLS_REJECT_UNAUTHORIZED
mutation with per-request undici Agent to eliminate TLS race condition (#1)
Bugs
- indexing: re-throw unexpected embedding errors so transaction rolls back
instead of silently committing chunks with no vector (#3)
- search: replace correlated minRating subquery with avg_r.avg_rating from
the pre-joined aggregate in FTS and LIKE search paths (#4)
Performance
- bulk: replace O(n²) docs.find() loops with pre-built Map; replace
per-document getDocumentTags() calls with a single getDocumentTagsBatch()
query (#5)
- config: add 30-second TTL cache to loadConfig() so disk reads are not
repeated on every request (#6)
Code quality
- routes: check res.write() return value to handle SSE backpressure (#7)
- reindex: delegate to schema.createVectorTable() instead of duplicating
the vec0 DDL inline (#8)
- obsidian: replace hand-rolled parseSimpleYaml() with js-yaml, normalise
Date objects back to ISO-8601 strings (#9)
Docs
- agents.md: expand architecture tree to include src/api/ and src/connectors/;
add Security Patterns section with correct undici examples
- CONTRIBUTING.md: fix check-suite command (npm test → npm run test:coverage)
and correct coverage threshold (80% → actual 75%/74%)
Tests
- bulk.test: add dateFrom/dateTo filter coverage
- config.test: add cache-hit test; call invalidateConfigCache() before env-var
tests so TTL cache doesn't return stale results
* fix: remove unused warnIfTlsBypassMissing function
Dead code after conflict resolution chose the per-request undici Agent
approach (which doesn't need a warning about NODE_TLS_REJECT_UNAUTHORIZED).
* fix: update tests for config cache and retry semantics
- Add invalidateConfigCache() before loadConfig() in 4 env-override tests
that were failing because the 30s TTL cache introduced in the config
module was returning stale results from the previous test's cache entry
- Update http-utils retry assertion: maxRetries=2 means 1 initial + 2
retries = 3 total calls (loop is attempt <= maxRetries)
---------
* feat: CLI logging improvements and pack installation performance (#330) (#336)
- Add `src/cli/reporter.ts`: PrettyReporter (ANSI colors + \r progress
bar), SilentReporter (no-op for verbose/JSON mode), `isVerbose()` and
`createReporter()` factory. `LIBSCOPE_VERBOSE=1` env var alternative.
- Update `setupLogging` to default to "silent" in CLI mode (pretty
reporter handles user-facing output). Verbose/`--log-level` flags still
route to structured JSON pino logs. Fix duplicate `initLogger` calls in
onenote connect/disconnect commands to use `setupLogging` consistently.
- Update `installPack` in `packs.ts` to support batch embedding and
progress reporting:
- New `InstallOptions` interface with `batchSize`, `resumeFrom`,
`onProgress` fields
- Batch documents: chunk all → single `provider.embedBatch` call per
batch → single SQLite transaction per batch (avoids N embedding calls)
- `resumeFrom` skips the first N documents (enables partial install
resume after failure)
- `InstallResult` now includes `errors` count
- Add `--batch-size` and `--resume-from` CLI options to `pack install`
- Tests: `tests/unit/reporter.test.ts` (17 tests covering PrettyReporter,
SilentReporter, isVerbose, env var detection); extended
`tests/unit/packs.test.ts` with 7 new tests for progress callbacks,
batch efficiency, resumeFrom, embedBatch failure handling.
* Claude/fix issue 331 s1qzu (#338)
* build(deps): Bump the npm_and_yarn group across 1 directory with 2 updates (#337)
Bumps the npm_and_yarn group with 2 updates in the / directory: [@hono/node-server](https://github.com/honojs/node-server) and [hono](https://github.com/honojs/hono).
Updates `@hono/node-server` from 1.19.9 to 1.19.10
- [Release notes](https://github.com/honojs/node-server/releases)
- [Commits](https://github.com/honojs/node-server/compare/v1.19.9...v1.19.10)
Updates `hono` from 4.12.3 to 4.12.5
- [Release notes](https://github.com/honojs/hono/releases)
- [Commits](https://github.com/honojs/hono/compare/v4.12.3...v4.12.5)
---
updated-dependencies:
- dependency-name: "@hono/node-server"
dependency-version: 1.19.10
dependency-type: indirect
dependency-group: npm_and_yarn
- dependency-name: hono
dependency-version: 4.12.5
dependency-type: indirect
dependency-group: npm_and_yarn
...
* fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (#341)
* fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (closes #340)
**SSRF (CWE-918 — CodeQL alert #28)**
Replace the two-step validate-then-fetch approach in url-fetcher.ts with
IP-pinned requests using node:http / node:https directly. validateUrl()
resolves DNS and checks for private IPs, then the validated IP is passed
straight to the TCP connection (hostname: pinnedIp, servername: original
hostname for TLS SNI). There is now zero TOCTOU window between validation
and the actual network request. The redundant post-fetch DNS rebinding
check and the env-var-based TLS bypass wrapper are removed; rejectUnauthorized
is now passed directly to the request options.
An internal _setRequestImpl hook is exported for unit test injection so
tests can stub responses without touching node:http / node:https.
Tests are updated accordingly.
**ReDoS (CWE-1333 — CodeQL alert #24)**
Five regexes in confluence.ts used the pattern [^>]*ac:name="X"[^>]* —
two [^>]* quantifiers around a fixed literal. For input that contains a
large attribute blob without the target ac:name value, the engine must try
all O(n²) splits before concluding no match (catastrophic backtracking).
Fix: rewrite the leading [^>]* as (?:(?!ac:name="X")[^>])* — a negative
lookahead prevents the quantifier from overlapping with the literal,
making backtracking structurally impossible.
* fix: replace remaining polynomial regexes in confluence.ts with indexOf helpers
The conflict resolution left the original `/<ac:structured-macro [^>]*>([\s\S]*?)
<\/ac:structured-macro>/gi` patterns in place for code blocks and info/tip panels
(those were not part of the original security fix diff). These have the same
O(n²) backtracking problem: with k opening tags and no closing tags, [\s\S]*?
scans O(n - pos) chars per attempt, totalling O(n²).
Replace the entire convertConfluenceStorage function with the indexOf-based
approach (replaceStructuredMacros / replaceTagPairs / extractTagContent helpers)
that eliminates all regex-based tag parsing. Also add removeSelfClosingMacros
to handle the self-closing TOC case without regex, since the previous self-closing
fix still used a [^>]*ac:name="toc"[^>]* pattern.
---------
* feat: concurrent pack installation and -v verbose shorthand (issue #330) (#339)
* feat: concurrent pack installation and -v verbose shorthand (issue #330)
Add concurrent batch embedding to installPack for significant performance
improvement on large packs, plus CLI ergonomics improvements.
Key changes:
- `InstallOptions.concurrency` (default: 4): controls how many embedBatch
calls run simultaneously; embedding is I/O-bound so parallelism directly
reduces wall-clock installation time
- Refactor installPack to pre-chunk all documents upfront, then use a
semaphore-based scheduler to run up to `concurrency` embedBatch calls
concurrently while inserting completed batches in-order (SQLite requires
serialised writes); progress callbacks fire after each batch as before
- `pack install --concurrency <n>` CLI flag exposes the new option
- `-v` shorthand for `--verbose` on the global program options
- Fix transaction install-count tracking: count committed docs accurately
without relying on subtract-on-failure arithmetic
- Add 6 new tests covering concurrency=1 sequential, concurrency=4 parallel,
multiple embedBatch calls per install, concurrency limit enforcement,
incremental progress reporting, and partial-failure error counting
https://claude.ai/code/session_019hzhbEgV1ysnGmFVXBWzkZ
* fix: address all 4 Copilot review comments on PR #339
- Validate batchSize, concurrency, resumeFrom at the start of installPack
and throw ValidationError for invalid values (comments 3 & 4). Concurrency
<= 0 would silently hang the semaphore indefinitely.
- Add CLI lower-bound guard: --concurrency < 1 exits with a user-facing
error before ever calling installPack (comment 3).
- Lazy chunking: pre-chunking all documents upfront held chunks for the
entire pack in memory simultaneously. Batches now store only the raw
documents; resolveBatch() chunks on demand right before embedBatch
is called, so only one batch's worth of chunks is in memory at a time
(comment 2).
- Wrap provider.embedBatch() in try/catch so synchronous throws are
converted to rejected Promises rather than escaping scheduleNext() and
leaving the outer Promise permanently pending (comment 1).
---------
* fix: address 7 pre-release bugs from audit (#342) (#344)
- Guard JSON.parse in rowToWebhook with try/catch, default to []
- Guard JSON.parse in rowToSavedSearch with try/catch, default to null
- Guard JSON.parse in loadDbConnectorConfig with try/catch, throw ConfigError
- Push dateFrom/dateTo filters into SQL WHERE in listDocuments (before LIMIT)
- Validate negative limit in resolveSelector, throw ValidationError
- Replace manual substring extension parsing with path.extname() in packs.ts
- Verified reporter.ts is already tracked on development (no action needed)
- Added tests for all fixes (corrupted JSON, SQL-level date filtering, negative limit)
Closes #342
* feat: URL spidering — crawl linked pages with configurable depth and page limits (#315) (#343)
* feat: URL spidering — crawl linked pages with configurable depth and page limits (#315)
Adds opt-in spidering to URL indexing. A single seed URL can now crawl
and index an entire documentation site or wiki section in one call.
New files:
- src/core/link-extractor.ts: indexOf-based <a href> extraction, relative
URL resolution, fragment stripping, dedup, scheme filtering. No regex.
- src/core/spider.ts: BFS crawl engine with sameDomain, pathPrefix,
excludePatterns (glob), maxPages (hard cap 200), maxDepth (hard cap 5),
10-min total timeout, robots.txt (User-agent: * and libscope), and
1s inter-request delay. Yields SpiderResult per page; returns SpiderStats.
- tests/unit/link-extractor.test.ts: 25 tests covering relative resolution,
dedup, fragment stripping, scheme filtering, attribute order, edge cases.
- tests/unit/spider.test.ts: 20 tests covering BFS order, depth/page limits,
domain + path + pattern filtering, cycle detection, robots.txt, partial
failure recovery, stats, and abortReason reporting.
Modified:
- src/core/url-fetcher.ts: adds fetchRaw() export returning raw body +
contentType + finalUrl before HTML-to-markdown conversion, so the
spider can extract links from HTML before conversion.
- src/api/routes.ts: POST /api/v1/documents/url now accepts spider, maxPages,
maxDepth, sameDomain, pathPrefix, excludePatterns. Returns { documents,
pagesIndexed, pagesCrawled, pagesSkipped, errors, abortReason }.
- src/mcp/server.ts: submit-document tool gains spider, maxPages, maxDepth,
sameDomain, pathPrefix, excludePatterns parameters.
Safety: all fetched URLs pass through the existing SSRF validation in
fetchRaw() (DNS resolution, private IP blocking, scheme allowlist).
Hard limits (200 pages, depth 5, 10min) cannot be overridden by callers.
robots.txt is fetched once per origin and Disallow rules are honoured.
Individual page failures do not abort the crawl.
Closes #315
* fix: resolve CI lint errors in spider implementation
- Remove unnecessary type assertions (routes.ts, mcp/server.ts) —
TypeScript already narrows SpiderResult/SpiderStats from the generator
- Add explicit return type annotation on mock fetchRaw to satisfy
no-unsafe-return rule in spider.test.ts
- Replace .resolves.not.toThrow() with a direct assertion — vitest
.resolves requires a Promise, not an async function
* fix: address CodeQL security findings in spider/link-extractor
link-extractor.ts (CodeQL #30 — incomplete URL scheme check):
Replace the enumerated scheme blocklist (javascript:, vbscript:, etc.)
with a strict http/https allowlist check on the resolved URL protocol.
An allowlist is exhaustive by definition; a blocklist will always miss
obscure schemes like vbscript:, blob:, or future additions.
spider.ts (CodeQL #31 — incomplete multi-character sanitization):
Replace the regex-based tag stripper /<[^>]+>/g in extractTitle() with
an indexOf-based stripTags() function. The regex stops at the first >
which can be inside a quoted attribute value (e.g. <img alt="a>b">),
potentially leaving partial tag content in the extracted title.
The new implementation walks quoted attribute values explicitly so no
tag content leaks through regardless of its internal structure.
* fix: address all Copilot review comments on spider PR (#343)
- link-extractor: add word-boundary check in extractHref to prevent
matching data-href, aria-href (false positives on non-href attributes)
- spider: rename pagesIndexed → pagesFetched throughout (SpiderStats
interface already used pagesFetched; sync implementation + tests)
- spider: per-origin robots.txt cache (Map<origin, Set>) fetched lazily
as new origins are encountered during crawl (was seed-only before)
- spider: normalize to raw.finalUrl after redirects — visited set,
yielded URL, and link-extraction base all use the canonical URL
- routes: validate maxPages/maxDepth are finite positive integers
- routes: change conditional spread &&-patterns to ternaries
- routes: remove inner try/catch for spider fetch errors; add FetchError
to top-level handler (consistent with single-URL mode → 502)
- mcp/server: replace conditional spreads with explicit if assignments
- mcp/server: validate spider=true requires url (throws ValidationError)
- openapi: document spider request fields in IndexFromUrlRequest schema,
add SpiderResponse schema, update 201 response to oneOf
* style: fix prettier formatting in spider files
---------
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: always target main branch in release-please, fall back to github.token (#356)
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: add development branch workflow (#327)
* chore: add development branch workflow
- Add merge-gate.yml: enforces only 'development' can merge into main
- Update CI/CodeQL/Docker workflows to run on both main and development
- Update dependabot.yml: target-branch set to development for all ecosystems
- Update copilot-instructions.md: document branch workflow convention
- Rulesets configured: Main (requires merge-gate + squash-only),
Development (requires CI status checks + PR)
- Default branch set to development
- All open PRs retargeted to development
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix: skip Vercel preview deployments on non-main branches
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* chore: trigger check refresh
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: create-pack from local folder or URL sources (#329)
* fix: comprehensive audit fixes — security, performance, resilience, API hardening
Addresses findings from issue #314:
- SSRF protection for webhook URLs (CRITICAL)
- Scrub secrets from exports
- Stored XSS prevention on document URL
- O(n²) and N+1 fixes in bulk operations
- Rate limit cache eviction improvement
- SSE backpressure handling
- Replace raw Error() with typed errors
- Fetch timeouts on all network calls
- Input validation on API parameters
- Search query length limit
- Silent catch block logging
- DNS rebinding check fix
- N+1 in Slack user resolution
- Pagination on webhook/search list endpoints
Closes #314
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Address PR review comments: fix SSE listener leak, preserve caller signals, add SSRF validation to webhook test, chunk SQL params, use dynamic import in test
- SSE backpressure: create single disconnect promise, race against drain (no listener accumulation)
- http-utils.ts/onenote.ts: use AbortSignal.any() to combine caller signal with timeout
- Webhook test endpoint: validate URL with validateWebhookUrlSsrf before fetch
- bulk.ts: chunk IN clause to 999 params max (SQLite limit)
- webhooks.test.ts: dynamic import after vi.mock() for deterministic DNS mocking
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* ci: consolidate and fix CI/CD workflows
- Merge lint + typecheck into single job (saves one npm ci)
- Add concurrency groups to ci, docker, codeql (cancel stale runs)
- Add dependency-review-action on PRs (block vulnerable deps)
- Add workflow_call trigger to ci.yml for reusability
- Remove duplicate npm publish from release.yml (release-please owns it)
- Fix SDK paths: sdk-go/ → sdk/go/, sdk-python/ → sdk/python/
- Fix Dependabot paths to match actual SDK directories
- Add github-actions ecosystem to Dependabot (keep actions up to date)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: add HTML file parser for .html/.htm document indexing
Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.
Closes #317
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix: address CodeQL and review comments on HTML parser
- Replace regex-based tag stripping with node-html-markdown's native
ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Revert "fix: skip Vercel preview deployments on non-main branches"
This reverts commit eb481870c883f77278291b72245b1ca0b890a78c.
* feat: add --from option to pack create for folder/URL sources
Adds createPackFromSource() that builds packs directly from local
folders, files, or URLs without requiring database interaction.
CLI: libscope pack create --name my-pack --from ~/docs/ [--extensions .md,.html] [--exclude pattern] [--no-recursive]
Features:
- Walks directories recursively using registered parsers
- Fetches URLs via fetchAndConvert
- Supports extension filtering, exclude patterns, progress callback
- Multiple --from sources supported
- Output format identical to DB export (pack install works unchanged)
Closes #328
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* style: fix prettier formatting
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: add gzip support for pack files (.json.gz)
Pack files can now be compressed with gzip for smaller distribution:
- writePackFile/readPackFile auto-detect gzip by extension or magic bytes
- installPack accepts both .json and .json.gz files
- createPackFromSource defaults to .json.gz output (source packs can be large)
- createPack (DB export) still defaults to .json
- Auto-detects gzip by magic bytes even if extension is .json
5 new tests covering gzip write, install, magic byte detection, and round-trip.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: add progress logging and fix dedup handling in pack install
- Log each document as it's indexed so large installs show progress
- Change pack install to use dedup: 'skip' for graceful duplicate handling
- Make title+content_length dedup check respect the dedup mode setting
(previously it always threw ValidationError regardless of dedup mode)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: auto-generate tags during pack creation and apply on install
- Export tokenize() and add suggestTagsFromText() in tags.ts for DB-free tag generation
- createPackFromSource() now auto-generates tags per document via TF-IDF
- installPack() applies doc.tags via addTagsToDocument() after indexing
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: add HTML file parser for .html/.htm document indexing (#318)
* feat: add HTML file parser for .html/.htm document indexing
Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.
Closes #317
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix: address CodeQL and review comments on HTML parser
- Replace regex-based tag stripping with node-html-markdown's native
ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Revert "fix: skip Vercel preview deployments on non-main branches"
This reverts commit eb481870c883f77278291b72245b1ca0b890a78c.
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* build(deps-dev): Bump eslint-config-prettier from 9.1.2 to 10.1.8 (#325)
Bumps [eslint-config-prettier](https://github.com/prettier/eslint-config-prettier) from 9.1.2 to 10.1.8.
- [Release notes](https://github.com/prettier/eslint-config-prettier/releases)
- [Changelog](https://github.com/prettier/eslint-config-prettier/blob/main/CHANGELOG.md)
- [Commits](https://github.com/prettier/eslint-config-prettier/commits/v10.1.8)
---
updated-dependencies:
- dependency-name: eslint-config-prettier
dependency-version: 10.1.8
dependency-type: direct:development
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* build(deps-dev): Bump lint-staged from 16.3.1 to 16.3.2
Bumps the minor-and-patch group with 1 update: [lint-staged](https://github.com/lint-staged/lint-staged).
Updates `lint-staged` from 16.3.1 to 16.3.2
- [Release notes](https://github.com/lint-staged/lint-staged/releases)
- [Changelog](https://github.com/lint-staged/lint-staged/blob/main/CHANGELOG.md)
- [Commits](https://github.com/lint-staged/lint-staged/compare/v16.3.1...v16.3.2)
---
updated-dependencies:
- dependency-name: lint-staged
dependency-version: 16.3.2
dependency-type: direct:development
update-type: version-update:semver-patch
dependency-group: minor-and-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* build(deps): Bump the actions group with 5 updates
Bumps the actions group with 5 updates:
| Package | From | To |
| --- | --- | --- |
| [actions/checkout](https://github.com/actions/checkout) | `4` | `6` |
| [actions/setup-node](https://github.com/actions/setup-node) | `4` | `6` |
| [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4` | `7` |
| [actions/setup-python](https://github.com/actions/setup-python) | `5` | `6` |
| [actions/setup-go](https://github.com/actions/setup-go) | `5` | `6` |
Updates `actions/checkout` from 4 to 6
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)
Updates `actions/setup-node` from 4 to 6
- [Release notes](https://github.com/actions/setup-node/releases)
- [Commits](https://github.com/actions/setup-node/compare/v4...v6)
Updates `actions/upload-artifact` from 4 to 7
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v7)
Updates `actions/setup-python` from 5 to 6
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/v5...v6)
Updates `actions/setup-go` from 5 to 6
- [Release notes](https://github.com/actions/setup-go/releases)
- [Commits](https://github.com/actions/setup-go/compare/v5...v6)
---
updated-dependencies:
- dependency-name: actions/checkout
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/setup-node
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/upload-artifact
dependency-version: '7'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/setup-python
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/setup-go
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* build(deps-dev): Bump @types/node from 22.19.13 to 25.3.3
Bumps [@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node) from 22.19.13 to 25.3.3.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases)
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node)
---
updated-dependencies:
- dependency-name: "@types/node"
dependency-version: 25.3.3
dependency-type: direct:development
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* build(deps): Bump better-sqlite3 from 11.10.0 to 12.6.2
Bumps [better-sqlite3](https://github.com/WiseLibs/better-sqlite3) from 11.10.0 to 12.6.2.
- [Release notes](https://github.com/WiseLibs/better-sqlite3/releases)
- [Commits](https://github.com/WiseLibs/better-sqlite3/compare/v11.10.0...v12.6.2)
---
updated-dependencies:
- dependency-name: better-sqlite3
dependency-version: 12.6.2
dependency-type: direct:production
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* build(deps-dev): Bump eslint from 9.39.3 to 10.0.2
Bumps [eslint](https://github.com/eslint/eslint) from 9.39.3 to 10.0.2.
- [Release notes](https://github.com/eslint/eslint/releases)
- [Commits](https://github.com/eslint/eslint/compare/v9.39.3...v10.0.2)
---
updated-dependencies:
- dependency-name: eslint
dependency-version: 10.0.2
dependency-type: direct:development
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Robert DeRienzo <rderienzo@voloridge.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: add passthrough LLM mode for ask-question tool (#335)
* feat: add passthrough LLM mode for ask-question tool
Adds llm.provider = "passthrough" so the ask-question MCP tool returns
retrieved context chunks directly to the calling LLM instead of requiring
a separate OpenAI/Ollama provider. This is the natural design for MCP tools
where the client already has an LLM (e.g. Claude Code).
- config.ts: add "passthrough" to llm.provider union type and env var handling
- rag.ts: add isPassthroughMode() helper and getContextForQuestion() which
retrieves and formats context without an LLM call
- mcp/server.ts: ask-question checks passthrough first and returns context
directly; falls through to existing LLM path otherwise
Enable via config: { "llm": { "provider": "passthrough" } }
Enable via env var: LIBSCOPE_LLM_PROVIDER=passthrough
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: format config.ts and include passthrough in provider override
- Reformat long if-condition to satisfy prettier (printWidth: 100)
- Fix logic bug: passthrough provider was checked in outer condition but
not spread into overrides.llm.provider
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: address 9 audit findings from issue #332 (#333)
* fix: address 9 audit findings from issue #332
Security
- middleware: use timingSafeEqual for API key comparison (#2)
- url-fetcher, http-utils: replace process-wide NODE_TLS_REJECT_UNAUTHORIZED
mutation with per-request undici Agent to eliminate TLS race condition (#1)
Bugs
- indexing: re-throw unexpected embedding errors so transaction rolls back
instead of silently committing chunks with no vector (#3)
- search: replace correlated minRating subquery with avg_r.avg_rating from
the pre-joined aggregate in FTS and LIKE search paths (#4)
Performance
- bulk: replace O(n²) docs.find() loops with pre-built Map; replace
per-document getDocumentTags() calls with a single getDocumentTagsBatch()
query (#5)
- config: add 30-second TTL cache to loadConfig() so disk reads are not
repeated on every request (#6)
Code quality
- routes: check res.write() return value to handle SSE backpressure (#7)
- reindex: delegate to schema.createVectorTable() instead of duplicating
the vec0 DDL inline (#8)
- obsidian: replace hand-rolled parseSimpleYaml() with js-yaml, normalise
Date objects back to ISO-8601 strings (#9)
Docs
- agents.md: expand architecture tree to include src/api/ and src/connectors/;
add Security Patterns section with correct undici examples
- CONTRIBUTING.md: fix check-suite command (npm test → npm run test:coverage)
and correct coverage threshold (80% → actual 75%/74%)
Tests
- bulk.test: add dateFrom/dateTo filter coverage
- config.test: add cache-hit test; call invalidateConfigCache() before env-var
tests so TTL cache doesn't return stale results
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: remove unused warnIfTlsBypassMissing function
Dead code after conflict resolution chose the per-request undici Agent
approach (which doesn't need a warning about NODE_TLS_REJECT_UNAUTHORIZED).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: update tests for config cache and retry semantics
- Add invalidateConfigCache() before loadConfig() in 4 env-override tests
that were failing because the 30s TTL cache introduced in the config
module was returning stale results from the previous test's cache entry
- Update http-utils retry assertion: maxRetries=2 means 1 initial + 2
retries = 3 total calls (loop is attempt <= maxRetries)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: CLI logging improvements and pack installation performance (#330) (#336)
- Add `src/cli/reporter.ts`: PrettyReporter (ANSI colors + \r progress
bar), SilentReporter (no-op for verbose/JSON mode), `isVerbose()` and
`createReporter()` factory. `LIBSCOPE_VERBOSE=1` env var alternative.
- Update `setupLogging` to default to "silent" in CLI mode (pretty
reporter handles user-facing output). Verbose/`--log-level` flags still
route to structured JSON pino logs. Fix duplicate `initLogger` calls in
onenote connect/disconnect commands to use `setupLogging` consistently.
- Update `installPack` in `packs.ts` to support batch embedding and
progress reporting:
- New `InstallOptions` interface with `batchSize`, `resumeFrom`,
`onProgress` fields
- Batch documents: chunk all → single `provider.embedBatch` call per
batch → single SQLite transaction per batch (avoids N embedding calls)
- `resumeFrom` skips the first N documents (enables partial install
resume after failure)
- `InstallResult` now includes `errors` count
- Add `--batch-size` and `--resume-from` CLI options to `pack install`
- Tests: `tests/unit/reporter.test.ts` (17 tests covering PrettyReporter,
SilentReporter, isVerbose, env var detection); extended
`tests/unit/packs.test.ts` with 7 new tests for progress callbacks,
batch efficiency, resumeFrom, embedBatch failure handling.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Claude/fix issue 331 s1qzu (#338)
* build(deps): Bump the npm_and_yarn group across 1 directory with 2 updates (#337)
Bumps the npm_and_yarn group with 2 updates in the / directory: [@hono/node-server](https://github.com/honojs/node-server) and [hono](https://github.com/honojs/hono).
Updates `@hono/node-server` from 1.19.9 to 1.19.10
- [Release notes](https://github.com/honojs/node-server/releases)
- [Commits](https://github.com/honojs/node-server/compare/v1.19.9...v1.19.10)
Updates `hono` from 4.12.3 to 4.12.5
- [Release notes](https://github.com/honojs/hono/releases)
- [Commits](https://github.com/honojs/hono/compare/v4.12.3...v4.12.5)
---
updated-dependencies:
- dependency-name: "@hono/node-server"
dependency-version: 1.19.10
dependency-type: indirect
dependency-group: npm_and_yarn
- dependency-name: hono
dependency-version: 4.12.5
dependency-type: indirect
dependency-group: npm_and_yarn
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (#341)
* fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (closes #340)
**SSRF (CWE-918 — CodeQL alert #28)**
Replace the two-step validate-then-fetch approach in url-fetcher.ts with
IP-pinned requests using node:http / node:https directly. validateUrl()
resolves DNS and checks for private IPs, then the validated IP is passed
straight to the TCP connection (hostname: pinnedIp, servername: original
hostname for TLS SNI). There is now zero TOCTOU window between validation
and the actual network request. The redundant post-fetch DNS rebinding
check and the env-var-based TLS bypass wrapper are removed; rejectUnauthorized
is now passed directly to the request options.
An internal _setRequestImpl hook is exported for unit test injection so
tests can stub responses without touching node:http / node:https.
Tests are updated accordingly.
**ReDoS (CWE-1333 — CodeQL alert #24)**
Five regexes in confluence.ts used the pattern [^>]*ac:name="X"[^>]* —
two [^>]* quantifiers around a fixed literal. For input that contains a
large attribute blob without the target ac:name value, the engine must try
all O(n²) splits before concluding no match (catastrophic backtracking).
Fix: rewrite the leading [^>]* as (?:(?!ac:name="X")[^>])* — a negative
lookahead prevents the quantifier from overlapping with the literal,
making backtracking structurally impossible.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: replace remaining polynomial regexes in confluence.ts with indexOf helpers
The conflict resolution left the original `/<ac:structured-macro [^>]*>([\s\S]*?)
<\/ac:structured-macro>/gi` patterns in place for code blocks and info/tip panels
(those were not part of the original security fix diff). These have the same
O(n²) backtracking problem: with k opening tags and no closing tags, [\s\S]*?
scans O(n - pos) chars per attempt, totalling O(n²).
Replace the entire convertConfluenceStorage function with the indexOf-based
approach (replaceStructuredMacros / replaceTagPairs / extractTagContent helpers)
that eliminates all regex-based tag parsing. Also add removeSelfClosingMacros
to handle the self-closing TOC case without regex, since the previous self-closing
fix still used a [^>]*ac:name="toc"[^>]* pattern.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: concurrent pack installation and -v verbose shorthand (issue #330) (#339)
* feat: concurrent pack installation and -v verbose shorthand (issue #330)
Add concurrent batch embedding to installPack for significant performance
improvement on large packs, plus CLI ergonomics improvements.
Key changes:
- `InstallOptions.concurrency` (default: 4): controls how many embedBatch
calls run simultaneously; embedding is I/O-bound so parallelism directly
reduces wall-clock installation time
- Refactor installPack to pre-chunk all documents upfront, then use a
semaphore-based scheduler to run up to `concurrency` embedBatch calls
concurrently while inserting completed batches in-order (SQLite requires
serialised writes); progress callbacks fire after each batch as before
- `pack install --concurrency <n>` CLI flag exposes the new option
- `-v` shorthand for `--verbose` on the global program options
- Fix transaction install-count tracking: count committed docs accurately
without relying on subtract-on-failure arithmetic
- Add 6 new tests covering concurrency=1 sequential, concurrency=4 parallel,
multiple embedBatch calls per install, concurrency limit enforcement,
incremental progress reporting, and partial-failure error counting
https://claude.ai/code/session_019hzhbEgV1ysnGmFVXBWzkZ
* fix: address all 4 Copilot review comments on PR #339
- Validate batchSize, concurrency, resumeFrom at the start of installPack
and throw ValidationError for invalid values (comments 3 & 4). Concurrency
<= 0 would silently hang the semaphore indefinitely.
- Add CLI lower-bound guard: --concurrency < 1 exits with a user-facing
error before ever calling installPack (comment 3).
- Lazy chunking: pre-chunking all documents upfront held chunks for the
entire pack in memory simultaneously. Batches now store only the raw
documents; resolveBatch() chunks on demand right before embedBatch
is called, so only one batch's worth of chunks is in memory at a time
(comment 2).
- Wrap provider.embedBatch() in try/catch so synchronous throws are
converted to rejected Promises rather than escaping scheduleNext() and
leaving the outer Promise permanently pending (comment 1).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
* fix: address 7 pre-release bugs from audit (#342) (#344)
- Guard JSON.parse in rowToWebhook with try/catch, default to []
- Guard JSON.parse in rowToSavedSearch with try/catch, default to null
- Guard JSON.parse in loadDbConnectorConfig with try/catch, throw ConfigError
- Push dateFrom/dateTo filters into SQL WHERE in listDocuments (before LIMIT)
- Validate negative limit in resolveSelector, throw ValidationError
- Replace manual substring extension parsing with path.extname() in packs.ts
- Verified reporter.ts is already tracked on development (no action needed)
- Added tests for all fixes (corrupted JSON, SQL-level date filtering, negative limit)
Closes #342
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: URL spidering — crawl linked pages with configurable depth and page limits (#315) (#343)
* feat: URL spidering — crawl linked pages with configurable depth and page limits (#315)
Adds opt-in spidering to URL indexing. A single seed URL can now crawl
and index an entire documentation site or wiki section in one call.
New files:
- src/core/link-extractor.ts: indexOf-based <a href> extraction, relative
URL resolution, fragment stripping, dedup, scheme filtering. No regex.
- src/core/spider.ts: BFS crawl engine with sameDomain, pathPrefix,
excludePatterns (glob), maxPages (hard cap 200), maxDepth (hard cap 5),
10-min total timeout, robots.txt (User-agent: * and libscope), and
1s inter-request delay. Yields SpiderResult per page; returns SpiderStats.
- tests/unit/link-extractor.test.ts: 25 tests covering relative resolution,
dedup, fragment stripping, scheme filtering, attribute order, edge cases.
- tests/unit/spider.test.ts: 20 tests covering BFS order, depth/page limits,
domain + path + pattern filtering, cycle detection, robots.txt, partial
failure recovery, stats, and abortReason reporting.
Modified:
- src/core/url-fetcher.ts: adds fetchRaw() export returning raw body +
contentType + finalUrl before HTML-to-markdown conversion, so the
spider can extract links from HTML before conversion.
- src/api/routes.ts: POST /api/v1/documents/url now accepts spider, maxPages,
maxDepth, sameDomain, pathPrefix, excludePatterns. Returns { documents,
pagesIndexed, pagesCrawled, pagesSkipped, errors, abortReason }.
- src/mcp/server.ts: submit-document tool gains spider, maxPages, maxDepth,
sameDomain, pathPrefix, excludePatterns parameters.
Safety: all fetched URLs pass through the existing SSRF validation in
fetchRaw() (DNS resolution, private IP blocking, scheme allowlist).
Hard limits (200 pages, depth 5, 10min) cannot be overridden by callers.
robots.txt is fetched once per origin and Disallow rules are honoured.
Individual page failures do not abort the crawl.
Closes #315
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: resolve CI lint errors in spider implementation
- Remove unnecessary type assertions (routes.ts, mcp/server.ts) —
TypeScript already narrows SpiderResult/SpiderStats from the generator
- Add explicit return type annotation on mock fetchRaw to satisfy
no-unsafe-return rule in spider.test.ts
- Replace .resolves.not.toThrow() with a direct assertion — vitest
.resolves requires a Promise, not an async function
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: address CodeQL security findings in spider/link-extractor
link-extractor.ts (CodeQL #30 — incomplete URL scheme check):
Replace the enumerated scheme blocklist (javascript:, vbscript:, etc.)
with a strict http/https allowlist check on the resolved URL protocol.
An allowlist is exhaustive by definition; a blocklist will always miss
obscure schemes like vbscript:, blob:, or future additions.
spider.ts (CodeQL #31 — incomplete multi-character sanitization):
Replace the regex-based tag stripper /<[^>]+>/g in extractTitle() with
an indexOf-based stripTags() function. The regex stops at the first >
which can be inside a quoted attribute value (e.g. <img alt="a>b">),
potentially leaving partial tag content in the extracted title.
The new implementation walks quoted attribute values explicitly so no
tag content leaks through regardless of its internal structure.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: address all Copilot review comments on spider PR (#343)
- link-extractor: add word-boundary check in extractHref to prevent
matching data-href, aria-href (false positives on non-href attributes)
- spider: rename pagesIndexed → pagesFetched throughout (SpiderStats
interface already used pagesFetched; sync implementation + tests)
- spider: per-origin robots.txt cache (Map<origin, Set>) fetched lazily
as new origins are encountered during crawl (was seed-only before)
- spider: normalize to raw.finalUrl after redirects — visited set,
yielded URL, and link-extraction base all use the canonical URL
- routes: validate maxPages/maxDepth are finite positive integers
- routes: change conditional spread &&-patterns to ternaries
- routes: remove inner try/catch for spider fetch errors; add FetchError
to top-level handler (consistent with single-URL mode → 502)
- mcp/server: replace conditional spreads with explicit if assignments
- mcp/server: validate spider=true requires url (throws ValidationError)
- openapi: document spider request fields in IndexFromUrlRequest schema,
add SpiderResponse schema, update 201 response to oneOf
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* style: fix prettier formatting in spider files
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs: comprehensive documentation update for v1.3.0 (#347)
- README: fix license (BUSL-1.1, not MIT), expand MCP tools table to
all 26 tools, expand REST API table with all endpoints (webhooks,
links, analytics, connectors status, suggest-tags, bulk ops), add
webhooks section with HMAC signing example, add missing CLI commands
(bulk ops, saved searches, document links, docs update)
- getting-started: fix Node.js requirement (20, not 18), add sections
for web dashboard, organize/annotate features, REST API
- mcp-setup: expand available tools section to list all 26 tools
grouped by category instead of just 4
- mcp-tools reference: add 5 missing tools — update-document,
suggest-tags, link-documents, get-document-links, delete-link
- rest-api reference: add all missing endpoints, reorganize by category,
add examples for update, bulk retag, webhooks, links, saved searches
- configuration guide: document passthrough LLM provider
- configuration reference: add passthrough LLM, llm.ollamaUrl key,
expand config set examples to cover all settable keys
- cli reference: expand config set supported keys list
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: allow release-please PRs to pass merge gate and trigger CI (#348)
* docs: comprehensive documentation update for v1.3.0
- README: fix license (BUSL-1.1, not MIT), expand MCP tools table to
all 26 tools, expand REST API table with all endpoints (webhooks,
links, analytics, connectors status, suggest-tags, bulk ops), add
webhooks section with HMAC signing example, add missing CLI commands
(bulk ops, saved searches, document links, docs update)
- getting-started: fix Node.js requirement (20, not 18), add sections
for web dashboard, organize/annotate features, REST API
- mcp-setup: expand available tools section to list all 26 tools
grouped by category instead of just 4
- mcp-tools reference: add 5 missing tools — update-document,
suggest-tags, link-documents, get-document-links, delete-link
- rest-api reference: add all missing endpoints, reorganize by category,
add examples for update, bulk retag, webhooks, links, saved searches
- configuration guide: document passthrough LLM provider
- configuration reference: add passthrough LLM, llm.ollamaUrl key,
expand config set examples to cover all settable keys
- cli reference: expand config set supported keys list
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: allow release-please PRs to pass merge gate and trigger CI
Two issues prevented PR #238 from getting CI runs:
1. merge-gate blocked release-please PRs — the gate only allowed
'development' as the source branch, but release-please uses
'release-please--branches--main--components--libscope'. Updated
to allow any branch matching 'release-please--*'.
2. CI never ran on the PR — GitHub does not trigger workflows when
GITHUB_TOKEN creates a PR (intentional security restriction to
prevent infinite loops). Fixed by passing a PAT via secrets.GH_TOKEN
to the release-please action so its PR creation triggers CI.
Note: requires a 'GH_TOKEN' secret in repo settings — a classic PAT
with repo and workflow scopes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Prepare for release (#345) (#350)
* chore: add development branch workflow (#327)
* chore: add development branch workflow
- Add merge-gate.yml: enforces only 'development' can merge into main
- Update CI/CodeQL/Docker workflows to run on both main and development
- Update dependabot.yml: target-branch set to development for all ecosystems
- Update copilot-instructions.md: document branch workflow convention
- Rulesets configured: Main (requires merge-gate + squash-only),
Development (requires CI status checks + PR)
- Default branch set to development
- All open PRs retargeted to development
* fix: skip Vercel preview deployments on non-main branches
* chore: trigger check refresh
---------
* feat: create-pack from local folder or URL sources (#329)
* fix: comprehensive audit fixes — security, performance, resilience, API hardening
Addresses findings from issue #314:
- SSRF protection for webhook URLs (CRITICAL)
- Scrub secrets from exports
- Stored XSS prevention on document URL
- O(n²) and N+1 fixes in bulk operations
- Rate limit cache eviction improvement
- SSE backpressure handling
- Replace raw Error() with typed errors
- Fetch timeouts on all network calls
- Input validation on API parameters
- Search query length limit
- Silent catch block logging
- DNS rebinding check fix
- N+1 in Slack user resolution
- Pagination on webhook/search list endpoints
Closes #314
* Address PR review comments: fix SSE listener leak, preserve caller signals, add SSRF validation to webhook test, chunk SQL params, use dynamic import in test
- SSE backpressure: create single disconnect promise, race against drain (no listener accumulation)
- http-utils.ts/onenote.ts: use AbortSignal.any() to combine caller signal with timeout
- Webhook test endpoint: validate URL with validateWebhookUrlSsrf before fetch
- bulk.ts: chunk IN clause to 999 params max (SQLite limit)
- webhooks.test.ts: dynamic import after vi.mock() for deterministic DNS mocking
* ci: consolidate and fix CI/CD workflows
- Merge lint + typecheck into single job (saves one npm ci)
- Add concurrency groups to ci, docker, codeql (cancel stale runs)
- Add dependency-review-action on PRs (block vulnerable deps)
- Add workflow_call trigger to ci.yml for reusability
- Remove duplicate npm publish from release.yml (release-please owns it)
- Fix SDK paths: sdk-go/ → sdk/go/, sdk-python/ → sdk/python/
- Fix Dependabot paths to match actual SDK directories
- Add github-actions ecosystem to Dependabot (keep actions up to date)
* feat: add HTML file parser for .html/.htm document indexing
Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.
Closes #317
* fix: address CodeQL and review comments on HTML parser
- Replace regex-based tag stripping with node-html-markdown's native
ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency
* Revert "fix: skip Vercel preview deployments on non-main branches"
This reverts commit eb481870c883f77278291b72245b1ca0b890a78c.
* feat: add --from option to pack create for folder/URL sources
Adds createPackFromSource() that builds packs directly from local
folders, files, or URLs without requiring database interaction.
CLI: libscope pack create --name my-pack --from ~/docs/ [--extensions .md,.html] [--exclude pattern] [--no-recursive]
Features:
- Walks directories recursively using registered parsers
- Fetches URLs via fetchAndConvert
- Supports extension filtering, exclude patterns, progress callback
- Multiple --from sources supported
- Output format identical to DB export (pack install works unchanged)
Closes #328
* style: fix prettier formatting
* feat: add gzip support for pack files (.json.gz)
Pack files can now be compressed with gzip for smaller distribution:
- writePackFile/readPackFile auto-detect gzip by extension or magic bytes
- installPack accepts both .json and .json.gz files
- createPackFromSource defaults to .json.gz output (source packs can be large)
- createPack (DB export) still defaults to .json
- Auto-detects gzip by magic bytes even if extension is .json
5 new tests covering gzip write, install, magic byte detection, and round-trip.
* feat: add progress logging and fix dedup handling in pack install
- Log each document as it's indexed so large installs show progress
- Change pack install to use dedup: 'skip' for graceful duplicate handling
- Make title+content_length dedup check respect the dedup mode setting
(previously it always threw ValidationError regardless of dedup mode)
* feat: auto-generate tags during pack creation and apply on install
- Export tokenize() and add suggestTagsFromText() in tags.ts for DB-free tag generation
- createPackFromSource() now auto-generates tags per document via TF-IDF
- installPack() applies doc.tags via addTagsToDocument() after indexing
---------
* feat: add HTML file parser for .html/.htm document indexing (#318)
* feat: add HTML file parser for .html/.htm document indexing
Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.
Closes #317
* fix: address CodeQL and review comments on HTML parser
- Replace regex-based tag stripping with node-html-markdown's native
ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency
* Revert "fix: skip Vercel preview deployments on non-main branches"
This reverts commit eb481870c883f77278291b72245b1ca0b890a78c.
---------
* build(deps-dev): Bump eslint-config-prettier from 9.1.2 to 10.1.8 (#325)
Bumps [eslint-config-prettier](https://github.com/prettier/eslint-config-prettier) from 9.1.2 to 10.1.8.
- [Release notes](https://github.com/prettier/eslint-config-prettier/releases)
- [Changelog](https://github.com/prettier/eslint-config-prettier/blob/main/CHANGELOG.md)
- [Commits](https://github.com/prettier/eslint-config-prettier/commits/v10.1.8)
---
updated-dependencies:
- dependency-name: eslint-config-prettier
dependency-version: 10.1.8
dependency-type: direct:development
update-type: version-update:semver-major
...
* build(deps-dev): Bump lint-staged from 16.3.1 to 16.3.2
Bumps the minor-and-patch group with 1 update: [lint-staged](https://github.com/lint-staged/lint-staged).
Updates `lint-staged` from 16.3.1 to 16.3.2
- [Release notes](https://github.com/lint-staged/lint-staged/releases)
- [Changelog](https://github.com/lint-staged/lint-staged/blob/main/CHANGELOG.md)
- [Commits](https://github.com/lint-staged/lint-staged/compare/v16.3.1...v16.3.2)
---
updated-dependencies:
- dependency-name: lint-staged
dependency-version: 16.3.2
dependency-type: direct:development
update-type: version-update:semver-patch
dependency-group: minor-and-patch
...
* build(deps): Bump the actions group with 5 updates
Bumps the actions group with 5 updates:
| Package | From | To |
| --- | --- | --- |
| [actions/checkout](https://github.com/actions/checkout) | `4` | `6` |
| [actions/setup-node](https://github.com/actions/setup-node) | `4` | `6` |
| [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4` | `7` |
| [actions/setup-python](https://github.com/actions/setup-python) | `5` | `6` |
| [actions/setup-go](https://github.com/actions/setup-go) | `5` | `6` |
Updates `actions/checkout` from 4 to 6
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)
Updates `actions/setup-node` from 4 to 6
- [Release notes](https://github.com/actions/setup-node/releases)
- [Commits](https://github.com/actions/setup-node/compare/v4...v6)
Updates `actions/upload-artifact` from 4 to 7
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v7)
Updates `actions/setup-python` from 5 to 6
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/v5...v6)
Updates `actions/setup-go` from 5 to 6
- [Release notes](https://github.com/actions/setup-go/releases)
- [Commits](https://github.com/actions/setup-go/compare/v5...v6)
---
updated-dependencies:
- dependency-name: actions/checkout
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/setup-node
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/upload-artifact
dependency-version: '7'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/setup-python
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
- dependency-name: actions/setup-go
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
dependency-group: actions
...
* build(deps-dev): Bump @types/node from 22.19.13 to 25.3.3
Bumps [@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node) from 22.19.13 to 25.3.3.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases)
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node)
---
updated-dependencies:
- dependency-name: "@types/node"
dependency-version: 25.3.3
dependency-type: direct:development
update-type: version-update:semver-major
...
* build(deps): Bump better-sqlite3 from 11.10.0 to 12.6.2
Bumps [better-sqlite3](https://github.com/WiseLibs/better-sqlite3) from 11.10.0 to 12.6.2.
- [Release notes](https://github.com/WiseLibs/better-sqlite3/releases)
- [Commits](https://github.com/WiseLibs/better-sqlite3/compare/v11.10.0...v12.6.2)
---
updated-dependencies:
- dependency-name: better-sqlite3
dependency-version: 12.6.2
dependency-type: direct:production
update-type: version-update:semver-major
...
* build(deps-dev): Bump eslint from 9.39.3 to 10.0.2
Bumps [eslint](https://github.com/eslint/eslint) from 9.39.3 to 10.0.2.
- [Release notes](https://github.com/eslint/eslint/releases)
- [Commits](https://github.com/eslint/eslint/compare/v9.39.3...v10.0.2)
---
updated-dependencies:
- dependency-name: eslint
dependency-version: 10.0.2
dependency-type: direct:development
update-type: version-update:semver-major
...
* feat: add passthrough LLM mode for ask-question tool (#335)
* feat: add passthrough LLM mode for ask-question tool
Adds llm.provider = "passthrough" so the ask-question MCP tool returns
retrieved context chunks directly to the calling LLM instead of requiring
a separate OpenAI/Ollama provider. This is the natural design for MCP tools
where the client already has an LLM (e.g. Claude Code).
- config.ts: add "passthrough" to llm.provider union type and env var handling
- rag.ts: add isPassthroughMode() helper and getContextForQuestion() which
retrieves and formats context without an LLM call
- mcp/server.ts: ask-question checks passthrough first and returns context
directly; falls through to existing LLM path otherwise
Enable via config: { "llm": { "provider": "passthrough" } }
Enable via env var: LIBSCOPE_LLM_PROVIDER=passthrough
* fix: format config.ts and include passthrough in provider override
- Reformat long if-condition to satisfy prettier (printWidth: 100)
- Fix logic bug: passthrough provider was checked in outer condition but
not spread into overrides.llm.provider
---------
* fix: address 9 audit findings from issue #332 (#333)
* fix: address 9 audit findings from issue #332
Security
- middleware: use timingSafeEqual for API key comparison (#2)
- url-fetcher, http-utils: replace process-wide NODE_TLS_REJECT_UNAUTHORIZED
mutation with per-request undici Agent to eliminate TLS race condition (#1)
Bugs
- indexing: re-throw unexpected embedding errors so transaction rolls back
instead of silently committing chunks with no vector (#3)
- search: replace correlated minRating subquery with avg_r.avg_rating from
the pre-joined aggregate in FTS and LIKE search paths (#4)
Performance
- bulk: replace O(n²) docs.find() loops with pre-built Map; replace
per-document getDocumentTags() calls with a single getDocumentTagsBatch()
query (#5)
- config: add 30-second TTL cache to loadConfig() so disk reads are not
repeated on every request (#6)
Code quality
- routes: check res.write() return value to handle SSE backpressure (#7)
- reindex: delegate to schema.createVectorTable() instead of duplicating
the vec0 DDL inline (#8)
- obsidian: replace hand-rolled parseSimpleYaml() with js-yaml, normalise
Date objects back to ISO-8601 strings (#9)
Docs
- agents.md: expand architecture tree to include src/api/ and src/connectors/;
add Security Patterns section with correct undici examples
- CONTRIBUTING.md: fix check-suite command (npm test → npm run test:coverage)
and correct coverage threshold (80% → actual 75%/74%)
Tests
- bulk.test: add dateFrom/dateTo filter coverage
- config.test: add cache-hit test; call invalidateConfigCache() before env-var
tests so TTL cache doesn't return stale results
* fix: remove unused warnIfTlsBypassMissing function
Dead code after conflict resolution chose the per-request undici Agent
approach (which doesn't need a warning about NODE_TLS_REJECT_UNAUTHORIZED).
* fix: update tests for config cache and retry semantics
- Add invalidateConfigCache() before loadConfig() in 4 env-override tests
that were failing because the 30s TTL cache introduced in the config
module was returning stale results from the previous test's cache entry
- Update http-utils retry assertion: maxRetries=2 means 1 initial + 2
retries = 3 total calls (loop is attempt <= maxRetries)
---------
* feat: CLI logging improvements and pack installation performance (#330) (#336)
- Add `src/cli/reporter.ts`: PrettyReporter (ANSI colors + \r progress
bar), SilentReporter (no-op for verbose/JSON mode), `isVerbose()` and
`createReporter()` factory. `LIBSCOPE_VERBOSE=1` env var alternative.
- Update `setupLogging` to default to "silent" in CLI mode (pretty
reporter handles user-facing output). Verbose/`--log-level` flags still
route to structured JSON pino logs. Fix duplicate `initLogger` calls in
onenote connect/disconnect commands to use `setupLogging` consistently.
- Update `installPack` in `packs.ts` to support batch embedding and
progress reporting:
- New `InstallOptions` interface with `batchSize`, `resumeFrom`,
`onProgress` fields
- Batch documents: chunk all → single `provider.embedBatch` call per
batch → single SQLite transaction per batch (avoids N embedding calls)
- `resumeFrom` skips the first N documents (enables partial install
resume after failure)
- `InstallResult` now includes `errors` count
- Add `--batch-size` and `--resume-from` CLI options to `pack install`
- Tests: `tests/unit/reporter.test.ts` (17 tests covering PrettyReporter,
SilentReporter, isVerbose, env var detection); extended
`tests/unit/packs.test.ts` with 7 new tests for progress callbacks,
batch efficiency, resumeFrom, embedBatch failure handling.
* Claude/fix issue 331 s1qzu (#338)
* build(deps): Bump the npm_and_yarn group across 1 directory with 2 updates (#337)
Bumps the npm_and_yarn group with 2 updates in the / directory: [@hono/node-server](https://github.com/honojs/node-server) and [hono](https://github.com/honojs/hono).
Updates `@hono/node-server` from 1.19.9 to 1.19.10
- [Release notes](https://github.com/honojs/node-server/releases)
- [Commits](https://github.com/honojs/node-server/compare/v1.19.9...v1.19.10)
Updates `hono` from 4.12.3 to 4.12.5
- [Release notes](https://github.com/honojs/hono/releases)
- [Commits](https://github.com/honojs/hono/compare/v4.12.3...v4.12.5)
---
updated-dependencies:
- dependency-name: "@hono/node-server"
dependency-version: 1.19.10
dependency-type: indirect
dependency-group: npm_and_yarn
- dependency-name: hono
dependency-version: 4.12.5
dependency-type: indirect
dependency-group: npm_and_yarn
...
* fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (#341)
* fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (closes #340)
**SSRF (CWE-918 — CodeQL alert #28)**
Replace the two-step validate-then-fetch approach in url-fetcher.ts with
IP-pinned requests using node:http / node:https directly. validateUrl()
resolves DNS and checks for private IPs, then the validated IP is passed
straight to the TCP connection (hostname: pinnedIp, servername: original
hostname for TLS SNI). There is now zero TOCTOU window between validation
and the actual network request. The redundant post-fetch DNS rebinding
check and the env-var-based TLS bypass wrapper are removed; rejectUnauthorized
is now passed directly to the request options.
An internal _setRequestImpl hook is exported for unit test injection so
tests can stub responses without touching node:http / node:https.
Tests are updated accordingly.
**ReDoS (CWE-1333 — CodeQL alert #24)**
Five regexes in confluence.ts used the pattern [^>]*ac:name="X"[^>]* —
two [^>]* quantifiers around a fixed literal. For input that contains a
large attribute blob without the target ac:name value, the engine must try
all O(n²) splits before concluding no match (catastrophic backtracking).
Fix: rewrite the leading [^>]* as (?:(?!ac:name="X")[^>])* — a negative
lookahead prevents the quantifier from overlapping with the literal,
making backtracking structurally impossible.
* fix: replace remaining polynomial regexes in confluence.ts with indexOf helpers
The conflict resolution left the original `/<ac:structured-macro [^>]*>([\s\S]*?)
<\/ac:structured-macro>/gi` patterns in place for code blocks and info/tip panels
(those were not part of the original security fix diff). These have the same
O(n²) backtracking problem: with k opening tags and no closing tags, [\s\S]*?
scans O(n - pos) chars per attempt, totalling O(n²).
Replace the entire convertConfluenceStorage function with the indexOf-based
approach (replaceStructuredMacros / replaceTagPairs / extractTagContent helpers)
that eliminates all regex-based tag parsing. Also add removeSelfClosingMacros
to handle the self-closing TOC case without regex, since the previous self-closing
fix still used a [^>]*ac:name="toc"[^>]* pattern.
---------
* feat: concurrent pack installation and -v verbose shorthand (issue #330) (#339)
* feat: concurrent pack installation and -v verbose shorthand (issue #330)
Add concurrent batch embedding to installPack for significant performance
improvement on large packs, plus CLI ergonomics improvements.
Key changes:
- `InstallOptions.concurrency` (default: 4): controls how many embedBatch
calls run simultaneously; embedding is I/O-bound so parallelism directly
reduces wall-clock installation time
- Refactor installPack to pre-chunk all documents upfront, then use a
semaphore-based scheduler to run up to `concurrency` embedBatch calls
concurrently while inserting completed batches in-order (SQLite requires
serialised writes); progress callbacks fire after each batch as before
- `pack install --concurrency <n>` CLI flag exposes the new option
- `-v` shorthand for `--verbose` on the global program options
- Fix transaction install-count tracking: count committed docs accurately
without relying on subtract-on-failure arithmetic
- Add 6 new tests covering concurrency=1 sequential, concurrency=4 parallel,
multiple embedBatch calls per install, concurrency limit enforcement,
incremental progress reporting, and partial-failure error counting
https://claude.ai/code/session_019hzhbEgV1ysnGmFVXBWzkZ
* fix: address all 4 Copilot review comments on PR #339
- Validate batchSize, concurrency, resumeFrom at the start of installPack
and throw ValidationError for invalid values (comments 3 & 4). Concurrency
<= 0 would silently hang the semaphore indefinitely.
- Add CLI lower-bound guard: --concurrency < 1 exits with a user-facing
error before ever calling installPack (comment 3).
- Lazy chunking: pre-chunking all documents upfront held chunks for the
entire pack in memory simultaneously. Batches now store only the raw
documents; resolveBatch() chunks on demand right before embedBatch
is called, so only one batch's worth of chunks is in memory at a time
(comment 2).
- Wrap provider.embedBatch() in try/catch so synchronous throws are
converted to rejected Promises rather than escaping scheduleNext() and
leaving the outer Promise permanently pending (comment 1).
---------
* fix: address 7 pre-release bugs from audit (#342) (#344)
- Guard JSON.parse in rowToWebhook with try/catch, default to []
- Guard JSON.parse in rowToSavedSearch with try/catch, default to null
- Guard JSON.parse in loadDbConnectorConfig with try/catch, throw ConfigError
- Push dateFrom/dateTo filters into SQL WHERE in listDocuments (before LIMIT)
- Validate negative limit in resolveSelector, throw ValidationError
- Replace manual substring extension parsing with path.extname() in packs.ts
- Verified reporter.ts is already tracked on development (no action needed)
- Added tests for all fixes (corrupted JSON, SQL-level date filtering, negative limit)
Closes #342
* feat: URL spidering — crawl linked pages with configurable depth and page limits (#315) (#343)
* feat: URL spidering — crawl linked pages with configurable depth and page limits (#315)
Adds opt-in spidering to URL indexing. A single seed URL can now crawl
and index an entire documentation site or wiki section in one call.
New files:
- src/core/link-extractor.ts: indexOf-based <a href> extraction, relative
URL resolution, fragment stripping, dedup, scheme filtering. No regex.
- src/core/spider.ts: BFS crawl engine with sameDomain, pathPrefix,
excludePatterns (glob), maxPages (hard cap 200), maxDepth (hard cap 5),
10-min total timeout, robots.txt (User-agent: * and libscope), and
1s inter-request delay. Yields SpiderResult per page; returns SpiderStats.
- tests/unit/link-extractor.test.ts: 25 tests covering relative resolution,
dedup, fragment stripping, scheme filtering, attribute order, edge cases.
- tests/unit/spider.test.ts: 20 tests covering BFS order, depth/page limits,
domain + path + pattern filtering, cycle detection, robots.txt, partial
failure recovery, stats, and abortReason reporting.
Modified:
- src/core/url-fetcher.ts: adds fetchRaw() export returning raw body +
contentType + finalUrl before HTML-to-markdown conversion, so the
spider can extract links from HTML before conversion.
- src/api/routes.ts: POST /api/v1/documents/url now accepts spider, maxPages,
maxDepth, sameDomain, pathPrefix, excludePatterns. Returns { documents,
pagesIndexed, pagesCrawled, pagesSkipped, errors, abortReason }.
- src/mcp/server.ts: submit-document tool gains spider, maxPages, maxDepth,
sameDomain, pathPrefix, excludePatterns parameters.
Safety: all fetched URLs pass through the existing SSRF validation in
fetchRaw() (DNS resolution, private IP blocking, scheme allowlist).
Hard limits (200 pages, depth 5, 10min) cannot be overridden by callers.
robots.txt is fetched once per origin and Disallow rules are honoured.
Individual page failures do not abort the crawl.
Closes #315
* fix: resolve CI lint errors in spider implementation
- Remove unnecessary type assertions (routes.ts, mcp/server.ts) —
TypeScript already narrows SpiderResult/SpiderStats from the generator
- Add explicit return type annotation on mock fetchRaw to satisfy
no-unsafe-return rule in spider.test.ts
- Replace .resolves.not.toThrow() with a direct assertion — vitest
.resolves requires a Promise, not an async function
* fix: address CodeQL security findings in spider/link-extractor
link-extractor.ts (CodeQL #30 — incomplete URL scheme check):
Replace the enumerated scheme blocklist (javascript:, vbscript:, etc.)
with a strict http/https allowlist check on the resolved URL protocol.
An allowlist is exhaustive by definition; a blocklist will always miss
obscure schemes like vbscript:, blob:, or future additions.
spider.ts (CodeQL #31 — incomplete multi-character sanitization):
Replace the regex-based tag stripper /<[^>]+>/g in extractTitle() with
an indexOf-based stripTags() function. The regex stops at the first >
which can be inside a quoted attribute value (e.g. <img alt="a>b">),
potentially leaving partial tag content in the extracted title.
The new implementation walks quoted attribute values explicitly so no
tag content leaks through regardless of its internal structure.
* fix: address all Copilot review comments on spider PR (#343)
- link-extractor: add word-boundary check in extractHref to prevent
matching data-href, aria-href (false positives on non-href attributes)
- spider: rename pagesIndexed → pagesFetched throughout (SpiderStats
interface already used pagesFetched; sync implementation + tests)
- spider: per-origin robots.txt cache (Map<origin, Set>) fetched lazily
as new origins are encountered during crawl (was seed-only before)
- spider: normalize to raw.finalUrl after redirects — visited set,
yielded URL, and link-extraction base all use the canonical URL
- routes: validate maxPages/maxDepth are finite positive integers
- routes: change conditional spread &&-patterns to ternaries
- routes: remove inner try/catch for spider fetch errors; add FetchError
to top-level handler (consistent with single-URL mode → 502)
- mcp/server: replace conditional spreads with explicit if assignments
- mcp/server: validate spider=true requires url (throws ValidationError)
- openapi: document spider request fields in IndexFromUrlRequest schema,
add SpiderResponse schema, update 201 response to oneOf
* style: fix prettier formatting in spider files
---------
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: always target main branch in release-please, fall back to github.token (#356)
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
) * Prepare for release (#345) * chore: add development branch workflow (#327) * chore: add development branch workflow - Add merge-gate.yml: enforces only 'development' can merge into main - Update CI/CodeQL/Docker workflows to run on both main and development - Update dependabot.yml: target-branch set to development for all ecosystems - Update copilot-instructions.md: document branch workflow convention - Rulesets configured: Main (requires merge-gate + squash-only), Development (requires CI status checks + PR) - Default branch set to development - All open PRs retargeted to development Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: skip Vercel preview deployments on non-main branches Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: trigger check refresh --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: create-pack from local folder or URL sources (#329) * fix: comprehensive audit fixes — security, performance, resilience, API hardening Addresses findings from issue #314: - SSRF protection for webhook URLs (CRITICAL) - Scrub secrets from exports - Stored XSS prevention on document URL - O(n²) and N+1 fixes in bulk operations - Rate limit cache eviction improvement - SSE backpressure handling - Replace raw Error() with typed errors - Fetch timeouts on all network calls - Input validation on API parameters - Search query length limit - Silent catch block logging - DNS rebinding check fix - N+1 in Slack user resolution - Pagination on webhook/search list endpoints Closes #314 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR review comments: fix SSE listener leak, preserve caller signals, add SSRF validation to webhook test, chunk SQL params, use dynamic import in test - SSE backpressure: create single disconnect promise, race against drain (no listener accumulation) - http-utils.ts/onenote.ts: use AbortSignal.any() to combine caller signal with timeout - Webhook test endpoint: validate URL with validateWebhookUrlSsrf before fetch - bulk.ts: chunk IN clause to 999 params max (SQLite limit) - webhooks.test.ts: dynamic import after vi.mock() for deterministic DNS mocking Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ci: consolidate and fix CI/CD workflows - Merge lint + typecheck into single job (saves one npm ci) - Add concurrency groups to ci, docker, codeql (cancel stale runs) - Add dependency-review-action on PRs (block vulnerable deps) - Add workflow_call trigger to ci.yml for reusability - Remove duplicate npm publish from release.yml (release-please owns it) - Fix SDK paths: sdk-go/ → sdk/go/, sdk-python/ → sdk/python/ - Fix Dependabot paths to match actual SDK directories - Add github-actions ecosystem to Dependabot (keep actions up to date) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add HTML file parser for .html/.htm document indexing Adds HtmlParser using the existing node-html-markdown dependency. Strips <script>, <style>, and <nav> tags before conversion. Registered for .html and .htm extensions. Includes 12 tests covering conversion, tag stripping, edge cases. Closes #317 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address CodeQL and review comments on HTML parser - Replace regex-based tag stripping with node-html-markdown's native ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization, bad HTML filtering regexp) - Wrap translate() in try/catch, throw ValidationError (consistent with other parsers) - Use trimEnd() instead of trim() to preserve leading indentation - Reuse single NHM instance for efficiency Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Revert "fix: skip Vercel preview deployments on non-main branches" This reverts commit eb481870c883f77278291b72245b1ca0b890a78c. * feat: add --from option to pack create for folder/URL sources Adds createPackFromSource() that builds packs directly from local folders, files, or URLs without requiring database interaction. CLI: libscope pack create --name my-pack --from ~/docs/ [--extensions .md,.html] [--exclude pattern] [--no-recursive] Features: - Walks directories recursively using registered parsers - Fetches URLs via fetchAndConvert - Supports extension filtering, exclude patterns, progress callback - Multiple --from sources supported - Output format identical to DB export (pack install works unchanged) Closes #328 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * style: fix prettier formatting Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add gzip support for pack files (.json.gz) Pack files can now be compressed with gzip for smaller distribution: - writePackFile/readPackFile auto-detect gzip by extension or magic bytes - installPack accepts both .json and .json.gz files - createPackFromSource defaults to .json.gz output (source packs can be large) - createPack (DB export) still defaults to .json - Auto-detects gzip by magic bytes even if extension is .json 5 new tests covering gzip write, install, magic byte detection, and round-trip. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add progress logging and fix dedup handling in pack install - Log each document as it's indexed so large installs show progress - Change pack install to use dedup: 'skip' for graceful duplicate handling - Make title+content_length dedup check respect the dedup mode setting (previously it always threw ValidationError regardless of dedup mode) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: auto-generate tags during pack creation and apply on install - Export tokenize() and add suggestTagsFromText() in tags.ts for DB-free tag generation - createPackFromSource() now auto-generates tags per document via TF-IDF - installPack() applies doc.tags via addTagsToDocument() after indexing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add HTML file parser for .html/.htm document indexing (#318) * feat: add HTML file parser for .html/.htm document indexing Adds HtmlParser using the existing node-html-markdown dependency. Strips <script>, <style>, and <nav> tags before conversion. Registered for .html and .htm extensions. Includes 12 tests covering conversion, tag stripping, edge cases. Closes #317 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address CodeQL and review comments on HTML parser - Replace regex-based tag stripping with node-html-markdown's native ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization, bad HTML filtering regexp) - Wrap translate() in try/catch, throw ValidationError (consistent with other parsers) - Use trimEnd() instead of trim() to preserve leading indentation - Reuse single NHM instance for efficiency Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Revert "fix: skip Vercel preview deployments on non-main branches" This reverts commit eb481870c883f77278291b72245b1ca0b890a78c. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * build(deps-dev): Bump eslint-config-prettier from 9.1.2 to 10.1.8 (#325) Bumps [eslint-config-prettier](https://github.com/prettier/eslint-config-prettier) from 9.1.2 to 10.1.8. - [Release notes](https://github.com/prettier/eslint-config-prettier/releases) - [Changelog](https://github.com/prettier/eslint-config-prettier/blob/main/CHANGELOG.md) - [Commits](https://github.com/prettier/eslint-config-prettier/commits/v10.1.8) --- updated-dependencies: - dependency-name: eslint-config-prettier dependency-version: 10.1.8 dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps-dev): Bump lint-staged from 16.3.1 to 16.3.2 Bumps the minor-and-patch group with 1 update: [lint-staged](https://github.com/lint-staged/lint-staged). Updates `lint-staged` from 16.3.1 to 16.3.2 - [Release notes](https://github.com/lint-staged/lint-staged/releases) - [Changelog](https://github.com/lint-staged/lint-staged/blob/main/CHANGELOG.md) - [Commits](https://github.com/lint-staged/lint-staged/compare/v16.3.1...v16.3.2) --- updated-dependencies: - dependency-name: lint-staged dependency-version: 16.3.2 dependency-type: direct:development update-type: version-update:semver-patch dependency-group: minor-and-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): Bump the actions group with 5 updates Bumps the actions group with 5 updates: | Package | From | To | | --- | --- | --- | | [actions/checkout](https://github.com/actions/checkout) | `4` | `6` | | [actions/setup-node](https://github.com/actions/setup-node) | `4` | `6` | | [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4` | `7` | | [actions/setup-python](https://github.com/actions/setup-python) | `5` | `6` | | [actions/setup-go](https://github.com/actions/setup-go) | `5` | `6` | Updates `actions/checkout` from 4 to 6 - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](https://github.com/actions/checkout/compare/v4...v6) Updates `actions/setup-node` from 4 to 6 - [Release notes](https://github.com/actions/setup-node/releases) - [Commits](https://github.com/actions/setup-node/compare/v4...v6) Updates `actions/upload-artifact` from 4 to 7 - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](https://github.com/actions/upload-artifact/compare/v4...v7) Updates `actions/setup-python` from 5 to 6 - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](https://github.com/actions/setup-python/compare/v5...v6) Updates `actions/setup-go` from 5 to 6 - [Release notes](https://github.com/actions/setup-go/releases) - [Commits](https://github.com/actions/setup-go/compare/v5...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/setup-node dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/upload-artifact dependency-version: '7' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/setup-python dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/setup-go dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps-dev): Bump @types/node from 22.19.13 to 25.3.3 Bumps [@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node) from 22.19.13 to 25.3.3. - [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases) - [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node) --- updated-dependencies: - dependency-name: "@types/node" dependency-version: 25.3.3 dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): Bump better-sqlite3 from 11.10.0 to 12.6.2 Bumps [better-sqlite3](https://github.com/WiseLibs/better-sqlite3) from 11.10.0 to 12.6.2. - [Release notes](https://github.com/WiseLibs/better-sqlite3/releases) - [Commits](https://github.com/WiseLibs/better-sqlite3/compare/v11.10.0...v12.6.2) --- updated-dependencies: - dependency-name: better-sqlite3 dependency-version: 12.6.2 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps-dev): Bump eslint from 9.39.3 to 10.0.2 Bumps [eslint](https://github.com/eslint/eslint) from 9.39.3 to 10.0.2. - [Release notes](https://github.com/eslint/eslint/releases) - [Commits](https://github.com/eslint/eslint/compare/v9.39.3...v10.0.2) --- updated-dependencies: - dependency-name: eslint dependency-version: 10.0.2 dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Robert DeRienzo <rderienzo@voloridge.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add passthrough LLM mode for ask-question tool (#335) * feat: add passthrough LLM mode for ask-question tool Adds llm.provider = "passthrough" so the ask-question MCP tool returns retrieved context chunks directly to the calling LLM instead of requiring a separate OpenAI/Ollama provider. This is the natural design for MCP tools where the client already has an LLM (e.g. Claude Code). - config.ts: add "passthrough" to llm.provider union type and env var handling - rag.ts: add isPassthroughMode() helper and getContextForQuestion() which retrieves and formats context without an LLM call - mcp/server.ts: ask-question checks passthrough first and returns context directly; falls through to existing LLM path otherwise Enable via config: { "llm": { "provider": "passthrough" } } Enable via env var: LIBSCOPE_LLM_PROVIDER=passthrough Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: format config.ts and include passthrough in provider override - Reformat long if-condition to satisfy prettier (printWidth: 100) - Fix logic bug: passthrough provider was checked in outer condition but not spread into overrides.llm.provider Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address 9 audit findings from issue #332 (#333) * fix: address 9 audit findings from issue #332 Security - middleware: use timingSafeEqual for API key comparison (#2) - url-fetcher, http-utils: replace process-wide NODE_TLS_REJECT_UNAUTHORIZED mutation with per-request undici Agent to eliminate TLS race condition (#1) Bugs - indexing: re-throw unexpected embedding errors so transaction rolls back instead of silently committing chunks with no vector (#3) - search: replace correlated minRating subquery with avg_r.avg_rating from the pre-joined aggregate in FTS and LIKE search paths (#4) Performance - bulk: replace O(n²) docs.find() loops with pre-built Map; replace per-document getDocumentTags() calls with a single getDocumentTagsBatch() query (#5) - config: add 30-second TTL cache to loadConfig() so disk reads are not repeated on every request (#6) Code quality - routes: check res.write() return value to handle SSE backpressure (#7) - reindex: delegate to schema.createVectorTable() instead of duplicating the vec0 DDL inline (#8) - obsidian: replace hand-rolled parseSimpleYaml() with js-yaml, normalise Date objects back to ISO-8601 strings (#9) Docs - agents.md: expand architecture tree to include src/api/ and src/connectors/; add Security Patterns section with correct undici examples - CONTRIBUTING.md: fix check-suite command (npm test → npm run test:coverage) and correct coverage threshold (80% → actual 75%/74%) Tests - bulk.test: add dateFrom/dateTo filter coverage - config.test: add cache-hit test; call invalidateConfigCache() before env-var tests so TTL cache doesn't return stale results Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: remove unused warnIfTlsBypassMissing function Dead code after conflict resolution chose the per-request undici Agent approach (which doesn't need a warning about NODE_TLS_REJECT_UNAUTHORIZED). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: update tests for config cache and retry semantics - Add invalidateConfigCache() before loadConfig() in 4 env-override tests that were failing because the 30s TTL cache introduced in the config module was returning stale results from the previous test's cache entry - Update http-utils retry assertion: maxRetries=2 means 1 initial + 2 retries = 3 total calls (loop is attempt <= maxRetries) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: CLI logging improvements and pack installation performance (#330) (#336) - Add `src/cli/reporter.ts`: PrettyReporter (ANSI colors + \r progress bar), SilentReporter (no-op for verbose/JSON mode), `isVerbose()` and `createReporter()` factory. `LIBSCOPE_VERBOSE=1` env var alternative. - Update `setupLogging` to default to "silent" in CLI mode (pretty reporter handles user-facing output). Verbose/`--log-level` flags still route to structured JSON pino logs. Fix duplicate `initLogger` calls in onenote connect/disconnect commands to use `setupLogging` consistently. - Update `installPack` in `packs.ts` to support batch embedding and progress reporting: - New `InstallOptions` interface with `batchSize`, `resumeFrom`, `onProgress` fields - Batch documents: chunk all → single `provider.embedBatch` call per batch → single SQLite transaction per batch (avoids N embedding calls) - `resumeFrom` skips the first N documents (enables partial install resume after failure) - `InstallResult` now includes `errors` count - Add `--batch-size` and `--resume-from` CLI options to `pack install` - Tests: `tests/unit/reporter.test.ts` (17 tests covering PrettyReporter, SilentReporter, isVerbose, env var detection); extended `tests/unit/packs.test.ts` with 7 new tests for progress callbacks, batch efficiency, resumeFrom, embedBatch failure handling. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Claude/fix issue 331 s1qzu (#338) * build(deps): Bump the npm_and_yarn group across 1 directory with 2 updates (#337) Bumps the npm_and_yarn group with 2 updates in the / directory: [@hono/node-server](https://github.com/honojs/node-server) and [hono](https://github.com/honojs/hono). Updates `@hono/node-server` from 1.19.9 to 1.19.10 - [Release notes](https://github.com/honojs/node-server/releases) - [Commits](https://github.com/honojs/node-server/compare/v1.19.9...v1.19.10) Updates `hono` from 4.12.3 to 4.12.5 - [Release notes](https://github.com/honojs/hono/releases) - [Commits](https://github.com/honojs/hono/compare/v4.12.3...v4.12.5) --- updated-dependencies: - dependency-name: "@hono/node-server" dependency-version: 1.19.10 dependency-type: indirect dependency-group: npm_and_yarn - dependency-name: hono dependency-version: 4.12.5 dependency-type: indirect dependency-group: npm_and_yarn ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (#341) * fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (closes #340) **SSRF (CWE-918 — CodeQL alert #28)** Replace the two-step validate-then-fetch approach in url-fetcher.ts with IP-pinned requests using node:http / node:https directly. validateUrl() resolves DNS and checks for private IPs, then the validated IP is passed straight to the TCP connection (hostname: pinnedIp, servername: original hostname for TLS SNI). There is now zero TOCTOU window between validation and the actual network request. The redundant post-fetch DNS rebinding check and the env-var-based TLS bypass wrapper are removed; rejectUnauthorized is now passed directly to the request options. An internal _setRequestImpl hook is exported for unit test injection so tests can stub responses without touching node:http / node:https. Tests are updated accordingly. **ReDoS (CWE-1333 — CodeQL alert #24)** Five regexes in confluence.ts used the pattern [^>]*ac:name="X"[^>]* — two [^>]* quantifiers around a fixed literal. For input that contains a large attribute blob without the target ac:name value, the engine must try all O(n²) splits before concluding no match (catastrophic backtracking). Fix: rewrite the leading [^>]* as (?:(?!ac:name="X")[^>])* — a negative lookahead prevents the quantifier from overlapping with the literal, making backtracking structurally impossible. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: replace remaining polynomial regexes in confluence.ts with indexOf helpers The conflict resolution left the original `/<ac:structured-macro [^>]*>([\s\S]*?) <\/ac:structured-macro>/gi` patterns in place for code blocks and info/tip panels (those were not part of the original security fix diff). These have the same O(n²) backtracking problem: with k opening tags and no closing tags, [\s\S]*? scans O(n - pos) chars per attempt, totalling O(n²). Replace the entire convertConfluenceStorage function with the indexOf-based approach (replaceStructuredMacros / replaceTagPairs / extractTagContent helpers) that eliminates all regex-based tag parsing. Also add removeSelfClosingMacros to handle the self-closing TOC case without regex, since the previous self-closing fix still used a [^>]*ac:name="toc"[^>]* pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: concurrent pack installation and -v verbose shorthand (issue #330) (#339) * feat: concurrent pack installation and -v verbose shorthand (issue #330) Add concurrent batch embedding to installPack for significant performance improvement on large packs, plus CLI ergonomics improvements. Key changes: - `InstallOptions.concurrency` (default: 4): controls how many embedBatch calls run simultaneously; embedding is I/O-bound so parallelism directly reduces wall-clock installation time - Refactor installPack to pre-chunk all documents upfront, then use a semaphore-based scheduler to run up to `concurrency` embedBatch calls concurrently while inserting completed batches in-order (SQLite requires serialised writes); progress callbacks fire after each batch as before - `pack install --concurrency <n>` CLI flag exposes the new option - `-v` shorthand for `--verbose` on the global program options - Fix transaction install-count tracking: count committed docs accurately without relying on subtract-on-failure arithmetic - Add 6 new tests covering concurrency=1 sequential, concurrency=4 parallel, multiple embedBatch calls per install, concurrency limit enforcement, incremental progress reporting, and partial-failure error counting https://claude.ai/code/session_019hzhbEgV1ysnGmFVXBWzkZ * fix: address all 4 Copilot review comments on PR #339 - Validate batchSize, concurrency, resumeFrom at the start of installPack and throw ValidationError for invalid values (comments 3 & 4). Concurrency <= 0 would silently hang the semaphore indefinitely. - Add CLI lower-bound guard: --concurrency < 1 exits with a user-facing error before ever calling installPack (comment 3). - Lazy chunking: pre-chunking all documents upfront held chunks for the entire pack in memory simultaneously. Batches now store only the raw documents; resolveBatch() chunks on demand right before embedBatch is called, so only one batch's worth of chunks is in memory at a time (comment 2). - Wrap provider.embedBatch() in try/catch so synchronous throws are converted to rejected Promises rather than escaping scheduleNext() and leaving the outer Promise permanently pending (comment 1). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> * fix: address 7 pre-release bugs from audit (#342) (#344) - Guard JSON.parse in rowToWebhook with try/catch, default to [] - Guard JSON.parse in rowToSavedSearch with try/catch, default to null - Guard JSON.parse in loadDbConnectorConfig with try/catch, throw ConfigError - Push dateFrom/dateTo filters into SQL WHERE in listDocuments (before LIMIT) - Validate negative limit in resolveSelector, throw ValidationError - Replace manual substring extension parsing with path.extname() in packs.ts - Verified reporter.ts is already tracked on development (no action needed) - Added tests for all fixes (corrupted JSON, SQL-level date filtering, negative limit) Closes #342 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: URL spidering — crawl linked pages with configurable depth and page limits (#315) (#343) * feat: URL spidering — crawl linked pages with configurable depth and page limits (#315) Adds opt-in spidering to URL indexing. A single seed URL can now crawl and index an entire documentation site or wiki section in one call. New files: - src/core/link-extractor.ts: indexOf-based <a href> extraction, relative URL resolution, fragment stripping, dedup, scheme filtering. No regex. - src/core/spider.ts: BFS crawl engine with sameDomain, pathPrefix, excludePatterns (glob), maxPages (hard cap 200), maxDepth (hard cap 5), 10-min total timeout, robots.txt (User-agent: * and libscope), and 1s inter-request delay. Yields SpiderResult per page; returns SpiderStats. - tests/unit/link-extractor.test.ts: 25 tests covering relative resolution, dedup, fragment stripping, scheme filtering, attribute order, edge cases. - tests/unit/spider.test.ts: 20 tests covering BFS order, depth/page limits, domain + path + pattern filtering, cycle detection, robots.txt, partial failure recovery, stats, and abortReason reporting. Modified: - src/core/url-fetcher.ts: adds fetchRaw() export returning raw body + contentType + finalUrl before HTML-to-markdown conversion, so the spider can extract links from HTML before conversion. - src/api/routes.ts: POST /api/v1/documents/url now accepts spider, maxPages, maxDepth, sameDomain, pathPrefix, excludePatterns. Returns { documents, pagesIndexed, pagesCrawled, pagesSkipped, errors, abortReason }. - src/mcp/server.ts: submit-document tool gains spider, maxPages, maxDepth, sameDomain, pathPrefix, excludePatterns parameters. Safety: all fetched URLs pass through the existing SSRF validation in fetchRaw() (DNS resolution, private IP blocking, scheme allowlist). Hard limits (200 pages, depth 5, 10min) cannot be overridden by callers. robots.txt is fetched once per origin and Disallow rules are honoured. Individual page failures do not abort the crawl. Closes #315 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: resolve CI lint errors in spider implementation - Remove unnecessary type assertions (routes.ts, mcp/server.ts) — TypeScript already narrows SpiderResult/SpiderStats from the generator - Add explicit return type annotation on mock fetchRaw to satisfy no-unsafe-return rule in spider.test.ts - Replace .resolves.not.toThrow() with a direct assertion — vitest .resolves requires a Promise, not an async function Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address CodeQL security findings in spider/link-extractor link-extractor.ts (CodeQL #30 — incomplete URL scheme check): Replace the enumerated scheme blocklist (javascript:, vbscript:, etc.) with a strict http/https allowlist check on the resolved URL protocol. An allowlist is exhaustive by definition; a blocklist will always miss obscure schemes like vbscript:, blob:, or future additions. spider.ts (CodeQL #31 — incomplete multi-character sanitization): Replace the regex-based tag stripper /<[^>]+>/g in extractTitle() with an indexOf-based stripTags() function. The regex stops at the first > which can be inside a quoted attribute value (e.g. <img alt="a>b">), potentially leaving partial tag content in the extracted title. The new implementation walks quoted attribute values explicitly so no tag content leaks through regardless of its internal structure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address all Copilot review comments on spider PR (#343) - link-extractor: add word-boundary check in extractHref to prevent matching data-href, aria-href (false positives on non-href attributes) - spider: rename pagesIndexed → pagesFetched throughout (SpiderStats interface already used pagesFetched; sync implementation + tests) - spider: per-origin robots.txt cache (Map<origin, Set>) fetched lazily as new origins are encountered during crawl (was seed-only before) - spider: normalize to raw.finalUrl after redirects — visited set, yielded URL, and link-extraction base all use the canonical URL - routes: validate maxPages/maxDepth are finite positive integers - routes: change conditional spread &&-patterns to ternaries - routes: remove inner try/catch for spider fetch errors; add FetchError to top-level handler (consistent with single-URL mode → 502) - mcp/server: replace conditional spreads with explicit if assignments - mcp/server: validate spider=true requires url (throws ValidationError) - openapi: document spider request fields in IndexFromUrlRequest schema, add SpiderResponse schema, update 201 response to oneOf Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: fix prettier formatting in spider files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bring main up to date with development for v1.3.0 (#353) * chore: add development branch workflow (#327) * chore: add development branch workflow - Add merge-gate.yml: enforces only 'development' can merge into main - Update CI/CodeQL/Docker workflows to run on both main and development - Update dependabot.yml: target-branch set to development for all ecosystems - Update copilot-instructions.md: document branch workflow convention - Rulesets configured: Main (requires merge-gate + squash-only), Development (requires CI status checks + PR) - Default branch set to development - All open PRs retargeted to development Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: skip Vercel preview deployments on non-main branches Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: trigger check refresh --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: create-pack from local folder or URL sources (#329) * fix: comprehensive audit fixes — security, performance, resilience, API hardening Addresses findings from issue #314: - SSRF protection for webhook URLs (CRITICAL) - Scrub secrets from exports - Stored XSS prevention on document URL - O(n²) and N+1 fixes in bulk operations - Rate limit cache eviction improvement - SSE backpressure handling - Replace raw Error() with typed errors - Fetch timeouts on all network calls - Input validation on API parameters - Search query length limit - Silent catch block logging - DNS rebinding check fix - N+1 in Slack user resolution - Pagination on webhook/search list endpoints Closes #314 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR review comments: fix SSE listener leak, preserve caller signals, add SSRF validation to webhook test, chunk SQL params, use dynamic import in test - SSE backpressure: create single disconnect promise, race against drain (no listener accumulation) - http-utils.ts/onenote.ts: use AbortSignal.any() to combine caller signal with timeout - Webhook test endpoint: validate URL with validateWebhookUrlSsrf before fetch - bulk.ts: chunk IN clause to 999 params max (SQLite limit) - webhooks.test.ts: dynamic import after vi.mock() for deterministic DNS mocking Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ci: consolidate and fix CI/CD workflows - Merge lint + typecheck into single job (saves one npm ci) - Add concurrency groups to ci, docker, codeql (cancel stale runs) - Add dependency-review-action on PRs (block vulnerable deps) - Add workflow_call trigger to ci.yml for reusability - Remove duplicate npm publish from release.yml (release-please owns it) - Fix SDK paths: sdk-go/ → sdk/go/, sdk-python/ → sdk/python/ - Fix Dependabot paths to match actual SDK directories - Add github-actions ecosystem to Dependabot (keep actions up to date) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add HTML file parser for .html/.htm document indexing Adds HtmlParser using the existing node-html-markdown dependency. Strips <script>, <style>, and <nav> tags before conversion. Registered for .html and .htm extensions. Includes 12 tests covering conversion, tag stripping, edge cases. Closes #317 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address CodeQL and review comments on HTML parser - Replace regex-based tag stripping with node-html-markdown's native ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization, bad HTML filtering regexp) - Wrap translate() in try/catch, throw ValidationError (consistent with other parsers) - Use trimEnd() instead of trim() to preserve leading indentation - Reuse single NHM instance for efficiency Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Revert "fix: skip Vercel preview deployments on non-main branches" This reverts commit eb481870c883f77278291b72245b1ca0b890a78c. * feat: add --from option to pack create for folder/URL sources Adds createPackFromSource() that builds packs directly from local folders, files, or URLs without requiring database interaction. CLI: libscope pack create --name my-pack --from ~/docs/ [--extensions .md,.html] [--exclude pattern] [--no-recursive] Features: - Walks directories recursively using registered parsers - Fetches URLs via fetchAndConvert - Supports extension filtering, exclude patterns, progress callback - Multiple --from sources supported - Output format identical to DB export (pack install works unchanged) Closes #328 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * style: fix prettier formatting Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add gzip support for pack files (.json.gz) Pack files can now be compressed with gzip for smaller distribution: - writePackFile/readPackFile auto-detect gzip by extension or magic bytes - installPack accepts both .json and .json.gz files - createPackFromSource defaults to .json.gz output (source packs can be large) - createPack (DB export) still defaults to .json - Auto-detects gzip by magic bytes even if extension is .json 5 new tests covering gzip write, install, magic byte detection, and round-trip. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add progress logging and fix dedup handling in pack install - Log each document as it's indexed so large installs show progress - Change pack install to use dedup: 'skip' for graceful duplicate handling - Make title+content_length dedup check respect the dedup mode setting (previously it always threw ValidationError regardless of dedup mode) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: auto-generate tags during pack creation and apply on install - Export tokenize() and add suggestTagsFromText() in tags.ts for DB-free tag generation - createPackFromSource() now auto-generates tags per document via TF-IDF - installPack() applies doc.tags via addTagsToDocument() after indexing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add HTML file parser for .html/.htm document indexing (#318) * feat: add HTML file parser for .html/.htm document indexing Adds HtmlParser using the existing node-html-markdown dependency. Strips <script>, <style>, and <nav> tags before conversion. Registered for .html and .htm extensions. Includes 12 tests covering conversion, tag stripping, edge cases. Closes #317 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address CodeQL and review comments on HTML parser - Replace regex-based tag stripping with node-html-markdown's native ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization, bad HTML filtering regexp) - Wrap translate() in try/catch, throw ValidationError (consistent with other parsers) - Use trimEnd() instead of trim() to preserve leading indentation - Reuse single NHM instance for efficiency Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Revert "fix: skip Vercel preview deployments on non-main branches" This reverts commit eb481870c883f77278291b72245b1ca0b890a78c. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * build(deps-dev): Bump eslint-config-prettier from 9.1.2 to 10.1.8 (#325) Bumps [eslint-config-prettier](https://github.com/prettier/eslint-config-prettier) from 9.1.2 to 10.1.8. - [Release notes](https://github.com/prettier/eslint-config-prettier/releases) - [Changelog](https://github.com/prettier/eslint-config-prettier/blob/main/CHANGELOG.md) - [Commits](https://github.com/prettier/eslint-config-prettier/commits/v10.1.8) --- updated-dependencies: - dependency-name: eslint-config-prettier dependency-version: 10.1.8 dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps-dev): Bump lint-staged from 16.3.1 to 16.3.2 Bumps the minor-and-patch group with 1 update: [lint-staged](https://github.com/lint-staged/lint-staged). Updates `lint-staged` from 16.3.1 to 16.3.2 - [Release notes](https://github.com/lint-staged/lint-staged/releases) - [Changelog](https://github.com/lint-staged/lint-staged/blob/main/CHANGELOG.md) - [Commits](https://github.com/lint-staged/lint-staged/compare/v16.3.1...v16.3.2) --- updated-dependencies: - dependency-name: lint-staged dependency-version: 16.3.2 dependency-type: direct:development update-type: version-update:semver-patch dependency-group: minor-and-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): Bump the actions group with 5 updates Bumps the actions group with 5 updates: | Package | From | To | | --- | --- | --- | | [actions/checkout](https://github.com/actions/checkout) | `4` | `6` | | [actions/setup-node](https://github.com/actions/setup-node) | `4` | `6` | | [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4` | `7` | | [actions/setup-python](https://github.com/actions/setup-python) | `5` | `6` | | [actions/setup-go](https://github.com/actions/setup-go) | `5` | `6` | Updates `actions/checkout` from 4 to 6 - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](https://github.com/actions/checkout/compare/v4...v6) Updates `actions/setup-node` from 4 to 6 - [Release notes](https://github.com/actions/setup-node/releases) - [Commits](https://github.com/actions/setup-node/compare/v4...v6) Updates `actions/upload-artifact` from 4 to 7 - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](https://github.com/actions/upload-artifact/compare/v4...v7) Updates `actions/setup-python` from 5 to 6 - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](https://github.com/actions/setup-python/compare/v5...v6) Updates `actions/setup-go` from 5 to 6 - [Release notes](https://github.com/actions/setup-go/releases) - [Commits](https://github.com/actions/setup-go/compare/v5...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/setup-node dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/upload-artifact dependency-version: '7' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/setup-python dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: actions/setup-go dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps-dev): Bump @types/node from 22.19.13 to 25.3.3 Bumps [@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node) from 22.19.13 to 25.3.3. - [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases) - [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node) --- updated-dependencies: - dependency-name: "@types/node" dependency-version: 25.3.3 dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): Bump better-sqlite3 from 11.10.0 to 12.6.2 Bumps [better-sqlite3](https://github.com/WiseLibs/better-sqlite3) from 11.10.0 to 12.6.2. - [Release notes](https://github.com/WiseLibs/better-sqlite3/releases) - [Commits](https://github.com/WiseLibs/better-sqlite3/compare/v11.10.0...v12.6.2) --- updated-dependencies: - dependency-name: better-sqlite3 dependency-version: 12.6.2 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps-dev): Bump eslint from 9.39.3 to 10.0.2 Bumps [eslint](https://github.com/eslint/eslint) from 9.39.3 to 10.0.2. - [Release notes](https://github.com/eslint/eslint/releases) - [Commits](https://github.com/eslint/eslint/compare/v9.39.3...v10.0.2) --- updated-dependencies: - dependency-name: eslint dependency-version: 10.0.2 dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Robert DeRienzo <rderienzo@voloridge.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add passthrough LLM mode for ask-question tool (#335) * feat: add passthrough LLM mode for ask-question tool Adds llm.provider = "passthrough" so the ask-question MCP tool returns retrieved context chunks directly to the calling LLM instead of requiring a separate OpenAI/Ollama provider. This is the natural design for MCP tools where the client already has an LLM (e.g. Claude Code). - config.ts: add "passthrough" to llm.provider union type and env var handling - rag.ts: add isPassthroughMode() helper and getContextForQuestion() which retrieves and formats context without an LLM call - mcp/server.ts: ask-question checks passthrough first and returns context directly; falls through to existing LLM path otherwise Enable via config: { "llm": { "provider": "passthrough" } } Enable via env var: LIBSCOPE_LLM_PROVIDER=passthrough Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: format config.ts and include passthrough in provider override - Reformat long if-condition to satisfy prettier (printWidth: 100) - Fix logic bug: passthrough provider was checked in outer condition but not spread into overrides.llm.provider Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address 9 audit findings from issue #332 (#333) * fix: address 9 audit findings from issue #332 Security - middleware: use timingSafeEqual for API key comparison (#2) - url-fetcher, http-utils: replace process-wide NODE_TLS_REJECT_UNAUTHORIZED mutation with per-request undici Agent to eliminate TLS race condition (#1) Bugs - indexing: re-throw unexpected embedding errors so transaction rolls back instead of silently committing chunks with no vector (#3) - search: replace correlated minRating subquery with avg_r.avg_rating from the pre-joined aggregate in FTS and LIKE search paths (#4) Performance - bulk: replace O(n²) docs.find() loops with pre-built Map; replace per-document getDocumentTags() calls with a single getDocumentTagsBatch() query (#5) - config: add 30-second TTL cache to loadConfig() so disk reads are not repeated on every request (#6) Code quality - routes: check res.write() return value to handle SSE backpressure (#7) - reindex: delegate to schema.createVectorTable() instead of duplicating the vec0 DDL inline (#8) - obsidian: replace hand-rolled parseSimpleYaml() with js-yaml, normalise Date objects back to ISO-8601 strings (#9) Docs - agents.md: expand architecture tree to include src/api/ and src/connectors/; add Security Patterns section with correct undici examples - CONTRIBUTING.md: fix check-suite command (npm test → npm run test:coverage) and correct coverage threshold (80% → actual 75%/74%) Tests - bulk.test: add dateFrom/dateTo filter coverage - config.test: add cache-hit test; call invalidateConfigCache() before env-var tests so TTL cache doesn't return stale results Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: remove unused warnIfTlsBypassMissing function Dead code after conflict resolution chose the per-request undici Agent approach (which doesn't need a warning about NODE_TLS_REJECT_UNAUTHORIZED). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: update tests for config cache and retry semantics - Add invalidateConfigCache() before loadConfig() in 4 env-override tests that were failing because the 30s TTL cache introduced in the config module was returning stale results from the previous test's cache entry - Update http-utils retry assertion: maxRetries=2 means 1 initial + 2 retries = 3 total calls (loop is attempt <= maxRetries) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: CLI logging improvements and pack installation performance (#330) (#336) - Add `src/cli/reporter.ts`: PrettyReporter (ANSI colors + \r progress bar), SilentReporter (no-op for verbose/JSON mode), `isVerbose()` and `createReporter()` factory. `LIBSCOPE_VERBOSE=1` env var alternative. - Update `setupLogging` to default to "silent" in CLI mode (pretty reporter handles user-facing output). Verbose/`--log-level` flags still route to structured JSON pino logs. Fix duplicate `initLogger` calls in onenote connect/disconnect commands to use `setupLogging` consistently. - Update `installPack` in `packs.ts` to support batch embedding and progress reporting: - New `InstallOptions` interface with `batchSize`, `resumeFrom`, `onProgress` fields - Batch documents: chunk all → single `provider.embedBatch` call per batch → single SQLite transaction per batch (avoids N embedding calls) - `resumeFrom` skips the first N documents (enables partial install resume after failure) - `InstallResult` now includes `errors` count - Add `--batch-size` and `--resume-from` CLI options to `pack install` - Tests: `tests/unit/reporter.test.ts` (17 tests covering PrettyReporter, SilentReporter, isVerbose, env var detection); extended `tests/unit/packs.test.ts` with 7 new tests for progress callbacks, batch efficiency, resumeFrom, embedBatch failure handling. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Claude/fix issue 331 s1qzu (#338) * build(deps): Bump the npm_and_yarn group across 1 directory with 2 updates (#337) Bumps the npm_and_yarn group with 2 updates in the / directory: [@hono/node-server](https://github.com/honojs/node-server) and [hono](https://github.com/honojs/hono). Updates `@hono/node-server` from 1.19.9 to 1.19.10 - [Release notes](https://github.com/honojs/node-server/releases) - [Commits](https://github.com/honojs/node-server/compare/v1.19.9...v1.19.10) Updates `hono` from 4.12.3 to 4.12.5 - [Release notes](https://github.com/honojs/hono/releases) - [Commits](https://github.com/honojs/hono/compare/v4.12.3...v4.12.5) --- updated-dependencies: - dependency-name: "@hono/node-server" dependency-version: 1.19.10 dependency-type: indirect dependency-group: npm_and_yarn - dependency-name: hono dependency-version: 4.12.5 dependency-type: indirect dependency-group: npm_and_yarn ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (#341) * fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (closes #340) **SSRF (CWE-918 — CodeQL alert #28)** Replace the two-step validate-then-fetch approach in url-fetcher.ts with IP-pinned requests using node:http / node:https directly. validateUrl() resolves DNS and checks for private IPs, then the validated IP is passed straight to the TCP connection (hostname: pinnedIp, servername: original hostname for TLS SNI). There is now zero TOCTOU window between validation and the actual network request. The redundant post-fetch DNS rebinding check and the env-var-based TLS bypass wrapper are removed; rejectUnauthorized is now passed directly to the request options. An internal _setRequestImpl hook is exported for unit test injection so tests can stub responses without touching node:http / node:https. Tests are updated accordingly. **ReDoS (CWE-1333 — CodeQL alert #24)** Five regexes in confluence.ts used the pattern [^>]*ac:name="X"[^>]* — two [^>]* quantifiers around a fixed literal. For input that contains a large attribute blob without the target ac:name value, the engine must try all O(n²) splits before concluding no match (catastrophic backtracking). Fix: rewrite the leading [^>]* as (?:(?!ac:name="X")[^>])* — a negative lookahead prevents the quantifier from overlapping with the literal, making backtracking structurally impossible. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: replace remaining polynomial regexes in confluence.ts with indexOf helpers The conflict resolution left the original `/<ac:structured-macro [^>]*>([\s\S]*?) <\/ac:structured-macro>/gi` patterns in place for code blocks and info/tip panels (those were not part of the original security fix diff). These have the same O(n²) backtracking problem: with k opening tags and no closing tags, [\s\S]*? scans O(n - pos) chars per attempt, totalling O(n²). Replace the entire convertConfluenceStorage function with the indexOf-based approach (replaceStructuredMacros / replaceTagPairs / extractTagContent helpers) that eliminates all regex-based tag parsing. Also add removeSelfClosingMacros to handle the self-closing TOC case without regex, since the previous self-closing fix still used a [^>]*ac:name="toc"[^>]* pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: concurrent pack installation and -v verbose shorthand (issue #330) (#339) * feat: concurrent pack installation and -v verbose shorthand (issue #330) Add concurrent batch embedding to installPack for significant performance improvement on large packs, plus CLI ergonomics improvements. Key changes: - `InstallOptions.concurrency` (default: 4): controls how many embedBatch calls run simultaneously; embedding is I/O-bound so parallelism directly reduces wall-clock installation time - Refactor installPack to pre-chunk all documents upfront, then use a semaphore-based scheduler to run up to `concurrency` embedBatch calls concurrently while inserting completed batches in-order (SQLite requires serialised writes); progress callbacks fire after each batch as before - `pack install --concurrency <n>` CLI flag exposes the new option - `-v` shorthand for `--verbose` on the global program options - Fix transaction install-count tracking: count committed docs accurately without relying on subtract-on-failure arithmetic - Add 6 new tests covering concurrency=1 sequential, concurrency=4 parallel, multiple embedBatch calls per install, concurrency limit enforcement, incremental progress reporting, and partial-failure error counting https://claude.ai/code/session_019hzhbEgV1ysnGmFVXBWzkZ * fix: address all 4 Copilot review comments on PR #339 - Validate batchSize, concurrency, resumeFrom at the start of installPack and throw ValidationError for invalid values (comments 3 & 4). Concurrency <= 0 would silently hang the semaphore indefinitely. - Add CLI lower-bound guard: --concurrency < 1 exits with a user-facing error before ever calling installPack (comment 3). - Lazy chunking: pre-chunking all documents upfront held chunks for the entire pack in memory simultaneously. Batches now store only the raw documents; resolveBatch() chunks on demand right before embedBatch is called, so only one batch's worth of chunks is in memory at a time (comment 2). - Wrap provider.embedBatch() in try/catch so synchronous throws are converted to rejected Promises rather than escaping scheduleNext() and leaving the outer Promise permanently pending (comment 1). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> * fix: address 7 pre-release bugs from audit (#342) (#344) - Guard JSON.parse in rowToWebhook with try/catch, default to [] - Guard JSON.parse in rowToSavedSearch with try/catch, default to null - Guard JSON.parse in loadDbConnectorConfig with try/catch, throw ConfigError - Push dateFrom/dateTo filters into SQL WHERE in listDocuments (before LIMIT) - Validate negative limit in resolveSelector, throw ValidationError - Replace manual substring extension parsing with path.extname() in packs.ts - Verified reporter.ts is already tracked on development (no action needed) - Added tests for all fixes (corrupted JSON, SQL-level date filtering, negative limit) Closes #342 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: URL spidering — crawl linked pages with configurable depth and page limits (#315) (#343) * feat: URL spidering — crawl linked pages with configurable depth and page limits (#315) Adds opt-in spidering to URL indexing. A single seed URL can now crawl and index an entire documentation site or wiki section in one call. New files: - src/core/link-extractor.ts: indexOf-based <a href> extraction, relative URL resolution, fragment stripping, dedup, scheme filtering. No regex. - src/core/spider.ts: BFS crawl engine with sameDomain, pathPrefix, excludePatterns (glob), maxPages (hard cap 200), maxDepth (hard cap 5), 10-min total timeout, robots.txt (User-agent: * and libscope), and 1s inter-request delay. Yields SpiderResult per page; returns SpiderStats. - tests/unit/link-extractor.test.ts: 25 tests covering relative resolution, dedup, fragment stripping, scheme filtering, attribute order, edge cases. - tests/unit/spider.test.ts: 20 tests covering BFS order, depth/page limits, domain + path + pattern filtering, cycle detection, robots.txt, partial failure recovery, stats, and abortReason reporting. Modified: - src/core/url-fetcher.ts: adds fetchRaw() export returning raw body + contentType + finalUrl before HTML-to-markdown conversion, so the spider can extract links from HTML before conversion. - src/api/routes.ts: POST /api/v1/documents/url now accepts spider, maxPages, maxDepth, sameDomain, pathPrefix, excludePatterns. Returns { documents, pagesIndexed, pagesCrawled, pagesSkipped, errors, abortReason }. - src/mcp/server.ts: submit-document tool gains spider, maxPages, maxDepth, sameDomain, pathPrefix, excludePatterns parameters. Safety: all fetched URLs pass through the existing SSRF validation in fetchRaw() (DNS resolution, private IP blocking, scheme allowlist). Hard limits (200 pages, depth 5, 10min) cannot be overridden by callers. robots.txt is fetched once per origin and Disallow rules are honoured. Individual page failures do not abort the crawl. Closes #315 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: resolve CI lint errors in spider implementation - Remove unnecessary type assertions (routes.ts, mcp/server.ts) — TypeScript already narrows SpiderResult/SpiderStats from the generator - Add explicit return type annotation on mock fetchRaw to satisfy no-unsafe-return rule in spider.test.ts - Replace .resolves.not.toThrow() with a direct assertion — vitest .resolves requires a Promise, not an async function Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address CodeQL security findings in spider/link-extractor link-extractor.ts (CodeQL #30 — incomplete URL scheme check): Replace the enumerated scheme blocklist (javascript:, vbscript:, etc.) with a strict http/https allowlist check on the resolved URL protocol. An allowlist is exhaustive by definition; a blocklist will always miss obscure schemes like vbscript:, blob:, or future additions. spider.ts (CodeQL #31 — incomplete multi-character sanitization): Replace the regex-based tag stripper /<[^>]+>/g in extractTitle() with an indexOf-based stripTags() function. The regex stops at the first > which can be inside a quoted attribute value (e.g. <img alt="a>b">), potentially leaving partial tag content in the extracted title. The new implementation walks quoted attribute values explicitly so no tag content leaks through regardless of its internal structure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address all Copilot review comments on spider PR (#343) - link-extractor: add word-boundary check in extractHref to prevent matching data-href, aria-href (false positives on non-href attributes) - spider: rename pagesIndexed → pagesFetched throughout (SpiderStats interface already used pagesFetched; sync implementation + tests) - spider: per-origin robots.txt cache (Map<origin, Set>) fetched lazily as new origins are encountered during crawl (was seed-only before) - spider: normalize to raw.finalUrl after redirects — visited set, yielded URL, and link-extraction base all use the canonical URL - routes: validate maxPages/maxDepth are finite positive integers - routes: change conditional spread &&-patterns to ternaries - routes: remove inner try/catch for spider fetch errors; add FetchError to top-level handler (consistent with single-URL mode → 502) - mcp/server: replace conditional spreads with explicit if assignments - mcp/server: validate spider=true requires url (throws ValidationError) - openapi: document spider request fields in IndexFromUrlRequest schema, add SpiderResponse schema, update 201 response to oneOf Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: fix prettier formatting in spider files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: comprehensive documentation update for v1.3.0 (#347) - README: fix license (BUSL-1.1, not MIT), expand MCP tools table to all 26 tools, expand REST API table with all endpoints (webhooks, links, analytics, connectors status, suggest-tags, bulk ops), add webhooks section with HMAC signing example, add missing CLI commands (bulk ops, saved searches, document links, docs update) - getting-started: fix Node.js requirement (20, not 18), add sections for web dashboard, organize/annotate features, REST API - mcp-setup: expand available tools section to list all 26 tools grouped by category instead of just 4 - mcp-tools reference: add 5 missing tools — update-document, suggest-tags, link-documents, get-document-links, delete-link - rest-api reference: add all missing endpoints, reorganize by category, add examples for update, bulk retag, webhooks, links, saved searches - configuration guide: document passthrough LLM provider - configuration reference: add passthrough LLM, llm.ollamaUrl key, expand config set examples to cover all settable keys - cli reference: expand config set supported keys list Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: allow release-please PRs to pass merge gate and trigger CI (#348) * docs: comprehensive documentation update for v1.3.0 - README: fix license (BUSL-1.1, not MIT), expand MCP tools table to all 26 tools, expand REST API table with all endpoints (webhooks, links, analytics, connectors status, suggest-tags, bulk ops), add webhooks section with HMAC signing example, add missing CLI commands (bulk ops, saved searches, document links, docs update) - getting-started: fix Node.js requirement (20, not 18), add sections for web dashboard, organize/annotate features, REST API - mcp-setup: expand available tools section to list all 26 tools grouped by category instead of just 4 - mcp-tools reference: add 5 missing tools — update-document, suggest-tags, link-documents, get-document-links, delete-link - rest-api reference: add all missing endpoints, reorganize by category, add examples for update, bulk retag, webhooks, links, saved searches - configuration guide: document passthrough LLM provider - configuration reference: add passthrough LLM, llm.ollamaUrl key, expand config set examples to cover all settable keys - cli reference: expand config set supported keys list Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: allow release-please PRs to pass merge gate and trigger CI Two issues prevented PR #238 from getting CI runs: 1. merge-gate blocked release-please PRs — the gate only allowed 'development' as the source branch, but release-please uses 'release-please--branches--main--components--libscope'. Updated to allow any branch matching 'release-please--*'. 2. CI never ran on the PR — GitHub does not trigger workflows when GITHUB_TOKEN creates a PR (intentional security restriction to prevent infinite loops). Fixed by passing a PAT via secrets.GH_TOKEN to the release-please action so its PR creation triggers CI. Note: requires a 'GH_TOKEN' secret in repo settings — a classic PAT with repo and workflow scopes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: trigger checks --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: merge development into main for v1.3.0 release (#354) * chore: add development branch workflow (#327) * chore: add development branch workflow - Add merge-gate.yml: enforces only 'development' can merge into main - Update CI/CodeQL/Docker workflows to run on both main and development - Update dependabot.yml: target-branch set to development for all ecosystems - Update copilot-instructions.md: document branch workflow convention - Rulesets configured: Main (requires merge-gate + squash-only), Development (requires CI status checks + PR) - Default branch set to development - All open PRs retargeted to development Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: skip Vercel preview deployments on non-main branches Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: trigger check refresh --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: create-pack from local folder or URL sources (#329) * fix: comprehensive audit fixes — security, performance, resilience, API hardening Addresses findings from issue #31…
Summary
Implements opt-in URL spidering for the
POST /api/v1/documents/urlendpoint and thesubmit-documentMCP tool. A single seed URL can now crawl and index an entire documentation site or wiki section in one call.Closes #315.
Changes
New files
src/core/link-extractor.ts— indexOf-based<a href>extraction (no regex, ReDoS-safe). Resolves relative URLs, strips fragments, deduplicates, filters non-http/https schemes.src/core/spider.ts— BFS crawl engine as an async generator. Yields oneSpiderResultper page and returnsSpiderStatswhen done.tests/unit/link-extractor.test.ts— 25 tests (relative resolution, dedup, fragment stripping, scheme filtering, edge cases)tests/unit/spider.test.ts— 20 tests (BFS order, depth/page limits, domain + path + pattern filtering, cycle detection, robots.txt, partial failure recovery, abortReason)Modified files
src/core/url-fetcher.ts— addsfetchRaw()export returning the raw body + contentType + finalUrl before HTML conversion, used by the spider for link extractionsrc/api/routes.ts—POST /api/v1/documents/urlaccepts spider options; returns{ documents, pagesIndexed, pagesCrawled, pagesSkipped, errors, abortReason }src/mcp/server.ts—submit-documenttool gainsspider,maxPages,maxDepth,sameDomain,pathPrefix,excludePatternsparametersParameters
spiderfalsemaxPagesmaxDepthsameDomaintruepathPrefixexcludePatterns[]Safety
fetchRaw()(DNS resolution, private IP blocking, scheme allowlist)User-agent: *andUser-agent: libscoperules honouredTest plan
npm run typecheck— zero errorsnpm test— 1059 tests passing (52 files)🤖 Generated with Claude Code