feat: auto-suggest tags based on content analysis (#243)#273
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
There was a problem hiding this comment.
Pull request overview
Adds content-based tag suggestions to LibScope, exposing the functionality through core APIs and multiple user-facing interfaces (CLI, REST, MCP) to improve document discoverability.
Changes:
- Implemented
suggestTags()insrc/core/tags.tsusing tokenization, stopword filtering, TF scoring, and boosting for existing system tags. - Exposed tag suggestion via MCP tool (
suggest-tags), CLI command (libscope tag suggest), and REST endpoint (GET /api/v1/documents/:id/suggest-tags). - Added unit tests covering
suggestTags()behavior and error handling.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/core/tags.ts |
Adds stopword list, tokenizer, and suggestTags() implementation. |
src/core/index.ts |
Re-exports suggestTags from the core entrypoint. |
src/mcp/server.ts |
Adds MCP tool for suggesting tags for a document. |
src/cli/index.ts |
Adds libscope tag suggest command to print suggested tags. |
src/api/routes.ts |
Adds REST route matcher + handler for suggest-tags endpoint. |
tests/unit/tags.test.ts |
Adds unit tests for suggestTags() including boosting, exclusion, limits, and missing-doc error. |
| const suggestDocId = matchDocumentSuggestTags(segments); | ||
| if (suggestDocId && method === "GET") { | ||
| const limitRaw = url.searchParams.get("limit"); | ||
| const limit = limitRaw ? parseInt(limitRaw, 10) : 5; |
There was a problem hiding this comment.
limit is parsed with parseInt but not validated for NaN/non-numeric values (unlike other endpoints in this file that convert NaN to undefined). As written, ?limit=foo results in limit = NaN and an empty suggestions list rather than a 400. Validate the query param (finite integer >= 1, and optionally cap it) and return a VALIDATION_ERROR on invalid input.
| const limit = limitRaw ? parseInt(limitRaw, 10) : 5; | |
| let limit = 5; | |
| if (limitRaw !== null) { | |
| const parsedLimit = Number.parseInt(limitRaw, 10); | |
| if (!Number.isFinite(parsedLimit) || !Number.isInteger(parsedLimit) || parsedLimit < 1) { | |
| sendError(res, 400, "VALIDATION_ERROR", "Query parameter 'limit' must be a positive integer"); | |
| return; | |
| } | |
| const MAX_LIMIT = 100; | |
| limit = Math.min(parsedLimit, MAX_LIMIT); | |
| } |
| // Suggest tags for a document | ||
| const suggestDocId = matchDocumentSuggestTags(segments); | ||
| if (suggestDocId && method === "GET") { | ||
| const limitRaw = url.searchParams.get("limit"); | ||
| const limit = limitRaw ? parseInt(limitRaw, 10) : 5; | ||
| const suggestions = suggestTags(db, suggestDocId, limit); | ||
| const took = Math.round(performance.now() - start); | ||
| sendJson(res, 200, { documentId: suggestDocId, suggestions }, took); | ||
| return; |
There was a problem hiding this comment.
This adds a new REST endpoint (GET /api/v1/documents/:id/suggest-tags) but there are existing API route tests in tests/unit/api.test.ts for other /api/v1/documents/* endpoints. Add unit coverage for the new route (success case, invalid limit, and 404 for missing document) to prevent regressions.
| const limit = maxSuggestions ?? 5; | ||
|
|
There was a problem hiding this comment.
maxSuggestions is used directly to compute limit, but callers can pass NaN, 0, negative numbers, or very large values (e.g., REST/CLI parseInt failures). This can lead to surprising behavior (e.g., slice(0, NaN) returns an empty list). Validate maxSuggestions inside suggestTags (finite integer, >= 1, and ideally a reasonable max) and throw ValidationError when invalid.
| const limit = maxSuggestions ?? 5; | |
| const defaultLimit = 5; | |
| const maxAllowedLimit = 100; | |
| let limit: number; | |
| if (maxSuggestions === undefined) { | |
| limit = defaultLimit; | |
| } else { | |
| if ( | |
| !Number.isFinite(maxSuggestions) || | |
| !Number.isInteger(maxSuggestions) || | |
| maxSuggestions < 1 || | |
| maxSuggestions > maxAllowedLimit | |
| ) { | |
| throw new ValidationError( | |
| `maxSuggestions must be an integer between 1 and ${maxAllowedLimit}, got ${String( | |
| maxSuggestions, | |
| )}`, | |
| ); | |
| } | |
| limit = maxSuggestions; | |
| } |
| // Get all known tags in the system for boosting | ||
| const knownTags = new Set( | ||
| (db.prepare("SELECT name FROM tags").all() as Array<{ name: string }>).map((r) => r.name), | ||
| ); |
There was a problem hiding this comment.
knownTags is built by selecting all tag names on every suggestTags call. If the tag table grows, this becomes an O(totalTags) read per request/tool invocation. Consider querying only the subset relevant to the extracted terms (e.g., SELECT name FROM tags WHERE name IN (...)) or otherwise caching known tag names for the lifetime of the process.
| // Get all known tags in the system for boosting | |
| const knownTags = new Set( | |
| (db.prepare("SELECT name FROM tags").all() as Array<{ name: string }>).map((r) => r.name), | |
| ); | |
| // Get known tags among the candidate terms for boosting | |
| const candidateTerms = [...tf.keys()]; | |
| const knownTags = | |
| candidateTerms.length === 0 | |
| ? new Set<string>() | |
| : new Set( | |
| ( | |
| db | |
| .prepare( | |
| `SELECT name FROM tags WHERE name IN (${candidateTerms | |
| .map(() => "?") | |
| .join(", ")})`, | |
| ) | |
| .all(...candidateTerms) as Array<{ name: string }> | |
| ).map((r) => r.name), | |
| ); |
| "its", | ||
| "our", | ||
| "their", | ||
| "your", | ||
| "his", | ||
| "her", | ||
| "our", | ||
| "out", | ||
| "then", | ||
| "there", | ||
| "these", | ||
| "those", | ||
| "them", | ||
| "they", | ||
| "you", | ||
| "your", | ||
| "only", |
There was a problem hiding this comment.
STOPWORDS includes duplicates (e.g., "our" and "your" appear twice). This is harmless at runtime because it's a Set, but it adds noise and makes future maintenance harder. Remove the duplicates to keep the list clean.
| .action((documentId: string, opts: { limit: string }) => { | ||
| const { db } = initializeApp(); | ||
| try { | ||
| const limit = parseInt(opts.limit, 10); |
There was a problem hiding this comment.
This command parses --limit with parseInt directly, but the CLI already has parseIntOption() which validates numeric options and exits with a clear error. Using raw parseInt here allows NaN (or negative/zero) to flow into suggestTags, producing confusing results. Use parseIntOption(opts.limit, "--limit") (and enforce >= 1) for consistent CLI behavior.
| const limit = parseInt(opts.limit, 10); | |
| const limit = parseIntOption(opts.limit, "--limit"); | |
| if (limit < 1) { | |
| console.error("--limit must be at least 1"); | |
| process.exitCode = 1; | |
| return; | |
| } |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
f93c920 to
43146e4
Compare
Closes #243
Adds tag suggestion via TF-IDF-like keyword extraction:
suggestTags()in tags.ts — tokenize, stopword filter, TF scoring, known-tag boostingsuggest-tagstoollibscope tag suggest <docId> [--limit <n>]GET /api/v1/documents/:id/suggest-tags?limit=5