feat: support additional document formats (PDF, Word, CSV, YAML, JSON) (#249)#275
feat: support additional document formats (PDF, Word, CSV, YAML, JSON) (#249)#275
Conversation
#249) Add a parser abstraction layer that normalizes supported formats into markdown before entering the existing chunking pipeline. New parsers: - PDF (.pdf) via pdf-parse (optional dependency) - Word (.docx) via mammoth (optional dependency) - CSV (.csv) via csv-parse — converts to markdown table - YAML (.yaml, .yml) — validates and wraps in fenced block - JSON (.json) — validates and wraps in fenced block - Markdown (.md, .markdown, .mdx) — pass-through - Plain text (.txt) — pass-through New `indexFile()` function auto-detects format from extension and pipes through the correct parser. CLI `add` command now supports all formats with a `--format` override option. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
There was a problem hiding this comment.
Pull request overview
Adds a document parser abstraction so LibScope can ingest non-Markdown document formats by normalizing them to text/markdown before passing into the existing chunking/indexing pipeline.
Changes:
- Introduces
DocumentParserinterface plus parser implementations for Markdown/text/JSON/YAML/CSV/PDF/Word. - Adds
indexFile()to parse local files based on extension (with a CLI--formatoverride) and then index viaindexDocument(). - Updates CLI
addandimportflows to useindexFile()for local files; adds unit tests for parser selection and core parsers.
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/core/parsers/index.ts |
Defines DocumentParser, registers parsers, and adds extension-based parser selection utilities. |
src/core/parsers/markdown.ts |
Pass-through Markdown parser. |
src/core/parsers/text.ts |
Pass-through plain text parser. |
src/core/parsers/json-parser.ts |
JSON validation + formatting into fenced code block. |
src/core/parsers/yaml.ts |
YAML validation + wrapping into fenced code block. |
src/core/parsers/csv.ts |
Converts CSV into a Markdown table. |
src/core/parsers/pdf.ts |
Adds PDF parsing via dynamic import of pdf-parse. |
src/core/parsers/word.ts |
Adds Word parsing via dynamic import of mammoth. |
src/core/indexing.ts |
Adds indexFile() and options to parse+index files. |
src/core/index.ts |
Re-exports parser utilities and new indexing API. |
src/cli/index.ts |
Uses indexFile() for local file ingestion; adds --format override and updates import defaults. |
tests/unit/parsers.test.ts |
Adds unit tests for parser selection and parser behavior (JSON/YAML/CSV + error cases). |
package.json / package-lock.json |
Adds dependencies for new formats (csv/yaml/pdf/word). |
package.json
Outdated
| "mammoth": "^1.11.0", | ||
| "node-html-markdown": "^2.0.0", | ||
| "openai": "^6.25.0", | ||
| "pdf-parse": "^2.4.5", |
There was a problem hiding this comment.
PR description says the PDF/Word parsers’ packages are optional and only loaded when needed, but mammoth and pdf-parse are added as required dependencies here. If the intent is optional installation (so environments without native deps can still install LibScope), consider moving these to optionalDependencies and keeping the current runtime error messaging.
| let PDFParse: typeof import("pdf-parse").PDFParse; | ||
| try { | ||
| const mod = await import("pdf-parse"); | ||
| PDFParse = mod.PDFParse; | ||
| } catch (err) { | ||
| throw new ValidationError( | ||
| 'PDF parsing requires the "pdf-parse" package. Install it with: npm install pdf-parse', | ||
| err, | ||
| ); | ||
| } | ||
|
|
||
| try { | ||
| const parser = new PDFParse({ data: new Uint8Array(content) }); | ||
| const result = await parser.getText(); | ||
| return result.text; |
There was a problem hiding this comment.
pdf-parse doesn’t expose a PDFParse class in its public API/types (the @types/pdf-parse package typically models it as a callable default export). As written, mod.PDFParse will be undefined/typecheck will fail, and parsing will throw at runtime. Update the dynamic import to use the package’s exported parse function (and adjust the typing accordingly).
| const lines: string[] = []; | ||
| lines.push("| " + header.join(" | ") + " |"); | ||
| lines.push("| " + header.map(() => "---").join(" | ") + " |"); | ||
| for (const row of rows) { | ||
| lines.push("| " + row.join(" | ") + " |"); | ||
| } |
There was a problem hiding this comment.
CSV cells are inserted into a Markdown table without escaping. Values containing | or newlines will break the table rendering and can also create rows with a different number of columns than the header (since relax_column_count allows variable widths). Consider normalizing each row to the header length (pad/truncate) and escaping cell text (e.g., replace | and newlines) before joining.
src/core/indexing.ts
Outdated
| options: IndexFileOptions = {}, | ||
| ): Promise<IndexedDocument> { | ||
| const log = getLogger(); | ||
| const effectiveName = options.format ? `file${options.format}` : filePath; |
There was a problem hiding this comment.
The --format override is concatenated as file${options.format}. If callers pass pdf (without a leading dot) or mixed-case extensions, format detection will fail silently and look unsupported. It would be more robust to normalize options.format (trim, lowercase, ensure it starts with .) before building effectiveName.
| const effectiveName = options.format ? `file${options.format}` : filePath; | |
| const rawFormat = options.format?.trim(); | |
| const normalizedFormat = | |
| rawFormat && rawFormat.length > 0 | |
| ? (rawFormat.startsWith(".") ? rawFormat : `.${rawFormat}`).toLowerCase() | |
| : undefined; | |
| const effectiveName = normalizedFormat ? `file${normalizedFormat}` : filePath; |
| throw new ValidationError( | ||
| `Unsupported file format: "${filePath}". Supported extensions: .md, .markdown, .mdx, .json, .yaml, .yml, .csv, .pdf, .docx`, | ||
| ); |
There was a problem hiding this comment.
The unsupported-format error message hardcodes the supported extension list, which can drift from the actual parser registry. Consider generating this from getSupportedExtensions() (and/or including the --format hint) so the message stays accurate as formats are added/removed.
package.json
Outdated
| "pdf-parse": "^2.4.5", | ||
| "pino": "^10.3.1", | ||
| "sqlite-vec": "^0.1.0", | ||
| "zod": "^4.3.6" | ||
| }, |
There was a problem hiding this comment.
pdf-parse v2.x declares a stricter Node engine range (see its entry in the lockfile: >=20.16.0 <21 || >=22.3.0). Adding it as a regular dependency can break installs for users on earlier Node 20.x versions even though this package’s engines.node is currently >=20. Consider either tightening engines.node to match the dependency, or making the PDF parser truly optional (e.g., move pdf-parse to optionalDependencies and keep the runtime "install it" error path).
| "pdf-parse": "^2.4.5", | |
| "pino": "^10.3.1", | |
| "sqlite-vec": "^0.1.0", | |
| "zod": "^4.3.6" | |
| }, | |
| "pino": "^10.3.1", | |
| "sqlite-vec": "^0.1.0", | |
| "zod": "^4.3.6" | |
| }, | |
| "optionalDependencies": { | |
| "pdf-parse": "^2.4.5" | |
| }, |
- Move mammoth and pdf-parse to optionalDependencies - Escape pipe characters and newlines in CSV cell values - Normalize --format option (trim, lowercase, ensure leading dot) - Generate unsupported format error dynamically from getSupportedExtensions() - Handle mismatched CSV column counts by normalizing rows Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add connectors-config tests (file/DB config, sync tracker, deleteConnectorDocuments) - Add api-server tests (startApiServer with various options) - Add db-validation tests (validateRow, validateCountRow edge cases) - Add dedup tests for semantic/both strategies (exercises catch blocks) - Add schema tests for createVectorTable validation - Add export tests for import validation branches - Add config tests for env var overrides (ollama URL, private URLs, LLM) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add PdfParser and WordParser tests (invalid content error paths) - Add CsvParser tests (pipe escaping, newline handling, row normalization) - Add Saved Searches API tests (create, list, delete, run) - Add Bulk Operations API tests (delete, retag, move + validation) - Add Links API test (create + delete) - Branch coverage now passes 74% threshold Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The envOverrides.indexing spread was missing from the config merge, so LIBSCOPE_ALLOW_PRIVATE_URLS and LIBSCOPE_ALLOW_SELF_SIGNED_CERTS environment variables were silently ignored. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes CodeQL alert: incomplete string escaping. Backslashes must be escaped first to prevent \| in input becoming \| (misinterpreted as escaped-backslash + literal pipe). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds Document Format Support section to README with a table showing supported formats, optional dependencies, and Node.js requirements. pdf-parse v2 requires Node >= 20.16 or >= 22.3. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- package.json: keep both feature deps (csv-parse, js-yaml, pdf-parse) and main deps (node-cron) - schema.test.ts: add missing DatabaseError import from main - db-validation.test.ts: take main's more comprehensive test suite - api-server.test.ts: use await on result.close() from main Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Adds a parser abstraction layer so LibScope can ingest documents beyond markdown. New files are normalized to text/markdown before entering the existing chunking pipeline.
Supported Formats
.md,.markdown,.mdx.txt.json.yaml,.ymljs-yaml.csvcsv-parse.pdfpdf-parse(optional).docxmammoth(optional)Changes
src/core/parsers/— New parser module withDocumentParserinterface and 7 parser implementationssrc/core/indexing.ts— NewindexFile()function that auto-detects format and delegates to the correct parsersrc/cli/index.ts— Updatedaddcommand to useindexFile()for local files, added--formatoverride optionsrc/core/index.ts— Re-exports for new parser utilitiestests/unit/parsers.test.ts— 23 tests covering all parsers including error casesDesign
import()— packages are optional and only loaded when neededPromise<string>to support async parsers (PDF/Word) alongside sync ones--formatCLI overrideCloses #249