Skip to content

feat: support additional document formats (PDF, Word, CSV, YAML, JSON) (#249)#275

Merged
RobertLD merged 8 commits intomainfrom
feat/document-formats
Mar 3, 2026
Merged

feat: support additional document formats (PDF, Word, CSV, YAML, JSON) (#249)#275
RobertLD merged 8 commits intomainfrom
feat/document-formats

Conversation

@RobertLD
Copy link
Owner

@RobertLD RobertLD commented Mar 2, 2026

Summary

Adds a parser abstraction layer so LibScope can ingest documents beyond markdown. New files are normalized to text/markdown before entering the existing chunking pipeline.

Supported Formats

Format Extensions Parser Dependency
Markdown .md, .markdown, .mdx Pass-through Built-in
Plain text .txt Pass-through Built-in
JSON .json Validates & wraps in fenced block Built-in
YAML .yaml, .yml Validates & wraps in fenced block js-yaml
CSV .csv Converts to markdown table csv-parse
PDF .pdf Text extraction pdf-parse (optional)
Word .docx Text extraction mammoth (optional)

Changes

  • src/core/parsers/ — New parser module with DocumentParser interface and 7 parser implementations
  • src/core/indexing.ts — New indexFile() function that auto-detects format and delegates to the correct parser
  • src/cli/index.ts — Updated add command to use indexFile() for local files, added --format override option
  • src/core/index.ts — Re-exports for new parser utilities
  • tests/unit/parsers.test.ts — 23 tests covering all parsers including error cases

Design

  • PDF and Word parsers use dynamic import() — packages are optional and only loaded when needed
  • Parser interface returns Promise<string> to support async parsers (PDF/Word) alongside sync ones
  • Format detection is by file extension with a --format CLI override

Closes #249

#249)

Add a parser abstraction layer that normalizes supported formats into
markdown before entering the existing chunking pipeline.

New parsers:
- PDF (.pdf) via pdf-parse (optional dependency)
- Word (.docx) via mammoth (optional dependency)
- CSV (.csv) via csv-parse — converts to markdown table
- YAML (.yaml, .yml) — validates and wraps in fenced block
- JSON (.json) — validates and wraps in fenced block
- Markdown (.md, .markdown, .mdx) — pass-through
- Plain text (.txt) — pass-through

New `indexFile()` function auto-detects format from extension and
pipes through the correct parser. CLI `add` command now supports
all formats with a `--format` override option.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 2, 2026 21:28
@vercel
Copy link

vercel bot commented Mar 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
libscope Ignored Ignored Preview Mar 3, 2026 0:13am

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a document parser abstraction so LibScope can ingest non-Markdown document formats by normalizing them to text/markdown before passing into the existing chunking/indexing pipeline.

Changes:

  • Introduces DocumentParser interface plus parser implementations for Markdown/text/JSON/YAML/CSV/PDF/Word.
  • Adds indexFile() to parse local files based on extension (with a CLI --format override) and then index via indexDocument().
  • Updates CLI add and import flows to use indexFile() for local files; adds unit tests for parser selection and core parsers.

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/core/parsers/index.ts Defines DocumentParser, registers parsers, and adds extension-based parser selection utilities.
src/core/parsers/markdown.ts Pass-through Markdown parser.
src/core/parsers/text.ts Pass-through plain text parser.
src/core/parsers/json-parser.ts JSON validation + formatting into fenced code block.
src/core/parsers/yaml.ts YAML validation + wrapping into fenced code block.
src/core/parsers/csv.ts Converts CSV into a Markdown table.
src/core/parsers/pdf.ts Adds PDF parsing via dynamic import of pdf-parse.
src/core/parsers/word.ts Adds Word parsing via dynamic import of mammoth.
src/core/indexing.ts Adds indexFile() and options to parse+index files.
src/core/index.ts Re-exports parser utilities and new indexing API.
src/cli/index.ts Uses indexFile() for local file ingestion; adds --format override and updates import defaults.
tests/unit/parsers.test.ts Adds unit tests for parser selection and parser behavior (JSON/YAML/CSV + error cases).
package.json / package-lock.json Adds dependencies for new formats (csv/yaml/pdf/word).

package.json Outdated
Comment on lines +60 to +63
"mammoth": "^1.11.0",
"node-html-markdown": "^2.0.0",
"openai": "^6.25.0",
"pdf-parse": "^2.4.5",
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says the PDF/Word parsers’ packages are optional and only loaded when needed, but mammoth and pdf-parse are added as required dependencies here. If the intent is optional installation (so environments without native deps can still install LibScope), consider moving these to optionalDependencies and keeping the current runtime error messaging.

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +23
let PDFParse: typeof import("pdf-parse").PDFParse;
try {
const mod = await import("pdf-parse");
PDFParse = mod.PDFParse;
} catch (err) {
throw new ValidationError(
'PDF parsing requires the "pdf-parse" package. Install it with: npm install pdf-parse',
err,
);
}

try {
const parser = new PDFParse({ data: new Uint8Array(content) });
const result = await parser.getText();
return result.text;
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pdf-parse doesn’t expose a PDFParse class in its public API/types (the @types/pdf-parse package typically models it as a callable default export). As written, mod.PDFParse will be undefined/typecheck will fail, and parsing will throw at runtime. Update the dynamic import to use the package’s exported parse function (and adjust the typing accordingly).

Copilot uses AI. Check for mistakes.
Comment on lines +20 to +25
const lines: string[] = [];
lines.push("| " + header.join(" | ") + " |");
lines.push("| " + header.map(() => "---").join(" | ") + " |");
for (const row of rows) {
lines.push("| " + row.join(" | ") + " |");
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSV cells are inserted into a Markdown table without escaping. Values containing | or newlines will break the table rendering and can also create rows with a different number of columns than the header (since relax_column_count allows variable widths). Consider normalizing each row to the header length (pad/truncate) and escaping cell text (e.g., replace | and newlines) before joining.

Copilot uses AI. Check for mistakes.
options: IndexFileOptions = {},
): Promise<IndexedDocument> {
const log = getLogger();
const effectiveName = options.format ? `file${options.format}` : filePath;
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --format override is concatenated as file${options.format}. If callers pass pdf (without a leading dot) or mixed-case extensions, format detection will fail silently and look unsupported. It would be more robust to normalize options.format (trim, lowercase, ensure it starts with .) before building effectiveName.

Suggested change
const effectiveName = options.format ? `file${options.format}` : filePath;
const rawFormat = options.format?.trim();
const normalizedFormat =
rawFormat && rawFormat.length > 0
? (rawFormat.startsWith(".") ? rawFormat : `.${rawFormat}`).toLowerCase()
: undefined;
const effectiveName = normalizedFormat ? `file${normalizedFormat}` : filePath;

Copilot uses AI. Check for mistakes.
Comment on lines +372 to +374
throw new ValidationError(
`Unsupported file format: "${filePath}". Supported extensions: .md, .markdown, .mdx, .json, .yaml, .yml, .csv, .pdf, .docx`,
);
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unsupported-format error message hardcodes the supported extension list, which can drift from the actual parser registry. Consider generating this from getSupportedExtensions() (and/or including the --format hint) so the message stays accurate as formats are added/removed.

Copilot uses AI. Check for mistakes.
package.json Outdated
Comment on lines 63 to 67
"pdf-parse": "^2.4.5",
"pino": "^10.3.1",
"sqlite-vec": "^0.1.0",
"zod": "^4.3.6"
},
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pdf-parse v2.x declares a stricter Node engine range (see its entry in the lockfile: >=20.16.0 <21 || >=22.3.0). Adding it as a regular dependency can break installs for users on earlier Node 20.x versions even though this package’s engines.node is currently >=20. Consider either tightening engines.node to match the dependency, or making the PDF parser truly optional (e.g., move pdf-parse to optionalDependencies and keep the runtime "install it" error path).

Suggested change
"pdf-parse": "^2.4.5",
"pino": "^10.3.1",
"sqlite-vec": "^0.1.0",
"zod": "^4.3.6"
},
"pino": "^10.3.1",
"sqlite-vec": "^0.1.0",
"zod": "^4.3.6"
},
"optionalDependencies": {
"pdf-parse": "^2.4.5"
},

Copilot uses AI. Check for mistakes.
- Move mammoth and pdf-parse to optionalDependencies
- Escape pipe characters and newlines in CSV cell values
- Normalize --format option (trim, lowercase, ensure leading dot)
- Generate unsupported format error dynamically from getSupportedExtensions()
- Handle mismatched CSV column counts by normalizing rows

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
RobertLD and others added 6 commits March 2, 2026 22:09
- Add connectors-config tests (file/DB config, sync tracker, deleteConnectorDocuments)
- Add api-server tests (startApiServer with various options)
- Add db-validation tests (validateRow, validateCountRow edge cases)
- Add dedup tests for semantic/both strategies (exercises catch blocks)
- Add schema tests for createVectorTable validation
- Add export tests for import validation branches
- Add config tests for env var overrides (ollama URL, private URLs, LLM)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add PdfParser and WordParser tests (invalid content error paths)
- Add CsvParser tests (pipe escaping, newline handling, row normalization)
- Add Saved Searches API tests (create, list, delete, run)
- Add Bulk Operations API tests (delete, retag, move + validation)
- Add Links API test (create + delete)
- Branch coverage now passes 74% threshold

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The envOverrides.indexing spread was missing from the config merge,
so LIBSCOPE_ALLOW_PRIVATE_URLS and LIBSCOPE_ALLOW_SELF_SIGNED_CERTS
environment variables were silently ignored.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes CodeQL alert: incomplete string escaping. Backslashes must be
escaped first to prevent \| in input becoming \| (misinterpreted as
escaped-backslash + literal pipe).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds Document Format Support section to README with a table showing
supported formats, optional dependencies, and Node.js requirements.
pdf-parse v2 requires Node >= 20.16 or >= 22.3.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- package.json: keep both feature deps (csv-parse, js-yaml, pdf-parse) and main deps (node-cron)
- schema.test.ts: add missing DatabaseError import from main
- db-validation.test.ts: take main's more comprehensive test suite
- api-server.test.ts: use await on result.close() from main

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@RobertLD RobertLD merged commit 6c4b589 into main Mar 3, 2026
9 checks passed
@RobertLD RobertLD deleted the feat/document-formats branch March 3, 2026 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Support additional document formats (PDF, Word, CSV, YAML, JSON)

2 participants