feat: support additional document formats (PDF, Word, CSV, YAML, JSON) (#249) by RobertLD · Pull Request #275 · RobertLD/libscope

RobertLD · 2026-03-02T21:28:20Z

Summary

Adds a parser abstraction layer so LibScope can ingest documents beyond markdown. New files are normalized to text/markdown before entering the existing chunking pipeline.

Supported Formats

Format	Extensions	Parser	Dependency
Markdown	`.md`, `.markdown`, `.mdx`	Pass-through	Built-in
Plain text	`.txt`	Pass-through	Built-in
JSON	`.json`	Validates & wraps in fenced block	Built-in
YAML	`.yaml`, `.yml`	Validates & wraps in fenced block	`js-yaml`
CSV	`.csv`	Converts to markdown table	`csv-parse`
PDF	`.pdf`	Text extraction	`pdf-parse` (optional)
Word	`.docx`	Text extraction	`mammoth` (optional)

Changes

src/core/parsers/ — New parser module with DocumentParser interface and 7 parser implementations
src/core/indexing.ts — New indexFile() function that auto-detects format and delegates to the correct parser
src/cli/index.ts — Updated add command to use indexFile() for local files, added --format override option
src/core/index.ts — Re-exports for new parser utilities
tests/unit/parsers.test.ts — 23 tests covering all parsers including error cases

Design

PDF and Word parsers use dynamic import() — packages are optional and only loaded when needed
Parser interface returns Promise<string> to support async parsers (PDF/Word) alongside sync ones
Format detection is by file extension with a --format CLI override

Closes #249

#249) Add a parser abstraction layer that normalizes supported formats into markdown before entering the existing chunking pipeline. New parsers: - PDF (.pdf) via pdf-parse (optional dependency) - Word (.docx) via mammoth (optional dependency) - CSV (.csv) via csv-parse — converts to markdown table - YAML (.yaml, .yml) — validates and wraps in fenced block - JSON (.json) — validates and wraps in fenced block - Markdown (.md, .markdown, .mdx) — pass-through - Plain text (.txt) — pass-through New `indexFile()` function auto-detects format from extension and pipes through the correct parser. CLI `add` command now supports all formats with a `--format` override option. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

vercel · 2026-03-02T21:28:25Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
libscope	Ignored	Preview	Mar 3, 2026 0:13am

Copilot

Pull request overview

Adds a document parser abstraction so LibScope can ingest non-Markdown document formats by normalizing them to text/markdown before passing into the existing chunking/indexing pipeline.

Changes:

Introduces DocumentParser interface plus parser implementations for Markdown/text/JSON/YAML/CSV/PDF/Word.
Adds indexFile() to parse local files based on extension (with a CLI --format override) and then index via indexDocument().
Updates CLI add and import flows to use indexFile() for local files; adds unit tests for parser selection and core parsers.

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`src/core/parsers/index.ts`	Defines `DocumentParser`, registers parsers, and adds extension-based parser selection utilities.
`src/core/parsers/markdown.ts`	Pass-through Markdown parser.
`src/core/parsers/text.ts`	Pass-through plain text parser.
`src/core/parsers/json-parser.ts`	JSON validation + formatting into fenced code block.
`src/core/parsers/yaml.ts`	YAML validation + wrapping into fenced code block.
`src/core/parsers/csv.ts`	Converts CSV into a Markdown table.
`src/core/parsers/pdf.ts`	Adds PDF parsing via dynamic import of `pdf-parse`.
`src/core/parsers/word.ts`	Adds Word parsing via dynamic import of `mammoth`.
`src/core/indexing.ts`	Adds `indexFile()` and options to parse+index files.
`src/core/index.ts`	Re-exports parser utilities and new indexing API.
`src/cli/index.ts`	Uses `indexFile()` for local file ingestion; adds `--format` override and updates import defaults.
`tests/unit/parsers.test.ts`	Adds unit tests for parser selection and parser behavior (JSON/YAML/CSV + error cases).
`package.json` / `package-lock.json`	Adds dependencies for new formats (csv/yaml/pdf/word).

Copilot · 2026-03-02T21:32:39Z

package.json

+    "mammoth": "^1.11.0",
    "node-html-markdown": "^2.0.0",
    "openai": "^6.25.0",
+    "pdf-parse": "^2.4.5",


PR description says the PDF/Word parsers’ packages are optional and only loaded when needed, but mammoth and pdf-parse are added as required dependencies here. If the intent is optional installation (so environments without native deps can still install LibScope), consider moving these to optionalDependencies and keeping the current runtime error messaging.

Copilot · 2026-03-02T21:32:39Z

src/core/parsers/pdf.ts

+    let PDFParse: typeof import("pdf-parse").PDFParse;
+    try {
+      const mod = await import("pdf-parse");
+      PDFParse = mod.PDFParse;
+    } catch (err) {
+      throw new ValidationError(
+        'PDF parsing requires the "pdf-parse" package. Install it with: npm install pdf-parse',
+        err,
+      );
+    }
+
+    try {
+      const parser = new PDFParse({ data: new Uint8Array(content) });
+      const result = await parser.getText();
+      return result.text;


pdf-parse doesn’t expose a PDFParse class in its public API/types (the @types/pdf-parse package typically models it as a callable default export). As written, mod.PDFParse will be undefined/typecheck will fail, and parsing will throw at runtime. Update the dynamic import to use the package’s exported parse function (and adjust the typing accordingly).

Copilot · 2026-03-02T21:32:39Z

src/core/parsers/csv.ts

+      const lines: string[] = [];
+      lines.push("| " + header.join(" | ") + " |");
+      lines.push("| " + header.map(() => "---").join(" | ") + " |");
+      for (const row of rows) {
+        lines.push("| " + row.join(" | ") + " |");
+      }


CSV cells are inserted into a Markdown table without escaping. Values containing | or newlines will break the table rendering and can also create rows with a different number of columns than the header (since relax_column_count allows variable widths). Consider normalizing each row to the header length (pad/truncate) and escaping cell text (e.g., replace | and newlines) before joining.

Copilot · 2026-03-02T21:32:40Z

src/core/indexing.ts

+  options: IndexFileOptions = {},
+): Promise<IndexedDocument> {
+  const log = getLogger();
+  const effectiveName = options.format ? `file${options.format}` : filePath;


The --format override is concatenated as file${options.format}. If callers pass pdf (without a leading dot) or mixed-case extensions, format detection will fail silently and look unsupported. It would be more robust to normalize options.format (trim, lowercase, ensure it starts with .) before building effectiveName.

Suggested change

const effectiveName = options.format ? `file${options.format}` : filePath;

const rawFormat = options.format?.trim();

const normalizedFormat =

rawFormat && rawFormat.length > 0

? (rawFormat.startsWith(".") ? rawFormat : `.${rawFormat}`).toLowerCase()

: undefined;

const effectiveName = normalizedFormat ? `file${normalizedFormat}` : filePath;

Copilot · 2026-03-02T21:32:40Z

src/core/indexing.ts

+    throw new ValidationError(
+      `Unsupported file format: "${filePath}". Supported extensions: .md, .markdown, .mdx, .json, .yaml, .yml, .csv, .pdf, .docx`,
+    );


The unsupported-format error message hardcodes the supported extension list, which can drift from the actual parser registry. Consider generating this from getSupportedExtensions() (and/or including the --format hint) so the message stays accurate as formats are added/removed.

Copilot · 2026-03-02T21:32:40Z

package.json

+    "pdf-parse": "^2.4.5",
    "pino": "^10.3.1",
    "sqlite-vec": "^0.1.0",
    "zod": "^4.3.6"
  },


pdf-parse v2.x declares a stricter Node engine range (see its entry in the lockfile: >=20.16.0 <21 || >=22.3.0). Adding it as a regular dependency can break installs for users on earlier Node 20.x versions even though this package’s engines.node is currently >=20. Consider either tightening engines.node to match the dependency, or making the PDF parser truly optional (e.g., move pdf-parse to optionalDependencies and keep the runtime "install it" error path).

Suggested change

"pdf-parse": "^2.4.5",

"pino": "^10.3.1",

"sqlite-vec": "^0.1.0",

"zod": "^4.3.6"

},

"pino": "^10.3.1",

"sqlite-vec": "^0.1.0",

"zod": "^4.3.6"

},

"optionalDependencies": {

"pdf-parse": "^2.4.5"

},

- Move mammoth and pdf-parse to optionalDependencies - Escape pipe characters and newlines in CSV cell values - Normalize --format option (trim, lowercase, ensure leading dot) - Generate unsupported format error dynamically from getSupportedExtensions() - Handle mismatched CSV column counts by normalizing rows Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/core/parsers/csv.ts

- Add connectors-config tests (file/DB config, sync tracker, deleteConnectorDocuments) - Add api-server tests (startApiServer with various options) - Add db-validation tests (validateRow, validateCountRow edge cases) - Add dedup tests for semantic/both strategies (exercises catch blocks) - Add schema tests for createVectorTable validation - Add export tests for import validation branches - Add config tests for env var overrides (ollama URL, private URLs, LLM) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add PdfParser and WordParser tests (invalid content error paths) - Add CsvParser tests (pipe escaping, newline handling, row normalization) - Add Saved Searches API tests (create, list, delete, run) - Add Bulk Operations API tests (delete, retag, move + validation) - Add Links API test (create + delete) - Branch coverage now passes 74% threshold Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The envOverrides.indexing spread was missing from the config merge, so LIBSCOPE_ALLOW_PRIVATE_URLS and LIBSCOPE_ALLOW_SELF_SIGNED_CERTS environment variables were silently ignored. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fixes CodeQL alert: incomplete string escaping. Backslashes must be escaped first to prevent \| in input becoming \| (misinterpreted as escaped-backslash + literal pipe). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds Document Format Support section to README with a table showing supported formats, optional dependencies, and Node.js requirements. pdf-parse v2 requires Node >= 20.16 or >= 22.3. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- package.json: keep both feature deps (csv-parse, js-yaml, pdf-parse) and main deps (node-cron) - schema.test.ts: add missing DatabaseError import from main - db-validation.test.ts: take main's more comprehensive test suite - api-server.test.ts: use await on result.close() from main Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 2, 2026 21:28

Copilot started reviewing on behalf of RobertLD March 2, 2026 21:29 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

RobertLD mentioned this pull request Mar 2, 2026

docs: fix agent instructions — coverage thresholds and PR quality rules #277

Merged

github-advanced-security bot found potential problems Mar 2, 2026

View reviewed changes

src/core/parsers/csv.ts Fixed Show fixed Hide fixed

RobertLD and others added 6 commits March 2, 2026 22:09

RobertLD merged commit 6c4b589 into main Mar 3, 2026
9 checks passed

RobertLD deleted the feat/document-formats branch March 3, 2026 00:14

This was referenced Mar 3, 2026

chore(main): release 1.3.0 #238

Closed

chore(development): release 1.3.0 #346

Closed

chore(development): release 1.3.0 #355

Closed

chore(main): release 1.3.0 #359

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support additional document formats (PDF, Word, CSV, YAML, JSON) (#249)#275

feat: support additional document formats (PDF, Word, CSV, YAML, JSON) (#249)#275
RobertLD merged 8 commits intomainfrom
feat/document-formats

RobertLD commented Mar 2, 2026

Uh oh!

vercel bot commented Mar 2, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 2, 2026

Uh oh!

Copilot AI Mar 2, 2026

Uh oh!

Copilot AI Mar 2, 2026

Uh oh!

Copilot AI Mar 2, 2026

Uh oh!

Copilot AI Mar 2, 2026

Uh oh!

Copilot AI Mar 2, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-  const effectiveName = options.format ? `file${options.format}` : filePath;
+  const rawFormat = options.format?.trim();
+  const normalizedFormat =
+    rawFormat && rawFormat.length > 0
+      ? (rawFormat.startsWith(".") ? rawFormat : `.${rawFormat}`).toLowerCase()
+      : undefined;
+  const effectiveName = normalizedFormat ? `file${normalizedFormat}` : filePath;

Conversation

RobertLD commented Mar 2, 2026

Summary

Supported Formats

Changes

Design

Uh oh!

vercel bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel bot commented Mar 2, 2026 •

edited

Loading