format-converters

Detect and split structured content into what must stay verbatim and what can be compressed or summarized. Zero dependencies. 152 tests. Works everywhere JavaScript runs.

import { detect } from '@lisa/format-converters';

const result = detect(content);
// result?.format      → 'json' | 'xml' | 'html' | 'yaml' | 'toml' | 'markdown' | 'csv' | 'dockerfile'
// result?.converter.extractPreserved(content)    → structural skeleton
// result?.converter.extractCompressible(content) → prose segments to summarize
// result?.converter.reconstruct(preserved, summary) → reassembled output

Supports JSON, XML, HTML, YAML, TOML, Markdown, CSV, and Dockerfile. Node 18+, Deno, Bun, edge runtimes. ESM only.

The problem it solves

LLM pipelines regularly receive structured documents — Kubernetes manifests, API responses, changelogs, Dockerfiles — as message content. Treating them as flat prose wastes tokens and destroys the structure the LLM needs to reason about them.

format-converters splits each format at the right seam: keep the structure, compress the prose.

Before (a Kubernetes deployment, 480 chars):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: nginx
        image: nginx:1.25
        description: This container runs the nginx web server and handles all
          incoming HTTP and HTTPS traffic for the production environment,
          serving requests from thousands of concurrent users daily with
          automatic health checking and graceful restart on failure.

After (skeleton + summary comment, 148 chars):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: nginx
        image: nginx:1.25
# nginx web server for production, high-traffic, health-checked

The keys, versions, and replica count survive verbatim. The verbose prose description is replaced by a summary. 3.2x smaller, structure intact.

Install

npm install @lisa/format-converters

Quick start

Auto-detect and split

import { detect } from '@lisa/format-converters';

const result = detect(content);
if (result) {
  const preserved    = result.converter.extractPreserved(content);
  const compressible = result.converter.extractCompressible(content);
  const summary      = await myLlm.summarize(compressible.join('\n'));
  const output       = result.converter.reconstruct(preserved, summary);
  // output is always ≤ original length — check before using
}

Pick a specific converter

import { YamlConverter } from '@lisa/format-converters';

if (YamlConverter.detect(content)) {
  const skeleton = YamlConverter.extractPreserved(content);
  const prose    = YamlConverter.extractCompressible(content);
  const output   = YamlConverter.reconstruct(skeleton, summarize(prose));
}

Iterate all converters

import { converters } from '@lisa/format-converters';
// [JsonConverter, XmlConverter, HtmlConverter, YamlConverter,
//  TomlConverter, MarkdownConverter, CsvConverter, DockerfileConverter]
// First match wins — JSON and XML are checked before YAML.

for (const converter of converters) {
  if (converter.detect(content)) {
    // ...
  }
}

What each converter does

Format	Preserved verbatim	Compressible
JSON	All keys, numbers, booleans, short strings	String values ≥6 words and ≥100 chars
XML	Full tag skeleton, attributes, short text values	Prose text nodes and verbose XML comments
HTML	Full tag skeleton, `<script>`/`<style>` as `[code]`	Prose text nodes and verbose HTML comments
YAML	All keys, booleans, numbers, strings ≤60 chars	String values that are long prose
TOML	Section headers, all keys with atomic values	String values that are long prose
Markdown	All headings (`#`–`######`), table blocks	Paragraph prose between structural elements
CSV	Header row + row count annotation	All data rows
Dockerfile	All instruction lines, short/directive comments	Multi-line prose comment blocks

Detection is heuristic and fast — no external parsers, no AST. Each converter is designed to fail safe: if reconstruct() output is not shorter than the original, discard it and keep the original.

Format examples

JSON — API response

import { JsonConverter } from '@lisa/format-converters';

// Input: {"id": "req_1", "model": "gpt-4", "message": {"content": "The deployment
//   failed because the health check endpoint returned 503 repeatedly..."}}
//
// extractPreserved → {"id": "req_1", "model": "gpt-4", "message": {"content": "[…]"}}
// extractCompressible → ["The deployment failed because the health check..."]
// reconstruct → skeleton + {"_summary": "health check 503, deployment failed"}

XML — Maven POM

import { XmlConverter } from '@lisa/format-converters';

// Input: <project><description>This is a lengthy explanation of what the project
//   does and how it integrates with other systems...</description></project>
//
// extractPreserved → <project><description>[…]</description></project>
// extractCompressible → ["lengthy explanation of what the project does..."]
// reconstruct → skeleton + <!-- integrates with other systems -->

Dockerfile — multi-stage build

import { DockerfileConverter } from '@lisa/format-converters';

// Input:
//   # This stage compiles the Go binary and runs all unit tests to ensure
//   # correctness before packaging into the final distroless runtime image.
//   FROM golang:1.22-alpine AS builder
//   ...
//
// extractPreserved → FROM / RUN / COPY instructions only
// extractCompressible → ["This stage compiles the Go binary and runs..."]
// reconstruct → instructions + # Go build stage, distroless runtime

API reference

`detect(content: string): DetectResult | undefined`

Returns the first matching converter, or undefined if no format matches (plain prose, shell scripts, etc.).

interface DetectResult {
  converter: FormatConverter;
  format: 'json' | 'xml' | 'html' | 'yaml' | 'toml' | 'markdown' | 'csv' | 'dockerfile';
}

`FormatConverter` interface

interface FormatConverter {
  /** Unique name — appears in trace/debug output. */
  name: string;

  /** Returns true if this converter handles the given content. Called on every message — keep fast. */
  detect(content: string): boolean;

  /** Structural parts that must survive verbatim (tags, keys, headings, instructions). */
  extractPreserved(content: string): string[];

  /** Prose segments that can be summarized or compressed. */
  extractCompressible(content: string): string[];

  /**
   * Reassemble from preserved structure and a summary string.
   * Always check: if output >= original length, discard and keep original.
   */
  reconstruct(preserved: string[], summary: string): string;
}

Detection criteria

Export	`name`	Detects when…
`JsonConverter`	`'json'`	Valid JSON object (≥2 keys) or array (≥2 items)
`XmlConverter`	`'xml'`	Starts with `<?xml` or a letter-tag and has a closing tag
`HtmlConverter`	`'html'`	Has `<!DOCTYPE html>` or `<html` and structural closing tags
`YamlConverter`	`'yaml'`	≥4 non-empty lines, >35% match `key: value` pattern
`TomlConverter`	`'toml'`	≥2 `key = value` lines with section headers or density >40%
`MarkdownConverter`	`'markdown'`	≥2 heading lines and content ≥200 chars
`CsvConverter`	`'csv'`	≥3 lines with commas, consistent column count
`DockerfileConverter`	`'dockerfile'`	≥2 Dockerfile instruction keywords (FROM, RUN, COPY, …)

Low-level exports

All internal helpers are exported for use in custom pipelines:

import {
  detectJson,
  detectXml, xmlSkeleton, xmlProseNodes,
  detectHtml, htmlSkeleton, htmlProseNodes,
  detectYaml, isAtomicYamlValue,
  detectToml, isAtomicTomlValue,
  detectMarkdown,
  detectCsv,
  detectDockerfile,
  looksLikeProse,   // true if text has ≥6 words and ≥100 chars
} from '@lisa/format-converters';

Writing a custom converter

Implement FormatConverter and register it:

import type { FormatConverter } from '@lisa/format-converters';
import { converters, detect } from '@lisa/format-converters';

const IniConverter: FormatConverter = {
  name: 'ini',

  detect(content) {
    const lines = content.split('\n').filter(l => l.trim());
    return lines.some(l => /^\[.+\]$/.test(l)) && lines.some(l => /^\w+=/.test(l));
  },

  extractPreserved(content) {
    return content.split('\n').filter(l => {
      const t = l.trim();
      return !t || t.startsWith('[') || t.startsWith(';') || t.split('=')[1]?.trim().length <= 60;
    });
  },

  extractCompressible(content) {
    return content.split('\n').filter(l => {
      const val = l.split('=')[1]?.trim() ?? '';
      return val.length > 60;
    });
  },

  reconstruct(preserved, summary) {
    return summary ? `${preserved.join('\n')}\n; ${summary}` : preserved.join('\n');
  },
};

// Use standalone
IniConverter.detect(content);

// Or prepend to the built-in list
const myConverters = [IniConverter, ...converters];

Integration with context-compression-engine

FormatConverter (this library) and FormatAdapter (context-compression-engine) share the same four-method interface. TypeScript's structural typing lets you pass any converter directly as an adapter — no wrapper needed:

import { compress } from 'context-compression-engine';
import { converters } from '@lisa/format-converters';

compress(messages, { adapters: converters });

Contributing

Bug reports and pull requests are welcome. Open an issue first for anything beyond small fixes so we can align on direction before you invest time.

git clone https://github.com/SimplyLiz/FormatConverter.git
cd FormatConverter
npm install
npm test

Adding a new converter:

Create src/<format>.ts — implement FormatConverter, export the converter and any helpers
Register it in src/registry.ts — order matters, first match wins
Export from src/index.ts
Add tests/<format>.test.ts — cover detect, extractPreserved, extractCompressible, reconstruct, and an end-to-end shorter-than-input check

License

Free for personal projects, educational use, academic research, and organizations with annual revenue below USD 25,000.

Commercial license required for organizations with annual revenue of USD 25,000 or more. Contact lisa@tastehub.io for pricing.

See LICENSE for the full terms.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
.prettierrc		.prettierrc
LICENSE		LICENSE
README.md		README.md
eslint.config.js		eslint.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

format-converters

The problem it solves

Install

Quick start

Auto-detect and split

Pick a specific converter

Iterate all converters

What each converter does

Format examples

JSON — API response

XML — Maven POM

Dockerfile — multi-stage build

API reference

`detect(content: string): DetectResult | undefined`

`FormatConverter` interface

Detection criteria

Low-level exports

Writing a custom converter

Integration with context-compression-engine

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

format-converters

The problem it solves

Install

Quick start

Auto-detect and split

Pick a specific converter

Iterate all converters

What each converter does

Format examples

JSON — API response

XML — Maven POM

Dockerfile — multi-stage build

API reference

detect(content: string): DetectResult | undefined

FormatConverter interface

Detection criteria

Low-level exports

Writing a custom converter

Integration with context-compression-engine

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`detect(content: string): DetectResult | undefined`

`FormatConverter` interface

Packages