Skip to content

Align ParseResult ZIP models with current worker output #21

@suguanYang

Description

@suguanYang

Goal

Align the Python SDK parse-result ZIP parser and public result models with the current Knowhere worker output contract.

This issue is limited to parsing completed job result ZIPs. It should not reimplement features that already exist on origin/main.

Current SDK Baseline

The Python SDK already has the following on origin/main:

  • Knowhere.parse(...) and AsyncKnowhere.parse(...) high-level helpers.
  • ParsingParams.
  • jobs.create(...), file upload, polling, and result loading.
  • Retrieval query options including channels, rerank, filters, signal paths, thresholds, document exclusions, and section exclusions.
  • Document lifecycle resource methods.
  • Default base URL https://api.knowhereto.ai.

Those areas are out of scope for this issue unless a regression is found while implementing the ZIP parser updates.

Current Worker ZIP Contract

Current worker output, verified against staging result job_922b79256307 and the worker contract test, emits:

  • manifest.json
  • chunks.json
  • full.md
  • doc_nav.json
  • images/*
  • tables/*

The current worker contract does not emit:

  • chunks_slim.json
  • hierarchy.json
  • hierarchy_slim.json

Current SDK Mismatches

The Python SDK currently still models and/or reads legacy result files:

  • ParseResult.hierarchy is populated from hierarchy.json, which current worker output no longer emits.
  • ParseResult.chunks_slim and SlimChunk are tied to chunks_slim.json, which current worker output no longer emits.
  • TableChunk.table_type is exposed, but current table chunk metadata does not include table_type.

The SDK also misses current worker fields:

  • doc_nav.json, which contains the canonical navigation tree and resource summaries.
  • manifest.json field HIERARCHY.
  • metadata.document_top_summary on chunks.

Requirements

1. Parse and expose doc_nav.json

Add a typed public representation for doc_nav.json and expose it from ParseResult.

Expected shape:

  • sections
    • title
    • path
    • level
    • summary
    • chunk_count
    • children
  • resources.images
    • path
    • summary
  • resources.tables
    • path
    • summary

Acceptance criteria:

  • Given a result ZIP with doc_nav.json, when parseResultZip() loads it, then callers can access the parsed navigation object from ParseResult.
  • Given a result ZIP without doc_nav.json, when parseResultZip() loads it, then parsing still succeeds and the navigation field is None.
  • ParseResult.save() writes doc_nav.json when the parsed navigation object exists.

2. Expose manifest hierarchy from manifest.json

Represent the worker-emitted HIERARCHY field in the public Manifest model.

Acceptance criteria:

  • Given manifest.json contains HIERARCHY, when parseResultZip() loads it, then Python callers can access the field without reading raw manifest dictionaries.
  • The Pydantic model should handle the all-caps input key. An idiomatic model field such as hierarchy = Field(default=None, alias="HIERARCHY") is acceptable if serialization behavior is covered by tests.

3. Surface document_top_summary on chunks

Expose metadata.document_top_summary as an optional chunk field.

Acceptance criteria:

  • Given any text, image, or table chunk metadata includes document_top_summary, when parseResultZip() loads chunks.json, then the corresponding chunk model exposes document_top_summary.
  • Existing chunk parsing continues to handle missing document_top_summary.

4. Remove or clearly deprecate legacy/ghost fields

Clean up public result models and tests that still imply the current worker emits removed files or fields.

Acceptance criteria:

  • chunks_slim.json and hierarchy.json are no longer described as current worker outputs.
  • Tests no longer require chunks_slim.json or hierarchy.json to exist for the current contract fixture.
  • TableChunk.table_type is either removed or marked deprecated with documentation that the current worker does not emit table_type.
  • Any backward-compatible legacy parsing that remains is explicitly documented as legacy-only.

Out of Scope

  • Adding high-level parse APIs, parsing params, job polling, retrieval options, document lifecycle resources, or base URL changes. These already exist on origin/main.
  • Changing the worker result ZIP contract.
  • Adding table_type to worker table metadata.
  • Removing backward compatibility unless the implementation chooses a major-version cleanup.

Suggested Verification

  • Add or update result parser tests using a fixture that matches the current worker ZIP contract.
  • Assert that current-contract ZIPs parse without hierarchy.json or chunks_slim.json.
  • Assert that doc_nav.json, manifest HIERARCHY, and chunk document_top_summary are exposed.
  • Run the Python SDK type checks and test suite used by this repository.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions