Goal
Align the Python SDK parse-result ZIP parser and public result models with the current Knowhere worker output contract.
This issue is limited to parsing completed job result ZIPs. It should not reimplement features that already exist on origin/main.
Current SDK Baseline
The Python SDK already has the following on origin/main:
Knowhere.parse(...) and AsyncKnowhere.parse(...) high-level helpers.
ParsingParams.
jobs.create(...), file upload, polling, and result loading.
- Retrieval query options including channels, rerank, filters, signal paths, thresholds, document exclusions, and section exclusions.
- Document lifecycle resource methods.
- Default base URL
https://api.knowhereto.ai.
Those areas are out of scope for this issue unless a regression is found while implementing the ZIP parser updates.
Current Worker ZIP Contract
Current worker output, verified against staging result job_922b79256307 and the worker contract test, emits:
manifest.json
chunks.json
full.md
doc_nav.json
images/*
tables/*
The current worker contract does not emit:
chunks_slim.json
hierarchy.json
hierarchy_slim.json
Current SDK Mismatches
The Python SDK currently still models and/or reads legacy result files:
ParseResult.hierarchy is populated from hierarchy.json, which current worker output no longer emits.
ParseResult.chunks_slim and SlimChunk are tied to chunks_slim.json, which current worker output no longer emits.
TableChunk.table_type is exposed, but current table chunk metadata does not include table_type.
The SDK also misses current worker fields:
doc_nav.json, which contains the canonical navigation tree and resource summaries.
manifest.json field HIERARCHY.
metadata.document_top_summary on chunks.
Requirements
1. Parse and expose doc_nav.json
Add a typed public representation for doc_nav.json and expose it from ParseResult.
Expected shape:
sections
title
path
level
summary
chunk_count
children
resources.images
resources.tables
Acceptance criteria:
- Given a result ZIP with
doc_nav.json, when parseResultZip() loads it, then callers can access the parsed navigation object from ParseResult.
- Given a result ZIP without
doc_nav.json, when parseResultZip() loads it, then parsing still succeeds and the navigation field is None.
ParseResult.save() writes doc_nav.json when the parsed navigation object exists.
2. Expose manifest hierarchy from manifest.json
Represent the worker-emitted HIERARCHY field in the public Manifest model.
Acceptance criteria:
- Given
manifest.json contains HIERARCHY, when parseResultZip() loads it, then Python callers can access the field without reading raw manifest dictionaries.
- The Pydantic model should handle the all-caps input key. An idiomatic model field such as
hierarchy = Field(default=None, alias="HIERARCHY") is acceptable if serialization behavior is covered by tests.
3. Surface document_top_summary on chunks
Expose metadata.document_top_summary as an optional chunk field.
Acceptance criteria:
- Given any text, image, or table chunk metadata includes
document_top_summary, when parseResultZip() loads chunks.json, then the corresponding chunk model exposes document_top_summary.
- Existing chunk parsing continues to handle missing
document_top_summary.
4. Remove or clearly deprecate legacy/ghost fields
Clean up public result models and tests that still imply the current worker emits removed files or fields.
Acceptance criteria:
chunks_slim.json and hierarchy.json are no longer described as current worker outputs.
- Tests no longer require
chunks_slim.json or hierarchy.json to exist for the current contract fixture.
TableChunk.table_type is either removed or marked deprecated with documentation that the current worker does not emit table_type.
- Any backward-compatible legacy parsing that remains is explicitly documented as legacy-only.
Out of Scope
- Adding high-level parse APIs, parsing params, job polling, retrieval options, document lifecycle resources, or base URL changes. These already exist on
origin/main.
- Changing the worker result ZIP contract.
- Adding
table_type to worker table metadata.
- Removing backward compatibility unless the implementation chooses a major-version cleanup.
Suggested Verification
- Add or update result parser tests using a fixture that matches the current worker ZIP contract.
- Assert that current-contract ZIPs parse without
hierarchy.json or chunks_slim.json.
- Assert that
doc_nav.json, manifest HIERARCHY, and chunk document_top_summary are exposed.
- Run the Python SDK type checks and test suite used by this repository.
Goal
Align the Python SDK parse-result ZIP parser and public result models with the current Knowhere worker output contract.
This issue is limited to parsing completed job result ZIPs. It should not reimplement features that already exist on
origin/main.Current SDK Baseline
The Python SDK already has the following on
origin/main:Knowhere.parse(...)andAsyncKnowhere.parse(...)high-level helpers.ParsingParams.jobs.create(...), file upload, polling, and result loading.https://api.knowhereto.ai.Those areas are out of scope for this issue unless a regression is found while implementing the ZIP parser updates.
Current Worker ZIP Contract
Current worker output, verified against staging result
job_922b79256307and the worker contract test, emits:manifest.jsonchunks.jsonfull.mddoc_nav.jsonimages/*tables/*The current worker contract does not emit:
chunks_slim.jsonhierarchy.jsonhierarchy_slim.jsonCurrent SDK Mismatches
The Python SDK currently still models and/or reads legacy result files:
ParseResult.hierarchyis populated fromhierarchy.json, which current worker output no longer emits.ParseResult.chunks_slimandSlimChunkare tied tochunks_slim.json, which current worker output no longer emits.TableChunk.table_typeis exposed, but current table chunk metadata does not includetable_type.The SDK also misses current worker fields:
doc_nav.json, which contains the canonical navigation tree and resource summaries.manifest.jsonfieldHIERARCHY.metadata.document_top_summaryon chunks.Requirements
1. Parse and expose
doc_nav.jsonAdd a typed public representation for
doc_nav.jsonand expose it fromParseResult.Expected shape:
sectionstitlepathlevelsummarychunk_countchildrenresources.imagespathsummaryresources.tablespathsummaryAcceptance criteria:
doc_nav.json, whenparseResultZip()loads it, then callers can access the parsed navigation object fromParseResult.doc_nav.json, whenparseResultZip()loads it, then parsing still succeeds and the navigation field isNone.ParseResult.save()writesdoc_nav.jsonwhen the parsed navigation object exists.2. Expose manifest hierarchy from
manifest.jsonRepresent the worker-emitted
HIERARCHYfield in the publicManifestmodel.Acceptance criteria:
manifest.jsoncontainsHIERARCHY, whenparseResultZip()loads it, then Python callers can access the field without reading raw manifest dictionaries.hierarchy = Field(default=None, alias="HIERARCHY")is acceptable if serialization behavior is covered by tests.3. Surface
document_top_summaryon chunksExpose
metadata.document_top_summaryas an optional chunk field.Acceptance criteria:
document_top_summary, whenparseResultZip()loadschunks.json, then the corresponding chunk model exposesdocument_top_summary.document_top_summary.4. Remove or clearly deprecate legacy/ghost fields
Clean up public result models and tests that still imply the current worker emits removed files or fields.
Acceptance criteria:
chunks_slim.jsonandhierarchy.jsonare no longer described as current worker outputs.chunks_slim.jsonorhierarchy.jsonto exist for the current contract fixture.TableChunk.table_typeis either removed or marked deprecated with documentation that the current worker does not emittable_type.Out of Scope
origin/main.table_typeto worker table metadata.Suggested Verification
hierarchy.jsonorchunks_slim.json.doc_nav.json, manifestHIERARCHY, and chunkdocument_top_summaryare exposed.