feat: add client.parse() for the Data Extraction API (/extraction/parse)#47
Merged
Conversation
Adds first-class support for the Data Extraction API on NutrientClient.
Covers all four processing modes (text, structure, understand, agentic)
and both output shapes (spatial elements and whole-document Markdown).
The response surface is a fully typed ParseResponse TypedDict with a
discriminated union of element variants (paragraph, table, formula,
picture, keyValueRegion, handwriting) so callers can narrow on `type`.
The Data Extraction API is billed against extraction credits, which are
a separate billing bucket from the processor API credits consumed by the
other endpoints used by this client (Build, sign, OCR, watermarking,
etc.). Docstrings, README, and changelog make that distinction explicit
so callers do not conflate the two buckets.
Verification:
- 16 new unit tests in tests/unit/test_parse.py (request shape per mode,
response parsing, error propagation for 401 / 400 / 402 / 500).
- mypy strict and ruff clean on src/.
Endpoint surface (httpx-multipart): POST /extraction/parse with a
'file' part and an optional 'instructions' part carrying the JSON
{mode, output:{format}} body. Extends the existing send_request infra
(RequestConfig + TypeGuard + overload) without churn to existing
endpoint paths.
The extraction-credits accounting shape (cost + remainingCredits) will surface on every future endpoint billed against the extraction-credits bucket, not just /extraction/parse. Factor it out of types/parse.py into its own module so other endpoints can import it without pulling in the whole parse type tree. Also clarify ParseBounds: document that (x, y) is the top-left corner and that bounds share a coordinate space with the page dimensions in ParsePageRef. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three small style nits surfaced in code review against the patterns set
by sign() and the other raw-send_request methods (get_account_info,
create_token, delete_token):
- Drop the redundant inner cast("ParseOutput", {"format": output_format}).
ParseOutput is a single-key TypedDict with total=False; the literal
already satisfies it structurally via the surrounding ParseInstructions
annotation. No other call site in client.py casts an inner literal
this way.
- Replace the RequestConfig(...) constructor call with an inline dict
literal at the send_request boundary, matching sign / create_token /
delete_token / get_account_info. RequestConfig is a generic TypedDict;
the constructor form is the outlier.
- Broaden the file parameter docstring to call out that the endpoint
accepts PDFs, Office documents, and images. Unlike sign(), parsing is
not PDF-only, and the previous docstring implicitly invited readers
to transplant sign()'s PDF-only mental model.
No behavior change.
format) combinations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The README's Data Extraction section previously described WHAT parse() does (modes, output formats, billing) without explaining WHY a user would reach for it over the existing extract_* helpers. Rework so the positioning leads: - New "designed for" bullets up top — RAG ingestion, search indexing, content migration, form/invoice extraction, layout-aware document understanding. - New output-format selector table mapping each format to its primary use case (markdown → RAG/search; spatial → form/layout). - Modes table reworded so each row says when to pick it, not just what it technically does (text = born-digital only; structure = OCR for scanned input; understand = AI-augmented for complex layouts; agentic = + VLM for image-heavy content). - Two worked recipes: RAG ingestion (PDF → markdown → embed) and form extraction (PDF → spatial elements → structured dict). Also adds a parse() entry to docs/METHODS.md (it was missing entirely) and a "Designed for" preamble to the parse() docstring so the method's positioning is visible in IDE hover popups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HungKNguyen
reviewed
May 27, 2026
HungKNguyen
reviewed
May 27, 2026
HungKNguyen
requested changes
May 27, 2026
Collaborator
HungKNguyen
left a comment
There was a problem hiding this comment.
The main blocker is that DWS Extract actually require a different API key from DWS Processor, maybe the client can be initilize with multiple API key for different products
16 tasks
DWS Extract is a separate product from DWS Processor with its own API key and credit pool. Calling /extraction/parse with the Processor key returns 403. Add an optional extract_api_key constructor parameter (str or async callable) that parse() prefers over api_key when set; non-parse methods keep using api_key. Falling back to api_key keeps a single-key setup working once tenants get global DWS keys. Also reject mode='text' + output_format='spatial' before the request goes out — the text mode only produces markdown, so the combination would 502 on the server side. Surface it as a ValidationError with guidance. Addresses PR #47 review feedback from HungKNguyen.
HungKNguyen
approved these changes
May 27, 2026
The docstring promises pageIndex/width/height are always populated and only pageNumber may be absent, but the class was declared `total=False`, which contradicts that and forces type-strict callers to guard every subscript access on guaranteed-present fields. Switch to the default (`total=True`) shape with pageNumber explicitly `NotRequired`, matching the precedent set by ParseBounds in the same module. No runtime impact — the wire already populates these fields.
Contributor
Author
Live smoke against the DWS APIs (commit
|
| Case | Status | Detail |
|---|---|---|
P1 processor: extract_text |
OK | 6 pages |
P2 processor: get_account_info |
OK | subscription=enterprise |
| E1 parse text + markdown | OK | cost=6.0, pages=6, md_len=1922 |
| E2 parse structure + spatial | OK | cost=9.0, pages=6, elements=72 |
| E3 parse structure + markdown | OK | cost=1.5, pages=1, md_len=2560 |
| E4 parse understand + spatial | OK | cost=54.0, pages=6, elements=124 |
| E5 parse understand + markdown | OK | cost=9.0, pages=1, md_len=5607 |
| E6 parse agentic + markdown | OK | cost=18.0, pages=1, md_len=6770, ~54s elapsed (first run hit a transient NetworkError; passed on retry with a longer timeout) |
| V1 parse text + spatial | OK | ValidationError raised pre-network — the new client-side guard works |
V2 Processor key against /extraction/parse |
OK | 403 — the exact failure called out in review, now provably gated |
B1 parse with bytes input |
OK | elements=72, cost=9.0 |
What this verifies
extract_api_keyroutes/extraction/parsethrough the Extract key end-to-end (E1-E6, B1 all billed againstdata_extraction_credits).- Existing Processor methods on the same client keep using the Processor key (P1, P2).
- Passing only
api_key(Processor) toparse()is correctly rejected with 403 — i.e. the documented restriction holds (V2). mode='text'+output_format='spatial'raisesValidationErrorbefore any HTTP round-trip (V1).bytesinput behaves identically to a file-path input (B1).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The Data Extraction API (
/extraction/parse) is now generally available. This PR adds first-class client support so users can call it directly fromNutrientClientwithout constructing raw HTTP requests.Summary
client.parse()method covering all four processing modes (text,structure,understand,agentic) and both output formats (spatialelement list, whole-documentmarkdown).ParseResponseenvelope with a discriminated union of element variants (paragraph, table, formula, picture, keyValueRegion, handwriting) —if element["type"] == "table": ...narrows correctly via thetypediscriminator.ExtractionCreditstype module to surface the extraction-credit billing bucket, which is separate from the processor-credit bucket consumed by existing endpoints. README, changelog, and method docstring all make the distinction explicit so callers do not conflate the two.401/400/402/500).Verification — static
mypyclean onsrc/(strict)ruff checkclean on touched filespytest tests/unit— 263 / 263 passing (16 new intests/unit/test_parse.py)Verification — live (prod)
A full sweep against the prod API using
tests/data/sample.pdf(6 pages) covered every documented(mode, output_format)combination plus the spec-rejected case, both alternative input shapes (bytes, file-like), and both error paths. All 12 calls behaved as expected:textmarkdowntextspatialValidationErrorHTTP 400 (rejected per spec)structurespatialstructuremarkdownunderstandspatialunderstandmarkdownagenticspatialagenticmarkdownstructurespatialbytesstructurespatialFileNotFoundErrorraisedstructurespatialAuthenticationErrorHTTP 401 raised