Skip to content

Optimize performance, fix memory leak, add HF Spaces auto-deploy#7

Merged
lfoppiano merged 2 commits into
devfrom
claude/perf-deploy-oQZPv
Apr 13, 2026
Merged

Optimize performance, fix memory leak, add HF Spaces auto-deploy#7
lfoppiano merged 2 commits into
devfrom
claude/perf-deploy-oQZPv

Conversation

@lfoppiano
Copy link
Copy Markdown
Collaborator

Summary

Performance

  • Reuse ObjectMapperDataTypeClassifier and DatasetContextClassifier were creating new ObjectMapper() on every classify() call. ObjectMapper is thread-safe and expensive to instantiate. Replaced with a private static final instance shared across all calls.

Memory leak fix

  • Close DocumentSource after PDF processingprocessDatasetPDF() calls DatasetParser.processPDF() which returns a Document holding native PDF parser resources via DocumentSource. These were never released, leaking memory on every request. Added DocumentSource.close() in the finally block.

CI / Deployment

  • JDK 17 → 21 in ci-build.yml build step (matches the java.toolchain.languageVersion = 21 in build.gradle)
  • HuggingFace Spaces auto-deploy — new deploy-hf-spaces job triggers a factory reboot of lfoppiano/datastet-dev Space after successful Docker build on non-PR pushes. Requires HF_TOKEN secret in repo settings.

Test plan

  • ./gradlew clean compileJava passes
  • ./gradlew test passes
  • Verify no new ObjectMapper() in hot paths
  • Verify memory usage stabilizes under repeated PDF requests
  • Verify HF Space restarts after merge (requires HF_TOKEN secret)

https://claude.ai/code/session_018EBZhK2RtGtsvN4E1rp2tF

claude added 2 commits April 13, 2026 22:31
Dropwizard 4.x uses strict Jackson deserialization and rejects unknown
YAML keys. config.yml has version, corpusPath, models, entityFishingHost,
etc. at the top level, but DatastetServiceConfiguration only declared
grobidHome, datastetConfiguration, maxParallelRequests, and CORS fields.

Add all missing top-level fields to DatastetServiceConfiguration,
matching the pattern from software-mentions' SoftwareServiceConfiguration:
version, entityFishingHost/Port, corpusPath, templatePath, tmpPath,
pub2teiPath, useBinaryContextClassifiers, models.

Fixes: "Unrecognized field at: version" crash on startup.

https://claude.ai/code/session_018EBZhK2RtGtsvN4E1rp2tF
Performance:
- Reuse static ObjectMapper in DataTypeClassifier and
  DatasetContextClassifier instead of creating new instances per
  classify() call. ObjectMapper is thread-safe and expensive to create.

Memory leak:
- Close DocumentSource after PDF processing in processDatasetPDF().
  The grobid Document holds native PDF parser resources via
  DocumentSource that were never released, leaking memory per request.

CI:
- Update JDK 17 -> 21 in ci-build.yml (matches java toolchain)
- Add deploy-hf-spaces job: triggers factory reboot of HuggingFace
  Space lfoppiano/datastet-dev after successful Docker build on
  non-PR pushes. Requires HF_TOKEN secret in repo settings.

https://claude.ai/code/session_018EBZhK2RtGtsvN4E1rp2tF
@lfoppiano lfoppiano merged commit 9cd121e into dev Apr 13, 2026
3 checks passed
@lfoppiano lfoppiano deleted the claude/perf-deploy-oQZPv branch April 13, 2026 22:47
lfoppiano pushed a commit that referenced this pull request Apr 16, 2026
PR #7 fixed a leak in the PDF path by closing DocumentSource; TEI has no
DocumentSource, but heap and file-descriptor usage still grew under
sustained /processDatasetTEI load. Root causes:

1. JAXP factory churn. DocumentBuilderFactory.newInstance(),
   XPathFactory.newInstance(), TransformerFactory.newInstance(), and
   SAXParserFactory.newInstance() ran on every request (and, in
   XMLUtilities.segment(), on every sentence). Each call re-runs
   ServiceLoader discovery and produces factories whose classloader-
   backed caches are not reclaimed promptly.
2. DocumentBuilder.parse(File) defers FileInputStream closure to
   Xerces, accumulating FDs under sustained load.
3. The parsed DOM Document was left as a local reference until
   method return, delaying young-gen reclaim of a large node graph.

Fix:
- Cache factories as private static finals in DatasetParser and
  XMLUtilities, with synchronized accessors (newDocumentBuilder,
  newXPath, newSAXParser, newTransformer). Factories' new*() methods
  are not guaranteed thread-safe; synchronized access keeps contention
  negligible versus XML parse cost and avoids ThreadLocal leaks in
  the Dropwizard thread pool.
- Parse TEI via try-with-resources FileInputStream so the handle is
  released deterministically.
- Null the parsed Document reference in finally blocks to aid GC.
- Remove dead DocumentBuilderFactory allocation in processXML(File).

No PDF-path changes; PR #7's DocumentSource.close() is preserved.
lfoppiano added a commit that referenced this pull request Apr 16, 2026
#20)

* Fix memory leak in TEI processing: cache JAXP factories, close streams

PR #7 fixed a leak in the PDF path by closing DocumentSource; TEI has no
DocumentSource, but heap and file-descriptor usage still grew under
sustained /processDatasetTEI load. Root causes:

1. JAXP factory churn. DocumentBuilderFactory.newInstance(),
   XPathFactory.newInstance(), TransformerFactory.newInstance(), and
   SAXParserFactory.newInstance() ran on every request (and, in
   XMLUtilities.segment(), on every sentence). Each call re-runs
   ServiceLoader discovery and produces factories whose classloader-
   backed caches are not reclaimed promptly.
2. DocumentBuilder.parse(File) defers FileInputStream closure to
   Xerces, accumulating FDs under sustained load.
3. The parsed DOM Document was left as a local reference until
   method return, delaying young-gen reclaim of a large node graph.

Fix:
- Cache factories as private static finals in DatasetParser and
  XMLUtilities, with synchronized accessors (newDocumentBuilder,
  newXPath, newSAXParser, newTransformer). Factories' new*() methods
  are not guaranteed thread-safe; synchronized access keeps contention
  negligible versus XML parse cost and avoids ThreadLocal leaks in
  the Dropwizard thread pool.
- Parse TEI via try-with-resources FileInputStream so the handle is
  released deterministically.
- Null the parsed Document reference in finally blocks to aid GC.
- Remove dead DocumentBuilderFactory allocation in processXML(File).

No PDF-path changes; PR #7's DocumentSource.close() is preserved.

* Harden cached JAXP factories and preserve systemId on TEI parse

Addresses Copilot review feedback on the previous commit:

1. XXE/SSRF hardening. Because the cached DocumentBuilderFactory /
   SAXParserFactory / TransformerFactory parse user-supplied XML/TEI,
   the static initializers now apply OWASP-recommended hardening:
   - FEATURE_SECURE_PROCESSING
   - disallow-doctype-decl
   - external-general-entities = false
   - external-parameter-entities = false
   - nonvalidating/load-external-dtd = false
   - setXIncludeAware(false), setExpandEntityReferences(false)
   - TransformerFactory: ACCESS_EXTERNAL_DTD / ACCESS_EXTERNAL_STYLESHEET
     pinned to empty string
   Each feature/attribute is set in its own try/catch so unsupported
   options on a given JAXP implementation do not break class init.

2. Preserve systemId. DocumentBuilder.parse(File) previously set the
   document systemId to the file URI, which matters for relative
   references and error locations. Restore it by setting
   inputSource.setSystemId(file.toURI().toString()) on the InputSource
   wrapper in DatasetParser.processTEI while keeping the try-with-
   resources FileInputStream for deterministic stream closure.

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants