Optimize performance, fix memory leak, add HF Spaces auto-deploy by lfoppiano · Pull Request #7 · ScienciaLAB/datastet

lfoppiano · 2026-04-13T22:36:18Z

Summary

Performance

Reuse ObjectMapper — DataTypeClassifier and DatasetContextClassifier were creating new ObjectMapper() on every classify() call. ObjectMapper is thread-safe and expensive to instantiate. Replaced with a private static final instance shared across all calls.

Memory leak fix

Close DocumentSource after PDF processing — processDatasetPDF() calls DatasetParser.processPDF() which returns a Document holding native PDF parser resources via DocumentSource. These were never released, leaking memory on every request. Added DocumentSource.close() in the finally block.

CI / Deployment

JDK 17 → 21 in ci-build.yml build step (matches the java.toolchain.languageVersion = 21 in build.gradle)
HuggingFace Spaces auto-deploy — new deploy-hf-spaces job triggers a factory reboot of lfoppiano/datastet-dev Space after successful Docker build on non-PR pushes. Requires HF_TOKEN secret in repo settings.

Test plan

./gradlew clean compileJava passes
./gradlew test passes
Verify no new ObjectMapper() in hot paths
Verify memory usage stabilizes under repeated PDF requests
Verify HF Space restarts after merge (requires HF_TOKEN secret)

https://claude.ai/code/session_018EBZhK2RtGtsvN4E1rp2tF

Dropwizard 4.x uses strict Jackson deserialization and rejects unknown YAML keys. config.yml has version, corpusPath, models, entityFishingHost, etc. at the top level, but DatastetServiceConfiguration only declared grobidHome, datastetConfiguration, maxParallelRequests, and CORS fields. Add all missing top-level fields to DatastetServiceConfiguration, matching the pattern from software-mentions' SoftwareServiceConfiguration: version, entityFishingHost/Port, corpusPath, templatePath, tmpPath, pub2teiPath, useBinaryContextClassifiers, models. Fixes: "Unrecognized field at: version" crash on startup. https://claude.ai/code/session_018EBZhK2RtGtsvN4E1rp2tF

Performance: - Reuse static ObjectMapper in DataTypeClassifier and DatasetContextClassifier instead of creating new instances per classify() call. ObjectMapper is thread-safe and expensive to create. Memory leak: - Close DocumentSource after PDF processing in processDatasetPDF(). The grobid Document holds native PDF parser resources via DocumentSource that were never released, leaking memory per request. CI: - Update JDK 17 -> 21 in ci-build.yml (matches java toolchain) - Add deploy-hf-spaces job: triggers factory reboot of HuggingFace Space lfoppiano/datastet-dev after successful Docker build on non-PR pushes. Requires HF_TOKEN secret in repo settings. https://claude.ai/code/session_018EBZhK2RtGtsvN4E1rp2tF

PR #7 fixed a leak in the PDF path by closing DocumentSource; TEI has no DocumentSource, but heap and file-descriptor usage still grew under sustained /processDatasetTEI load. Root causes: 1. JAXP factory churn. DocumentBuilderFactory.newInstance(), XPathFactory.newInstance(), TransformerFactory.newInstance(), and SAXParserFactory.newInstance() ran on every request (and, in XMLUtilities.segment(), on every sentence). Each call re-runs ServiceLoader discovery and produces factories whose classloader- backed caches are not reclaimed promptly. 2. DocumentBuilder.parse(File) defers FileInputStream closure to Xerces, accumulating FDs under sustained load. 3. The parsed DOM Document was left as a local reference until method return, delaying young-gen reclaim of a large node graph. Fix: - Cache factories as private static finals in DatasetParser and XMLUtilities, with synchronized accessors (newDocumentBuilder, newXPath, newSAXParser, newTransformer). Factories' new*() methods are not guaranteed thread-safe; synchronized access keeps contention negligible versus XML parse cost and avoids ThreadLocal leaks in the Dropwizard thread pool. - Parse TEI via try-with-resources FileInputStream so the handle is released deterministically. - Null the parsed Document reference in finally blocks to aid GC. - Remove dead DocumentBuilderFactory allocation in processXML(File). No PDF-path changes; PR #7's DocumentSource.close() is preserved.

#20) * Fix memory leak in TEI processing: cache JAXP factories, close streams PR #7 fixed a leak in the PDF path by closing DocumentSource; TEI has no DocumentSource, but heap and file-descriptor usage still grew under sustained /processDatasetTEI load. Root causes: 1. JAXP factory churn. DocumentBuilderFactory.newInstance(), XPathFactory.newInstance(), TransformerFactory.newInstance(), and SAXParserFactory.newInstance() ran on every request (and, in XMLUtilities.segment(), on every sentence). Each call re-runs ServiceLoader discovery and produces factories whose classloader- backed caches are not reclaimed promptly. 2. DocumentBuilder.parse(File) defers FileInputStream closure to Xerces, accumulating FDs under sustained load. 3. The parsed DOM Document was left as a local reference until method return, delaying young-gen reclaim of a large node graph. Fix: - Cache factories as private static finals in DatasetParser and XMLUtilities, with synchronized accessors (newDocumentBuilder, newXPath, newSAXParser, newTransformer). Factories' new*() methods are not guaranteed thread-safe; synchronized access keeps contention negligible versus XML parse cost and avoids ThreadLocal leaks in the Dropwizard thread pool. - Parse TEI via try-with-resources FileInputStream so the handle is released deterministically. - Null the parsed Document reference in finally blocks to aid GC. - Remove dead DocumentBuilderFactory allocation in processXML(File). No PDF-path changes; PR #7's DocumentSource.close() is preserved. * Harden cached JAXP factories and preserve systemId on TEI parse Addresses Copilot review feedback on the previous commit: 1. XXE/SSRF hardening. Because the cached DocumentBuilderFactory / SAXParserFactory / TransformerFactory parse user-supplied XML/TEI, the static initializers now apply OWASP-recommended hardening: - FEATURE_SECURE_PROCESSING - disallow-doctype-decl - external-general-entities = false - external-parameter-entities = false - nonvalidating/load-external-dtd = false - setXIncludeAware(false), setExpandEntityReferences(false) - TransformerFactory: ACCESS_EXTERNAL_DTD / ACCESS_EXTERNAL_STYLESHEET pinned to empty string Each feature/attribute is set in its own try/catch so unsupported options on a given JAXP implementation do not break class init. 2. Preserve systemId. DocumentBuilder.parse(File) previously set the document systemId to the file URI, which matters for relative references and error locations. Restore it by setting inputSource.setSystemId(file.toURI().toString()) on the InputSource wrapper in DatasetParser.processTEI while keeping the try-with- resources FileInputStream for deterministic stream closure. --------- Co-authored-by: Claude <noreply@anthropic.com>

claude added 2 commits April 13, 2026 22:31

lfoppiano merged commit 9cd121e into dev Apr 13, 2026
3 checks passed

lfoppiano deleted the claude/perf-deploy-oQZPv branch April 13, 2026 22:47

lfoppiano mentioned this pull request Apr 16, 2026

Fix memory leak in TEI processing: cache JAXP factories, close streams #20

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize performance, fix memory leak, add HF Spaces auto-deploy#7

Optimize performance, fix memory leak, add HF Spaces auto-deploy#7
lfoppiano merged 2 commits into
devfrom
claude/perf-deploy-oQZPv

lfoppiano commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lfoppiano commented Apr 13, 2026

Summary

Performance

Memory leak fix

CI / Deployment

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants