Optimize performance, fix memory leak, add HF Spaces auto-deploy#7
Merged
Conversation
Dropwizard 4.x uses strict Jackson deserialization and rejects unknown YAML keys. config.yml has version, corpusPath, models, entityFishingHost, etc. at the top level, but DatastetServiceConfiguration only declared grobidHome, datastetConfiguration, maxParallelRequests, and CORS fields. Add all missing top-level fields to DatastetServiceConfiguration, matching the pattern from software-mentions' SoftwareServiceConfiguration: version, entityFishingHost/Port, corpusPath, templatePath, tmpPath, pub2teiPath, useBinaryContextClassifiers, models. Fixes: "Unrecognized field at: version" crash on startup. https://claude.ai/code/session_018EBZhK2RtGtsvN4E1rp2tF
Performance: - Reuse static ObjectMapper in DataTypeClassifier and DatasetContextClassifier instead of creating new instances per classify() call. ObjectMapper is thread-safe and expensive to create. Memory leak: - Close DocumentSource after PDF processing in processDatasetPDF(). The grobid Document holds native PDF parser resources via DocumentSource that were never released, leaking memory per request. CI: - Update JDK 17 -> 21 in ci-build.yml (matches java toolchain) - Add deploy-hf-spaces job: triggers factory reboot of HuggingFace Space lfoppiano/datastet-dev after successful Docker build on non-PR pushes. Requires HF_TOKEN secret in repo settings. https://claude.ai/code/session_018EBZhK2RtGtsvN4E1rp2tF
lfoppiano
pushed a commit
that referenced
this pull request
Apr 16, 2026
PR #7 fixed a leak in the PDF path by closing DocumentSource; TEI has no DocumentSource, but heap and file-descriptor usage still grew under sustained /processDatasetTEI load. Root causes: 1. JAXP factory churn. DocumentBuilderFactory.newInstance(), XPathFactory.newInstance(), TransformerFactory.newInstance(), and SAXParserFactory.newInstance() ran on every request (and, in XMLUtilities.segment(), on every sentence). Each call re-runs ServiceLoader discovery and produces factories whose classloader- backed caches are not reclaimed promptly. 2. DocumentBuilder.parse(File) defers FileInputStream closure to Xerces, accumulating FDs under sustained load. 3. The parsed DOM Document was left as a local reference until method return, delaying young-gen reclaim of a large node graph. Fix: - Cache factories as private static finals in DatasetParser and XMLUtilities, with synchronized accessors (newDocumentBuilder, newXPath, newSAXParser, newTransformer). Factories' new*() methods are not guaranteed thread-safe; synchronized access keeps contention negligible versus XML parse cost and avoids ThreadLocal leaks in the Dropwizard thread pool. - Parse TEI via try-with-resources FileInputStream so the handle is released deterministically. - Null the parsed Document reference in finally blocks to aid GC. - Remove dead DocumentBuilderFactory allocation in processXML(File). No PDF-path changes; PR #7's DocumentSource.close() is preserved.
lfoppiano
added a commit
that referenced
this pull request
Apr 16, 2026
#20) * Fix memory leak in TEI processing: cache JAXP factories, close streams PR #7 fixed a leak in the PDF path by closing DocumentSource; TEI has no DocumentSource, but heap and file-descriptor usage still grew under sustained /processDatasetTEI load. Root causes: 1. JAXP factory churn. DocumentBuilderFactory.newInstance(), XPathFactory.newInstance(), TransformerFactory.newInstance(), and SAXParserFactory.newInstance() ran on every request (and, in XMLUtilities.segment(), on every sentence). Each call re-runs ServiceLoader discovery and produces factories whose classloader- backed caches are not reclaimed promptly. 2. DocumentBuilder.parse(File) defers FileInputStream closure to Xerces, accumulating FDs under sustained load. 3. The parsed DOM Document was left as a local reference until method return, delaying young-gen reclaim of a large node graph. Fix: - Cache factories as private static finals in DatasetParser and XMLUtilities, with synchronized accessors (newDocumentBuilder, newXPath, newSAXParser, newTransformer). Factories' new*() methods are not guaranteed thread-safe; synchronized access keeps contention negligible versus XML parse cost and avoids ThreadLocal leaks in the Dropwizard thread pool. - Parse TEI via try-with-resources FileInputStream so the handle is released deterministically. - Null the parsed Document reference in finally blocks to aid GC. - Remove dead DocumentBuilderFactory allocation in processXML(File). No PDF-path changes; PR #7's DocumentSource.close() is preserved. * Harden cached JAXP factories and preserve systemId on TEI parse Addresses Copilot review feedback on the previous commit: 1. XXE/SSRF hardening. Because the cached DocumentBuilderFactory / SAXParserFactory / TransformerFactory parse user-supplied XML/TEI, the static initializers now apply OWASP-recommended hardening: - FEATURE_SECURE_PROCESSING - disallow-doctype-decl - external-general-entities = false - external-parameter-entities = false - nonvalidating/load-external-dtd = false - setXIncludeAware(false), setExpandEntityReferences(false) - TransformerFactory: ACCESS_EXTERNAL_DTD / ACCESS_EXTERNAL_STYLESHEET pinned to empty string Each feature/attribute is set in its own try/catch so unsupported options on a given JAXP implementation do not break class init. 2. Preserve systemId. DocumentBuilder.parse(File) previously set the document systemId to the file URI, which matters for relative references and error locations. Restore it by setting inputSource.setSystemId(file.toURI().toString()) on the InputSource wrapper in DatasetParser.processTEI while keeping the try-with- resources FileInputStream for deterministic stream closure. --------- Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Performance
DataTypeClassifierandDatasetContextClassifierwere creatingnew ObjectMapper()on everyclassify()call. ObjectMapper is thread-safe and expensive to instantiate. Replaced with aprivate static finalinstance shared across all calls.Memory leak fix
processDatasetPDF()callsDatasetParser.processPDF()which returns aDocumentholding native PDF parser resources viaDocumentSource. These were never released, leaking memory on every request. AddedDocumentSource.close()in thefinallyblock.CI / Deployment
ci-build.ymlbuild step (matches thejava.toolchain.languageVersion = 21in build.gradle)deploy-hf-spacesjob triggers a factory reboot oflfoppiano/datastet-devSpace after successful Docker build on non-PR pushes. RequiresHF_TOKENsecret in repo settings.Test plan
./gradlew clean compileJavapasses./gradlew testpassesnew ObjectMapper()in hot pathsHF_TOKENsecret)https://claude.ai/code/session_018EBZhK2RtGtsvN4E1rp2tF