Fast export implementation fails to close GZIPInputStream #2040

lmsurpre · 2021-03-09T21:49:41Z

Describe the bug
In our kubernetes deployment, our pod kept getting killed for going over its memory limit. However, this was very strange because we have the max heap size set at 4GB.

To Reproduce
Do multiple large exports at the same time

Expected behavior
Export may slow down with load, but pods shouldn't be leaking memory like this.

Additional context
Looks like we're failing to close the GZipInputStream that is used in this fast export impl; GZipInputStream uses Inflater internally and native zlib is used and only freed on an explicit call to close.

The main one was in ResourcePayload.transferTo because, without this, we never free the non-heap memory being held by the GZIPInputStream. Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>

Remove the dataStream.close() from BulkDataUtils.multiPartUpload and add javadoc to make it clear this isn't closed. Also some formatting/cleanup in classes which call this (while I was making sure none of them need a to be closed). Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>

Remove the dataStream.close() from BulkDataUtils.multiPartUpload and add javadoc to make it clear this isn't closed. Also some formatting/cleanup in classes which call this (while I was making sure none of them need a to be closed). Additionally, I updated the ImportTransientUserData.Builder builder signatures to return ImportTransientUserData builders and objects instead of ImportCheckPointData; accomplished this by following the same pattern we use in fhir-model. Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>

I think that we inadvertently changed the S3Provider.writeResources logic as part of a late change in #1999 (the one where we set `chunkData.isFinishCurrentUpload` when getPageNum > lastPageNum). Specifically, in the NDJSON case, we still want to start writing parts as soon as we have crossed the partUploadTriggerSize. This fixes that and removes some now-unnecessary `or` expressions from S3Provider. I also made a number of config-related changes (some of which are probably more opinion-based than strictly needed): * Reconciled the Fast impl's `partUploadTriggerSize` with the legacy impl's `minPartSize` (by adopting the fast impl's name) * Updated the maxFileSizeThreshold and minPartSize params to `objectSizeThreshold` and `objectResourceCountThreshold` respectively * Added "MB" after the size names to make it clear what the units are without needing to look at the docs * Changed ExportCheckpointAlgorithm to handle threshold values of 0 like "no threshold" (formerly these were -1 but that is now impossible to set due to the fact we multiply the config values in our getters) * Fixed the javadoc in ConfigurationAdapter accordingly Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>

I think that we inadvertently changed the S3Provider.writeResources logic as part of a late change in #1999 (the one where we set `chunkData.isFinishCurrentUpload` when getPageNum > lastPageNum). Specifically, in the NDJSON case, we still want to start writing parts as soon as we have crossed the partUploadTriggerSize. This attempts to fix that and removes some now-unnecessary `or` expressions from S3Provider. I also made a number of config-related changes (some of which are probably more opinion-based than strictly needed): * Reconciled the Fast impl's `partUploadTriggerSize` with the legacy impl's `minPartSize` (by adopting the fast impl's name) * Updated the maxFileSizeThreshold and minPartSize params to `objectSizeThreshold` and `objectResourceCountThreshold` respectively * Added "MB" after the size names to make it clear what the units are without needing to look at the docs * Changed ExportCheckpointAlgorithm to handle threshold values of 0 like "no threshold" (formerly these were -1 but that is now impossible to set due to the fact we multiply the config values in our getters) * Fixed the javadoc in ConfigurationAdapter accordingly Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>

1. coreCosObjectResourceCountThreshold is now consistently an int 2. systemExportImpl is now "fast" in our default config Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>

1. Previously we were looping through all of the ReadResultDTOs (presumably 1 per page of search results), and entering our logic for each page. However, the chunkData buffer stream constains the content from all the results, so this doesn't make much sense to me. 2. Renamed lastWritePageNum to lastWrittenPageNum to make its purpose more clear (more distinct from lastPageNum) 3. Update ExportOperationTest to properly look for an 'output' key in the response json, not just anywhere in the response body Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>

issue #2040 - close input streams in finally blocks

lmsurpre · 2021-03-10T20:34:57Z

After finding and fixing the leak, I've launched numerous exports in our test environment and the pods are properly staying within their limits.

1. Previously we were looping through all of the ReadResultDTOs (presumably 1 per page of search results), and entering our logic for each page. However, the chunkData buffer stream constains the content from all the results, so this doesn't make much sense to me. 2. Renamed lastWritePageNum to lastWrittenPageNum to make its purpose more clear (more distinct from lastPageNum) 3. Update ExportOperationTest to properly look for an 'output' key in the response json, not just anywhere in the response body 4. Add logging to ExportCheckpointAlgorithm 5. ExportTransientUserData.Builder methods now return ExportTransientUserData instead of the super type (same change was made to ImportTransientUserData previously) 6. Page number now matches the actual page number read; previously it was always 1 higher Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>

1. Previously we were looping through all of the ReadResultDTOs (presumably 1 per page of search results), and entering our logic for each page. However, the chunkData buffer stream constains the content from all the results, so this doesn't make much sense to me. 2. Renamed lastWritePageNum to lastWrittenPageNum to make its purpose more clear (more distinct from lastPageNum) 3. Update ExportOperationTest to properly look for an 'output' key in the response json, not just anywhere in the response body 4. Add logging to ExportCheckpointAlgorithm 5. ExportTransientUserData.Builder methods now return ExportTransientUserData instead of the super type (same change was made to ImportTransientUserData previously) 6. Page number now matches the actual page number read; previously it was always 1 higher 7. Set the checkpoint-policy of Group and Patient export to 'custom' to indicate that the ExportCheckpointAlgorithm will be used to control checkpointing Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>

issue #2040 - cleanup in S3Provider.writeResources

lmsurpre added the bug Something isn't working label Mar 9, 2021

lmsurpre added a commit that referenced this issue Mar 10, 2021

issue #2040 - updates per PR feedback

2d2effc

1. coreCosObjectResourceCountThreshold is now consistently an int 2. systemExportImpl is now "fast" in our default config Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>

lmsurpre added a commit that referenced this issue Mar 10, 2021

issue #2040 - updates per PR feedback

8d9befe

1. coreCosObjectResourceCountThreshold is now consistently an int 2. systemExportImpl is now "fast" in our default config Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>

prb112 added a commit that referenced this issue Mar 10, 2021

Merge pull request #2042 from IBM/issue-2040

cc6cfbf

issue #2040 - close input streams in finally blocks

lmsurpre assigned lmsurpre and punktilious Mar 10, 2021

lmsurpre closed this as completed Mar 10, 2021

lmsurpre added a commit that referenced this issue Mar 15, 2021

Merge pull request #2044 from IBM/issue-2040-experiment

2e93573

issue #2040 - cleanup in S3Provider.writeResources

lmsurpre mentioned this issue Mar 23, 2021

Replace ByteArrayInputStream and ByteArrayOutputStream usage with something more efficient #2049

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast export implementation fails to close GZIPInputStream #2040

Fast export implementation fails to close GZIPInputStream #2040

lmsurpre commented Mar 9, 2021

lmsurpre commented Mar 10, 2021

Fast export implementation fails to close GZIPInputStream #2040

Fast export implementation fails to close GZIPInputStream #2040

Comments

lmsurpre commented Mar 9, 2021

lmsurpre commented Mar 10, 2021