Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast export implementation fails to close GZIPInputStream #2040

Closed
lmsurpre opened this issue Mar 9, 2021 · 1 comment
Closed

Fast export implementation fails to close GZIPInputStream #2040

lmsurpre opened this issue Mar 9, 2021 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@lmsurpre
Copy link
Member

lmsurpre commented Mar 9, 2021

Describe the bug
In our kubernetes deployment, our pod kept getting killed for going over its memory limit. However, this was very strange because we have the max heap size set at 4GB.

To Reproduce
Do multiple large exports at the same time

Expected behavior
Export may slow down with load, but pods shouldn't be leaking memory like this.

Additional context
Looks like we're failing to close the GZipInputStream that is used in this fast export impl; GZipInputStream uses Inflater internally and native zlib is used and only freed on an explicit call to close.

@lmsurpre lmsurpre added the bug Something isn't working label Mar 9, 2021
lmsurpre added a commit that referenced this issue Mar 9, 2021
The main one was in ResourcePayload.transferTo because, without this, we
never free the non-heap memory being held by the GZIPInputStream.

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Mar 9, 2021
The main one was in ResourcePayload.transferTo because, without this, we
never free the non-heap memory being held by the GZIPInputStream.

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Mar 9, 2021
The main one was in ResourcePayload.transferTo because, without this, we
never free the non-heap memory being held by the GZIPInputStream.

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Mar 10, 2021
Remove the dataStream.close() from BulkDataUtils.multiPartUpload and add
javadoc to make it clear this isn't closed. Also some formatting/cleanup
in classes which call this (while I was making sure none of them need a
to be closed).

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Mar 10, 2021
Remove the dataStream.close() from BulkDataUtils.multiPartUpload and add
javadoc to make it clear this isn't closed. Also some formatting/cleanup
in classes which call this (while I was making sure none of them need a
to be closed).

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Mar 10, 2021
Remove the dataStream.close() from BulkDataUtils.multiPartUpload and add
javadoc to make it clear this isn't closed. Also some formatting/cleanup
in classes which call this (while I was making sure none of them need a
to be closed).

Additionally, I updated the ImportTransientUserData.Builder builder
signatures to return ImportTransientUserData builders and objects
instead of ImportCheckPointData; accomplished this by following the same
pattern we use in fhir-model.

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Mar 10, 2021
I think that we inadvertently changed the S3Provider.writeResources
logic as part of a late change in #1999 (the one where we set
`chunkData.isFinishCurrentUpload` when getPageNum > lastPageNum).
Specifically, in the NDJSON case, we still want to start writing parts
as soon as we have crossed the partUploadTriggerSize. This fixes that
and removes some now-unnecessary `or` expressions from S3Provider.

I also made a number of config-related changes (some of which are
probably more opinion-based than strictly needed):
* Reconciled the Fast impl's `partUploadTriggerSize` with the legacy
impl's `minPartSize` (by adopting the fast impl's name)
* Updated the maxFileSizeThreshold and minPartSize params to
`objectSizeThreshold` and `objectResourceCountThreshold` respectively
* Added "MB" after the size names to make it clear what the units are
without needing to look at the docs
* Changed ExportCheckpointAlgorithm to handle threshold values of 0 like
"no threshold" (formerly these were -1 but that is now impossible to set
due to the fact we multiply the config values in our getters)
* Fixed the javadoc in ConfigurationAdapter accordingly

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Mar 10, 2021
I think that we inadvertently changed the S3Provider.writeResources
logic as part of a late change in #1999 (the one where we set
`chunkData.isFinishCurrentUpload` when getPageNum > lastPageNum).
Specifically, in the NDJSON case, we still want to start writing parts
as soon as we have crossed the partUploadTriggerSize. This fixes that
and removes some now-unnecessary `or` expressions from S3Provider.

I also made a number of config-related changes (some of which are
probably more opinion-based than strictly needed):
* Reconciled the Fast impl's `partUploadTriggerSize` with the legacy
impl's `minPartSize` (by adopting the fast impl's name)
* Updated the maxFileSizeThreshold and minPartSize params to
`objectSizeThreshold` and `objectResourceCountThreshold` respectively
* Added "MB" after the size names to make it clear what the units are
without needing to look at the docs
* Changed ExportCheckpointAlgorithm to handle threshold values of 0 like
"no threshold" (formerly these were -1 but that is now impossible to set
due to the fact we multiply the config values in our getters)
* Fixed the javadoc in ConfigurationAdapter accordingly

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Mar 10, 2021
I think that we inadvertently changed the S3Provider.writeResources
logic as part of a late change in #1999 (the one where we set
`chunkData.isFinishCurrentUpload` when getPageNum > lastPageNum).
Specifically, in the NDJSON case, we still want to start writing parts
as soon as we have crossed the partUploadTriggerSize. This attempts to
fix that and removes some now-unnecessary `or` expressions from
S3Provider.

I also made a number of config-related changes (some of which are
probably more opinion-based than strictly needed):
* Reconciled the Fast impl's `partUploadTriggerSize` with the legacy
impl's `minPartSize` (by adopting the fast impl's name)
* Updated the maxFileSizeThreshold and minPartSize params to
`objectSizeThreshold` and `objectResourceCountThreshold` respectively
* Added "MB" after the size names to make it clear what the units are
without needing to look at the docs
* Changed ExportCheckpointAlgorithm to handle threshold values of 0 like
"no threshold" (formerly these were -1 but that is now impossible to set
due to the fact we multiply the config values in our getters)
* Fixed the javadoc in ConfigurationAdapter accordingly

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Mar 10, 2021
I think that we inadvertently changed the S3Provider.writeResources
logic as part of a late change in #1999 (the one where we set
`chunkData.isFinishCurrentUpload` when getPageNum > lastPageNum).
Specifically, in the NDJSON case, we still want to start writing parts
as soon as we have crossed the partUploadTriggerSize. This attempts to
fix that and removes some now-unnecessary `or` expressions from
S3Provider.

I also made a number of config-related changes (some of which are
probably more opinion-based than strictly needed):
* Reconciled the Fast impl's `partUploadTriggerSize` with the legacy
impl's `minPartSize` (by adopting the fast impl's name)
* Updated the maxFileSizeThreshold and minPartSize params to
`objectSizeThreshold` and `objectResourceCountThreshold` respectively
* Added "MB" after the size names to make it clear what the units are
without needing to look at the docs
* Changed ExportCheckpointAlgorithm to handle threshold values of 0 like
"no threshold" (formerly these were -1 but that is now impossible to set
due to the fact we multiply the config values in our getters)
* Fixed the javadoc in ConfigurationAdapter accordingly

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Mar 10, 2021
1. coreCosObjectResourceCountThreshold is now consistently an int
2. systemExportImpl is now "fast" in our default config

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Mar 10, 2021
1. coreCosObjectResourceCountThreshold is now consistently an int
2. systemExportImpl is now "fast" in our default config

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Mar 10, 2021
1. Previously we were looping through all of the ReadResultDTOs
(presumably 1 per page of search results), and entering our logic for
each page.  However, the chunkData buffer stream constains the content
from all the results, so this doesn't make much sense to me.

2. Renamed lastWritePageNum to lastWrittenPageNum to make its purpose
more clear (more distinct from lastPageNum)

3. Update ExportOperationTest to properly look for an 'output' key in
the response json, not just anywhere in the response body

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Mar 10, 2021
1. Previously we were looping through all of the ReadResultDTOs
(presumably 1 per page of search results), and entering our logic for
each page.  However, the chunkData buffer stream constains the content
from all the results, so this doesn't make much sense to me.

2. Renamed lastWritePageNum to lastWrittenPageNum to make its purpose
more clear (more distinct from lastPageNum)

3. Update ExportOperationTest to properly look for an 'output' key in
the response json, not just anywhere in the response body

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
prb112 added a commit that referenced this issue Mar 10, 2021
issue #2040 - close input streams in finally blocks
@lmsurpre
Copy link
Member Author

After finding and fixing the leak, I've launched numerous exports in our test environment and the pods are properly staying within their limits.

lmsurpre added a commit that referenced this issue Mar 13, 2021
1. Previously we were looping through all of the ReadResultDTOs
(presumably 1 per page of search results), and entering our logic for
each page.  However, the chunkData buffer stream constains the content
from all the results, so this doesn't make much sense to me.

2. Renamed lastWritePageNum to lastWrittenPageNum to make its purpose
more clear (more distinct from lastPageNum)

3. Update ExportOperationTest to properly look for an 'output' key in
the response json, not just anywhere in the response body

4. Add logging to ExportCheckpointAlgorithm

5. ExportTransientUserData.Builder methods now return
ExportTransientUserData instead of the super type (same change was made
to ImportTransientUserData previously)

6. Page number now matches the actual page number read; previously it
was always 1 higher

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Mar 13, 2021
1. Previously we were looping through all of the ReadResultDTOs
(presumably 1 per page of search results), and entering our logic for
each page.  However, the chunkData buffer stream constains the content
from all the results, so this doesn't make much sense to me.

2. Renamed lastWritePageNum to lastWrittenPageNum to make its purpose
more clear (more distinct from lastPageNum)

3. Update ExportOperationTest to properly look for an 'output' key in
the response json, not just anywhere in the response body

4. Add logging to ExportCheckpointAlgorithm

5. ExportTransientUserData.Builder methods now return
ExportTransientUserData instead of the super type (same change was made
to ImportTransientUserData previously)

6. Page number now matches the actual page number read; previously it
was always 1 higher

7. Set the checkpoint-policy of Group and Patient export to 'custom' to
indicate that the ExportCheckpointAlgorithm will be used to control
checkpointing

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Mar 14, 2021
1. Previously we were looping through all of the ReadResultDTOs
(presumably 1 per page of search results), and entering our logic for
each page.  However, the chunkData buffer stream constains the content
from all the results, so this doesn't make much sense to me.

2. Renamed lastWritePageNum to lastWrittenPageNum to make its purpose
more clear (more distinct from lastPageNum)

3. Update ExportOperationTest to properly look for an 'output' key in
the response json, not just anywhere in the response body

4. Add logging to ExportCheckpointAlgorithm

5. ExportTransientUserData.Builder methods now return
ExportTransientUserData instead of the super type (same change was made
to ImportTransientUserData previously)

6. Page number now matches the actual page number read; previously it
was always 1 higher

7. Set the checkpoint-policy of Group and Patient export to 'custom' to
indicate that the ExportCheckpointAlgorithm will be used to control
checkpointing

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Mar 15, 2021
issue #2040 - cleanup in S3Provider.writeResources
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants