Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-fast bulk export writes to many files instead of honoring settings #2666

Closed
lmsurpre opened this issue Aug 5, 2021 · 1 comment
Closed
Assignees
Labels
bug Something isn't working

Comments

@lmsurpre
Copy link
Member

lmsurpre commented Aug 5, 2021

Describe the bug
I have an bulkdata config with a ibm-cos storageProvider and COS settings as follows:

            "cos" : {
              "partUploadTriggerSizeMB": 10,
              "objectSizeThresholdMB": 200,
              "objectResourceCountThreshold": 200000
            },

The fast export seems to be working as desired, but when I add a typeFilter it falls back to the search-based export implementation and the result is that the resources are being split across many more COS objects than they should be.

I turned up tracing and it looks like the currentUploadSize is being computed incorrectly:

2021-08-05 16:47:32.530 wh-cmsiop-cthon-fhir-internal-799fc96897-65rg7 fhir-internal FINE isReadyToCheckpoint [readyToWrite=false (37343 >= 10485760), overSizeThreshold=false (37343 >= 209715200), overResourceCountThreshold=false (100 >= 200000), end=false (1 >= 328)]
2021-08-05 16:47:32.674 wh-cmsiop-cthon-fhir-internal-799fc96897-65rg7 fhir-internal FINE isReadyToCheckpoint [readyToWrite=false (73858 >= 10485760), overSizeThreshold=false (111201 >= 209715200), overResourceCountThreshold=false (200 >= 200000), end=false (2 >= 328)]
...
2021-08-05 16:47:44.294 wh-cmsiop-cthon-fhir-internal-799fc96897-65rg7 fhir-internal FINE isReadyToCheckpoint [readyToWrite=false (3796495 >= 10485760), overSizeThreshold=false (199361860 >= 209715200), overResourceCountThreshold=false (10400 >= 200000), end=false (104 >= 328)]
2021-08-05 16:47:45.738 wh-cmsiop-cthon-fhir-internal-799fc96897-65rg7 fhir-internal INFO pushFhirJsonsToCos: '3905969' bytes were successfully appended to COS object - '0vzo_7VIZWnkThMov9MmU8AO7JdwylM9jHTEEGd5JD0/Patient_1.ndjson' uploadId='0100017b-1849-57a1-aa48-8cc67e858702'

Specifically, note how the "sizeThreshold" bytes are growing faster than the "readyToWrite" bytes even before we've started an upload and cleared the buffer.

Environment
Which version of IBM FHIR Server?

To Reproduce
Steps to reproduce the behavior:

  1. configure the server with a bulkdata storageProvider that uses s3 (e.g. aws-s3 or ibm-cos)
  2. issue an export command with a typeFilter (e.g. GET [base]/$export?_type=Patient&_typeFilter=Patient?_elements=gender
  3. note that the results have more COS objects than it should

Expected behavior
The system should continue writing to a single cos object until either the objectSizeThresholdMB or the objectResourceCountThreshold has been reached

Additional context

@lmsurpre lmsurpre added the bug Something isn't working label Aug 5, 2021
@lmsurpre lmsurpre added this to the Sprint 2021-11 milestone Aug 5, 2021
@lmsurpre lmsurpre self-assigned this Aug 5, 2021
lmsurpre added a commit that referenced this issue Aug 6, 2021
Previously, we added the entire size of the buffer after each page of
results was read. This leads us to think that we have a lot more data
than we actually do. Now we will add only the new bytes.

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Aug 6, 2021
issue #2666 - fix for currentUploadSize tracking
@kmbarton423
Copy link
Contributor

Ran a variety of $export scenarios (type=file) against 250K resources focusing on system level and patient. Also used _type and _typeFilter query parm combos in a subset of the $export(s). Varied config parms and confirm honored ( ie writeTriggerSizeMB, sizeThresholdMB, resourceCountThreshold).

Ran bulk data sniff test which utilizes type=ibm-cos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants