Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File-based export should start writing earlier #2112

Closed
lmsurpre opened this issue Mar 18, 2021 · 2 comments
Closed

File-based export should start writing earlier #2112

lmsurpre opened this issue Mar 18, 2021 · 2 comments
Assignees

Comments

@lmsurpre
Copy link
Member

Is your feature request related to a problem? Please describe.
The file-based export uses the COS properties for checkpointing.
This is a bit confusing and, additionally, I think we'd want a different default for "when to start writing" (in COS there are
constraints on the size and count of the parts, while writing to local disk has different concerns).

Describe the solution you'd like
A better default for when to start writing a file. I was thinking it should just checkpoint at every item (i.e. each page or results), but Paul suggested this could contribute to fragmentation. Therefor, lets introduce a new config param for it and set it to 1MB as the default.

Describe alternatives you've considered
A new "fast" implementation that uses nio to asynchronously queue the bytes for the OS / disk controller (instead of synchronously writing chunks to the file). I still think that would be a better implementation, but its more than I want to chew off right now.

@lmsurpre lmsurpre added this to the Sprint 2021-04 milestone Mar 18, 2021
@lmsurpre lmsurpre self-assigned this Mar 18, 2021
@prb112
Copy link
Contributor

prb112 commented Mar 18, 2021

I can Q/A this one.

@prb112
Copy link
Contributor

prb112 commented Mar 19, 2021

Tested locally. I can confirm this behaves on the export boundaries we changed to. I lowered the configuration to

                "file" : { 
                    "writeTriggerSizeMB": 1,
                    "sizeThresholdMB": 1,
                    "resourceCountThreshold": 20
                },

and

                "file" : { 
                    "writeTriggerSizeMB": 10,
                    "sizeThresholdMB": 10,
                    "resourceCountThreshold": 20
                },

I imported a 37M ndjson

This behaves as expected

@prb112 prb112 closed this as completed Mar 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants