Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve buffering efficiency #427

Merged
merged 3 commits into from Mar 15, 2020
Merged

improve buffering efficiency #427

merged 3 commits into from Mar 15, 2020

Conversation

mpenkov
Copy link
Collaborator

@mpenkov mpenkov commented Mar 9, 2020

Motivation

This fixes an edge case where the user performs multiple sequential reads of a small number of bytes, e.g. one byte at a time. The previous implementation would fill the buffer one byte a time, which negates the benefit of using a buffer.

The new implementation fixes the above problem by always reading in chunks that are larger than a sensible threshold (currently equal to io.DEFAULT_BUFFER_SIZE).

Checklist

Before you create the PR, please make sure you have:

  • Picked a concise, informative and complete title
  • Clearly explained the motivation behind the PR
  • Linked to any existing issues that your PR will be solving
  • Included tests for any new functionality
  • Checked that all unit tests pass

@mpenkov mpenkov added this to the 1.10.0 milestone Mar 9, 2020
@mpenkov mpenkov requested a review from piskvorky March 9, 2020 03:18
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Just curious – since when have we had this bug? It looks pretty serious (poor S3 performance).

@mpenkov
Copy link
Collaborator Author

mpenkov commented Mar 9, 2020

Probably for a while now, as long as I can remember, anyway.

I'm not sure what the actual impact of this bug is. I'll run some benchmarks later and let you know if anything interesting comes up.

Before:

------------------------------------------- benchmark: 1 tests -------------------------------------------
Name (time in s)        Min      Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
----------------------------------------------------------------------------------------------------------
test                 4.8925  10.1093  5.9906  2.3032  5.0104  1.3963       1;1  0.1669       5           1
----------------------------------------------------------------------------------------------------------

After:

------------------------------------------- benchmark: 1 tests ------------------------------------------
Name (time in s)        Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------
test                 4.9611  9.7707  5.9822  2.1190  5.0280  1.3168       1;1  0.1672       5           1
---------------------------------------------------------------------------------------------------------
@mpenkov
Copy link
Collaborator Author

mpenkov commented Mar 11, 2020

I haven't been able to measure any significant benefit in doing this. I suspect something else outside of our control is performing its own buffering.

@mpenkov
Copy link
Collaborator Author

mpenkov commented Mar 11, 2020

Despite the lack of performance improvement, I think we should merge this anyway, as the new way of doing things makes more sense.

Let me know if you think otherwise.

@piskvorky
Copy link
Owner

What's the performance before/after?

@mpenkov
Copy link
Collaborator Author

mpenkov commented Mar 11, 2020

It's in the commit message for the "add benchmarks" commit.

Before:

------------------------------------------- benchmark: 1 tests -------------------------------------------
Name (time in s)        Min      Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
----------------------------------------------------------------------------------------------------------
test                 4.8925  10.1093  5.9906  2.3032  5.0104  1.3963       1;1  0.1669       5           1
----------------------------------------------------------------------------------------------------------

After:

------------------------------------------- benchmark: 1 tests ------------------------------------------
Name (time in s)        Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------
test                 4.9611  9.7707  5.9822  2.1190  5.0280  1.3168       1;1  0.1672       5           1
---------------------------------------------------------------------------------------------------------

@piskvorky
Copy link
Owner

piskvorky commented Mar 11, 2020

Is boto is doing some buffering internally? How else could reading one-byte-at-a-time be the same speed? Strange.

If so, maybe we shouldn't buffer at all in smart_open, and leave it to boto instead.

@mpenkov
Copy link
Collaborator Author

mpenkov commented Mar 11, 2020

Yes, I also suspect boto3 does its own buffering. We could investigate, and remove our own buffering and compare performance, but that's a much larger change than the current PR. I'd rather deal with that separately, when we have more time.

@mpenkov mpenkov self-assigned this Mar 11, 2020
@mpenkov mpenkov merged commit 85a67ee into piskvorky:master Mar 15, 2020
@mpenkov mpenkov deleted the fill-buffer branch March 15, 2020 05:35
@mpenkov mpenkov mentioned this pull request Apr 8, 2020
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

S3 BufferedInputBase fills buffer during every read
2 participants