improve buffering efficiency #427

mpenkov · 2020-03-09T01:53:56Z

Motivation

This fixes an edge case where the user performs multiple sequential reads of a small number of bytes, e.g. one byte at a time. The previous implementation would fill the buffer one byte a time, which negates the benefit of using a buffer.

The new implementation fixes the above problem by always reading in chunks that are larger than a sensible threshold (currently equal to io.DEFAULT_BUFFER_SIZE).

Fixes S3 BufferedInputBase fills buffer during every read #348

Checklist

Before you create the PR, please make sure you have:

Picked a concise, informative and complete title
Clearly explained the motivation behind the PR
Linked to any existing issues that your PR will be solving
Included tests for any new functionality
Checked that all unit tests pass

piskvorky

LGTM, thanks!

Just curious – since when have we had this bug? It looks pretty serious (poor S3 performance).

mpenkov · 2020-03-09T11:14:25Z

Probably for a while now, as long as I can remember, anyway.

I'm not sure what the actual impact of this bug is. I'll run some benchmarks later and let you know if anything interesting comes up.

Before: ------------------------------------------- benchmark: 1 tests ------------------------------------------- Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations ---------------------------------------------------------------------------------------------------------- test 4.8925 10.1093 5.9906 2.3032 5.0104 1.3963 1;1 0.1669 5 1 ---------------------------------------------------------------------------------------------------------- After: ------------------------------------------- benchmark: 1 tests ------------------------------------------ Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations --------------------------------------------------------------------------------------------------------- test 4.9611 9.7707 5.9822 2.1190 5.0280 1.3168 1;1 0.1672 5 1 ---------------------------------------------------------------------------------------------------------

mpenkov · 2020-03-11T02:44:19Z

I haven't been able to measure any significant benefit in doing this. I suspect something else outside of our control is performing its own buffering.

mpenkov · 2020-03-11T03:12:40Z

Despite the lack of performance improvement, I think we should merge this anyway, as the new way of doing things makes more sense.

Let me know if you think otherwise.

piskvorky · 2020-03-11T03:40:16Z

What's the performance before/after?

mpenkov · 2020-03-11T04:08:08Z

It's in the commit message for the "add benchmarks" commit.

Before:

------------------------------------------- benchmark: 1 tests -------------------------------------------
Name (time in s)        Min      Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
----------------------------------------------------------------------------------------------------------
test                 4.8925  10.1093  5.9906  2.3032  5.0104  1.3963       1;1  0.1669       5           1
----------------------------------------------------------------------------------------------------------

After:

------------------------------------------- benchmark: 1 tests ------------------------------------------
Name (time in s)        Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------
test                 4.9611  9.7707  5.9822  2.1190  5.0280  1.3168       1;1  0.1672       5           1
---------------------------------------------------------------------------------------------------------

piskvorky · 2020-03-11T04:32:18Z

Is boto is doing some buffering internally? How else could reading one-byte-at-a-time be the same speed? Strange.

If so, maybe we shouldn't buffer at all in smart_open, and leave it to boto instead.

mpenkov · 2020-03-11T04:40:14Z

Yes, I also suspect boto3 does its own buffering. We could investigate, and remove our own buffering and compare performance, but that's a much larger change than the current PR. I'd rather deal with that separately, when we have more time.

improve buffering efficiency

99e4cb5

mpenkov added this to the 1.10.0 milestone Mar 9, 2020

mpenkov requested a review from piskvorky March 9, 2020 03:18

piskvorky approved these changes Mar 9, 2020

View reviewed changes

remove unused import

b9e6684

mpenkov self-assigned this Mar 11, 2020

mpenkov mentioned this pull request Mar 14, 2020

Investigate s3 buffering #432

Open

mpenkov merged commit 85a67ee into piskvorky:master Mar 15, 2020

mpenkov deleted the fill-buffer branch March 15, 2020 05:35

mpenkov mentioned this pull request Apr 8, 2020

Support Azure Storage Blob #444

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve buffering efficiency #427

improve buffering efficiency #427

mpenkov commented Mar 9, 2020 •

edited

piskvorky left a comment

mpenkov commented Mar 9, 2020

mpenkov commented Mar 11, 2020

mpenkov commented Mar 11, 2020

piskvorky commented Mar 11, 2020

mpenkov commented Mar 11, 2020

piskvorky commented Mar 11, 2020 •

edited

mpenkov commented Mar 11, 2020

improve buffering efficiency #427

improve buffering efficiency #427

Conversation

mpenkov commented Mar 9, 2020 • edited

Motivation

Checklist

piskvorky left a comment

Choose a reason for hiding this comment

mpenkov commented Mar 9, 2020

mpenkov commented Mar 11, 2020

mpenkov commented Mar 11, 2020

piskvorky commented Mar 11, 2020

mpenkov commented Mar 11, 2020

piskvorky commented Mar 11, 2020 • edited

mpenkov commented Mar 11, 2020

mpenkov commented Mar 9, 2020 •

edited

piskvorky commented Mar 11, 2020 •

edited