Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: S3 ignore seek requests to the current position #748

Conversation

rustyconover
Copy link
Contributor

Motivation

When callers perform a seek() on a S3 backed file handle that seek can be ignored if it is to the current position. Python's ZipFile module often seeks to the current position causing performance to be quite slow when reading zip files from S3.

This change compares the current position vs the destination position and preserves the buffer if possible while still populating the EOF flag.

This addresses: #742

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 19, 2022

Looks good to me. Thank you!

@rustyconover
Copy link
Contributor Author

Is there anything else I need to do to get it merged?

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 19, 2022

It looks like some of the CI tests are failing. Are you able to make them pass?

@tooptoop4
Copy link
Contributor

tooptoop4 commented Dec 21, 2022

https://github.com/RaRe-Technologies/smart_open/blob/v6.3.0/smart_open/tests/test_s3_version.py#L70-L74 @rustyconover seems this no longer throws error because seek is skipped

@rustyconover
Copy link
Contributor Author

Please rerun tests and merge.

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 27, 2023

I think in this case you should fix the behavior of the function instead of changing the test.

People have come to expect an open to raise an IOError immediately.

@tooptoop4
Copy link
Contributor

what if this new PR behavior was behind an opt-in flag? so we can get faster perf if we want, others can have existing functionality. as i think to get the error would slow down the benefit of this change

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 30, 2023

I'd rather not add an additional configuration option for something so trivial. Besides, there's no reason to have one or the other (performance or compatibility) here. For example, we can:

  • ignore seeks to the current position, unless it is the very first seek
  • perform a GetObject call to ensure the object exists before doing anything else
  • anything else?

@@ -663,9 +663,10 @@ def seek(self, offset, whence=constants.WHENCE_START):
whence = constants.WHENCE_START
offset += self._current_pos

self._current_pos = self._raw_reader.seek(offset, whence)
if not (whence == constants.WHENCE_START and offset == self._current_pos):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not (whence == constants.WHENCE_START and offset == self._current_pos):
if not (whence == constants.WHENCE_START and offset == self._current_pos):

The optimization idea is good, but we need to maintain the current behavior of the library. That is, the seek should always be performed when opening the file. Perhaps we could initialize self._current_pos to -1 to achieve this.

ananth1996 added a commit to HPC-HD/smart_open that referenced this pull request Apr 13, 2023
ananth1996 added a commit to HPC-HD/smart_open that referenced this pull request Apr 13, 2023
Changes the position to -1 as suggested in piskvorky#748 (comment) to ensure that the initial seek is carried out
@ananth1996
Copy link

We tried the changes proposed in this issue and got favourable results in our use-case. Is there any progress towards merging this PR request. We are currently using our forked and fixed version in poetry for the application we are building. However, it would be nice to have an official fix for this problem as we plan to use this solution further in the future.

@tooptoop4
Copy link
Contributor

🦗

@mpenkov
Copy link
Collaborator

mpenkov commented Sep 6, 2023

@tooptoop4 or @ananth1996 Are you able to push this PR over the line? It looks like the original author abandoned it.

with self.assertRaises(IOError):
open(self.url, 'rb', transport_params=params)
with open(self.url, 'rb', transport_params=params) as fin:
with self.assertRaises(IOError):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not change this test. Opening the URL to a non-existent object should raise an error, before any reading is attempted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I hadn't changed this in my fork of the code. But then again I wasn't aiming to make a PR. I could check what errors the fix raise during tests and see if there is a sensible way to incorporate them without changing core functionality.

@ananth1996
Copy link

@tooptoop4 or @ananth1996 Are you able to push this PR over the line? It looks like the original author abandoned it.

I can help with this. What would be needed to get the PR over the line?

@beck3905
Copy link
Contributor

beck3905 commented Sep 6, 2023

I took a stab at fixing this PR. Just submitted #782 with my changes.

@rustyconover
Copy link
Contributor Author

I'm happy to see this PR moving forward. I didn't abandon the PR I just disagree with the maintainer's decision about the test and behavior.

@mpenkov
Copy link
Collaborator

mpenkov commented Sep 7, 2023

@rustyconover OK, let's agree to disagree. Closing in favor of #782

@mpenkov mpenkov closed this Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants