New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: S3 ignore seek requests to the current position #748
fix: S3 ignore seek requests to the current position #748
Conversation
Looks good to me. Thank you! |
Is there anything else I need to do to get it merged? |
It looks like some of the CI tests are failing. Are you able to make them pass? |
https://github.com/RaRe-Technologies/smart_open/blob/v6.3.0/smart_open/tests/test_s3_version.py#L70-L74 @rustyconover seems this no longer throws error because seek is skipped |
Please rerun tests and merge. |
I think in this case you should fix the behavior of the function instead of changing the test. People have come to expect an open to raise an IOError immediately. |
what if this new PR behavior was behind an opt-in flag? so we can get faster perf if we want, others can have existing functionality. as i think to get the error would slow down the benefit of this change |
I'd rather not add an additional configuration option for something so trivial. Besides, there's no reason to have one or the other (performance or compatibility) here. For example, we can:
|
@@ -663,9 +663,10 @@ def seek(self, offset, whence=constants.WHENCE_START): | |||
whence = constants.WHENCE_START | |||
offset += self._current_pos | |||
|
|||
self._current_pos = self._raw_reader.seek(offset, whence) | |||
if not (whence == constants.WHENCE_START and offset == self._current_pos): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not (whence == constants.WHENCE_START and offset == self._current_pos): | |
if not (whence == constants.WHENCE_START and offset == self._current_pos): |
The optimization idea is good, but we need to maintain the current behavior of the library. That is, the seek should always be performed when opening the file. Perhaps we could initialize self._current_pos to -1 to achieve this.
Adds the patch suggested in piskvorky#748
Changes the position to -1 as suggested in piskvorky#748 (comment) to ensure that the initial seek is carried out
We tried the changes proposed in this issue and got favourable results in our use-case. Is there any progress towards merging this PR request. We are currently using our forked and fixed version in poetry for the application we are building. However, it would be nice to have an official fix for this problem as we plan to use this solution further in the future. |
🦗 |
@tooptoop4 or @ananth1996 Are you able to push this PR over the line? It looks like the original author abandoned it. |
with self.assertRaises(IOError): | ||
open(self.url, 'rb', transport_params=params) | ||
with open(self.url, 'rb', transport_params=params) as fin: | ||
with self.assertRaises(IOError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not change this test. Opening the URL to a non-existent object should raise an error, before any reading is attempted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I hadn't changed this in my fork of the code. But then again I wasn't aiming to make a PR. I could check what errors the fix raise during tests and see if there is a sensible way to incorporate them without changing core functionality.
I can help with this. What would be needed to get the PR over the line? |
I took a stab at fixing this PR. Just submitted #782 with my changes. |
I'm happy to see this PR moving forward. I didn't abandon the PR I just disagree with the maintainer's decision about the test and behavior. |
@rustyconover OK, let's agree to disagree. Closing in favor of #782 |
Motivation
When callers perform a
seek()
on a S3 backed file handle that seek can be ignored if it is to the current position. Python's ZipFile module often seeks to the current position causing performance to be quite slow when reading zip files from S3.This change compares the current position vs the destination position and preserves the buffer if possible while still populating the EOF flag.
This addresses: #742