fix: S3 ignore seek requests to the current position #748

rustyconover · 2022-12-18T15:45:49Z

Motivation

When callers perform a seek() on a S3 backed file handle that seek can be ignored if it is to the current position. Python's ZipFile module often seeks to the current position causing performance to be quite slow when reading zip files from S3.

This change compares the current position vs the destination position and preserves the buffer if possible while still populating the EOF flag.

This addresses: #742

mpenkov · 2022-12-19T07:13:47Z

Looks good to me. Thank you!

rustyconover · 2022-12-19T14:32:20Z

Is there anything else I need to do to get it merged?

mpenkov · 2022-12-19T14:44:18Z

It looks like some of the CI tests are failing. Are you able to make them pass?

tooptoop4 · 2022-12-21T05:35:39Z

https://github.com/RaRe-Technologies/smart_open/blob/v6.3.0/smart_open/tests/test_s3_version.py#L70-L74 @rustyconover seems this no longer throws error because seek is skipped

rustyconover · 2023-01-27T04:19:51Z

Please rerun tests and merge.

mpenkov · 2023-01-27T11:51:47Z

I think in this case you should fix the behavior of the function instead of changing the test.

People have come to expect an open to raise an IOError immediately.

tooptoop4 · 2023-01-30T02:40:13Z

what if this new PR behavior was behind an opt-in flag? so we can get faster perf if we want, others can have existing functionality. as i think to get the error would slow down the benefit of this change

mpenkov · 2023-01-30T04:27:27Z

I'd rather not add an additional configuration option for something so trivial. Besides, there's no reason to have one or the other (performance or compatibility) here. For example, we can:

ignore seeks to the current position, unless it is the very first seek
perform a GetObject call to ensure the object exists before doing anything else
anything else?

mpenkov · 2023-01-30T05:59:02Z

smart_open/s3.py

@@ -663,9 +663,10 @@ def seek(self, offset, whence=constants.WHENCE_START):
            whence = constants.WHENCE_START
            offset += self._current_pos

-        self._current_pos = self._raw_reader.seek(offset, whence)
+        if not (whence == constants.WHENCE_START and offset == self._current_pos):


Suggested change

if not (whence == constants.WHENCE_START and offset == self._current_pos):

if not (whence == constants.WHENCE_START and offset == self._current_pos):

The optimization idea is good, but we need to maintain the current behavior of the library. That is, the seek should always be performed when opening the file. Perhaps we could initialize self._current_pos to -1 to achieve this.

Adds the patch suggested in piskvorky#748

Changes the position to -1 as suggested in piskvorky#748 (comment) to ensure that the initial seek is carried out

ananth1996 · 2023-08-25T06:43:09Z

We tried the changes proposed in this issue and got favourable results in our use-case. Is there any progress towards merging this PR request. We are currently using our forked and fixed version in poetry for the application we are building. However, it would be nice to have an official fix for this problem as we plan to use this solution further in the future.

tooptoop4 · 2023-09-06T00:26:49Z

🦗

mpenkov · 2023-09-06T03:45:39Z

@tooptoop4 or @ananth1996 Are you able to push this PR over the line? It looks like the original author abandoned it.

mpenkov · 2023-09-06T14:45:16Z

smart_open/tests/test_s3_version.py

-        with self.assertRaises(IOError):
-            open(self.url, 'rb', transport_params=params)
+        with open(self.url, 'rb', transport_params=params) as fin:
+            with self.assertRaises(IOError):


We should not change this test. Opening the URL to a non-existent object should raise an error, before any reading is attempted.

So I hadn't changed this in my fork of the code. But then again I wasn't aiming to make a PR. I could check what errors the fix raise during tests and see if there is a sensible way to incorporate them without changing core functionality.

ananth1996 · 2023-09-06T15:28:41Z

@tooptoop4 or @ananth1996 Are you able to push this PR over the line? It looks like the original author abandoned it.

I can help with this. What would be needed to get the PR over the line?

beck3905 · 2023-09-06T16:33:43Z

I took a stab at fixing this PR. Just submitted #782 with my changes.

rustyconover · 2023-09-07T01:20:49Z

I'm happy to see this PR moving forward. I didn't abandon the PR I just disagree with the maintainer's decision about the test and behavior.

mpenkov · 2023-09-07T02:12:24Z

@rustyconover OK, let's agree to disagree. Closing in favor of #782

fix: ignore seek requests to the current position

e3fea83

rustyconover mentioned this pull request Dec 18, 2022

Random seeking from S3 is very slow #622

Open

3 tasks

mpenkov approved these changes Dec 19, 2022

View reviewed changes

fix: adjust test to match new seek behavior

c5f6954

mpenkov requested changes Jan 30, 2023

View reviewed changes

ananth1996 added a commit to HPC-HD/smart_open that referenced this pull request Apr 13, 2023

Ignores seek to current position

e90a43a

Adds the patch suggested in piskvorky#748

ananth1996 added a commit to HPC-HD/smart_open that referenced this pull request Apr 13, 2023

Fixes issue with initial seek

d66e135

Changes the position to -1 as suggested in piskvorky#748 (comment) to ensure that the initial seek is carried out

mpenkov mentioned this pull request Sep 6, 2023

Fix FTP write functionality by sending additional FTP command #781

Merged

mpenkov reviewed Sep 6, 2023

View reviewed changes

beck3905 mentioned this pull request Sep 6, 2023

Fix_s3_ignore_seeks_to_current_position #782

Merged

mpenkov closed this Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: S3 ignore seek requests to the current position #748

fix: S3 ignore seek requests to the current position #748

rustyconover commented Dec 18, 2022

mpenkov commented Dec 19, 2022

rustyconover commented Dec 19, 2022

mpenkov commented Dec 19, 2022

tooptoop4 commented Dec 21, 2022 •

edited

rustyconover commented Jan 27, 2023

mpenkov commented Jan 27, 2023

tooptoop4 commented Jan 30, 2023

mpenkov commented Jan 30, 2023

mpenkov Jan 30, 2023

ananth1996 commented Aug 25, 2023

tooptoop4 commented Sep 6, 2023

mpenkov commented Sep 6, 2023

mpenkov Sep 6, 2023

ananth1996 Sep 6, 2023

ananth1996 commented Sep 6, 2023

beck3905 commented Sep 6, 2023

rustyconover commented Sep 7, 2023

mpenkov commented Sep 7, 2023

	if not (whence == constants.WHENCE_START and offset == self._current_pos):
	if not (whence == constants.WHENCE_START and offset == self._current_pos):

fix: S3 ignore seek requests to the current position #748

fix: S3 ignore seek requests to the current position #748

Conversation

rustyconover commented Dec 18, 2022

Motivation

mpenkov commented Dec 19, 2022

rustyconover commented Dec 19, 2022

mpenkov commented Dec 19, 2022

tooptoop4 commented Dec 21, 2022 • edited

rustyconover commented Jan 27, 2023

mpenkov commented Jan 27, 2023

tooptoop4 commented Jan 30, 2023

mpenkov commented Jan 30, 2023

mpenkov Jan 30, 2023

Choose a reason for hiding this comment

ananth1996 commented Aug 25, 2023

tooptoop4 commented Sep 6, 2023

mpenkov commented Sep 6, 2023

mpenkov Sep 6, 2023

Choose a reason for hiding this comment

ananth1996 Sep 6, 2023

Choose a reason for hiding this comment

ananth1996 commented Sep 6, 2023

beck3905 commented Sep 6, 2023

rustyconover commented Sep 7, 2023

mpenkov commented Sep 7, 2023

tooptoop4 commented Dec 21, 2022 •

edited