Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List obj fix #1060

Merged
merged 8 commits into from Apr 19, 2023
Merged

List obj fix #1060

merged 8 commits into from Apr 19, 2023

Conversation

sethiay
Copy link
Collaborator

@sethiay sethiay commented Apr 13, 2023

Description

Fix bug in list object method due to which some files are not listed if number of files in a directory are more than 5000 and contains one or more nested directory.

More details:

  • The listing of objects (includes both files and directories as directories are also object in GCS) happens one by one using the next function of this iterator.
  • The next function returns object, next page token and other attributes. E.g. if the listing is happening for 1st page, then the next function returns page token of 2nd page.

Before fix:

  • This for loop runs until the number of files are not equal to MaxResults (5000) which is problematic. E.g. if we are listing for 1st page which consists of 5000 objects including 4000 files and 1000 directories, then this for loop will not stop after listing 4000 files and 1000 directories but will go on till it lists 5000 files, which will go into 2nd page. Now assume 2nd page consists of 5000 files and 0 directories. At the end of listing 5000 files (4000 from 1st page and 1000 from 2nd page), the next page token will be of 3rd page. So the next ListObject method call from downstream methods will start from 3rd page and hence skipping 4000 files from 2nd page.

After fix:

  • This for loop will run until all the objects in requested page are fetched using (itr.Next() method). So, taking the same example as before, the first call to ListObject will list objects only from 1st page and return token of 2nd page. Hence will not miss 4000 files/objects of 2nd page.

Link to the issue in case of a bug fix.

This bug was identified in this issue: #1054

Testing details

  1. Manual
    a. Verified that correct number of files are listed for the exact issue specified by customer: This bug was identified in this issue: Large directory listing returns wrong results with --enable-storage-client-library=true #1054
    b. Reproduced the issue for lesser number of files i.e. 5500 and verified the fix against that many files as well.
    c. Also ran listing benchmarks and got similar performance.
  2. Unit tests
    a. Can't add unit test for testing this functionality specifically because the fake storage doesn't support ContinuationToken.
  3. Integration tests

@sethiay sethiay added the execute-perf-test Execute performance test in PR label Apr 13, 2023
@sethiay sethiay merged commit bcfff52 into master Apr 19, 2023
3 of 4 checks passed
@sethiay sethiay deleted the list_obj_fix branch May 14, 2024 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
execute-perf-test Execute performance test in PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants