Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Program only retrieves the first 1000 images across all collections #9

Closed
rc-gr opened this issue Nov 7, 2023 · 25 comments · Fixed by #30
Closed

Bug: Program only retrieves the first 1000 images across all collections #9

rc-gr opened this issue Nov 7, 2023 · 25 comments · Fixed by #30
Labels
bug Something isn't working

Comments

@rc-gr
Copy link
Contributor

rc-gr commented Nov 7, 2023

For example, if I have the following collections (assuming they're all valid) in the order as they are shown in My Saves:
Collection 1 (550 items)
Collection 2 (500 items)
Collection 3 (1000 items)
etc.

Only all of Collection 1 and 450 items from Collection 2 would be retrieved. The rest of Collection 2 and all other subsequent collections would be zeroed.

As it would seem, any value above 1000 for maxItemsToFetch is disregarded.

@rc-gr
Copy link
Contributor Author

rc-gr commented Nov 7, 2023

As an informal suggestion for the interim, which sidesteps this issue, the previous clipboard method could be provided as an option so that the program would read from image_clipboard.txt as before. However, if any of the URLs fail to return an image, then it has to be skipped without any alternatives.

And currently, without #7 addressed, any failed downloads would prevent subsequent files from being processed and zipped. From what I can tell, I'd get None from these downloads. Thus, if I have them like so on this line:
[foo, None, bar, None, None, baz, ...], where anything else but None are valid elements, only foo will be output to the zip file.

Also, if 1000 is indeed the hard limit, perhaps an option could be added to address the index numbers such that it could start at any arbitrary number. As an example, the option might be something like --index_start 551. That way, if I have the abovementioned collections in My Saves, I could then rearrange it (via the Collections button in the Edge Browser itself) like so in another run:
Collection 2 (500 items)
Collection 1 (550 items)
Collection 3 (1000 items)
etc.

Now Collection 2 would have all items processed and the index could start from 551 onwards, going from 551 to 1050. With this, I could put COLLECTIONS_TO_INCLUDE=Collection 1 in .env on the first run, then swap it with Collection 2 on the second, extract the outputs to the same folder, and all the index numbers would be unique, and with more than 1000 items.

Additionally, as a bonus for keeping the index numbers unique, this would greatly benefit #8

@Richard-Weiss
Copy link
Owner

Richard-Weiss commented Nov 8, 2023

@rc-gr You raise an important point.
I've heard of other users having this issue and I'm currently working on a script so I can import collections for testing it myself by being able to import collections from the collection_dict as input.
However the alternative you mentioned won't really work.
The API returns all collections of a certain type. Like Generic for AI generated and Image for Bing images.
It will return all collections and seems to truncate after 1000 entries.
Maybe they are using some kind of pagination which I can test for myself after I finished the script for importing collections.
Could you check in the mean time if the copy button on the website returns more than 1000 entries?
You can just search for https://www.bing.com/images/create/ in the text from the clipboard in Notepad++ for example.

@rc-gr
Copy link
Contributor Author

rc-gr commented Nov 8, 2023

I think I may have created a misunderstanding here. I was just merely referring to offsetting the index on the file names. So after processing, instead of 0001_A.jpg, 0002_B.jpg, 0003_C.jpg, etc. I can apply an offset of eg. 1000 so the names would generate as 1001_A.jpg, 1002_B.jpg, 1003_C.jpg etc. The images remain as-is.

Could you check in the mean time if the copy button on the website returns more than 1000 entries?

On this front, I don't know how it is for others. But for me, the button refuses to work if I attempt to copy more than 54 items. I don't know why the number is this arbitrary.

For reference, the browser's Collections version is worse, as I'm only able to copy the first 24 items with "Copy Items to Clipboard".

It's strange, as I do recall for the longest time being able to copy more than 1000 items to the clipboard at one point, but only using the browser's Collections, which was why I had my presumptions and suggested the clipboard fallback in the first place.

@Richard-Weiss
Copy link
Owner

@rc-gr I'll have to work on the import script first then.
What I meant is that there's no way to continue, so I don't really see what an offset would accomplish, as it wouldn't work.
You will always get the same 1000 items.
But that made me think of something.
In theory you could chunk the collections by dumping the collection_dict, deleting the items from the dict via another API endpoint and repeat this step until no images are left and then reimport the concatenated collection_dict at the end.
That would maybe work.
But this would need extensive testing to prevent permanent data loss.

@Gabriele007xx
Copy link

I tried to download the hard limit number of images, but after changing the number from 500 to 1000 at the corrispondent line, i get at the end an error, and a zip file containing 239 images. When it was at 500, i got all 500 images in the zip. I am pretty sure DALL-E images don't get deleted, and my number of items hasn't gone down by not even 1. Anybody know what it could say '404 not found'?
image

@rc-gr
Copy link
Contributor Author

rc-gr commented Nov 9, 2023

@Gabriele007xx Have you tried rerunning the script at a later time? I just checked on one of the links for the failed downloads, and it shows me an image as intended.

In any case, I had suggested a workaround at #7 where the thumbnail would be downloaded instead in the event where the original fails to download for whatever reason. From what I can tell, a failed download is less likely to occur for thumbnails as I've yet to see a broken image for them in my collections.

@Richard-Weiss
Copy link
Owner

@rc-gr @Gabriele007xx The import functionality already looks promising. I had to manually throttle it, because the backend is slow.
I still had issues for some specific images I'll have to investigate.
Once the import feature is finished I'll start on the delete feature (using the delete from collection api) which would create the basis for this feature.
However progress is slowed down by the API being a bit temperamental and because I don't have any docs for it.

@vinnyreid
Copy link

Hi, I'm trying to find out more information about the collection api for a similar project I'm working on (organizing collection image data in google sheets). Is there documentation somewhere? I'm running into the same problem with the 1000 image limit via the api, however, deletion isn't really an option for my situation (unless I migrate away from bing collections ...)

@Richard-Weiss
Copy link
Owner

@vinnyreid
There are no docs since it's not an official API.
With the deletion it was meant more as a workaround.
Operating in batches of 1000 images, downloading them, deleting them from the collections and then readding them.
But I'm having issues with adding them already, since there are some server issues at about 600 images, even when using a very low semaphore, so it's extremely slow too.
You can try out the BingCreatorCollectionImport class, but make sure to replace CollectionId with one of your own collections.
You can find the Id in a request in your browser, like when adding or removing an image.
Also it takes the dictionary from the https://www.bing.com/mysaves/collections/get API, so you can just write the collection_dict variable in the __gather_image_data method to a JSON file for testing.

@AblazingHeart
Copy link

So the way I found to go around that problem is by creating collections of 1000 images at most, currently I have 4 collections (named 1, 2, 3, and 4) of roughly 1000 gens each (originally from the default collection but I moved them 1k at a time manually), and since it seems downloader can only pull the most resent gens (I say this because trying to download collection 1, or 2 returns no images, while 3 and 4 return 350 and 650 respectively) I did the following:

  1. select all images from collection 1 by selecting the last image at the bottom and then pressing the "select all button"(the button only select loaded entries that's why you have to go to the bottom of the collection before pressing it)
  2. copy of move them to a new collection, "1a" in my case. this will reverse the index order as a side effect but that's not a problem for me as it's easily fixed with renamer programs
  3. Including "1a" in the "collections_to_include" line in the config.toml
  4. running the downloader which will now success as it detects that collection as the most recent one
  5. repeat the process with the following collections
    After that you can just delete the created collections (if you copied them), rename it (if you moved them) or just leave them be but that's a waste of space as you will have duplicated collections lmao. Hope this helps!

@Richard-Weiss
Copy link
Owner

@AblazingHeart
Sounds like a very interesting workaround.
Basically creating a new temp collection that's 1000 images max and using that instead.
Just wish the add collection API was more robust or the get collection API more flexible, so the user wouldn't have to do this manually.

@Richard-Weiss
Copy link
Owner

Can someone also try these steps, so I can see if I should add back the old version again?
If there are more than 1000 occurrences I would add an option to use a .txt file instead.

@AblazingHeart
Copy link

AblazingHeart commented Dec 23, 2023

Also, my workaround sometimes throws this error:
image_2023-12-23_175354818
although it seems that it resolves itself with time because in the morning I made my steps and it threw that error but I tried downloading again now and it worked.

Btw @Richard-Weiss, I can indeed copy to clipboard more than 1k items, problem is that it takes a lot of time, for me roughly 10 minutes per 1k items. Also, I remember trying the clipboard version of the downloader when you published but using it made my PC bluescreen by using 100% of my cpu, I think it had something to do with ilegal characters and or emojis in my prompts

@rc-gr
Copy link
Contributor Author

rc-gr commented Dec 24, 2023

@Richard-Weiss, now I can finally also confirm that clipboard method works, albeit less than 1000 images currently (about 200+ now). What changed this time was that I turned on clipboard history (using "Win+V" on Win 11) and observed it. It took about 10 secs for me for the copied items to show up there. Before now, when I was unaware of my disabled clipboard history, the clipboard method only seemed to work sporadically.

P.S. I have since pruned my collections because I found that my saved images now started to expire randomly and sparsely if they were generated more than a month ago, which then became severely apparent at >3 months. Thankfully, I've already downloaded them beforehand, so there's little reason for me to keep the 1000+ thumbnails without their original images lying around.

@rc-gr
Copy link
Contributor Author

rc-gr commented Dec 24, 2023

Also, to add on to @AblazingHeart's method, perhaps I should've clarified much earlier on how I've been downloading my collections when I've had more than 1000 items, which I felt is more straightforward without having to load all the images in a collection.

  1. Access Collections via the Collections icon on Edge ("Ctrl+Shift+Y"). You'll see your collections page something like the following on the side (this is on an alt btw):
    CollectionsStep0
    Using the image above as reference, assume that Collections 1 through 4 all have 1000 images. If I were to run the program as-is (with the instructions provided on the main page), I would get 1000 images from only Collection 4 due to API limitations. What if I wanted the program to get 1000 images from Collection 1?

  2. Open the 3 dots menu and select "Manage":
    CollectionsStep1a
    All collections will now each have a handle beside it:
    CollectionsStep1b

  3. Drag the handle (of Collection 1 in this example) to the very top, above all other collections. Save to apply the changes:
    CollectionsStep2a
    You can also confirm that the changes has been applied by going to your full Collections page:
    CollectionsStep2b

  4. Now if you ran the program once more, assuming you have ensured your cookie has been updated in the .env, you should now have the 1000 images downloaded from Collection 1.

Notice that there's also no need to specify which collection to download from in the config file. If Collection 1 at this point had less than 1000 images, the program would retrieve from subsequent collections down the list (ie. Collection 4, then 3, then 2 in this case) until it hits 1000 images.

@rc-gr
Copy link
Contributor Author

rc-gr commented Dec 24, 2023

@AblazingHeart I think you might be seeing that error because it seems that thumbnails can expire pretty quickly, like so:
Thumbnail
And this was just 4 days ago as of this post! Thankfully, the original image is still present in this case (via the generation page link by clicking through the thumbnail).

Because of this, with the expired thumbnail, the program is more likely to fail here if the original image also could not be retrieved prior for whatever reason. However, as long as the original image has yet to expire, a quick re-run or two should get the program to proceed successfully, as you've observed.

@Richard-Weiss
Copy link
Owner

I've added the fallbacks and statistics now.
I've also added some code to prevent an error when the thumbnail property isn't there, so that error shouldn't be happening anymore.
I only have like 100 images or so in my own account, can someone try out having a collection with 2000+ images and using the clipboard again and wait for 1-2 minutes?
If it returns all links I would add back the alternative method with the advances I made so far, please don't use the actual old method, it creates a new instance of Firefox for each image, so that explains the hardware usage.
Using the text file would maybe take more actual time, but less effective time for the user.

@Richard-Weiss
Copy link
Owner

I've seen that someone made a chrome plugin for it.
I've looked into it and I think I'll create a userscript you can use with GreaseMonkey, TamperMonkey etc. so it is browser agnostic.
That should also work with 1000+ images in a single collection.
I'll let you know once it has parity with the vital existing features.

@Ruffy314
Copy link

Ruffy314 commented Jan 9, 2024

can someone try out having a collection with 2000+ images and using the clipboard again and wait for 1-2 minutes?

In case this question is still interesting, I was able to do so. It may be necessary to give bing.com access to the clipboard. I had to try a few times, then at some point I got the browser prompt for granting the permission. After that I was able to paste from the clipboard.
Three cases tested:

  1. 3500 images from a single collection, resulting in 17531 lines after pasting into a text file. (per image 3 lines of text + 2 empty lines, not sure where the extra 31 lines come from)
  2. 5536 images from 14 different collections, resulting in 27839 lines of text. These were, however, the same 3500 unique images from case 1, just split up over several collections for downloading with this. Some images were copied into multiple collections, because moving from a big collection apparently sometimes copies them instead. Checking the lines for duplicates results in the expected 3500 unique URLs.
  3. 5876 images from 16 different collections, resulting in 29549 lines of text. These were the same images+collections from case 2, plus two collections with 342 new unique images.

/edit
OS: windows 10 pro x64
Browser: Edge x64 Version 120.0.2210.121

@Richard-Weiss
Copy link
Owner

@Ruffy314 Thanks for the info.
I forgot to mention it in the other comment.
Can you also measure the time it takes?
I tested it with around 1600 images and it took 2:45 minutes.
I think they are also using something similar to the detail API for it to work, because the HTML elements don't have the necessary data.
Using that API is actually the bottleneck right now, even for my planned userscript implementation.
I think it would be feasible to implement implement using a text file, albeit a bit slow because the API gets a bit fussy if I'm sending requests too quickly and it's not even returning a 429 but a 203.
It also gets worse the more images you have, so there might be some issues if I don't set the limit low enough.
I think I'll just add some code to use a predetermined text file and add an option to the .toml to switch between using the collection API and text file.

@Ruffy314
Copy link

Ruffy314 commented Jan 9, 2024

@Richard-Weiss
The 3500 images from single collection (case 1) took about 45 seconds for me.
852 images over seven collections took about 15 seconds.
4352 images across eight collections (both sets above at once) 57 seconds.
So more or less linear time regarding number of images, maybe some overhead for each collection.

CPU intel core i5-8400T 2x 1.7 GHz, 16 GB RAM

It looks like I have to stay in the browser tab while it is copying. When I switched to a different program the clipboard did not get filled.

@Richard-Weiss
Copy link
Owner

Richard-Weiss commented Jan 9, 2024

@Ruffy314
It's quite random and I think it depends on the current load the server has.
You can see if you open the network tab that it requests the thumbnail for each item sequentially and in the end copies the text to the clipboard.
From my testing from the userscript, calling the details API yourself is actually faster, even if I restrict it to 5 parallel requests.
Maybe I can also improve the copy button while I'm at it.
I can create a new repo with a dev branch if you are curious and want to test it out, I'll just have to implement the download section and zipping, but I have all the data I need from the details API for all my 1600 images.
It's not really functional in that way, just if you are curious.
Only downside of the userscript is, that I don't have access to the thumbnail URL that still works sometimes.
I only have the thumbnail from the HTML and the two URLs from the detailsAPI, which are often the same.

@Richard-Weiss
Copy link
Owner

Richard-Weiss commented Jan 9, 2024

@Ruffy314 I'm actually a bit curious how well it works with even larger collections.
Can you try it for me?
It adds two new button, one to scroll down to load all images and the other one to download the images.
The download button just collects the detailed information for now, you can see the progress in the browser console set to verbose.
Here's the current state:
https://gist.github.com/Richard-Weiss/66af2f13b248ff50cd1752b7789a833b
You can try it with an extension like ViolentMonkey or TamperMonkey.
The only issue I'm encountering sometimes is that the detail API returns a 203 for some images and sometimes if I use it too often I can't even load something like bing.com for 5 minutes or so. I think I have to decrease the concurrency even further.

@Ruffy314
Copy link

Ruffy314 commented Jan 9, 2024

@Richard-Weiss I have not encountered any problems. After selecting the first image of my 3500 collection and clicking the new scroll down button, it took the script 41 seconds to scroll to the end. (Very nice that it automatically returns to the top of the page).
Clicking the existing select all button shows indeed 3500 elements, so nothing is missing.
Clicking then the new download selected button fetches the data for 1000 images in ~14 seconds, 49 seconds for the full 3500

@Richard-Weiss
Copy link
Owner

@Ruffy314
That sounds good, thanks.
I think I'll improve the logic for the scroll down to be more dynamic and not with a hardcoded wait time, it kinda annoyed me too.
But the rest seems to be working fine.
I'll add the .txt option for this repo the coming days/weeks and create a repo for the userscript once I implemented the download.
Thanks for your feedback. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants