Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce "redundancy", so the check only happens once and that value is stored, allowing that value to be compared. #1201

Merged
merged 3 commits into from
Dec 8, 2022

Conversation

NewUserHa
Copy link
Contributor

Issue: #1200

@NewUserHa
Copy link
Contributor Author

It seems there's some memory leak, it currently committed 1G+ memory in a long-run test. But maybe the reason is not this change

@NewUserHa
Copy link
Contributor Author

Even if all images are skipped downloading (i.e. without writing any file), the committed memory still grows.
So memory leak probably is not related to this PR.

@Nandaka
Copy link
Owner

Nandaka commented Dec 3, 2022

memory leak

can you compare before this PR and after? how do you test it? Last time when this happen, usually due to the BeautifulSoup page is not decompose() and deleted.

also maybe because the downloaded page is being cached for 1hour (useful if it need to get the same artist info multiple time and reduce access to pixiv server)

@NewUserHa
Copy link
Contributor Author

Without this change, it seems 23M workset memory for a middle-run (maybe fewer hours and cool delay), don't remember what memory is in a long run.

BeautifulSoup's objects should be garbage collected automatically by python if no reference to that object for a while.

It also costs ~300MB committed memory size if 10000 skipped via "Already downloaded in DB". Therefore may have nothing to do with file writing or beautifulsoup.

@Nandaka Nandaka merged commit 8318f28 into Nandaka:master Dec 8, 2022
@biggestsonicfan
Copy link
Contributor

In the future, these checks are not "useless", each check itself is important to the code. What you've done here, however, is reduce "redundancy", so the check only happens once and that value is stored, allowing that value to be compared. "Remove redundant file checks" would have been a more apt title for this pr.

@Nandaka Nandaka changed the title get rid of useless checks to speed up the download, if image not exists at all. reduce "redundancy", so the check only happens once and that value is stored, allowing that value to be compared. Dec 8, 2022
@Nandaka
Copy link
Owner

Nandaka commented Dec 8, 2022

well, grammar is not my strong poing, but at least I can understand it 😄

@Nandaka Nandaka mentioned this pull request Dec 15, 2022
3 tasks
@biggestsonicfan
Copy link
Contributor

biggestsonicfan commented Dec 27, 2022

Just firing up PixivUtil2 again and the "Skipped getting remote file size because local file not exists" is really getting to me...

It should be "Remote file size check skipped because local file does not exist" or something like that. Ironically printing this message creates greater latency than the "redundant" checks it intended to remove, so they have actually managed to slow down the program where they intended to speed it up... If no message is printed, it should be implied that the file size check is skipped, else it prints the check when it the conditions to check the file size is met. This notification for every new image is superfluous...

@NewUserHa
Copy link
Contributor Author

NewUserHa commented Dec 27, 2022

  1. That msg is printed just like how other similar lines printing in that same code block.
  2. Printing msg will NOT slow down anything in any way, because it is just a 'print()'. If it makes you unhappy, you can comment that 'print' line out.
  3. But this PR does save your time and your disks. And you can verify that by Task Manager or etc to monitor your disk.

@biggestsonicfan
Copy link
Contributor

I had to see the numbers for myself, and you are right. The line printing adds only an additional 0:00:00.000051 time per image, while six checks to see if a file exists takes a whopping 0:00:00.00015

If you are able to perceive a difference of even 0:00:00.001 per image downloaded, not taking into account the random delay used in-between downloads, I don't know what to tell you.

@NewUserHa
Copy link
Contributor Author

  1. that's how programs work, and the 'print' you can consider it costs no time.
  2. the time the disk costs actually depends on your cases, e.g. your disks are SSD or not, the file is in your devices buffer or not, etc. If you have a large folder, that should cost a lot of time.
  3. why do you want your disk to check the same thing( if the file exists here) again and again for the same result? No program should do that. and this PR fixed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants