Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Site Archive Inconsistencies #8

Open
JoshuaFern opened this issue Jan 29, 2023 · 3 comments
Open

Site Archive Inconsistencies #8

JoshuaFern opened this issue Jan 29, 2023 · 3 comments

Comments

@JoshuaFern
Copy link

I'm backing up games on Itch and I've noticed multiple inconsistencies with the archived pages generated by itch-dl.

  • Cyrillic (All unicode?) is garbled.
  • 'Updated' and 'Published' dates are missing under 'More information'
  • Some images like the cover and screenshots seem to be archived but not linked to the local mirror, and other images are missing like all store banners and user avatars.
  • Developer log posts are not saved, and if I understand correctly the devlogs can also include past versions of software that itch-dl does not download. That would be super slick if itch-dl could back up all the past versions as well with an argument like --devlog.
  • Useful information in community posts is also not archived.

I tried with and without using --mirror-web but there was not much of a difference. Screenshots are saved when specified but I did not note any additional benefit.

@DragoonAethis
Copy link
Owner

  • Looks like sites primarily in non-Latin scripts got their encoding guessed incorrectly and the output ended up garbled - I've released 0.3.3 with a quick fix, check it it resolves your issues. I've tested it on Cyrillic and CJK pages which now look correct.
  • Looks like published/updated/etc dates don't show up for some games (not all...?) if the webpage request is unauthenticated (requires proper session cookies, not just the API key). Need to investigate.
  • Itch stores older versions separately on another API endpoint, they're not connected with devlogs directly. Either way, the downloader currently fetches just the latest version, but yup, it would be nice to be able to grab those as well. I've added Support downloading non-latest versions #9 to track this in a separate issue.

But in general, yeah, the webpage mirroring feature is very barebones - the intended use case was to scrape just the front page and screenshots attached there, as that often includes instructions not included with games themselves. Getting image links correct, all the devlogs/comments, etc would require a lot more postprocessing.

I'll try to find some time to fix up dates and at least partial site parsing in the coming weeks, but I've got a lot on my plate right now until end of February, so can't say when :/

@JoshuaFern
Copy link
Author

Thanks for your careful consideration.

Conceptually, I'm think I'm so keen on accurate mirroring because I can imagine a future where itch doesn't exist anymore, and this tool was used to back up a bunch of games and post them on archive.org. It would be a shame if anything was lost.

@Akamaru
Copy link

Akamaru commented Nov 2, 2024

Hey, I think a very good way to save webpages could be Monolith https://github.com/Y2Z/monolith
It saves the complete site in one html file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants