-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a GamesDB auto scraper using hashing to fetch by ID. #355
Conversation
Awesome work! I've merged this in commit 2bd9062 with a few small changes to make it build on Windows (commit 68a77d1). I'll try and improve the stuff around the scraper system to let ES use multiple scrapers soon (platform-specific scrapers like MameDB first, if none exist then try a checksum-based scraper, if that also fails or doesn't support the platform, try a by-name scraper like always). You've done the hard part though. :) If you think the hash.tsv file is "done" (and won't be updated much anymore) I can build it into the executable. |
Thanks for the corrections, I'm using linux if you couldn't tell. The hash.tsv is "done" for the platforms that it supports. Sometimes I find mistakes or new games that got added to the gamesdb but that is fairly rare. Including it seems like a good idea, adding new platforms typically requires some code changes so those could include a patch to the hash file. I did it this way originally since Go didn't seem to have a good way to include static files in the binary, but my code stores the etag(md5sum of the hash file) then sends it with the request to see if the response is a 304. |
Nice! Where do the hash values for the ROMs come from? I know there are different ROMs out there, for instance the ones from the GoodTools collection, TOSEC or NO-INTRO. |
@sselph and also; if any games aren't scraped when the scraper is run, what is the best way to ensure they are the next time (whether it be for someone else in the future, or yourself if you ever have to re-scrape)? Is it just a case of adding a new game entry to thegamesdb or is there more to it than that? |
@nilsbyte The hashes were from no-intro's parent/clone xml. @robertybob Sure I can write up an guide on how it is working and how to fix it. But the short answer is if the scraper missed the game it ether didn't match the no-intro hash or the game wasn't in thegamesdb when I made my DB of hash to ID mappings. Adding it to thegamesdb wouldn't be enough for the auto scraper since its hash to ID mapping also has to be added to my DB. You can always run the manual scraper with the only scrape games without images and manually do the ones the scraper missed. |
Ah ok, I think out of the 400+ GB roms I scraped, only 130ish were found, and I know that thegamesdb has every game that I was trying to scrape. I guess one option would be to download them again from another source, which may mean the hashes match the no-intro database.. |
Maybe a simple web service could be built that collects users' ROM hashes + accompanying GamesDB entry IDs. If more than one person reports the same hash-ID pairing, add it to the "official" list. |
@robertybob That seems low and GB is just a binary dump so no headers or odd encodings that I'm aware of to throw things off. I'll inspect my dataset to see if I made errors(opening issue in my repo). The image flags I mentioned in the issue should work on the windows version it was just the command line workaround to rename your images that was linux specific. @Aloshi I had the same idea a while back. When I get a chance I'll throw together a proof of concept in a standalone script. |
I'm currently working on creating a searchable database using both ROM names and ROM titles to facilitate faster data retrieval. This will be implemented as a responsive dataframe, allowing for versatile queries. Users will only need to download the data file from a location that will be specified. but I don't know if you guys still working on it. and maybe is rewrite the wheels. |
I've created this as a new scraper. It works best if you turn off the "user decides on conflict" option. The first time it runs it downloads a hash.tsv file and loads it into memory then starts hashing roms and getting the metadata by id. This is my first c++ code so I'm sure there are some mistakes but I've compiled it and run it and it works.