Create a GamesDB auto scraper using hashing to fetch by ID. #355

Closed
wants to merge 1 commit into
from

Projects

None yet

4 participants

@sselph
sselph commented Jan 19, 2015

I've created this as a new scraper. It works best if you turn off the "user decides on conflict" option. The first time it runs it downloads a hash.tsv file and loads it into memory then starts hashing roms and getting the metadata by id. This is my first c++ code so I'm sure there are some mistakes but I've compiled it and run it and it works.

@Aloshi
Owner
Aloshi commented Jan 20, 2015

Awesome work! I've merged this in commit 2bd9062 with a few small changes to make it build on Windows (commit 68a77d1).

I'll try and improve the stuff around the scraper system to let ES use multiple scrapers soon (platform-specific scrapers like MameDB first, if none exist then try a checksum-based scraper, if that also fails or doesn't support the platform, try a by-name scraper like always). You've done the hard part though. :)

If you think the hash.tsv file is "done" (and won't be updated much anymore) I can build it into the executable.

@Aloshi Aloshi closed this Jan 20, 2015
@sselph
sselph commented Jan 20, 2015

Thanks for the corrections, I'm using linux if you couldn't tell. The hash.tsv is "done" for the platforms that it supports. Sometimes I find mistakes or new games that got added to the gamesdb but that is fairly rare. Including it seems like a good idea, adding new platforms typically requires some code changes so those could include a patch to the hash file.

I did it this way originally since Go didn't seem to have a good way to include static files in the binary, but my code stores the etag(md5sum of the hash file) then sends it with the request to see if the response is a 304.

@sselph sselph deleted the unknown repository branch Jan 20, 2015
@nilsbyte
Collaborator

Nice! Where do the hash values for the ROMs come from? I know there are different ROMs out there, for instance the ones from the GoodTools collection, TOSEC or NO-INTRO.

@robertybob

@sselph and also; if any games aren't scraped when the scraper is run, what is the best way to ensure they are the next time (whether it be for someone else in the future, or yourself if you ever have to re-scrape)? Is it just a case of adding a new game entry to thegamesdb or is there more to it than that?
If there is already an entry on thegamesdb but it's not picked up by the scraper, despite having the correct filename, what can be done about that? Perhaps you could write a guide (?) :)

@sselph
sselph commented Jan 20, 2015

@nilsbyte The hashes were from no-intro's parent/clone xml.

@robertybob Sure I can write up an guide on how it is working and how to fix it. But the short answer is if the scraper missed the game it ether didn't match the no-intro hash or the game wasn't in thegamesdb when I made my DB of hash to ID mappings. Adding it to thegamesdb wouldn't be enough for the auto scraper since its hash to ID mapping also has to be added to my DB. You can always run the manual scraper with the only scrape games without images and manually do the ones the scraper missed.

@robertybob

Ah ok, I think out of the 400+ GB roms I scraped, only 130ish were found, and I know that thegamesdb has every game that I was trying to scrape. I guess one option would be to download them again from another source, which may mean the hashes match the no-intro database..
I've been using the Windows version of your scraper which means I can't scrape without images. I guess I'll just have to bite the bullet and attempt to scrape the games from my Pi with that flag lol

@Aloshi
Owner
Aloshi commented Jan 20, 2015

Maybe a simple web service could be built that collects users' ROM hashes + accompanying GamesDB entry IDs. If more than one person reports the same hash-ID pairing, add it to the "official" list.

@sselph
sselph commented Jan 20, 2015

@robertybob That seems low and GB is just a binary dump so no headers or odd encodings that I'm aware of to throw things off. I'll inspect my dataset to see if I made errors(opening issue in my repo). The image flags I mentioned in the issue should work on the windows version it was just the command line workaround to rename your images that was linux specific.

@Aloshi I had the same idea a while back. When I get a chance I'll throw together a proof of concept in a standalone script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment