-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The ultimate HIBP downloader: curl #79
Comments
Hey @muzso, that is an awesome approach and I'm kind a wondering why we never did that in the first place! I've added a reference to the readme, well done! |
Thanks for the mention in the readme.
It's not a regular expression. Curl calls it "globbing", which is a simple method for specification of string patterns. See: |
Fixed! |
Thanks. |
For some reason this skipped a single file, |
@dextercd I've also added the |
I've tested this (with the It ran for 34 minutes (likely due to the more modest network parameters) and finished successfully, all 1048576 files were downloaded (with OK looking file sizes). :) |
Nice work. In my experience However I managed to get 25-30% speedup by moving the FILENAME manipulation and the removal of the oliver@alpha:~/hibp/data$ time find . -type f -printf '%f\n' | egrep -ia '^[0-9a-f]{5}$' | head -n10000 | xargs awk -F: '{ print FILENAME $1 ":" $2 }' | tr -d '\r' > hibp_all2.txt
real 0m13.353s
user 0m13.062s
sys 0m1.846s
oliver@alpha:~/hibp/data$ time find . -type f -print | egrep -ia '/[0-9a-f]{5}$' | head -n10000 | xargs -r -d '\n' awk -F: '{ sub(/\r$/,""); print substr(FILENAME, length(FILENAME)-4, 5) $1 ":" $2 }' > hibp_all.txt
real 0m18.645s
user 0m17.648s
sys 0m0.917s
oliver@alpha:~/hibp/data$ diff hibp_all.txt hibp_all2.txt
[ no differences ] above shown for 10000 files. So the suggested "improved" command line is: find . -type f -printf '%f\n' | egrep -ia '^[0-9a-f]{5}$' | xargs awk -F: '{ print FILENAME $1 ":" $2 }' | tr -d '\r' > hibp_all.txt |
That's the exact reason (1 million invocations of sed) why I didn't even give it (sed) a try.
Great! :) You've motivated me to spend some time/effort on performance optimization too. :) As a first step, I've collected the input filenames into a file so the measurements are as little influenced by other factors as possible.
I've also created a ramdisk and copied the input files to it, so my SSD won't be a factor either. I've included a run for the same code and both using my SSD and using the ramdisk, but the difference is within the margin of error (of measurement accuracy). I've done quite a few variations.
Here's the code of
Here's the code of
And here's the code of
It seems that the best performance came from this:
So the best commandline seems to be this (this one must be executed in the directory where the HIBP files were downloaded):
|
Great stuff... I particularly like the sort of filenames before merge, that's essential! You may also be interested in a little project I have been working on which is all about working with this dataset. https://github.com/oschonrock/hibp The key is... it uses "binary storage", which means the on disk size is halved. And binary means that each "record" is the same size, 24 bytes, and that means we can use "binary search". So it does that and provides a "local http server" you can connect to for queries. The
All the tools are very lean on memory and disk space. All you need is enough disk space for one copy of the binary DB (current ~ 21GB) and a tiny amount of memory (about ~50MB during download, and ~5MB during serve). My goal was to make this feasible on a "small" size Virtual Machine, which you can get for a few $ per month, say 50GB SSD, 2 cores, and 2GB memory. I have build instructions for Linux, FreeBSD and Windows, and it is tested on those. If this is popular I am happy to maintain packages. Feedback very welcome! |
Great project! :)
You can get a VM with 4 vcpus, 24 GB RAM and 50 GB storage (don't remember whether it's HDD or SSD) for free from Oracle Cloud. :) I have been using this for over a year now. They don't even charge you for egress traffic (Google does even in its "always-free" plan) ... or at least not at the amount of traffic I'm generating. |
I will check that out. Interesting that disk size is the limiting factor. This DB is so large that even just one (text) copy of it almost fills that VM. And your If you had time to build my project and try it out on your VM, I would really appreciate it, and the feedback will help improve it! Thanks! |
I am amazed the go version didn't do better. Much slower than In any case, what I realised when coding my project is there are 2 slow things here. Network and Disk. We want to use disk as little as possible and write to disk while we are downloading, so that time is essentially for free. My code also does the convert to binary and that is free too. |
I've found this interesting too. No idea why it came out this way. Didn't spend any time on optimizing the Python/Perl/Go implementations for performance.
Storing the data in a proprietary binary format is not necessarily an advantage. It consumes less space and can be more efficient to search, but requires custom code/tooling to process, while a text-based format can be more easily processed with a number of tools. There're pros and cons on both sides. |
That's true of course. I can easily add a switch to spit out the text version. It would still be faster and use half the disk space (because no need for temp copy). Let me know if that is of interest? What code/tooling exists to do something useful with the text file? That's what I never understood. I started using this dataset many years ago, and used to insert it into a mysql DB with with the SHA1 as the primary index, and I used to convert it into binary on the way into the db. That process of importing was super slow and almost started breaking as the data grew, and the queries were like avg 45ms, quite slow? That's why I started the with this custom tool, for querying, and later (because of your curl idea) for the download as well. What do you do with 38GB of text? Any script, language, etc can make an http request, that's why i chose that. But a "php extension" or python/js module with c backend could also be good. |
If you need the hashes for continuous use (e.g. integrated into an authentication system), then I agree that storing it in a performance-optimized format makes sense. If you need it just for a one-off examination, research, etc., then dealing with a custom storage format might be an unnecessary overhead.
As demonstrated: you can grep it or write a shell/awk script to process it, etc. It depends on your use case. |
I agree, of course. It's the UNIX way. I just literally can't think of a use case for this data, other than "look up a password and get the count". The SHA1 is non-intelligible by design. Your use case above, is literally "download the data and combine it", ie "just getting the data in the first place". I can't think of another one. But I guess you can never imagine all use cases.. I will add a "text export" switch.. both during download, and also if you already have the binary version. That way you never have to "store the 38GB of text", unless you specifically want to, and can just pipe it to the next process. Just working on "resuming downloads"... some users have reported that their internet is so flaky they can't get through the 38GB without a failure (even with retries), so "resume" does seem useful, and is not so hard. Will add --text-export. after that. Troy has made such a fabulous resource available here, but unless you are happy with his online api (security questions...), then using it is actually quite hard: The .NET downloader, the file size, what to query it with... That's what I was trying to solve. |
FYI..
and
now both work. Although not together... ie you can't resume a text download. |
Hi!
Thanks for HIBP and this downloader. At first I was considering using it, but the API of HIBP passwords is so easy that I wrote a small shell script for it. It was slow as hell, because it had no parallelism at all. It was far too slow for my taste, thus I was thinking about adding parallelism. And that's when I stumbled on curl's URL globbing feature.
curl is the swiss army knife of HTTP downloading and it supports patterns/globbing and massive parallelism, and pretty much every aspect of HTTP downloads (proxies, HTTP1/2/3, all SSL/TLS versions, etc.).
Here's a single curl commandline that downloads the entire HIBP password hash database into the current working directory:
If you want to be able to check the curl output/log, remove the
-s
option and redirect to a file:To debug issues, you can add the
-v
(or the--trace-ascii -
) option to increase verbosity.By default curl does not retry any requests
On an always-free Oracle Cloud VM this finished (for me) in 13.5 minutes. ;-)
The URL globbing and the
--remote-name-all
options have been around in curl for ages (i.e. for over a decade) and the--parallel*
options have been added in Sep 2019 (v7.66.0). So pretty much all "recent" Linux distros already contain a curl version that fully support this commandline.Curl is cross-platform, e.g. you can download a Windows version too.
If you don't have the necessary 30-40 GB free space for the entire hash dump, you can get away with less by downloading in smaller batches and instantly compressing them.
Here's a command that you can fire and forget (i.e. disconnect / log out) on any Linux PC/server and it'll do the job in batches:
This doesn't support suspend and resume of the download job (other HIBP downloaders do), but since it finishes pretty quickly (if you have a good enough internet connection), I don't see any reason for this feature.
You can easily assemble the server responses into a single ~38 GB file with the following commandline (on Linux):
You can sort it easily based on the second field to get the most "popular" hashes:
To get just the most popular hashes:
You can feed these into a pre-computed lookup table (like what crackstation.net provides) to get the most popular breached passwords. :)
The text was updated successfully, but these errors were encountered: