Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example dataset: pwned passwords list #42363

Open
alexey-milovidov opened this issue Oct 17, 2022 · 7 comments
Open

Example dataset: pwned passwords list #42363

alexey-milovidov opened this issue Oct 17, 2022 · 7 comments

Comments

@alexey-milovidov
Copy link
Member

https://haveibeenpwned.com/Passwords
https://downloads.pwnedpasswords.com/passwords/pwned-passwords-sha1-ordered-by-hash-v8.7z

A degenerate example to test range requests with the primary key.

@alexey-milovidov
Copy link
Member Author

CREATE TABLE passwords
(
    `hash` FixedString(20) CODEC(ZSTD(6)),
    `count` UInt32 CODEC(ZSTD(6))
)
ENGINE = MergeTree
ORDER BY hash

clickhouse-client --progress --query "INSERT INTO passwords WITH splitByChar(':', line) AS columns SELECT unhex(columns[1]), replaceOne(columns[2], '\r', '') FROM input('line String') FORMAT LineAsString" < pwned-passwords-sha1-ordered-by-hash-v8.txt

@alexey-milovidov
Copy link
Member Author

SELECT count() FROM passwords

847223402

@alexey-milovidov
Copy link
Member Author

The performance is really great:

milovidov-desktop :) SELECT count FROM passwords WHERE hash = SHA1('qwezxc') LIMIT 10

SELECT count
FROM passwords
WHERE hash = SHA1('qwezxc')
LIMIT 10

Query id: 62977e3a-edf3-4a95-bda3-f49b29b63690

┌─count─┐
│ 18882 │
└───────┘

1 row in set. Elapsed: 0.004 sec. Processed 8.19 thousand rows, 196.62 KB (2.08 million rows/s., 49.81 MB/s.)

@alexey-milovidov
Copy link
Member Author

The data size in ClickHouse is 17 GB, just a little larger than the 7z archive (16.3 GB).

@alexey-milovidov
Copy link
Member Author

It works as fast in ClickHouse Cloud:

clickhouse-cloud :) SELECT count FROM passwords WHERE hash = SHA1('qwezxc') LIMIT 10

SELECT count
FROM passwords
WHERE hash = SHA1('qwezxc')
LIMIT 10

Query id: 86a27c4b-eff0-453a-9109-aed571824b3b

┌─count─┐
│ 18882 │
└───────┘

1 row in set. Elapsed: 0.003 sec. Processed 8.19 thousand rows, 196.62 KB (3.11 million rows/s., 74.75 MB/s.)

@alexey-milovidov
Copy link
Member Author

alexey-milovidov commented Oct 17, 2022

A benchmark for low-latency queries:

clickhouse-benchmark -c256 --host x36rme18yy.eu-central-1.aws.clickhouse.cloud --secure --password ... --query "
    SELECT count FROM passwords WHERE hash = SHA1((SELECT randomPrintableASCII(8)))"

@alexey-milovidov
Copy link
Member Author

Queries executed: 280219.

x36rme18yy.eu-central-1.aws.clickhouse.cloud:9440, queries 280219, QPS: 2848.831, RPS: 23340240.601, MiB/s: 534.175, result RPS: 0.000, result MiB/s: 0.000.

0.000%          0.003 sec.
10.000%         0.017 sec.
20.000%         0.035 sec.
30.000%         0.067 sec.
40.000%         0.073 sec.
50.000%         0.078 sec.
60.000%         0.098 sec.
70.000%         0.108 sec.
80.000%         0.155 sec.
90.000%         0.168 sec.
95.000%         0.174 sec.
99.000%         0.208 sec.
99.900%         0.462 sec.
99.990%         0.733 sec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant