Small, production-quality Go module for fast 64-bit SimHash on HTML documents.
It includes:
- Core streaming SimHash hasher (
Hasher64) - Visible-text HTML tokenization (
TokenizeHTMLText) - DOM-structure tokenization (
TokenizeHTMLDOM) - Comparison helpers (
Hamming64,Similarity64)
go get github.com/evanleleux/simhashpackage main
import (
"fmt"
"github.com/evanleleux/simhash"
)
func main() {
a := []byte(`<html><body><h1>Checkout</h1><p>Pay securely</p></body></html>`)
b := []byte(`<html><body><h1>Checkout</h1><p>Pay securely today</p></body></html>`)
h1, _ := simhash.FingerprintHTMLText64(a)
h2, _ := simhash.FingerprintHTMLText64(b)
fmt.Printf("h1=0x%016x h2=0x%016x\n", h1, h2)
fmt.Printf("hamming=%d similarity=%.4f\n", simhash.Hamming64(h1, h2), simhash.Similarity64(h1, h2))
}FingerprintTokens64(tokens TokenStream, opts ...Option) (uint64, error)FingerprintHTMLText64(html []byte, opts ...Option) (uint64, error)FingerprintHTMLDOM64(html []byte, opts ...Option) (uint64, error)Hamming64(a, b uint64) intSimilarity64(a, b uint64) float64
Hasher64 supports streaming accumulation without building token slices:
h := simhash.NewHasher64()
h.AddStringToken("checkout", 1)
h.AddStringToken("payment", 1)
fp := h.Sum64()
_ = fp- Parses with
golang.org/x/net/html(no regex parsing) - Ignores text under
<script>and<style> - Best-effort hidden filtering (
hidden, inlinedisplay:none,visibility:hidden) by default - Collapses whitespace by tokenizing into word tokens
- Emits path tokens like
html/body/div/form/input - Ignores attributes by default (stable across changing classes/IDs)
- Uses configurable max depth (default
8) - Can focus on form tags with
WithDOMFormOnly(true)
WithHashFunc(HashFunc64)to override hashing functionWithWeightFunc(WeightFunc)to override token weightingWithMaxTextBytes(n int)to cap visible text bytes processedWithDOMMaxDepth(depth int)to cap emitted DOM depthWithIgnoreHidden(enabled bool)to toggle hidden-node filteringWithLowercaseTags(enabled bool)to toggle lowercasing tag namesWithDOMFormOnly(enabled bool)to emit only form-related DOM paths
Default token hash is github.com/cespare/xxhash/v2.
For 64-bit SimHash, near-duplicate detection often starts around Hamming distance <= 35, but this is dataset-dependent. Tune thresholds on your corpus and objective (precision vs recall).
Do not shingle before SimHash in this workflow. Shingling is mainly useful for MinHash/Jaccard style similarity; this package is intended for direct token streams into SimHash.
Run:
go run ./cmd/example [fileA.html fileB.html]Without args, it compares:
https://evanleleux.dev/simhash/page-01throughhttps://evanleleux.dev/simhash/page-10
Default output includes per-page hashes plus adjacent-page similarity comparisons.
You can also pass two local file paths or URLs for direct pair comparison.