Composable fuzzy string matching for Polars, implemented as a native Polars plugin (Rust core + Python bindings via PyO3). All scoring runs in Rust — no Python loops over rows — and works in both eager and lazy frames.
- 24 metric expressions — edit distances (Levenshtein, Damerau-Levenshtein, OSA, Hamming), the Jaro family, token/n-gram (Jaccard, Sørensen-Dice, q-grams), LCS, and four phonetic encoders (Soundex, Metaphone, DoubleMetaphone, NYSIIS). See
ALGORITHMS.mdfor the math and normalization behind each one. - Hybrid scoring — combine multiple algorithms in a single Rust call (
hybrid_score), plus 4 pre-built scorers (name_default,phonetic_edit,token_char,prefix_ngram). - DataFrame helpers —
fuzzy_join,deduplicate,pairwise_comparewith blocking indexes (first-chars, char-bag) to avoid O(n²) cross joins. - Explainability —
return_breakdown=Truereturns per-metric scores alongside the combined score, so you can see why two strings matched. - Ensembles —
weighted_avg,mean,max,min,median,vote.
All similarity functions return Float64 in [0, 1]. Null in either input → null output.
Prebuilt wheels are published on PyPI for CPython 3.9–3.13 on Linux x86_64 and Windows x86_64. macOS and aarch64-linux wheels are planned (see Platform support); on those platforms pip falls back to a source build.
pip install polars-stringsimThat's it — import polars_stringsim as pf works out of the box.
If a wheel for your platform is missing, pip will fall back to a source build, which does require Rust (see below).
For hacking on the plugin itself, or for a platform without a prebuilt wheel:
git clone https://github.com/Pratham-26/rust_helpers.git
cd rust_helpers
# Iterative dev install (rebuild on every change):
pip install maturin
maturin develop --release
# Or: install directly from the repo's source (compiles Rust, needs a toolchain):
pip install git+https://github.com/Pratham-26/rust_helpers.gitRequires Rust stable. Pinned to polars = 0.54.4 / pyo3-polars = 0.27.
See RELEASE.md for how wheels are built and published.
import polars as pl
import polars_stringsim as pf
customers = pl.DataFrame({"name": ["Robert Smith", "Catherine Jones", "Jon Smyth"]})
db = pl.DataFrame({"name": ["Robert Smyth", "Katherine Jones", "William Brown"]})
# 1. Single metric
customers.join(db, how="cross").with_columns(
s=pf.jaro_winkler("name", "name_right")
)
# 2. Hybrid: spelling + phonetic + token, fused in one Rust call
df.with_columns(
hybrid=pf.hybrid_score("a", "b",
algorithms=["jaro_winkler", "double_metaphone", "trigram_jaccard"],
weights=[0.5, 0.3, 0.2])
)
# 3. Pre-built scorer
df.with_columns(s=pf.name_default("a", "b")) # JW + Double Metaphone + trigram
# 4. Per-metric breakdown (explainability)
df.with_columns(
bd=pf.combine(
[pf.jaro_winkler("a","b"), pf.double_metaphone_sim("a","b")],
weights=[0.6, 0.4], return_breakdown=True,
)
)
# 5. fuzzy_join with blocking (avoids O(n*m) cross product)
pf.fuzzy_join(customers, db, left_on="name", right_on="name",
algorithms=["jaro_winkler", "double_metaphone"], weights=[0.6, 0.4],
threshold=0.75, top_k=1, block="first_chars", block_n=1)
# 6. deduplicate near-duplicate name variants
pf.deduplicate(messy_names, on="name",
algorithms=["jaro_winkler"], weights=[1.0],
composite_threshold=0.8, block="first_chars", block_n=1)
# 7. pairwise_compare for threshold tuning (returns combined score per pair)
pf.pairwise_compare(customers, db, left_on="name", right_on="name",
algorithms=["jaro_winkler", "trigram_jaccard"], weights=[0.6, 0.4])jaro, jaro_winkler, levenshtein, levenshtein_norm, damerau_levenshtein,
damerau_levenshtein_norm, osa, hamming, hamming_norm, token_jaccard,
token_sorensen_dice, trigram_jaccard, trigram_sorensen_dice,
qgram_jaccard(left, right, q=3), lcs_sim, soundex_sim, soundex_jw_sim,
metaphone_sim, metaphone_jw_sim, double_metaphone_sim,
double_metaphone_jw_sim, nysiis_sim, nysiis_jw_sim.
pf.combine(metrics, *, weights=None, method="weighted_avg", threshold=None, return_breakdown=False)— fuse pre-built metric expressions.pf.hybrid_score(left, right, *, algorithms, weights=None, method="weighted_avg", threshold=None)— same, but builds metrics in Rust (no intermediate struct column).- Pre-built scorers:
pf.phonetic_edit,pf.token_char,pf.prefix_ngram,pf.name_default.
pf.fuzzy_join(left, right, *, left_on, right_on, algorithms, weights, method, threshold, top_k, block, block_n, how, add_breakdown)pf.deduplicate(frame, *, on, algorithms, weights, method, composite_threshold, block, block_n)pf.pairwise_compare(left, right, *, left_on, right_on, algorithms, weights, method, block, block_n)- Blocking:
pf.block_first_chars(col, n=2),pf.block_char_bag(col)
weighted_avg (default; weights normalized to sum to 1), mean, max, min, median, vote (count of metrics ≥ threshold, normalized by N).
hybrid_score parallelizes its row scan across a dedicated worker pool that is
independent of the Polars engine pool (POLARS_MAX_THREADS), so you can
tune them separately. Throughput scales near-linearly with cores (≈7–8× on 16
threads vs 1).
import polars_stringsim as pf
pf.get_num_threads() # default = number of logical cores
pf.set_num_threads(8) # use 8 threads for hybrid_score
pf.set_num_threads(0) # restore defaultOr set the default at process start: POLARS_STRINGSIM_THREADS=8 python ...
hybrid_score, fuzzy_join, deduplicate, and pairwise_compare accept algorithm names:
jaro, jaro_winkler, levenshtein, levenshtein_norm, damerau_levenshtein, damerau_levenshtein_norm, osa, hamming, hamming_norm, token_jaccard, token_sorensen_dice, trigram_jaccard, trigram_sorensen_dice, lcs_sim, soundex/soundex_jw, metaphone/metaphone_jw, double_metaphone/double_metaphone_jw, nysiis/nysiis_jw.
For what each metric computes and how it's normalized, see ALGORITHMS.md.
Prebuilt wheels on PyPI (no Rust toolchain needed):
| Platform | Wheel | Status |
|---|---|---|
| Linux x86_64 | manylinux_2_28_x86_64 |
✅ v0.1.0 |
| Windows x86_64 | win_amd64 |
✅ v0.1.0 |
| macOS x86_64 / arm64 | — | ⏳ planned (sdist fallback works; needs Rust) |
| Linux aarch64 | — | ⏳ planned (sdist fallback works; needs Rust) |
CPython 3.9–3.13 supported wherever a wheel exists. pip install polars-stringsim picks the right wheel automatically; on uncovered platforms it builds from the sdist (requires Rust stable).
cargo test --lib # 30 Rust unit tests
pytest tests/python # 38 Python end-to-end testsRun the example end-to-end with uv:
maturin build --release
WHL=target/wheels/polars_stringsim-*.whl
uv run --with "$WHL" --with polars --with pytest python -m pytest tests/python/
uv run --with "$WHL" --with polars python examples/record_linkage.pysrc/
├── algorithms/ # pure Rust: edit, jaro, token, lcs, phonetic
├── combiner.rs # CombineMethod enum + combine_row (weighted_avg/mean/max/min/median/vote)
├── expr/mod.rs # #[polars_expr] wrappers, one per metric
├── expr_combine.rs # combine_expr + combine_breakdown_expr (Struct output)
├── expr_hybrid.rs # hybrid_score_expr (multi-algo in one Rust call)
├── series_util.rs # str-column readers, Float64 builder, null handling
└── lib.rs # #[pymodule]
python/polars_stringsim/
├── _expression.py # per-metric expr builders + combine()
├── _registry.py # algorithm name → builder map
├── hybrid.py # hybrid_score + pre-built scorers
├── frame.py # fuzzy_join, deduplicate, pairwise_compare, blocking
└── __init__.py # public API
Everything in the original PRD is implemented and shipped:
- ✅ 24 metric expressions — edit distances, Jaro family, token/n-gram, LCS, and phonetic encoders. Full reference:
ALGORITHMS.md. - ✅ Composable ensembles —
combine()withweighted_avg/mean/max/min/median/vote. - ✅ Hybrid scoring —
hybrid_score()(builds metrics in Rust, no intermediate struct column) + 4 pre-built scorers. - ✅ DataFrame helpers —
fuzzy_join(blocked/cross, threshold,top_k,how),deduplicate(union-find clustering),pairwise_compare, and blocking indexes. - ✅ Explainability —
return_breakdown=Truereturns a per-metric score breakdown. - ✅ Native Polars plugin — all scoring in Rust via PyO3; works in eager and lazy frames.
- ✅ Prebuilt wheels on PyPI —
pip install polars-stringsim, no Rust toolchain required (see Platform support).
- ⏳ Platform wheels: macOS (x86_64 + arm64) and aarch64-linux — currently sdist-only on those platforms.
- 🔲 Custom combiner registration (user-supplied Rust closures).
- 🔲 GPU acceleration.
- 🔲 More phonetic encoders (Caverphone, Beider-Morse).