polars-stringsim

Composable fuzzy string matching for Polars, implemented as a native Polars plugin (Rust core + Python bindings via PyO3). All scoring runs in Rust — no Python loops over rows — and works in both eager and lazy frames.

Features

24 metric expressions — edit distances (Levenshtein, Damerau-Levenshtein, OSA, Hamming), the Jaro family, token/n-gram (Jaccard, Sørensen-Dice, q-grams), LCS, and four phonetic encoders (Soundex, Metaphone, DoubleMetaphone, NYSIIS). See ALGORITHMS.md for the math and normalization behind each one.
Hybrid scoring — combine multiple algorithms in a single Rust call (hybrid_score), plus 4 pre-built scorers (name_default, phonetic_edit, token_char, prefix_ngram).
DataFrame helpers — fuzzy_join, deduplicate, pairwise_compare with blocking indexes (first-chars, char-bag) to avoid O(n²) cross joins.
Explainability — return_breakdown=True returns per-metric scores alongside the combined score, so you can see why two strings matched.
Ensembles — weighted_avg, mean, max, min, median, vote.

All similarity functions return Float64 in [0, 1]. Null in either input → null output.

Install

Users (no Rust toolchain required)

Prebuilt wheels are published on PyPI for CPython 3.9–3.13 on Linux x86_64 and Windows x86_64. macOS and aarch64-linux wheels are planned (see Platform support); on those platforms pip falls back to a source build.

pip install polars-stringsim

That's it — import polars_stringsim as pf works out of the box.

If a wheel for your platform is missing, pip will fall back to a source build, which does require Rust (see below).

From source / development

For hacking on the plugin itself, or for a platform without a prebuilt wheel:

git clone https://github.com/Pratham-26/rust_helpers.git
cd rust_helpers

# Iterative dev install (rebuild on every change):
pip install maturin
maturin develop --release

# Or: install directly from the repo's source (compiles Rust, needs a toolchain):
pip install git+https://github.com/Pratham-26/rust_helpers.git

Requires Rust stable. Pinned to polars = 0.54.4 / pyo3-polars = 0.27. See RELEASE.md for how wheels are built and published.

Usage

import polars as pl
import polars_stringsim as pf

customers = pl.DataFrame({"name": ["Robert Smith", "Catherine Jones", "Jon Smyth"]})
db = pl.DataFrame({"name": ["Robert Smyth", "Katherine Jones", "William Brown"]})

# 1. Single metric
customers.join(db, how="cross").with_columns(
    s=pf.jaro_winkler("name", "name_right")
)

# 2. Hybrid: spelling + phonetic + token, fused in one Rust call
df.with_columns(
    hybrid=pf.hybrid_score("a", "b",
        algorithms=["jaro_winkler", "double_metaphone", "trigram_jaccard"],
        weights=[0.5, 0.3, 0.2])
)

# 3. Pre-built scorer
df.with_columns(s=pf.name_default("a", "b"))   # JW + Double Metaphone + trigram

# 4. Per-metric breakdown (explainability)
df.with_columns(
    bd=pf.combine(
        [pf.jaro_winkler("a","b"), pf.double_metaphone_sim("a","b")],
        weights=[0.6, 0.4], return_breakdown=True,
    )
)

# 5. fuzzy_join with blocking (avoids O(n*m) cross product)
pf.fuzzy_join(customers, db, left_on="name", right_on="name",
    algorithms=["jaro_winkler", "double_metaphone"], weights=[0.6, 0.4],
    threshold=0.75, top_k=1, block="first_chars", block_n=1)

# 6. deduplicate near-duplicate name variants
pf.deduplicate(messy_names, on="name",
    algorithms=["jaro_winkler"], weights=[1.0],
    composite_threshold=0.8, block="first_chars", block_n=1)

# 7. pairwise_compare for threshold tuning (returns combined score per pair)
pf.pairwise_compare(customers, db, left_on="name", right_on="name",
    algorithms=["jaro_winkler", "trigram_jaccard"], weights=[0.6, 0.4])

API reference

Per-metric expressions (`pf.<name>(left, right) → pl.Expr`)

jaro, jaro_winkler, levenshtein, levenshtein_norm, damerau_levenshtein, damerau_levenshtein_norm, osa, hamming, hamming_norm, token_jaccard, token_sorensen_dice, trigram_jaccard, trigram_sorensen_dice, qgram_jaccard(left, right, q=3), lcs_sim, soundex_sim, soundex_jw_sim, metaphone_sim, metaphone_jw_sim, double_metaphone_sim, double_metaphone_jw_sim, nysiis_sim, nysiis_jw_sim.

Combiners / hybrid

pf.combine(metrics, *, weights=None, method="weighted_avg", threshold=None, return_breakdown=False) — fuse pre-built metric expressions.
pf.hybrid_score(left, right, *, algorithms, weights=None, method="weighted_avg", threshold=None) — same, but builds metrics in Rust (no intermediate struct column).
Pre-built scorers: pf.phonetic_edit, pf.token_char, pf.prefix_ngram, pf.name_default.

DataFrame helpers

pf.fuzzy_join(left, right, *, left_on, right_on, algorithms, weights, method, threshold, top_k, block, block_n, how, add_breakdown)
pf.deduplicate(frame, *, on, algorithms, weights, method, composite_threshold, block, block_n)
pf.pairwise_compare(left, right, *, left_on, right_on, algorithms, weights, method, block, block_n)
Blocking: pf.block_first_chars(col, n=2), pf.block_char_bag(col)

Combine methods

weighted_avg (default; weights normalized to sum to 1), mean, max, min, median, vote (count of metrics ≥ threshold, normalized by N).

Parallelism & thread control

hybrid_score parallelizes its row scan across a dedicated worker pool that is independent of the Polars engine pool (POLARS_MAX_THREADS), so you can tune them separately. Throughput scales near-linearly with cores (≈7–8× on 16 threads vs 1).

import polars_stringsim as pf

pf.get_num_threads()         # default = number of logical cores
pf.set_num_threads(8)        # use 8 threads for hybrid_score
pf.set_num_threads(0)        # restore default

Or set the default at process start: POLARS_STRINGSIM_THREADS=8 python ...

Algorithm name registry

hybrid_score, fuzzy_join, deduplicate, and pairwise_compare accept algorithm names:

jaro, jaro_winkler, levenshtein, levenshtein_norm, damerau_levenshtein, damerau_levenshtein_norm, osa, hamming, hamming_norm, token_jaccard, token_sorensen_dice, trigram_jaccard, trigram_sorensen_dice, lcs_sim, soundex/soundex_jw, metaphone/metaphone_jw, double_metaphone/double_metaphone_jw, nysiis/nysiis_jw.

For what each metric computes and how it's normalized, see ALGORITHMS.md.

Platform support

Prebuilt wheels on PyPI (no Rust toolchain needed):

Platform	Wheel	Status
Linux x86_64	`manylinux_2_28_x86_64`	✅ v0.1.0
Windows x86_64	`win_amd64`	✅ v0.1.0
macOS x86_64 / arm64	—	⏳ planned (sdist fallback works; needs Rust)
Linux aarch64	—	⏳ planned (sdist fallback works; needs Rust)

CPython 3.9–3.13 supported wherever a wheel exists. pip install polars-stringsim picks the right wheel automatically; on uncovered platforms it builds from the sdist (requires Rust stable).

Tests

cargo test --lib          # 30 Rust unit tests
pytest tests/python       # 38 Python end-to-end tests

Run the example end-to-end with uv:

maturin build --release
WHL=target/wheels/polars_stringsim-*.whl
uv run --with "$WHL" --with polars --with pytest python -m pytest tests/python/
uv run --with "$WHL" --with polars python examples/record_linkage.py

Architecture

src/
├── algorithms/   # pure Rust: edit, jaro, token, lcs, phonetic
├── combiner.rs   # CombineMethod enum + combine_row (weighted_avg/mean/max/min/median/vote)
├── expr/mod.rs   # #[polars_expr] wrappers, one per metric
├── expr_combine.rs   # combine_expr + combine_breakdown_expr (Struct output)
├── expr_hybrid.rs    # hybrid_score_expr (multi-algo in one Rust call)
├── series_util.rs    # str-column readers, Float64 builder, null handling
└── lib.rs        # #[pymodule]

python/polars_stringsim/
├── _expression.py  # per-metric expr builders + combine()
├── _registry.py    # algorithm name → builder map
├── hybrid.py       # hybrid_score + pre-built scorers
├── frame.py        # fuzzy_join, deduplicate, pairwise_compare, blocking
└── __init__.py     # public API

Features & status

Everything in the original PRD is implemented and shipped:

✅ 24 metric expressions — edit distances, Jaro family, token/n-gram, LCS, and phonetic encoders. Full reference: ALGORITHMS.md.
✅ Composable ensembles — combine() with weighted_avg / mean / max / min / median / vote.
✅ Hybrid scoring — hybrid_score() (builds metrics in Rust, no intermediate struct column) + 4 pre-built scorers.
✅ DataFrame helpers — fuzzy_join (blocked/cross, threshold, top_k, how), deduplicate (union-find clustering), pairwise_compare, and blocking indexes.
✅ Explainability — return_breakdown=True returns a per-metric score breakdown.
✅ Native Polars plugin — all scoring in Rust via PyO3; works in eager and lazy frames.
✅ Prebuilt wheels on PyPI — pip install polars-stringsim, no Rust toolchain required (see Platform support).

Not yet implemented

⏳ Platform wheels: macOS (x86_64 + arm64) and aarch64-linux — currently sdist-only on those platforms.
🔲 Custom combiner registration (user-supplied Rust closures).
🔲 GPU acceleration.
🔲 More phonetic encoders (Caverphone, Beider-Morse).

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
examples		examples
profiling		profiling
python/polars_stringsim		python/polars_stringsim
src		src
tests/python		tests/python
.gitignore		.gitignore
ALGORITHMS.md		ALGORITHMS.md
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
PRD.md		PRD.md
README.md		README.md
RELEASE.md		RELEASE.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

polars-stringsim

Features

Install

Users (no Rust toolchain required)

From source / development

Usage

API reference

Per-metric expressions (`pf.<name>(left, right) → pl.Expr`)

Combiners / hybrid

DataFrame helpers

Combine methods

Parallelism & thread control

Algorithm name registry

Platform support

Tests

Architecture

Features & status

Not yet implemented

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

polars-stringsim

Features

Install

Users (no Rust toolchain required)

From source / development

Usage

API reference

Per-metric expressions (pf.<name>(left, right) → pl.Expr)

Combiners / hybrid

DataFrame helpers

Combine methods

Parallelism & thread control

Algorithm name registry

Platform support

Tests

Architecture

Features & status

Not yet implemented

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Per-metric expressions (`pf.<name>(left, right) → pl.Expr`)

Packages