Skip to content

Text-similarity toolkit for Typescript providing high-performance string metrics via WASM bindings to rapidfuzz-rs and some useful string similarity functions

License

Notifications You must be signed in to change notification settings

3leaps/string-metrics-wasm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

string-metrics-wasm

npm version license: MIT

High-performance string similarity and fuzzy matching via WASM bindings to rapidfuzz-rs.

Description

This library provides blazing-fast string similarity metrics through WASM bindings to the Rust rapidfuzz-rs library, plus TypeScript implementations of advanced fuzzy matching algorithms. It combines the performance of compiled Rust/WASM with the flexibility of TypeScript for a comprehensive text similarity toolkit.

Features:

  • WASM-powered distance metrics: Levenshtein, Damerau-Levenshtein, OSA, Jaro, Jaro-Winkler, Indel, LCS
  • Fuzzy matching: Token-based comparison (order-insensitive, set-based)
  • Process helpers: Find best matches from arrays with configurable scoring
  • Unified API: Consistent interface across all metrics
  • TypeScript extensions: Substring similarity, normalization presets, suggestions API
  • Multi-runtime: Node.js, Bun, Deno support

Prerequisites

  1. Rust toolchain via rustup
  2. wasm-pack (pinned to the version we build against)

Install wasm-pack once per machine:

cargo install wasm-pack --version 0.13.1

Installation

npm install string-metrics-wasm

Quick Start

import { levenshtein, ratio, tokenSortRatio, extractOne, score } from 'string-metrics-wasm';

// Basic edit distance
const dist = levenshtein('kitten', 'sitting');
console.log(dist); // 3

// Fuzzy matching (0-100 scale)
const fuzzy = ratio('hello', 'hallo');
console.log(fuzzy); // 80.0

// Order-insensitive comparison
const tokens = tokenSortRatio('new york mets', 'mets york new');
console.log(tokens); // 100.0

// Find best match from array
const choices = ['Atlanta Falcons', 'New York Jets', 'Dallas Cowboys'];
const best = extractOne('new york', choices);
console.log(best); // { choice: 'New York Jets', score: 57.14, index: 1 }

// Unified scoring API (0-1 scale)
const similarity = score('hello', 'world', 'jaroWinkler');
console.log(similarity); // 0.4666...

API Documentation

Compatibility: All examples use camelCase option names and metric identifiers. For ecosystems that standardize on snake_case (e.g., Fulmen/Crucible fixtures), the same snake_case names are accepted as aliases and normalized internally.

Distance Metrics (WASM)

Edit distance metrics return raw integer distances (lower = more similar):

levenshtein(a: string, b: string): number

Minimum edits (insertions, deletions, substitutions) to transform a into b.

levenshtein('kitten', 'sitting'); // 3

damerau_levenshtein(a: string, b: string): number

Levenshtein + transpositions (unrestricted).

damerau_levenshtein('abcd', 'abdc'); // 1

osa_distance(a: string, b: string): number

Optimal String Alignment (restricted Damerau-Levenshtein).

osa_distance('abcd', 'abdc'); // 1

indel_distance(a: string, b: string): number

Insertions and deletions only (no substitutions).

indel_distance('hello', 'hallo'); // 2

lcs_seq_distance(a: string, b: string): number

Longest Common Subsequence distance.

lcs_seq_distance('AGGTAB', 'GXTXAYB'); // 3

Similarity Metrics (WASM)

Normalized similarity scores (0.0-1.0 scale, higher = more similar):

normalized_levenshtein(a: string, b: string): number

Normalized Levenshtein similarity.

normalized_levenshtein('kitten', 'sitting'); // 0.5714

jaro(a: string, b: string): number

Jaro similarity.

jaro('kitten', 'sitting'); // 0.7460

jaro_winkler(a: string, b: string): number

Jaro-Winkler similarity (boosts prefix matches).

jaro_winkler('kitten', 'sitting'); // 0.7460

indel_normalized_similarity(a: string, b: string): number

Normalized indel similarity.

indel_normalized_similarity('hello', 'hallo'); // 0.8

lcs_seq_normalized_similarity(a: string, b: string): number

Normalized LCS similarity.

lcs_seq_normalized_similarity('AGGTAB', 'GXTXAYB'); // 0.5714

Fuzzy Matching (WASM + TypeScript)

Fuzzy string comparison metrics (0-100 scale):

ratio(a: string, b: string): number (WASM)

Basic fuzzy similarity using Indel distance.

ratio('kitten', 'sitting'); // 61.54

partialRatio(a: string, b: string): number (TypeScript)

Best matching substring using sliding window.

partialRatio('fuzzy', 'fuzzy wuzzy was a bear'); // 100.0

tokenSortRatio(a: string, b: string): number (TypeScript)

Order-insensitive token comparison (sorts tokens first).

tokenSortRatio('new york mets', 'mets york new'); // 100.0

tokenSetRatio(a: string, b: string): number (TypeScript)

Set-based token comparison (handles duplicates and order).

tokenSetRatio('hello world world', 'world hello'); // 100.0

Process Helpers (TypeScript)

Find best matches from arrays:

extractOne(query: string, choices: string[], options?): ExtractResult | null

Find the single best match.

Options:

  • scorer?: (a: string, b: string) => number - Scoring function (default: ratio)
  • processor?: (str: string) => string - Preprocessing function
  • scoreCutoff?: number - Minimum score threshold (default: 0)
const choices = ['Atlanta Falcons', 'New York Jets', 'Dallas Cowboys'];
const best = extractOne('jets', choices, { scoreCutoff: 30 });
// { choice: 'New York Jets', score: 35.29, index: 1 }

extract(query: string, choices: string[], options?): ExtractResult[]

Find top N matches (sorted by score).

Options:

  • scorer?: (a: string, b: string) => number - Scoring function
  • processor?: (str: string) => string - Preprocessing function
  • scoreCutoff?: number - Minimum score threshold
  • limit?: number - Maximum results to return
const results = extract('new york', choices, { limit: 2, scoreCutoff: 40 });
// [
//   { choice: 'New York Jets', score: 57.14, index: 1 },
//   { choice: 'New York Giants', score: 52.17, index: 2 }
// ]

Unified API (TypeScript)

Metric-selectable interface with consistent scales:

distance(a: string, b: string, metric?: DistanceMetric): number

Calculate edit distance using any metric (returns raw distance).

Supported metrics: 'levenshtein' (default), 'damerauLevenshtein', 'osa', 'indel', 'lcsSeq'

distance('hello', 'world'); // 4 (default: levenshtein)
distance('hello', 'world', 'indel'); // 8

score(a: string, b: string, metric?: SimilarityMetric): number

Calculate similarity using any metric (returns 0-1 normalized score).

Supported metrics: 'jaroWinkler' (default), 'levenshtein', 'damerauLevenshtein', 'osa', 'jaro', 'indel', 'lcsSeq', 'ratio', 'partialRatio', 'tokenSortRatio', 'tokenSetRatio'

score('hello', 'world'); // 0.4666... (default: jaroWinkler)
score('new york mets', 'mets york new', 'tokenSortRatio'); // 1.0

// Fulmen/Crucible users: override default metric if needed
score('hello', 'world', 'levenshtein'); // 0.5714 (edit distance-based)

Normalization & Suggestions

normalize(input: string, preset?: NormalizationPreset, locale?: NormalizationLocale): string

Normalize text for comparison with optional locale-specific case folding.

Presets: 'none', 'minimal', 'default', 'aggressive'

Locales: 'tr' (Turkish), 'az' (Azerbaijani), 'lt' (Lithuanian), or undefined (default Unicode casefold)

normalize('Naïve Café', 'default'); // 'naïve café'

// Turkish/Azerbaijani: dotted/dotless I handling
normalize('İstanbul', 'default', 'tr'); // 'istanbul' (İ→i)
normalize('IĞDIR', 'default', 'tr'); // 'ığdır' (I→ı dotless)

// Default Unicode casefold (no locale)
normalize('İstanbul', 'default'); // 'i̇stanbul' (İ→i + combining dot)

Note: Most applications don't need locale-specific normalization. Only use when processing Turkish, Azerbaijani, or Lithuanian text where dotted/dotless I distinction matters.

suggest(query: string, candidates: string[], options?): Suggestion[]

Get ranked suggestions with detailed scoring.

const suggestions = suggest('pythn', ['python', 'java', 'javascript'], {
  metric: 'jaroWinkler',
  minScore: 0.6,
  maxSuggestions: 3,
});
// [
//   { value: 'python', score: 0.9555, ... },
//   ...
// ]

See Suggestions API docs for full details.

Implementation Details

WASM vs TypeScript

This library uses a hybrid approach for optimal performance and flexibility:

WASM Implementations (fastest):

  • Core distance metrics: levenshtein, damerau_levenshtein, osa_distance, jaro, jaro_winkler
  • RapidFuzz metrics: ratio, indel_*, lcs_seq_*

TypeScript Implementations (flexible):

  • Token-based fuzzy matching: partialRatio, tokenSortRatio, tokenSetRatio
  • Process helpers: extractOne, extract
  • Unified API: distance(), score()
  • Suggestions and normalization

Token-based metrics benefit from TypeScript's array operations and avoid WASM serialization overhead. The unified API provides a convenient abstraction over both WASM and TypeScript implementations.

Supported Runtimes

  • Node.js 16+ (ESM and CommonJS)
  • Bun (native ESM support)
  • Deno (use npm: specifier)

Building from Source

  1. Install dependencies and tooling: make bootstrap
  2. Build WASM: npm run build:wasm or make build
  3. Build TS: npm run build:ts

Development

This project uses a Makefile for common tasks:

make help           # Show all available targets
make build          # Build WASM and TypeScript (with version check)
make test           # Run tests
make clean          # Remove build artifacts

# Code quality
make quality        # Run all quality checks (format-check, lint, rust checks)
make format         # Format all code (Biome + Prettier + rustfmt)
make format-check   # Check formatting without changes
make lint           # Lint TypeScript code with Biome
make lint-fix       # Lint and auto-fix TypeScript code

# Version management
make version-check  # Verify package.json and Cargo.toml versions match
make bump-patch     # Bump patch version (0.1.0 -> 0.1.1)
make bump-minor     # Bump minor version (0.1.0 -> 0.2.0)
make bump-major     # Bump major version (0.1.0 -> 1.0.0)
make set-version VERSION=x.y.z  # Set explicit version

Explore the rest of the documentation under docs/. Start with the high-level overview or jump straight to the contributor guide in docs/development.md.

Code Quality Tools

This project uses modern, fast tooling for code quality:

  • TypeScript/JavaScript: Biome for linting and formatting
  • JSON/YAML/Markdown: Prettier for formatting
  • Rust: rustfmt for formatting, clippy for linting

Run make quality before committing to ensure all checks pass.

Version Management

This project maintains version sync between package.json (npm) and Cargo.toml (Rust). The Makefile provides targets to bump versions and keep them in sync. Additionally, the test suite includes a version consistency check that will fail if versions drift.

Important: Always use make bump-* or make set-version commands to update versions. This ensures both files stay synchronized.

Performance

All string comparison operations complete in < 1ms:

  • WASM metrics: 0.0003-0.0005ms per operation
  • Token-based metrics: 0.0003-0.0017ms per operation
  • Process helpers: 0.0008-0.001ms per operation
  • Unified API: minimal dispatch overhead

Run node benchmark-phase1b.js for detailed benchmarks.

Testing

This project includes comprehensive test coverage:

  • 119 unit tests covering all functions
  • 80 YAML fixture test cases for reproducibility
  • 100% regression-free across all releases

Run tests with npm test or make test.

Related Projects

  • rapidfuzz-rs - Rust implementation of RapidFuzz
  • rapidfuzz - Original Python implementation
  • strsim-rs - String similarity metrics (deprecated in favor of rapidfuzz-rs)

Versioning

This project follows Semantic Versioning. Version history is maintained in CHANGELOG.md.

Current Status: See latest release for the current version and changes.

License

This project is licensed under the MIT License.

Contributing

Contributions welcome! Please see our contributing guidelines:

Governance



Fast Strings. Accurate Matches.

High-performance text similarity for modern TypeScript applications



Built with ⚡ by the 3 Leaps team

String MetricsFuzzy MatchingWASM Performance

About

Text-similarity toolkit for Typescript providing high-performance string metrics via WASM bindings to rapidfuzz-rs and some useful string similarity functions

Resources

License

Stars

Watchers

Forks

Packages

No packages published