Data from multiple sources rarely align perfectly—names are misspelled, addresses differ, formats vary. RecordLink aims to make linking such data transparent and reproducible by implementing core string similarity and phonetic algorithms from scratch in Python. The goal is twofold:
- Provide an educational, open, and well-documented reference implementation.
- Serve as a foundation for scalable entity-resolution systems in research and industry domains such as healthcare, census, and customer analytics.
This project focuses on clarity, correctness, and extensibility rather than raw speed—each algorithm is implemented in readable Python to illustrate the underlying logic.
Record Linkage (or Entity Resolution) is the process of identifying records across datasets that refer to the same real-world entity (e.g., the same patient or customer), even when text fields differ slightly. It combines string matching and blocking:
- String Matching quantifies similarity between individual fields (like names).
- Blocking groups potentially similar records to avoid O(n²) comparisons.
Levenshtein Distance
Measures the minimal number of insertions, deletions, or substitutions needed to transform one string into another.
Example: kitten → sitting = 3 edits.
Used to capture typo-level differences.
Jaro–Winkler Similarity Considers matching characters and transpositions within a variable window; Winkler extension rewards common prefixes. Effective for short names and transposed letters.
Soundex
Encodes words into a 4-character alphanumeric code based on how they sound in English.
Example: Smith → S530, Smyth → S530.
Used for phonetic blocking.
Metaphone
Improved phonetic algorithm capturing more modern pronunciation patterns (e.g., Knight and Night → same code).
Better linguistic coverage than Soundex.
Together, these algorithms measure edit distance, character similarity, and phonetic resemblance—the core pillars of record linkage.
recordlink/
core/ # Normalization and text preprocessing
metrics/ # Levenshtein and Jaro–Winkler similarity
phonetic/ # Soundex and Metaphone encoders
blocking/ # Blocking keys using phonetic codes
linkage/ # Combined scorer for similarity weighting
tests/ # Unit tests and examples
Idea Minimum number of single-character edits (insert, delete, substitute) to turn string A into B.
Pseudocode (plain English)
-
Make a 2D table
dpof size(len(A)+1) x (len(B)+1). -
Initialize first row with
0..len(B)(cost to insert B’s prefix), first column with0..len(A)(cost to delete A’s prefix). -
For each
i=1..len(A)andj=1..len(B):- If
A[i-1] == B[j-1], setcost = 0, elsecost = 1. - Set
dp[i][j] = min( dp[i-1][j] + 1 (delete), dp[i][j-1] + 1 (insert), dp[i-1][j-1] + cost (substitute) ).
- If
-
Answer is
dp[len(A)][len(B)].
Complexity: O(n·m) time, O(n·m) space (or O(min(n,m)) with row-rolling). Edge cases: empty vs non-empty (distance = length of other); identical strings (0). Failure mode vs Damerau: adjacent swaps count as 2 edits (not 1).
Idea Compare characters within a match window and penalize out-of-order matches (transpositions). Winkler adds a bonus for common prefixes. Returns a similarity in [0,1].
Pseudocode (plain English)
-
If A == B → return 1. If either empty → 0.
-
Match window =
floor(max(len(A), len(B)) / 2) - 1, floored at 0. -
Find matches:
- For each char in A at index
i, search in B fromi - windowtoi + windowfor the same char that isn’t already matched. - Mark both positions matched and increment
matches.
- For each char in A at index
-
If
matches == 0→ return 0. -
Count transpositions:
- Iterate matched chars in order in A and B; count positions where the matched characters differ;
transpositions = count / 2.
- Iterate matched chars in order in A and B; count positions where the matched characters differ;
-
Jaro score = average of:
matches / len(A),matches / len(B), and(matches - transpositions) / matches.
-
Winkler bonus:
l = length of common prefix of A and B up to 4.JW = Jaro + l * p * (1 - Jaro)withp = 0.1typically.
Complexity: O(n + m) with the sliding window; worst-case O(n·window). Edge cases: works well for short names and transpositions; identical prefix increases score. Failure mode: prefix bonus can make long, diverging strings look slightly too similar.
Idea Encode an English name into a 4-character code: first letter + three digits representing consonant groups; collapses similar-sounding names.
Pseudocode (plain English)
-
Uppercase the input; keep only A–Z; if empty → return empty.
-
Keep the first letter as is.
-
For remaining letters, map each to a digit using groups:
- BFPV→1, CGJKQSXZ→2, DT→3, L→4, MN→5, R→6; vowels and H/W/Y are “no digit”.
-
Drop consecutive duplicate digits (e.g., “BB” should not double-count “1”).
-
Remove all “no digit” markers.
-
Construct
first_letter + digits, then:- Truncate to 4 chars, or pad with zeros to reach 4.
-
Return the 4-char code.
Complexity: O(n). Edge cases: non-ASCII letters should be normalized first; empty → empty string. Failure mode: very coarse—many different names share the same code (great for blocking, not for final matching).
Idea Encode a word into a consonant-like skeleton that approximates English pronunciation (handles silent letters, digraphs, and common patterns). Output length varies.
Pseudocode (plain English)
-
Uppercase word; optionally drop leading silent combos (KN, GN, PN, AE, WR).
-
Scan left→right with look-ahead (next 1–2 chars) and apply ordered rules:
- Vowels: keep only if at start.
PH → F;GH→ oftenF(depending on context) or silent;KN/GN/PN→ drop leading consonant.CH → X(like “sh”);TIO/TIA → X;SH → X.TH → 0(special token).CbeforeE/I/Y → S, elseK.DGbeforeE/I/Y → J; elseT.Q → K,X → KS,V → F,Z → S.- Drop repeated letters (except special cases like “CC”).
- Keep sonorants
L, R, M, N, Jas themselves when pronounced.
-
Concatenate emitted symbols to form the code (no fixed length).
-
Return the code.
Complexity: O(n). Edge cases: rules are English-centric; ambiguous cases (“Xavier”, “Javier”) vary by accent. Failure mode: still collapses distinct names; not robust for non-English phonology. Double Metaphone (not implemented here) handles more cases by producing primary/alternate codes.
- Normalize text first (Unicode normalize, case-fold, strip accents/punct, collapse spaces).
- Block candidate pairs using phonetic keys (Soundex/Metaphone) to reduce comparisons.
- Compare within blocks using Jaro–Winkler and/or Levenshtein.
- Score (e.g., weighted average) and threshold to decide matches.
- Evaluate with precision/recall; adjust normalization, weights, and thresholds.
from recordlink import normalize, levenshtein, jaro_winkler, soundex, metaphone, Scorer
a, b = normalize("José López"), normalize("Jose Lopez")
print("Levenshtein:", levenshtein(a, b))
print("Jaro–Winkler:", jaro_winkler(a, b))
print("Soundex:", soundex("Robert"), soundex("Rupert"))
print("Metaphone:", metaphone("Knight"), metaphone("Night"))
scorer = Scorer(weights={"jw": 0.7, "lev": 0.3})
print("Combined score:", scorer.score(a, b))uv run python -m record_link "José López" "Jose Lopez"
| Function | Output | Meaning |
|---|---|---|
levenshtein(a, b) |
Integer (≥0) | Number of edits required |
jaro_winkler(a, b) |
0–1 float | Similarity (1 = identical) |
soundex(name) |
4-char code | Phonetic code (for blocking) |
metaphone(name) |
String | Pronunciation-based key |
Scorer.score(a,b) |
0–1 float | Weighted similarity combining algorithms |
A high Jaro–Winkler score with identical Soundex/Metaphone codes strongly suggests two strings represent the same entity.
The current release focuses on algorithmic clarity rather than full-scale linkage pipelines. Future extensions may include:
- Double Metaphone for multilingual phonetic encoding.
- Damerau–Levenshtein (handling transpositions).
- Probabilistic linkage models (Fellegi–Sunter framework).
- Parallelized and vectorized computation for large datasets.
- Machine-learned similarity weighting for adaptive scoring.
RecordLink is meant to bridge theory and practice—a toolkit for learning, experimenting, and eventually scaling real-world record.