In [None]:
# uncomment and run if mismo is not installed
# %pip install -q git+https://github.com/NickCrews/mismo@main

In [1]:
from __future__ import annotations

import ibis


ibis.options.interactive = True


Many real-world datasets contain errors due to causes such as manual data entry, incorrect data processing and inconsistent formatting. Therefore it's often useful to make use of string similarity measures that can quantify how close two strings are to each other by accounting for common types of string manipulations. These are defined to give a score between 0 and 1 which indicate minimal and maximal similarity respectively. mismo currently implements the following string similarity measures that are suitable for different use-cases:

- `Jaro` - a measure of the similarity between two strings given the number of matching characters and transpositions and their length.
- `Jaro-Winkler` - a modification of the Jaro similarity that uses a prefix scale to give more favourable weightings to strings that match at the start.
- `Jaccard` - a measure of the number of overlapping sets of words in two strings.

In addition, the following edit distance measures are defined along with equivalent similarities that are normalized using string lengths.
- `Levenshtein` - a measure of the distance between two strings based on the number of deletions, insertions and substitutions.
- `Damerau-Levenshtein` - an extension of `Levenshtein` that includes transpositions.

Let's explore how these work in practice using the patents dataset. We will generate pairs by blocking on the `label_true` column

In [2]:
from mismo.playdata import load_patents
patents = load_patents()
patents

In [3]:
from mismo.block import KeyBlocker
blocked = KeyBlocker("label_true")(patents, patents)

A comparison table of these string similarity measures can be generated using `mismo.eda.string_comparator_scores`

In [4]:
from mismo.eda import string_comparator_score_chart, string_comparator_scores
scores = string_comparator_scores(blocked.limit(20), "name_l", "name_r")

In [5]:
scores

These can be visually represented using `mismo.eda.string_comparator_score_chart` which plots a heatmap of the similarity and distance measures.

In [6]:
chart = string_comparator_score_chart(blocked.limit(20), "name_l", "name_r")
chart