Skip to content

Short strings and fuzzy matching - maybe too sensitive! #106

@EllieKallmier

Description

@EllieKallmier

While writing up some tests for a helper function that's used to canonicalise e.g. a pd.Series of strings to a set of standard options, clocked that the current fuzz.ratio() approach gets pretty fussy for shorter strings. Side note/related - maybe we should update to use RapidFuzz (which thefuzz uses internally already)?

As it currently stands, these are the fuzz ratios for a few examples with the 4-char 'Wind':

>>> fuzz.ratio("wind", "Wind")
>>> 75

>>> fuzz.ratio("wand", "Wind")
>>> 50

>>> fuzz.ratio("Wild", "Wind")
>>> 75

Not a big deal because we can just set a lower threshold, but it kind of creates different levels of leniency allowing for matching that we might not want to apply for longer strings. And/but we also might not want to allow something like the second example above to be considered a match, so some threshold still needed.

Ideas:

  • Update to use RapidFuzz - try out some of the different metrics/functionality available (some built in pre-processing stuff could be handy for other cases too)
  • Apply dynamic threshold related to string length - not my favourite but the threshold is kinda arbitrary anyway (to a degree)
  • Explore some more weighting/normalisation options

Metadata

Metadata

Assignees

Labels

category: data-validationRelates to data validation practices across any module - e.g tables, schema or enforcementtype: technical-debtCode could be improved

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions