Short strings and fuzzy matching - maybe too sensitive!

While writing up some tests for a helper function that's used to canonicalise e.g. a pd.Series of strings to a set of standard options, clocked that the current fuzz.ratio() approach gets pretty fussy for shorter strings. Side note/related - maybe we should update to use [RapidFuzz](https://github.com/rapidfuzz/RapidFuzz) (which thefuzz uses internally already)?

As it currently stands, these are the fuzz ratios for a few examples with the 4-char 'Wind':
```
>>> fuzz.ratio("wind", "Wind")
>>> 75

>>> fuzz.ratio("wand", "Wind")
>>> 50

>>> fuzz.ratio("Wild", "Wind")
>>> 75
```

Not a big deal because we can just set a lower threshold, but it kind of creates different levels of leniency allowing for matching that we might not want to apply for longer strings. And/but we also might not want to allow something like the second example above to be considered a match, so some threshold still needed. 

Ideas:
- Update to use RapidFuzz  - try out some of the different metrics/functionality available (some built in pre-processing stuff could be handy for other cases too)
- Apply dynamic threshold related to string length - not my favourite but the threshold is kinda arbitrary anyway (to a degree)
- Explore some more weighting/normalisation options 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Short strings and fuzzy matching - maybe too sensitive! #106

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Short strings and fuzzy matching - maybe too sensitive! #106

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions