# Text Distance

Using the text distance helps us determine the lexical similarity of words, for example we can detect simple typos when user write `hose` instead of `house` we can detect that these are similar

There are multiple python libraries that helps us use many text distances methods, we will use `jellyfish` for this tutorial




In [1]:
!pip install jellyfish

Collecting jellyfish
  Downloading https://files.pythonhosted.org/packages/3f/29/56521d8f0acc49149175903ad41b87f0ca392fe56c0fb3bf079fd9912c4f/jellyfish-1.0.0-cp37-none-win_amd64.whl (206kB)
Installing collected packages: jellyfish
Successfully installed jellyfish-1.0.0


# Levenshtein Distance

Levenshtein distance represents the number of insertions, deletions, and substitutions required to change one word to another.

For example: `levenshtein_distance('berne', 'born') == 2` representing the transformation of the first `e` to `o` and the deletion of the second `e`.


In [2]:
import jellyfish

jellyfish.levenshtein_distance('jellyfish', 'smellyfish')

2

# Hamming Distance

Hamming distance is the measure of the number of characters that differ between two strings.

Typically Hamming distance is undefined when strings are of different length, but this implementation considers extra characters as differing. For example hamming_distance('abc', 'abcd') == 1.

In [8]:
jellyfish.hamming_distance("cat", "hat")

1

In [3]:
jellyfish.hamming_distance('jellyfish', 'smellyfish')

9

# Damerau-Levenshtein Distance

A modification of Levenshtein distance, Damerau-Levenshtein distance counts transpositions (such as ifsh for fish) as a single edit.

Where levenshtein_distance('fish', 'ifsh') == 2 as it would require a deletion and an insertion, though damerau_levenshtein_distance('fish', 'ifsh') == 1 as this counts as a transposition.

In [4]:
jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs')

1

# which method better ? 

The choice of which text distance method to use depends on the specific problem I'm  trying to solve and the characteristics of the data. Each of the previous text distance methods (Levenshtein Distance, Hamming Distance, and Damerau-Levenshtein Distance) has its own strengths and limitations.

Here's a summary of when each method might be more suitable:

1. Levenshtein Distance:

• Strengths:

Measures the minimum number of edit operations (insertions, deletions, substitutions) to transform one string into another.
Useful for measuring the similarity between two strings when the order of characters matters, and when you want to account for various types of edits.

• Limitations:

Computationally more expensive, especially for longer strings.
Not suitable for measuring similarity when transpositions (reversed or swapped characters) are common.

2. Hamming Distance:

• Strengths:

Efficient for comparing strings of the same length.
Suitable for situations where strings are expected to have the same length and you want to measure character-level differences.

• Limitations:

Undefined when strings have different lengths.
Not suitable for measuring similarity when strings can have different lengths or require more complex transformations.

3. Damerau-Levenshtein Distance:

• Strengths:

An extension of Levenshtein distance that also considers transpositions as a single edit.
Useful for measuring similarity when transpositions are likely to occur, such as in spelling corrections.

• Limitations:

Slightly more computationally expensive than standard Levenshtein distance.

If we're working with DNA sequences where insertions, deletions, and substitutions are common, Levenshtein distance may be suitable.

If we're working with fixed-length strings and you want to measure character-level differences, Hamming distance might be a good choice.

If we're dealing with typos in text or need to account for transpositions, Damerau-Levenshtein distance is a good option.

In practice, we may need to experiment with different distance metrics and choose the one that best fits your particular problem and dataset.