## 24.4 Levenshtein distance

When spell checkers detect a misspelled word, they only suggest similar words
as replacements. They do this by computing the edit distance between
a valid word and the misspelled word: the number of editing operations
it takes to turn the misspelled word into the valid word.
If the number of edits is low, e.g. three or less, then
the valid word is added to the suggestions.

Different edit distances can be defined by allowing different kinds of
operations. If the allowed operations are insertion, deletion and replacement,
then the edit distance is commonly known as the Levenshtein distance.
For example,
'fast' becomes 'haste' in two operations: replace 'f' with 'h', and insert 'e'.
Vice versa, deleting 'e' and replacing 'h' with 'f' turns 'haste' into 'fast'.
It's not possible to transform one string into the other with fewer operations,
so the Levenshtein distance of 'fast' and 'haste' is 2.

The Levenshtein distance can be computed for any two strings:
the tests below include DNA strings and English words without misspellings.
We'll name the function edit(*left*, *right*) instead of
levenshtein(*left*, *right*) to make it faster and easier to type.
The tests are:

In [1]:
from algoesup import check_tests, test

edit_tests = [
    # case,                 left,           right,    distance
    ('same word',           'hello',        'hello',        0),
    ('insert',              'rate',         'grate',        1),
    ('delete',              'rate',         'ate',          1),
    ('replace',             'rate',         'fate',         1),
    ('replace',             'algorithm',    'logarithm',    3),
    ('replace & delete',    'yes',          'no',           3),
    ('delete & insert',     'great',        'grate',        2),
    ('replace & insert',    'fast',         'haste',        2),
    ('all three edits',     'GATACA',       'CATCAT',       3),
    # common typo: pressing neighbouring letter on keyboard
    ('replace',             'mimt',         'mint',         1),
    # common typo: pressing neighbouring letters in wrong order
    ('swap letters',        'perquel',      'prequel',      2),
]

check_tests(edit_tests, [str, str, int])

OK: the test table passed the automatic checks.


### 24.4.1 Exercises

#### Exercise 24.4.1

Write a recursive definition of the Levenshtein distance,
using the head and tail operations on strings.
This is not trivial so break it down and think of concrete examples:

- What are the base cases? What's the distance in such cases, i.e.
  how many operations does it take to transform one string into the other?
- What's the edit distance if both strings have the same head?
- If the heads are different, what are the three ways of transforming
  the left string into the right string?
  The edit distance will be the lowest of the corresponding edit distances.

Complete the following.

- if *left* = ...: edit(*left*, *right*) = ...
- if *right* = ...: edit(*left*, *right*) = ...
- if head(*left*) = head(*right*): edit(*left*, *right*) = ...
- otherwise: edit(*left*, *right*) = lowest of
   - ...
   - ...
   - ...

[Hint](../31_Hints/Hints_24_4_01.ipynb)
[Answer](../32_Answers/Answers_24_4_01.ipynb)

#### Exercise 24.4.2

Implement the recursive definition, using `...[0]` and `...[1:]`
for the head and tail operations.

The `timeit` line in this and the following code cells helps you see how
the different approaches improve the run-time.

In [2]:
def edit(left: str, right: str) -> int:
    """Return the Levenshtein distance between the strings."""
    pass


test(edit, edit_tests)
%timeit edit('algorithm', 'logarithm')

[Hint](../31_Hints/Hints_24_4_02.ipynb)
[Answer](../32_Answers/Answers_24_4_02.ipynb)

#### Exercise 24.4.3

Copy your code to the next cell and modify it so that
it uses indices `l` and `r` instead of slicing the strings.

In [3]:
def edit_indices(left: str, right: str) -> int:
    """Return the Levenshtein distance between the strings."""

    def edit(l: int, r: int) -> int:
        """Return the Levenshtein distance of left[l:] and right[r:].

        Preconditions: 0 ≤ l ≤ len(left) and 0 ≤ r ≤ len(right)
        """
        pass

    pass  # call the inner function and return the result


test(edit_indices, edit_tests)
%timeit edit_indices('algorithm', 'logarithm')

[Hint](../31_Hints/Hints_24_4_03.ipynb)
[Answer](../32_Answers/Answers_24_4_03.ipynb)

#### Exercise 24.4.4

Explain why dynamic programming is applicable to
the Levenshtein distance problem.

_Write your answer here._

[Hint](../31_Hints/Hints_24_4_04.ipynb)
[Answer](../32_Answers/Answers_24_4_04.ipynb)

#### Exercise 24.4.5

Compute the Levenshtein distance with a top-down dynamic programming algorithm.
Copy your Exercise 24.4.3 code to the next cell and add a cache.

In [4]:
def edit_topdown(left: str, right: str) -> int:
    """Return the Levenshtein distance between the strings."""

    def edit(l: int, r: int) -> int:
        """Return the Levenshtein distance of left[l:] and right[r:].

        Preconditions: 0 ≤ l ≤ len(left) and 0 ≤ r ≤ len(right)
        """
        pass

    pass


test(edit_topdown, edit_tests)
%timeit edit_topdown('algorithm', 'logarithm')

[Hint](../31_Hints/Hints_24_4_05.ipynb)
[Answer](../32_Answers/Answers_24_4_05.ipynb)

#### Exercise 24.4.6

For bottom-up dynamic programming, in which order must the cache be filled?

_Write your answer here._

[Hint](../31_Hints/Hints_24_4_06.ipynb)
[Answer](../32_Answers/Answers_24_4_06.ipynb)

#### Exercise 24.4.7

Compute the Levenshtein distance with bottom-up dynamic programming.
Copy your Exercise 24.4.5 code to the next cell and modify it.

In [5]:
def edit_bottomup(left: str, right: str) -> int:
    """Return the Levenshtein distance between the strings."""
    pass


test(edit_bottomup, edit_tests)
%timeit edit_bottomup('algorithm', 'logarithm')

[Answer](../32_Answers/Answers_24_4_07.ipynb)

#### Exercise 24.4.8

What's the worst-case complexity?

_Write your answer here._

[Hint](../31_Hints/Hints_24_4_08.ipynb)
[Answer](../32_Answers/Answers_24_4_08.ipynb)

⟵ [Previous section](24_3_book.ipynb) | [Up](24-introduction.ipynb) | [Next section](24_5_higher.ipynb) ⟶