# Computing the Levenshtein distance

As covered in the lecture materials, the **edit distance** between two strings `s` and `t` measures the shortest sequence of edits that transforms `s` into `t`.
Depending on what kind of edits are allowed, one obtains different metrics.
For the Levenshtein distance, three kinds of operations are allowed: substitution, deletion, and insertion.
Computing the Levenshtein distance is an optimization problem.
That is to say, it's not enough to find some sequence of edits, one has to find the shortest one.
Optimization problems can be very challenging.
The naive approach would be to look at all possible sequences of edits and then pick the shortest one among them.
But this is very inefficient, and just like with our `fibonacci` function the inefficiency stems from the fact that  many sequences of edits would share parts that would be needlessly recomputed with each new sequence.

For this reason, any implementation of the Levenshtein distance needs to use dynamic programming techniques.
The rest of this notebook assumes that you already know the graph-based approach covered in the lecture notes.
If you don't, go back and reread that part.
Once you understand how it works, come back and we'll see how it can be implemented in Python.

## Representing the problem

The graph-based approach computes the Levenshtein distance between strings `s` and `t` as a grid of nodes that are connected by edges.
The size of the grid depends on `s` and `t`.
The number of columns is `len(s) + 1`, and the number of rows is `len(t) + 1`.
Run the cell below to get an approxiation of what the grid looks like for the strings `"fire"` and `"fry"`

In [None]:
from tabulate import tabulate

s = "fire"
t = "fry"

grid = [[(x,y) for x in range(len(s) + 1)] for y in range(len(t)+ 1)]
print(tabulate(grid))

As the code above already suggests, we can generate this grid with a list expression where each node is a `tuple(x, y)` such that `x` indicates the column and `y` the row.

In [None]:
def construct_simple_grid(s, t):
    """Construct Levenshtein grid for strings s and t."""
    return [(x,y) 
            for x in range(len(s) + 1)
            for y in range(len(t) + 1)]

In [None]:
print(construct_simple_grid("fire", "fry"))

Note that this list expression contains two `for` loops back to back.
This is perfectly fine.
The list expression is equivalent to the following:

In [None]:
grid = []
for x in range(len(s) + 1):
    for y in range(len(t) + 1):
        grid.append((x, y))

A node can be reached by at most three edges in this grid:

1. An edge coming form the left (`x-1`) deletes a symbol of `s`.
1. An edge coming from above (`y-1`) inserts a symbol of `t`.
1. An edge coming from above left (`x-1` and `y-1`) replaces a symbol of `s` with a symbol of `t`.

Which symbols are chosen depends on what node in the grid we are looking at.
But this does not really matter for the most part.
Deletion and insertion always come with a cost of 1 because they modify the string, no matter what exactly gets deleted or inserted.
Only for substitution does the choice of symbols matter.
Substitution can have a reduced cost of 0 if it replaces a symbol with the very same symbol.

As a concrete example, consider the node `(1, 1)` in the grid for `"fire"` and `"fry"`.
You can print it here again for your convenience.

In [None]:
from tabulate import tabulate

s = "fire"
t = "fry"

grid = [[(x,y) for x in range(len(s) + 1)] for y in range(len(t)+ 1)]
print(tabulate(grid))

If one enters the node from above left, coming from `(0,0)`, this is tantamount to replacing `"fire"[0]` in `"fire"` with  `"fry"[0]`.
But since that's exactly the same letter, the cost is `0`.

In general then, all edges have a cost of `1`, with one specific exception for substitution edges: if the edge originates in `(m, n)` and `s[m] == t[n]`, then its cost is `0`.

Let's implement this.
We will construct a dictionary where each node of the grid is a key.
The value is itself a dictionary with three keys.
Each one of the three keys is a neighboring node in the grid, and its value is the cost of entering the current node from said neighboring node.

In [2]:
def construct_grid(s, t):
    """Compute the default cost for each edge."""
    # grid is now a dictionary instead of a list
    grid = {(x,y): {}
            for x in range(len(s) + 1)
            for y in range(len(t) + 1)}
    for x, y in grid:
        # add deletion edge if there is a node to the left
        if x > 0:
            grid[(x,y)][(x-1, y)] = 1
        # add insertion edge is there is a node above
        if y > 0:
            grid[(x,y)][(x, y-1)] = 1
        # add substitution edge; check if cost is 0
        if x > 0 and y > 0:
            grid[(x,y)][(x-1, y-1)] =\
                0 if s[x-1] == t[y-1] else 1
    return grid

In [None]:
from pprint import pprint

s = "fire"
t = "fry"
pprint(construct_grid(s, t), width=1)

Alright, we now have a function `construct_grid` that returns a dictionary.
The dictionary encodes all the information we need about the grid:

1. its nodes,
1. for each node, the cost of entering it from a neighboring node via deletion, insertion, or substitution.

Everything is in place to calculate the Levenshtein distance of two nodes.

## A recursive implementation of Levenshtein distance

We will start with a recursive implementation.
I know, I know, Pythonistas should shun recursion like the devil holy water.
But the recursive implementation is a lot easier to understand than the alternatives, so we'll start with that one.

The Levenshtein distance is the cost of the optimal path from the top left of the grid to the bottom right.
Also, for the current node `c` the cost of the optimal path from the top left to `c` is obtained as follows:

1. For each neighboring node `n` of `c`, take the cost of the optimal path from the top left to `n` and add the cost of the edge from `n` to `c`.
   Let's call this the cost to `c` through `n`.
1. Once you have compute the cost to `c` through `n` for each neighbor `n`, pick the smallest.
   This is the cost of reaching `c` from the top left of the grid.
   
Implementing this as a recursive function is easy-peasy.

In [None]:
def cost(node, grid):
    """Calculate the cost of the optimal path to node through the grid."""
    if node == (0,0):
        return 0
    else:
        lowest = min([cost(neighbor, grid) + edge_cost
                      for neighbor, edge_cost in grid[node].items()])
        return lowest
    

def levenshtein_distance(s, t):
    """Calculate Levenshtein distance between s and t."""
    return cost((len(s), len(t)), construct_grid(s, t))

All the work is done by `cost`.
Like every recursive function, it has a base case and a recursion step.
The base case is when the node is `(0, 0)`, the top left corner of the grid.
The cost of reaching this node is 0 since we start there.

In the recursion step, we use a list expression to collect the cost of all the paths that lead to the node.
We then use `min` to pick the lowest cost among them.
The cost of these paths is computed recursively: for each neighboring node, call `cost` with the node as its argument and add the cost of the edge from the neighbor on top of that.

In [None]:
pairs = [("fire", "fry"),
         ("fyre", "fry"),
         ("apple", "banana"),
         ("aaa", "bbb"),
         ("long string", "")]
for s, t in pairs:
    print(f"Levenshtein distance of \"{s}\" and \"{t}\" is {levenshtein_distance(s, t)}")

This code constantly recomputes cost values.
Consider once more the grid for *fire* and *fry*.

In [None]:
from tabulate import tabulate

s = "fire"
t = "fry"

grid = [[(x,y) for x in range(len(s) + 1)] for y in range(len(t)+ 1)]
print(tabulate(grid))

How often do we compute the cost of the optimal path for `(1, 1)`?
Well, for starters, it is needed to compute the cost for the optimal path for its three neighboring nodes, which are `(2, 1)`, `(1, 2)`, and `(2, 2)`.
In addition, it gets recomputed whenever the path for one those three nodes is recomputed.
This is at least the case when one of their neighboring nodes has its path computed:

- For `(2, 1)` that's `(3, 1)`, `(2, 2)`, and `(3, 2)`.
- For `(1, 2)`, it's `(2, 2)` (again), `(1, 3)`, and `(2, 3)`.
- Finally, it's `(3,2)`, `(2, 3)`, and `(3, 3)` for `(2, 2)`

I think you can see the pattern here.
Every node in the rectangle with `(1, 1)` as its top left corner will trigger another computation of the optimal path for `(1, 1)`.
Many of those nodes will even do it multiple times, like `(2, 2)`.
This is incredibly wasteful.

Why, then, were the Levenshtein distances computed so fast for the examples above?
Just because the example strings are very short.
If we check the Levenshtein distance of much longer words, it will take much longer than the short examples suggest may have led you to believe.

In [None]:
# this will take a while... a long while
from pprint import pprint

s = "supercalifragilisticexpialidocious"
t = s
print(f"Levenshtein distance of \"{s}\" and \"{t}\" is {levenshtein_distance(s, t)}")

Time for some memoization.

## Adding memoization

Just as with the recursive function for Fibonacci numbers, we want to avoid recomputing values by storing them in a dictionary.
And just like last time, we have to add a few hooks so that values get computed only if they aren't already in the dictionary.

In [3]:
def cost(node, grid):
    """Calculate the cost of the optimal path to node through the grid."""
    if node == (0, 0) or node in memo:
        return memo[node]
    else:
        lowest = min([cost(neighbor, grid) + edge_cost
                      for neighbor, edge_cost in grid[node].items()])
        memo[node] = lowest
        return lowest
    

def levenshtein_distance(s, t):
    """Calculate Levenshtein distance between s and t."""
    return cost((len(s), len(t)), construct_grid(s, t))

In [7]:
memo = {(0, 0): 0}
s = "supercalifragilisticexpialidocious"
t = s
print(f"Levenshtein distance of \"{s}\" and \"{t}\" is {levenshtein_distance(s, t)}")

Levenshtein distance of "supercalifragilisticexpialidocious" and "supercalifragilisticexpialidocious" is 0


Look at that, lightning speed with only a few minor changes.
First we extend the base clause so that it captures both `(0, 0)` and any node for which we already have a value in `memo`.
Then we add a single line `memo[node] = lowest` to add the cost of a newly computed path to `memo`.
Two minor changes, but a tremendous speed-up.
Simple, easy code that's also fast - what isn't there to like?

## A non-recursive implementation

Since we spent quite a bit of space in the previous notebook on why recursive functions are a dubious choice at best in Python, let's compare the recursive implementation with one that uses loops instead.

The strategy is a bit more convoluted here.
As before we will use a dictionary representation of the grid, with all nodes, edges, and costs.
We will also use memo to keep a running tally of the minimum cost for reaching any given node.
But in putting the things together we have to use quite a few more `for`-loops and `if`-statements to make things work the way we want.

In [None]:
def cost(s, t, grid):
    # use a dictionary for keeping track of cost,
    # and set start values for (0, 0)
    memo = {(0, 0): 0}
    
    # iterate through grid; row first, then column;
    # the order of loops must be the opposite
    for y in range(len(t) + 1):
        for x in range(len(s) + 1):
            cn = (x, y)  # shorthand for current node
            # start computing cost of node if it doesn't exist yet
            if memo.get(cn) != 0:
                for neighbor, edge_cost in grid[cn].items():
                    total_cost = edge_cost + memo[neighbor]
                    # if this is the lowest cost so far, add it to memo
                    if memo.get(cn) is None or total_cost <= memo.get(cn):
                        memo[cn] = total_cost
                    
    # return stored value for bottom right corner of grid
    return memo[(len(s), len(t))]

def levenshtein_distance(s, t):
    """Calculate Levenshtein distance between s and t."""
    return cost(s, t, construct_grid(s, t))

In [None]:
pairs = [("fire", "fry"),
         ("fyre", "fry"),
         ("apple", "banana"),
         ("aaa", "bbb"),
         ("long string", "")]
for s, t in pairs:
    print(f"Levenshtein distance of \"{s}\" and \"{t}\" is {levenshtein_distance(s, t)}")

In [None]:
s = "supercalifragilisticexpialidocious"
t = s
print(f"Levenshtein distance of \"{s}\" and \"{t}\" is {levenshtein_distance(s, t)}")

Okay, it seems to work, and it's also very fast.
But the code is a lot messier and not nearly as transparent as the recursive implementation.
This is actually one of the few cases where a recursive implementation makes sense even with Python.
The main problem with Python's recursion is the limited recursion depth.
But the Levenshtein distance is usually used to compare words, and words aren't particularly long in most languages.
Even a deliberately long fatansy word like *supercalifragilisticexpialidocious* doesn't come anywhere near Python's recursion depth limit.
Considering that the recursive implementation is more intuitive and can easily be expanded, there is little reason to use the more convoluted implementation with `for`-loops.

No matter which implementation you prefer, though, both are instances of dynamic programming.
In both cases, the problem of finding the optimal path is broken down into small subproblems (finding shorter optimal paths) and then stored for later retrieval (through the use of `memo`).

## Bullet point summary

- Use the grid representation of the Levenshtein distance to allow for dynamic programming techniques.
- Sometimes, recursive functions might be the right choice after all.