## Edit distance

Edit distance between two sequences (s1,s2) is the number of edit operations required to transform s1 into s2.  There are three main operations that can be performed.

1. Insert an element into s1
2. Remove an element from s1
3. replace element x in s1 with element y

The literature name for this metric is *Levenshtein distance*, which is the total number of these operations requried to transform s1 into s2.  

This metric can be generalized to apply different weights to different operations.  Either different weights between the three operations above, or different weights depending on the actual element's likelihood.

The computationally expensive approach to this operation requres O(|s1| * |s2|)

In [81]:
from copy import deepcopy

def levenshtein_matrix(s1,s2):
    n, m = len(s1), len(s2)
    if n > m:
        s1,s2 = s2,s1
        n,m = m,n
    M = [[0 for i in range(m+1)] for j in range(n+1)]
    for i in range(n+1):
        M[i][0] = i
    for j in range(m+1):
        M[0][j] = j
    # double for loops reveals complexity
    for i in range(1, n+1):
        for j in range(1, m+1):
            M[i][j] = min(M[i-1][j-1] + (0 if s1[i-1]==s2[j-1] else 1),
                          M[i-1][j] + 1,
                          M[i][j-1] + 1)
    return M[m][n]


def levenshtein(s1,s2):
    n, m = len(s1), len(s2)
    if n > m:
        s1,s2 = s2,s1
        n,m = m,n
        
    current = range(n+1)
    # double for loops reveals complexity
    for i in range(1, m+1):
        previous, current = current, [i] + [0 for x in range(n)]
        for j in range(1,n+1):
            add, delete = previous[j]+1, current[j-1]+1
            change = previous[j-1]
            if s1[j-1] != s2[i-1]:
                change = change + 1
            current[j] = min(add, delete, change)
            
    return current[n]


s1 = "fast"
s2 = "cats"

print levenshtein(s1, s2)
print levenshtein_matrix(s1, s2)

3
3


Of course doing a 1xN comparison of a query term with vocabulary terms is expensive.  So, it is impractical to simply score all vocabulary terms.  

More efficient approaches involve reducing the set of possible vocabulary words over which to search using intelligent indexing, such as permuterm indexing.

## n-gram indexes

A statistical approach to reducing the vocabulary over which edit distances need be computed uses n-gram indexes.  This involves computing n-grams over a training set, and using this context information to identify probable words.