<img src='img/anaconda-logo.png' align='left' style="padding:10px">
<br>
*Copyright Continuum 2012-2016 All Rights Reserved.*

# Accelerate Natural Language Processing: Damerau–Levenshtein distance

The DL distance tells us the number of edits needed to turn one string into another.


> *"...the Damerau–Levenshtein distance is a distance (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters."* [from Wikipedia](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)

## Table of Contents
* [Accelerate Natural Language Processing: Damerau–Levenshtein distance](#Accelerate-Natural-Language-Processing:-Damerau–Levenshtein-distance)
	* [Set-up](#Set-up)
	* [Pairwise API](#Pairwise-API)
	* [Batch API](#Batch-API)
	* [Bigger Dictionary](#Bigger-Dictionary)


## Set-up

In [None]:
from __future__ import print_function

In [None]:
from dldist import damerau_levenshtein_distance
from dldist import BatchDamerauLevenshtein
import nltk

## Pairwise API

The ``damerau_levenshtein_distance()`` function computes the DL-distance between two words.

In [None]:
damerau_levenshtein_distance("apple", "aplp")

In [None]:
damerau_levenshtein_distance("orange", "oarnge")

* The API accepts ``bytes`` as an argument.  
* But, ``str`` is accepted for convenience when dealing with single-byte characters.  
* For unicode string, use a fixed length encoding and specify the byteperchar.

In [None]:
unicode1 = u"яблоко"
unicode2 = u"бякло"
damerau_levenshtein_distance(unicode1.encode('utf16'), 
                             unicode2.encode('utf16'), byteperchar=2)

## Batch API

The batch API provides a faster machinery for computing the DL-distance of a word against a pre-filled dictionary.  

Computation is optimized by using multithread execution when needed.

In [None]:
dictionary = [
    'computer',
    'science',
    'programming',
    'program',
    'graph',
    'analytics',
]

In [None]:
bdl = BatchDamerauLevenshtein(dictionary)
bdl.query('komapter')

The return results are pairs of word and edit-distance from the given string.

The `topk` closest words and their distances are returned in a list of 2-tuples.  

The result is always sorted with closest word (word with the smallest edit distance) first.

In [None]:
bdl.query('gram', topk=1)

To use multiple workers (threads), set `nworkers`.  The default is `1`.

In [None]:
bdl.query('apple', nworkers=2)

Here's an example to use multibyte charset with the batch API:

In [None]:
wide_dictionary = [
    u"яблоко",
    u"компьютер",
    u"наука",
]

Encode the words into bytes using utf32 for fixed length encoding

In [None]:
encoded = [x.encode('utf32') for x in wide_dictionary]

Create our batch D-L object

In [None]:
widebdl = BatchDamerauLevenshtein(encoded, byteperchar=4)

Query the batch D-L object for the distances and then print the results

In [None]:
# Use it
result = widebdl.query(u"омюпьтер".encode('utf32'))

for word, dist in result:
    print(word.decode('utf32'), dist)

## Bigger Dictionary

Ensure we have the corpus downloaded

In [None]:
nltk.download('brown')

In [None]:
words = set(nltk.corpus.brown.words())

Normalize the data

In [None]:
def normalize(w):
    # lowercase and ascii bytes
    return w.lower().encode('ascii')

words = set(map(normalize, words))

Create a batch D-L object

In [None]:
%%time
print('# of words', len(words))
bdl = BatchDamerauLevenshtein(words)

Perform queries and time the execution when using different numbers of workers

In [None]:
%%time
bdl.query('macinth', nworkers=1)

In [None]:
%%time
bdl.query('macinth', nworkers=2)

In [None]:
%%time
bdl.query('macinth', nworkers=4)

---
*Copyright Continuum 2012-2016 All Rights Reserved.*