## Topic
Use Levenshtein's minimum edit distance to find the closest words to the user input
 
## Reference
1. https://www.nltk.org/book/ch01.html

## Build a vocabulary set
Build a vocabulary (set of all unique words) using any English corpus from nltk.book. The input string will be searched in this vocabulary.

In [1]:
import nltk
# load all book in local machine
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [2]:
# use text3 book as the vocabulary set
vol_dict = sorted(set(text3))
words = nltk.tokenize.word_tokenize(text1)
print("The number of sets in the book:", len(vol_dict))

The number of sets in the book: 2789


## Find Frequency
Find the no. of occurrences (frequency) of each unique word in the chosen corpus.  Also, find the total number of words in the chosen corpus (N).

In [37]:
num_words = len(text3)
print("The total number of words in the chosen corpus is:", num_words)

The total number of words in the chosen corpus is: 44764


In [38]:
fq_dic = FreqDist(text3)
print(fq_dic.most_common(50))

[(',', 3681), ('and', 2428), ('the', 2411), ('of', 1358), ('.', 1315), ('And', 1250), ('his', 651), ('he', 648), ('to', 611), (';', 605), ('unto', 590), ('in', 588), ('that', 509), ('I', 484), ('said', 476), ('him', 387), ('a', 342), ('my', 325), ('was', 317), ('for', 297), ('it', 290), ('with', 289), ('me', 282), ('thou', 272), ("'", 268), ('is', 267), ('thy', 267), ('s', 263), ('thee', 257), ('be', 254), ('shall', 253), ('they', 249), ('all', 245), (':', 238), ('God', 231), ('them', 230), ('not', 224), ('which', 198), ('father', 198), ('will', 195), ('land', 184), ('Jacob', 179), ('came', 177), ('her', 173), ('LORD', 166), ('were', 163), ('she', 161), ('from', 157), ('Joseph', 157), ('their', 153)]


## Find Relative Frequency
Find the relative frequency of each word x where relative frequency of x = frequency of x / N. This relative frequency can be interpreted as the probability of each word in the corpus.

In [48]:
import collections
re_fq_dic = {}
for k, v in fq_dic.items():
    v = v / num_words
    re_fq_dic[k] = v
d = Counter(re_fq_dic)
print(d.most_common())

ModuleNotFoundError: No module named 'collections.Counter'

In [3]:
test_str = "this"
# tokens = nltk.word_tokenize(test_str)
# print(tokens)

In [4]:
# A Dynamic Programming based Python program for edit
# distance problem

def minDistance(word1, word2):
    # levinshtien distance
    
    m = len(word1)+1
    n = len(word2)+1

    dp = [[0 for _ in range(n)] for _ in range(m)]
    
    for i in range(m):
        dp[i][0] = i

    for j in range(n):
        dp[0][j] = j
        
    # Main Loop
    for i in range(1,m):
        for j in range(1,n):
            dp[i][j] = min((0 if word1[i-1] == word2[j-1] else 2) + dp[i-1][j-1],
                            dp[i-1][j] + 1, 
                            dp[i][j-1] + 1)

    return dp[-1][-1]

# test
str1 = "intention"
str2 = "execution" 
ans = minDistance(str1, str2)
print(ans)

8


In [5]:
test_str = "this"
if test_str in vol_dict:
    print("[" + test_str + "]" + " is a complete and correct word in English.")

[this] is a complete and correct word in English.
