# Autocorrect



<a name='0'></a>
## 0. Overview

We use autocorrect every day on our cell phone and computer. We will explore what really goes on behind the scenes. Of course, the model we are about to implement is not identical to the one used in our phone, but it is still quite good. 

- Get a word count given a corpus
- Get a word probability in the corpus 
- Manipulate strings 
- Filter strings 
- Implement Minimum edit distance to compare strings and to help find the optimal path for the edits. 
- Understand how dynamic programming works


Similar systems are used everywhere. 
- For example, if you type in the word **"I am lerningg"**, chances are very high that you meant to write **"learning"**

The Turkish translation of Sapiens: A Brief History of Humanind, written by Yuval Noah Harari, has been used as a corpus.


# Part 1: Data Preprocessing 

In [69]:
import re
from collections import Counter
import numpy as np
import pandas as pd

In [70]:
def process_data(file_name):
    """
    Input: 
        A file_name which is found in your current directory. You just have to read it in. 
    Output: 
        words: a list containing all the words in the corpus (text file you read) in lower case. 
    """
    words = [] # return this variable correctly

    with open(file_name, encoding='UTF-8') as f:
        contents3 = f.read()

    contents2 = contents3.lower()
    words = re.findall(r'\w+',contents2)
    
    return words

In [71]:

word_l = process_data('sapiens.txt')
vocab = set(word_l)  # this will be new vocabulary
print(f"The first ten words in the text are: \n{word_l[0:10]}")
print(f"There are {len(vocab)} unique words in the vocabulary.")

The first ten words in the text are: 
['hayvanlardan', 'tanrilara', 'sapiens', 'i', 'nsan', 'türünün', 'kısa', 'bir', 'tarihi', 'yuval']
There are 24730 unique words in the vocabulary.


In [72]:
def get_count(word_l):
    '''
    Input:
        word_l: a set of words representing the corpus. 
    Output:
        word_count_dict: The wordcount dictionary where key is the word and value is its frequency.
    '''
    
    word_count_dict = {}  # fill this with word counts
    word_count_dict = Counter(word_l)

    return word_count_dict

In [73]:
word_count_dict = get_count(word_l)
print(f"There are {len(word_count_dict)} key values pairs")
print(f"The count for the word 'thee' is {word_count_dict.get('insan',0)}")

There are 24730 key values pairs
The count for the word 'thee' is 325


In [74]:
def get_probs(word_count_dict):
    '''
    Input:
        word_count_dict: The wordcount dictionary where key is the word and value is its frequency.
    Output:
        probs: A dictionary where keys are the words and the values are the probability that a word will occur. 
    '''
    probs = {}  # return this variable correctly

    for i in word_count_dict.keys():
        probs[i] = word_count_dict[i] / sum(word_count_dict.values())

    return probs

In [75]:

probs = get_probs(word_count_dict)
print(f"Length of probs is {len(probs)}")
print(f"P('insan') is {probs['insan']:.4f}")

Length of probs is 24730
P('insan') is 0.0030


# Part 2: String Manipulations

Now, I have computed $P(w_i)$ for all the words in the corpus, you will write a few functions to manipulate strings so that you can edit the erroneous strings and return the right spellings of the words. In this section, I implemented four functions: 

* `delete_letter`: given a word, it returns all the possible strings that have **one character removed**. 
* `switch_letter`: given a word, it returns all the possible strings that have **two adjacent letters switched**.
* `replace_letter`: given a word, it returns all the possible strings that have **one character replaced by another different letter**.
* `insert_letter`: given a word, it returns all the possible strings that have an **additional character inserted**. 


In [76]:

def delete_letter(word, verbose=False):
    '''
    Input:
        word: the string/word for which you will generate all possible words 
                in the vocabulary which have 1 missing character
    Output:
        delete_l: a list of all possible strings obtained by deleting 1 character from word
    '''
    
    delete_l = []
    split_l = []

    split_l = [(word[:i], word[i:]) for i in range(len(word))]
    
    delete_l = [left+right[1:] for left,right in split_l if right ]

    if verbose: print(f"input word {word}, \nsplit_l = {split_l}, \ndelete_l = {delete_l}")

    return delete_l

In [77]:
delete_word_l = delete_letter(word="insan",
                        verbose=True)

input word insan, 
split_l = [('', 'insan'), ('i', 'nsan'), ('in', 'san'), ('ins', 'an'), ('insa', 'n')], 
delete_l = ['nsan', 'isan', 'inan', 'insn', 'insa']


In [78]:
def switch_letter(word, verbose=False):
    '''
    Input:
        word: input string
     Output:
        switches: a list of all possible strings with one adjacent charater switched
    ''' 
    
    switch_l = []
    split_l = []
    
    split_l = [(word[:i],word[i:])for i in range(len(word))]
    switch_l =[left[:-1] + right[0] + left[-1] + right[1:]  for left,right in split_l if left]
    
    if verbose: print(f"Input word = {word} \nsplit_l = {split_l} \nswitch_l = {switch_l}") 

    return switch_l

In [79]:
switch_word_l = switch_letter(word="insan",
                         verbose=True)

Input word = insan 
split_l = [('', 'insan'), ('i', 'nsan'), ('in', 'san'), ('ins', 'an'), ('insa', 'n')] 
switch_l = ['nisan', 'isnan', 'inasn', 'insna']


In [80]:
def replace_letter(word, verbose=False):
    '''
    Input:
        word: the input string/word 
    Output:
        replaces: a list of all possible strings where we replaced one letter from the original word. 
    ''' 
    
    letters = 'abcçdefgğhıijklmnoöpqrsştuüvyz'
    replace_l = []
    split_l = []
    
    ### START CODE HERE ###
    split_l = [(word[:i], word[i:]) for i in range(len(word))] 
    replace_ll = [left + c + right[1:] for left, right in split_l for c in letters if not (left + c + right[1:]==word)]
    replace_set = set(replace_ll)
    ### END CODE HERE ###
    
    # turn the set back into a list and sort it, for easier viewing
    replace_l = sorted(list(replace_set))
    
    if verbose: print(f"Input word = {word} \nsplit_l = {split_l} \nreplace_l {replace_l}")   
    
    return replace_l

In [81]:
replace_l = replace_letter(word='insan',
                              verbose=True)

Input word = insan 
split_l = [('', 'insan'), ('i', 'nsan'), ('in', 'san'), ('ins', 'an'), ('insa', 'n')] 
replace_l ['ansan', 'bnsan', 'cnsan', 'dnsan', 'ensan', 'fnsan', 'gnsan', 'hnsan', 'iasan', 'ibsan', 'icsan', 'idsan', 'iesan', 'ifsan', 'igsan', 'ihsan', 'iisan', 'ijsan', 'iksan', 'ilsan', 'imsan', 'inaan', 'inban', 'incan', 'indan', 'inean', 'infan', 'ingan', 'inhan', 'inian', 'injan', 'inkan', 'inlan', 'inman', 'innan', 'inoan', 'inpan', 'inqan', 'inran', 'insaa', 'insab', 'insac', 'insad', 'insae', 'insaf', 'insag', 'insah', 'insai', 'insaj', 'insak', 'insal', 'insam', 'insao', 'insap', 'insaq', 'insar', 'insas', 'insat', 'insau', 'insav', 'insay', 'insaz', 'insaç', 'insaö', 'insaü', 'insağ', 'insaı', 'insaş', 'insbn', 'inscn', 'insdn', 'insen', 'insfn', 'insgn', 'inshn', 'insin', 'insjn', 'inskn', 'insln', 'insmn', 'insnn', 'inson', 'inspn', 'insqn', 'insrn', 'inssn', 'instn', 'insun', 'insvn', 'insyn', 'inszn', 'insçn', 'insön', 'insün', 'insğn', 'insın', 'insşn', 'intan', 

In [82]:
def insert_letter(word, verbose=False):
    '''
    Input:
        word: the input string/word 
    Output:
        inserts: a set of all possible strings with one new letter inserted at every offset
    ''' 
    letters = 'abcdefghijklmnopqrstuvwxyz'
    insert_l = []
    split_l = []
    
    split_l = [(word[:i],word[i:]) for i in range(len(word) + 1)]
    insert_l = [left + c + right for left,right in split_l for c in letters]
   
    if verbose: print(f"Input word {word} \nsplit_l = {split_l} \ninsert_l = {insert_l}")
    
    return insert_l

In [83]:
insert_l = insert_letter('insan', True)
print(f"Number of strings output by insert_letter('at') is {len(insert_l)}")

Input word insan 
split_l = [('', 'insan'), ('i', 'nsan'), ('in', 'san'), ('ins', 'an'), ('insa', 'n'), ('insan', '')] 
insert_l = ['ainsan', 'binsan', 'cinsan', 'dinsan', 'einsan', 'finsan', 'ginsan', 'hinsan', 'iinsan', 'jinsan', 'kinsan', 'linsan', 'minsan', 'ninsan', 'oinsan', 'pinsan', 'qinsan', 'rinsan', 'sinsan', 'tinsan', 'uinsan', 'vinsan', 'winsan', 'xinsan', 'yinsan', 'zinsan', 'iansan', 'ibnsan', 'icnsan', 'idnsan', 'iensan', 'ifnsan', 'ignsan', 'ihnsan', 'iinsan', 'ijnsan', 'iknsan', 'ilnsan', 'imnsan', 'innsan', 'ionsan', 'ipnsan', 'iqnsan', 'irnsan', 'isnsan', 'itnsan', 'iunsan', 'ivnsan', 'iwnsan', 'ixnsan', 'iynsan', 'iznsan', 'inasan', 'inbsan', 'incsan', 'indsan', 'inesan', 'infsan', 'ingsan', 'inhsan', 'inisan', 'injsan', 'inksan', 'inlsan', 'inmsan', 'innsan', 'inosan', 'inpsan', 'inqsan', 'inrsan', 'inssan', 'intsan', 'inusan', 'invsan', 'inwsan', 'inxsan', 'inysan', 'inzsan', 'insaan', 'insban', 'inscan', 'insdan', 'insean', 'insfan', 'insgan', 'inshan', 'insian'

In [84]:

def edit_one_letter(word, allow_switches = True):
    """
    Input:
        word: the string/word for which we will generate all possible wordsthat are one edit away.
    Output:
        edit_one_set: a set of words with one possible edit. Please return a set. and not a list.
    """
    
    edit_one_set = set()
    one_list = insert_letter(word) + delete_letter(word) + replace_letter(word)
    if allow_switches:
        edit_one_set = set(switch_letter(word) + one_list)
    else:
        edit_one_set = set(one_list)

    return edit_one_set

In [85]:
tmp_word = "insan"
tmp_edit_one_set = edit_one_letter(tmp_word)
# turn this into a list to sort it, in order to view it
tmp_edit_one_l = sorted(list(tmp_edit_one_set))

print(f"input word {tmp_word} \nedit_one_l \n{tmp_edit_one_l}\n")
print(f"The type of the returned object should be a set {type(tmp_edit_one_set)}")
print(f"Number of outputs from edit_one_letter('at') is {len(edit_one_letter('at'))}")

input word insan 
edit_one_l 
['ainsan', 'ansan', 'binsan', 'bnsan', 'cinsan', 'cnsan', 'dinsan', 'dnsan', 'einsan', 'ensan', 'finsan', 'fnsan', 'ginsan', 'gnsan', 'hinsan', 'hnsan', 'iansan', 'iasan', 'ibnsan', 'ibsan', 'icnsan', 'icsan', 'idnsan', 'idsan', 'iensan', 'iesan', 'ifnsan', 'ifsan', 'ignsan', 'igsan', 'ihnsan', 'ihsan', 'iinsan', 'iisan', 'ijnsan', 'ijsan', 'iknsan', 'iksan', 'ilnsan', 'ilsan', 'imnsan', 'imsan', 'inaan', 'inan', 'inasan', 'inasn', 'inban', 'inbsan', 'incan', 'incsan', 'indan', 'indsan', 'inean', 'inesan', 'infan', 'infsan', 'ingan', 'ingsan', 'inhan', 'inhsan', 'inian', 'inisan', 'injan', 'injsan', 'inkan', 'inksan', 'inlan', 'inlsan', 'inman', 'inmsan', 'innan', 'innsan', 'inoan', 'inosan', 'inpan', 'inpsan', 'inqan', 'inqsan', 'inran', 'inrsan', 'insa', 'insaa', 'insaan', 'insab', 'insabn', 'insac', 'insacn', 'insad', 'insadn', 'insae', 'insaen', 'insaf', 'insafn', 'insag', 'insagn', 'insah', 'insahn', 'insai', 'insain', 'insaj', 'insajn', 'insak', 'ins

In [86]:

def edit_two_letters(word, allow_switches = True):
    '''
    Input:
        word: the input string/word 
    Output:
        edit_two_set: a set of strings with all possible two edits
    '''    
    edit_two_set = set()

    for new_word in edit_one_letter(word):
        edit_two_set = edit_two_set|edit_one_letter(new_word, allow_switches=allow_switches)
   
    return edit_two_set

In [87]:
tmp_edit_two_set = edit_two_letters("insan")
tmp_edit_two_l = sorted(list(tmp_edit_two_set))
print(f"Number of strings with edit distance of two: {len(tmp_edit_two_l)}")
print(f"First 10 strings {tmp_edit_two_l[:10]}")
print(f"Last 10 strings {tmp_edit_two_l[-10:]}")
print(f"The data type of the returned object should be a set {type(tmp_edit_two_set)}")
print(f"Number of strings that are 2 edit distances from 'at' is {len(edit_two_letters('at'))}")

Number of strings with edit distance of two: 42709
First 10 strings ['aainsan', 'aansan', 'aasan', 'abinsan', 'abnsan', 'absan', 'acinsan', 'acnsan', 'acsan', 'adinsan']
Last 10 strings ['şynsan', 'şysan', 'şznsan', 'şzsan', 'şçsan', 'şösan', 'şüsan', 'şğsan', 'şısan', 'şşsan']
The data type of the returned object should be a set <class 'set'>
Number of strings that are 2 edit distances from 'at' is 8204


In [88]:
def get_corrections(word, probs, vocab, n=2, verbose = False):
    '''
    Input: 
        word: a user entered string to check for suggestions
        probs: a dictionary that maps each word to its probability in the corpus
        vocab: a set containing all the vocabulary
        n: number of possible word corrections you want returned in the dictionary
    Output: 
        n_best: a list of tuples with the most probable n corrected words and their probabilities.
    '''
    
    suggestions = []
    n_best = []

    if word not in vocab:
        edit_one = edit_one_letter(word)
        edit_one_vocab =edit_one.intersection(vocab)
        edit_two = edit_two_letters(word)
        edit_two_vocab = edit_two.intersection(vocab)
        suggestions = list(edit_one_vocab) or list(edit_two_vocab)
        prob_suggestions = [probs[i] for i in suggestions]
        n_best_inx = np.argsort(prob_suggestions)[-3:][::-1]#n olacak
        n_best = [(suggestions[i],prob_suggestions[i]) for i in n_best_inx]
    else:
        n_best = [(word,1)]

    if verbose: print("entered word = ", word, "\nsuggestions = ", suggestions)

    return n_best

In [89]:
# Testing implementation - feel free to try other words in my word
my_word = 'insson' 
tmp_corrections = get_corrections(my_word, probs, vocab, 2, verbose=True) # keep verbose=True
for i, word_prob in enumerate(tmp_corrections):
    print(f"word {i}: {word_prob[0]}, probability {word_prob[1]:.6f}")

print(f"data type of corrections {type(tmp_corrections)}")

entered word =  insson 
suggestions =  ['insan']
word 0: insan, probability 0.003040
data type of corrections <class 'list'>


# Part 3: Minimum Edit Distance


In [90]:

def min_edit_distance(source, target, ins_cost = 1, del_cost = 1, rep_cost = 2):
    '''
    Input: 
        source: a string corresponding to the string you are starting with
        target: a string corresponding to the string you want to end with
        ins_cost: an integer setting the insert cost
        del_cost: an integer setting the delete cost
        rep_cost: an integer setting the replace cost
    Output:
        D: a matrix of len(source)+1 by len(target)+1 containing minimum edit distances
        med: the minimum edit distance (med) required to convert the source string to the target
    '''
   
    m = len(source) 
    n = len(target) 
    
    D = np.zeros((m+1, n+1), dtype=int)    

    for row in range(1,m+1): 
        D[row,0] = D[row-1,0] + 1
        
    for col in range(1,n+1): 
        D[0,col] = D[0,col-1] + 1
        
    for row in range(1,m+1): 
        
        for col in range(1,n+1):
            
            r_cost = rep_cost          
                
            if source[row-1] == target[col-1]:
               
                r_cost = 0                
            
            D[row,col] = min(D[row-1,col]+del_cost,D[row,col-1]+ins_cost, D[row-1,col-1]+r_cost)
    
    med = D[m,n]
    
    return D, med

In [91]:
source =  'insson'
target = 'insan '
matrix, min_edits = min_edit_distance(source, target)
print("minimum edits: ",min_edits, "\n")
idx = list('#' + source)
cols = list('#' + target)
df = pd.DataFrame(matrix, index=idx, columns= cols)
print(df)

minimum edits:  4 

   #  i  n  s  a  n   
#  0  1  2  3  4  5  6
i  1  0  1  2  3  4  5
n  2  1  0  1  2  3  4
s  3  2  1  0  1  2  3
s  4  3  2  1  2  3  4
o  5  4  3  2  3  4  5
n  6  5  4  3  4  3  4


Minimum edit distance is lower than 2 among one letter edit selection.

In [99]:
source = "inson"
targets = edit_one_letter(source,allow_switches = False)  #disable switches since min_edit_distance does not include them
for t in targets:
    _, min_edits = min_edit_distance(source, t,1,1,1)  # set ins, del, sub costs all to one
    if min_edits <2 and t in vocab: print(source, t, min_edits)

inson insan 1


Minimum edit distance is lower than 4 among two letter edit selection.

In [97]:
source = "inson"
targets = edit_two_letters(source,allow_switches = False) #disable switches since min_edit_distance does not include them
for t in targets:
    _, min_edits = min_edit_distance(source, t,1,1,1)  # set ins, del, sub costs all to one
    if min_edits <4 and t in vocab: print(source, t, min_edits)

inson insek 2
inson insan 1
inson insana 2
inson son 2
inson insani 2
inson inin 2
inson anton 2
inson nsan 2
inson nisan 3
