**Build a auto-corrector based on a (small) vocabulary of ”known words” extracted from a text repository of your choice that takes as input a sentence (with possible misspelled words) and replaces each of the words with one from the vocabulary that minimizes the edit distance.
Please cite your sources, show your code, and include some input-output examples.
Discuss what other techniques (that we have discussed in previous sessions or that you can think of otherwise) could be used to make this simple auto-correct perform better in terms of inferring the intended meaning of the word or to take into account similarities in pronunciation between differently-spelled words.**

https://www.kaggle.com/code/bouweceunen/levenshtein-distance-spelling-correction-nlp/notebook

https://www.analyticsvidhya.com/blog/2021/06/how-to-load-kaggle-datasets-directly-into-google-colab/

In [1]:
# Install the Kaggle library
! pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
# Make a directory named “.kaggle”
! mkdir ~/.kaggle

In [3]:
# Copy the “kaggle.json” into this new directory
! cp kaggle.json ~/.kaggle/

cp: cannot stat 'kaggle.json': No such file or directory


In [4]:
# Allocate the required permission for this file.
! chmod 600 ~/.kaggle/kaggle.json

chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory


In [None]:
# Downloading Datasets: https://www.kaggle.com/datasets/bouweceunen/smart-home-commands-dataset
! kaggle datasets download bouweceunen/smart-home-commands-dataset

In [6]:
# Upload the file in Google Colab
from google.colab import files
uploaded = files.upload()

Saving smart-home-commands-dataset.zip to smart-home-commands-dataset.zip


In [7]:
import os
from nltk import word_tokenize
import itertools
import pandas as pd
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [8]:
df = pd.read_csv('smart-home-commands-dataset.zip')
df.head()

Unnamed: 0,Number,Category,Action_needed,Question,Subcategory,Action,Time,Sentence
0,1,lights,1,0,kitchen,on,today,Illuminate the kitchen today.
1,2,lights,1,0,kitchen,on,tomorrow,Illuminate the kitchen tomorrow.
2,3,lights,1,0,kitchen,on,hour,Turn on the light in the kitchen in 10 hours.
3,4,lights,1,0,kitchen,on,day,Turn on the light in the kitchen in 1 day.
4,5,lights,1,0,diningroom,on,today,Illuminate the dining room today.


In [9]:
df_sent = df[['Sentence']]
df_sent.head(10)

Unnamed: 0,Sentence
0,Illuminate the kitchen today.
1,Illuminate the kitchen tomorrow.
2,Turn on the light in the kitchen in 10 hours.
3,Turn on the light in the kitchen in 1 day.
4,Illuminate the dining room today.
5,Don't illuminate the dining room today.
6,Illuminate the dining room tomorrow.
7,Turn on the light in the dinin room in 10 hours.
8,Turn on the light in the dining room in 1 day.
9,Turn on the light in the bathroom.


In [10]:
# Tokenize each sentence
sent = [word_tokenize(sentence['Sentence']) for index, sentence in df_sent.iterrows()]
sent[0]

['Illuminate', 'the', 'kitchen', 'today', '.']

In [11]:
# Merge each word of the sentences togheter 
merge_sent = list(itertools.chain.from_iterable(sent))
print(merge_sent)

['Illuminate', 'the', 'kitchen', 'today', '.', 'Illuminate', 'the', 'kitchen', 'tomorrow', '.', 'Turn', 'on', 'the', 'light', 'in', 'the', 'kitchen', 'in', '10', 'hours', '.', 'Turn', 'on', 'the', 'light', 'in', 'the', 'kitchen', 'in', '1', 'day', '.', 'Illuminate', 'the', 'dining', 'room', 'today', '.', 'Do', "n't", 'illuminate', 'the', 'dining', 'room', 'today', '.', 'Illuminate', 'the', 'dining', 'room', 'tomorrow', '.', 'Turn', 'on', 'the', 'light', 'in', 'the', 'dinin', 'room', 'in', '10', 'hours', '.', 'Turn', 'on', 'the', 'light', 'in', 'the', 'dining', 'room', 'in', '1', 'day', '.', 'Turn', 'on', 'the', 'light', 'in', 'the', 'bathroom', '.', 'Do', "n't", 'turn', 'on', 'the', 'light', 'right', 'now', '.', 'I', 'would', 'get', 'mad', 'if', 'you', "'d", 'put', 'on', 'the', 'light', 'right', 'now', '.', 'I', 'would', 'not', 'like', 'it', 'if', 'you', 'would', "n't", 'turn', 'on', 'the', 'light', 'right', 'now', '.', 'Is', 'the', 'light', 'on', 'in', 'the', 'living', 'room', '.', 'W

In [12]:
# Distinct word voccabulary to know
vocabulary = list(set(merge_sent))
print(vocabulary)

['I', 'sad', 'skies', 'more', 'Sun', 'the', 'be', 'Who', 'give', 'Facebook', 'tall', '112', 'and', 'getting', 'motion', 'weed', 'Turn', 'Give', 'sun', 'trench', 'garage', 'bieber', 'you', 'toaster', 'let', 'make', 'departure', 'days', 'Do', 'Make', 'Myplace', 'lower', 'lately', 'far', 'Raise', 'appliance', 'eighteen', 'meaning', 'Antwerp', 'location', 'Flicker', 'random', 'there', 'Home', 'want', 'has', 'earth', '50', 'Bathroom', 'all', 'done', 'falls', 'viewing', 'Open', 'turn', 'coming', 'here', 'shown', 'Can', '!', 'kitchen', 'past', 'music', 'question', 'drug', 'freezing', 'Maps', 'play', 'alive', 'movement', 'great', 'control', '45', 'less', 'Has', 'Antwerpen', 'movies', 'Update', 'Tell', 'station', 'up', 'travel', 'too', '?', 'did', 'not', 'locationtracker', 'six', 'power', '10', 'minute', 'bathroom', 'Leuven', 'refrigerator', 'ride', 'machine', 'dark', 'we', 'off', 'hit', 'pixels', 'bus', 'hail', 'fahrenheit', 'hours', 'blinding', 'living', 'live', 'does', 'down', 'Power', 'ca',

In [13]:
# Levenshtein distance

def editdist(p, q, elimination = 1, insertion = 1, defrep = 1, repcost = dict()):
    d = dict()
    np = len(p) + 1 # length of first string plus one 
    nq = len(q) + 1 # length of second string plus one
    for i in range(np): # initialize each row
        d[(i, 0)] = i * insertion
    for j in range(nq): # initialize each column
        d[(0, j)] = j * elimination
    for i in range(1, np):
        for j in range(1, nq):
            lp = p[i - 1] # corresponding letter of the first string
            lq = q[j - 1] # corresponding letter of the second string
            eli = d[(i - 1, j)] + elimination
            ins = d[(i, j - 1)] + insertion
            ree = d[(i - 1, j - 1)] # no cost of replacement unless they differ
            if lp != lq:
              # include cost of that pair or default cost if undefined
              ree += repcost.get((lp, lq), defrep) 
            d[(i, j)] = min(eli, ins, ree) # dynamic programming step: the cheapest option wins
    return d[(np -1, nq - 1)] # final cost
 
print(editdist("orthography", "ortografy"))

3


In [14]:
def auto_correction(sentence):

  # Tokenize the sentence to auto_correction by word
  wt = word_tokenize(sentence)

  # For each word in the sentence
  for i, word in enumerate(wt):

        # If the word is not in the know word of the text and not digit
        if (word not in vocabulary and not word.isdigit()): # ignore digits

            # Create a list
            levdistances = []

            # Calcul the Levenshtein distance for each word to know
            for j in vocabulary:

              # Put the distance in the list
              levdistances.append(editdist(word,j))

              # Take the word with the minimum distance
              wt[i] = vocabulary[levdistances.index(min(levdistances))]

        else:
          # If the word is a know word in the voccabulary (no auto-correction)
          wt[i] = word

  return ' '.join(wt)

In [15]:
# Word Illumminate & kitchean & todday corrected
print(auto_correction("Illumminate the kitchean todday."))

Illuminate the kitchen today .


In [16]:
# Word Turne & lihght & inn corrected
print(auto_correction("Turne on the lihght in the kitchen inn 1 day."))

Turn on the light in the kitchen in 1 day .


J’ai utilisé la base de données smart-home-commands de kaggle. J’ai créé un dictionnaire de vocabulaire afin de valider les mots qui sont à corriger. Si un mot de ma phrase ne se trouve pas dans le dictionnaire, j’ai effectué le calcul de la distance Levenshtein pour voir quel mot aurait la distance minimale pour faire la correction.

https://fileadmin.cs.lth.se/cs/education/EDA171/Reports/2009/david.pdf
Utiliser un modèle de correction d’orthographe en priorisant les suggestions de mots selon le contexte de la phrase. (n-gram).


**Using either some n-gram based or another type of approach (remember to cite any sources you consult) and a text repository of your choice, implement a simple auto-complete that suggests possible options for what the next word could be, given a start of a sentence as input.
Please include code snippets and examples, as usual.
Discuss how the value for n affects the quality you observe (subjective or measured). Would
you actually need a range of values for n instead of a single value for this to work well?**

https://www.nltk.org/howto/collocations.html

In [17]:
print(merge_sent)

['Illuminate', 'the', 'kitchen', 'today', '.', 'Illuminate', 'the', 'kitchen', 'tomorrow', '.', 'Turn', 'on', 'the', 'light', 'in', 'the', 'kitchen', 'in', '10', 'hours', '.', 'Turn', 'on', 'the', 'light', 'in', 'the', 'kitchen', 'in', '1', 'day', '.', 'Illuminate', 'the', 'dining', 'room', 'today', '.', 'Do', "n't", 'illuminate', 'the', 'dining', 'room', 'today', '.', 'Illuminate', 'the', 'dining', 'room', 'tomorrow', '.', 'Turn', 'on', 'the', 'light', 'in', 'the', 'dinin', 'room', 'in', '10', 'hours', '.', 'Turn', 'on', 'the', 'light', 'in', 'the', 'dining', 'room', 'in', '1', 'day', '.', 'Turn', 'on', 'the', 'light', 'in', 'the', 'bathroom', '.', 'Do', "n't", 'turn', 'on', 'the', 'light', 'right', 'now', '.', 'I', 'would', 'get', 'mad', 'if', 'you', "'d", 'put', 'on', 'the', 'light', 'right', 'now', '.', 'I', 'would', 'not', 'like', 'it', 'if', 'you', 'would', "n't", 'turn', 'on', 'the', 'light', 'right', 'now', '.', 'Is', 'the', 'light', 'on', 'in', 'the', 'living', 'room', '.', 'W

In [18]:
from heapq import nlargest
from operator import itemgetter

In [19]:
def auto_complete_tri(sentence, n):

  find = nltk.TrigramCollocationFinder.from_words(merge_sent) 
  pmi = find.score_ngrams(nltk.TrigramAssocMeasures().pmi)

  beg_sentence = nltk.word_tokenize(sentence)
  w1 = beg_sentence[0]
  w2 = beg_sentence[1]

  top=[]
  for (tri, score) in pmi:
    (first, second, third) = tri

    if first == w1 and second == w2:
      top.append((third, score))
    
  top_max = nlargest(n, top, key=itemgetter(1))

  return (top_max)

In [20]:
auto_complete_tri('Illuminate the', 5)

[('dining', 9.821787027232098),
 ('kitchen', 9.049197523335174),
 ('attic', 8.59939460589565)]

In [21]:
auto_complete_tri('Give me', 5)

[('some', 12.872244452801626),
 ('info', 12.364097549131301),
 ('information', 11.779135048410144),
 ('coffee', 10.609210046967831)]

In [22]:
auto_complete_tri('Open the', 5)

[('Facebook', 11.406749527953256),
 ('window', 10.406749527953256),
 ('relay', 9.406749527953256)]

Avec les mêmes données, j’ai utilisé la même technique que dans le cours mais avec un tri-gram. Dans un dictionnaire, j’entrepose une liste de 3 mots. Un score pmi (mesure d’association qui compare la probabilité entre les mots dans un corpus) est associé à ces trigram. Ainsi selon deux mots sélectionnés, la fonction retournera le 3e mots du tri-gram par score décroissant.

In [23]:
def auto_complete_bi(word, n):

  find = nltk.BigramCollocationFinder.from_words(merge_sent) 
  pmi = find.score_ngrams(nltk.BigramAssocMeasures().pmi)

  top=[]
  for (second, score) in pmi:
    (first, second) = second

    if first == word:
      top.append((second, score))
    
  top_max = nlargest(n, top, key=itemgetter(1))

  return (top_max)

In [24]:
auto_complete_bi('the', 15)

[('basement', 3.236824526510947),
 ('library', 3.236824526510947),
 ('attic', 3.2368245265109454),
 ('coffeemachine', 3.2368245265109454),
 ('dining', 3.2368245265109454),
 ('garage', 3.2368245265109454),
 ('nearest', 3.2368245265109454),
 ('tallest', 3.2368245265109454),
 ('toaster', 3.2368245265109454),
 ('toilet', 3.2368245265109454),
 ('trains', 3.2368245265109454),
 ('Facebook', 3.2368245265109437),
 ('Home', 3.2368245265109437),
 ('Malidives', 3.2368245265109437),
 ('backyard', 3.2368245265109437)]

En utilisant un bi-gram, on remarque que le choix de mots est plus élevé et que le pmi décroit comparativement au tri-gram. Le bi-gram est donc moins précis que le tri-gram puisque le contexte de la phrase est moins significatif avec cette méthode.

**Modify your auto-complete so that it never suggests names of people or places. Include a code snippet and examples in your response.**

https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/,

In [25]:
import spacy
from spacy import displacy

NER = spacy.load("en_core_web_sm")

In [26]:
sent_nt = [sentence['Sentence'] for index, sentence in df_sent.iterrows()]
print(sent_nt)

['Illuminate the kitchen today.', 'Illuminate the kitchen tomorrow.', 'Turn on the light in the kitchen in 10 hours.', 'Turn on the light in the kitchen in 1 day.', 'Illuminate the dining room today.', "Don't illuminate the dining room today.", 'Illuminate the dining room tomorrow.', 'Turn on the light in the dinin room in 10 hours.', 'Turn on the light in the dining room in 1 day.', 'Turn on the light in the bathroom.', "Don't turn on the light right now.", "I would get mad if you'd put on the light right now.", "I would not like it if you wouldn't turn on the light right now.", 'Is the light on in the living room.', 'Would you illuminate the living room in one hour for me please.', "It would be great if you'd turn on the light for me.", 'Light on in kitchen.', 'Light off in living room in a few hours.', 'Turn the light off in the basement.', 'Turn the light off in the dining room.', 'I would like if you made it brighter.', 'I would like if you made it brighter in 3 minutes.', 'Can yo

In [27]:
text = ' '.join(sent_nt)
print(text)

Illuminate the kitchen today. Illuminate the kitchen tomorrow. Turn on the light in the kitchen in 10 hours. Turn on the light in the kitchen in 1 day. Illuminate the dining room today. Don't illuminate the dining room today. Illuminate the dining room tomorrow. Turn on the light in the dinin room in 10 hours. Turn on the light in the dining room in 1 day. Turn on the light in the bathroom. Don't turn on the light right now. I would get mad if you'd put on the light right now. I would not like it if you wouldn't turn on the light right now. Is the light on in the living room. Would you illuminate the living room in one hour for me please. It would be great if you'd turn on the light for me. Light on in kitchen. Light off in living room in a few hours. Turn the light off in the basement. Turn the light off in the dining room. I would like if you made it brighter. I would like if you made it brighter in 3 minutes. Can you make it brighter here in 8 minutes? Can you make it more dark here

In [28]:
text1= NER(text)

In [29]:
name=[]
for word in text1.ents:
    print(word.text,word.label_)
    if word.label_ == 'ORG' or word.label_ == 'PERSON' or word.label_ == 'LOC' or word.label_ == 'GPE':
      name.append(word.text)

today DATE
tomorrow DATE
10 hours TIME
1 day DATE
today DATE
today DATE
tomorrow DATE
10 hours TIME
1 day DATE
one hour TIME
a few hours TIME
3 minutes TIME
8 minutes TIME
a minute TIME
a few minutes TIME
a few minutes TIME
a few minutes TIME
5 minutes TIME
10 minutes TIME
10 minutes TIME
one hour TIME
45 minutes TIME
7 days DATE
tomorrow DATE
tomorrow DATE
one hour TIME
5 minutes TIME
85 minutes TIME
5 minutes TIME
45 minutes TIME
15 minutes TIME
50 minutes TIME
5 minutes TIME
15 hours TIME
50 hours TIME
an hour TIME
5 days DATE
5 weeks DATE
2 minutes TIME
5 minutes TIME
15 hours TIME
50 hours TIME
an hour TIME
5 days DATE
5 weeks DATE
2 minutes TIME
5 minutes TIME
between september and october DATE
yesterday DATE
yesterday DATE
yesterday DATE
yesterday DATE
yesterday DATE
yesterday DATE
yesterday DATE
tonight TIME
tonight TIME
between september and october DATE
today DATE
today DATE
yesterday DATE
yesterday DATE
the past hour TIME
yesterday DATE
today DATE
a few days DATE
yesterday D

In [30]:
names = list(set(name))
print(names)

['Google Maps', 'Justin Bieber', 'Balen', 'Antwerpen', 'Brussels', 'Donald Trump', 'Belgium', 'Antwerp', 'Hasselt', 'Adam Sandler', 'Leuven', 'Greece', 'justin', 'Spain']


In [31]:
names.append('Facebook')
print(names)

['Google Maps', 'Justin Bieber', 'Balen', 'Antwerpen', 'Brussels', 'Donald Trump', 'Belgium', 'Antwerp', 'Hasselt', 'Adam Sandler', 'Leuven', 'Greece', 'justin', 'Spain', 'Facebook']


In [32]:
from nltk.tokenize import word_tokenize

In [33]:
n = ' '.join(names)
names = nltk.word_tokenize(n)
names = list(set(names))
print(names)

['Sandler', 'Donald', 'Brussels', 'Adam', 'Bieber', 'Justin', 'Balen', 'Facebook', 'Antwerp', 'justin', 'Greece', 'Antwerpen', 'Belgium', 'Hasselt', 'Trump', 'Google', 'Leuven', 'Maps', 'Spain']


In [34]:
def auto_complete_tri(sentence, n):

  find = nltk.TrigramCollocationFinder.from_words(merge_sent) 
  pmi = find.score_ngrams(nltk.TrigramAssocMeasures().pmi)

  beg_sentence = nltk.word_tokenize(sentence)
  w1 = beg_sentence[0]
  w2 = beg_sentence[1]

  top=[]
  for (tri, score) in pmi:
    (first, second, third) = tri

    if first == w1 and second == w2 and third not in names:
      top.append((third, score))
    
  top_max = nlargest(n, top, key=itemgetter(1))

  return (list(zip(*top_max))[0])

In [35]:
auto_complete_tri('Open the', 5)

('window', 'relay')

J’ai utilisé spacy (‘en_core_web_sm’) pour faire une recherche sur les noms et places. Il détecte certain noms et lieux mais n’est pas parfait (comme pour le mot Facebook auquel j’ai ajouté manuellement). En utilisant la même fonction que dans la question 2 avec le trigram, j’ai ajouté une condition afin que ce dernier ne propose pas de noms ou de places. En utilisant la même sélection de 2 mots, nous voyons qu’à présent le mot Facebook est retiré de la sélection de choix.