**Build a auto-corrector based on a (small) vocabulary of ”known words” extracted from a text repository of your choice that takes as input a sentence (with possible misspelled words) and replaces each of the words with one from the vocabulary that minimizes the edit distance.
Please cite your sources, show your code, and include some input-output examples.
Discuss what other techniques (that we have discussed in previous sessions or that you can think of otherwise) could be used to make this simple auto-correct perform better in terms of inferring the intended meaning of the word or to take into account similarities in pronunciation between differently-spelled words.**

https://www.kaggle.com/code/bouweceunen/levenshtein-distance-spelling-correction-nlp/notebook

https://www.analyticsvidhya.com/blog/2021/06/how-to-load-kaggle-datasets-directly-into-google-colab/

In [None]:
# Install the Kaggle library
! pip install kaggle

In [None]:
# Make a directory named “.kaggle”
! mkdir ~/.kaggle

In [None]:
# Copy the “kaggle.json” into this new directory
! cp kaggle.json ~/.kaggle/

In [None]:
# Allocate the required permission for this file.
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
# Downloading Datasets: https://www.kaggle.com/datasets/bouweceunen/smart-home-commands-dataset
! kaggle datasets download bouweceunen/smart-home-commands-dataset

In [None]:
# Upload the file in Google Colab
from google.colab import files
uploaded = files.upload()

In [None]:
df = pd.read_csv('smart-home-commands-dataset.zip')
df.head()

In [None]:
import os
from nltk import word_tokenize
import itertools
import pandas as pd
import nltk
nltk.download('punkt')

In [None]:
df_sent = df[['Sentence']]
df_sent.head(10)

In [None]:
# Tokenize each sentence
sent = [word_tokenize(sentence['Sentence']) for index, sentence in df_sent.iterrows()]
sent[0]

In [None]:
# Merge each word of the sentences togheter 
merge_sent = list(itertools.chain.from_iterable(sent))
print(merge_sent)

In [None]:
# Distinct word voccabulary to know
vocabulary = list(set(merge_sent))
print(vocabulary)

In [None]:
# Levenshtein distance

def editdist(p, q, elimination = 1, insertion = 1, defrep = 1, repcost = dict()):
    d = dict()
    np = len(p) + 1 # length of first string plus one 
    nq = len(q) + 1 # length of second string plus one
    for i in range(np): # initialize each row
        d[(i, 0)] = i * insertion
    for j in range(nq): # initialize each column
        d[(0, j)] = j * elimination
    for i in range(1, np):
        for j in range(1, nq):
            lp = p[i - 1] # corresponding letter of the first string
            lq = q[j - 1] # corresponding letter of the second string
            eli = d[(i - 1, j)] + elimination
            ins = d[(i, j - 1)] + insertion
            ree = d[(i - 1, j - 1)] # no cost of replacement unless they differ
            if lp != lq:
              # include cost of that pair or default cost if undefined
              ree += repcost.get((lp, lq), defrep) 
            d[(i, j)] = min(eli, ins, ree) # dynamic programming step: the cheapest option wins
    return d[(np -1, nq - 1)] # final cost
 
print(editdist("orthography", "ortografy"))

In [None]:
def auto_correction(sentence):

  # Tokenize the sentence to auto_correction by word
  wt = word_tokenize(sentence)

  # For each word in the sentence
  for i, word in enumerate(wt):

        # If the word is not in the know word of the text and not digit
        if (word not in vocabulary and not word.isdigit()): # ignore digits

            # Create a list
            levdistances = []

            # Calcul the Levenshtein distance for each word to know
            for j in vocabulary:

              # Put the distance in the list
              levdistances.append(editdist(word,j))

              # Take the word with the minimum distance
              wt[i] = vocabulary[levdistances.index(min(levdistances))]

        else:
          # If the word is a know word in the voccabulary (no auto-correction)
          wt[i] = word

  return ' '.join(wt)

In [None]:
# Word Illumminate & kitchean & todday corrected
print(auto_correction("Illumminate the kitchean todday."))

In [None]:
# Word Turne & lihght & inn corrected
print(auto_correction("Turne on the lihght in the kitchen inn 1 day."))

**Using either some n-gram based or another type of approach (remember to cite any sources you consult) and a text repository of your choice, implement a simple auto-complete that suggests possible options for what the next word could be, given a start of a sentence as input.
Please include code snippets and examples, as usual.
Discuss how the value for n affects the quality you observe (subjective or measured). Would
you actually need a range of values for n instead of a single value for this to work well?**

https://www.nltk.org/howto/collocations.html

In [None]:
print(merge_sent)

In [None]:
from heapq import nlargest
from operator import itemgetter

In [None]:
def auto_complete_tri(sentence, n):

  find = nltk.TrigramCollocationFinder.from_words(merge_sent) 
  pmi = find.score_ngrams(nltk.TrigramAssocMeasures().pmi)

  beg_sentence = nltk.word_tokenize(sentence)
  w1 = beg_sentence[0]
  w2 = beg_sentence[1]

  top=[]
  for (tri, score) in pmi:
    (first, second, third) = tri

    if first == w1 and second == w2:
      top.append((third, score))
    
  top_max = nlargest(n, top, key=itemgetter(1))

  return (top_max)

In [None]:
auto_complete_tri('Illuminate the', 5)

In [None]:
auto_complete_tri('Give me', 5)

In [None]:
auto_complete_tri('Open the', 5)

In [None]:
def auto_complete_bi(word, n):

  find = nltk.BigramCollocationFinder.from_words(merge_sent) 
  pmi = find.score_ngrams(nltk.BigramAssocMeasures().pmi)

  top=[]
  for (second, score) in pmi:
    (first, second) = second

    if first == word:
      top.append((second, score))
    
  top_max = nlargest(n, top, key=itemgetter(1))

  return (top_max)

In [None]:
auto_complete_bi('the', 15)

**Modify your auto-complete so that it never suggests names of people or places. Include a code snippet and examples in your response.**

https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/,

In [None]:
import spacy
from spacy import displacy

NER = spacy.load("en_core_web_sm")

In [None]:
sent_nt = [sentence['Sentence'] for index, sentence in df_sent.iterrows()]
print(sent_nt)

In [None]:
text = ' '.join(sent_nt)
print(text)

In [None]:
text1= NER(text)

In [None]:
name=[]
for word in text1.ents:
    print(word.text,word.label_)
    if word.label_ == 'ORG' or word.label_ == 'PERSON' or word.label_ == 'LOC' or word.label_ == 'GPE':
      name.append(word.text)

In [None]:
names = list(set(name))
print(names)

In [None]:
names.append('Facebook')
print(names)

In [None]:
from nltk.tokenize import word_tokenize

In [None]:
n = ' '.join(names)
names = nltk.word_tokenize(n)
names = list(set(names))
print(names)

In [None]:
def auto_complete_tri(sentence, n):

  find = nltk.TrigramCollocationFinder.from_words(merge_sent) 
  pmi = find.score_ngrams(nltk.TrigramAssocMeasures().pmi)

  beg_sentence = nltk.word_tokenize(sentence)
  w1 = beg_sentence[0]
  w2 = beg_sentence[1]

  top=[]
  for (tri, score) in pmi:
    (first, second, third) = tri

    if first == w1 and second == w2 and third not in names:
      top.append((third, score))
    
  top_max = nlargest(n, top, key=itemgetter(1))

  return (list(zip(*top_max))[0])

In [None]:
auto_complete_tri('Open the', 5)