<a href="https://colab.research.google.com/github/AyushiKashyapp/NLP/blob/main/AutoCorrectionModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Autocorrection Model

Autocorrect model referenced from the smartphones. When typing a word in the keyboard if the word exists in the vocabulary of the smartphone, then it will assume that the word is correct.

If the word doesn't exist in the smartphone vocabulary then the autocorrect is programmed to find the most similar words in the history of the smartphone.

For this task, we are using ***textdistance*** library which is used to compute the similarity or distance between texts.

In [1]:
!pip install textdistance

Collecting textdistance
  Downloading textdistance-4.6.2-py3-none-any.whl (31 kB)
Installing collected packages: textdistance
Successfully installed textdistance-4.6.2


In [6]:
import pandas as pd
import numpy as np
import textdistance
import re
from collections import Counter

words = []

with open('/autocorrect.txt', 'r') as f:
  file_name_data = f.read()
  file_name_data = file_name_data.lower()
  words = re.findall('\w+', file_name_data)

V = set(words)
print(f"The first ten words in the text are: \n{words[0:10]}")
print(f"There are {len(V)} unique words in the vocabulary.")

The first ten words in the text are: 
['the', 'project', 'gutenberg', 'ebook', 'of', 'moby', 'dick', 'or', 'the', 'whale']
There are 17647 unique words in the vocabulary.


Building the frequency of words in the above list using **Counter** function.

In [7]:
word_freq_dict = {}
word_freq_dict = Counter(words)
print(word_freq_dict.most_common()[0:10])

[('the', 14703), ('of', 6742), ('and', 6517), ('a', 4799), ('to', 4707), ('in', 4238), ('that', 3081), ('it', 2534), ('his', 2530), ('i', 2120)]


**Relative frequency of words**

Probability of occurence of each word is the relative frequencies of the word.



In [8]:
probs = {}
total = sum(word_freq_dict.values())
for k in word_freq_dict.keys():
  probs[k] = word_freq_dict[k]/total

**Finding similar words**

Sorting words according to the Jaccard distance by calculating the 2 grams Q of the words. Then, returning the 5 most similar words ordered by similarity and probability.

The Jaccard similarity measures the similarity between two sets of data to see which members are shared and distinct. The Jaccard similarity is calculated by dividing the number of observations in both sets by the number of observations in either set.

In [9]:
def my_autocorrect(input_word):
  input_word = input_word.lower()

  if input_word in V:
    return ('Your word seems to be correct.')
  else:
    similarities = [1-(textdistance.Jaccard(qval=2).distance(v, input_word)) for v in word_freq_dict.keys()]
    df = pd.DataFrame.from_dict(probs, orient='index').reset_index()
    df = df.rename(columns = {'index':'Word', 0:'Prob'})
    df['Similarity'] = similarities
    output = df.sort_values(['Similarity','Prob'], ascending=False).head()
    return(output)

Finding similar words using autocorrect function.

In [10]:
my_autocorrect('neverteless')

Unnamed: 0,Word,Prob,Similarity
2571,nevertheless,0.000225,0.75
13657,boneless,1.3e-05,0.416667
12684,elevates,4e-06,0.416667
1105,never,0.000925,0.4
7136,level,0.000108,0.4
