1. Check if word exists, using the NLTK Brown corpus (million word corpus, Brown University, 1961): https://www.nltk.org/book/ch02.html
2. If not, suggest options, with NTLK word corpus: https://www.nltk.org/howto/corpus.html and jaccard/edit distance: https://www.nltk.org/api/nltk.metrics.distance.html
3. Select highest-scoring (likelye) or k-beam options (candidates)

In [12]:
!pip install -q nltk
!pip install -q pyspellchecker
!pip install -q symspellpy

In [13]:
# drive access
from google.colab import drive
drive.mount('/content/drive')

# standard library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# for checking if a word exists
import nltk
from nltk.corpus import brown
from nltk.corpus import words
from nltk.metrics.distance  import edit_distance #does a terrible job
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams

# for dealing with punctuation
import string
import re

# pyspellchecker for spelling suggestions
from spellchecker import SpellChecker

# other spellchekers
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
from symspellpy import SymSpell, Verbosity
import pkg_resources #dictionary for use with symspellpy


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [14]:
# define brown and words corpus
nltk.download('brown')
nltk.download('words')

word_list = brown.words()
word_set = set(word_list) #list of words.

correct_words = words.words() #for spelling suggestions

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


# 1. Check if word exists in Brown corpus

In [15]:
# minimal example
print("smol" in word_set)
print("cat" in word_set)

False
True


In [16]:
def check_spelling(sentence):
  '''
  Takes a sentence, splits by spacing, removes punctutation, and checks if word
  is in nltk.brown corpus

  Returns sentence, along with correct and misspelled words.
  '''
  seen_words = set()
  misspelled = []

  for word in sentence.split():
      stripped_word = word.strip(string.punctuation)

      if stripped_word == '':
          continue

      if stripped_word in seen_words:
          continue

      if stripped_word not in word_set:
          misspelled.append(word)

      seen_words.add(stripped_word)
  return misspelled

## test case

In [17]:
# test case with one of our sentences
check_spelling('He became more vitle part of him.')

['vitle']

## on 'cleaned_auto_transcription'

In [18]:
#data from simpleGEC model with high thresholds (0.9/0.1)
simple_high = pd.read_csv('/content/drive/MyDrive/266/Data/GEC_Data/0.SimpleGEC/0e_simple_gec_09_01_data.csv')
simple_high.head()

Unnamed: 0,filename,clean_filename,actor,gender,emotion,auto_transcription,label,cleaned_auto_transcription,cleaned_label,base_score,simpleGEC_transcription,simpleGEC_score
0,arctic_a0355.wav,a0355,clb,female,neutral,A BURST OF LAUGHTER WAS HIS REWARD,A burst of laughter was his reward.,A burst of laughter was his reward.,A burst of laughter was his reward.,0.996666,A burst of laughter was his reward.,0.996666
1,neutral_1-28_0023.wav,23,josh,male,neutral,A COMBINATION OF CANADIAN CAPITAL QUICKLY ORGA...,A combination of Canadian capital quickly orga...,"A combination of canadian capital, quickly org...",A combination of Canadian capital quickly orga...,0.031432,"A combination of canadian capital, companies o...",0.901733
2,sleepiness_113-140_0132.wav,132,bea,female,sleepy,A CRY OF JOY BURST FROM PHILIP'S LIPS,A cry of joy burst from Philip's lips.,A cry of joy burst from philip's lips.,A cry of joy burst from Philip's lips.,0.999009,A cry of joy burst from philip's lips.,0.999009
3,arctic_b0301.wav,b0301,slt,female,neutral,A FLYING ARROW PASSED BETWEEN US,A flying arrow passed between us.,A flying arrow passed between us.,A flying arrow passed between us.,0.998554,A flying arrow passed between us.,0.998554
4,amused_421-448_0434.wav,434,sam,male,amused,A HALF CASE OF TOBACCO WAS WORTH THREE POUNDS,A half a case of tobacco was worth three pounds.,A half case of tobacco was worth three pounds.,A half a case of tobacco was worth three pounds.,0.998099,A half case of tobacco was worth three pounds.,0.998099


In [19]:
# add misspelled words to our df
simple_high['misspelled'] = simple_high['cleaned_auto_transcription'].apply(lambda x: check_spelling(x))
simple_high.head()

Unnamed: 0,filename,clean_filename,actor,gender,emotion,auto_transcription,label,cleaned_auto_transcription,cleaned_label,base_score,simpleGEC_transcription,simpleGEC_score,misspelled
0,arctic_a0355.wav,a0355,clb,female,neutral,A BURST OF LAUGHTER WAS HIS REWARD,A burst of laughter was his reward.,A burst of laughter was his reward.,A burst of laughter was his reward.,0.996666,A burst of laughter was his reward.,0.996666,[]
1,neutral_1-28_0023.wav,23,josh,male,neutral,A COMBINATION OF CANADIAN CAPITAL QUICKLY ORGA...,A combination of Canadian capital quickly orga...,"A combination of canadian capital, quickly org...",A combination of Canadian capital quickly orga...,0.031432,"A combination of canadian capital, companies o...",0.901733,[canadian]
2,sleepiness_113-140_0132.wav,132,bea,female,sleepy,A CRY OF JOY BURST FROM PHILIP'S LIPS,A cry of joy burst from Philip's lips.,A cry of joy burst from philip's lips.,A cry of joy burst from Philip's lips.,0.999009,A cry of joy burst from philip's lips.,0.999009,[philip's]
3,arctic_b0301.wav,b0301,slt,female,neutral,A FLYING ARROW PASSED BETWEEN US,A flying arrow passed between us.,A flying arrow passed between us.,A flying arrow passed between us.,0.998554,A flying arrow passed between us.,0.998554,[]
4,amused_421-448_0434.wav,434,sam,male,amused,A HALF CASE OF TOBACCO WAS WORTH THREE POUNDS,A half a case of tobacco was worth three pounds.,A half case of tobacco was worth three pounds.,A half a case of tobacco was worth three pounds.,0.998099,A half case of tobacco was worth three pounds.,0.998099,[]


# 2. For misspelled words, suggest alternatives

Tried with:
- Pyspellchecker
- NLTK edit distance
- NLTK jaccard distance

In [20]:
spell = SpellChecker()

In [21]:
def spelling_suggestion(misspelled):
  '''
  Input: List of words
  Outputs:
  - likely: List with most likely suggestion
  - candidates: list of candidates
  misspelled: list of words
  '''
  likely = []
  candidates = []

  for word in misspelled:
    likely.append(spell.correction(word))
    candidates.append(spell.candidates(word))

  return likely, candidates

## test case with pyspellchecker

- Doesn't capitalize
- Not looking great

In [22]:
# sample misspelled word taken from our data. Expected word: vital
spelling_suggestion(['vitle'])

(['title'], [{'litle', 'title', 'vile', 'ville', 'vitae', 'vite'}])

In [23]:
# sample misspelled word taken from our data. Expected words: Montreal, Toronto
spelling_suggestion(['montreall', 'toranto'])

(['montreal', 'tomato'],
 [{'montreal'},
  {'tanto', 'tomato', 'toretto', 'torino', 'tornado', 'tyrant', 'tyrants'}])

## test case with NLTK

Example taken from: https://www.geeksforgeeks.org/correcting-words-using-nltk-in-python/



### with edit distance

- Not looking great either

In [24]:
# list of incorrect spellings
# that need to be corrected
incorrect_words=['vitle','montreall', 'toranto']

# loop for finding correct spellings
# based on edit distance and
# printing the correct words
for word in incorrect_words:
    temp = [(edit_distance(word, w),w) for w in correct_words if w[0]==word[0]]
    print(sorted(temp, key = lambda val:val[0])[0][1])

vile
moneral
tomato


### with jaccard

- still terrible

In [25]:
for word in incorrect_words:
    temp = [(jaccard_distance(set(ngrams(word, 2)),
                              set(ngrams(w, 2))),w)
            for w in correct_words if w[0]==word[0]]
    print(sorted(temp, key = lambda val:val[0])[0][1])

vile
monotremal
toran


#Test case with Symspell

Still bad

In [27]:
# load dictionary and bigram dictionaries
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt"
)
bigram_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_bigramdictionary_en_243_342.txt"
)


# term_index is the column of the term and count_index is the column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)


True

In [51]:
# lookup suggestions for single-word input strings
input_term = "vitle"
suggestions = sym_spell.lookup(input_term, Verbosity.CLOSEST, max_edit_distance=2)
print(f'Misspelled: {input_term}')
for suggestion in suggestions:
    print(f'suggestion term, edit distance, and term frequency: {suggestion}')

Misspelled: vitle
suggestion term, edit distance, and term frequency: title, 1, 196676017
suggestion term, edit distance, and term frequency: vitae, 1, 1754634
suggestion term, edit distance, and term frequency: vile, 1, 930726


In [52]:
def check_spelling(terms):
    suggestions = []
    for term in terms:
        if isinstance(term, list):
            if len(term) == 0:
                suggestions.append([])
            else:
                suggestion = sym_spell.lookup(term[0], Verbosity.CLOSEST, max_edit_distance=2)
                suggestions.append(suggestion)
        else:
            suggestions.append([])
    return suggestions

In [53]:
simple_high["suggestions"] = check_spelling(simple_high["misspelled"])

simple_high

Unnamed: 0,filename,clean_filename,actor,gender,emotion,auto_transcription,label,cleaned_auto_transcription,cleaned_label,base_score,simpleGEC_transcription,simpleGEC_score,misspelled,suggestions
0,arctic_a0355.wav,a0355,clb,female,neutral,A BURST OF LAUGHTER WAS HIS REWARD,A burst of laughter was his reward.,A burst of laughter was his reward.,A burst of laughter was his reward.,0.996666,A burst of laughter was his reward.,0.996666,[],[]
1,neutral_1-28_0023.wav,23,josh,male,neutral,A COMBINATION OF CANADIAN CAPITAL QUICKLY ORGA...,A combination of Canadian capital quickly orga...,"A combination of canadian capital, quickly org...",A combination of Canadian capital quickly orga...,0.031432,"A combination of canadian capital, companies o...",0.901733,[canadian],"[canadian, 0, 56492789]"
2,sleepiness_113-140_0132.wav,132,bea,female,sleepy,A CRY OF JOY BURST FROM PHILIP'S LIPS,A cry of joy burst from Philip's lips.,A cry of joy burst from philip's lips.,A cry of joy burst from Philip's lips.,0.999009,A cry of joy burst from philip's lips.,0.999009,[philip's],"[philips, 1, 14023116]"
3,arctic_b0301.wav,b0301,slt,female,neutral,A FLYING ARROW PASSED BETWEEN US,A flying arrow passed between us.,A flying arrow passed between us.,A flying arrow passed between us.,0.998554,A flying arrow passed between us.,0.998554,[],[]
4,amused_421-448_0434.wav,434,sam,male,amused,A HALF CASE OF TOBACCO WAS WORTH THREE POUNDS,A half a case of tobacco was worth three pounds.,A half case of tobacco was worth three pounds.,A half a case of tobacco was worth three pounds.,0.998099,A half case of tobacco was worth three pounds.,0.998099,[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
901,arctic_b0226.wav,b0226,clb,female,neutral,YOUR FACE WAS THE PERSONIFICATION OF DUPLICITY,Your face was the personification of duplicity.,Your face was the personification of duplicity.,Your face was the personification of duplicity.,0.994595,Your face was the personification of duplicity.,0.994595,[duplicity.],"[duplicity, 1, 189373]"
902,arctic_b0419.wav,b0419,bdl,male,neutral,YOUR FATHER'S FIFTH COMMAND HE NODDED,"Your father's fifth command, he nodded.",Your father's fifth command. He nodded.,"Your father's fifth command, he nodded.",0.928192,Your father's fifth command. He nodded.,0.928192,[],[]
903,neutral_365-392_0376.wav,376,jenie,female,neutral,YOUR PRICE MY SON IS JUST ABOUT THIRTY PER WEEK,"Your price, my son, is just about thirty per w...",Your price. My son is just about thirty per week.,"Your price, my son, is just about thirty per w...",0.341454,Your response My son is just getting thirty pe...,0.978574,[],[]
904,disgust_197-224_0208.wav,208,bea,female,disgust,YOUTH HAD COME BACK TO HER FREED FROM THE YOKE...,"Youth had come back to her, freed from the yok...","Youth had come back to her, freed from the yok...","Youth had come back to her, freed from the yok...",0.968045,"Youth had come back to her, freed from the yok...",0.968045,[],[]


In [54]:
print("Case 1: inserted space")

sample_right_sentence = "I k new you where trouble when you walked in."
right_input_term = str(sample_right_sentence)
print("original sentence 1:",right_input_term)

# max edit distance per lookup (per single word, not per whole input string)
suggestions_1 = sym_spell.lookup_compound(right_input_term, max_edit_distance=2, transfer_casing=True)

# display suggestion term, edit distance, and term frequency in dictionary
for suggestion in suggestions_1:
    print("suggestion:",suggestion)

print("-"*100)

print("Case 2: ommited space")
sample_right_sentence = "I knewyou where troublewhen you walked in."
right_input_term = str(sample_right_sentence)
print("original sentence 2:",right_input_term)

# max edit distance per lookup (per single word, not per whole input string)
suggestions_2 = sym_spell.lookup_compound(right_input_term, max_edit_distance=2, transfer_casing=True)

# display suggestion term, edit distance, and term frequency (although term frequency always shows 0)
for suggestion in suggestions_2:
    print("suggestion 2:",suggestion)

Case 1: inserted space
original sentence 1: I k new you where trouble when you walked in.
suggestion: I knew you where trouble when you walked in, 2, 0
----------------------------------------------------------------------------------------------------
Case 2: ommited space
original sentence 2: I knewyou where troublewhen you walked in.
suggestion 2: I knew you where trouble when you walked in, 3, 0


# Conclusions

Maybe we can use symspell for segmentation: https://github.com/wolfgarbe/SymSpell