# PELIC spelling

This notebook adds further processing to `PELIC_compiled.csv`  in the [`PELIC-dataset`](https://github.com/ELI-Data-Mining-Group/PELIC-dataset) repo by creating a column of tok_POS whose spelling has been automatically corrected.

**Notebook contents:**
- [Building `non_words_df`](#Building-non_words_df)
- [Building `misspell_df`](#Building-misspell_df)
- [Possible segmentation](#Applying-segmentation)
- [Applying spelling correction](#Applying-spelling-correction)
- [Incorporating corrections into `pelic_df`](#Incorporating-corrections-into-pelic_df)

## Building non_words_df
In this section, we build a dataframe, `non_words_df`, which collects all of the non-words from the PELIC dataset (in `PELIC_compiled.csv`). The final dataframe has the following columns:
- `non_word`: tuples with the non-words and their parts of speech
- `sentence`: the complete sentence containing the non-word to provide context
- `answer_id`: the id of the text they come from

In [1]:
# Import necessary modules

import pandas as pd
import pprint
import numpy as np
from ast import literal_eval
import nltk
from tqdm import tqdm
import random

In [2]:
# Read in PELIC_compiled.csv

pelic_df = pd.read_csv("../PELIC-dataset/PELIC_compiled.csv", index_col = 'answer_id', # answer_id is unique
                      dtype = {'level_id':'object','question_id':'object','version':'object','course_id':'object'}, # str not ints
                               converters={'tokens':literal_eval,'tok_lem_POS':literal_eval}) # read in as lists
pelic_df.head()

Unnamed: 0_level_0,anon_id,L1,gender,course_id,level_id,class_id,question_id,version,text_len,text,tokens,tok_lem_POS
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,eq0,Arabic,Male,149,4,g,5,1,177,I met my friend Nife while I was studying in a...,"[I, met, my, friend, Nife, while, I, was, stud...","((I, i, PRP), (met, meet, VBD), (my, my, PRP$)..."
2,am8,Thai,Female,149,4,g,5,1,137,"Ten years ago, I met a women on the train betw...","[Ten, years, ago, ,, I, met, a, women, on, the...","((Ten, ten, CD), (years, year, NNS), (ago, ago..."
3,dk5,Turkish,Female,115,4,w,12,1,64,In my country we usually don't use tea bags. F...,"[In, my, country, we, usually, do, n't, use, t...","((In, in, IN), (my, my, PRP$), (country, count..."
4,dk5,Turkish,Female,115,4,w,13,1,6,I organized the instructions by time.,"[I, organized, the, instructions, by, time, .]","((I, i, PRP), (organized, organize, VBD), (the..."
5,ad1,Korean,Female,115,4,w,12,1,59,"First, prepare a port, loose tea, and cup.\nSe...","[First, ,, prepare, a, port, ,, loose, tea, ,,...","((First, first, RB), (,, ,, ,), (prepare, prep..."


The focus here is the `tok_lem_POS` column, but all columns will be kept as the entire df will be written out at the end of the notebook.

In [3]:
# Creating small dataframe to be used for finding non-words

non_words = pelic_df[['text','tok_lem_POS']]

**Note:** For spelling correction, it is necessary to decide what list of words will be used for determining if a word is real or not.

Here, we use the [`SCOWL_condensed.txt`](https://github.com/ELI-Data-Mining-Group/PELIC-spelling/blob/master/SCOWL_condensed.txt) file which is a combination of wordlists available for download at http://wordlist.aspell.net/. We include items from all the dictionaries _except_ the abbreviations dictionary. For a detailed look at the compilation of this dictionary, please see the [SCOWL_wordlist](https://github.com/ELI-Data-Mining-Group/PELIC-spelling/blob/master/SCOWL_wordlist.ipynb) notebook.

In [4]:
#Reading in SCOWL_condensed as a set as a lookup list for spelling (500k words)

scowl = set(open("SCOWL_condensed.txt", "r").read().split('\n'))
print(random.sample(scowl,5))

['uberrima', 'integrates', 'brid', 'separable', 'ftnerr']


The following is a list of words which should be considered words but which were previously being labelled as non-words. These items have been manually added to this list based on output later in this notebook. Most of these items are food items, names, or abbreviations.

In [5]:
scowl_supp = open("scowl_supp.txt", "r").read().split(',')
scowl_supp = [x[2:-1] for x in scowl_supp]
print(len(scowl_supp))
print(scowl_supp)

97
['adha', 'adj', 'ahamed', 'alaikum', 'anonurlpage', 'antiretroviral', 'arpa', 'beyonce', 'bibimbap', 'bio', 'biodiesel', 'bioethanol', 'bulgogi', 'bundang', 'cafe', 'carnaval', 'cds', 'cf', 'co', 'comscore', 'cyber', 'ddukboggi', 'def', 'dr', 'eg', 'eid', 'electrospray', 'entrees', 'erectus', 'etc', 'fiance', 'fiancee', 'fiter', 'fitir', 'fitr', 'fl', 'freediving', 'fukubukuro', 'geolinguist', 'hikikomori', 'hp', 'ibt', 'iq', 'iriver', 'jetta', 'jul', 'kabsa', 'kaled', 'kawader', 'km', 'leisureville', 'll', 'maamool', 'mayumi', 'mcdonalds', 'min', 'mongongo', 'nc', 'neuro', 'nian', 'notting', 'okroshka', 'onsen', 'pajeon', 'pbt', 'pc', 'pcs', 'pp', 'pudim', 'puket', 'samear', 'shui', 'sq', 'st', 'staycation', 'sth', 'taoyuan', 'toefl', 'trans', 'transgene', 'tv', 'unsub', 'va', 'vol', 'vs', 'webaholic', 'webaholics', 'webaholism', 'wenjing', 'woong', 'yaoming', 'ying', 'yingdong', 'yugong', 'yuval', 'zi', '']


In [6]:
# Lower case all toks

non_words.tok_lem_POS = non_words.tok_lem_POS.apply(lambda row: [(x[0].lower(),x[1],x[2]) for x in row])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [7]:
# Function to find non-words

def spell_check(tok_lem_POS_list):
    word_list = scowl # Choose word_list here. Default is scowl described above.
    not_in_word_list = []
    for tok_lem_POS in tok_lem_POS_list:
        if tok_lem_POS[0] not in word_list and tok_lem_POS[0] not in scowl_supp:
            not_in_word_list.append(tok_lem_POS)
    return not_in_word_list

In [8]:
# Apply spell check function to find all misspelled-words. 

non_words['misspelled_words'] = non_words.tok_lem_POS.apply(spell_check)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [9]:
non_words.head()

Unnamed: 0_level_0,text,tok_lem_POS,misspelled_words
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,I met my friend Nife while I was studying in a...,"[(i, i, PRP), (met, meet, VBD), (my, my, PRP$)...","[(., ., .), (., ., .), (., ., .), (;, ;, :), (..."
2,"Ten years ago, I met a women on the train betw...","[(ten, ten, CD), (years, year, NNS), (ago, ago...","[(,, ,, ,), (,, ,, ,), (., ., .), (;, ;, :), (..."
3,In my country we usually don't use tea bags. F...,"[(in, in, IN), (my, my, PRP$), (country, count...","[(., ., .), (,, ,, ,), (., ., .), (., ., .), (..."
4,I organized the instructions by time.,"[(i, i, PRP), (organized, organize, VBD), (the...","[(., ., .)]"
5,"First, prepare a port, loose tea, and cup.\nSe...","[(first, first, RB), (,, ,, ,), (prepare, prep...","[(,, ,, ,), (,, ,, ,), (,, ,, ,), (., ., .), (..."


#### Adding context to the dataframe
Seeing the mistakes in the context of a sentence will allow for better manual checking if required.

In [10]:
# Sent-tokenizing the text

non_words['sentence'] = non_words['text'].apply(lambda x: nltk.sent_tokenize(x))

# And delete text column which is no longer needed

del non_words['text']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [11]:
# Checking for hyphenated words tagged as misspellings because SCOWL doesn't contain hypenated words

hyphenated = set([x[0] for x in [x for y in non_words.misspelled_words.to_list() for x in y] if '-' in x[0]])
print(len(hyphenated))
print(list(hyphenated)[:10])

# These need to be removed from the non-words dataframe if composed of valid words

1182
['well-educated', 'one-bedroom', 'record-keeping', 'air-conditioning', 'low-calorie', 'long-distance', 'sixty-year-old', 're-read', 'oscar-worthy', 'white-haired']


In [12]:
# Hyphenated items whose components are not in scowl - possible misspellings or punctuation strings

sorted([y for y in [x.split('-') for x in hyphenated] if y[0] not in scowl or y[1] not in scowl])

[['', "'"],
 ['', '***', '****'],
 ['', '+'],
 ['', '.'],
 ['',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  ''],
 ["'", ''],
 ['.', ''],
 ['/', ''],
 ['\\\\', ''],
 ['^', '^'],
 ['al', 'qaida'],
 ['austro', 'hungarian'],
 ['cd', 'rom'],
 ['co', 'authored'],
 ['co', 'ed'],
 ['co', 'educational'],
 ['co', 'exist'],
 ['co', 'existence'],
 ['co', 'founded'],
 ['co', 'founder'],
 ['co', 'founders'],
 ['co', 'host'],
 ['co', 'op'],
 ['co', 'operate'],
 ['co', 'operation'],
 ['co', 'pay'],
 ['co', 'pilot'],
 ['co', 'sleeping'],
 ['co', 'star'],
 ['co', 'worker'],
 ['co', 'workers'],
 ['co', 'written'],
 ['co', 'wrote'],
 ['mah', 'jong'],
 ['mid', '80s'],
 ['pay', 'tv'],
 ['roly', 'poly'],
 ['socio', 'cultural'],
 ['socio', 'economic'],
 ['trans', 'fat'],
 ['vis', 'a', 'vis'],
 ['wal', 'mart']]

After manual checking, all the hypenated words are punctuation, real words (or true productive use of affixes) and can be removed from the non-words df.

The following two cells 
1. remove all the hypenated words from the dataframe
2. remove all words that don't contain a letter

However, as all hyphenated word are fine, we will instead just eliminate all words that are not purely composed of letters. This will have the effect of removing the following categories from the dataframe:
- punctuation
- hyphenated words (e.g. well-known)
- contractions (e.g. 'll, 've)
- years (e.g. 1950s)
- ordinals (e.g. 1st, 2nd)

In [13]:
# Removing hypenated words

# non_words.misspelled_words = non_words.misspelled_words.apply(lambda row: [x for x in row if x[0] not in hyphenated])

In [14]:
# Removing items that are only numbers or punctuation
# .isalpha() cannot be used without 'any' as this also removes hyphenated words

# non_words.misspelled_words = non_words.misspelled_words.apply(lambda row: [x for x in row if any(y.isalpha() for y in x[0])])

In [15]:
# Checking initial length of non_words list

print(len([x for y in non_words.misspelled_words.to_list() for x in y]))
print(len(set([x for y in non_words.misspelled_words.to_list() for x in y])))

684453
20599


In [16]:
# Removing items that are not purely alpha

non_words.misspelled_words = non_words.misspelled_words.apply(lambda row: [x for x in row if x[0].isalpha()])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [17]:
# Checking affect of removal

print(len([x for y in non_words.misspelled_words.to_list() for x in y]))
print(len(set([x for y in non_words.misspelled_words.to_list() for x in y])))

30205
15993


In [18]:
# Removing proper names - NNP, NNPS

non_words.misspelled_words = non_words.misspelled_words.apply(lambda row: [x for x in row if x[2] != 'NNP' and x[1] != 'NNPS'])

In [19]:
# Checking affect of removal

print(len([x for y in non_words.misspelled_words.to_list() for x in y]))
print(len(set([x for y in non_words.misspelled_words.to_list() for x in y])))

16832
11094


In [20]:
# Removing all words with length 1

non_words.misspelled_words = non_words.misspelled_words.apply(lambda row: [x for x in row if len(x[0]) > 1])

In [21]:
# Checking affect of removal

print(len([x for y in non_words.misspelled_words.to_list() for x in y]))
print(len(set([x for y in non_words.misspelled_words.to_list() for x in y])))

16826
11090


In [22]:
non_words.head()

Unnamed: 0_level_0,tok_lem_POS,misspelled_words,sentence
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,"[(i, i, PRP), (met, meet, VBD), (my, my, PRP$)...",[],[I met my friend Nife while I was studying in ...
2,"[(ten, ten, CD), (years, year, NNS), (ago, ago...",[],"[Ten years ago, I met a women on the train bet..."
3,"[(in, in, IN), (my, my, PRP$), (country, count...",[],"[In my country we usually don't use tea bags.,..."
4,"[(i, i, PRP), (organized, organize, VBD), (the...",[],[I organized the instructions by time.]
5,"[(first, first, RB), (,, ,, ,), (prepare, prep...",[],"[First, prepare a port, loose tea, and cup., S..."


Create new dataframe so that each misspelling token is a separate row.

In [23]:
# Removing rows with no misspellings

non_words2 = non_words.loc[non_words.misspelled_words.str.len() > 0,:]

In [24]:
# Exploding the lists in misspelled words so that each misspelling gets its own row

non_words2 = non_words2.explode('misspelled_words')

In [25]:
# Keeping the answer_id (which is no longer unique) as a separate column

non_words2 = non_words2.reset_index(drop = False)

In [26]:
# Checking final non_words2 dataframe

non_words2.head()

Unnamed: 0,answer_id,tok_lem_POS,misspelled_words,sentence
0,8,"[(i, i, PRP), (organized, organize, VBD), (the...","(beacause, beacause, NN)","[I organized the instructions by time, beacaus..."
1,11,"[(to, to, TO), (make, make, VB), (tea, tea, NN...","(wallmart, wallmart, NN)","[To make tea, nothing is easier, even if somet..."
2,13,"[(first, first, RB), (,, ,, ,), (you, you, PRP...","(dovn, dovn, NN)","[First, you should take some hot water, you ca..."
3,13,"[(first, first, RB), (,, ,, ,), (you, you, PRP...","(mircowave, mircowave, VBP)","[First, you should take some hot water, you ca..."
4,16,"[(every, every, DT), (paragragh, paragragh, NN...","(paragragh, paragragh, NN)",[Every paragragh's instructions depend on a ma...


In [27]:
# Total number of non-words (tokens)
print(len(non_words2))

# Total number of non-words (types)
print(non_words2.misspelled_words.nunique())

16826
11090


#### Creating a dataframe of misspellings
In the `non-words2` dataframe above, each row is an occurrence of a misspelling (i.e. _tokens_ ). We also want a dataframe where each row is a misspelling _type_ with frequency information attached.

In [28]:
# Gathering the total misspellings

total_misspellings = [x for x in non_words2['misspelled_words']]

In [29]:
# To keep an account of misspelling frequency, create a dictionary with frequencies

misspell_freq_dict = {}
for word in total_misspellings:
    if word not in misspell_freq_dict:
        misspell_freq_dict[word] = 1
    else:
        misspell_freq_dict[word] += 1

In [30]:
print(random.sample(list(misspell_freq_dict),5))

[('eperience', 'eperience', 'NN'), ('goodisms', 'goodisms', 'NN'), ('difrenet', 'difrenet', 'NN'), ('specaly', 'specaly', 'NN'), ('instrcture', 'instrcture', 'NN')]


In [31]:
# Remove duplicates

final_misspellings = sorted(list(set(total_misspellings)))
len(final_misspellings)

11090

In [32]:
# Constructing misspell_df

misspell_df = pd.DataFrame(final_misspellings)
misspell_df.head()

Unnamed: 0,0,1,2
0,aa,aa,VB
1,aabout,aabout,IN
2,aad,aad,JJ
3,aain,aain,VBP
4,aare,aare,IN


In [33]:
# Renaming columns to match other DataFrames in this notebook

misspell_df.rename(columns = {0: 'misspelling',1:'lemma',2:'POS'}, inplace = True)

In [34]:
# Recreating tok_lem_POS column to match dictionary

misspell_df['tok_lem_POS'] = list(zip(misspell_df.misspelling, misspell_df.lemma, misspell_df.POS))
misspell_df.head()

Unnamed: 0,misspelling,lemma,POS,tok_lem_POS
0,aa,aa,VB,"(aa, aa, VB)"
1,aabout,aabout,IN,"(aabout, aabout, IN)"
2,aad,aad,JJ,"(aad, aad, JJ)"
3,aain,aain,VBP,"(aain, aain, VBP)"
4,aare,aare,IN,"(aare, aare, IN)"


In [35]:
# Mapping dictionary to DataFrame

misspell_df['freq'] = misspell_df['tok_lem_POS'].map(misspell_freq_dict)

In [36]:
# Sorting by frequency

misspell_df = misspell_df.sort_values(by=['freq'], ascending=False)

In [37]:
# Resetting index
misspell_df = misspell_df.reset_index(drop = True)
misspell_df.head()

Unnamed: 0,misspelling,lemma,POS,tok_lem_POS,freq
0,alot,alot,NN,"(alot, alot, NN)",127
1,studing,studing,VBG,"(studing, studing, VBG)",74
2,goverment,goverment,NN,"(goverment, goverment, NN)",47
3,grammer,grammer,NN,"(grammer, grammer, NN)",29
4,becuase,becuase,NN,"(becuase, becuase, NN)",28


#### scowl_supp
The following is the basis for the 'scowl_supp' list used earlier. Here, errors with a frequency of 10 or more were manually checked, and if determined to be a real word, were added to the scowl_supp list. There were originally 267 items which met this criteria.

In [38]:
print(len(misspell_df.loc[misspell_df.freq >= 10]))
misspell_df.loc[misspell_df.freq >= 10]

77


Unnamed: 0,misspelling,lemma,POS,tok_lem_POS,freq
0,alot,alot,NN,"(alot, alot, NN)",127
1,studing,studing,VBG,"(studing, studing, VBG)",74
2,goverment,goverment,NN,"(goverment, goverment, NN)",47
3,grammer,grammer,NN,"(grammer, grammer, NN)",29
4,becuase,becuase,NN,"(becuase, becuase, NN)",28
...,...,...,...,...,...
72,contry,contry,NN,"(contry, contry, NN)",10
73,childrens,child,NNS,"(childrens, child, NNS)",10
74,apartement,apartement,NN,"(apartement, apartement, NN)",10
75,fastfood,fastfood,NN,"(fastfood, fastfood, NN)",10


### Possible segmentation

Selected segmenter and spellchecker: SymSpell https://github.com/mammothb/symspellpy

There is a dictionary file which which needs to be installed (saved to repo):
[frequency_dictionary_en_82_765.txt](https://symspellpy.readthedocs.io/en/latest/users/installing.html)

To install symspellpy the first time, use pip in command line: `pip install -U symspellpy`

Prior to spelling correct, we first consider using the segmenter. This is a potentially useful first step as misspellings like 'alot' or 'dogmeat' will be separated into 'a lot' and 'dog meat' rather than corrected to a single word like 'lot'.  

However, when segementing misspellings, the segmenter over performs, segmenting non-words into real words where it was clearly not intended, e.g. _improtant_ into _imp rot ant_ or _befor_ into _be for_. As such, the segmenting will not be automated. 

Instead, one manual segmentation will be carried out: _alot_ -> _a lot_ since _alot_ is the most common misspelling remaining in our dataframe (127 occurrences).

In [39]:
# Setting up symspell

from itertools import islice
import pkg_resources
from symspellpy import SymSpell
from symspellpy import Verbosity
sym_spell = SymSpell()
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt")
sym_spell.load_dictionary(dictionary_path, 0, 1)

# Print out first 5 elements to demonstrate that dictionary is successfully loaded
list(islice(sym_spell.words.items(), 5))

[('the', 23135851162),
 ('of', 13151942776),
 ('and', 12997637966),
 ('to', 12136980858),
 ('a', 9081174698)]

In [40]:
# Testing segmenter with 'alot' and 'dogmeat'

# Set max_dictionary_edit_distance to avoid spelling correction
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

# It is also possible to display frequency with result.distance_sum and edit distance with .log_prob_sum

True

In [41]:
# Creating function for applying the above code

def get_segments(word):
    segments = sym_spell.word_segmentation(word)
    if len(segments.corrected_string.split(' ')) > 1 \
    and segments.corrected_string.split(' ')[0] in scowl and segments.corrected_string.split(' ')[1] in scowl:
        return segments.corrected_string
    else:
        return word

In [42]:
# Testing function

print(get_segments('dogmeat')) # Should be segmented
print(get_segments('fireplace')) # Should not be segmented
print(get_segments('becuase')) # Should not be segmented

dog meat
fireplace
becuase


In [43]:
# Applying the function to create a new column

misspell_df['segments'] =  misspell_df['misspelling'].apply(get_segments)
misspell_df.head(10)

Unnamed: 0,misspelling,lemma,POS,tok_lem_POS,freq,segments
0,alot,alot,NN,"(alot, alot, NN)",127,a lot
1,studing,studing,VBG,"(studing, studing, VBG)",74,stu ding
2,goverment,goverment,NN,"(goverment, goverment, NN)",47,g over men t
3,grammer,grammer,NN,"(grammer, grammer, NN)",29,gramme r
4,becuase,becuase,NN,"(becuase, becuase, NN)",28,becuase
5,beatiful,beatiful,JJ,"(beatiful, beatiful, JJ)",28,beat if ul
6,differents,differents,NNS,"(differents, differents, NNS)",23,different s
7,resturant,resturant,NN,"(resturant, resturant, NN)",23,re stu rant
8,becouse,becouse,IN,"(becouse, becouse, IN)",23,becouse
9,lifes,life,NN,"(lifes, life, NN)",22,life s


In [44]:
# Deleting this new column as segmentation creates false segments of misspelled words

del misspell_df['segments']

### Applying spelling correction

In some ways SymSpell is not ideal as sentence context is not considered, only general frequencies. However, other well-known spellcheckers (hunspell, pyspell, etc.) use the same strategy - frequency based criteria for suggestions, without considering immediate cotext. As such, we have followed this common practice, but it is important to remember that accuracy of corrected tokens will not be 100% and must be taken into consideration.

In [45]:
# Testing spelling suggestions with 'becuase'

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt")
# term_index is the column of the term and count_index is the column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

input_term = "becuase"
suggestions = sym_spell.lookup(input_term, Verbosity.CLOSEST, max_edit_distance=2, #Edit distance can be adjusted
                               transfer_casing=True, #Optional argument set to ignore case
                              include_unknown=True) #Return same word if unknown
for suggestion in suggestions:
    print(suggestion)  

because, 1, 271323986


In [46]:
# Creating function for applying the above code

def get_suggestions(word):
    if len(word) >= 4:
        suggestions = sym_spell.lookup(word, Verbosity.CLOSEST,max_edit_distance=2, transfer_casing=True)
    else:
        suggestions = sym_spell.lookup(word, Verbosity.CLOSEST,max_edit_distance=1, transfer_casing=True)
    return [str(x).split(',') for x in suggestions]

**Note**: The function has a variable edit distance: words of length 4 or more get edit distance of 2, shorter words get edit distance of 1. These preferences can be adjusted in the function if desired.

In [47]:
# Applying function to create new column

misspell_df['suggestions'] =  misspell_df['misspelling'].apply(get_suggestions)
misspell_df.head()

Unnamed: 0,misspelling,lemma,POS,tok_lem_POS,freq,suggestions
0,alot,alot,NN,"(alot, alot, NN)",127,"[[lot, 1, 106405208], [slot, 1, 21602762],..."
1,studing,studing,VBG,"(studing, studing, VBG)",74,"[[studying, 1, 9763653], [studding, 1, 345..."
2,goverment,goverment,NN,"(goverment, goverment, NN)",47,"[[government, 1, 206582673]]"
3,grammer,grammer,NN,"(grammer, grammer, NN)",29,"[[grammar, 1, 8019137], [grammes, 1, 13549..."
4,becuase,becuase,NN,"(becuase, becuase, NN)",28,"[[because, 1, 271323986]]"


In [48]:
# Checking how many items without suggestions

len(misspell_df.loc[(misspell_df.suggestions.str.len() == 0),:])

736

Items with no suggestions - these will be left in their original form though manual corrections could be applied if desired.

In [49]:
# Create new column with just the most likely correction (based on frequency)

misspell_df['correction'] = [x[0][0] if len(x) != 0 else np.NaN for x in misspell_df['suggestions']]

In [50]:
# If no correction, use original word

misspell_df.correction.fillna(misspell_df.misspelling, inplace=True)

In [51]:
misspell_df.head(1)

Unnamed: 0,misspelling,lemma,POS,tok_lem_POS,freq,suggestions,correction
0,alot,alot,NN,"(alot, alot, NN)",127,"[[lot, 1, 106405208], [slot, 1, 21602762],...",lot


In [52]:
# Create correction_POS column

misspell_df['correction_POS'] = list(zip(misspell_df.correction, misspell_df.POS))
misspell_df.head()

Unnamed: 0,misspelling,lemma,POS,tok_lem_POS,freq,suggestions,correction,correction_POS
0,alot,alot,NN,"(alot, alot, NN)",127,"[[lot, 1, 106405208], [slot, 1, 21602762],...",lot,"(lot, NN)"
1,studing,studing,VBG,"(studing, studing, VBG)",74,"[[studying, 1, 9763653], [studding, 1, 345...",studying,"(studying, VBG)"
2,goverment,goverment,NN,"(goverment, goverment, NN)",47,"[[government, 1, 206582673]]",government,"(government, NN)"
3,grammer,grammer,NN,"(grammer, grammer, NN)",29,"[[grammar, 1, 8019137], [grammes, 1, 13549...",grammar,"(grammar, NN)"
4,becuase,becuase,NN,"(becuase, becuase, NN)",28,"[[because, 1, 271323986]]",because,"(because, NN)"


In [53]:
# As described earlier - one manual correction for 'alot' will be added

misspell_df.iloc[0,6] = 'a lot'
misspell_df.at[0, 'correction_POS'] = ('a','DT'),('lot','NN')

In [54]:
# Deleting unnecessary columns with duplicate information already contained in the tuple

del misspell_df['lemma']
del misspell_df['POS']

In [55]:
misspell_df.sort_values(by=['freq'], ascending=False).head(50)

Unnamed: 0,misspelling,tok_lem_POS,freq,suggestions,correction,correction_POS
0,alot,"(alot, alot, NN)",127,"[[lot, 1, 106405208], [slot, 1, 21602762],...",a lot,"((a, DT), (lot, NN))"
1,studing,"(studing, studing, VBG)",74,"[[studying, 1, 9763653], [studding, 1, 345...",studying,"(studying, VBG)"
2,goverment,"(goverment, goverment, NN)",47,"[[government, 1, 206582673]]",government,"(government, NN)"
3,grammer,"(grammer, grammer, NN)",29,"[[grammar, 1, 8019137], [grammes, 1, 13549...",grammar,"(grammar, NN)"
4,becuase,"(becuase, becuase, NN)",28,"[[because, 1, 271323986]]",because,"(because, NN)"
5,beatiful,"(beatiful, beatiful, JJ)",28,"[[beautiful, 1, 58503804]]",beautiful,"(beautiful, JJ)"
6,differents,"(differents, differents, NNS)",23,"[[different, 1, 179794224]]",different,"(different, NNS)"
7,resturant,"(resturant, resturant, NN)",23,"[[restaurant, 1, 48255033]]",restaurant,"(restaurant, NN)"
8,becouse,"(becouse, becouse, IN)",23,"[[because, 1, 271323986]]",because,"(because, IN)"
9,lifes,"(lifes, life, NN)",22,"[[life, 1, 306559205], [lines, 1, 76806341...",life,"(life, NN)"


### Incorporating corrections into `pelic_df`

In [56]:
# First removing items from misspell_df where no correction will take place

print(len(misspell_df))
misspell_df = misspell_df.loc[misspell_df.misspelling != misspell_df.correction]
print(len(misspell_df))

11090
10245


In [57]:
misspell_df.head()

Unnamed: 0,misspelling,tok_lem_POS,freq,suggestions,correction,correction_POS
0,alot,"(alot, alot, NN)",127,"[[lot, 1, 106405208], [slot, 1, 21602762],...",a lot,"((a, DT), (lot, NN))"
1,studing,"(studing, studing, VBG)",74,"[[studying, 1, 9763653], [studding, 1, 345...",studying,"(studying, VBG)"
2,goverment,"(goverment, goverment, NN)",47,"[[government, 1, 206582673]]",government,"(government, NN)"
3,grammer,"(grammer, grammer, NN)",29,"[[grammar, 1, 8019137], [grammes, 1, 13549...",grammar,"(grammar, NN)"
4,becuase,"(becuase, becuase, NN)",28,"[[because, 1, 271323986]]",because,"(because, NN)"


In [58]:
# Creating dictionary for mappying - key = incorrect spelling, value = correct spelling

misspell_dict = pd.Series(misspell_df.correction_POS.values,misspell_df.tok_lem_POS).to_dict()

In [59]:
misspell_dict

{('alot', 'alot', 'NN'): (('a', 'DT'), ('lot', 'NN')),
 ('studing', 'studing', 'VBG'): ('studying', 'VBG'),
 ('goverment', 'goverment', 'NN'): ('government', 'NN'),
 ('grammer', 'grammer', 'NN'): ('grammar', 'NN'),
 ('becuase', 'becuase', 'NN'): ('because', 'NN'),
 ('beatiful', 'beatiful', 'JJ'): ('beautiful', 'JJ'),
 ('differents', 'differents', 'NNS'): ('different', 'NNS'),
 ('resturant', 'resturant', 'NN'): ('restaurant', 'NN'),
 ('becouse', 'becouse', 'IN'): ('because', 'IN'),
 ('lifes', 'life', 'NN'): ('life', 'NN'),
 ('shoud', 'shoud', 'VBP'): ('should', 'VBP'),
 ('sould', 'sould', 'VBP'): ('would', 'VBP'),
 ('befor', 'befor', 'NN'): ('before', 'NN'),
 ('somthing', 'somthing', 'VBG'): ('something', 'VBG'),
 ('jop', 'jop', 'NN'): ('top', 'NN'),
 ('apartament', 'apartament', 'NN'): ('apartment', 'NN'),
 ('becouse', 'becouse', 'NN'): ('because', 'NN'),
 ('everytime', 'everytime', 'NN'): ('overtime', 'NN'),
 ('tution', 'tution', 'NN'): ('tuition', 'NN'),
 ('referes', 'referes', 'VBZ'

In [60]:
# Incorporating back into pelic_df

pelic_df['tok_POS_corrected'] = pelic_df['tok_lem_POS'].apply\
(lambda row: [misspell_dict[(x[0].lower(),x[1],x[2])] if (x[0].lower(),x[1],x[2]) in misspell_dict else (x[0],x[2]) for x in row])

# One minor issue is that this will make misspelled items lower case when originally upper case.

In [61]:
# Checking with 'becuase'

print(pelic_df.loc[pelic_df.text.str.contains('becuase')].iloc[1,11]) #uncorrected
print(pelic_df.loc[pelic_df.text.str.contains('becuase')].iloc[1,12]) #corrected

(('My', 'my', 'PRP$'), ('friend', 'friend', 'NN'), ('is', 'be', 'VBZ'), ('realy', 'realy', 'JJ'), ('nise', 'nise', 'RB'), ('guy', 'guy', 'NN'), ('.', '.', '.'), ('I', 'i', 'PRP'), ('like', 'like', 'VBP'), ('hem', 'hem', 'JJ'), ('becuase', 'becuase', 'NN'), ('he', 'he', 'PRP'), ('is', 'be', 'VBZ'), ('friendlly', 'friendlly', 'RB'), ('and', 'and', 'CC'), ('lovliy', 'lovliy', 'NN'), ('.', '.', '.'))
[('My', 'PRP$'), ('friend', 'NN'), ('is', 'VBZ'), ('real', 'JJ'), ('nice', 'RB'), ('guy', 'NN'), ('.', '.'), ('I', 'PRP'), ('like', 'VBP'), ('hem', 'JJ'), ('because', 'NN'), ('he', 'PRP'), ('is', 'VBZ'), ('friendly', 'RB'), ('and', 'CC'), ('lovely', 'NN'), ('.', '.')]


We can see here that many approrpriate corrections have been made, including _beccuase_ -> _because_ , _nise_ -> _nice_ , and _lovily_ -> _lovely_ .  
Importantly, incorrect spellings that are actual words, e.g. _hem_ (should be _him_ in this case) are not corrected. In addition, as context is not considered, there will be some inaccuracies, e.g. _realy_ (marked as an adj) -> _real_ rather than _really_.

In [62]:
pelic_df.head()

Unnamed: 0_level_0,anon_id,L1,gender,course_id,level_id,class_id,question_id,version,text_len,text,tokens,tok_lem_POS,tok_POS_corrected
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,eq0,Arabic,Male,149,4,g,5,1,177,I met my friend Nife while I was studying in a...,"[I, met, my, friend, Nife, while, I, was, stud...","((I, i, PRP), (met, meet, VBD), (my, my, PRP$)...","[(I, PRP), (met, VBD), (my, PRP$), (friend, NN..."
2,am8,Thai,Female,149,4,g,5,1,137,"Ten years ago, I met a women on the train betw...","[Ten, years, ago, ,, I, met, a, women, on, the...","((Ten, ten, CD), (years, year, NNS), (ago, ago...","[(Ten, CD), (years, NNS), (ago, RB), (,, ,), (..."
3,dk5,Turkish,Female,115,4,w,12,1,64,In my country we usually don't use tea bags. F...,"[In, my, country, we, usually, do, n't, use, t...","((In, in, IN), (my, my, PRP$), (country, count...","[(In, IN), (my, PRP$), (country, NN), (we, PRP..."
4,dk5,Turkish,Female,115,4,w,13,1,6,I organized the instructions by time.,"[I, organized, the, instructions, by, time, .]","((I, i, PRP), (organized, organize, VBD), (the...","[(I, PRP), (organized, VBD), (the, DT), (instr..."
5,ad1,Korean,Female,115,4,w,12,1,59,"First, prepare a port, loose tea, and cup.\nSe...","[First, ,, prepare, a, port, ,, loose, tea, ,,...","((First, first, RB), (,, ,, ,), (prepare, prep...","[(First, RB), (,, ,), (prepare, VB), (a, DT), ..."


In [63]:
# Write out new PELIC_compiled.csv

pelic_df.to_csv('PELIC_compiled_spellcorrected.csv', encoding='utf-8', index=True)

In [64]:
# Pickle new pelic_df dataframe

pelic_df.to_pickle('pelic_spellcorrected.pkl')

If preferred, this entire spelling correctin process can also be applied to [`answer.csv`]() instead of `PELIC_compiled`.

[Back to top](#Corrected-spelling)