# PELIC spelling

This notebook adds further processing to `PELIC_compiled.csv`  in the [`PELIC-dataset`](https://github.com/ELI-Data-Mining-Group/PELIC-dataset) repo by creating a column of tok_POS whose spelling has been automatically corrected.

**Notebook contents:**
- [Building `non_words_df`](#Building-non_words_df)
- [Building `misspell_df`](#Building-misspell_df)
- [Possible segmentation](#Applying-segmentation)
- [Applying spelling correction](#Applying-spelling-correction)
- [Incorporating corrections into `pelic_df`](#Incorporating-corrections-into-pelic_df)

## Building non_words_df
In this section, we build a dataframe, `non_words_df`, which collects all of the non-words from the PELIC dataset (in `PELIC_compiled.csv`). The final dataframe has the following columns:
- `non_word`: tuples with the non-words and their parts of speech
- `sentence`: the complete sentence containing the non-word to provide context
- `answer_id`: the id of the text they come from

In [1]:
# Import necessary modules

import pandas as pd
import pprint
import numpy as np
from ast import literal_eval
import nltk
import random
from pelitk import lex
import string
import re

In [2]:
# Read in PELIC_compiled.csv

pelic_df = pd.read_csv("../PELIC-dataset/PELIC_compiled.csv", index_col = 'answer_id', # answer_id is unique
                      dtype = {'level_id':'object','question_id':'object','version':'object','course_id':'object'}, # str not ints
                               converters={'tokens':literal_eval,'tok_lem_POS':literal_eval}) # read in as lists
pelic_df.head()

Unnamed: 0_level_0,anon_id,L1,gender,course_id,level_id,class_id,question_id,version,text_len,text,tokens,tok_lem_POS
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,eq0,Arabic,Male,149,4,g,5,1,177,I met my friend Nife while I was studying in a...,"[I, met, my, friend, Nife, while, I, was, stud...","((I, i, PRP), (met, meet, VBD), (my, my, PRP$)..."
2,am8,Thai,Female,149,4,g,5,1,137,"Ten years ago, I met a women on the train betw...","[Ten, years, ago, ,, I, met, a, women, on, the...","((Ten, ten, CD), (years, year, NNS), (ago, ago..."
3,dk5,Turkish,Female,115,4,w,12,1,64,In my country we usually don't use tea bags. F...,"[In, my, country, we, usually, do, n't, use, t...","((In, in, IN), (my, my, PRP$), (country, count..."
4,dk5,Turkish,Female,115,4,w,13,1,6,I organized the instructions by time.,"[I, organized, the, instructions, by, time, .]","((I, i, PRP), (organized, organize, VBD), (the..."
5,ad1,Korean,Female,115,4,w,12,1,59,"First, prepare a port, loose tea, and cup.\nSe...","[First, ,, prepare, a, port, ,, loose, tea, ,,...","((First, first, RB), (,, ,, ,), (prepare, prep..."


The focus here is the `tok_lem_POS` column, but all columns will be kept as the entire df will be written out at the end of the notebook.

In [3]:
# Creating small dataframe to be used for finding non-words

non_words = pelic_df[['text','tok_lem_POS']]

**Note:** For spelling correction, it is necessary to decide what list of words will be used for determining if a word is real or not.

Here, we use the [`SCOWL_condensed.txt`](https://github.com/ELI-Data-Mining-Group/PELIC-spelling/blob/master/SCOWL_condensed.txt) file which is a combination of wordlists available for download at http://wordlist.aspell.net/. We include items from all the dictionaries _except_ the abbreviations dictionary. For a detailed look at the compilation of this dictionary, please see the [SCOWL_wordlist](https://github.com/ELI-Data-Mining-Group/PELIC-spelling/blob/master/SCOWL_wordlist.ipynb) notebook.

In [4]:
#Reading in SCOWL_condensed as a set as a lookup list for spelling (500k words)

scowl = set(open("SCOWL_condensed.txt", "r").read().split('\n'))
print(random.sample(scowl,5))

['cadelles', 'autonegation', 'rhinion', 'mikvahs', 'buttery']


In [5]:
scowl = set([x.lower() for x in scowl])
len(scowl)

497552

The following is a list of words which should be considered words but which were previously being labelled as non-words. These items have been manually added to this list based on output later in this notebook. Most of these items are food items, names, or abbreviations.

In [6]:
scowl_supp = open("SCOWL_supp.txt", "r").read().split(',')
scowl_supp = [x[2:-1] for x in scowl_supp]
print(len(scowl_supp))
print(scowl_supp)

238
['adha', 'adj', 'ahamed', 'alaikum', 'anonurlpage', 'antiretroviral', 'arpa', 'atm', 'ave', 'beyonce', 'bibimbap', 'bio', 'biodiesel', 'bioethanol', 'bulgogi', 'bundang', 'cafe', 'carnaval', 'cds', 'cf', 'co', 'comscore', 'cyber', 'ddukboggi', 'def', 'dr', 'eg', 'eid', 'electrospray', 'entrees', 'erectus', 'etc', 'fiance', 'fiancee', 'fiter', 'fitir', 'fitr', 'fl', 'freediving', 'fukubukuro', 'geolinguist', 'hikikomori', 'hp', 'ibt', 'iq', 'iriver', 'jetta', 'jul', 'kabsa', 'kaled', 'kawader', 'kennywood', 'km', 'leisureville', 'll', 'maamool', 'mayumi', 'mcdonalds', 'min', 'mongongo', 'nc', 'neuro', 'nian', 'notting', 'okroshka', 'onsen', 'pajeon', 'pbt', 'pc', 'pcs', 'pp', 'pudim', 'puket', 'samear', 'shui', 'sq', 'st', 'staycation', 'sth', 'taoyuan', 'toefl', 'trans', 'transgene', 'tv', 'unsub', 'va', 'vol', 'vs', 'webaholic', 'webaholics', 'webaholism', 'wenjing', 'woong', 'yaoming', 'ying', 'yingdong', 'yugong', 'yuval', 'zi', 'abha', 'achuar', 'ae', 'afandi', 'aladha', 'alfat

In [7]:
# Lower case all toks

non_words.tok_lem_POS = non_words.tok_lem_POS.apply(lambda row: [(x[0].lower(),x[1],x[2]) for x in row])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [8]:
# Function to find non-words

def spell_check(tok_lem_POS_list):
    word_list = scowl # Choose word_list here. Default is scowl described above.
    not_in_word_list = []
    for tok_lem_POS in tok_lem_POS_list:
        if tok_lem_POS[0] not in word_list and tok_lem_POS[0] not in scowl_supp:
            not_in_word_list.append(tok_lem_POS)
    return not_in_word_list

In [9]:
# Apply spell check function to find all misspelled-words. 

non_words['misspelled_words'] = non_words.tok_lem_POS.apply(spell_check)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [10]:
non_words.head()

Unnamed: 0_level_0,text,tok_lem_POS,misspelled_words
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,I met my friend Nife while I was studying in a...,"[(i, i, PRP), (met, meet, VBD), (my, my, PRP$)...","[(., ., .), (., ., .), (., ., .), (;, ;, :), (..."
2,"Ten years ago, I met a women on the train betw...","[(ten, ten, CD), (years, year, NNS), (ago, ago...","[(,, ,, ,), (,, ,, ,), (., ., .), (;, ;, :), (..."
3,In my country we usually don't use tea bags. F...,"[(in, in, IN), (my, my, PRP$), (country, count...","[(., ., .), (,, ,, ,), (., ., .), (., ., .), (..."
4,I organized the instructions by time.,"[(i, i, PRP), (organized, organize, VBD), (the...","[(., ., .)]"
5,"First, prepare a port, loose tea, and cup.\nSe...","[(first, first, RB), (,, ,, ,), (prepare, prep...","[(,, ,, ,), (,, ,, ,), (,, ,, ,), (., ., .), (..."


#### Adding context to the dataframe
Seeing the mistakes in the context of a sentence will allow for better manual checking if required.

In [11]:
# Sent-tokenizing the text

non_words['sentence'] = non_words['text'].apply(lambda x: nltk.sent_tokenize(x))

# And delete text column which is no longer needed

del non_words['text']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [12]:
# Checking for hyphenated words tagged as misspellings because SCOWL doesn't contain hypenated words

hyphenated = set([x[0] for x in [x for y in non_words.misspelled_words.to_list() for x in y] if '-' in x[0]])
print(len(hyphenated))
print(list(hyphenated)[:10])

# These need to be removed from the non-words dataframe if composed of valid words

1182
['city-state', 'well-rounded', 'standard-setting', 'trade-related', 'blood-curdling', 'hunter-gatherer', 'first-born', 'self-efficacy', 'cul-de-sac', 'self-expression']


In [13]:
# Hyphenated items whose components are not in scowl - possible misspellings or punctuation strings

sorted([y for y in [x.split('-') for x in hyphenated] if y[0] not in scowl or y[1] not in scowl])

[['', "'"],
 ['', '***', '****'],
 ['', '+'],
 ['', '.'],
 ['',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  '.',
  ''],
 ["'", ''],
 ['.', ''],
 ['/', ''],
 ['\\\\', ''],
 ['^', '^'],
 ['al', 'qaida'],
 ['austro', 'hungarian'],
 ['cd', 'rom'],
 ['co', 'authored'],
 ['co', 'ed'],
 ['co', 'educational'],
 ['co', 'exist'],
 ['co', 'existence'],
 ['co', 'founded'],
 ['co', 'founder'],
 ['co', 'founders'],
 ['co', 'host'],
 ['co', 'op'],
 ['co', 'operate'],
 ['co', 'operation'],
 ['co', 'pay'],
 ['co', 'pilot'],
 ['co', 'sleeping'],
 ['co', 'star'],
 ['co', 'worker'],
 ['co', 'workers'],
 ['co', 'written'],
 ['co', 'wrote'],
 ['mah', 'jong'],
 ['mid', '80s'],
 ['pay', 'tv'],
 ['roly', 'poly'],
 ['socio', 'cultural'],
 ['socio', 'economic'],
 ['trans', 'fat'],
 ['vis', 'a', 'vis'],
 ['wal', 'mart']]

After manual checking, all the hypenated words are punctuation, real words (or true productive use of affixes) and can be removed from the non-words df.

The following two cells 
1. remove all the hypenated words from the dataframe
2. remove all words that don't contain a letter

However, as all hyphenated word are fine, we will instead just eliminate all words that are not purely composed of letters. This will have the effect of removing the following categories from the dataframe:
- punctuation
- hyphenated words (e.g. well-known)
- contractions (e.g. 'll, 've)
- years (e.g. 1950s)
- ordinals (e.g. 1st, 2nd)

## Tangent
The next three cells use the above information to create a more complete `PELIC-SCOWL.txt` wordlist for use with PELIC data.

In [14]:
%%capture

# Creating a text file of hyphenated list for use elsewhere in creating an PELIC-SCOWL wordlist

hyphenated = {word for word in hyphenated if any(x.isalpha() for x in word)}
with open('hyphens.txt', 'w') as f:
    for item in hyphenated:
        f.write("%s\n" % item)

In [15]:
%%capture

# Creating a text file of contractions for use elsewhere in creating an PELIC-SCOWL wordlist

contractions = {"'ll","'ve","n't","'m","'s","'d","'re","'ve"}
with open('contractions.txt', 'w') as f:
    for item in contractions:
        f.write("%s\n" % item)

In [16]:
%%capture

# Combining SCOWL_condensed, hyphenated, and contraction lists

pelic_scowl = scowl|hyphenated|contractions
pelic_scowl.remove('')
pelic_scowl = sorted(list(pelic_scowl))

with open('PELIC-SCOWL.txt', 'w') as f:
    for item in pelic_scowl:
        f.write("%s\n" % item)

In [17]:
# Removing hypenated words

# non_words.misspelled_words = non_words.misspelled_words.apply(lambda row: [x for x in row if x[0] not in hyphenated])

In [18]:
# Removing items that are only numbers or punctuation
# .isalpha() cannot be used without 'any' as this also removes hyphenated words

# non_words.misspelled_words = non_words.misspelled_words.apply(lambda row: [x for x in row if any(y.isalpha() for y in x[0])])

In [19]:
# Checking initial length of non_words list

print(len([x for y in non_words.misspelled_words.to_list() for x in y]))
print(len(set([x for y in non_words.misspelled_words.to_list() for x in y])))

680898
20385


In [20]:
# Removing items that are not purely alpha

non_words.misspelled_words = non_words.misspelled_words.apply(lambda row: [x for x in row if x[0].isalpha()])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [21]:
# Checking affect of removal

print(len([x for y in non_words.misspelled_words.to_list() for x in y]))
print(len(set([x for y in non_words.misspelled_words.to_list() for x in y])))

26650
15779


In [22]:
# Removing proper names - NNP, NNPS

# non_words.misspelled_words = non_words.misspelled_words.apply(lambda row: [x for x in row if x[2] != 'NNP' and x[1] != 'NNPS'])

After manual checking, it was decided to keep in items tagged as NNP and NNPS as some items were in fact mistagged and were general capitalized nouns (NN) which were misspelled.

In [23]:
# Checking affect of removal

print(len([x for y in non_words.misspelled_words.to_list() for x in y]))
print(len(set([x for y in non_words.misspelled_words.to_list() for x in y])))

26650
15779


In [24]:
# Removing all words with length 1

non_words.misspelled_words = non_words.misspelled_words.apply(lambda row: [x for x in row if len(x[0]) > 1])

In [25]:
# Checking affect of removal

print(len([x for y in non_words.misspelled_words.to_list() for x in y]))
print(len(set([x for y in non_words.misspelled_words.to_list() for x in y])))

26638
15770


In [26]:
# Removing all words with special characters (non_ascii)
# after checking, these are all foreign words with accents and other non-latin characters

# Creating function to check
def is_ascii(s):
    return all(ord(c) < 128 for c in s)

In [27]:
non_word_list = set([x for y in non_words.misspelled_words.to_list() for x in y])
foreign_words = [x for x in non_word_list if is_ascii(x[0]) == False ]
foreign_words[:20]

[('çark', 'çark', 'JJ'),
 ('currículm', 'currículm', 'NN'),
 ('beatricé', 'beatricé', 'NNP'),
 ('óæçá', 'óæçá', 'NNP'),
 ('çáãçäóé', 'çáãçäóé', 'NNP'),
 ('êíáýæäì', 'êíáýæäì', 'NNP'),
 ('çáíçá', 'çáíçá', 'NNP'),
 ('arára', 'arára', 'NNP'),
 ('çáôíä', 'çáôíä', 'NNP'),
 ('çááçòãé', 'çááçòãé', 'NNP'),
 ('ôûáåã', 'ôûáåã', 'NNP'),
 ('ãú', 'ãú', 'NNP'),
 ('ýçääì', 'ýçääì', 'NNP'),
 ('renée', 'renée', 'NNP'),
 ('bohème', 'bohème', 'NN'),
 ('bülent', 'bülent', 'NNP'),
 ('ßçä', 'ßçä', 'NNP'),
 ('opéra', 'opéra', 'NNP'),
 ('inácio', 'inácio', 'NNP'),
 ('béla', 'béla', 'NNP')]

In [28]:
# Removing foreign words

non_words.misspelled_words = non_words.misspelled_words.apply(lambda row: [x for x in row if x not in foreign_words])

In [29]:
# Checking affect of removal

print(len([x for y in non_words.misspelled_words.to_list() for x in y]))
print(len(set([x for y in non_words.misspelled_words.to_list() for x in y])))

26315
15531


In [30]:
non_words.head()

Unnamed: 0_level_0,tok_lem_POS,misspelled_words,sentence
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,"[(i, i, PRP), (met, meet, VBD), (my, my, PRP$)...",[],[I met my friend Nife while I was studying in ...
2,"[(ten, ten, CD), (years, year, NNS), (ago, ago...",[],"[Ten years ago, I met a women on the train bet..."
3,"[(in, in, IN), (my, my, PRP$), (country, count...",[],"[In my country we usually don't use tea bags.,..."
4,"[(i, i, PRP), (organized, organize, VBD), (the...",[],[I organized the instructions by time.]
5,"[(first, first, RB), (,, ,, ,), (prepare, prep...",[],"[First, prepare a port, loose tea, and cup., S..."


Create new dataframe so that each misspelling token is a separate row.

In [31]:
# Removing rows with no misspellings

non_words2 = non_words.loc[non_words.misspelled_words.str.len() > 0,:].copy()

In [32]:
# Exploding the lists in misspelled words so that each misspelling gets its own row

non_words2 = non_words2.explode('misspelled_words')

In [33]:
# Keeping only the sentence containing the error (the first occurence of the error if repeated)

non_words2['sentence'] = list(zip([x[0] for x in non_words2.misspelled_words], non_words2.sentence))
non_words2['sentence'] = non_words2['sentence'].apply(
    lambda row: [i for i in row[1] if row[0] in lex.re_tokenize(i) or row[0]+"n't" in i.lower()])
non_words2['sentence'] = [x[0] for x in non_words2['sentence']]
non_words2 = non_words2.drop_duplicates(subset = ['misspelled_words','sentence'])

In [34]:
# Keeping the answer_id (which is no longer unique) as a separate column

non_words2 = non_words2.reset_index(drop = False)
non_words2.head()

Unnamed: 0,answer_id,tok_lem_POS,misspelled_words,sentence
0,8,"[(i, i, PRP), (organized, organize, VBD), (the...","(beacause, beacause, NN)","I organized the instructions by time, beacause..."
1,11,"[(to, to, TO), (make, make, VB), (tea, tea, NN...","(wallmart, wallmart, NN)","next, you need to buy a box of tea in wallmart..."
2,13,"[(first, first, RB), (,, ,, ,), (you, you, PRP...","(dovn, dovn, NN)","First, you should take some hot water, you can..."
3,13,"[(first, first, RB), (,, ,, ,), (you, you, PRP...","(mircowave, mircowave, VBP)","First, you should take some hot water, you can..."
4,15,"[(in, in, IN), (my, my, PRP$), (country, count...","(fitst, fitst, NNP)","Fitst, boil a water in a pot."


#### Adding a bigrams column, i.e. one token left and right of the misspelled word

In [35]:
# Creating a tokenized version of the sentence without punctuation and with the index for each token

non_words2['enumerated'] = non_words2.sentence.apply(lambda x: lex.re_tokenize(x)).apply(enumerate).apply(list)
non_words2.head()

Unnamed: 0,answer_id,tok_lem_POS,misspelled_words,sentence,enumerated
0,8,"[(i, i, PRP), (organized, organize, VBD), (the...","(beacause, beacause, NN)","I organized the instructions by time, beacause...","[(0, i), (1, organized), (2, the), (3, instruc..."
1,11,"[(to, to, TO), (make, make, VB), (tea, tea, NN...","(wallmart, wallmart, NN)","next, you need to buy a box of tea in wallmart...","[(0, next), (1, you), (2, need), (3, to), (4, ..."
2,13,"[(first, first, RB), (,, ,, ,), (you, you, PRP...","(dovn, dovn, NN)","First, you should take some hot water, you can...","[(0, first), (1, you), (2, should), (3, take),..."
3,13,"[(first, first, RB), (,, ,, ,), (you, you, PRP...","(mircowave, mircowave, VBP)","First, you should take some hot water, you can...","[(0, first), (1, you), (2, should), (3, take),..."
4,15,"[(in, in, IN), (my, my, PRP$), (country, count...","(fitst, fitst, NNP)","Fitst, boil a water in a pot.","[(0, fitst), (1, boil), (2, a), (3, water), (4..."


In [36]:
# Creating a function to extract the bigrams (1 word either side of misspelling)

def get_bigrams(misspelled_word, enumerated_list):
    if len(enumerated_list) <2:
        return []
    for tup in enumerated_list:
        if tup[1] == misspelled_word[0]:
            if tup[0] == 0:
                bigram = ' '.join([x[1] for x in (enumerated_list[tup[0]],enumerated_list[tup[0]+1])])
                return [bigram]
            if tup[0] == len(enumerated_list)-1:
                bigram = ' '.join([x[1] for x in (enumerated_list[tup[0]-1],enumerated_list[tup[0]])])
                return [bigram]
            else:
                bigram1 = ' '.join([x[1] for x in (enumerated_list[tup[0]-1],enumerated_list[tup[0]])])
                bigram2 = ' '.join([x[1] for x in (enumerated_list[tup[0]],enumerated_list[tup[0]+1])])
                return [bigram1, bigram2]

In [37]:
# Testing the function

test_list = non_words2.iloc[5,4]
print(test_list)

first_item = non_words2.iloc[5,1][0] # first item in list
middle_item = non_words2.iloc[5,1][4] # item in in middle of list
last_item = non_words2.iloc[5,1][8] # item at end of list
print('\n',first_item, middle_item, last_item)

[(0, 'every'), (1, 'paragragh'), (2, 's'), (3, 'instructions'), (4, 'depend'), (5, 'on'), (6, 'a'), (7, 'main'), (8, 'idea')]

 ('every', 'every', 'DT') ('depend', 'depend', 'VBP') ('idea', 'idea', 'NN')


In [38]:
print(get_bigrams(first_item,test_list)) # One possible bigram
print(get_bigrams(middle_item,test_list)) # Two possible bigrams
print(get_bigrams(last_item,test_list)) # One possible bigram

['every paragragh']
['instructions depend', 'depend on']
['main idea']


In [39]:
# Applying the above function

non_words2['bigrams'] = non_words2[['misspelled_words','enumerated']].apply(lambda x: get_bigrams(x[0],x[1]), axis=1)

In [40]:
# The 14 misspellings with contracted words did not return any suggestions as the n't is handled differently by 
# the re_tokenize function which strips punctuation and the nltk tokenizer used to create the tok_lem_POS column

no_bigrams = non_words2.loc[(non_words2['bigrams'].isnull()) & (non_words2.enumerated.str.len() >1),:]
print(len(no_bigrams))
no_bigrams.head(20)

14


Unnamed: 0,answer_id,tok_lem_POS,misspelled_words,sentence,enumerated,bigrams
909,1469,"[(about, about, IN), (4, 4, CD), (days, day, N...","(shoudl, shoudl, VBP)","Anyway, we went to a hospital and the doctor s...","[(0, anyway), (1, we), (2, went), (3, to), (4,...",
1100,1797,"[(i, i, PRP), (failed, fail, VBD), (the, the, ...","(idid, idid, NNP)",I failed the test because Ididn't study.,"[(0, i), (1, failed), (2, the), (3, test), (4,...",
2446,4804,"[(bill, bill, NNP), (shoul, shoul, VBP), (n't,...","(shoul, shoul, VBP)",Bill shouln't have lied about the price of the...,"[(0, bill), (1, shouln), (2, t), (3, have), (4...",
2449,4830,"[(caroline, caroline, NNP), (must, must, MD), ...","(sould, sould, VBP)",Caroline souldn't have insulted Bill in front ...,"[(0, caroline), (1, souldn), (2, t), (3, have)...",
2725,6346,"[(last, last, JJ), (year, year, NN), (,, ,, ,)...","(sould, sould, MD)",I told the officer that he caused the accident...,"[(0, i), (1, told), (2, the), (3, officer), (4...",
2728,6346,"[(last, last, JJ), (year, year, NN), (,, ,, ,)...","(sould, sould, VBP)",I told the officer that he caused the accident...,"[(0, i), (1, told), (2, the), (3, officer), (4...",
3750,8640,"[(before, before, IN), (i, i, PRP), (came, com...","(daoe, daoe, VBP)","I like to play soccer,but in Korea daoen't hav...","[(0, i), (1, like), (2, to), (3, play), (4, so...",
4051,10001,"[(you, you, PRP), (shoud, shoud, VBP), (n't, n...","(shoud, shoud, VBP)",You shoudn't,"[(0, you), (1, shoudn), (2, t)]",
10334,23939,"[(i, i, PRP), (have, have, VBP), (an, a, DT), ...","(sould, sould, VBP)",I souldn't have gone to his wedding party.,"[(0, i), (1, souldn), (2, t), (3, have), (4, g...",
11393,26126,"[(i, i, PRP), (like, like, VBP), (cooking, coo...","(doed, doed, VBP)",But In pittsburgh doedn't have good traditiona...,"[(0, but), (1, in), (2, pittsburgh), (3, doedn...",


In [41]:
# Creating function to fix these items

misspelled_contractions = [x[0] for x in no_bigrams.misspelled_words]

def fix_contractions(word):
    if word == "t":
        word = "n't"
    if word[:-1] in misspelled_contractions:
        word = word[:-1]
    return word

In [42]:
# Applying the above function to the selected rows

mask = non_words2.index.isin(no_bigrams.index)
non_words2.loc[mask, 'enumerated'] = non_words2.loc[mask, 'enumerated'].apply(
    lambda row: [(x[0],fix_contractions(x[1])) for x in row])

In [43]:
# Reapplying the bigrams function

non_words2['bigrams'] = non_words2[['misspelled_words','enumerated']].apply(lambda x: get_bigrams(x[0],x[1]), axis=1)

In [44]:
# Re-checking that all rows have bigrams
non_words2.loc[(non_words2['bigrams'].isnull()) & (non_words2.enumerated.str.len() >1),:]

Unnamed: 0,answer_id,tok_lem_POS,misspelled_words,sentence,enumerated,bigrams


In [45]:
# Deleting the enumerated column as no longer necessary

del non_words2['enumerated']

In [46]:
# Renaming the 'misspelled_words' column as there is only one word in each row

non_words2 = non_words2.rename(columns={"misspelled_words": "misspelling"})

In [47]:
# Checking final non_words2 dataframe

non_words2.head()

Unnamed: 0,answer_id,tok_lem_POS,misspelling,sentence,bigrams
0,8,"[(i, i, PRP), (organized, organize, VBD), (the...","(beacause, beacause, NN)","I organized the instructions by time, beacause...","[time beacause, beacause to]"
1,11,"[(to, to, TO), (make, make, VB), (tea, tea, NN...","(wallmart, wallmart, NN)","next, you need to buy a box of tea in wallmart...","[in wallmart, wallmart or]"
2,13,"[(first, first, RB), (,, ,, ,), (you, you, PRP...","(dovn, dovn, NN)","First, you should take some hot water, you can...","[use dovn, dovn mircowave]"
3,13,"[(first, first, RB), (,, ,, ,), (you, you, PRP...","(mircowave, mircowave, VBP)","First, you should take some hot water, you can...","[dovn mircowave, mircowave or]"
4,15,"[(in, in, IN), (my, my, PRP$), (country, count...","(fitst, fitst, NNP)","Fitst, boil a water in a pot.",[fitst boil]


In [48]:
# Total number of non-words (tokens)
print(len(non_words2))

# Total number of non-words (types)
print(non_words2.misspelling.nunique())

21068
15531


#### Creating a dataframe of misspellings
In the `non-words2` dataframe above, each row is an occurrence of a misspelling (i.e. _tokens_ ). We also want a dataframe where each row is a misspelling _type_ with frequency information attached.

In [49]:
# Gathering the total misspellings and bigrams

total_unigram_misspell = [x for x in non_words2['misspelling']]
total_bigram_misspell = [x for y in non_words2['bigrams'] for x in y] #flattened list

In [50]:
print(total_unigram_misspell[:5])
print(total_bigram_misspell[:5])

[('beacause', 'beacause', 'NN'), ('wallmart', 'wallmart', 'NN'), ('dovn', 'dovn', 'NN'), ('mircowave', 'mircowave', 'VBP'), ('fitst', 'fitst', 'NNP')]
['time beacause', 'beacause to', 'in wallmart', 'wallmart or', 'use dovn']


In [51]:
# Creating frequency dictionaries for unigrams and bigrams

unigram_misspell_freq_dict = {}
for word in total_unigram_misspell:
    if word not in unigram_misspell_freq_dict:
        unigram_misspell_freq_dict[word] = 1
    else:
        unigram_misspell_freq_dict[word] += 1

In [52]:
bigram_misspell_freq_dict = {}
for bigram in total_bigram_misspell:
    if bigram not in bigram_misspell_freq_dict:
        bigram_misspell_freq_dict[bigram] = 1
    else:
        bigram_misspell_freq_dict[bigram] += 1

In [53]:
# Checking dictionaries

print(random.sample(list(unigram_misspell_freq_dict),5))
print(random.sample(list(bigram_misspell_freq_dict),5))

[('attitudt', 'attitudt', 'NN'), ('marrid', 'marrid', 'JJ'), ('littel', 'littel', 'NN'), ('hfcs', 'hfcs', 'NNP'), ('prefre', 'prefre', 'VBP')]
['ramin mostaghim', 'cuase of', 't understnad', 'so fiday', 'engeler heart']


In [54]:
# Remove duplicates

final_unigram_misspellings = sorted(list(set(total_unigram_misspell)))
final_bigram_misspellings = sorted(list(set(total_bigram_misspell)))
print(len(final_unigram_misspellings))
print(len(final_bigram_misspellings))

15531
31915


In [55]:
# Constructing misspell_df

misspell_df = pd.DataFrame(final_unigram_misspellings)
misspell_df.head()

Unnamed: 0,0,1,2
0,aa,aa,NNP
1,aa,aa,VB
2,aabout,aabout,IN
3,aad,aad,JJ
4,aain,aain,VBP


In [56]:
# Renaming columns to match other DataFrames in this notebook

misspell_df.rename(columns = {0: 'misspelling',1:'lemma',2:'POS'}, inplace = True)

In [57]:
# Recreating tok_lem_POS column to match dictionary

misspell_df['tok_lem_POS'] = list(zip(misspell_df.misspelling, misspell_df.lemma, misspell_df.POS))
misspell_df.head()

Unnamed: 0,misspelling,lemma,POS,tok_lem_POS
0,aa,aa,NNP,"(aa, aa, NNP)"
1,aa,aa,VB,"(aa, aa, VB)"
2,aabout,aabout,IN,"(aabout, aabout, IN)"
3,aad,aad,JJ,"(aad, aad, JJ)"
4,aain,aain,VBP,"(aain, aain, VBP)"


In [58]:
# Mapping dictionary to DataFrame

misspell_df['freq'] = misspell_df['tok_lem_POS'].map(unigram_misspell_freq_dict)

In [59]:
# Sorting by frequency

misspell_df = misspell_df.sort_values(by=['freq'], ascending=False)

In [60]:
# Resetting index and deleting unnecesary columns

misspell_df = misspell_df.reset_index(drop = True)
del misspell_df['lemma']
del misspell_df['POS']

misspell_df.head(20)

Unnamed: 0,misspelling,tok_lem_POS,freq
0,alot,"(alot, alot, NN)",103
1,studing,"(studing, studing, VBG)",62
2,tofel,"(tofel, tofel, NNP)",39
3,goverment,"(goverment, goverment, NN)",36
4,iam,"(iam, iam, NNP)",28
5,finaly,"(finaly, finaly, NNP)",26
6,beatiful,"(beatiful, beatiful, JJ)",24
7,becuase,"(becuase, becuase, NN)",24
8,nickell,"(nickell, nickell, NNP)",22
9,oss,"(oss, oss, NNP)",22


#### scowl_supp
The following is the basis for the 'scowl_supp' list used earlier. Here, errors with a frequency of 10 or more were manually checked, and if determined to be a real word, were added to the scowl_supp list. There were originally 267 items which met this criteria.

In [61]:
print(len(misspell_df.loc[misspell_df.freq >= 10]))
misspell_df.loc[misspell_df.freq >= 10]

63


Unnamed: 0,misspelling,tok_lem_POS,freq
0,alot,"(alot, alot, NN)",103
1,studing,"(studing, studing, VBG)",62
2,tofel,"(tofel, tofel, NNP)",39
3,goverment,"(goverment, goverment, NN)",36
4,iam,"(iam, iam, NNP)",28
...,...,...,...
58,childrens,"(childrens, child, NNS)",10
59,befor,"(befor, befor, IN)",10
60,differnt,"(differnt, differnt, JJ)",10
61,eatting,"(eatting, eatting, VBG)",10


### Possible segmentation

Selected segmenter and spellchecker: SymSpell https://github.com/mammothb/symspellpy

There is a dictionary file which which needs to be installed (saved to repo):
[frequency_dictionary_en_82_765.txt](https://symspellpy.readthedocs.io/en/latest/users/installing.html)

To install symspellpy the first time, use pip in command line: `pip install -U symspellpy`

Prior to spelling correct, we first consider using the segmenter. This is a potentially useful first step as misspellings like 'alot' or 'dogmeat' will be separated into 'a lot' and 'dog meat' rather than corrected to a single word like 'lot'.  

However, when segementing misspellings, the segmenter over performs, segmenting non-words into real words where it was clearly not intended, e.g. _improtant_ into _imp rot ant_ or _befor_ into _be for_. As such, the segmenting will not be automated. 

Instead, we rely on the bigram spelling correction which correctly segments items like _alot, iam, everytime,_ etc.

In [62]:
# Setting up symspell

from itertools import islice
import pkg_resources
from symspellpy import SymSpell
from symspellpy import Verbosity
sym_spell = SymSpell()
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt")
sym_spell.load_dictionary(dictionary_path, 0, 1)

# Print out first 5 elements to demonstrate that dictionary is successfully loaded
list(islice(sym_spell.words.items(), 5))

[('the', 23135851162),
 ('of', 13151942776),
 ('and', 12997637966),
 ('to', 12136980858),
 ('a', 9081174698)]

In [63]:
# Testing segmenter with 'alot' and 'dogmeat'

# Set max_dictionary_edit_distance to avoid spelling correction
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

# It is also possible to display frequency with result.distance_sum and edit distance with .log_prob_sum

True

In [64]:
# Creating function for applying the above code

def get_segments(word):
    segments = sym_spell.word_segmentation(word)
    if len(segments.corrected_string.split(' ')) > 1 \
    and segments.corrected_string.split(' ')[0] in scowl and segments.corrected_string.split(' ')[1] in scowl:
        return segments.corrected_string
    else:
        return word

In [65]:
# Testing function

print(get_segments('dogmeat')) # Should be segmented
print(get_segments('fireplace')) # Should not be segmented
print(get_segments('becuase')) # Should not be segmented

dog meat
fireplace
becuase


In [66]:
# Applying the function to create a new column

misspell_df['segments'] =  misspell_df['misspelling'].apply(get_segments)
misspell_df.head(10)

Unnamed: 0,misspelling,tok_lem_POS,freq,segments
0,alot,"(alot, alot, NN)",103,a lot
1,studing,"(studing, studing, VBG)",62,stu ding
2,tofel,"(tofel, tofel, NNP)",39,tofel
3,goverment,"(goverment, goverment, NN)",36,g over men t
4,iam,"(iam, iam, NNP)",28,i am
5,finaly,"(finaly, finaly, NNP)",26,final y
6,beatiful,"(beatiful, beatiful, JJ)",24,beat if ul
7,becuase,"(becuase, becuase, NN)",24,becuase
8,nickell,"(nickell, nickell, NNP)",22,nick ell
9,oss,"(oss, oss, NNP)",22,oss


In [67]:
# Deleting this new column as segmentation creates false segments of misspelled words

del misspell_df['segments']

### Applying spelling correction

In some ways SymSpell is not ideal as full sentence context is not considered, only general frequencies. However, other well-known spellcheckers (hunspell, pyspell, etc.) use the same strategy - frequency based criteria for suggestions, without considering immediate cotext. As such, we have followed this common practice, but it is important to remember that accuracy of corrected tokens will not be 100% and must be taken into consideration.

As a compromise and to consider context, spelling corrections based on bigrams is first implemented. If no suggestions are available, spelling corrections based on unigrams are implemented.

In [68]:
# Testing spelling suggestions with 'becuase'

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt")
# term_index is the column of the term and count_index is the column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

input_term = "becuase"
suggestions = sym_spell.lookup(input_term, Verbosity.CLOSEST, max_edit_distance=2, #Edit distance can be adjusted
                               transfer_casing=True, #Optional argument set to ignore case
                              include_unknown=True) #Return same word if unknown
for suggestion in suggestions:
    print(suggestion)  

because, 1, 271323986


In [69]:
# Creating function for finding unigram suggestions

def get_unigram_suggestions(word):
    if len(word) >= 4:
        suggestions = sym_spell.lookup(word, Verbosity.CLOSEST,max_edit_distance=2, transfer_casing=True)
    else:
        suggestions = sym_spell.lookup(word, Verbosity.CLOSEST,max_edit_distance=1, transfer_casing=True)
    return [str(x).split(',') for x in suggestions]

In [70]:
# Testing function

get_unigram_suggestions('becuase')

[['because', ' 1', ' 271323986']]

**Note**: The function has a variable edit distance: words of length 4 or more get edit distance of 2, shorter words get edit distance of 1. These preferences can be adjusted in the function if desired.

In [71]:
# Testing spelling suggestions with 'becuase of'

max_edit_distance_dictionary = 2
prefix_length = 7
sym_spell = SymSpell(max_edit_distance_dictionary, prefix_length)
bigram_path = pkg_resources.resource_filename("symspellpy", "frequency_bigramdictionary_en_243_342.txt")
if not sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1):
    print("Dictionary file not found")
if not sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2):
    print("Bigram dictionary file not found")
input_term = 'becuase of'
max_edit_distance_lookup = 2
suggestions = sym_spell.lookup_compound(input_term, max_edit_distance_lookup)
for suggestion in suggestions:
    print(suggestion) 

because of, 1, 3481714


In [72]:
# Creating function for finding bigram suggestions

def get_bigram_suggestions(bigram):
    suggestions = sym_spell.lookup_compound(bigram, max_edit_distance_lookup)
    for suggestion in suggestions:
        return [str(x).split(',') for x in suggestions] 

In [73]:
# Testing function
get_bigram_suggestions('worq harg')

[['work hard', ' 2', ' 53229']]

In [74]:
# Returing to non_words2 dataframe and applying functions to create new column

# Creating unigram suggestions column

non_words2['unigram_suggestions'] =  non_words2['misspelling'].apply(
    lambda x: get_unigram_suggestions(x[0]))

In [75]:
# Turning into tuples for easier processing

non_words2.unigram_suggestions = non_words2.unigram_suggestions.apply(
    lambda row: [tuple(x) for x in row])

In [76]:
# Creating bigram suggestions column

non_words2['bigram_suggestions'] =  non_words2['bigrams'].apply(
    lambda row: [get_bigram_suggestions(x) for x in row])

In [77]:
# Flattening and turning into tuples for easier processing

non_words2.bigram_suggestions = non_words2.bigram_suggestions.apply(
    lambda row: [tuple(x) for y in row for x in y])

In [78]:
non_words2.head()

Unnamed: 0,answer_id,tok_lem_POS,misspelling,sentence,bigrams,unigram_suggestions,bigram_suggestions
0,8,"[(i, i, PRP), (organized, organize, VBD), (the...","(beacause, beacause, NN)","I organized the instructions by time, beacause...","[time beacause, beacause to]","[(because, 1, 271323986)]","[(time because, 1, 240561), (because to, 1,..."
1,11,"[(to, to, TO), (make, make, VB), (tea, tea, NN...","(wallmart, wallmart, NN)","next, you need to buy a box of tea in wallmart...","[in wallmart, wallmart or]","[(walmart, 1, 2269839)]","[(in wall art, 1, 99805), (wall art or, 1, ..."
2,13,"[(first, first, RB), (,, ,, ,), (you, you, PRP...","(dovn, dovn, NN)","First, you should take some hot water, you can...","[use dovn, dovn mircowave]","[(down, 1, 224915894), (don, 1, 26003672),...","[(use down, 1, 157999), (down microwave, 2,..."
3,13,"[(first, first, RB), (,, ,, ,), (you, you, PRP...","(mircowave, mircowave, VBP)","First, you should take some hot water, you can...","[dovn mircowave, mircowave or]","[(microwave, 1, 8934594)]","[(down microwave, 2, 1960), (microwave or, ..."
4,15,"[(in, in, IN), (my, my, PRP$), (country, count...","(fitst, fitst, NNP)","Fitst, boil a water in a pot.",[fitst boil],"[(first, 1, 578161543), (fits, 1, 12942004...","[(first boil, 1, 1366)]"


In [79]:
# Checking how many items without suggestions

print(len(non_words2.loc[(non_words2.unigram_suggestions.str.len() == 0),:]))
print(len(non_words2.loc[(non_words2.bigram_suggestions.str.len() == 0),:]))
print(len(non_words2.loc[(non_words2.bigram_suggestions.str.len() == 0) & (non_words2.unigram_suggestions.str.len() == 0),:]))

1966
116
15


In [80]:
non_words2.loc[(non_words2.bigram_suggestions.str.len() == 0) & (non_words2.unigram_suggestions.str.len() == 0),:]

Unnamed: 0,answer_id,tok_lem_POS,misspelling,sentence,bigrams,unigram_suggestions,bigram_suggestions
71,105,"[(here, here, RB), (are, be, VBP), (a, a, DT),...","(asdfkjdlkfjadlfjalsdf, asdfkjdlkfjadlfjalsdf,...",asdfkjdlkfjadlfjalsdf,[],[],[]
4104,10490,"[(getting, get, VBG), (good, good, NNP), (grad...","(quintcareers, quintcareers, NNS)",Quintcareers.,[],[],[]
7223,18380,"[(hi, hi, NNP), (,, ,, ,), (adrian, adrian, NN...","(yeonjea, yeonjea, NN)",- Yeonjea.,[],[],[]
8328,20089,"[(uuuuuuu, uuuuuuu, NN)]","(uuuuuuu, uuuuuuu, NN)",uuuuuuu,[],[],[]
8329,20096,"[(yhryr, yhryr, NN)]","(yhryr, yhryr, NN)",yhryr,[],[],[]
8989,21585,"[(1, 1, CD), (., ., .), (individual, individua...","(shadizubeidi, shadizubeidi, NNP)",Shadizubeidi.,[],[],[]
9161,21845,"[(when, when, WRB), (i, i, PRP), ('m, be, VBP)...","(jaaaaaaaaaaaaaaaaaaajajjajajajaja, jaaaaaaaaa...",Jaaaaaaaaaaaaaaaaaaajajjajajajaja ;),[],[],[]
10457,24136,"[(hsthethghryjyjy, hsthethghryjyjy, NN)]","(hsthethghryjyjy, hsthethghryjyjy, NN)",hsthethghryjyjy,[],[],[]
11293,25988,"[(fibromyalgia, fibromyalgia, NNP), (?, ?, .),...","(fibromyalgia, fibromyalgia, NNP)",Fibromyalgia ?,[],[],[]
13222,30421,"[(stayhealthy, stayhealthy, NN)]","(stayhealthy, stayhealthy, NN)",StayHealthy,[],[],[]


Items with no suggestions - these will be left in their original form though manual corrections could be applied if desired.

Next, we create a new column with just the most likely correction (based on frequency). Bigram suggestions are given preference before unigram suggestions. If there is no suggestion, the original word is returned.

In [81]:
# Create new column with just the most likely correction (based on frequency)

def sort_tuple(tup):  
    tup.sort(key = lambda x: x[2], reverse=True)  
    return tup    

In [82]:
# Keeping the bigram correction with the highest frequency

non_words2['bigram_correction'] = [sort_tuple(x)[0][0] if len(x) != 0 else np.NaN for x in non_words2['bigram_suggestions']]

In [83]:
# Keeping the unigram correction with the highest frequency

non_words2['unigram_correction'] = [sort_tuple(x)[0][0] if len(x) != 0 else np.NaN for x in non_words2['unigram_suggestions']]

# STUCK HERE
Trying to keep only the word in the bigram correction that was originally misspelled, not the entire bigram. Below are failed experiments

In [84]:
# Creating function to keep only the key word of the bigram correction

def key_word(bigrams,bigram_correction):
    key_words = []
    for word in bigram_correction.split():
        if word not in lex.re_tokenize(str(bigrams)):
            key_words.append(word)
    return key_words
    #return " ".join(key_words) -- experimenting with both version, either return a list or a single string

In [85]:
# Applying the above function to the selected rows

mask2 = non_words2.loc[~non_words2.bigram_correction.isnull()].index
non_words2['final_correction'] = non_words2.loc[mask2][['bigrams','bigram_correction']].apply(
    lambda x: key_word(x[0],x[1]),axis=1)

non_words2.head()

Unnamed: 0,answer_id,tok_lem_POS,misspelling,sentence,bigrams,unigram_suggestions,bigram_suggestions,bigram_correction,unigram_correction,final_correction
0,8,"[(i, i, PRP), (organized, organize, VBD), (the...","(beacause, beacause, NN)","I organized the instructions by time, beacause...","[time beacause, beacause to]","[(because, 1, 271323986)]","[(because to, 1, 3213023), (time because, 1...",because to,because,[because]
1,11,"[(to, to, TO), (make, make, VB), (tea, tea, NN...","(wallmart, wallmart, NN)","next, you need to buy a box of tea in wallmart...","[in wallmart, wallmart or]","[(walmart, 1, 2269839)]","[(in wall art, 1, 99805), (wall art or, 1, ...",in wall art,walmart,"[wall, art]"
2,13,"[(first, first, RB), (,, ,, ,), (you, you, PRP...","(dovn, dovn, NN)","First, you should take some hot water, you can...","[use dovn, dovn mircowave]","[(dove, 1, 3253560), (donn, 1, 299470), (d...","[(down microwave, 2, 1960), (use down, 1, ...",down microwave,dove,"[down, microwave]"
3,13,"[(first, first, RB), (,, ,, ,), (you, you, PRP...","(mircowave, mircowave, VBP)","First, you should take some hot water, you can...","[dovn mircowave, mircowave or]","[(microwave, 1, 8934594)]","[(microwave or, 1, 22584), (down microwave, ...",microwave or,microwave,[microwave]
4,15,"[(in, in, IN), (my, my, PRP$), (country, count...","(fitst, fitst, NNP)","Fitst, boil a water in a pot.",[fitst boil],"[(fist, 1, 7319405), (first, 1, 578161543)...","[(first boil, 1, 1366)]",first boil,fist,[first]


In [86]:
# Mystery to solve - why are these final corrections blank when there is a bigram correction?

non_words2.loc[(non_words2.final_correction.str.len() == 0) & (non_words2.bigram_correction.isnull()==False),:]

Unnamed: 0,answer_id,tok_lem_POS,misspelling,sentence,bigrams,unigram_suggestions,bigram_suggestions,bigram_correction,unigram_correction,final_correction
392,630,"[(i, i, NN), (am, be, VBP), (living, live, VBG...","(blvd, blvd, NNP)",i am living on ANON_NAME_0 blvd street,"[name blvd, blvd street]","[(blvd, 0, 13317989)]","[(name blvd, 0, 6036), (blvd street, 0, 19...",name blvd,blvd,[]
398,643,"[(good, good, JJ), (effects, effect, NNS), (of...","(sen, sen, NN)","""On"" is hot, and ""sen"" is springs in Japanese,...","[and sen, sen is]","[(sen, 0, 5230325)]","[(and sen, 0, 66329), (sen is, 0, 24014)]",and sen,sen,[]
502,796,"[(mismanagement, mismanagement, NN), (:, :, :)...","(est, est, NNP)",Mismanagement:\n\nP of S: N\n\nDefinition: if ...,[against est],"[(est, 0, 58112143)]","[(against est, 0, 8349)]",against est,est,[]
503,796,"[(mismanagement, mismanagement, NN), (:, :, :)...","(est, est, JJS)",Mismanagement:\n\nP of S: N\n\nDefinition: if ...,[against est],"[(est, 0, 58112143)]","[(against est, 0, 8349)]",against est,est,[]
517,846,"[(suspend, suspend, NN), (verb, verb, NNP), (d...","(esp, esp, FW)","to reprove or scold, esp.",[scold esp],"[(esp, 0, 4612780)]","[(scold esp, 0, 0)]",scold esp,esp,[]
...,...,...,...,...,...,...,...,...,...,...
20578,47525,"[(in, in, IN), (italy, italy, NNP), (,, ,, ,),...","(bunuel, bunuel, NNP)","The Spaniard Luis Bunuel, whose impressive fil...","[luis bunuel, bunuel whose]","[(bunuel, 0, 48835)]","[(bunuel whose, 0, 2), (luis bunuel, 0, 0)]",bunuel whose,bunuel,[]
20586,47541,"[(modern, modern, JJ), (family, family, NN), (...","(phil, phil, NNP)","Second family, which is the most common type, ...","[to phil, phil who]","[(phil, 0, 14637522)]","[(phil who, 0, 9010), (to phil, 0, 173337)]",phil who,phil,[]
20655,47839,"[(relationships, relationship, NNS), (are, be,...","(esp, esp, NN)",The Canadian Oxford definition says that a sou...,[bond esp],"[(esp, 0, 4612780)]","[(bond esp, 0, 105)]",bond esp,esp,[]
20689,47868,"[(annie, annie, NNP), (is, be, VBZ), (my, my, ...","(maths, maths, NNP)",She is good at Maths which always makes me con...,"[at maths, maths which]","[(maths, 0, 3231878)]","[(at maths, 0, 7165), (maths which, 0, 2555)]",at maths,maths,[]


Manually checking items with more than one word in the final correction - bigrams with two misspellings will need to be trimmed, e.g. 'dovn mircowave'

In [87]:
# What am I even doing any more?
pd.concat(g for _, g in non_words2.groupby("sentence") if len(g) > 1).loc[(non_words2.final_correction.str.len() > 1),:]

Unnamed: 0,answer_id,tok_lem_POS,misspelling,sentence,bigrams,unigram_suggestions,bigram_suggestions,bigram_correction,unigram_correction,final_correction
1956,3622,"[(sultan, sultan, NNP), (alhimali, alhimali, V...","(alhimali, alhimali, VBZ)",\n\n\n \n\nSultan alhimali\n\n\n\n The beduons...,"[sultan alhimali, alhimali the]",[],"[(all mali the, 2, 244), (sultan all mali, ...",all mali the,,"[all, mali]"
1957,3622,"[(sultan, sultan, NNP), (alhimali, alhimali, V...","(beduons, beduons, NNS)",\n\n\n \n\nSultan alhimali\n\n\n\n The beduons...,"[the beduons, beduons have]","[(beacons, 2, 618822), (bedouins, 2, 55824...","[(the bed on, 2, 199223), (bed on have, 2, ...",the bed on,beacons,"[bed, on]"
13659,31540,"[(the, the, DT), (floence, floence, NN), (when...","(pronaionstion, pronaionstion, NN)",\n\nthe floence when i read and the right pron...,"[right pronaionstion, pronaionstion for]",[],"[(prozac option for, 5, 1), (right prozac op...",prozac option for,,"[prozac, option]"
8198,19783,"[(the, the, DT), (white, white, NNP), (house, ...","(cigaretts, cigaretts, NN)","\n \nThe White House\nWashington, D.C.20500 07...","[s cigaretts, cigaretts last]","[(cigarette, 1, 6890707), (cigarettes, 1, ...","[(a cigarette, 2, 61054), (cigarette last, ...",a cigarette,cigarette,"[a, cigarette]"
8199,19783,"[(the, the, DT), (white, white, NNP), (house, ...","(ecenbarger, ecenbarger, NNP)","\n \nThe White House\nWashington, D.C.20500 07...","[william ecenbarger, ecenbarger from]",[],"[(even larger from, 3, 46153), (william even...",even larger from,,"[even, larger]"
...,...,...,...,...,...,...,...,...,...,...
9026,21601,"[(yasterday, yasterday, NN), (when, when, WRB)...","(didnot, didnot, VBD)",yasterday when Iawoke up I called up my friend...,"[she didnot, didnot pick]","[(dicot, 2, 89663), (dido, 2, 839057), (mi...","[(did not pick, 1, 304251), (she did not, 1...",did not pick,dicot,"[did, not]"
9028,21601,"[(yasterday, yasterday, NN), (when, when, WRB)...","(numberand, numberand, NN)",yasterday when Iawoke up I called up my friend...,"[stange numberand, numberand i]","[(numbered, 2, 4773804), (cumberland, 2, 3...","[(stage number and, 2, 18406), (number and i...",stage number and,numbered,"[stage, number, and]"
10658,24794,"[(fantastic, fantastic, JJ), (galaxy, galaxy, ...","(earthscape, earthscape, NN)",you will see such an incredibly wonderful sigh...,"[beautiful earthscape, earthscape reservation]",[],"[(beautiful earth scape, 1, 0), (earth scape...",beautiful earth scape,,"[earth, scape]"
18084,40470,"[(yse, yse, NN), (,, ,, ,), (iam, iam, NNP), (...","(yse, yse, NN)","yse,Iam .I am recycling some staff.",[yse iam],"[(use, 1, 719980257), (lyse, 1, 62280), (y...","[(use i am, 2, 404936)]",use i am,use,"[use, i, am]"


# END OF STUCK SECTION

In [88]:
# If no correction, use original word

non_words2.final_correction.fillna(non_words2.misspelling.apply(lambda x: x[0]), inplace=True)

In [89]:
# Create correction_POS column

non_words2['final_correction_POS'] = list(zip(non_words2.final_correction, non_words2.misspelling.apply(lambda x: x[2])))
non_words2.head()

Unnamed: 0,answer_id,tok_lem_POS,misspelling,sentence,bigrams,unigram_suggestions,bigram_suggestions,bigram_correction,unigram_correction,final_correction,final_correction_POS
0,8,"[(i, i, PRP), (organized, organize, VBD), (the...","(beacause, beacause, NN)","I organized the instructions by time, beacause...","[time beacause, beacause to]","[(because, 1, 271323986)]","[(because to, 1, 3213023), (time because, 1...",because to,because,[because],"([because], NN)"
1,11,"[(to, to, TO), (make, make, VB), (tea, tea, NN...","(wallmart, wallmart, NN)","next, you need to buy a box of tea in wallmart...","[in wallmart, wallmart or]","[(walmart, 1, 2269839)]","[(in wall art, 1, 99805), (wall art or, 1, ...",in wall art,walmart,"[wall, art]","([wall, art], NN)"
2,13,"[(first, first, RB), (,, ,, ,), (you, you, PRP...","(dovn, dovn, NN)","First, you should take some hot water, you can...","[use dovn, dovn mircowave]","[(dove, 1, 3253560), (donn, 1, 299470), (d...","[(down microwave, 2, 1960), (use down, 1, ...",down microwave,dove,"[down, microwave]","([down, microwave], NN)"
3,13,"[(first, first, RB), (,, ,, ,), (you, you, PRP...","(mircowave, mircowave, VBP)","First, you should take some hot water, you can...","[dovn mircowave, mircowave or]","[(microwave, 1, 8934594)]","[(microwave or, 1, 22584), (down microwave, ...",microwave or,microwave,[microwave],"([microwave], VBP)"
4,15,"[(in, in, IN), (my, my, PRP$), (country, count...","(fitst, fitst, NNP)","Fitst, boil a water in a pot.",[fitst boil],"[(fist, 1, 7319405), (first, 1, 578161543)...","[(first boil, 1, 1366)]",first boil,fist,[first],"([first], NNP)"


### Incorporating corrections into `pelic_df`

# THIS WON'T WORK UNTIL EARLIER ISSUE RESOLVED AND `final_correction_POS` contains a tuple in each row (no lists in position x[0]).

In [90]:
# Creating dictionary for mappying - key = incorrect spelling, value = correct spelling

misspell_dict = pd.Series(non_words2.final_correction_POS.values,non_words2.tok_lem_POS).to_dict()

TypeError: unhashable type: 'list'

In [None]:
misspell_dict

In [None]:
# Incorporating back into pelic_df

pelic_df['tok_POS_corrected'] = pelic_df['tok_lem_POS'].apply\
(lambda row: [misspell_dict[(x[0].lower(),x[1],x[2])] if (x[0].lower(),x[1],x[2]) in misspell_dict else (x[0],x[2]) for x in row])

# One minor issue is that this will make misspelled items lower case when originally upper case.

In [None]:
# Checking with 'becuase'

print(pelic_df.loc[pelic_df.text.str.contains('becuase')].iloc[1,11]) #uncorrected
print(pelic_df.loc[pelic_df.text.str.contains('becuase')].iloc[1,12]) #corrected

We can see here that many approrpriate corrections have been made, including _beccuase_ -> _because_ , _nise_ -> _nice_ , and _lovily_ -> _lovely_ .  
Importantly, incorrect spellings that are actual words, e.g. _hem_ (should be _him_ in this case) are not corrected. In addition, as context is not considered, there will be some inaccuracies, e.g. _realy_ (marked as an adj) -> _real_ rather than _really_.

In [None]:
pelic_df.head()

In [None]:
# Write out new PELIC_compiled.csv

pelic_df.to_csv('PELIC_compiled_spellcorrected.csv', encoding='utf-8', index=True)

In [None]:
# Pickle new pelic_df dataframe

pelic_df.to_pickle('pelic_spellcorrected.pkl')

If preferred, this entire spelling correctin process can also be applied to [`answer.csv`]() instead of `PELIC_compiled`.

[Back to top](#Corrected-spelling)