# PELIC spelling

This notebook adds further processing to `answer.csv`  in the [`PELIC-dataset`](https://github.com/ELI-Data-Mining-Group/PELIC-dataset) repo ([`corpus_files` folder](https://github.com/ELI-Data-Mining-Group/PELIC-dataset/tree/master/corpus_files)) by creating a column of tok_POS and lemma_POS whose spelling has been automatically corrected.

**Notebook contents:**
- [Building `non_words_df`](#Building-non_words_df)
- [Building `misspell_df`](#Building-misspell_df)
- [Applying spelling correction](#Applying-spelling-correction)
- [Incorporating corrections into `answer_df`](#Incorporating-corrections-into-answer_df)

## Building non_words_df
In this section, we build a dataframe, `non_words_df`, which collects all of the non-words from the PELIC dataset (in `answer.csv`). The final dataframe has the following columns:
- `non_word`: tuples with the non-words and their parts of speech
- `sentence`: the complete sentence containing the non-word to provide context
- `answer_id`: the id of the text they come from

In [1]:
# Import necessary modules

import pandas as pd
import pprint
import numpy as np
from ast import literal_eval
import nltk
from tqdm import tqdm
import random

In [2]:
# Read in PELIC_compiled.csv

pelic_df = pd.read_csv("../PELIC-dataset/PELIC_compiled.csv", index_col = 'answer_id', # answer_id is unique
                      dtype = {'level_id':'object','question_id':'object','version':'object','course_id':'object'}, # str not ints
                               converters={'tokens':literal_eval,'tok_lem_POS':literal_eval}) # read in as lists
pelic_df.head()

Unnamed: 0_level_0,anon_id,L1,gender,course_id,level_id,class_id,question_id,version,text_len,text,tokens,tok_lem_POS
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,eq0,Arabic,Male,149,4,g,5,1,177,I met my friend Nife while I was studying in a...,"[I, met, my, friend, Nife, while, I, was, stud...","((I, i, PRP), (met, meet, VBD), (my, my, PRP$)..."
2,am8,Thai,Female,149,4,g,5,1,137,"Ten years ago, I met a women on the train betw...","[Ten, years, ago, ,, I, met, a, women, on, the...","((Ten, ten, CD), (years, year, NNS), (ago, ago..."
3,dk5,Turkish,Female,115,4,w,12,1,64,In my country we usually don't use tea bags. F...,"[In, my, country, we, usually, do, n't, use, t...","((In, in, IN), (my, my, PRP$), (country, count..."
4,dk5,Turkish,Female,115,4,w,13,1,6,I organized the instructions by time.,"[I, organized, the, instructions, by, time, .]","((I, i, PRP), (organized, organize, VBD), (the..."
5,ad1,Korean,Female,115,4,w,12,1,59,"First, prepare a port, loose tea, and cup.\nSe...","[First, ,, prepare, a, port, ,, loose, tea, ,,...","((First, first, RB), (,, ,, ,), (prepare, prep..."


The focus here is the `tok_lem_POS` column, but all columns will be kept as the entire df will be written out at the end of the notebook.

In [3]:
# Creating small dataframe to be used for finding non-words

non_words = pelic_df[['text','tok_lem_POS']]

**Note:** For spelling correction, it is necessary to decide what list of words will be used for determining if a word is real or not.

Here, we use the [`SCOWL_condensed.txt`](https://github.com/ELI-Data-Mining-Group/PELIC-spelling/blob/master/SCOWL_condensed.txt) file which is a combination of wordlists available for download at http://wordlist.aspell.net/. We include items from all the dictionaries _except_ the abbreviations dictionary. For a detailed look at the compilation of this dictionary, please see the [SCOWL_wordlist](https://github.com/ELI-Data-Mining-Group/PELIC-spelling/blob/master/SCOWL_wordlist.ipynb) notebook.

In [4]:
#Reading in SCOWL_condensed as a set as a lookup list for spelling (500k words)

scowl = set(open("SCOWL_condensed.txt", "r").read().split('\n'))
print(random.sample(scowl,5))

['unsensitizing', 'irreversibilities', 'handjar', 'polyfenestral', 'inseverably']


The following is a list of words which should be considered words but which were previously being labelled as non-words. These items have been manually added to this list based on output later in this notebook. Most of these items are food items, names, or abbreviations.

In [5]:
actually_ok = open("actually_ok", "r").read().split(',')
actually_ok = [x[2:-1] for x in actually_ok]
print(len(actually_ok))
print(actually_ok)

96
['adha', 'adj', 'ahamed', 'alaikum', 'anonurlpage', 'antiretroviral', 'arpa', 'beyonce', 'bibimbap', 'bio', 'biodiesel', 'bioethanol', 'bulgogi', 'bundang', 'cafe', 'carnaval', 'cds', 'cf', 'co', 'comscore', 'cyber', 'ddukboggi', 'def', 'dr', 'eg', 'eid', 'electrospray', 'entrees', 'erectus', 'etc', 'fiance', 'fiancee', 'fiter', 'fitir', 'fitr', 'fl', 'freediving', 'fukubukuro', 'geolinguist', 'hikikomori', 'hp', 'ibt', 'iq', 'iriver', 'jetta', 'jul', 'kabsa', 'kaled', 'kawader', 'km', 'leisureville', 'll', 'maamool', 'mayumi', 'mcdonalds', 'min', 'mongongo', 'nc', 'neuro', 'nian', 'notting', 'okroshka', 'onsen', 'pajeon', 'pbt', 'pc', 'pcs', 'pp', 'pudim', 'puket', 'samear', 'shui', 'sq', 'st', 'staycation', 'sth', 'taoyuan', 'toefl', 'trans', 'transgene', 'tv', 'unsub', 'va', 'vol', 'vs', 'webaholic', 'webaholics', 'webaholism', 'wenjing', 'woong', 'yaoming', 'ying', 'yingdong', 'yugong', 'yuval', 'zi']


In [6]:
# Lower case all toks

non_words.tok_lem_POS = non_words.tok_lem_POS.apply(lambda row: [(x[0].lower(),x[1],x[2]) for x in row])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [7]:
# Function to find non-words

def spell_check(tok_lem_POS_list):
    word_list = scowl # Choose word_list here. Default is scowl described above.
    not_in_word_list = []
    for tok_lem_POS in tok_lem_POS_list:
        if tok_lem_POS[0] not in word_list and tok_lem_POS[0] not in actually_ok:
            not_in_word_list.append(tok_lem_POS)
    return not_in_word_list

In [8]:
# Apply spell check function to find all misspelled-words. 

non_words['misspelled_words'] = non_words.tok_lem_POS.apply(spell_check)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [9]:
non_words.head()

Unnamed: 0_level_0,text,tok_lem_POS,misspelled_words
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,I met my friend Nife while I was studying in a...,"[(i, i, PRP), (met, meet, VBD), (my, my, PRP$)...","[(., ., .), (., ., .), (., ., .), (;, ;, :), (..."
2,"Ten years ago, I met a women on the train betw...","[(ten, ten, CD), (years, year, NNS), (ago, ago...","[(,, ,, ,), (,, ,, ,), (., ., .), (;, ;, :), (..."
3,In my country we usually don't use tea bags. F...,"[(in, in, IN), (my, my, PRP$), (country, count...","[(., ., .), (,, ,, ,), (., ., .), (., ., .), (..."
4,I organized the instructions by time.,"[(i, i, PRP), (organized, organize, VBD), (the...","[(., ., .)]"
5,"First, prepare a port, loose tea, and cup.\nSe...","[(first, first, RB), (,, ,, ,), (prepare, prep...","[(,, ,, ,), (,, ,, ,), (,, ,, ,), (., ., .), (..."


#### Adding context to the dataframe
Seeing the mistakes in the context of a sentence will allow for better manual checking if required.

In [10]:
# Sent-tokenizing the text

non_words['sents'] = non_words['text'].apply(lambda x: nltk.sent_tokenize(x))

# And delete text column which is no longer needed

del non_words['text']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [11]:
# Removing punctuation, numbers, and non-alphanumeric symbols from misspelled_words

non_words.misspelled_words = non_words.misspelled_words.apply(lambda row: [x for x in row if x[0].isalpha()])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [12]:
# Removing proper names - NNP, NNPS

non_words.misspelled_words = non_words.misspelled_words.apply(lambda row: [x for x in row if x[2] != 'NNP' and x[1] != 'NNPS'])

In [13]:
# Removing all words with length 1

non_words.misspelled_words = non_words.misspelled_words.apply(lambda row: [x for x in row if len(x[0]) > 1])

In [14]:
non_words.head()

Unnamed: 0_level_0,tok_lem_POS,misspelled_words,sents
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,"[(i, i, PRP), (met, meet, VBD), (my, my, PRP$)...",[],[I met my friend Nife while I was studying in ...
2,"[(ten, ten, CD), (years, year, NNS), (ago, ago...",[],"[Ten years ago, I met a women on the train bet..."
3,"[(in, in, IN), (my, my, PRP$), (country, count...",[],"[In my country we usually don't use tea bags.,..."
4,"[(i, i, PRP), (organized, organize, VBD), (the...",[],[I organized the instructions by time.]
5,"[(first, first, RB), (,, ,, ,), (prepare, prep...",[],"[First, prepare a port, loose tea, and cup., S..."


And convert into more of a concordance format.

In [15]:
def get_misspellings():
    num = 1
    collections = []

    while num-1 < len(non_words):
        misspelled_words = non_words['misspelled_words'].iloc[num-1]
        sentences = non_words['sents'].iloc[num-1]
        sentences = [sent.lower() for sent in sentences]
    
        index = 0
        while index < len(sentences):
            for word in misspelled_words:
                if len(word[0]) == 1:
                    pass
                elif sentences[index][0] == '[':
                    pass
                elif word[0] in sentences[index]:
                    collections.append((word, sentences[index], num))
            index += 1
        num += 1
    
    return collections

In [16]:
tups = get_misspellings()

In [17]:
# Rebuilding the non_words df

non_words = pd.DataFrame(tups, columns = ['tok_lem_POS', 'sentence', 'answer_id'])
non_words.head()

Unnamed: 0,tok_lem_POS,sentence,answer_id
0,"(beacause, beacause, NN)","i organized the instructions by time, beacause...",8
1,"(wallmart, wallmart, NN)","next, you need to buy a box of tea in wallmart...",11
2,"(dovn, dovn, NN)","first, you should take some hot water, you can...",13
3,"(mircowave, mircowave, VBP)","first, you should take some hot water, you can...",13
4,"(paragragh, paragragh, NN)",every paragragh's instructions depend on a mai...,16


In [18]:
# Total number of non-words

len(non_words)

21368

#### Creating a dataframe of misspellings
In the `non-words` dataframe above, each row is an occurrence of a misspelling (i.e. _tokens_ ). We also want a dataframe where each row is a misspelling _type_ with frequency information attached.

In [19]:
# Gathering the total misspellings

total_misspellings = [x for x in non_words['tok_lem_POS']]
len(total_misspellings)

21368

#### Creating a dataframe of misspelled frequencies
This dataframe will be what we apply the spell correction too.

In [20]:
# To keep an account of misspelling frequency, create a dictionary with frequencies

misspell_freq_dict = {}
for word in total_misspellings:
    if word not in misspell_freq_dict:
        misspell_freq_dict[word] = 1
    else:
        misspell_freq_dict[word] += 1

In [21]:
print(random.sample(list(misspell_freq_dict),5))

[('instand', 'instand', 'NN'), ('raeder', 'raeder', 'NN'), ('stuednts', 'stuednts', 'NNS'), ('instraction', 'instraction', 'NN'), ('chooesd', 'chooesd', 'VBP')]


In [22]:
# Remove duplicates

final_misspellings = sorted(list(set(total_misspellings)))
len(final_misspellings)

11084

In [23]:
# Constructing misspell_df

misspell_df = pd.DataFrame(final_misspellings)
misspell_df.head()

Unnamed: 0,0,1,2
0,aa,aa,VB
1,aabout,aabout,IN
2,aad,aad,JJ
3,aain,aain,VBP
4,aare,aare,IN


In [24]:
# Renaming columns to match other DataFrames in this notebook

misspell_df.rename(columns = {0: 'misspelling',1:'lemma',2:'POS'}, inplace = True)

In [25]:
# Creating recreating tok_lem_POS column to match dictionary

misspell_df['tok_lem_POS'] = list(zip(misspell_df.misspelling, misspell_df.lemma, misspell_df.POS))
misspell_df.head()

Unnamed: 0,misspelling,lemma,POS,tok_lem_POS
0,aa,aa,VB,"(aa, aa, VB)"
1,aabout,aabout,IN,"(aabout, aabout, IN)"
2,aad,aad,JJ,"(aad, aad, JJ)"
3,aain,aain,VBP,"(aain, aain, VBP)"
4,aare,aare,IN,"(aare, aare, IN)"


In [26]:
# Mapping dictionary to DataFrame

misspell_df['freq'] = misspell_df['tok_lem_POS'].map(misspell_freq_dict)

In [27]:
# Sorting by frequency

misspell_df = misspell_df.sort_values(by=['freq'], ascending=False)

#### actually_ok
The following is the basis for the 'acutally_ok' list used earlier. Here, errors with a frequency of 10 or more were manually checked, and if determined to be a real word, were added to the actually_ok list. There were originally 267 items which met this criteria.

In [28]:
print(len(misspell_df.loc[misspell_df.freq >= 10]))
misspell_df.loc[misspell_df.freq >= 10]

189


Unnamed: 0,misspelling,lemma,POS,tok_lem_POS,freq
439,alot,alot,NN,"(alot, alot, NN)",173
2704,defition,defition,NN,"(defition, defition, NN)",128
6647,nepotizm,nepotizm,NN,"(nepotizm, nepotizm, NN)",126
619,apartament,apartament,NN,"(apartament, apartament, NN)",120
11036,yut,yut,NN,"(yut, yut, NN)",96
...,...,...,...,...,...
6462,mounths,mounths,NNS,"(mounths, mounths, NNS)",10
7210,peaple,peaple,NN,"(peaple, peaple, NN)",10
2947,differnt,differnt,NN,"(differnt, differnt, NN)",10
3880,experiance,experiance,NN,"(experiance, experiance, NN)",10


In [29]:
pd.options.display.max_rows = 1000
print(len(set(misspell_df.loc[misspell_df.freq >= 10].misspelling.to_list())))
set(misspell_df.loc[misspell_df.freq >= 10].misspelling.to_list())

173


{'acording',
 'addtion',
 'adventages',
 'advertisment',
 'airplan',
 'alfter',
 'alot',
 'analytes',
 'anoise',
 'apartament',
 'apartement',
 'appartment',
 'aquired',
 'ar',
 'aupair',
 'auther',
 'axé',
 'beacuse',
 'beatiful',
 'becasue',
 'becaue',
 'becaus',
 'becouse',
 'becuase',
 'befor',
 'begining',
 'belived',
 'benefites',
 'benifits',
 'beutifull',
 'bousporus',
 'bridege',
 'bridg',
 'brillient',
 'caffiene',
 'caral',
 'celefone',
 'ch',
 'chiken',
 'childern',
 'childrens',
 'chres',
 'chuan',
 'citys',
 'coffe',
 'comercials',
 'conclution',
 'confortable',
 'contry',
 'countery',
 'ddui',
 'defition',
 'dementi',
 'deprission',
 'desicion',
 'devorce',
 'diferent',
 'differents',
 'differnt',
 'diffirent',
 'diffrent',
 'dopperganger',
 'eatting',
 'etat',
 'eventhough',
 'everythings',
 'everytime',
 'evry',
 'experiance',
 'experince',
 'familly',
 'famos',
 'fastfood',
 'favorate',
 'favorit',
 'forigners',
 'freind',
 'freinds',
 'friens',
 'frind',
 'frinds',
 

### Applying spelling correction

Selected spell check: SymSpell https://github.com/mammothb/symspellpy

In some ways symspell is not ideal as sentence context is not considered, only general frequencies. However, other well-known spellcheckers (hunspell, pyspell, etc.) use the same strategy - frequency based criteria for suggestions, without considering immediate cotext. As such, we have followed this common practice, but it is important to remember that accuracy of corrected tokens will not be 100% and must be taken into consideration.

There is a dictionary file which which needs to be installed (saved to repo):
[frequency_dictionary_en_82_765.txt](https://symspellpy.readthedocs.io/en/latest/users/installing.html)

To install symspellpy the first time, use pip in command line: `pip install -U symspellpy`

In [30]:
# Setting up symspell

from itertools import islice
import pkg_resources
from symspellpy import SymSpell
from symspellpy import Verbosity
sym_spell = SymSpell()
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt")
sym_spell.load_dictionary(dictionary_path, 0, 1)

# Print out first 5 elements to demonstrate that dictionary is successfully loaded
list(islice(sym_spell.words.items(), 5))

[('the', 23135851162),
 ('of', 13151942776),
 ('and', 12997637966),
 ('to', 12136980858),
 ('a', 9081174698)]

In [31]:
# Testing with 'becuase'

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt")
# term_index is the column of the term and count_index is the column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

input_term = "becuase"
suggestions = sym_spell.lookup(input_term, Verbosity.CLOSEST, max_edit_distance=2, #Edit distance can be adjusted
                               transfer_casing=True, #Optional argument set to ignore case
                              include_unknown=True) #Return same word if unknown
for suggestion in suggestions:
    print(suggestion)  

because, 1, 271323986


In [32]:
# Creating function for applying the above code

def get_suggestions(word):
    if len(word) >= 4:
        suggestions = sym_spell.lookup(word, Verbosity.CLOSEST,max_edit_distance=2, transfer_casing=True)
    else:
        suggestions = sym_spell.lookup(word, Verbosity.CLOSEST,max_edit_distance=1, transfer_casing=True)
    return [str(x).split(',') for x in suggestions]

**Note**: The function has a variable edit distance: words of length 4 or more get edit distance of 2, shorter words get edit distance of 1. These preferences can be adjusted in the function if desired.

In [33]:
# Applying function to create new column

misspell_df['suggestions'] =  misspell_df['misspelling'].apply(get_suggestions)
misspell_df.head()

Unnamed: 0,misspelling,lemma,POS,tok_lem_POS,freq,suggestions
439,alot,alot,NN,"(alot, alot, NN)",173,"[[lot, 1, 106405208], [slot, 1, 21602762],..."
2704,defition,defition,NN,"(defition, defition, NN)",128,"[[edition, 2, 110051463], [decision, 2, 71..."
6647,nepotizm,nepotizm,NN,"(nepotizm, nepotizm, NN)",126,"[[nepotism, 1, 174108]]"
619,apartament,apartament,NN,"(apartament, apartament, NN)",120,"[[apartment, 1, 30771172]]"
11036,yut,yut,NN,"(yut, yut, NN)",96,"[[but, 1, 999899654], [out, 1, 741601852],..."


In [34]:
len(misspell_df.loc[(misspell_df.suggestions.str.len() == 0),:])

#Items with no suggestions - these will be left in their original form though manual corrections could be applied if desired

736

In [35]:
# Create new column with just the most likely correction (based on frequency)

misspell_df['correction'] = [x[0][0] if len(x) != 0 else np.NaN for x in misspell_df['suggestions']]

In [36]:
# If no correction, use original word

misspell_df.correction.fillna(misspell_df.misspelling, inplace=True)

In [37]:
# Create correction_POS column

misspell_df['correction_POS'] = list(zip(misspell_df.correction, misspell_df.POS))
misspell_df.head()

Unnamed: 0,misspelling,lemma,POS,tok_lem_POS,freq,suggestions,correction,correction_POS
439,alot,alot,NN,"(alot, alot, NN)",173,"[[lot, 1, 106405208], [slot, 1, 21602762],...",lot,"(lot, NN)"
2704,defition,defition,NN,"(defition, defition, NN)",128,"[[edition, 2, 110051463], [decision, 2, 71...",edition,"(edition, NN)"
6647,nepotizm,nepotizm,NN,"(nepotizm, nepotizm, NN)",126,"[[nepotism, 1, 174108]]",nepotism,"(nepotism, NN)"
619,apartament,apartament,NN,"(apartament, apartament, NN)",120,"[[apartment, 1, 30771172]]",apartment,"(apartment, NN)"
11036,yut,yut,NN,"(yut, yut, NN)",96,"[[but, 1, 999899654], [out, 1, 741601852],...",but,"(but, NN)"


In [38]:
misspell_df.sort_values(by=['freq'], ascending=False).sample(50)

Unnamed: 0,misspelling,lemma,POS,tok_lem_POS,freq,suggestions,correction,correction_POS
4454,garaoke,garaoke,NN,"(garaoke, garaoke, NN)",2,"[[karaoke, 1, 7568648]]",karaoke,"(karaoke, NN)"
9203,squerilhill,squerilhill,NN,"(squerilhill, squerilhill, NN)",1,[],squerilhill,"(squerilhill, NN)"
8558,schildren,schildren,NN,"(schildren, schildren, NN)",1,"[[children, 1, 206538107]]",children,"(children, NN)"
7201,pdf,pdf,VB,"(pdf, pdf, VB)",2,"[[psf, 1, 950636]]",psf,"(psf, VB)"
9496,supermarke,supermarke,NN,"(supermarke, supermarke, NN)",1,"[[supermarket, 1, 2885844]]",supermarket,"(supermarket, NN)"
1881,citicen,citicen,JJ,"(citicen, citicen, JJ)",1,"[[citizen, 1, 16245917]]",citizen,"(citizen, JJ)"
810,assignement,assignement,NN,"(assignement, assignement, NN)",1,"[[assignment, 1, 14722710]]",assignment,"(assignment, NN)"
8734,seris,seris,VBZ,"(seris, seris, VBZ)",1,"[[series, 1, 161518557], [serif, 1, 210933...",series,"(series, VBZ)"
5350,inpiration,inpiration,NN,"(inpiration, inpiration, NN)",1,"[[inspiration, 1, 8460625]]",inspiration,"(inspiration, NN)"
3516,enjouy,enjouy,VBP,"(enjouy, enjouy, VBP)",2,"[[enjoy, 1, 50141455]]",enjoy,"(enjoy, VBP)"


### Incorporating corrections into `answer_df`

In [39]:
# First removing items from misspell_df where no correction will take place

print(len(misspell_df))
misspell_df = misspell_df.loc[misspell_df.misspelling != misspell_df.correction]
print(len(misspell_df))

11084
10239


In [40]:
misspell_df.head()

Unnamed: 0,misspelling,lemma,POS,tok_lem_POS,freq,suggestions,correction,correction_POS
439,alot,alot,NN,"(alot, alot, NN)",173,"[[lot, 1, 106405208], [slot, 1, 21602762],...",lot,"(lot, NN)"
2704,defition,defition,NN,"(defition, defition, NN)",128,"[[edition, 2, 110051463], [decision, 2, 71...",edition,"(edition, NN)"
6647,nepotizm,nepotizm,NN,"(nepotizm, nepotizm, NN)",126,"[[nepotism, 1, 174108]]",nepotism,"(nepotism, NN)"
619,apartament,apartament,NN,"(apartament, apartament, NN)",120,"[[apartment, 1, 30771172]]",apartment,"(apartment, NN)"
11036,yut,yut,NN,"(yut, yut, NN)",96,"[[but, 1, 999899654], [out, 1, 741601852],...",but,"(but, NN)"


In [41]:
# Creating dictionary for mappying - key = incorrect spelling, value = correct spelling

misspell_dict = pd.Series(misspell_df.correction_POS.values,misspell_df.tok_lem_POS).to_dict()

In [42]:
misspell_dict

{('alot', 'alot', 'NN'): ('lot', 'NN'),
 ('defition', 'defition', 'NN'): ('edition', 'NN'),
 ('nepotizm', 'nepotizm', 'NN'): ('nepotism', 'NN'),
 ('apartament', 'apartament', 'NN'): ('apartment', 'NN'),
 ('yut', 'yut', 'NN'): ('but', 'NN'),
 ('studing', 'studing', 'VBG'): ('studying', 'VBG'),
 ('frisby', 'frisby', 'NN'): ('frisky', 'NN'),
 ('grammer', 'grammer', 'NN'): ('grammar', 'NN'),
 ('brillient', 'brillient', 'NN'): ('brilliant', 'NN'),
 ('brillient', 'brillient', 'VBN'): ('brilliant', 'VBN'),
 ('tution', 'tution', 'NN'): ('tuition', 'NN'),
 ('goverment', 'goverment', 'NN'): ('government', 'NN'),
 ('ludwing', 'ludwing', 'VBG'): ('ludwig', 'VBG'),
 ('fastfood', 'fastfood', 'NN'): ('eastwood', 'NN'),
 ('becuase', 'becuase', 'NN'): ('because', 'NN'),
 ('appartment', 'appartment', 'NN'): ('apartment', 'NN'),
 ('marrige', 'marrige', 'NN'): ('marriage', 'NN'),
 ('axé', 'axé', 'JJ'): ('axe', 'JJ'),
 ('etat', 'etat', 'NN'): ('eat', 'NN'),
 ('beatiful', 'beatiful', 'JJ'): ('beautiful', 'J

In [43]:
# Incorporating back into pelic_df

pelic_df['tok_POS_corrected'] = pelic_df['tok_lem_POS'].apply\
(lambda row: [misspell_dict[(x[0].lower(),x[1],x[2])] if (x[0].lower(),x[1],x[2]) in misspell_dict else (x[0],x[2]) for x in row])

# One minor issue is that this will make misspelled items lower case when originally upper case.

In [44]:
# Checking with 'becuase'

print(pelic_df.loc[pelic_df.text.str.contains('becuase')].iloc[1,11]) #uncorrected
print(pelic_df.loc[pelic_df.text.str.contains('becuase')].iloc[1,12]) #corrected

(('My', 'my', 'PRP$'), ('friend', 'friend', 'NN'), ('is', 'be', 'VBZ'), ('realy', 'realy', 'JJ'), ('nise', 'nise', 'RB'), ('guy', 'guy', 'NN'), ('.', '.', '.'), ('I', 'i', 'PRP'), ('like', 'like', 'VBP'), ('hem', 'hem', 'JJ'), ('becuase', 'becuase', 'NN'), ('he', 'he', 'PRP'), ('is', 'be', 'VBZ'), ('friendlly', 'friendlly', 'RB'), ('and', 'and', 'CC'), ('lovliy', 'lovliy', 'NN'), ('.', '.', '.'))
[('My', 'PRP$'), ('friend', 'NN'), ('is', 'VBZ'), ('real', 'JJ'), ('nice', 'RB'), ('guy', 'NN'), ('.', '.'), ('I', 'PRP'), ('like', 'VBP'), ('hem', 'JJ'), ('because', 'NN'), ('he', 'PRP'), ('is', 'VBZ'), ('friendly', 'RB'), ('and', 'CC'), ('lovely', 'NN'), ('.', '.')]


We can see here that many approrpriate corrections have been made, including _beccuase_ -> _because_ , _nise_ -> _nice_ , and _lovily_ -> _lovely_ .  
Importantly, incorrect spellings that are actual words, e.g. _hem_ (should be _him_ in this case) are not corrected. In addition, if the POS is incorrectly tagged, often due to the learner language, then the result may not be correct, e.g. _realy_ (marked as an adj) -> _real_ rather than _really_.

In [45]:
pelic_df.head()

Unnamed: 0_level_0,anon_id,L1,gender,course_id,level_id,class_id,question_id,version,text_len,text,tokens,tok_lem_POS,tok_POS_corrected
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,eq0,Arabic,Male,149,4,g,5,1,177,I met my friend Nife while I was studying in a...,"[I, met, my, friend, Nife, while, I, was, stud...","((I, i, PRP), (met, meet, VBD), (my, my, PRP$)...","[(I, PRP), (met, VBD), (my, PRP$), (friend, NN..."
2,am8,Thai,Female,149,4,g,5,1,137,"Ten years ago, I met a women on the train betw...","[Ten, years, ago, ,, I, met, a, women, on, the...","((Ten, ten, CD), (years, year, NNS), (ago, ago...","[(Ten, CD), (years, NNS), (ago, RB), (,, ,), (..."
3,dk5,Turkish,Female,115,4,w,12,1,64,In my country we usually don't use tea bags. F...,"[In, my, country, we, usually, do, n't, use, t...","((In, in, IN), (my, my, PRP$), (country, count...","[(In, IN), (my, PRP$), (country, NN), (we, PRP..."
4,dk5,Turkish,Female,115,4,w,13,1,6,I organized the instructions by time.,"[I, organized, the, instructions, by, time, .]","((I, i, PRP), (organized, organize, VBD), (the...","[(I, PRP), (organized, VBD), (the, DT), (instr..."
5,ad1,Korean,Female,115,4,w,12,1,59,"First, prepare a port, loose tea, and cup.\nSe...","[First, ,, prepare, a, port, ,, loose, tea, ,,...","((First, first, RB), (,, ,, ,), (prepare, prep...","[(First, RB), (,, ,), (prepare, VB), (a, DT), ..."


In [46]:
# Write out new PELIC_compiled.csv

pelic_df.to_csv('PELIC_compiled_spellcorrected.csv', encoding='utf-8', index=True)

In [47]:
# Pickle new pelic_df dataframe
pelic_df.to_pickle('pelic_spellcorrected.pkl')

If preferred, this entire spelling correctin process can also be applied to [`answer.csv`]() instead of `PELIC_compiled`.

[Back to top](#Corrected-spelling)