Version 1.0
Authors: Ben Naismith, John Starr, Eva Bacas
Contact: bnaismith@pitt.edu
This repo provides information and code about applying spelling correction to the PELIC dataset.
This README.md
file introduces the PELIC-spelling repository which provides information and code about applying spelling correction to the PELIC dataset. To download and find out more about the PELIC dataset, see the PELIC-dataset repository. For information regarding publications and presentations based on PELIC data, as well as for information regarding the people and parties responsible for the corpus, please visit the Pitt ELI Corpus web page.
Spelling correction is an important element to consider in any corpus study involving learner data. The decision whether to correct texts or not will invariably impact results: in some instances it may be preferable to use the raw text, maintaining its integrity and avoiding an additional layer of processing. However, for other projects, corrected text may provide a more accurate representation of the language features being investigated.
There are three main components to the spelling correction process, presented in two Jupyter notebooks:
- The SCOWL_wordlist: In this notebook we decide on a list of what we consider to be real words, using an edited version of the SCOWL wordlists.
- PELIC_spelling: In this notebook we create a dataframe of misspellings, apply an automated spelling correction process, and re-incorporate the corrected text into our corpus.
- PELIC_spelling_validation: In this notebook we detail a validation of the spell checker. Manual checking of spelling is performed on a sample of PELIC and is then compared to the output of the automated spell checker. The results indicate that spell-checker is highly accurate in terms of the total tokens in PELIC, but conservative resulting in lower precision. For details, please see the Jupyter notebook.
The PELIC-spelling repository contains 14 main files:
File | File type | Description |
---|---|---|
all_names.txt | text | list of over 90,000 names (first and last) from the 1990 US census data. Names collected by the names random name generator project |
contractions.txt | text | short list of contractions approved as legitimate tokens (not misspellings) |
frequency_bigramdictionary_en_243_342.txt | text | bigram frequency dictionary supplied by SymSpell spell correction module |
frequency_dictionary_en_82_765.txt | text | frequency dictionary supplied by SymSpell spell correction module |
hyphens.txt | text | list of hyphenated words which appear in PELIC and have been approved as legitimate tokens (not misspellings) |
PELIC_compiled_spellcorrected.csv | csv | final output of updated PELIC_compiled.csv with spelling correction |
PELIC_spelling.ipynb | Jupyter notebook | notebook demonstrating how spelling correction is applied to PELIC texts |
PELIC-SCOWL.txt | text | a combination of the SCOWL_condensed.txt, contractions.txt, and hyphens.txt lists |
README.md | markdown | this file describing the repository |
SCOWL_condensed.txt | text | final compiled word list based on SCOWL word lists |
SCOWL_supp.txt | text | short list of words manually approved as being legitimate words, e.g. proper names not found in SCOWL |
SCOWL_wordlist.ipynb | Jupyter notebook | notebook demonstrating how the SCOWL_condensed word list is created |
SCOWL_wordlist.txt | text | the full SCOWL wordlist before condensing |
PELIC_spelling_validation.ipynb | Jupyter notebook | manual validation of the spell checker |
This notebook produces a definitive list of 'real' words to use when deciding what to consider a word/non-word. The final output is the SCOWL_condensed.txt file. The primary wordlists are from the SCOWL set of word lists, freely availabe at http://wordlist.aspell.net/.
The notebook is divided into two main sections:
-
Exploratory Data Anaylsis : Here, we examine the various SCOWL dictionaries which include different language varieties, proper nouns, slang, abbreviations, etc. From this exploration, we opt to include all available dictionaries except the abbreviation dictionaries due to the high number of short strings of letters which may match learner errors. It is possible, however, to include these dictionaries if desired.
-
Compiling and condensing dictionaries : In the second part of the notebook, SCOWL_condensed is created by combining the various SCOWL dictionaries and then removing duplicates, blanks, and possessives. The final wordlist is slightly less than 500k words.
This notebook adds further processing to PELIC_compiled.csv
in the PELIC-dataset
repo by creating a column of tokens and their parts of speech which have been corrected in terms of spelling.
The notebook is divided into four main sections:
- Building a
non_words
dataframe : We first collect all of the non-words from the PELIC dataset (inPELIC_compiled.csv
) by extracting all words which are not found inSCOWL_condensed
:
>>> non_words.head()
tok_lem_POS | sentence | answer_id | |
---|---|---|---|
0 | ('beacause', 'beacause', 'NN') | i organized the instructions by time, beacause to make tea people who want to make tea have to follow the instructions step by step. | 8 |
1 | ('wallmart', 'wallmart', 'NN') | next, you need to buy a box of tea in wallmart or giant eagle. | 11 |
2 | ('dovn', 'dovn', 'NN') | first, you should take some hot water, you can use dovn, mircowave or other ways. | 13 |
3 | ('mircowave', 'mircowave', 'VBP') | first, you should take some hot water, you can use dovn, mircowave or other ways. | 13 |
4 | ('paragragh', 'paragragh', 'NN') | every paragragh's instructions depend on a main idea. | 16 |
- Building a dataframe of misspellings and their frequencies : In the non-words dataframe above, each row is an occurrence of a misspelling (i.e. tokens). We then create a dataframe where each row is a misspelling type with frequency information attached:
>>> misspell_df.sample(5)
Index | misspelling | tok_lem_POS | freq |
---|---|---|---|
9164 | spel | ('spel', 'spel', 'VB') | 1 |
5495 | invesigate | ('invesigate', 'invesigate', 'VB') | 1 |
3645 | estmatied | ('estmatied', 'estmatied', 'JJ') | 1 |
9313 | straigten | ('straigten', 'straigten', 'VB') | 1 |
8455 | hobbys | ('hobbys', 'hobbys', 'NN') | 2 |
- Applying spelling correction : Having collected and organized the misspellings, we then correct these occurrences using
SymSpell
. In SymSpell complete sentence context is not considered, only bigrams and frequencies. Though this is not ideal, other well-known spellcheckers (hunspell, pyspell, etc.) use the same strategy - frequency based criteria for suggestions, without considering co-text beyond bigrams. As such, it is important to remember that accuracy of corrected tokens will not be 100% and must be taken into consideration.
>>> print(non_words2[['answer_id','misspelling','sentence','final_correction_POS']].sample(5))
# Sample of 5 rows and key columns
answer_id | misspelling | sentence | final_correction_POS |
---|---|---|---|
11487 | ('celemony', 'celemony', 'NN') | Third, the ANON_NAME_0-Ju international movie celemony is opened in my hometown. | ('ceremony', 'ceremony', 'NN') |
13444 | ('miliion', 'miliion', 'NN') | 200 miliion people | ('million', 'million', 'NN') |
17707 | ('korian', 'korian', 'JJ') | Korian pizza is healthier than American pizza. | ('korean', 'korean', 'JJ') |
35162 | ('grammer', 'grammer', 'NN') | Although my grammer was not impeccable, they could usually understand what I meant. | ('grammar', 'grammar', 'NN') |
10839 | ('comunity', 'comunity', 'NN') | Second, truth make our comunity be truthable sociaty. | ('community', 'community', 'NN') |
- Incorporating corrections into
pelic_df
: Finally, these corrected tokens are incorporated back intopelic_df
, creating a newtok_lem_POS
column for easy comparison to the original texts. Below is an example of an original and corrected text:
>>> print(pelic_df.loc[pelic_df.text.str.contains('becuase')].iloc[1,11]) #uncorrected
[(('My', 'my', 'PRP$'), ('friend', 'friend', 'NN'), ('is', 'be', 'VBZ'), ('realy', 'realy', 'JJ'), ('nise', 'nise', 'RB'), ('guy', 'guy', 'NN'), ('.', '.', '.'), ('I', 'i', 'PRP'), ('like', 'like', 'VBP'), ('hem', 'hem', 'JJ'), ('becuase', 'becuase', 'NN'), ('he', 'he', 'PRP'), ('is', 'be', 'VBZ'), ('friendlly', 'friendlly', 'RB'), ('and', 'and', 'CC'), ('lovliy', 'lovliy', 'NN'), ('.', '.', '.'))]
>>> print(pelic_df.loc[pelic_df.text.str.contains('becuase')].iloc[1,12]) #corrected
[('My', 'PRP$'), ('friend', 'NN'), ('is', 'VBZ'), ('real', 'JJ'), ('nice', 'RB'), ('guy', 'NN'), ('.', '.'), ('I', 'PRP'), ('like', 'VBP'), ('hem', 'JJ'), ('because', 'NN'), ('he', 'PRP'), ('is', 'VBZ'), ('friendly', 'RB'), ('and', 'CC'), ('lovely', 'NN'), ('.', '.')]
We can see here that many approrpriate corrections have been made, including beccuase -> because , nise -> nice , friendlly -> friendly , and lovily -> lovely . Importantly, incorrect spellings that are actual words, e.g. hem (should be him in this case) are not corrected. In addition, as limited context is considered, there will be some inaccuracies, e.g. realy (real nice is a frequent bigram) -> real rather than really.
Overall, the application of spelling correction is an important resource as it allows for more accurate tracking of what learners may have been intending to write. For example, learners may know a word in every sense, except for its spelling. However, as with any automated text manipulation, the added layer of processing will allow for errors to enter the data, and as such, must be considered carefully when drawing conclusions from the data.
PELIC license:
PELIC dataset by Alan Juffs, Na-Rae Han, Ben Naismith is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Based on a work at https://github.com/ELI-Data-Mining-Group/PELIC-dataset.
SCOWL license: SCOWL Copyright and License Agreement
Spell Checking Oriented Word Lists (SCOWL) (http://wordlist.sourceforge.net/scowl-readme) The collective work is Copyright 2000-2011 by Kevin Atkinson as well as any of the copyrights mentioned below:
Copyright 2000-2011 by Kevin Atkinson Permission to use, copy, modify, distribute and sell these word lists, the associated scripts, the output created from the scripts, and its documentation for any purpose is hereby granted without fee, provided that the above copyright notice appears in all copies and that both that copyright notice and this permission notice appear in supporting documentation. Kevin Atkinson makes no representations about the suitability of this array for any purpose. It is provided "as is" without express or implied warranty.