# PELIC spelling validation

**Authors: Joey Livorno, Sean Steinle, & Ben Naismith  
Contact: bnaismith@pitt.edu**

**Last updated:** Feb 11, 2021

This notebook describes the validation of the automated spelling correction carried out in the [PELIC_spelling.ipynb](https://github.com/ELI-Data-Mining-Group/PELIC-spelling/blob/master/PELIC_spelling.ipynb) notebook. The original dataset is [the University of Pittsburgh English Language Institute Corpus (PELIC)](https://github.com/ELI-Data-Mining-Group/PELIC-dataset).

In brief, manual checking of spelling is performed on a sample of PELIC and is then compared to the output of the automated spell checker. The results indicate that spell-checker is highly accurate in terms of the total tokens in PELIC, but conservative resulting in lower precision.

## Table of Contents
1. [Introduction](#1.-Introduction)
2. [Initial setup](#2.-Initial-setup)
3. [Preparing the data](#3.-Preparing-the-data)
4. [Spell check comparison](#4.-Spell-check-comparison)
5. [Data analysis](#Data-analysis)
6. [Conclusion](#Conclusion)

## 1. Introduction

The [PELIC-spelling](https://github.com/ELI-Data-Mining-Group/PELIC-spelling) repository accompanies the main PELIC dataset and provides information and code about applying spelling correction to PELIC texts. This supplemental repository was created because the accuracy of learner spelling, and whether or not to correct misspellings, must be considered when using learner production data. Depending on the focus of the research, using uncorrected text may be preferable so as not to introduce potential error through an unnecessary level of text manipulation. In other instances, there might be a stronger justification for using corrected text so that spelling mistakes do not mask other salient linguistic features, i.e. what learners intended to write, or lessen the effectiveness of other automated processes. 

The spell-corrected version of PELIC data can be found in [`PELIC_compiled_spellcorrected.csv`](https://github.com/ELI-Data-Mining-Group/PELIC-spelling/blob/master/PELIC_compiled_spellcorrected.csv) which is a fork of the original [`PELIC_compiled.csv`](https://github.com/ELI-Data-Mining-Group/PELIC-dataset/blob/master/PELIC_compiled.csv). This spelling correction was performed through an automated process using the [SymSpell](https://pypi.org/project/symspellpy/) Python module (Garbe, 2020). The code and description of this process is in the [`PELIC_spelling.ipynb`](https://github.com/ELI-Data-Mining-Group/PELIC-spelling/blob/master/PELIC_spelling.ipynb) notebook.  

In the current notebook, our goal is to validate this automated spelling process by comparing manual human spell checking against the automated spell checking for a sample of PELIC data. To do so, we assess the corrections made in `PELIC_compiled_spellchecked.csv` by randomly sampling 50 entries and comparing them to the manual corrections made by two expert annotators (and adjudicated by a third).

## 2. Initial setup

In [1]:
# Import necessary modules

import pandas as pd
import numpy as np
from ast import literal_eval
from collections import Counter
import random

In [2]:
# Set preferred notebook format

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## 3. Preparing the Data 

In [`PELIC_compiled_spellcorrected.csv`](https://github.com/ELI-Data-Mining-Group/PELIC-spelling/blob/master/PELIC_compiled_spellcorrected.csv), the spelling changes are reflected in the column called `tok_lem_POS_corrected`. Each cell contains a list with a tuple for each token in the text. Each tuple has three parts: the token, the corresponding lemma, and its part of speech ([Penn Treebank tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)).

### Formatting the Sample

**Note:** the `tok_lem_POS_corrected` value for an entry will be *NaN* if no corrections were made in the entry. If any token has been replaced with a corrected spelling, `tok_lem_POS_corrected` will contain a list of tuples for all tokens including both the correctly spelled ones and the misspelled ones.  

In [3]:
# Read in PELIC_compiled.csv

pelic_df = pd.read_csv("../PELIC-spelling/PELIC_compiled_spellcorrected.csv", index_col = 'answer_id', # answer_id is unique
                      dtype = {'level_id':'object','question_id':'object','version':'object','course_id':'object'}, # str not ints
                               converters={'tokens':literal_eval,'tok_lem_POS':literal_eval}) # read in as lists
pelic_df.head(2)

Unnamed: 0_level_0,anon_id,L1,gender,course_id,level_id,class_id,question_id,version,text_len,text,tokens,tok_lem_POS,tok_lem_POS_corrected
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,eq0,Arabic,Male,149,4,g,5,1,177,I met my friend Nife while I was studying in a...,"[I, met, my, friend, Nife, while, I, was, stud...","((I, i, PRP), (met, meet, VBD), (my, my, PRP$)...",
2,am8,Thai,Female,149,4,g,5,1,137,"Ten years ago, I met a women on the train betw...","[Ten, years, ago, ,, I, met, a, women, on, the...","((Ten, ten, CD), (years, year, NNS), (ago, ago...",


To create the sample we first limit the search to only entries that have at least one corrected spelling mistake. Potential texts are also filtered to those with a text length greater than 50. The sample is seeded to ensure it can be reproduced.

In [4]:
# Randomly sample 50 entries

pelic_df = pelic_df[pelic_df['tok_lem_POS_corrected'].notnull()]
sample_df = pelic_df[pelic_df['text_len'] >= 50].sample(n=50, random_state=2)
sample_df.head(2)

Unnamed: 0_level_0,anon_id,L1,gender,course_id,level_id,class_id,question_id,version,text_len,text,tokens,tok_lem_POS,tok_lem_POS_corrected
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
23739,bv1,Arabic,Male,479,4,g,3074,2,323,I was sleeping when my cell phone rang. Then I...,"[I, was, sleeping, when, my, cell, phone, rang...","((I, i, PRP), (was, be, VBD), (sleeping, sleep...","[('i', 'i', 'PRP'), ('was', 'be', 'VBD'), ('sl..."
23488,fd6,Korean,Female,504,5,g,3052,1,169,I had an experience that taught me intuitive d...,"[I, had, an, experience, that, taught, me, int...","((I, i, PRP), (had, have, VBD), (an, a, DT), (...","[('i', 'i', 'PRP'), ('had', 'have', 'VBD'), ('..."


In [5]:
# Create subset of dataframe that includes only relevant columns

sample_df = sample_df[['text_len', 'text', 'tokens', 'tok_lem_POS_corrected']]
sample_df['tokens'] = sample_df.tokens.map(lambda x: [w.lower() for w in x]) #lowercase for easier comparison
sample_df.head()

Unnamed: 0_level_0,text_len,text,tokens,tok_lem_POS_corrected
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
23739,323,I was sleeping when my cell phone rang. Then I...,"[i, was, sleeping, when, my, cell, phone, rang...","[('i', 'i', 'PRP'), ('was', 'be', 'VBD'), ('sl..."
23488,169,I had an experience that taught me intuitive d...,"[i, had, an, experience, that, taught, me, int...","[('i', 'i', 'PRP'), ('had', 'have', 'VBD'), ('..."
16214,159,The most interesting experience I have ever ha...,"[the, most, interesting, experience, i, have, ...","[('the', 'the', 'DT'), ('most', 'most', 'RBS')..."
11926,119,When I was a five years old. I had a very pre...,"[when, i, was, a, five, years, old, ., i, had,...","[('when', 'when', 'WRB'), ('i', 'i', 'PRP'), (..."
38157,128,Traveling makes me happy. I am really like to ...,"[traveling, makes, me, happy, ., i, am, really...","[('traveling', 'travel', 'VBG'), ('makes', 'ma..."


In [6]:
# Write out dataframe to csv for annotators to check independently

sample_df.to_csv('sample_blank.csv', encoding='utf-8', index=True)

### Reading in the annotated samples 

In the annotated samples, each annotator created a columns duplicating the original `annotator2_df` called `checked`.  In this column, misspelled tokens were replaced with their correct spellings.  

**Note:** Annotators were instructed to only correct non-words, i.e., real words which may not have been intended were left as is, e.g., _two_ and _tow_. Although in some cases the intended word is clear, in learner language there are often ambiguities as well, so a consistent and simple algorithim was adopted to avoid having to guess what the learner may have intended.

In [7]:
# Read samples from annotators 1 and 2

annotator1_df = pd.read_csv("annotator1.csv", index_col = 'answer_id',
                            converters={'tokens':literal_eval,'tok_lem_POS':literal_eval,'tok_lem_POS_corrected':literal_eval,'checked':literal_eval})
annotator1_df.head(2)

annotator2_df = pd.read_csv("annotator2.csv", index_col = 'answer_id',
                            converters={'tokens':literal_eval,'tok_lem_POS':literal_eval,'tok_lem_POS_corrected':literal_eval,'checked':literal_eval})
annotator2_df.head(2)

Unnamed: 0_level_0,text,tokens,tok_lem_POS_corrected,checked
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
23739,I was sleeping when my cell phone rang. Then I...,"[I, was, sleeping, when, my, cell, phone, rang...","[(i, i, PRP), (was, be, VBD), (sleeping, sleep...","[I, was, sleeping, when, my, cell, phone, rang..."
23488,I had an experience that taught me intuitive d...,"[I, had, an, experience, that, taught, me, int...","[(i, i, PRP), (had, have, VBD), (an, a, DT), (...","[I, had, an, experience, that, taught, me, int..."


Unnamed: 0_level_0,text,tokens,tok_lem_POS_corrected,checked
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
23739,I was sleeping when my cell phone rang. Then I...,"[I, was, sleeping, when, my, cell, phone, rang...","[(i, i, PRP), (was, be, VBD), (sleeping, sleep...","[I, was, sleeping, when, my, cell, phone, rang..."
23488,I had an experience that taught me intuitive d...,"[I, had, an, experience, that, taught, me, int...","[(i, i, PRP), (had, have, VBD), (an, a, DT), (...","[I, had, an, experience, that, taught, me, int..."


### Annotator agreement rates

To assess inter-annotator reliability, we calculate simple agreement rates, which is frequently used in Natural Language Processing. The alternative, Kappa, is not used because it may underestimate the agreement of a rare category, in this case misspelled words.

In [8]:
# Create a column with enumerated checked texts

annotator1_df['enumerated_checked'] = annotator1_df.checked.apply(enumerate).apply(list)
annotator2_df['enumerated_checked'] = annotator2_df.checked.apply(enumerate).apply(list)
annotator1_df.head(2)

Unnamed: 0_level_0,text,tokens,tok_lem_POS_corrected,checked,enumerated_checked
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
23739,I was sleeping when my cell phone rang. Then I...,"[I, was, sleeping, when, my, cell, phone, rang...","[(i, i, PRP), (was, be, VBD), (sleeping, sleep...","[I, was, sleeping, when, my, cell, phone, rang...","[(0, I), (1, was), (2, sleeping), (3, when), (..."
23488,I had an experience that taught me intuitive d...,"[I, had, an, experience, that, taught, me, int...","[(i, i, PRP), (had, have, VBD), (an, a, DT), (...","[I, had, an, experience, that, taught, me, int...","[(0, I), (1, had), (2, an), (3, experience), (..."


In [9]:
# Create lists of annotator tokens tokens to compare

annotator1_toks = [x for y in annotator1_df.enumerated_checked.to_list() for x in y] # Make a list of tokens and flatten
annotator2_toks = [x for y in annotator2_df.enumerated_checked.to_list() for x in y]

In [10]:
# Calculate total number of tokens

len(annotator1_toks)
len(annotator2_toks) # Should match

11595

11595

In [11]:
# Compare each item in each list to check if match

annotator_match = [i==j for i, j in zip(annotator1_toks, annotator2_toks)]
Counter(annotator_match)

Counter({True: 11537, False: 58})

In [12]:
print('Percentage of matching tokens: ', round(Counter(annotator_match)[1]/len(annotator1_toks)*100,2))

Percentage of matching tokens:  99.5


In [13]:
round(Counter(annotator_match)[1]/len(annotator1_toks)*100,2)

99.5

The 99.5% agreement rate between annotators indicates high inter-rater reliability. The 58 mismatches will be adjudicated by a third annotator to produce one final version.

### Annotator 3

In [14]:
# Create a new dataframe for annotator3 with 'checked' column removed

annotator3_df = annotator1_df.iloc[:,:-2]

In [15]:
# Create a column with enumerated unchecked texts

annotator3_df['enumerated_unchecked'] = annotator3_df.tokens.apply(enumerate).apply(list)
annotator3_df.head(2)

Unnamed: 0_level_0,text,tokens,tok_lem_POS_corrected,enumerated_unchecked
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
23739,I was sleeping when my cell phone rang. Then I...,"[I, was, sleeping, when, my, cell, phone, rang...","[(i, i, PRP), (was, be, VBD), (sleeping, sleep...","[(0, I), (1, was), (2, sleeping), (3, when), (..."
23488,I had an experience that taught me intuitive d...,"[I, had, an, experience, that, taught, me, int...","[(i, i, PRP), (had, have, VBD), (an, a, DT), (...","[(0, I), (1, had), (2, an), (3, experience), (..."


In [16]:
# Create a list of original enumerated tokens

original_toks = [x for y in annotator3_df.enumerated_unchecked.to_list() for x in y]

In [17]:
# Create list of enumerated mismatches

mismatches = [(i,j,k) for (i,j,k) in zip(annotator1_toks, annotator2_toks, original_toks) if i!=j]
len(mismatches)
len(set(mismatches))
mismatches[:5]

58

58

[((48, 'craigslist'), (48, 'craiglist'), (48, 'craiglist')),
 ((191, 'am'), (191, 'a.m'), (191, 'a.m')),
 ((99, 'experience'), (99, 'expierience'), (99, 'expierience')),
 ((19, 'the'), (19, 'te'), (19, 'te')),
 ((50, 'sightseeing'), (50, 'sight-seeing'), (50, 'sigthseeing'))]

In [18]:
# Create flat list of all mistmatch items

flat_mismatches = [x for y in mismatches for x in y]
len(flat_mismatches) # 58*3
flat_mismatches = set(flat_mismatches)
len(flat_mismatches) # After duplicates removed
len(flat_mismatches)/2 # unique items to adjudicate

# This list is three items longer than the previous as (19, 'the') occurs in three texts and (18, 'since') occurs in two texts.

174

122

61.0

In [19]:
# Create column of items to adjudicate based on inclusion in mismatches

annotator3_df['to_check'] = annotator3_df.enumerated_unchecked.apply(lambda row: [x for x in row if x in flat_mismatches])
annotator3_df.loc[(annotator3_df.to_check.str.len() != 0),:].head() # Check rows with non-empty lists

Unnamed: 0_level_0,text,tokens,tok_lem_POS_corrected,enumerated_unchecked,to_check
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
23739,I was sleeping when my cell phone rang. Then I...,"[I, was, sleeping, when, my, cell, phone, rang...","[(i, i, PRP), (was, be, VBD), (sleeping, sleep...","[(0, I), (1, was), (2, sleeping), (3, when), (...","[(48, craiglist), (191, a.m)]"
38157,Traveling makes me happy. I am really like to ...,"[Traveling, makes, me, happy, ., I, am, really...","[(traveling, travel, VBG), (makes, make, VBZ),...","[(0, Traveling), (1, makes), (2, me), (3, happ...","[(19, the), (99, expierience)]"
10494,Indonesia \n \n It is a piece of the paradiece...,"[Indonesia, It, is, a, piece, of, the, paradie...","[(indonesia, indonesia, NN), (it, it, PRP), (i...","[(0, Indonesia), (1, It), (2, is), (3, a), (4,...","[(19, te), (50, sigthseeing)]"
47714,Terrible Weekend\n Last weekend was a terribl...,"[Terrible, Weekend, Last, weekend, was, a, ter...","[(terrible, terrible, JJ), (weekend, weekend, ...","[(0, Terrible), (1, Weekend), (2, Last), (3, w...","[(48, shouid), (58, Idecided), (168, tidous)]"
28804,"The article ""Reponsibility and learning"" base...","[The, article, ``, Reponsibility, and, learnin...","[(the, the, DT), (article, article, NN), (``, ...","[(0, The), (1, article), (2, ``), (3, Reponsib...","[(3, Reponsibility), (50, towords), (115, writ..."


In [20]:
# Double check number of mismatches

to_check = [x for y in annotator3_df.loc[(annotator3_df.to_check.str.len() != 0),:].to_check.to_list() for x in y]
len(to_check)

61

In [21]:
# Write out csv for annotator3

annotator3_df.to_csv('annotator3_blank.csv', encoding='utf-8', index=True)

In [22]:
# Read in checked annotator3 csv

final_df = pd.read_csv("annotator3_checked.csv", index_col = 'answer_id',
                            converters={'tokens':literal_eval,'tok_lem_POS_corrected':literal_eval,
                                        'enumerated':literal_eval, 'enumerated_checked':literal_eval})

final_df.head(2)

Unnamed: 0_level_0,text,tokens,tok_lem_POS_corrected,enumerated_unchecked,to_check,enumerated_checked
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
23739,I was sleeping when my cell phone rang. Then I...,"[I, was, sleeping, when, my, cell, phone, rang...","[(i, i, PRP), (was, be, VBD), (sleeping, sleep...","[(0, 'I'), (1, 'was'), (2, 'sleeping'), (3, 'w...","[(48, 'craiglist'), (191, 'a.m')]","[(0, I), (1, was), (2, sleeping), (3, when), (..."
23488,I had an experience that taught me intuitive d...,"[I, had, an, experience, that, taught, me, int...","[(i, i, PRP), (had, have, VBD), (an, a, DT), (...","[(0, 'I'), (1, 'had'), (2, 'an'), (3, 'experie...",[],"[(0, I), (1, had), (2, an), (3, experience), (..."


### Formatting the final dataframe

In [23]:
# Create a lower-cased version of enumerated_checked with only the tokens (to match tok_lem_POS_corrected column)

final_df['human_checked'] = final_df.enumerated_checked.apply(lambda row: [x[1].lower() for x in row])

In [24]:
# Clean up dataframe for easier analysis

final_df = final_df[['tokens','tok_lem_POS_corrected','human_checked']]
final_df = final_df.rename(columns={"tokens": "original", "tok_lem_POS_corrected": "automated_checked"})
final_df['automated_checked'] = final_df.automated_checked.apply(lambda row: [x[0] for x in row])
final_df['original'] = final_df.original.apply(lambda row: [x.lower() for x in row])
final_df.head()

Unnamed: 0_level_0,original,automated_checked,human_checked
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
23739,"[i, was, sleeping, when, my, cell, phone, rang...","[i, was, sleeping, when, my, cell, phone, rang...","[i, was, sleeping, when, my, cell, phone, rang..."
23488,"[i, had, an, experience, that, taught, me, int...","[i, had, an, experience, that, taught, me, int...","[i, had, an, experience, that, taught, me, int..."
16214,"[the, most, interesting, experience, i, have, ...","[the, most, interesting, experience, i, have, ...","[the, most, interesting, experience, i, have, ..."
11926,"[when, i, was, a, five, years, old, ., i, had,...","[when, i, was, a, five, years, old, ., i, had,...","[when, i, was, a, five, years, old, ., i, had,..."
38157,"[traveling, makes, me, happy, ., i, am, really...","[traveling, makes, me, happy, ., i, am, really...","[traveling, makes, me, happy, ., i, am, really..."


## 4. Spell check comparison

When comparing the tokens, from the three columns, there are five total possibilities:
1. The original token was not misspelled and the spell checker recognized this. In this case, we would expect all three tokens to be equal to one another.
2. The original token was misspelled and the spell checker corrected it. In this case, we would expect *automated_checked* and *human_checked* to be equal, with *original* being the outlier.
3. The original token was misspelled but the spell checker did not recognize this. In this case, *original* and *automated_checked* would be equal, but neither would be equal to the correct spelling of *human_checked*. This would be a Type II error, i.e., a false negative.
4. None of the three columns are equal to one another. This means that there was a misspelling, but the automated check was wrong (as judged by the human checker).
5. The original token was spelled correctly, but the spellchecker still applied a correction to it. In this case, *original* and *human_checked* would be equal to one another, with *automated_checked* being a mismatch. This is a Type I error, i.e., a false positive.

We now create methods that use *if* statements to check each of these cases. The methods function the same way: they simply parse through the lists, and when a condition is met, it adds the token to a list. The list is then returned.

### Creating functions for each case

In [25]:
def get_none(l1, l2, l3):
    """Returns the number of correct spellings in a token set where l1 is the
    original token list, l2 is the automated_checked list, and l3 is the human_checked list"""
    l0 = []
    ref = list(enumerate(l1))
    for i in range(len(l1)):
        if l1[i] == l3[i] and l2[i] == l3[i]:
            l0.append((ref[i][0], l1[i], l2[i], l3[i]))
    return l0

l1 = ['a', 'f', 'f', 'f', 'e']
l2 = ['a', 'b', 'f', 'g', 'f']
l3 = ['a', 'b', 'c', 'd', 'e']
get_none(l1, l2, l3)

[(0, 'a', 'a', 'a')]

In [26]:
def get_correct(l1, l2, l3):
    """Returns the number of corrected misspellings in a token set where l1 is the
    original token list, l2 is the automated_checked list, and l3 is the human_checked list"""
    l0 = []
    ref = list(enumerate(l1))
    for i in range(len(l1)):
        if l1[i] != l3[i] and l2[i] == l3[i]:
            l0.append((ref[i][0], l1[i], l2[i], l3[i]))
    return l0

get_correct(l1, l2, l3)

[(1, 'f', 'b', 'b')]

In [27]:
def get_false_neg(l1, l2, l3):
    """Returns the number of false negatives in a token set where l1 is the
    original token list, l2 is the automated_checked list, and l3 is the human_checked list"""
    l0 = []
    ref = list(enumerate(l1))
    for i in range(len(l1)):
        if l1[i] != l3[i] and l2[i] != l3[i] and l1[i] == l2[i]:
            l0.append((ref[i][0], l1[i], l2[i], l3[i]))
    return l0

get_false_neg(l1, l2, l3)

[(2, 'f', 'f', 'c')]

In [28]:
def get_incorrect(l1, l2, l3):
    """Returns the number of false negatives in a token set where l1 is the
    original token list, l2 is the automated_checked list, and l3 is the human_checked list"""
    l0 = []
    ref = list(enumerate(l1))
    for i in range(len(l1)):
        if l1[i] != l3[i] and l2[i] != l3[i] and l1[i] != l2[i]:
            l0.append((ref[i][0], l1[i], l2[i], l3[i]))
    return l0

get_incorrect(l1, l2, l3)

[(3, 'f', 'g', 'd')]

In [29]:
def get_false_pos(l1, l2, l3):
    """Returns the number of false negatives in a token set where l1 is the
    original token list, l2 is the automated_checked list, and l3 is the human_checked list"""
    l0 = []
    ref = list(enumerate(l1))
    for i in range(len(l1)):
        if l1[i] == l3[i] and l2[i] != l3[i]:
            l0.append((ref[i][0], l1[i], l2[i], l3[i]))
    return l0

get_false_pos(l1, l2, l3)

[(4, 'e', 'f', 'e')]

### Applying methods to final_df

In [30]:
[].append

<function list.append(object, /)>

In [31]:
none = []
correct = []
neg = []
incorrect = []
pos = []

for i in range(50):
    row = final_df.iloc[i]
    none.append(get_none(row[0], row[1], row[2]))
    correct.append(get_correct(row[0], row[1], row[2]))
    neg.append(get_false_neg(row[0], row[1], row[2]))
    incorrect.append(get_incorrect(row[0], row[1], row[2]))
    pos.append(get_false_pos(row[0], row[1], row[2]))

final_df['true_negatives'] = pd.Series(none).values
final_df['true_positives'] = pd.Series(correct).values
final_df['false_negatives'] = pd.Series(neg).values
final_df['true_positives_incorrect'] = pd.Series(incorrect).values
final_df['false_positives'] = pd.Series(pos).values
final_df.head()

Unnamed: 0_level_0,original,automated_checked,human_checked,true_negatives,true_positives,false_negatives,true_positives_incorrect,false_positives
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
23739,"[i, was, sleeping, when, my, cell, phone, rang...","[i, was, sleeping, when, my, cell, phone, rang...","[i, was, sleeping, when, my, cell, phone, rang...","[(0, i, i, i), (1, was, was, was), (2, sleepin...",[],"[(191, a.m, a.m, a.m.)]","[(48, craiglist, craig list, craigslist)]",[]
23488,"[i, had, an, experience, that, taught, me, int...","[i, had, an, experience, that, taught, me, int...","[i, had, an, experience, that, taught, me, int...","[(0, i, i, i), (1, had, had, had), (2, an, an,...",[],[],[],"[(8, decition, decision, decition), (95, liein..."
16214,"[the, most, interesting, experience, i, have, ...","[the, most, interesting, experience, i, have, ...","[the, most, interesting, experience, i, have, ...","[(0, the, the, the), (1, most, most, most), (2...",[],[],[],"[(20, merridian, meridian, merridian)]"
11926,"[when, i, was, a, five, years, old, ., i, had,...","[when, i, was, a, five, years, old, ., i, had,...","[when, i, was, a, five, years, old, ., i, had,...","[(0, when, when, when), (1, i, i, i), (2, was,...",[],[],[],"[(97, charecter, character, charecter), (103, ..."
38157,"[traveling, makes, me, happy, ., i, am, really...","[traveling, makes, me, happy, ., i, am, really...","[traveling, makes, me, happy, ., i, am, really...","[(0, traveling, traveling, traveling), (1, mak...","[(99, expierience, experience, experience)]",[],[],"[(79, underware, under are, underware)]"


In [32]:
final_df['true_negatives_count'] = final_df.true_negatives.map(lambda x: len(x))
final_df['true_positives_count'] = final_df.true_positives.map(lambda x: len(x))
final_df['false_negatives_count'] = final_df.false_negatives.map(lambda x: len(x))
final_df['true_positives_incorrect_count'] = final_df.true_positives_incorrect.map(lambda x: len(x))
final_df['false_positives_count'] = final_df.false_positives.map(lambda x: len(x))
final_df.head()

Unnamed: 0_level_0,original,automated_checked,human_checked,true_negatives,true_positives,false_negatives,true_positives_incorrect,false_positives,true_negatives_count,true_positives_count,false_negatives_count,true_positives_incorrect_count,false_positives_count
answer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
23739,"[i, was, sleeping, when, my, cell, phone, rang...","[i, was, sleeping, when, my, cell, phone, rang...","[i, was, sleeping, when, my, cell, phone, rang...","[(0, i, i, i), (1, was, was, was), (2, sleepin...",[],"[(191, a.m, a.m, a.m.)]","[(48, craiglist, craig list, craigslist)]",[],375,0,1,1,0
23488,"[i, had, an, experience, that, taught, me, int...","[i, had, an, experience, that, taught, me, int...","[i, had, an, experience, that, taught, me, int...","[(0, i, i, i), (1, had, had, had), (2, an, an,...",[],[],[],"[(8, decition, decision, decition), (95, liein...",180,0,0,0,2
16214,"[the, most, interesting, experience, i, have, ...","[the, most, interesting, experience, i, have, ...","[the, most, interesting, experience, i, have, ...","[(0, the, the, the), (1, most, most, most), (2...",[],[],[],"[(20, merridian, meridian, merridian)]",181,0,0,0,1
11926,"[when, i, was, a, five, years, old, ., i, had,...","[when, i, was, a, five, years, old, ., i, had,...","[when, i, was, a, five, years, old, ., i, had,...","[(0, when, when, when), (1, i, i, i), (2, was,...",[],[],[],"[(97, charecter, character, charecter), (103, ...",135,0,0,0,2
38157,"[traveling, makes, me, happy, ., i, am, really...","[traveling, makes, me, happy, ., i, am, really...","[traveling, makes, me, happy, ., i, am, really...","[(0, traveling, traveling, traveling), (1, mak...","[(99, expierience, experience, experience)]",[],[],"[(79, underware, under are, underware)]",137,1,0,0,1


## 5. Data analysis
To judge the effectiveness of the automated checker, we calculate accuracy, precision, recall, and the F-measures.

In [33]:
true_neg = final_df["true_negatives_count"].sum()
true_pos = final_df["true_positives_count"].sum()
true_pos_incorrect = final_df["true_positives_incorrect_count"].sum()
false_neg = final_df["false_negatives_count"].sum()
false_pos = final_df["false_positives_count"].sum()
print(f"True Negatives: {true_neg}\nTrue Positives: {true_pos}\nIncorrect True Positives: {true_pos_incorrect}\
      \nFalse Negatives: {false_neg}\nFalse Positives: {false_pos}\n")

accuracy = round((true_neg + true_pos)/(true_neg + true_pos + false_neg + false_pos),2)
print("Accuracy: ", accuracy)
precision = round(true_pos /(true_pos + false_pos),2)
print("Precision: ", precision)
recall = round(true_pos / (true_pos + false_neg),2)
print("Recall: ", recall)
f = round((2 * precision * recall)/(precision + recall),2)
print("F-Measure", f)

True Negatives: 11458
True Positives: 21
Incorrect True Positives: 7      
False Negatives: 9
False Positives: 100

Accuracy:  0.99
Precision:  0.17
Recall:  0.7
F-Measure 0.27


## 6. Conclusion
We set out with the intent to see how well our spellchecking tool corrects ESL data. So how does it do?<br>
It does fairly well, performing very accurately but not nearly as precisely.

**Accuracy, Precision and Recall**

The spellchecker has two main measures of performance. First, accuracy measures how well the spellchecker performs at correcting words. This score is comprised by looking at the correction the spellchecker provides. For example, in the case where the word is correctly spelled and the spellchecker does nothing, the accuracy is 1/1.

On the other hand, we have the measures of binary classification: precision, recall, and F-measure. These statistics show how well the spellchecker performs as a classifier of spelling error, that is, how complete is the spellchecker's coverage of misspelled words? Does it overstate the set of misspelled words? Does it underestimate them? We answer these questions empirically with the word-level statistics of true positives, true negatives, false positives, and false negatives. We then use the proportions of these tags to calculate the precision, recall, and F-measure. In our case, precision represents the proportion of words tagged as incorrect that were actually incorrect (i.e. if the classifier corrected "don't" to "donut", it would lower precision). On the other hand, recall represents the proportion of words tagged as incorrect to the total set of incorrect words. Finally, the F-measure is a balance of precision and recall to give a fuller picture of performance.

For more on classifier evaluation, we recommend [this article](https://www.svds.com/the-basics-of-classifier-evaluation-part-1/) by Tom Fawcett.

**Performance**

First, we see that the accuracy of the tool is very good. Both of the annotators found a >99% accuracy, meaning that over 99% of the spellchecker's output is spelled correctly. This is impressive as it means that the spell-corrected version of PELIC can be trusted to answer research questions. However, it is important to note that this figure represents the proportion of correctly spelled words, which would likely be very high even without a spellchecker (especially for more advanced students).  
Moving on to precision and recall, we see a noticeable drop-off. The F-measure in this case tells us that our spell-checking classifier is quite robust (the recall) in terms of classifying errors from *all* of the tokens, but less precise when classifying errors from the smaller subset of tokens the humans tagged as errors. Overall, these figures indicate that the spell checking is conservative, leaving the data unchanged in many cases. This approach is prefered in the spirit of minimizing data manipulation and introduction of possible errors, though researchers may wish to apply further more fine-grained spelling correction depending upon their research needs.

[Back to top](#Table-of-Contents)