# Spell Checking via Text Prediction Exploration
## Nicholas Miklaucic & Peabody Work Duty Group

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
df = pd.read_csv("test.csv", index_col=0)
df.head()

Unnamed: 0,Cat Number,Site Number,Locality,Site,Name,Situation
0,Cat. No. 1,M50/1,"Locality Squibnocket Head, southwest side 0\nM...",Site Squibnocket Cliff .,Name Butt of arrowhead.,Situation on sand under shell just south of st...
1,Cat. No. 2,M50/1,"Lmnmw Squibnocket Head, southwest side 0\nMart...",Site Squibnocket Cliff.,Name Butt of quartz knife.,Situation Black sandy loam near stake 2.
2,| Cat. No. 3,M50/1,"Locality Squibnocket Head, southwest side of\n...",Site Squibnocket Cliff .,Name Crude quartz point.,"Situation Black sandy loam, 1M. east of stake A."
3,Cat. No. 4,M50/1,"Locality Squibnocket Head, southwest side 0\nM...",Site Squibnocket Cliff.,Name Urude quartzite Spear.,Situation Top of first shell layer.
4,I Cat. No. 5,M50/1,"Uxﬂky Squibnocket Head, southwest side of\nMar...",Site Squibnocket Cliff .,Name Crude chopper.,Situation Underneath lowest shell layer.


In [3]:
# Remove word at front of "Locality", "Site", etc.
def remove_first_word(string):
    """Removes the first word of a string."""
    return ' '.join(str(string).split(' ')[1:])

extra_word_cols = ["Locality", "Site", "Name", "Situation"]
df[extra_word_cols] = df[extra_word_cols].apply(np.vectorize(remove_first_word))
df.head()

Unnamed: 0,Cat Number,Site Number,Locality,Site,Name,Situation
0,Cat. No. 1,M50/1,"Squibnocket Head, southwest side 0\nMartha' 3 ...",Squibnocket Cliff .,Butt of arrowhead.,on sand under shell just south of stake 1. I
1,Cat. No. 2,M50/1,"Squibnocket Head, southwest side 0\nMartha's V...",Squibnocket Cliff.,Butt of quartz knife.,Black sandy loam near stake 2.
2,| Cat. No. 3,M50/1,"Squibnocket Head, southwest side of\nMartha' 3...",Squibnocket Cliff .,Crude quartz point.,"Black sandy loam, 1M. east of stake A."
3,Cat. No. 4,M50/1,"Squibnocket Head, southwest side 0\nMartha's V...",Squibnocket Cliff.,Urude quartzite Spear.,Top of first shell layer.
4,I Cat. No. 5,M50/1,"Squibnocket Head, southwest side of\nMartha' 3...",Squibnocket Cliff .,Crude chopper.,Underneath lowest shell layer.


In [4]:
from string import punctuation, whitespace
unwanted_chars = list(punctuation) + list(whitespace)
unwanted_chars.remove(' ')
unwanted_chars.remove('-')
unwanted_chars.remove("'")
def word_parse(string):
    """Lowercases, removes punctutation besides that which can appear in the inside of words (hyphen, apostrophe), and removes extraneous whitespace."""
    parsed = string.strip().lower()
    for unwanted_char in unwanted_chars:
        parsed = parsed.replace(unwanted_char, '')
    return parsed

df[extra_word_cols] = df[extra_word_cols].apply(np.vectorize(word_parse))
df.head()

Unnamed: 0,Cat Number,Site Number,Locality,Site,Name,Situation
0,Cat. No. 1,M50/1,squibnocket head southwest side 0martha' 3 vin...,squibnocket cliff,butt of arrowhead,on sand under shell just south of stake 1 i
1,Cat. No. 2,M50/1,squibnocket head southwest side 0martha's vine...,squibnocket cliff,butt of quartz knife,black sandy loam near stake 2
2,| Cat. No. 3,M50/1,squibnocket head southwest side ofmartha' 3 vi...,squibnocket cliff,crude quartz point,black sandy loam 1m east of stake a
3,Cat. No. 4,M50/1,squibnocket head southwest side 0martha's vine...,squibnocket cliff,urude quartzite spear,top of first shell layer
4,I Cat. No. 5,M50/1,squibnocket head southwest side ofmartha' 3 vi...,squibnocket cliff,crude chopper,underneath lowest shell layer


In [5]:
# now for each of the four main categories, get a list of all the words
from collections import Counter
wordlists = []
for col in extra_word_cols:
    wordlists.append(Counter(df[col].apply(lambda x: x + ' ').sum().strip().split(' ')))
wordlists

[Counter({'': 223,
          'yorth': 1,
          "bob's": 2,
          'inletl': 1,
          "quibnockemartha'": 8,
          '‘': 5,
          'r': 1,
          "0martha's": 5,
          'data': 1,
          'mi‘7': 1,
          'ranch“': 1,
          'li': 3,
          "80111martha's": 1,
          'llo': 1,
          '116': 2,
          'fallsﬁlaine': 1,
          'h1119': 2,
          '1‘': 3,
          "squibnoclumartha'": 1,
          'sp': 1,
          "eoutmartha'": 1,
          'howard': 2,
          "'reservoir": 1,
          'camp“': 2,
          "squibnocklmartha's": 1,
          'bluehil‘l‘': 1,
          'uppern-a1': 1,
          'labrac': 1,
          '1311': 1,
          'rang': 1,
          'iald': 1,
          'monolake': 1,
          'acalifornia': 1,
          'amndel': 4,
          "srmartha's": 1,
          'squibnocket': 115,
          'from': 4,
          'fails': 1,
          'jan': 1,
          '1521may': 1,
          'dicks': 2,
          'wit': 8,
       

In [6]:
wordlists[0].most_common(10)

[('falls', 1364),
 ('bluehill', 1181),
 ('me', 989),
 ('of', 530),
 ('vineyard', 479),
 ('east', 450),
 ('locality', 381),
 ('end', 353),
 ('', 223),
 ('lie', 199)]

In [7]:
other_df = pd.read_csv("peabody_files/Peabody_Extended_Fields_Acc.1-12.csv")
other_df = other_df[["CATBY", "CATTYPE", "DESCRIP"]]
other_df["DESCRIP"] = other_df["DESCRIP"].apply(np.vectorize(lambda x: word_parse(str(x))))
other_df.head()

Unnamed: 0,CATBY,CATTYPE,DESCRIP
0,Work Duty,Archaeology,pot sherds
1,Work Duty,Archaeology,fragment of chipped arrowpoint
2,Work Duty,Archaeology,pot sherds
3,Work Duty,Archaeology,beaver tooth implement
4,Work Duty,Archaeology,barbed harpoon fragment appears to be unilater...


In [8]:
words = Counter(other_df["DESCRIP"].apply(lambda x: x + ' ').sum().strip().split(' '))

In [9]:
words.most_common(10)

[('of', 404),
 ('quartz', 212),
 ('broken', 205),
 ('1', 177),
 ('bone', 175),
 ('arrowpoint', 120),
 ('card', 118),
 ('original', 115),
 ('with', 109),
 ('point', 103)]

Now, we're done with pre-processing. The question is, what do we do now?
I'm going to first test using PyEnchant, a spell-check library that already exists. I customize it with our custom list of words and just let it go.

In [10]:
import enchant
enchant.list_languages()

[]

Well, I can't get pyEnchant to work right now, so we'll go on to the next idea: doing the standard spell-check algorithm using Levehnstein edit distance. This is much more annoying.

In [12]:
# algorithm for computing edit distance
# Good artists copy, great artists steal
# this is from https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python
# and I take no credit for it whatsoever
def levenshtein(source, target):
    if len(source) < len(target):
        return levenshtein(target, source)

    # So now we have len(source) >= len(target).
    if len(target) == 0:
        return len(source)

    # We call tuple() to force strings to be used as sequences
    # ('c', 'a', 't', 's') - numpy uses them as values by default.
    source = np.array(tuple(source))
    target = np.array(tuple(target))

    # We use a dynamic programming algorithm, but with the
    # added optimization that we only need the last two rows
    # of the matrix.
    previous_row = np.arange(target.size + 1)
    for s in source:
        # Insertion (target grows longer than source):
        current_row = previous_row + 1

        # Substitution or matching:
        # Target and source items are aligned, and either
        # are different (cost of 1), or are the same (cost of 0).
        current_row[1:] = np.minimum(
                current_row[1:],
                np.add(previous_row[:-1], target != s))

        # Deletion (target grows shorter than source):
        current_row[1:] = np.minimum(
                current_row[1:],
                current_row[0:-1] + 1)

        previous_row = current_row

    return previous_row[-1]

In [14]:
# testing examples
print(levenshtein("history", "herstory"))
# should output 2: add 'r', change 'i' to 'e'
print(levenshtein("t3sstin", "testing"))
# should output 3: change '3' to 'e', delete 's', add a 'g'

2
3


In [18]:
# now to spell-check a single word, we find its closest thing in the list of words we have and then settle ties by commonality in the list
def correct(input_word):
    """Returns the closest words in the list of words we have, sorted by likelihood."""
    candidates = []
    curr_min_distance = 200
    for word in words:
        distance = levenshtein(word, input_word)
        if distance < curr_min_distance:
            candidates = [word]
            curr_min_distance = distance
        elif distance == curr_min_distance:
            candidates.append(word)
        else:
            continue
    candidates.sort(key=lambda x: words[x], reverse=True)
    return candidates

In [23]:
print(correct("arrop oint"))
print(correct("or1g1nal3fdf"))
print(correct("history"))  # this is funny

['arrowpoint']
['original']
['stone', 'distal', 'sword', 'pottery', 'rotary', 'short', 'restored', 'oyster', 'vison', 'into']


I feel really good about this system, especially if I ever get the actual English dictionary to back the wordlist up and figure out how to properly mix those. Things to think about improving:
 * If the OCR collapses one word into many or many words into one, that's really hard for this to catch.
 * I thought about doing totally next-level optical edit distance stuff, but that seems like overkill.
 * Knowing when to apply this system is also important: locations and names have to stay that way, and Massachusetts has some *weird* place names that can't be spell-checked and might change constantly.
 * Word embeddings a la `word2vec` would significantly improve this and I'm working on that.

The really nifty thing would be to use this algorithm to train a neural network to do deep learning on spell-checking (which has been done successfully), but that REALLY seems like overkill. I'll plug this into the whole pipeline and get the dictionary sorted before I do that.