# Spell Checking via Text Prediction Exploration
## Nicholas Miklaucic & Peabody Work Duty Group

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
df = pd.read_csv("test.csv", index_col=0)
df.head()

Unnamed: 0,Cat Number,Site Number,Locality,Site,Name,Situation
0,Cat. No. 1,M50/1,"Locality Squibnocket Head, southwest side 0\nM...",Site Squibnocket Cliff .,Name Butt of arrowhead.,Situation on sand under shell just south of st...
1,Cat. No. 2,M50/1,"Lmnmw Squibnocket Head, southwest side 0\nMart...",Site Squibnocket Cliff.,Name Butt of quartz knife.,Situation Black sandy loam near stake 2.
2,| Cat. No. 3,M50/1,"Locality Squibnocket Head, southwest side of\n...",Site Squibnocket Cliff .,Name Crude quartz point.,"Situation Black sandy loam, 1M. east of stake A."
3,Cat. No. 4,M50/1,"Locality Squibnocket Head, southwest side 0\nM...",Site Squibnocket Cliff.,Name Urude quartzite Spear.,Situation Top of first shell layer.
4,I Cat. No. 5,M50/1,"Uxﬂky Squibnocket Head, southwest side of\nMar...",Site Squibnocket Cliff .,Name Crude chopper.,Situation Underneath lowest shell layer.


In [3]:
# Remove word at front of "Locality", "Site", etc.
def remove_first_word(string):
    """Removes the first word of a string."""
    return ' '.join(str(string).split(' ')[1:])

extra_word_cols = ["Locality", "Site", "Name", "Situation"]
df[extra_word_cols] = df[extra_word_cols].apply(np.vectorize(remove_first_word))
df.head()

Unnamed: 0,Cat Number,Site Number,Locality,Site,Name,Situation
0,Cat. No. 1,M50/1,"Squibnocket Head, southwest side 0\nMartha' 3 ...",Squibnocket Cliff .,Butt of arrowhead.,on sand under shell just south of stake 1. I
1,Cat. No. 2,M50/1,"Squibnocket Head, southwest side 0\nMartha's V...",Squibnocket Cliff.,Butt of quartz knife.,Black sandy loam near stake 2.
2,| Cat. No. 3,M50/1,"Squibnocket Head, southwest side of\nMartha' 3...",Squibnocket Cliff .,Crude quartz point.,"Black sandy loam, 1M. east of stake A."
3,Cat. No. 4,M50/1,"Squibnocket Head, southwest side 0\nMartha's V...",Squibnocket Cliff.,Urude quartzite Spear.,Top of first shell layer.
4,I Cat. No. 5,M50/1,"Squibnocket Head, southwest side of\nMartha' 3...",Squibnocket Cliff .,Crude chopper.,Underneath lowest shell layer.


In [4]:
from string import punctuation, whitespace
unwanted_chars = list(punctuation) + list(whitespace)
unwanted_chars.remove(' ')
unwanted_chars.remove('-')
unwanted_chars.remove("'")
def word_parse(string):
    """Lowercases, removes punctutation besides that which can appear in the inside of words (hyphen, apostrophe), and removes extraneous whitespace."""
    parsed = string.strip().lower()
    for unwanted_char in unwanted_chars:
        parsed = parsed.replace(unwanted_char, '')
    return parsed

df[extra_word_cols] = df[extra_word_cols].apply(np.vectorize(word_parse))
df.head()

Unnamed: 0,Cat Number,Site Number,Locality,Site,Name,Situation
0,Cat. No. 1,M50/1,squibnocket head southwest side 0martha' 3 vin...,squibnocket cliff,butt of arrowhead,on sand under shell just south of stake 1 i
1,Cat. No. 2,M50/1,squibnocket head southwest side 0martha's vine...,squibnocket cliff,butt of quartz knife,black sandy loam near stake 2
2,| Cat. No. 3,M50/1,squibnocket head southwest side ofmartha' 3 vi...,squibnocket cliff,crude quartz point,black sandy loam 1m east of stake a
3,Cat. No. 4,M50/1,squibnocket head southwest side 0martha's vine...,squibnocket cliff,urude quartzite spear,top of first shell layer
4,I Cat. No. 5,M50/1,squibnocket head southwest side ofmartha' 3 vi...,squibnocket cliff,crude chopper,underneath lowest shell layer


In [5]:
# now for each of the four main categories, get a list of all the words
from collections import Counter
wordlists = []
for col in extra_word_cols:
    wordlists.append(Counter(df[col].apply(lambda x: x + ' ').sum().strip().split(' ')))
wordlists

[Counter({'': 223,
          'm1': 10,
          'lana': 1,
          'acroas': 1,
          'indianvalley': 1,
          'livingston': 7,
          '0f': 1,
          'en': 1,
          'ranch“': 1,
          'no-': 1,
          'kev': 1,
          "squibnoﬁkemartha's": 1,
          '15521locality': 1,
          'lake': 6,
          'wiscdnsin': 1,
          'tn': 1,
          'nan': 1,
          '1‘0': 1,
          'llonroe': 1,
          '106reservoir': 1,
          'states': 2,
          'kil': 2,
          "soutmartha'": 3,
          'side': 66,
          'xanawha': 1,
          'out': 1,
          "squibnockcmartha's": 5,
          'labr‘': 1,
          'jen': 2,
          "squibnocksmartha's": 1,
          "squibnoclumartha's": 7,
          '119341': 1,
          'ben': 1,
          'fans': 19,
          'creservoir': 9,
          "squionockemartha'": 1,
          'pennsyh': 2,
          'mass': 171,
          '”77': 1,
          'lie’': 2,
          '737112i': 1,
          'ﬂad

In [6]:
wordlists[0].most_common(10)

[('falls', 1364),
 ('bluehill', 1181),
 ('me', 989),
 ('of', 530),
 ('vineyard', 479),
 ('east', 450),
 ('locality', 381),
 ('end', 353),
 ('', 223),
 ('lie', 199)]

In [7]:
other_df = pd.read_csv("peabody_files/Peabody_Extended_Fields_Acc.1-12.csv")
other_df = other_df[["CATBY", "CATTYPE", "DESCRIP"]]
other_df["DESCRIP"] = other_df["DESCRIP"].apply(np.vectorize(lambda x: word_parse(str(x))))
other_df.head()

Unnamed: 0,CATBY,CATTYPE,DESCRIP
0,Work Duty,Archaeology,pot sherds
1,Work Duty,Archaeology,fragment of chipped arrowpoint
2,Work Duty,Archaeology,pot sherds
3,Work Duty,Archaeology,beaver tooth implement
4,Work Duty,Archaeology,barbed harpoon fragment appears to be unilater...


In [31]:
words = Counter(other_df["DESCRIP"].apply(lambda x: x + ' ').sum().strip().split(' '))

In [32]:
words.most_common(10)

[('of', 404),
 ('quartz', 212),
 ('broken', 205),
 ('1', 177),
 ('bone', 175),
 ('arrowpoint', 120),
 ('card', 118),
 ('original', 115),
 ('with', 109),
 ('point', 103)]

The current idea I have for mixing frequencies is as follows: for any word currently in the corpus, ignore it. Otherwise, set it to a constant times the place in the list.

In [33]:
ALPHA = .04 # determines how much normal English words are weighted
with open("google-10000-english-no-swears.txt", 'r') as corpusfile:
    for i, word in enumerate(reversed(list(corpusfile))):
        if word.strip() in words:
            continue
        else:
            words[word.strip()] = int(ALPHA * i)

In [34]:
words.most_common(50)

[('of', 404),
 ('i', 396),
 ('by', 396),
 ('page', 395),
 ('if', 395),
 ('us', 395),
 ('your', 395),
 ('my', 395),
 ('was', 395),
 ('home', 395),
 ('can', 395),
 ('about', 395),
 ('you', 395),
 ('new', 395),
 ('will', 395),
 ('we', 395),
 ('our', 394),
 ('search', 394),
 ('free', 394),
 ('they', 394),
 ('he', 394),
 ('what', 394),
 ('time', 394),
 ('see', 394),
 ('there', 394),
 ('information', 394),
 ('up', 394),
 ('do', 394),
 ('any', 394),
 ('their', 394),
 ('site', 394),
 ('news', 394),
 ('his', 393),
 ('so', 393),
 ('e', 393),
 ('get', 393),
 ('am', 393),
 ('pm', 393),
 ('here', 393),
 ('web', 393),
 ('contact', 393),
 ('how', 393),
 ('would', 393),
 ('who', 393),
 ('online', 393),
 ('me', 393),
 ('view', 393),
 ('when', 393),
 ('help', 393),
 ('business', 393)]

In [37]:
BETA = 10
with open("name_list.dat", 'r') as namelistfile:
    for line in namelistfile:
        processed = line.strip().lower()
        if processed in words:
            

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 0: invalid continuation byte

In [10]:
# algorithm for computing edit distance
# Good artists copy, great artists steal
# this is from https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python
# and I take no credit for it whatsoever
def levenshtein(source, target):
    if len(source) < len(target):
        return levenshtein(target, source)

    # So now we have len(source) >= len(target).
    if len(target) == 0:
        return len(source)

    # We call tuple() to force strings to be used as sequences
    # ('c', 'a', 't', 's') - numpy uses them as values by default.
    source = np.array(tuple(source))
    target = np.array(tuple(target))

    # We use a dynamic programming algorithm, but with the
    # added optimization that we only need the last two rows
    # of the matrix.
    previous_row = np.arange(target.size + 1)
    for s in source:
        # Insertion (target grows longer than source):
        current_row = previous_row + 1

        # Substitution or matching:
        # Target and source items are aligned, and either
        # are different (cost of 1), or are the same (cost of 0).
        current_row[1:] = np.minimum(
                current_row[1:],
                np.add(previous_row[:-1], target != s))

        # Deletion (target grows shorter than source):
        current_row[1:] = np.minimum(
                current_row[1:],
                current_row[0:-1] + 1)

        previous_row = current_row

    return previous_row[-1]

In [11]:
# testing examples
print(levenshtein("history", "herstory"))
# should output 2: add 'r', change 'i' to 'e'
print(levenshtein("t3sstin", "testing"))
# should output 3: change '3' to 'e', delete 's', add a 'g'

2
3


In [12]:
# now to spell-check a single word, we find its closest thing in the list of words we have and then settle ties by commonality in the list
def correct(input_word):
    """Returns the closest words in the list of words we have, sorted by likelihood."""
    candidates = []
    curr_min_distance = 200
    for word in words:
        distance = levenshtein(word, input_word)
        if distance < curr_min_distance:
            candidates = [word]
            curr_min_distance = distance
        elif distance == curr_min_distance:
            candidates.append(word)
        else:
            continue
    candidates.sort(key=lambda x: words[x], reverse=True)
    return candidates

In [36]:
print(correct("arrop oint"))
print(correct("or1g1nal3fdf"))
print(correct("history"))

['arrowpoint']
['originally', 'original']
['history']


I feel really good about this system, especially if I ever get the actual English dictionary to back the wordlist up (EDIT: done!) and figure out how to properly mix those. Things to think about improving:
 * If the OCR collapses one word into many or many words into one, that's really hard for this to catch.
 * I thought about doing totally next-level optical edit distance stuff, but that seems like overkill.
 * Knowing when to apply this system is also important: locations and names have to stay that way, and Massachusetts has some *weird* place names that can't be spell-checked and might change constantly.
 * Word embeddings a la `word2vec` would significantly improve this and I'm working on that.

The really nifty thing would be to use this algorithm to train a neural network to do deep learning on spell-checking (which has been done successfully), but that REALLY seems like overkill. I'll plug this into the whole pipeline and get the dictionary sorted before I do that.