# Spelling Correction using Locality Sensitive Hashing

Practical course material for the ASDM Class 09 (Text Mining) by Florian Leitner.

© 2016 Florian Leitner. All rights reserved.

This Notebook combines yesterday's spelling correction example from Peter Norivg with a LSH to achieve higher correction speeds (although at slightly worse accuracy). In an acctual application with tens of thousands of words, you'd probably use the LSH approach with more lenient matching likelihoods that will match more spelling variants in one bucket, and then use some approach to find an exact match in that bucket (if any). In the same fashion as we match words with characters, the approach commonly is used to match similar documents with tokens. For example, LSH has been used to develop very efficient document plagiarism detection systems.

In [1]:
%precision 3

'%.3f'

Here is the LSH implementation again:

In [2]:
from collections import defaultdict


class MinHashSignature:
    """Hash signatures for sets/tuples using minhash."""

    def __init__(self, dim):
        """
        Define the dimension of the hash pool
        (number of hash functions).
        """
        self.dim = dim
        self.hashes = self.hash_functions()

    def hash_functions(self):
        """Return dim different hash functions."""
        def hash_factory(n):
            return lambda x: hash("salt" + str(n) + str(x) + "salt")
        
        return [ hash_factory(_) for _ in range(self.dim) ]

    def sign(self, item):
        """Return the minhash signatures for the `item`."""
        sig = [ float("inf") ] * self.dim
        
        for hash_ix, hash_fn in enumerate(self.hashes):
            # minhashing; requires item is iterable:
            sig[hash_ix] = min(hash_fn(i) for i in item)
        
        return sig


class LSH:
    """
    Locality sensitive hashing.

    Uses a banding approach to hash
    similar signatures to the same buckets.
    """

    def __init__(self, size, threshold):
        """
        LSH approximating a given similarity `threshold`
        with a given hash signature `size`.
        """
        self.size = size
        self.threshold = threshold
        self.bandwidth = self.get_bandwidth(size, threshold)

    @staticmethod
    def get_bandwidth(n, t):
        """
        Approximate the bandwidth (number of rows in each band)
        needed to get threshold.

        Threshold t = (1/b) ** (1/r)
        where
        b = # of bands
        r = # of rows per band
        n = b * r = size of signature
        """
        best = n # 1
        minerr = float("inf")

        for r in range(1, n + 1):
            try:
                b = 1. / (t ** r)
            except: # Divide by zero, your signature is huge
                return best

            err = abs(n - b * r)

            if err < minerr:
                best = r
                minerr = err

        return best

    def hash(self, sig):
        """Generate hash values for this signature."""
        for band in zip(*(iter(sig),) * self.bandwidth):
            yield hash("salt" + str(band) + "tlas")

    @property
    def exact_threshold(self):
        """The exact threshold defined by the chosen bandwith."""
        r = self.bandwidth
        b = self.size / r
        return (1. / b) ** (1. / r)

    def get_n_bands(self):
        """The number of bands."""
        return int(self.size / self.bandwidth)


class UnionFind:
    """
    Union-find data structure.

    Each unionFind instance X maintains a family of disjoint sets of
    hashable objects, supporting the following two methods:

    - X[item] returns a name for the set containing the given item.
    Each set is named by an arbitrarily-chosen one of its members; as
    long as the set remains unchanged it will keep the same name. If
    the item is not yet part of a set in X, a new singleton set is
    created for it.

    - X.union(item1, item2, ...) merges the sets containing each item
    into a single larger set. If any item is not yet part of a set
    in X, it is added to X as one of the members of the merged set.

    Source: http://www.ics.uci.edu/~eppstein/PADS/UnionFind.py

    Union-find data structure. Based on Josiah Carlson's code,
    http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/215912
    with significant additional changes by D. Eppstein.
    """

    def __init__(self):
        """Create a new empty union-find structure."""
        self.weights = {}
        self.parents = {}

    def __getitem__(self, object):
        """Find and return the name of the set containing the object."""
        # check for previously unknown object
        if object not in self.parents:
            self.parents[object] = object
            self.weights[object] = 1
            return object

        # find path of objects leading to the root
        path = [object]
        root = self.parents[object]

        while root != path[-1]:
            path.append(root)
            root = self.parents[root]

        # compress the path and return
        for ancestor in path:
            self.parents[ancestor] = root

        return root

    def __iter__(self):
        """Iterate through all items ever found or unioned by this structure."""
        return iter(self.parents)

    def union(self, *objects):
        """Find the sets containing the objects and merge them all."""
        roots = [self[x] for x in objects]
        heaviest = max([(self.weights[r],r) for r in roots])[1]
        for r in roots:
            if r != heaviest:
                self.weights[heaviest] += self.weights[r]
                self.parents[r] = heaviest

    def sets(self):
        """Return a list of each disjoint set"""
        ret = defaultdict(list)
        for k, _ in self.parents.items():
            ret[self[k]].append(k)
        return list(ret.values())


class Cluster:
    """
    Cluster items with a Jaccard similarity above
    some `threshold` with a high probability.

    Based on Rajaraman, "Mining of Massive Datasets":

    1. Generate items hash signatures
    2. Use LSH to map similar signatures to same buckets
    3. Use UnionFind to merge buckets containing same values
    """

    def __init__(self, threshold=0.5, size=20):
        """
        The `size` parameter controls the number of hash
        functions ("signature size") to create.
        """
        self.size = size
        self.unions = UnionFind()
        self.signer = MinHashSignature(size)
        self.hasher = LSH(size, threshold)
        self.hashmaps = [
            defaultdict(list) for _ in range(self.hasher.get_n_bands())
        ]

    def add(self, item, label=None):
        """
        Add an `item` to the cluster.

        Optionally, define a `label` to reference this `item`.
        Otherwise, the `item` itself is used as label.
        """
        # A label for this item
        if label is None:
            label = item

        # Add to unionfind structure
        self.unions[label]

        # Get signature
        sig = self.signer.sign(item)

        # Union labels with same LSH key in same band
        for band_idx, hashval in enumerate(self.hasher.hash(sig)):
            self.hashmaps[band_idx][hashval].append(label)
            self.unions.union(label, self.hashmaps[band_idx][hashval][0])

    def groups(self):
        """
        Get the clustering result.

        Returns sets of labels.
        """
        return self.unions.sets()

    def match(self, item):
        """
        Get the matching set of labels for `item`.

        Returns a (possibly empty) set of items.
        """
        # Get item signature
        sig = self.signer.sign(item)

        # Find matches
        matches = set()

        for band_idx, hashval in enumerate(self.hasher.hash(sig)):
            if hashval in self.hashmaps[band_idx]:
                matches.update(self.hashmaps[band_idx][hashval])

        return matches

This is the evaluation function from Peter Norvig and as used yesterday, with some reporting tweeks for our particular problem.

In [3]:
# Development Tests
TESTS_1 = { 'access': 'acess', 'accessing': 'accesing', 'accommodation':
'accomodation acommodation acomodation', 'account': 'acount', 'address':
'adress adres', 'addressable': 'addresable', 'arranged': 'aranged arrainged',
'arrangeing': 'aranging', 'arrangement': 'arragment', 'articles': 'articals',
'aunt': 'annt anut arnt', 'auxiliary': 'auxillary', 'available': 'avaible',
'awful': 'awfall afful', 'basically': 'basicaly', 'beginning': 'begining',
'benefit': 'benifit', 'benefits': 'benifits', 'between': 'beetween', 'bicycle':
'bicycal bycicle bycycle', 'biscuits': 
'biscits biscutes biscuts bisquits buiscits buiscuts', 'built': 'biult', 
'cake': 'cak', 'career': 'carrer',
'cemetery': 'cemetary semetary', 'centrally': 'centraly', 'certain': 'cirtain',
'challenges': 'chalenges chalenges', 'chapter': 'chaper chaphter chaptur',
'choice': 'choise', 'choosing': 'chosing', 'clerical': 'clearical',
'committee': 'comittee', 'compare': 'compair', 'completely': 'completly',
'consider': 'concider', 'considerable': 'conciderable', 'contented':
'contenpted contende contended contentid', 'curtains': 
'cartains certans courtens cuaritains curtans curtians curtions', 'decide': 'descide', 'decided':
'descided', 'definitely': 'definately difinately', 'definition': 'defenition',
'definitions': 'defenitions', 'description': 'discription', 'desiccate':
'desicate dessicate dessiccate', 'diagrammatically': 'diagrammaticaally',
'different': 'diffrent', 'driven': 'dirven', 'ecstasy': 'exstacy ecstacy',
'embarrass': 'embaras embarass', 'establishing': 'astablishing establising',
'experience': 'experance experiance', 'experiences': 'experances', 'extended':
'extented', 'extremely': 'extreamly', 'fails': 'failes', 'families': 'familes',
'february': 'febuary', 'further': 'futher', 'gallery': 'galery gallary gallerry gallrey', 
'hierarchal': 'hierachial', 'hierarchy': 'hierchy', 'inconvenient':
'inconvienient inconvient inconvinient', 'independent': 'independant independant',
'initial': 'intial', 'initials': 'inetials inistals initails initals intials',
'juice': 'guic juce jucie juise juse', 'latest': 'lates latets latiest latist', 
'laugh': 'lagh lauf laught lugh', 'level': 'leval',
'levels': 'levals', 'liaison': 'liaision liason', 'lieu': 'liew', 'literature':
'litriture', 'loans': 'lones', 'locally': 'localy', 'magnificent': 
'magnificnet magificent magnifcent magnifecent magnifiscant magnifisent magnificant',
'management': 'managment', 'meant': 'ment', 'minuscule': 'miniscule',
'minutes': 'muinets', 'monitoring': 'monitering', 'necessary': 
'neccesary necesary neccesary necassary necassery neccasary', 'occurrence':
'occurence occurence', 'often': 'ofen offen offten ofton', 'opposite': 
'opisite oppasite oppesite oppisit oppisite opposit oppossite oppossitte', 'parallel': 
'paralel paralell parrallel parralell parrallell', 'particular': 'particulaur',
'perhaps': 'perhapse', 'personnel': 'personnell', 'planned': 'planed', 'poem':
'poame', 'poems': 'poims pomes', 'poetry': 'poartry poertry poetre poety powetry', 
'position': 'possition', 'possible': 'possable', 'pretend': 
'pertend protend prtend pritend', 'problem': 'problam proble promblem proplen',
'pronunciation': 'pronounciation', 'purple': 'perple perpul poarple',
'questionnaire': 'questionaire', 'really': 'realy relley relly', 'receipt':
'receit receite reciet recipt', 'receive': 'recieve', 'refreshment':
'reafreshment refreshmant refresment refressmunt', 'remember': 'rember remeber rememmer rermember',
'remind': 'remine remined', 'scarcely': 'scarcly scarecly scarely scarsely', 
'scissors': 'scisors sissors', 'separate': 'seperate',
'singular': 'singulaur', 'someone': 'somone', 'sources': 'sorces', 'southern':
'southen', 'special': 'speaical specail specal speical', 'splendid': 
'spledid splended splened splended', 'standardizing': 'stanerdizing', 'stomach': 
'stomac stomache stomec stumache', 'supersede': 'supercede superceed', 'there': 'ther',
'totally': 'totaly', 'transferred': 'transfred', 'transportability':
'transportibility', 'triangular': 'triangulaur', 'understand': 'undersand undistand', 
'unexpected': 'unexpcted unexpeted unexspected', 'unfortunately':
'unfortunatly', 'unique': 'uneque', 'useful': 'usefull', 'valuable': 'valubale valuble', 
'variable': 'varable', 'variant': 'vairiant', 'various': 'vairious',
'visited': 'fisited viseted vistid vistied', 'visitors': 'vistors',
'voluntary': 'volantry', 'voting': 'voteing', 'wanted': 'wantid wonted',
'whether': 'wether', 'wrote': 'rote wote'}

# Final Tests
TESTS_2 = {'forbidden': 'forbiden', 'decisions': 'deciscions descisions',
'supposedly': 'supposidly', 'embellishing': 'embelishing', 'technique':
'tecnique', 'permanently': 'perminantly', 'confirmation': 'confermation',
'appointment': 'appoitment', 'progression': 'progresion', 'accompanying':
'acompaning', 'applicable': 'aplicable', 'regained': 'regined', 'guidelines':
'guidlines', 'surrounding': 'serounding', 'titles': 'tittles', 'unavailable':
'unavailble', 'advantageous': 'advantageos', 'brief': 'brif', 'appeal':
'apeal', 'consisting': 'consisiting', 'clerk': 'cleark clerck', 'component':
'componant', 'favourable': 'faverable', 'separation': 'seperation', 'search':
'serch', 'receive': 'recieve', 'employees': 'emploies', 'prior': 'piror',
'resulting': 'reulting', 'suggestion': 'sugestion', 'opinion': 'oppinion',
'cancellation': 'cancelation', 'criticism': 'citisum', 'useful': 'usful',
'humour': 'humor', 'anomalies': 'anomolies', 'would': 'whould', 'doubt':
'doupt', 'examination': 'eximination', 'therefore': 'therefoe', 'recommend':
'recomend', 'separated': 'seperated', 'successful': 'sucssuful succesful',
'apparent': 'apparant', 'occurred': 'occureed', 'particular': 'paerticulaur',
'pivoting': 'pivting', 'announcing': 'anouncing', 'challenge': 'chalange',
'arrangements': 'araingements', 'proportions': 'proprtions', 'organized':
'oranised', 'accept': 'acept', 'dependence': 'dependance', 'unequalled':
'unequaled', 'numbers': 'numbuers', 'sense': 'sence', 'conversely':
'conversly', 'provide': 'provid', 'arrangement': 'arrangment',
'responsibilities': 'responsiblities', 'fourth': 'forth', 'ordinary':
'ordenary', 'description': 'desription descvription desacription',
'inconceivable': 'inconcievable', 'data': 'dsata', 'register': 'rgister',
'supervision': 'supervison', 'encompassing': 'encompasing', 'negligible':
'negligable', 'allow': 'alow', 'operations': 'operatins', 'executed':
'executted', 'interpretation': 'interpritation', 'hierarchy': 'heiarky',
'indeed': 'indead', 'years': 'yesars', 'through': 'throut', 'committee':
'committe', 'inquiries': 'equiries', 'before': 'befor', 'continued':
'contuned', 'permanent': 'perminant', 'choose': 'chose', 'virtually':
'vertually', 'correspondence': 'correspondance', 'eventually': 'eventully',
'lonely': 'lonley', 'profession': 'preffeson', 'they': 'thay', 'now': 'noe',
'desperately': 'despratly', 'university': 'unversity', 'adjournment':
'adjurnment', 'possibilities': 'possablities', 'stopped': 'stoped', 'mean':
'meen', 'weighted': 'wagted', 'adequately': 'adequattly', 'shown': 'hown',
'matrix': 'matriiix', 'profit': 'proffit', 'encourage': 'encorage', 'collate':
'colate', 'disaggregate': 'disaggreagte disaggreaget', 'receiving':
'recieving reciving', 'proviso': 'provisoe', 'umbrella': 'umberalla', 'approached':
'aproached', 'pleasant': 'plesent', 'difficulty': 'dificulty', 'appointments':
'apointments', 'base': 'basse', 'conditioning': 'conditining', 'earliest':
'earlyest', 'beginning': 'begining', 'universally': 'universaly',
'unresolved': 'unresloved', 'length': 'lengh', 'exponentially':
'exponentualy', 'utilized': 'utalised', 'set': 'et', 'surveys': 'servays',
'families': 'familys', 'system': 'sysem', 'approximately': 'aproximatly',
'their': 'ther', 'scheme': 'scheem', 'speaking': 'speeking', 'repetitive':
'repetative', 'inefficient': 'ineffiect', 'geneva': 'geniva', 'exactly':
'exsactly', 'immediate': 'imediate', 'appreciation': 'apreciation', 'luckily':
'luckeley', 'eliminated': 'elimiated', 'believe': 'belive', 'appreciated':
'apreciated', 'readjusted': 'reajusted', 'were': 'wer where', 'feeling':
'fealing', 'and': 'anf', 'false': 'faulse', 'seen': 'seeen', 'interrogating':
'interogationg', 'academically': 'academicly', 'relatively': 'relativly relitivly',
'traditionally': 'traditionaly', 'studying': 'studing',
'majority': 'majorty', 'build': 'biuld', 'aggravating': 'agravating',
'transactions': 'trasactions', 'arguing': 'aurguing', 'sheets': 'sheertes',
'successive': 'sucsesive sucessive', 'segment': 'segemnt', 'especially':
'especaily', 'later': 'latter', 'senior': 'sienior', 'dragged': 'draged',
'atmosphere': 'atmospher', 'drastically': 'drasticaly', 'particularly':
'particulary', 'visitor': 'vistor', 'session': 'sesion', 'continually':
'contually', 'availability': 'avaiblity', 'busy': 'buisy', 'parameters':
'perametres', 'surroundings': 'suroundings seroundings', 'employed':
'emploied', 'adequate': 'adiquate', 'handle': 'handel', 'means': 'meens',
'familiar': 'familer', 'between': 'beeteen', 'overall': 'overal', 'timing':
'timeing', 'committees': 'comittees commitees', 'queries': 'quies',
'econometric': 'economtric', 'erroneous': 'errounous', 'decides': 'descides',
'reference': 'refereence refference', 'intelligence': 'inteligence',
'edition': 'ediion ediition', 'are': 'arte', 'apologies': 'appologies',
'thermawear': 'thermawere thermawhere', 'techniques': 'tecniques',
'voluntary': 'volantary', 'subsequent': 'subsequant subsiquent', 'currently':
'curruntly', 'forecast': 'forcast', 'weapons': 'wepons', 'routine': 'rouint',
'neither': 'niether', 'approach': 'aproach', 'available': 'availble',
'recently': 'reciently', 'ability': 'ablity', 'nature': 'natior',
'commercial': 'comersial', 'agencies': 'agences', 'however': 'howeverr',
'suggested': 'sugested', 'career': 'carear', 'many': 'mony', 'annual':
'anual', 'according': 'acording', 'receives': 'recives recieves',
'interesting': 'intresting', 'expense': 'expence', 'relevant':
'relavent relevaant', 'table': 'tasble', 'throughout': 'throuout', 'conference':
'conferance', 'sensible': 'sensable', 'described': 'discribed describd',
'union': 'unioun', 'interest': 'intrest', 'flexible': 'flexable', 'refered':
'reffered', 'controlled': 'controled', 'sufficient': 'suficient',
'dissension': 'desention', 'adaptable': 'adabtable', 'representative':
'representitive', 'irrelevant': 'irrelavent', 'unnecessarily': 'unessasarily',
'applied': 'upplied', 'apologised': 'appologised', 'these': 'thees thess',
'choices': 'choises', 'will': 'wil', 'procedure': 'proceduer', 'shortened':
'shortend', 'manually': 'manualy', 'disappointing': 'dissapoiting',
'excessively': 'exessively', 'comments': 'coments', 'containing': 'containg',
'develop': 'develope', 'credit': 'creadit', 'government': 'goverment',
'acquaintances': 'aquantences', 'orientated': 'orentated', 'widely': 'widly',
'advise': 'advice', 'difficult': 'dificult', 'investigated': 'investegated',
'bonus': 'bonas', 'conceived': 'concieved', 'nationally': 'nationaly',
'compared': 'comppared compased', 'moving': 'moveing', 'necessity':
'nessesity', 'opportunity': 'oppertunity oppotunity opperttunity', 'thoughts':
'thorts', 'equalled': 'equaled', 'variety': 'variatry', 'analysis':
'analiss analsis analisis', 'patterns': 'pattarns', 'qualities': 'quaties', 'easily':
'easyly', 'organization': 'oranisation oragnisation', 'the': 'thw hte thi',
'corporate': 'corparate', 'composed': 'compossed', 'enormously': 'enomosly',
'financially': 'financialy', 'functionally': 'functionaly', 'discipline':
'disiplin', 'announcement': 'anouncement', 'progresses': 'progressess',
'except': 'excxept', 'recommending': 'recomending', 'mathematically':
'mathematicaly', 'source': 'sorce', 'combine': 'comibine', 'input': 'inut',
'careers': 'currers carrers', 'resolved': 'resoved', 'demands': 'diemands',
'unequivocally': 'unequivocaly', 'suffering': 'suufering', 'immediately':
'imidatly imediatly', 'accepted': 'acepted', 'projects': 'projeccts',
'necessary': 'necasery nessasary nessisary neccassary', 'journalism':
'journaism', 'unnecessary': 'unessessay', 'night': 'nite', 'output':
'oputput', 'security': 'seurity', 'essential': 'esential', 'beneficial':
'benificial benficial', 'explaining': 'explaning', 'supplementary':
'suplementary', 'questionnaire': 'questionare', 'employment': 'empolyment',
'proceeding': 'proceding', 'decision': 'descisions descision', 'per': 'pere',
'discretion': 'discresion', 'reaching': 'reching', 'analysed': 'analised',
'expansion': 'expanion', 'although': 'athough', 'subtract': 'subtrcat',
'analysing': 'aalysing', 'comparison': 'comparrison', 'months': 'monthes',
'hierarchal': 'hierachial', 'misleading': 'missleading', 'commit': 'comit',
'auguments': 'aurgument', 'within': 'withing', 'obtaining': 'optaning',
'accounts': 'acounts', 'primarily': 'pimarily', 'operator': 'opertor',
'accumulated': 'acumulated', 'extremely': 'extreemly', 'there': 'thear',
'summarys': 'sumarys', 'analyse': 'analiss', 'understandable':
'understadable', 'safeguard': 'safegaurd', 'consist': 'consisit',
'declarations': 'declaratrions', 'minutes': 'muinutes muiuets', 'associated':
'assosiated', 'accessibility': 'accessability', 'examine': 'examin',
'surveying': 'servaying', 'politics': 'polatics', 'annoying': 'anoying',
'again': 'agiin', 'assessing': 'accesing', 'ideally': 'idealy', 'scrutinized':
'scrutiniesed', 'simular': 'similar', 'personnel': 'personel', 'whereas':
'wheras', 'when': 'whn', 'geographically': 'goegraphicaly', 'gaining':
'ganing', 'requested': 'rquested', 'separate': 'seporate', 'students':
'studens', 'prepared': 'prepaired', 'generated': 'generataed', 'graphically':
'graphicaly', 'suited': 'suted', 'variable': 'varible vaiable', 'building':
'biulding', 'required': 'reequired', 'necessitates': 'nessisitates',
'together': 'togehter', 'profits': 'proffits'}


def spelltest(tests, verbose=False):
    """Use one of the two provided tests."""
    import time
    n, bad, missed, junk, unknown, start = 0, 0, 0, 0, 0, time.clock()
    # missed: how often the right word wasn't in the cluster
    
    for target, wrongs in tests.items():
        for wrong in wrongs.split():
            n += 1
            w = correct(wrong) # our "API" definition
            
            if w != target:
                bad += 1
                
                if target not in VOCAB:
                    unknown += 1
                elif target not in match(wrong):
                    missed += 1
                elif wrong in VOCAB:
                    junk += 1
                elif verbose:
                    print('correct(%r) => %r (%d); expected %r (%d)' % (
                        wrong, w, VOCAB[w], target, VOCAB[target]
                    ))
    
    print('%d%% correct' % (100. - 100. * bad / n))
    print('tested', n, 'words ->')
    print('wrong on', bad, 'of those words:')
    print(unknown, 'targets are not known to system (%d%%)' % (100.*unknown / n))
    print(junk, 'wrong words in vocabulary (%d%%)' % (100.*junk / n))
    print('possible leads:')
    print(missed, 'words have target not in matched group (%d%%)' % (100.*missed / n))
    print(bad - missed - unknown - junk, 'other types of mistakes (%d%%)' % (100.*(bad - missed - unknown - junk)/n))
    print('performance:')
    print(time.clock() - start, 'seconds')

## Loading the known vocabulary

In [4]:
def jaccard(X, Y):
    """The Jaccard similarity between two sets."""
    x = set(X)
    y = set(Y)
    return float(len(x & y)) / len(x | y)

In [5]:
from nltk.corpus import brown
from collections import Counter

CORPUS = brown.words()
VOCAB = Counter(w.lower() for w in CORPUS if w.isalpha())
len (CORPUS), len(VOCAB)

(1161192, 40234)

## Clustering the vocabulary

Prepare the k-shingles for this vocabulary by setting the parameter K:

In [6]:
K = 2
    
def shingle(s, k):
    """Generate k-length shingles of string s."""
    k = min(len(s), k)
    for i in range(len(s) - k + 1):
        yield s[i:i+k]

NGRAMS = [(w, frozenset(shingle(w, K))) for w in VOCAB]

Choose to evaluate one of the next two cells:

**Either** evaluate this for k-shingling:

In [7]:
DICTIONARY = Cluster(.84, 80)
print("threshold =", DICTIONARY.hasher.exact_threshold)
print("bandwidth =", DICTIONARY.hasher.bandwidth)

for w, n in NGRAMS:
    DICTIONARY.add(n, w)

def match(word):
    return DICTIONARY.match(set(shingle(word, K)))

threshold = 0.834956560905386
bandwidth = 11


**Or** evaluate this cell to use unigram-based matching:

In [8]:
DICTIONARY = Cluster(.84, 80)
print("threshold =", DICTIONARY.hasher.exact_threshold)
print("bandwidth =", DICTIONARY.hasher.bandwidth)

for w in VOCAB:
    DICTIONARY.add(w, w)

def match(word):
    return DICTIONARY.match(word)

threshold = 0.834956560905386
bandwidth = 11


In [9]:
num_groups = len(DICTIONARY.groups()) * 1.
num_groups, sum(len(g) for g in DICTIONARY.groups()) / num_groups

(1010.000, 39.836)

## Correcting mistakes

After finding the best LSH cluster, we will calculate the edit distance of each candidate to the word being corrected, because LSH provided a significantly smaller search space. Note that on average we need to do the above number of comparisons:

```python
sum(len(g) for g in DICTIONARY.groups()) / num_groups
```

In this scenario, to still have reasonable speed, it seems finding an average group size of about 100 candidates is ideal.

In [10]:
def edit_distance(s1, s2):
    len1 = len(s1)
    len2 = len(s2)
    lev = _edit_dist_init(len1 + 1, len2 + 1)

    # iterate over the array
    for i in range(len1):
        for j in range(len2):
            _edit_dist_step(lev, i + 1, j + 1, s1, s2)
    return lev[len1][len2]

def _edit_dist_init(len1, len2):
    lev = []
    for i in range(len1):
        lev.append([0] * len2)  # initialize 2-D array to zero
    for i in range(len1):
        lev[i][0] = i           # column 0: 0,1,2,3,4,...
    for j in range(len2):
        lev[0][j] = j           # row 0: 0,1,2,3,4,...
    return lev

def _edit_dist_step(lev, i, j, s1, s2):
    c1 = s1[i - 1]
    c2 = s2[j - 1]

    # skipping a character in s1
    a = lev[i - 1][j] + 1
    # skipping a character in s2
    b = lev[i][j - 1] + 1
    # substitution
    c = lev[i - 1][j - 1] + (c1 != c2)

    # transposition
    d = c + 1  # never picked by default
    if i > 1 and j > 1:
        if s1[i - 2] == c2 and s2[j - 2] == c1:
            d = lev[i - 2][j - 2] + 1

    # pick the cheapest
    lev[i][j] = min(a, b, c, d)

In [11]:
def correct(word):
    "Find the best spelling correction for this word."
    # do not correct known words:
    if word in VOCAB:
        return word
    
    candidates = match(word)
    
    if candidates:
        # measure distance to word and sort by increasing distance
        d_c_pairs = sorted((edit_distance(word, c), c) for c in candidates)
        # select the candidates with the shortest distance min_d
        min_d = d_c_pairs[0][0]
        candidates = [d_c[1] for d_c in filter(lambda i: i[0] == min_d, d_c_pairs)]
        # return the most frequent of the selected candidates
        return max(candidates, key=VOCAB.get)
    else:
        # no candidates fallback: do not correct
        return word

## Running the evaluation

With the new evaluation method, we immediately can se if we need to change the cluster similarty parameter of (not in matched group) or the distance settings (shingle-size, edit distance calcluation, etc.).

In [12]:
spelltest(TESTS_1)

64% correct
tested 270 words ->
wrong on 95 of those words:
15 targets are not known to system (5%)
5 wrong words in vocabulary (1%)
possible leads:
59 words have target not in matched group (21%)
16 other types of mistakes (5%)
performance:
1.9214910000000032 seconds


In [13]:
spelltest(TESTS_2)

67% correct
tested 400 words ->
wrong on 132 of those words:
22 targets are not known to system (5%)
10 wrong words in vocabulary (2%)
possible leads:
74 words have target not in matched group (18%)
26 other types of mistakes (6%)
performance:
3.8779480000000035 seconds


It turns out that with using simple unigrams, we get roughly a similar accuracy as Peter Norvig's method, but have more than tripled the correction speed.