# Spelling correction

This is adapted from Peter Norvig's [post](http://norvig.com/spell-correct.html) on spelling correction.

In [1]:
import requests

big = requests.get('http://norvig.com/big.txt').text


In [4]:
print(big[:1000])


The Project Gutenberg EBook of The Adventures of Sherlock Holmes
by Sir Arthur Conan Doyle
(#15 in our series by Sir Arthur Conan Doyle)

Copyright laws are changing all over the world. Be sure to check the
copyright laws for your country before downloading or redistributing
this or any other Project Gutenberg eBook.

This header should be the first thing seen when viewing this Project
Gutenberg file.  Please do not remove it.  Do not change or edit the
header without written permission.

Please read the "legal small print," and other information about the
eBook and Project Gutenberg at the bottom of this file.  Included is
important information about your specific rights and restrictions in
how the file may be used.  You can also find out about how to make a
donation to Project Gutenberg, and how to get involved.


**Welcome To The World of Free Plain Vanilla Electronic Texts**

**eBooks Readable By Both Humans and By Computers, Since 1971**

*****These eBooks Were Prepared By Thousan

In [50]:
import re
from collections import Counter

def words(text):
    return re.findall(r'\w+', text.lower())

WORDS = Counter(words(big))

def correct(word):
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def P(word, N=sum(WORDS.values())):
    "Probability of `word`."
    return WORDS[word] / N

def candidates(word):
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words):
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word):
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))


Some examples of spelling correction:

In [51]:
correct('speling')


'spelling'

In [52]:
correct('korrectud')


'corrected'

How does this work? Well, the first idea to generate candidates. We can do this by generating all possible words that are one edit away from the original word. We then filter these candidates by only keeping the ones that are in our vocabulary. Next, we rank these candidates by some score. In this case, the score is frequency of the word. We then pick the candidate with the highest score.

In [53]:
import random

random.choices(list(edits1('candl')), k=10)


['canda',
 'cnadl',
 'candjl',
 'candzl',
 'candls',
 'candln',
 'canfl',
 'lcandl',
 'cacndl',
 'cabndl']

We can also consider words that are two edits away.

In [54]:
random.choices(list(edits2('candl')), k=10)


['caandil',
 'caazdl',
 'vaqndl',
 'candlfy',
 'calndm',
 'caqndd',
 'cagdl',
 'cawdo',
 'cacbdl',
 'canldla']

The `candidates` function lists all the 1-edits and 2-edits, and filter those that are not in the vocabulary.

In [55]:
candidates('candl')


{'canal', 'candle', 'candy'}

So in this case, there's three of them.

In [56]:
for candidate in candidates('candl'):
    print(candidate, P(candidate))


candy 8.963906829152417e-07
canal 6.454012916989741e-05
candle 3.2270064584948705e-05


In [57]:
correct('candl')


'canal'

This model is actually a simple Bayes classifier. We want to find the most probable correction given a word. We can do this by using Bayes rule:

$$argmax_{c \in candidates} P(c|w) = \frac{P(c)P(w|c)}{P(w)}$$

We can ignore the denominator since it's the same for all candidates. So we can simplify this to:

$$argmax_{c \in candidates} P(c)P(w|c)$$

This basic model estimates $P(c)$ by the frequency of the word in the vocabulary:

In [58]:
print(f"{P('the'):.2%}")


7.15%


The $P(w|c)$ is more complicated. It's the likelihood of observing a typo. The thing is, we don't really have access to a list of typos people made. So our basic model simply says that a 1-edit is more likely than a 2-edit. A more sophisticated model would use a corpus of misspellings to learn typical typos.

## Application

In [59]:
from bs4 import BeautifulSoup
import requests

url = 'https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines'

content = requests.get(url).content.decode()
soup = BeautifulSoup(content)


In [60]:
from collections import defaultdict

typos = defaultdict(list)

for line in soup.find(name='pre').text.splitlines():
    typo, correction = line.split('->')
    typos[correction].append(typo)

typos = dict(typos)

print(f'{len(typos):,d} words')
print(f'{sum(map(len, typos.values())):,d} misspellings')


3,156 words
4,291 misspellings


In [63]:
import random

keys = random.sample(list(typos.keys()), 25)

for truth in keys:
    misspellings = typos[truth]
    for misspelling in misspellings:
        correction = correct(misspelling)
        if correction == truth:
            print(f'✅ {misspelling} -> {correction} == {truth}')
        else:
            print(f'❌ {misspelling} -> {correction} != {truth}')


✅ consituted -> constituted == constituted
❌ tast -> last != taste
✅ appearences -> appearances == appearances
✅ apperances -> appearances == appearances
✅ appereances -> appearances == appearances
❌ Pucini -> mucin != Puccini
❌ rechargable -> rechargable != rechargeable
✅ casette -> cassette == cassette
✅ verisons -> versions == versions
✅ nowe -> now == now
❌ regardes -> regarded != regards
✅ mataphysical -> metaphysical == metaphysical
✅ coform -> conform == conform
❌ shoudln -> shouldn != should, shouldn't
❌ homogeneize -> homogeneize != homogenize
✅ guerrila -> guerrilla == guerrilla
❌ implimented -> complimented != implemented
✅ threee -> three == three
✅ inocence -> innocence == innocence
❌ maneouvres -> maneuvers != manoeuvres
❌ deteoriated -> deteoriated != deteriorated
❌ scoll -> scold != scroll
❌ Malcom -> falcon != Malcolm
✅ buisness -> business == business
✅ busines -> business == business
✅ busness -> business == business
✅ bussiness -> business == business
❌ omniverously