# How to write a spelling corrector
Marco Herrero <me@marhs.de> 
(Original idea by Peter Norvig)

### Goal
Let's try to write a function called `correct(word)`

```
>>> correct('madriz')
'madrid'
```

In [7]:
def correct(word):
    possible_corrections = []  # ??
    return best_correction  # ??

To define this function, we need to resolve two problems:

1. Given a word, which are the possible corrections of that word?
```
word = 'lates'
possible_corrections = ['late', 'latest']
```

2. Given a set of corrections, which one is the most likely to be the intended?
```
best_correction = late
```

Let's try to use probability to solve both problems

## A little probability theory

Given a word, we are trying to choose the most likely spelling correction for that word (could be the original word!). There is no way to know for sure, so we are going to use probabilites. We are trying to find the correction __c__ that maximizes the probability of __c__ given the original word __w__:

argmax<sub>c</sub> = P(c|w)

By Bayes' Theorem this is equivalent to:

argmax<sub>c</sub> = P(w|c) P(c)<del> / P(w)</del>  

There are three parts of this expression:    
1. _P(c)_: __Language model__: P(c): The probability of a correction stands on its own.
2. _P(w|c)_: __Error model__: The probability that given a word w, the author meant c. Error model.
3. _argmax<sub>x</sub>_: Control mechanism. Enumerate all feasible values of c and choose the one that gives the __best probability score__.

P(w) is useless because is the same for every possible c.

### Language model

Let's say that the probability that a correction __c__ is the valid one is the same that the probability of a word __c__ appears in an text. In and english text:  

__P("the")__ > __P("coriander")__ > __P("xxyxyxxxy")__

To user that, let's make a default dictionary with each word and it's probability given a text. 

In [27]:
import re, collections

def words(text):
    return re.findall('[a-z]+', text.lower())
    
def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

train(words("""Estais utilizando mi estirpe como ganado 
               para mantener al sol vivo.
               
                        El guardian de las estrellas
            """))

defaultdict(<function __main__.<lambda>>,
            {'al': 2,
             'como': 2,
             'de': 2,
             'el': 2,
             'estais': 2,
             'estirpe': 2,
             'estrellas': 2,
             'ganado': 2,
             'guardian': 2,
             'las': 2,
             'mantener': 2,
             'mi': 2,
             'para': 2,
             'sol': 2,
             'usando': 2,
             'vivo': 2})

If we use a really big text like a book, we can get a good __language model__. Keep in mind that we are not working with any language, so the language model is extracting knowledge of a specific language of a text. __I DONT LIKE THIS__  

In [29]:
NWORDS = train(words(open('data/conde_montecristo.txt', 'r').read()))

At this point _NWORDS[**w**]_ holds a count of how many times the word __w__ has been seen.

In [None]:
import re, collections

def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('conde_montecristo.txt').read()))
# NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

In [7]:
correct('cspado')

'criado'