# Context-sensitive Spelling Correction

The goal of the assignment is to implement context-sensitive spelling correction. The input of the code will be a set of text lines and the output will be the same lines with spelling mistakes fixed.

Useful links:
- [Norvig's solution](https://norvig.com/spell-correct.html)
- [Norvig's dataset](https://norvig.com/big.txt)
- [Ngrams data](https://www.ngrams.info/download_coca.asp)

Grading:
- 40 points - Implement spelling correction
- 20 points - Justify your decisions
- 20 points - Evaluate on a test set
- 20 points - Evaluate on Github Typo Corpus (for masters only)


Remarks: 
- Use Python 3 or greater
- Max is 80 points for bachelors, 100 points for masters

## Implement context-sensitive spelling correction

Your task is to implement context-sensitive spelling corrector using N-gram language model. The idea is to compute conditional probabilities of possible correction options. For example, the phrase "dking sport" should be fixed as "doing sport" not "dying sport", while "dking species" -- as "dying species".

The best way to start is to analyze [Norvig's solution](https://norvig.com/spell-correct.html).

[N-gram Language Models](https://web.stanford.edu/~jurafsky/slp3/3.pdf)

You may also wnat to implement:
- spell-checking for a concrete language - Russian, Tatar, Ukranian, etc. - any one you know, such that the solution accounts for language specifics,
- some recent (or not very recent) paper on this topic,
- solution which takes into account keyboard layout and associated misspellings,
- efficiency improvement to make the solution faster,
- any other idea of yours to improve the Norvig’s solution.

Important - your project should not be a mere code copy-paste from somewhere. Implement yourself, analyze why it was suggested this way, and think of improvements/customization.

Your solution should be able to perform 4 corrections per second (3-5 words in an example) on a typical cpu.

In [1]:
import re
from collections import Counter

class Norvig_corrector: # from https://norvig.com/spell-correct.html

    def words(self, text): return re.findall(r'\w+', text.lower())

    def __init__(self):
        self.WORDS = Counter(self.words(open('big.txt').read()))

    def P(self, word): 
        N=sum(self.WORDS.values())
        "Probability of `word`."
        return self.WORDS[word] / N

    def correction(self, word): 
        "Most probable spelling correction for word."
        return max(self.candidates(word), key=self.P)

    def candidates(self, word): 
        "Generate possible spelling corrections for word."
        return (self.known([word]) or self.known(self.edits1(word)) or self.known(self.edits2(word)) or set(word,))

    def known(self, words): 
        "The subset of `words` that appear in the dictionary of WORDS."
        return set(w for w in words if w in self.WORDS)

    def edits1(self, word):
        "All edits that are one edit away from `word`."
        letters    = 'abcdefghijklmnopqrstuvwxyz'
        splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
        deletes    = [L + R[1:]               for L, R in splits if R]
        transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
        replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
        inserts    = [L + c + R               for L, R in splits for c in letters]
        return set(deletes + transposes + replaces + inserts)

    def edits2(self, word): 
        "All edits that are two edits away from `word`."
        return (e2 for e1 in self.edits1(word) for e2 in self.edits1(e1))

In [2]:
from collections import defaultdict
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import random

class Corrector_with_weights(Norvig_corrector):

    def get_bigrams(self):
        with open("w2_.txt") as f:
            while (line := f.readline().rstrip()):
                val, w1, w2 = line.split()
                val = int(val)
                # if w2 in self.stop_words:
                #     val = 1
                self.bigrams[w1][w2] = val
        
        for w1 in self.bigrams:
            total_count = float(sum(self.bigrams[w1].values()))
            for w2 in self.bigrams[w1]:
                self.bigrams[w1][w2] /= total_count
            
    def __init__(self, w1 = 1, w2 = 0.2):
        super().__init__()
        self.WORDS = Counter(self.words(open('big.txt').read()))
        self.WORDS += Counter(self.words(open('w2_.txt').read()))
        self.stop_words = set(stopwords.words('english'))
        self.bigrams = defaultdict(lambda: defaultdict(lambda: 0))
        self.get_bigrams()
        self.weight1 = w1
        self.weight2 = w2

    def candidates_union(self, word): 
        # return self.candidates(word)
        "Generate possible spelling corrections for word."
        temp = self.known([word])
        if temp:
            return temp, [1]
        temp1 = self.known(self.edits1(word))
        temp2 = self.known(self.edits2(word))
        weights1 = [self.weight1 for i in range (len(temp1))]
        weights2 = [self.weight2 for i in range (len(temp2))]
        # print(weights1)
        # print(weights2)
        union = temp1 | temp2
        intersection = temp1 & temp2
        union_weights = weights1 + weights2[:len(weights2)-len(intersection)]
        # print(union_weights)
        
        if union:
            return union, union_weights
        return {word}, [1]

    def get_probable_word(self, w1_candidates, w2_candidates, w1_multipliers, w2_multipliers):
        w1 = w1_candidates[0]
        w2 = w2_candidates[0]
        prob = 0
        if (len(w1_candidates) != len(w1_multipliers)):
            print(len(w1_candidates), len(w1_multipliers), w1_candidates, w1_multipliers)
        if (len(w2_candidates) != len(w2_multipliers)):
            print(len(w2_candidates), len(w2_multipliers), w2_candidates, w2_multipliers)
        # print(len(w1_candidates), len(w1_multipliers), len(w2_candidates), len(w2_multipliers))
        for i in range(len(w1_candidates)):
            for j in range(len(w2_candidates)):
                if self.bigrams[w1_candidates[i]][w2_candidates[j]] * w1_multipliers[i] * w2_multipliers[j] > prob:
                    prob = self.bigrams[w1_candidates[i]][w2_candidates[j]] * w1_multipliers[i] * w2_multipliers[j]
                    w1 = w1_candidates[i]
                    w2 = w2_candidates[j]
        return w1, w2
    
    def fix_string(self, string, show=False):
        tokens = re.findall(r'\w+',string)
        corrections = self.fix_tokens(self.words(string))
        if show:
            print(tokens, corrections)
        for token, correction in zip(tokens, corrections):
            if token[0].isupper():
                correction = correction[0].upper() + correction[1:]
            idx = string.find(token)
            string = string[:idx] + correction + string[idx + len(token):]
        return string
    
    def perform_errors(self, string, prob=0.5, mistake_type_prob = 0.6):
        tokens = re.findall(r'\w+',string)
        curr_idx = 0
        for token in tokens:
            correction = token
            if random.random() < prob and not token.isnumeric():
                if random.random() < mistake_type_prob:
                    correction = random.choice(list(self.edits1(token)))
                else: 
                    correction = random.choice(list(self.edits2(token)))
                if correction=='':
                    correction = token
            if token[0].isupper():
                correction = correction[0].upper() + correction[1:]
            idx = string.find(token, curr_idx)
            string = string[:idx] + correction + string[idx + len(token):]
            curr_idx = idx + len(correction)
        return string

    def fix_tokens(self, tokens):
        tokens = tokens[:]
        possibilities = [[] for i in range (len(tokens))]
        multipliers = [[] for i in range (len(tokens))]
        for i in range (len(tokens)):
            if self.known([tokens[i]]) or tokens[i].isnumeric():
                possibilities[i].append(tokens[i])
                multipliers[i].append(1)
            else:
                arr, mul = self.candidates_union(tokens[i])
                # print(arr, mul)
                possibilities[i].extend(list(arr))
                multipliers[i].extend(mul)

        ## check for 1 word in sentence
        if len(tokens) == 1:
            return self.correction(tokens[0])
        
        for i in range (len(tokens)):
            if len(possibilities[i]) > 1:
                if i == 0:
                    tokens[i], _ = self.get_probable_word(possibilities[i], possibilities[i+1], multipliers[i], multipliers[i+1])
                else:
                    _, tokens[i] = self.get_probable_word([tokens[i-1]], possibilities[i], [1], multipliers[i])
            else:
                tokens[i] = possibilities[i][0]
        return tokens
    
    def compare_methods(self, correct_corpus, test_corpus):
        accuracy_Norvig = 0
        accuracy_Corrector = 0
        total_words = 0

        for i in range (len(test_corpus)):

            correct_tokens = self.words(correct_corpus[i])
            total_words += len(correct_tokens)
            test_tokens = self.words(test_corpus[i])
            corrector_answer = self.fix_tokens(test_tokens)

            for j in range (len(correct_tokens)):
                if self.correction(test_tokens[j]) == correct_tokens[j]:
                    accuracy_Norvig+=1

                if corrector_answer[j] == correct_tokens[j]:
                    accuracy_Corrector+=1
                    
        accuracy_Corrector /= total_words
        accuracy_Norvig /= total_words
        print(f"Norvig's corrector accuracy:\t{accuracy_Norvig}\nContext corrector accuracy:\t{accuracy_Corrector}")
        
    def context_fix(self, corpus):
        fixed_corpus = []
        for string in corpus:
            fixed_corpus.append(self.fix_string(string))
        return fixed_corpus

In [3]:
class Corrector(Norvig_corrector):

    def get_bigrams(self):
        with open("w2_.txt") as f:
            while (line := f.readline().rstrip()):
                val, w1, w2 = line.split()
                val = int(val)
                # if w2 in self.stop_words:
                #     val = 1
                self.bigrams[w1][w2] = val
        
        for w1 in self.bigrams:
            total_count = float(sum(self.bigrams[w1].values()))
            for w2 in self.bigrams[w1]:
                self.bigrams[w1][w2] /= total_count
            
    def __init__(self):
        super().__init__()
        self.WORDS = Counter(self.words(open('big.txt').read()))
        self.WORDS += Counter(self.words(open('w2_.txt').read()))
        self.stop_words = set(stopwords.words('english'))
        self.bigrams = defaultdict(lambda: defaultdict(lambda: 0))
        self.get_bigrams()

    def candidates_union(self, word): 
        return self.candidates(word)

    def get_probable_word(self, w1_candidates, w2_candidates):
        w1 = w1_candidates[0]
        w2 = w2_candidates[0]
        prob = 0
        # print(len(w1_candidates), len(w1_multipliers), len(w2_candidates), len(w2_multipliers),)
        for i in range(len(w1_candidates)):
            for j in range(len(w2_candidates)):
                if self.bigrams[w1_candidates[i]][w2_candidates[j]] > prob:
                    prob = self.bigrams[w1_candidates[i]][w2_candidates[j]]
                    w1 = w1_candidates[i]
                    w2 = w2_candidates[j]
        return w1, w2
    
    def fix_string(self, string, show=False):
        tokens = re.findall(r'\w+',string)
        corrections = self.fix_tokens(self.words(string))
        if show:
            print(tokens, corrections)
        for token, correction in zip(tokens, corrections):
            if token[0].isupper():
                correction = correction[0].upper() + correction[1:]
            idx = string.find(token)
            string = string[:idx] + correction + string[idx + len(token):]
        return string
    
    def perform_errors(self, string, prob=0.5, mistake_type_prob = 0.6):
        tokens = re.findall(r'\w+',string)
        curr_idx = 0
        for token in tokens:
            correction = token
            if random.random() < prob and not token.isnumeric():
                if random.random() < mistake_type_prob:
                    correction = random.choice(list(self.edits1(token)))
                else: 
                    correction = random.choice(list(self.edits2(token)))
                if correction=='':
                    correction = token
            if token[0].isupper():
                correction = correction[0].upper() + correction[1:]
            idx = string.find(token, curr_idx)
            string = string[:idx] + correction + string[idx + len(token):]
            curr_idx = idx + len(correction)
        return string

    def fix_tokens(self, tokens):
        tokens = tokens[:]
        possibilities = [[] for i in range (len(tokens))]
        for i in range (len(tokens)):
            if self.known([tokens[i]]) or tokens[i].isnumeric():
                possibilities[i].append(tokens[i])
            else:
                arr = self.candidates_union(tokens[i])
                possibilities[i].extend(list(arr))

        ## check for 1 word in sentence
        if len(tokens) == 1:
            return self.correction(tokens[0])
        
        for i in range (len(tokens)):
            if len(possibilities[i]) > 1:
                if i == 0:
                    tokens[i], _ = self.get_probable_word(possibilities[i], possibilities[i+1])
                else:
                    _, tokens[i] = self.get_probable_word([tokens[i-1]], possibilities[i])
            else:
                tokens[i] = possibilities[i][0]
        return tokens
    
    def compare_methods(self, correct_corpus, test_corpus):
        accuracy_Norvig = 0
        accuracy_Corrector = 0
        total_words = 0

        for i in range (len(test_corpus)):

            correct_tokens = self.words(correct_corpus[i])
            total_words += len(correct_tokens)
            test_tokens = self.words(test_corpus[i])
            corrector_answer = self.fix_tokens(test_tokens)
            # print(len(correct_tokens), len(test_tokens), len(corrector_answer))
            try:
                for j in range (len(correct_tokens)):
                    if self.correction(test_tokens[j]) == correct_tokens[j]:
                        accuracy_Norvig+=1

                    if corrector_answer[j] == correct_tokens[j]:
                        accuracy_Corrector+=1
            except:
                print("Some error")
                print(correct_tokens, test_tokens, corrector_answer)
                ans = []
                for j in range (len(correct_tokens)):
                    ans.append(self.correction(test_tokens[j]))
                print(ans)
                print(len(correct_tokens), len(test_tokens), len(corrector_answer), len(ans))
        accuracy_Corrector /= total_words
        accuracy_Norvig /= total_words
        print(f"Norvig's corrector accuracy:\t{accuracy_Norvig}\nContext corrector accuracy:\t{accuracy_Corrector}")
        
    def context_fix(self, corpus):
        fixed_corpus = []
        for string in corpus:
            fixed_corpus.append(self.fix_string(string))
        return fixed_corpus

## Justify your decisions
In your implementation you will need to decide which ngram dataset to use, which weights to assign for edit1, edit2 or absent words probabilities, beam search parameters and etc. Write down justificaitons for these choices.

I decided to use bigram dataset because it is not as sparse as other ngrams datasets and it also cover sentences that consists of two words (f.e. "Go outside!" or "Shut up."). 
I tried two different approaches:
- (Corrector with weights) Assign some additional weights for edit 2 and therefore change candidate function in order to operate with union of sets from edit1 and edit2. However performance for that approach is not better than another one.
- (Corrector) Do not assign any weights to edit1 and edit2 and operate with candidate function from Norvig's solution.
In any way both solutions wasn't able to perform better than Norvig's.  
Speaking about probabilities and for absent words. In all previously mentioned solutions if we face unknown word it firstly would consider it as a mistake and try to edit it. If there is no possibilities to convert this unknown word to known one through edit1 or edit2, then we leave this word as it is. In that case we have only one variant -> bigrams probabilities doesn't really matter (the maximum probability is 0 for any pair of an unknown word and any other word).

Now let's talk about beam search, to be honest I haven't came to any idea about how to use it in that problem.

Finally, let's consider performance of Corrector on texts with different error probabilities. The performance is better when the errors in the test data are mostly of the type edit1, this is because the word with the error edit2 has a larger number of possible candidates than with edit1. If there are few errors, the performance is not bad, but with an increase in the error rate, the probability of getting into a chain of incorrectly selected words increases. In this scenario, a larger number of candidates for some words means a greater probability of a significant change in this sentence. This could probably be fixed a bit by using several ngrams to determine the best candidate, but this leads us to exhausting tuning of hyperparameters (weights for each ngram) and other problems (for example, that with more n in ngrams, the matrix will be more sparse, therefore, this will require advanced smoothing, otherwise it will not lead to significant changes in the results).

One more note: If Corrector or Corrector with weights comes across a sentence consisting of a single word, it uses Norvig's solution to correct it if necessary.

## Evaluate on a test set

Your task is to generate a test set and evaluate your work. You may vary the noise probability to generate different datasets with varying compexity. Compare your solution to the Norvig's corrector, and report the accuracies.

In [4]:
corrector = Corrector()
corrector_with_weights = Corrector_with_weights()

In [5]:
from nltk.tokenize import sent_tokenize
s = "If you were poking around RT a week and a half or so ago, you might have come across a little poll we were taking on the site to try and determine the Scariest Movie Ever. Based on other lists and suggestions from the RT staff, we pulled together 40 of the scariest movies ever made and asked you to vote for the one that terrified you the most. As it happens, a British broadband service comparison website decided to conduct a science experiment to determine the same thing, and their results were… surprising, to say the least. Did Rotten Tomatoes readers agree with the findings? Read on to find out what our fans determined were the 10 Scariest Horror Movies Ever."
s += "You mr-director Aray not agree that The Exorcist is the scariest movie ever, but it probably also isn’t much of a surprise to see it at the top of our list — with a whopping 19% of all the votes cast. William Friedkin’s adaptation of the eponymous novel about a demon-possessed child and the attempts to banish said demon became the highest-grossing R-rated horror film ever and the first to be nominated for Best Picture at the Oscars (it earned nine other nominations and took home two trophies). But outside of its critical and commercial bona fides, the film is well-known for the mass hysteria it inspired across the country, from protests over its controversial subject matter to widespread reports of nausea and fainting in the audience. Its dramatic pacing and somewhat dated effects may seem quaint compared to some contemporary horror, but there’s no denying the power the film continues to have over those who see it for the first time."
s += "Writei Aster made a huge splash with his feature directorial debut, a dark family drama about the nature of grief couched within a supernatural horror film. Toni Collette earned a spot in the pantheon of great Oscar snubs with her slowly-ratcheted-up-to-11 performance as bedeviled mother Annie, but the movie’s biggest shock came courtesy of… Well, we won’t spoil that here. Suffice it to say Hereditary struck such a nerve with moviegoers that it instantly turned Aster into a director to watch and shot up to second place on our list."
s += 'Somewhat mystifyingly, some top-secret algorithmic function in DreamWorks Animation’s audience-reaction data analysis software has decreed that yet another comeback is in order for the sort of OK-ish and meh-plus character of Puss in Boots, smokily voiced by Antonio Banderas, originally seen in 2004 in Shrek 2, and then in the 2010 spinoff feature Puss in Boots. The numbers have come chuntering out of the side of some giant IBM-style computer, the suits have frowningly inspected them, and another tranche of Puss in Boots content has been greenlit. Once again, debonair outlaw Puss in Boots – a sort of cleaned-up southern European version of Jack Sparrow – is having sword-twirling adventures, again in the company of his paramour Kitty Softpaws (Salma Hayek); but now PiB must confront his own mortality, having used up eight of his nine lives. He is on a quest to put off the evil hour by finding the legendary wishing star which once fell to earth like a comet; he and Kitty join forces with the perky mutt Perrito (Harvey Guillén), but must battle other fairytale/nursery-rhyme honchos, including a Cockney crime family in the form of Goldilocks (Florence Pugh) and the Three Bears (Olivia Colman, Ray Winstone and Samson Kayo), and “Big” Jack Horner (John Mulaney) – to whom all the funny lines are given. Wagner Moura voices the Wolf, who is the grim reaper, wielding a couple of sickles.'
s += 'Really, this movie is a huge 102-minute additional scene, something that would go on the extras package of a Blu-ray edition of the previous Puss in Boots film, or possibly get its own video-on-demand release. It feels like something to put on your TV or iPad to pacify a toddler; nothing wrong with that, of course, and many stressed parents would call it the noblest artistic calling. But how bland and forgettable this film is, without in the smallest way harnessing the real performing power of Banderas, Colman, Pugh, Winstone et al.'
corpus = sent_tokenize(s)

In [6]:
mistake_probability = [0.1*i for i in range(1, 11)]
edit1_prob = [0.1, 0.33, 0.66, 1]
for i in range (10):
    for j in range(4):
        corpus_with_errors=[]
        for sentence in corpus:
            corpus_with_errors.append(corrector.perform_errors(sentence, prob=mistake_probability[i], mistake_type_prob=edit1_prob[j]))
        print(f"For probability of mistake = {mistake_probability[i]} and probability of performing edit1 type of mistake = {edit1_prob[j]}")
        corrector.compare_methods(corpus, corpus_with_errors)
        print()

For probability of mistake = 0.1 and probability of performing edit1 type of mistake = 0.1
Norvig's corrector accuracy:	0.9114799446749654
Context corrector accuracy:	0.9087136929460581

For probability of mistake = 0.1 and probability of performing edit1 type of mistake = 0.33
Norvig's corrector accuracy:	0.9308437067773168
Context corrector accuracy:	0.9266943291839558

For probability of mistake = 0.1 and probability of performing edit1 type of mistake = 0.66
Norvig's corrector accuracy:	0.9363762102351314
Context corrector accuracy:	0.9308437067773168

For probability of mistake = 0.1 and probability of performing edit1 type of mistake = 1
Norvig's corrector accuracy:	0.9336099585062241
Context corrector accuracy:	0.9294605809128631

For probability of mistake = 0.2 and probability of performing edit1 type of mistake = 0.1
Norvig's corrector accuracy:	0.8907330567081605
Context corrector accuracy:	0.8824343015214384

For probability of mistake = 0.2 and probability of performing ed

In [7]:
some_s = 'a a a aaaaaa a a aaa a a of a a aaaa a a of aa a a a a'
some_err = corrector.perform_errors(some_s, 1)
print(len(some_s.split()), len(some_err.split()))
some_err

21 21


'ac xae vh ajaaaa k vga aaaj ap lao yof c apn aauaa n ta ozfn yj ar aq ta qd'

In [8]:
corpus_with_errors = []
for sentence in corpus:
    corpus_with_errors.append(corrector.perform_errors(sentence, prob=0.33))
corpus_with_errors

['If you were poking around Ra i week alnd t half or so gago, yod mighth have cimec across a lirttle pol we were taking on the site to try and determine the Scariest PMovih Ever.',
 'Jasnd on other lists and suggestiond from he RT staff, we pulled together 40 of the scaryesq movies ever made and asked ydu eto vote for the one that terrified wxou bhe most.',
 'OhAs it happens, a British broadband service comparison website decidedu to conduct a science expeaiyent to determioe uhe same thing, and their results were… esurprisinga, to aay the least.',
 'Did Rotten Tomatoes zrvaders vgree with the findings?',
 'Rfead on uio fijnd out what oub farns deteyimined were thge 10 Scariest Horror Movies Ever.You mr-dirbctor Avrawy not agree thyast The Exorcisa is the jscoariest movie ever, nut it probably also isn’t mgsuch ofy a surprise to wvee it at tjhe top obf our list — with a whopiing 19% of all thwe vontjes cast.',
 'William Friedkin’dss adxptation ofdr the eponymous nnvel about a demon-poss

In [9]:
corrector.context_fix(corpus_with_errors)

['If you were poking around Ra i week and t half or so ago, you might have come across a little pol we were taking on the site to try and determine the Scariest Movie Ever.',
 'Based on other lists and suggestions from he Rt staff, we pulled together 40 of the scariest movies ever made and asked you to vote for the one that terrified whod she most.',
 'Has it happens, a British broadband service comparison website decided to conduct a science experiment to determine the same thing, and their results were… surprising, to say the least.',
 'Did Rotten Tomatoes traders agree with the findings?',
 'Read on io fiend out what our fans determined were the 10 Scariest Horror Movies Ever.You mr-director Array not agree that The Exorcist is the scariest movie ever, nut it probably also isn’t much of a surprise to weve it at the top of our list — with a whopping 19% of all the votes cast.',
 'William Friedkin’dss adaptation fdr the eponymous navel about a demon-possessed child and the attempts to

In [10]:
corrector.compare_methods(corpus, corpus_with_errors)

Norvig's corrector accuracy:	0.8575380359612724
Context corrector accuracy:	0.8450899031811895


In [11]:
l = [0.1*i for i in range(1, 11)]
for i in range (10):
    print(f"for weight for edit2 {l[i]}")
    corrector_with_weights.weight2 = l[i]
    corrector_with_weights.compare_methods(corpus, corpus_with_errors)
    print()

for weight for edit2 0.1
Norvig's corrector accuracy:	0.8575380359612724
Context corrector accuracy:	0.7952973720608575

for weight for edit2 0.2
Norvig's corrector accuracy:	0.8575380359612724
Context corrector accuracy:	0.7980636237897649

for weight for edit2 0.30000000000000004
Norvig's corrector accuracy:	0.8575380359612724
Context corrector accuracy:	0.7980636237897649

for weight for edit2 0.4
Norvig's corrector accuracy:	0.8575380359612724
Context corrector accuracy:	0.7994467496542186

for weight for edit2 0.5
Norvig's corrector accuracy:	0.8575380359612724
Context corrector accuracy:	0.7994467496542186

for weight for edit2 0.6000000000000001
Norvig's corrector accuracy:	0.8575380359612724
Context corrector accuracy:	0.7994467496542186

for weight for edit2 0.7000000000000001
Norvig's corrector accuracy:	0.8575380359612724
Context corrector accuracy:	0.7994467496542186

for weight for edit2 0.8
Norvig's corrector accuracy:	0.8575380359612724
Context corrector accuracy:	0.7994

# Some examples

In [27]:
corrector.fix_string("Ara-ara aunmt likes littl bous.", show=True)

['Ara', 'ara', 'aunmt', 'likes', 'littl', 'bous'] ['ara', 'ara', 'aunt', 'likes', 'little', 'boys']


'Ara-ara aunt likes little boys.'

In [15]:
corrector.fix_string("I ws bor on Novber 25th", show=True)

['I', 'ws', 'bor', 'on', 'Novber', '25th'] ['i', 'was', 'born', 'on', 'november', '25th']


'I was born on November 25th'

In [30]:
corrector.fix_string("Shhut uip!", show=True)

['Shhut', 'uip'] ['shut', 'up']


'Shut up!'