# Context-sensitive Spelling Correction

The goal of the assignment is to implement context-sensitive spelling correction. The input of the code will be a set of text lines and the output will be the same lines with spelling mistakes fixed.

Submit the solution of the assignment to Moodle as a link to your GitHub repository containing this notebook.

Useful links:
- [Norvig's solution](https://norvig.com/spell-correct.html)
- [Norvig's dataset](https://norvig.com/big.txt)
- [Ngrams data](https://www.ngrams.info/download_coca.asp)

Grading:
- 60 points - Implement spelling correction
- 20 points - Justify your decisions
- 20 points - Evaluate on a test set


## Implement context-sensitive spelling correction

Your task is to implement context-sensitive spelling corrector using N-gram language model. The idea is to compute conditional probabilities of possible correction options. For example, the phrase "dking sport" should be fixed as "doing sport" not "dying sport", while "dking species" -- as "dying species".

The best way to start is to analyze [Norvig's solution](https://norvig.com/spell-correct.html) and [N-gram Language Models](https://web.stanford.edu/~jurafsky/slp3/3.pdf).

You may also want to implement:
- spell-checking for a concrete language - Russian, Tatar, etc. - any one you know, such that the solution accounts for language specifics,
- some recent (or not very recent) paper on this topic,
- solution which takes into account keyboard layout and associated misspellings,
- efficiency improvement to make the solution faster,
- any other idea of yours to improve the Norvig’s solution.

IMPORTANT:  
Your project should not be a mere code copy-paste from somewhere. You must provide:
- Your implementation
- Analysis of why the implemented approach is suggested
- Improvements of the original approach that you have chosen to implement

In [4]:
import re
from collections import Counter
import pandas as pd
import numpy as np

def get_words(text): return re.findall(r'\w+', text.lower())

file = open('big.txt').read()
WORDS = Counter(get_words(file))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))


def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]).union(known(edits1(word))) or known(edits2(word)) or [word])

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

In [5]:
from typing import defaultdict

bigrams = defaultdict(int)
trigrams = defaultdict(int)
fourgrams = defaultdict(int)
fivegrams = defaultdict(int)

In [6]:
with open('w2_.txt', 'r') as file:
    lines = file.read().splitlines()
    for line in lines:
        line = line.strip().split('\t')
        
        frequency = int(line[0])
        
        bigram = tuple(line[1:])
        
        bigrams[bigram] += frequency

In [7]:
with open('w5_.txt', 'r') as file:
    lines = file.read().splitlines()
    for line in lines:
        line = line.strip().split('\t')
        
        frequency = int(line[0])
        
        fivegram = tuple(line[1:])
        fourgram = [fivegram[:4], fivegram[1:]]
        trigram = [fivegram[:3], fivegram[1:4], fivegram[2:]]
        
        if fivegram not in fivegrams:
            fivegrams[fivegram] = frequency
            
        for t in trigram:
            trigrams[t] += frequency
            
        for t in fourgram:
            fourgrams[t] += frequency

# Rewrite

In [8]:
def find_word(word, context):
    context = context[-4:]
    
    freq_context = []
    for i in range(1, 5):
        cur_context = context[-i:]
        
        subsentence = (*cur_context, word)
        if i == 1 and subsentence in bigrams:
            frequency = bigrams[subsentence]
            freq_context.append((i, frequency))
        elif i == 2 and subsentence in trigrams:
            frequency = trigrams[subsentence]
            freq_context.append((i, frequency))
        elif i == 3 and subsentence in fourgrams:
            frequency = fourgrams[subsentence]
            freq_context.append((i, frequency))
        elif i == 4 and subsentence in fivegrams:
            frequency = fivegrams[subsentence]
            freq_context.append((i, frequency))
        
    return freq_context
            
            
        

In [9]:
find_word("woods", ["a", "babe", "in", "the"])

[(1, 8805), (2, 630), (3, 16), (4, 16)]

In [12]:
def find_max_context(contexts):
    max_len = 0
    max_freq = 0
    idx = -1
    for i in range(len(contexts)):
        if len(contexts[i]) > max_len:
            max_len = len(contexts[i])
            max_freq = sum(c for _, c in contexts[i])
            idx = i
        elif len(contexts[i]) == max_len:
            context_freq = sum(c for _, c in contexts[i])
            if context_freq > max_freq:
                max_freq = context_freq
                idx = i
    return idx

In [13]:
def correct(word, context):
    candids = list(candidates(word))
    
    new_context = []
    non_empty = []
    for i, cand in enumerate(candids):
        new_word = find_word(cand, context)
        if new_word != []:
            new_context.append(new_word)
            non_empty.append(i)
    
    idx = find_max_context(new_context)
    
    if non_empty != []:
        correct_word = candids[non_empty[idx]]
    else:
        correct_word = correction(word)
    return correct_word

In [14]:
test_text_with_mistakes = \
""""
I red the book yesturday and it was really interessing.
He's alwais late for meetings, it's so frustraiting.
She baught a new dress for the occassion, it looks amasing on her.
We had a deliscious dinner last night, with lot's of different dishes.
Their cat is so cuite, with it's fluffy fur and big green eyes.
I can't belive how bueatiful the sunset was yesterday.
He's been working so hard latley, I hope he gets some rest soon.
She's such a gr8 friend, always there when I need her.
I herd the news about the accident, it was truely shoking.
We're going too the beach this weakend, I can't wait to relax in the sun.
"""

test_text_corrected = \
"""
I read the book yesterday and it was really interesting.
He's always late for meetings, it's so frustrating.
She bought a new dress for the occasion, it looks amazing on her.
We had a delicious dinner last night, with lots of different dishes.
Their cat is so cute, with its fluffy fur and big green eyes.
I can't believe how beautiful the sunset was yesterday.
He's been working so hard lately, I hope he gets some rest soon.
She's such a great friend, always there when I need her.
I heard the news about the accident, it was truly shocking.
We're going to the beach this weekend, I can't wait to relax in the sun.
"""

In [28]:
for line in test_text_with_mistakes.splitlines():
    words = get_words(line)
    for i in range(len(words)):
        if words[i] not in WORDS:
            incorrect_word = words[i]
            corrected_word = correct(words[i], words[:i])
            words[i] = corrected_word
            print(f"{incorrect_word} -> {corrected_word}")

yesturday -> yesterday
interessing -> interesting
alwais -> always
frustraiting -> frustraiting
baught -> bought
occassion -> occasion
amasing -> amazing
deliscious -> delicious
cuite -> quite
belive -> believe
bueatiful -> beautiful
latley -> lately
gr8 -> grm
truely -> truly
shoking -> showing
weakend -> weekend


In [29]:
def correct_mistakes(context):
    for line in context.splitlines():
        words = get_words(line)
        for i in range(len(words)):
            if words[i] not in WORDS:
                corrected_word = correct(words[i], words[:i])
                words[i] = corrected_word
    return context

In [30]:
print(correct_mistakes(test_text_with_mistakes))

"
I red the book yesturday and it was really interessing.
He's alwais late for meetings, it's so frustraiting.
She baught a new dress for the occassion, it looks amasing on her.
We had a deliscious dinner last night, with lot's of different dishes.
Their cat is so cuite, with it's fluffy fur and big green eyes.
I can't belive how bueatiful the sunset was yesterday.
He's been working so hard latley, I hope he gets some rest soon.
She's such a gr8 friend, always there when I need her.
I herd the news about the accident, it was truely shoking.
We're going too the beach this weakend, I can't wait to relax in the sun.



## Justify your decisions

Write down justificaitons for your implementation choices. For example, these choices could be:
- Which ngram dataset to use
- Which weights to assign for edit1, edit2 or absent words probabilities
- Beam search parameters
- etc.

*Your text here...*

## Evaluate on a test set

Your task is to generate a test set and evaluate your work. You may vary the noise probability to generate different datasets with varying compexity. Compare your solution to the Norvig's corrector, and report the accuracies.

In [18]:
with open('imdb_labelled.txt', 'r') as f:
    test = [line.split('\t')[0].strip() for line in f][:50]

In [32]:
import random


def make_mistake(words):
    choise = random.randint(0,1)
    if choise == 0:
        incorrect_words = edits1(words[-1])
    elif choise == 1:
        incorrect_words = edits2(words[-1])
    incorrect_word = random.choice(list(incorrect_words))
    words[-1] = incorrect_word
    incorrect_sentence = " ".join(words)
    return incorrect_sentence, incorrect_word

In [35]:
make_mistake(get_words("I love football"))

('i love footballt', 'footballt')

### Test my solution

In [40]:
count = 0
for sentence in test:
    words = get_words(sentence)
    correct_word = words[-1]
    incorrect_sentence, incorrect_word = make_mistake(words)
    
    corrected_word = correct(incorrect_word, words[:-1])
    if corrected_word == correct_word:
        count += 1

print(count / len(test))

0.72


### Test Norvig's solution

In [41]:
count = 0
for sentence in test:
    words = get_words(sentence)
    correct_word = words[-1]
    incorrect_sentence, incorrect_word = make_mistake(words)
    
    corrected_word = correction(incorrect_word)
    if corrected_word == correct_word:
        count += 1

print(count / len(test))

0.62
