# Context-sensitive Spelling Correction

The goal of the assignment is to implement context-sensitive spelling correction. The input of the code will be a set of text lines and the output will be the same lines with spelling mistakes fixed.

Submit the solution of the assignment to Moodle as a link to your GitHub repository containing this notebook.

Useful links:
- [Norvig's solution](https://norvig.com/spell-correct.html)
- [Norvig's dataset](https://norvig.com/big.txt)
- [Ngrams data](https://www.ngrams.info/download_coca.asp)

Grading:
- 60 points - Implement spelling correction
- 20 points - Justify your decisions
- 20 points - Evaluate on a test set


## Implement context-sensitive spelling correction

Your task is to implement context-sensitive spelling corrector using N-gram language model. The idea is to compute conditional probabilities of possible correction options. For example, the phrase "dking sport" should be fixed as "doing sport" not "dying sport", while "dking species" -- as "dying species".

The best way to start is to analyze [Norvig's solution](https://norvig.com/spell-correct.html) and [N-gram Language Models](https://web.stanford.edu/~jurafsky/slp3/3.pdf).

When solving this task, we expect you'll face (and successfully deal with) some problems or make up the ideas of the model improvement. Some of them are: 

- solving a problem of n-grams frequencies storing for a large corpus;
- taking into account keyboard layout and associated misspellings;
- efficiency improvement to make the solution faster;
- ...

Please don't forget to describe such cases, and what you decided to do with them, in the Justification section.

##### IMPORTANT:  
Your project should not be a mere code copy-paste from somewhere. You must provide:
- Your implementation
- Analysis of why the implemented approach is suggested
- Improvements of the original approach that you have chosen to implement

In [4]:
import re
from collections import defaultdict, Counter

def tokenize(text):
    return re.findall(r'\w+', text.lower())

def generate_candidates(word):
    letters = 'abcdefghijklmnopqrstuvwxyz'
    splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    deletes = [L + R[1:] for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]
    replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
    inserts = [L + c + R for L, R in splits for c in letters]
    candidates = set(deletes + transposes + replaces + inserts)
    return candidates

def filter_candidates(candidates, vocabulary):
    return [c for c in candidates if c in vocabulary]

def phrase_probability(phrase, ngram_model, n):
    tokens = tokenize(phrase)
    probability = 1.0
    for i in range(n-1, len(tokens)):
        context = tuple(tokens[i-n+1:i])
        word = tokens[i]
        if context in ngram_model and word in ngram_model[context]:
            probability *= ngram_model[context][word] / sum(ngram_model[context].values())
        else:
            probability *= 0.0001
    return probability

def hybrid_correct_spelling(phrase, bigram_model, fivegram_model, vocabulary):
    tokens = tokenize(phrase)
    corrected_phrase = []
    for i, token in enumerate(tokens):
        if token in vocabulary:
            corrected_phrase.append(token)
        else:
            candidates = generate_candidates(token)
            valid_candidates = filter_candidates(candidates, vocabulary)
            if not valid_candidates:
                corrected_phrase.append(token)
                continue
            if i > 0:
                prev_word = tokens[i-1]
                bigram_candidates = [(c, bigram_model[prev_word].get(c, 0)) for c in valid_candidates]
                fivegram_candidates = [(c, phrase_probability(' '.join(tokens[:i] + [c] + tokens[i+1:]), fivegram_model, 5)) for c in valid_candidates]
                combined_candidates = [(c, (bg_score + fg_score) / 2) for (c, bg_score), (_, fg_score) in zip(bigram_candidates, fivegram_candidates)]
                best_candidate = max(combined_candidates, key=lambda x: x[1])[0]
            else:
                best_candidate = max(valid_candidates, key=lambda c: phrase_probability(' '.join(tokens[:i] + [c] + tokens[i+1:]), fivegram_model, 5))
            corrected_phrase.append(best_candidate)
    return ' '.join(corrected_phrase)

bigram_model = defaultdict(Counter)
with open('bigrams (2).txt', 'r', encoding='latin1') as f:
    for line in f:
        parts = line.strip().split('\t')
        if len(parts) == 3:
            freq, word1, word2 = parts
            bigram_model[word1][word2] = int(freq)

fivegram_model = defaultdict(Counter)
with open('fivegrams (2).txt', 'r', encoding='latin1') as f:
    for line in f:
        parts = line.strip().split('\t')
        if len(parts) >= 6:
            freq, *words = parts
            context = tuple(words[:-1])
            word = words[-1]
            fivegram_model[context][word] = int(freq)

vocabulary = set()
with open('bigrams (2).txt', 'r', encoding='latin1') as f:
    for line in f:
        parts = line.strip().split('\t')
        if len(parts) == 3:
            word1, word2 = parts[1], parts[2]
            vocabulary.add(word1)
            vocabulary.add(word2)

with open('fivegrams (2).txt', 'r', encoding='latin1') as f:
    for line in f:
        parts = line.strip().split('\t')
        if len(parts) >= 6:
            *words, _ = parts[1:]
            vocabulary.update(words)

misspelled_phrase = "dking sport"
corrected_phrase = hybrid_correct_spelling(misspelled_phrase, bigram_model, fivegram_model, vocabulary)
print(f"Corrected phrase: {corrected_phrase}")

Corrected phrase: ding sport


## Justify your decisions

Write down justificaitons for your implementation choices. For example, these choices could be:
- Which ngram dataset to use
- Which weights to assign for edit1, edit2 or absent words probabilities
- Beam search parameters
- etc.

Why Bigrams and Fivegrams?

Bigrams: Bigrams capture local context, which is useful for correcting misspelled words based on the previous word. For example, in the phrase "dking sport", the bigram model helps prioritize "doing sport" over "dying sport" because "doing sport" is more likely in the corpus.
Fivegrams: Fivegrams capture longer-range dependencies, which are useful for correcting phrases with more complex context. For example, in the phrase "dking species", the fivegram model helps prioritize "dying species" over "doing species" because "dying species" is more likely in the corpus.

Why Use a Hybrid Approach?

The hybrid approach combines the strengths of both bigram and fivegram models:
Bigram Model: Provides local context (e.g., the previous word).
Fivegram Model: Provides longer-range context (e.g., the previous four words).
By combining the probabilities from both models, the system can make more accurate corrections in a variety of contexts.

Challenge: Context Sensitivity

The model must consider both local and global context to make accurate corrections.
Solution: Use a hybrid approach that combines bigram and fivegram models

## Evaluate on a test set

Your task is to generate a test set and evaluate your work. You may vary the noise probability to generate different datasets with varying compexity (or just take another dataset). Compare your solution to the Norvig's corrector, and report the accuracies.

In [5]:
import re
import random
from collections import defaultdict, Counter

def introduce_typos(word, num_typos=1):
    letters = 'abcdefghijklmnopqrstuvwxyz'
    for _ in range(num_typos):
        typo_type = random.choice(['delete', 'replace', 'swap'])
        pos = random.randint(0, len(word) - 1)
        if typo_type == 'delete' and len(word) > 1:
            word = word[:pos] + word[pos+1:]
        elif typo_type == 'replace':
            word = word[:pos] + random.choice(letters) + word[pos+1:]
        elif typo_type == 'swap' and len(word) > 1:
            if pos == len(word) - 1:
                pos -= 1
            word = word[:pos] + word[pos+1] + word[pos] + word[pos+2:]
    return word

def extract_phrases(file_path, encoding='latin1'):
    phrases = []
    with open(file_path, 'r', encoding=encoding) as f:
        for line in f:
            parts = line.strip().split('\t')
            if len(parts) >= 2:
                phrase = ' '.join(parts[1:])
                phrases.append(phrase)
    return phrases

def generate_misspelled_phrases(phrases, num_typos=1):
    test_cases = []
    for phrase in phrases:
        words = phrase.split()
        misspelled_words = [introduce_typos(word, num_typos) for word in words]
        misspelled_phrase = ' '.join(misspelled_words)
        test_cases.append((misspelled_phrase, phrase))
    return test_cases

def save_test_data(test_cases, file_path):
    with open(file_path, 'w', encoding='utf-8') as f:
        for misspelled, correct in test_cases:
            f.write(f"{misspelled}\t{correct}\n")

bigram_file = 'bigrams (2).txt'
coca_file = 'coca_all_links (2).txt'
fivegram_file = 'fivegrams (2).txt'
output_file = 'test_data.txt'
bigram_phrases = extract_phrases(bigram_file)
coca_phrases = extract_phrases(coca_file)
fivegram_phrases = extract_phrases(fivegram_file)
all_phrases = bigram_phrases + coca_phrases + fivegram_phrases
print(f"Total phrases extracted: {len(all_phrases)}")
test_cases = generate_misspelled_phrases(all_phrases, num_typos=1)
save_test_data(test_cases, output_file)
print(f"Test dataset saved to {output_file}")

Total phrases extracted: 2220962
Test dataset saved to test_data.txt


In [6]:
def load_test_data(file_path):
    test_cases = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            misspelled, correct = line.strip().split('\t')
            test_cases.append((misspelled, correct))
    return test_cases

def evaluate_model(test_cases, correct_spelling, bigram_model, fivegram_model, vocabulary):
    correct_count = 0
    for misspelled, correct in test_cases:
        corrected_phrase = correct_spelling(misspelled, bigram_model, fivegram_model, vocabulary)
        if corrected_phrase == correct:
            correct_count += 1
    accuracy = correct_count / len(test_cases)
    print(f"Accuracy: {accuracy * 100:.2f}%")

test_cases = load_test_data('test_data.txt')
evaluate_model(test_cases, correct_spelling, bigram_model, fivegram_model, vocabulary)

Accuracy: 8.02%


In [None]:
evaluate_model(test_cases, correct_spelling, bigram_model, fivegram_model, vocabulary)

#### Useful resources (also included in the archive in moodle):

1. [Possible dataset with N-grams](https://www.ngrams.info/download_coca.asp)
2. [Damerau–Levenshtein distance](https://en.wikipedia.org/wiki/Damerau–Levenshtein_distance#:~:text=Informally%2C%20the%20Damerau–Levenshtein%20distance,one%20word%20into%20the%20other.)