# Context-sensitive Spelling Correction

The goal of the assignment is to implement context-sensitive spelling correction. The input of the code will be a set of text lines and the output will be the same lines with spelling mistakes fixed.

Submit the solution of the assignment to Moodle as a link to your GitHub repository containing this notebook.

Useful links:
- [Norvig's solution](https://norvig.com/spell-correct.html)
- [Norvig's dataset](https://norvig.com/big.txt)
- [Ngrams data](https://www.ngrams.info/download_coca.asp)

Grading:
- 60 points - Implement spelling correction
- 20 points - Justify your decisions
- 20 points - Evaluate on a test set


## Implement context-sensitive spelling correction

Your task is to implement context-sensitive spelling corrector using N-gram language model. The idea is to compute conditional probabilities of possible correction options. For example, the phrase "dking sport" should be fixed as "doing sport" not "dying sport", while "dking species" -- as "dying species".

The best way to start is to analyze [Norvig's solution](https://norvig.com/spell-correct.html) and [N-gram Language Models](https://web.stanford.edu/~jurafsky/slp3/3.pdf).

When solving this task, we expect you'll face (and successfully deal with) some problems or make up the ideas of the model improvement. Some of them are: 

- solving a problem of n-grams frequencies storing for a large corpus;
- taking into account keyboard layout and associated misspellings;
- efficiency improvement to make the solution faster;
- ...

Please don't forget to describe such cases, and what you decided to do with them, in the Justification section.

##### IMPORTANT:  
Your project should not be a mere code copy-paste from somewhere. You must provide:
- Your implementation
- Analysis of why the implemented approach is suggested
- Improvements of the original approach that you have chosen to implement

In [None]:
# Заметки

# bigrams.txt
# Частота | Слово 1 | Слово 2
#
# 275  a    a
# 29   a    all

# coca_all_links
# Частота | Слово 1 | Слово 2 | Часть речи 1 | Часть речи 2
#
# 36  a-National  Rank    jj   nn1
# 92  abandoned   building  jj   nn1

# fivegrams.txt
# Частота | Слово 1 | Слово 2 | Слово 3 | Слово 4 | Слово 5
#
# 16  a    babe    in    the    woods
# 6   a    baby    at    her    breast


In [39]:
#Test bigrams
import nltk

nltk.download('words',quiet=True)
nltk.download('reuters',quiet=True)
nltk.download('punkt_tab',quiet=True)


def generate_candidates(word):
    abc = 'abcdefghijklmnopqrstuvwxyz'
    edits = set()

    #Delite char
    for i in range(len(word)):
        edits.add(word[:i] + word[i+1:])

    #Substitute char
    for i in range(len(word)):
        for char in abc:
            edits.add(word[:i] + char + word[i+1:])

     #Insert char
    for i in range(len(word) + 1):
        for char in abc:
            edits.add(word[:i] + char + word[i:])
    return edits & set(words.words())


def correct_sentence_bigram(sentence, bigram_freq):
    tokens = nltk.word_tokenize(sentence.lower())
    corrected_tokens = []

    for i in range(len(tokens)):
        word = tokens[i]
        if i > 0:
            prev_word = corrected_tokens[-1]
            candidates = generate_candidates(word)
            if candidates:
                best_candidate = max(candidates, key=lambda w: bigram_freq.get((prev_word, w), 1))
                corrected_tokens.append(best_candidate)
            else:
                corrected_tokens.append(word)
        else:
            corrected_tokens.append(word)

    return ' '.join(corrected_tokens)

bigram_freq = {}
with open("bigrams (2).txt", 'r', encoding='latin-1') as f:
    for line in f:
        parts = line.strip().split('\t')
        if len(parts) == 3:
            freq, w1, w2 = parts
            bigram_freq[(w1, w2)] = int(freq)


sample_text = "I am dking sport every day"
corrected_text = correct_sentence_bigram(sample_text, bigram_freq)
print(corrected_text)


i am doing spor revery dag


In [50]:
#Test coca
import nltk
from nltk.corpus import reuters, words


nltk.download('words',quiet=True)
nltk.download('reuters',quiet=True)
nltk.download('punkt_tab',quiet=True)

word_list = set(words.words())

# Function to generate candidate corrections
def generate_candidates(word):
    abc = 'abcdefghijklmnopqrstuvwxyz'
    edits = set()

   #Delite char
    for i in range(len(word)):
        edits.add(word[:i] + word[i+1:])

    #Substitute char
    for i in range(len(word)):
        for char in abc:
            edits.add(word[:i] + char + word[i+1:])

     #Insert char
    for i in range(len(word) + 1):
        for char in abc:
            edits.add(word[:i] + char + word[i:])

    # Оставляем только реальные слова
    valid_edits = {w for w in edits if w in word_list}
    return valid_edits if valid_edits else {word}  # Если нет валидных, оставляем оригинал


def correct_sentence_coca(sentence, coca_freq):
    tokens = nltk.word_tokenize(sentence.lower())
    corrected_tokens = []
    for i in range(len(tokens)):
        word = tokens[i]
        candidates = generate_candidates(word)
        if i > 0:
            prev_word = corrected_tokens[-1]
            best_candidate = max(candidates, key=lambda w: coca_freq.get((prev_word, w), 1), default=word)
        else:
            best_candidate = max(candidates, key=lambda w: w in word_list, default=word)
        corrected_tokens.append(best_candidate)
    return ' '.join(corrected_tokens)


coca_freq = {}
with open("coca_all_links (2).txt", 'r', encoding='latin-1') as f:
    for line in f:
        parts = line.strip().split('\t')
        if len(parts) >= 3:
            freq, w1, w2 = parts[:3]
            coca_freq[(w1, w2)] = int(freq)

sample_text = "I am dking sport every day"
corrected_text = correct_sentence_coca(sample_text, coca_freq)
print(corrected_text)

c om eking spor revery dag


In [56]:
#Test fivegrams
import nltk
from nltk.corpus import words
from nltk.metrics import edit_distance

# Скачать необходимые данные без вывода сообщений
nltk.download('words', quiet=True)
nltk.download('punkt', quiet=True)

word_list = set(words.words())

def correct_word(word):
    if word in word_list:
        return word  # Если слово уже правильное, оставляем его
    closest_word = min(word_list, key=lambda w: edit_distance(word, w))
    return closest_word

def correct_sentence_fivegram(sentence, fivegram_freq):
    tokens = nltk.word_tokenize(sentence)
    corrected_tokens = []

    for i, word in enumerate(tokens):
        word_lower = word.lower()
        original_case = word[0].isupper() if word else False  # Проверяем заглавную букву

        # Собираем последние 4 исправленных слова (или меньше, если в начале предложения)
        context = tuple(corrected_tokens[max(0, i - 4):i])

        if context in fivegram_freq:
            candidates = fivegram_freq[context]
            best_candidate = max(candidates, key=candidates.get)  # Выбираем слово с макс. частотой
        else:
            best_candidate = correct_word(word_lower)  # Если контекста нет, fallback на словарь
        corrected_tokens.append(best_candidate.capitalize() if original_case else best_candidate)

    return ' '.join(corrected_tokens)

fivegram_freq = {}
with open("fivegrams (2).txt", 'r', encoding='latin-1') as f:
    for line in f:
        parts = line.strip().split('\t')
        if len(parts) == 6:
            freq, *words = parts
            freq = int(freq)
            context, target_word = tuple(words[:-1]), words[-1]  # Первые 4 слова - контекст, последнее - исправляемое слово
            if context not in fivegram_freq:
                fivegram_freq[context] = {}
            fivegram_freq[context][target_word] = freq

sample_text = "I am dking sport every day"
corrected_text = correct_sentence_fivegram(sample_text, fivegram_freq)
print(corrected_text)

I am ding sport every day


In [61]:
import nltk
from collections import Counter

nltk.download('reuters', quiet=True)
nltk.download('words', quiet=True)

word_list = nltk.corpus.reuters.words()  # Используем корпус новостей
word_freq = Counter(word_list)

# Сохраняем в файл
with open("unigrams.txt", "w", encoding="utf-8") as f:
    for word, freq in word_freq.most_common():
        f.write(f"{freq}\t{word.lower()}\n")


In [68]:
import nltk
from collections import Counter, defaultdict
from nltk.corpus import words

nltk.download('words', quiet=True)
nltk.download('punkt', quiet=True)

word_list = set(words.words())

# Загрузка частот слов P(w)
def load_unigrams(filename):
    word_freq = defaultdict(int)
    total_count = 0
    with open(filename, 'r', encoding='latin-1') as f:
        for line in f:
            parts = line.strip().split('\t')
            if len(parts) == 2:
                freq, word = parts
                word_freq[word] = int(freq)
                total_count += int(freq)
    return word_freq, total_count

# Загрузка биграмм P(w2 | w1)
def load_bigrams(filename):
    bigram_freq = defaultdict(int)
    with open(filename, 'r', encoding='latin-1') as f:
        for line in f:
            parts = line.strip().split('\t')
            if len(parts) == 3:
                freq, w1, w2 = parts
                bigram_freq[(w1, w2)] = int(freq)
    return bigram_freq

# Загружаем частоты
unigram_freq, total_unigrams = load_unigrams("unigrams.txt")
bigram_freq = load_bigrams("bigrams (2).txt")

# Вероятность P(w)
def P(word):
    return unigram_freq[word] / total_unigrams if word in unigram_freq else 1e-6

# Вероятность P(w2 | w1)
def P_bigram(w1, w2):
    return bigram_freq.get((w1, w2), 1) / unigram_freq.get(w1, 1)

# Генерация исправлений (опечатки)
def edits1(word):
    abc = 'abcdefghijklmnopqrstuvwxyz'
    splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    deletes = [L + R[1:] for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]
    replaces = [L + c + R[1:] for L, R in splits if R for c in abc]
    inserts = [L + c + R for L, R in splits for c in abc]
    return set(deletes + transposes + replaces + inserts)

def known(words):
    return {w for w in words if w in unigram_freq}  # Только слова из словаря

# Исправление слова
def correct_word(word, prev_word):
    candidates = known([word]) or known(edits1(word)) or {word}
    return max(candidates, key=lambda w: P_bigram(prev_word, w) * P(w))

# Исправление предложения
def correct_sentence(sentence):
    tokens = nltk.word_tokenize(sentence.lower())
    corrected_tokens = []

    for i, word in enumerate(tokens):
        if i == 0:
            corrected_tokens.append(word)  # Первое слово без исправления
        else:
            corrected_tokens.append(correct_word(word, corrected_tokens[-1]))

    return ' '.join(corrected_tokens)

sample_text = "I jsut wnat to say hello."
corrected_text = correct_sentence(sample_text)
print(corrected_text)

i just what to say hell .


## Justify your decisions

Write down justificaitons for your implementation choices. For example, these choices could be:
- Which ngram dataset to use
- Which weights to assign for edit1, edit2 or absent words probabilities
- Beam search parameters
- etc.

*Your text here...*

## Evaluate on a test set

Your task is to generate a test set and evaluate your work. You may vary the noise probability to generate different datasets with varying compexity (or just take another dataset). Compare your solution to the Norvig's corrector, and report the accuracies.

In [None]:
# Your code here

#### Useful resources (also included in the archive in moodle):

1. [Possible dataset with N-grams](https://www.ngrams.info/download_coca.asp)
2. [Damerau–Levenshtein distance](https://en.wikipedia.org/wiki/Damerau–Levenshtein_distance#:~:text=Informally%2C%20the%20Damerau–Levenshtein%20distance,one%20word%20into%20the%20other.)