# Context-sensitive Spelling Correction

The goal of the assignment is to implement context-sensitive spelling correction. The input of the code will be a set of text lines and the output will be the same lines with spelling mistakes fixed.

Submit the solution of the assignment to Moodle as a link to your GitHub repository containing this notebook.

Useful links:
- [Norvig's solution](https://norvig.com/spell-correct.html)
- [Norvig's dataset](https://norvig.com/big.txt)
- [Ngrams data](https://www.ngrams.info/download_coca.asp)

Grading:
- 60 points - Implement spelling correction
- 20 points - Justify your decisions
- 20 points - Evaluate on a test set


## Implement context-sensitive spelling correction

Your task is to implement context-sensitive spelling corrector using N-gram language model. The idea is to compute conditional probabilities of possible correction options. For example, the phrase "dking sport" should be fixed as "doing sport" not "dying sport", while "dking species" -- as "dying species".

The best way to start is to analyze [Norvig's solution](https://norvig.com/spell-correct.html) and [N-gram Language Models](https://web.stanford.edu/~jurafsky/slp3/3.pdf).

When solving this task, we expect you'll face (and successfully deal with) some problems or make up the ideas of the model improvement. Some of them are: 

- solving a problem of n-grams frequencies storing for a large corpus;
- taking into account keyboard layout and associated misspellings;
- efficiency improvement to make the solution faster;
- ...

Please don't forget to describe such cases, and what you decided to do with them, in the Justification section.

##### IMPORTANT:  
Your project should not be a mere code copy-paste from somewhere. You must provide:
- Your implementation
- Analysis of why the implemented approach is suggested
- Improvements of the original approach that you have chosen to implement

In [None]:
# Заметки

# bigrams.txt
# Частота | Слово 1 | Слово 2
#
# 275  a    a
# 29   a    all

# coca_all_links
# Частота | Слово 1 | Слово 2 | Часть речи 1 | Часть речи 2
#
# 36  a-National  Rank    jj   nn1
# 92  abandoned   building  jj   nn1

# fivegrams.txt
# Частота | Слово 1 | Слово 2 | Слово 3 | Слово 4 | Слово 5
#
# 16  a    babe    in    the    woods
# 6   a    baby    at    her    breast


In [57]:
#Checking how bigrams work. No need to open)

import nltk
from nltk.corpus import words

nltk.download('words',quiet=True)
nltk.download('reuters',quiet=True)
nltk.download('punkt_tab',quiet=True)


def generate_candidates(word):
    abc = 'abcdefghijklmnopqrstuvwxyz'
    edits = set()

    for i in range(len(word)):
        edits.add(word[:i] + word[i+1:])
    for i in range(len(word)):
        for char in abc:
            edits.add(word[:i] + char + word[i+1:])
    for i in range(len(word) + 1):
        for char in abc:
            edits.add(word[:i] + char + word[i:])

    return edits & set(words.words())

def correct_sentence_bigram(sentence, bigram_freq):
    tokens = nltk.word_tokenize(sentence.lower())
    corrected_tokens = []

    for i in range(len(tokens)):
        word = tokens[i]
        if i > 0:
            prev_word = corrected_tokens[-1]
            candidates = generate_candidates(word)
            if candidates:
                best_candidate = max(candidates, key=lambda w: bigram_freq.get((prev_word, w), 1))
                corrected_tokens.append(best_candidate)
            else:
                corrected_tokens.append(word)
        else:
            corrected_tokens.append(word)

    return ' '.join(corrected_tokens)

bigram_freq = {}
with open("bigrams (2).txt", 'r', encoding='latin-1') as f:
    for line in f:
        parts = line.strip().split('\t')
        if len(parts) == 3:
            freq, w1, w2 = parts
            bigram_freq[(w1, w2)] = int(freq)


sample_text = "I am dking sport every day"
corrected_text = correct_sentence_bigram(sample_text, bigram_freq)
print(corrected_text)


i am doing sport very day


In [58]:
#Checking how coca work. No need to open)
import nltk
from nltk.corpus import reuters, words


nltk.download('words',quiet=True)
nltk.download('reuters',quiet=True)
nltk.download('punkt_tab',quiet=True)
word_list = set(words.words())

def generate_candidates(word):
    abc = 'abcdefghijklmnopqrstuvwxyz'
    edits = set()
    for i in range(len(word)):
        edits.add(word[:i] + word[i+1:])
    for i in range(len(word)):
        for char in abc:
            edits.add(word[:i] + char + word[i+1:])
    for i in range(len(word) + 1):
        for char in abc:
            edits.add(word[:i] + char + word[i:])
    valid_edits = {w for w in edits if w in word_list}
    return valid_edits if valid_edits else {word}

def correct_sentence_coca(sentence, coca_freq):
    tokens = nltk.word_tokenize(sentence.lower())
    corrected_tokens = []
    for i in range(len(tokens)):
        word = tokens[i]
        candidates = generate_candidates(word)
        if i > 0:
            prev_word = corrected_tokens[-1]
            best_candidate = max(candidates, key=lambda w: coca_freq.get((prev_word, w), 1), default=word)
        else:
            best_candidate = max(candidates, key=lambda w: w in word_list, default=word)
        corrected_tokens.append(best_candidate)
    return ' '.join(corrected_tokens)
coca_freq = {}
with open("coca_all_links (2).txt", 'r', encoding='latin-1') as f:
    for line in f:
        parts = line.strip().split('\t')
        if len(parts) >= 3:
            freq, w1, w2 = parts[:3]
            coca_freq[(w1, w2)] = int(freq)

sample_text = "I am dking sport every day"
corrected_text = correct_sentence_coca(sample_text, coca_freq)
print(corrected_text)

pi oam dying sport very day


In [59]:
#Checking how fivegrams work. No need to open)
import nltk
from nltk.corpus import words
from nltk.metrics import edit_distance

nltk.download('words', quiet=True)
nltk.download('punkt', quiet=True)
word_list = set(words.words())
def correct_word(word):
    if word in word_list:
        return word
    closest_word = min(word_list, key=lambda w: edit_distance(word, w))
    return closest_word
def correct_sentence_fivegram(sentence, fivegram_freq):
    tokens = nltk.word_tokenize(sentence)
    corrected_tokens = []
    for i, word in enumerate(tokens):
        word_lower = word.lower()
        original_case = word[0].isupper() if word else False

        context = tuple(corrected_tokens[max(0, i - 4):i])

        if context in fivegram_freq:
            candidates = fivegram_freq[context]
            best_candidate = max(candidates, key=candidates.get)
        else:
            best_candidate = correct_word(word_lower)
        corrected_tokens.append(best_candidate.capitalize() if original_case else best_candidate)

    return ' '.join(corrected_tokens)

fivegram_freq = {}
with open("fivegrams (2).txt", 'r', encoding='latin-1') as f:
    for line in f:
        parts = line.strip().split('\t')
        if len(parts) == 6:
            freq, *words = parts
            freq = int(freq)
            context, target_word = tuple(words[:-1]), words[-1]
            if context not in fivegram_freq:
                fivegram_freq[context] = {}
            fivegram_freq[context][target_word] = freq

sample_text = "I am dking sport every day"
corrected_text = correct_sentence_fivegram(sample_text, fivegram_freq)
print(corrected_text)

I am ding sport every day


In [45]:
import re
import collections
from Levenshtein import distance as levenshtein_distance

def words(text):
    return re.findall(r'\w+', text)

def load_corpus(filename):
    with open(filename, 'r', encoding='utf-8', errors='ignore') as f:
        text = f.read()
    tokens = words(text)
    return tokens

def generate_ngrams(tokens, n):
    return [' '.join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]

def count_ngrams(tokens, n):
    ngram_counts = collections.Counter(generate_ngrams(tokens, n))
    total = sum(ngram_counts.values())
    return {ngram: count / total for ngram, count in ngram_counts.items()}

def save_ngrams(ngram_counts, filename):
    with open(filename, 'w', encoding='utf-8') as f:
        for ngram, freq in ngram_counts.items():
            f.write(f"{ngram}\t{freq}\n")


tokens = load_corpus('big.txt')
unigram_counts = collections.Counter(tokens)

total_tokens = sum(unigram_counts.values())
unigram_probs = {word: count / total_tokens for word, count in unigram_counts.items()}

bigram_counts = count_ngrams(tokens, 2)
trigram_counts = count_ngrams(tokens, 3)
fivegram_counts = count_ngrams(tokens, 5)


In [46]:
keyboard_adjacent = {
    'a': ['q', 'w', 's', 'z'],
    'b': ['v', 'g', 'h', 'n'],
    'c': ['x', 'd', 'f', 'v'],
    'd': ['s', 'e', 'r', 'f', 'c', 'x'],
    'e': ['w', 's', 'd', 'r'],
    'f': ['d', 'r', 't', 'g', 'v', 'c'],
    'g': ['f', 't', 'y', 'h', 'b', 'v'],
    'h': ['g', 'y', 'u', 'j', 'n', 'b'],
    'i': ['u', 'j', 'k', 'o'],
    'j': ['h', 'u', 'i', 'k', 'n', 'm'],
    'k': ['j', 'i', 'o', 'l', 'm'],
    'l': ['k', 'o', 'p'],
    'm': ['n', 'j', 'k'],
    'n': ['b', 'h', 'j', 'm'],
    'o': ['i', 'k', 'l', 'p'],
    'p': ['o', 'l'],
    'q': ['w', 'a'],
    'r': ['e', 'd', 'f', 't'],
    's': ['a', 'w', 'e', 'd', 'x', 'z'],
    't': ['r', 'f', 'g', 'y'],
    'u': ['y', 'h', 'j', 'i'],
    'v': ['c', 'f', 'g', 'b'],
    'w': ['q', 'a', 's', 'e'],
    'x': ['z', 's', 'd', 'c'],
    'y': ['t', 'g', 'h', 'u'],
    'z': ['a', 's', 'x']
}



def change1(word):
    letters = 'abcdefghijklmnopqrstuvwxyz'
    splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    deletes = [L + R[1:] for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]
    replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
    inserts = [L + c + R for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def misclick_edits(word):
    return set(word[:i] + c + word[i+1:] for i, letter in enumerate(word) if letter in keyboard_adjacent for c in keyboard_adjacent[letter])


def change2(word):
    return set(e2 for e1 in change1(word) for e2 in change1(e1))


def known_words(words, word_dict):
    return set(w for w in words if w in word_dict)


def jaccard_similarity(word1, word2):
    set1, set2 = set(word1), set(word2)
    intersection = len(set1 & set2)
    union = len(set1 | set2)
    return intersection / union if union else 0

In [51]:
def correct_word(word, unigram_counts, bigram_counts,trigram_counts,fivegram_counts, prev_word, prev_prev_word):
    candidates = (known_words([word], unigram_counts) or
                  known_words(change1(word), unigram_counts) or
                  known_words(change2(word), unigram_counts) or
                  known_words(misclick_edits(word), unigram_counts) or
                  [word])


    def score(w):
        unigram_prob = unigram_counts.get(w, -10)
        bigram_prob = bigram_counts.get(f"{prev_word} {w}", -10) if prev_word else unigram_prob
        trigram_prob = trigram_counts.get(f"{prev_prev_word} {prev_word} {w}", -10) if prev_prev_word else bigram_prob
        fivegram_prob = fivegram_counts.get(f"{prev_prev_word} {prev_word} {w}", -10) if prev_prev_word else trigram_prob
        misclick_penalty = sum(1 for i, c in enumerate(word) if i < len(w) and c in keyboard_adjacent and w[i] in keyboard_adjacent[c])
        return unigram_prob + 2 * bigram_prob + 3 * trigram_prob + 5 * fivegram_prob - levenshtein_distance(word, w) + 2 * jaccard_similarity(word, w)- 0.5*misclick_penalty

    return max(candidates, key=score)

def correct_sentence(sentence, unigram_counts, bigram_counts, trigram_counts, fivegram_counts):
    words_with_punctuation = re.findall(r'\b\w+\b|[.,!?;]', sentence)
    corrected_words = []

    for i, word in enumerate(words_with_punctuation):
        if re.match(r'\w+', word):
            prev_word = corrected_words[i-1] if i > 0 else None
            prev_prev_word = corrected_words[i-2] if i > 1 else None
            best_candidate = correct_word(word.lower(), unigram_counts, bigram_counts, trigram_counts,fivegram_counts, prev_word, prev_prev_word)
            if word.istitle():
                best_candidate = best_candidate.capitalize()
            corrected_words.append(best_candidate)
        else:
            corrected_words.append(word)

    return ' '.join(corrected_words)



## Justify your decisions

Write down justificaitons for your implementation choices. For example, these choices could be:
- Which ngram dataset to use
- Which weights to assign for edit1, edit2 or absent words probabilities
- Beam search parameters
- etc.

#### I'll describe how I wrote my code and what methods I tried to apply.

## Justification for Implementation Choices

1. ### N-gram Dataset Selection

    The big.txt corpus was chosen as the primary dataset for language modeling because it is a large text corpus containing a variety of English words and sentences.

    Unigrams, bigrams, trigrams, fivegrams were extracted to improve the context-aware spelling correction. Higher-order n-grams provide more accurate predictions by considering a broader linguistic context.

2.  ### Weighting of Different Probabilities

    Unigram probability: Used as a base frequency for words in the corpus.

    Bigram probability: Given twice the weight of unigram probability to capture contextual relationships between words.

    Trigrams,Fivegrams: same

    Levenshtein Distance: Subtracted from the score to penalize candidates with higher edit distances, ensuring that corrections remain phonetically and structurally close to the input.

    Jaccard Similarity: Weighted positively to promote words with higher character overlap with the input.

3. ### Error Correction Methods

    The editing distance (edits1, edits2). I borrowed these functions from Norvig, because I considered them necessary to implement.

    Incorrect keystrokes: Leads to typical typos due to the location of adjacent keys on a standard keyboard.

4. ### Sentence-Level Correction Strategy

    The words are corrected sequentially to ensure grammatical consistency.

    The previous and penultimate preceding words are passed to the correct_word() function so that the probability can be adjusted based on the n-gram scale.

5. ### Scoring Function and Candidate Selection

    The scoring function provides a balance of n-gram probabilities, similarity measures, and error penalties.

    The candidate with the highest score will be selected as the best correction option, ensuring both grammatical accuracy and minimal deviation from the original word.





## Evaluate on a test set

Your task is to generate a test set and evaluate your work. You may vary the noise probability to generate different datasets with varying compexity (or just take another dataset). Compare your solution to the Norvig's corrector, and report the accuracies.

### Creating test sentences

In [23]:
import random

ERROR_TYPES = ["spelling", "homophone", "word_order", "extra_space", "missing_letter", "swapped_letters"]

def introduce_errors(sentence):
    words = sentence.split()
    error_type = random.choice(ERROR_TYPES)

    if error_type == "spelling":
        if words:
            idx = random.randint(0, len(words) - 1)
            if len(words[idx]) > 1:
                words[idx] = words[idx][:max(1, len(words[idx]) - 1)]

    elif error_type == "homophone":
        homophones = {"their": "there", "there": "their", "your": "you're", "you're": "your", "to": "too", "too": "to"}
        words = [homophones.get(w, w) for w in words]

    elif error_type == "extra_space":
        words = " ".join(words).replace(" ", "  ", 1).split()

    elif error_type == "missing_letter":
        if words:
            idx = random.randint(0, len(words) - 1)
            if len(words[idx]) > 1:
                char_idx = random.randint(0, len(words[idx]) - 1)
                words[idx] = words[idx][:char_idx] + words[idx][char_idx + 1:]
    elif error_type == "swapped_letters":
        if words:
            idx = random.randint(0, len(words) - 1)
            if len(words[idx]) > 1:
                char_idx = random.randint(0, len(words[idx]) - 2)
                word_list = list(words[idx])
                word_list[char_idx], word_list[char_idx + 1] = word_list[char_idx + 1], word_list[char_idx]
                words[idx] = "".join(word_list)

    return " ".join(words)

#I took these sentences from different sites and partially generated them
correct_sentences = [
    "i am going to the store later",
    "this is a simple test sentence",
    "she enjoys playing the piano",
    "the weather is really nice today",
    "we need to finish our project soon",
    "he quickly ran across the street to catch the bus",
    "my favorite color is blue but I also like red",
    "the dog barked loudly at the stranger",
    "we had a great time at the amusement park",
    "she bought a new dress for the party",
    "john loves reading books about history",
    "the cat is sleeping on the sofa",
    "we should leave early to avoid traffic",
    "can you help me carry these bags",
    "her voice was soft but clear",
    "the sun rises in the east and sets in the west",
    "this puzzle is difficult to solve",
    "our teacher gave us homework for the weekend",
    "the museum has a collection of ancient artifacts",
    "she practices yoga every morning",
    "the train arrived at the station on time",
    "my grandmother tells wonderful stories",
    "please turn off the lights before you leave",
    "we enjoyed our trip to the mountains",
    "he apologized for being late to the meeting",
    "the new restaurant serves delicious pasta",
    "his handwriting is very neat and clear",
    "they adopted a cute little puppy",
    "her favorite subject in school is mathematics",
    "we walked along the beach at sunset",
    "the scientist explained the theory in detail",
    "she is learning to play the violin",
    "the movie was both exciting and emotional",
    "we planted flowers in our garden last spring",
    "the bakery sells fresh bread every morning",
    "my brother is studying engineering at university",
    "their wedding was a beautiful ceremony",
    "the library has a vast collection of books",
    "i enjoy listening to classical music",
    "the artist painted a stunning landscape",
    "the city skyline looks amazing at night",
    "her dress was made of fine silk",
    "we celebrated my birthday with a big party",
    "the students are preparing for their exams",
    "he built a wooden birdhouse for his garden",
    "she loves watching the stars at night",
    "the chef prepared a delicious three-course meal",
    "the team won the championship last season",
    "a rainbow appeared after the heavy rain",
    "the firefighters bravely saved the family",
    "she bought a bouquet of fresh flowers",
    "the mechanic fixed the car engine quickly",
    "we took a boat ride across the lake",
    "the baby giggled when I tickled her feet",
    "my father taught me how to ride a bicycle",
    "the children built a sandcastle on the beach",
    "i prefer tea over coffee in the morning",
    "the dancer performed with grace and elegance",
    "our vacation to Italy was unforgettable",
    "she writes poetry in her free time",
    "we enjoyed watching the fireworks display",
    "the mountain peak was covered in snow",
    "the astronaut described life in space",
    "they watched the sunrise from the hilltop",
    "the athlete trained hard for the marathon",
    "the detective solved the mystery case",
    "she baked a chocolate cake for dessert",
    "the garden was filled with colorful butterflies",
    "they played chess by the fireplace",
    "the ancient ruins were fascinating to explore",
    "she wore a beautiful silver necklace",
    "he composed a symphony for the orchestra",
    "the scientist discovered a new planet",
    "we went camping under the starry sky",
    "the knight fought bravely in battle",
    "she enjoys long walks in the countryside",
    "hhe magician performed an amazing trick",
    "the violinist played a mesmerizing melody",
    "we adopted a kitten from the shelter",
    "the festival was full of joy and laughter",
    "i am going to the store later",
    "this is a simple test sentence",
    "she likes to read books",
    "he is reading a book",
    "i love eating apples",
    "the cat is sitting on the mat",
    "i will meet you at the park",
    "they're going to the mall",
    "i know the answer",
    "the weather is nice today",
    "i just want to say hello",
    "this is a great example",
    "he is a great teacher",
    "she bought a new phone",
    "i enjoy doing sport every day",
    "she was dyeing her hair",
    "he has a strong feeling about this",
    "their house is very big",
    "they're going to the store",
    "i am trying to study",
    "the car doesn't work",
    "he was very happy",
    "she does not like coffee",
    "this is an interesting book",
    "can you believe it?",
    "he tried to help",
    "we finally arrived",
    "this is a beautiful day",
    "she really wants to go",
    "let's go to the restaurant",
    "she gave me a gift",
    "he drove too fast",
    "i think we will win",
    "the doctor prescribed medicine",
    "i will definitely call you",
    "she always smiles",
    "he committed a mistake",
    "hhe surprised everyone",
    "you're welcome!",
    "their car is faster than ours",
    "i didn't see him",
    "she signed the contract",
    "the children are playing outside",
    "let's meet at 5 PM",
    "he is responsible for this",
    "she is a beautiful singer",
    "this is a difficult question",
    "we finally finished our work",
    "i hope you're doing well",
    "his advice was very helpful"
]

test_cases = [(f'{introduce_errors(sentence)}', f'{sentence}') for sentence in correct_sentences]


def save_test_cases(output_csv="testing_sentences_2.csv"):
    with open(output_csv, 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(["incorrect", "correct"])
        writer.writerows(test_cases)

    print(f"Generated {len(test_cases)} test cases and saved to {output_csv}")

save_test_cases()

Generated 130 test cases and saved to testing_sentences_2.csv


## Result. My algorithm and Norvig

In [52]:
## My

import csv

def test_correction(csv_filename):
    num_sentences = 0
    correct_sencences = 0
    with open(csv_filename, 'r', encoding='utf-8') as file:
        reader = csv.reader(file)
        next(reader)
        for incorrect, correct in reader:
            num_sentences +=1
            predicted = correct_sentence(incorrect, unigram_counts, bigram_counts, trigram_counts, fivegram_counts)
            if predicted == correct:
                correct_sencences += 1
                print(f"Correct:\nInput: {incorrect}\nExpected: {correct}\nPredicted: {predicted}\n")
            else:
                print(f"Incorrect:\nInput: {incorrect}\nExpected: {correct}\nPredicted: {predicted}\n")
        print(f"Accuracy: {correct_sencences/num_sentences}")


test_correction("testing_sentences_2.csv")


Correct:
Input: i am goig to the store later
Expected: i am going to the store later
Predicted: i am going to the store later

Correct:
Input: this is a simple test sentence
Expected: this is a simple test sentence
Predicted: this is a simple test sentence

Correct:
Input: she enjoys playing the piano
Expected: she enjoys playing the piano
Predicted: she enjoys playing the piano

Correct:
Input: the weather is really nice today
Expected: the weather is really nice today
Predicted: the weather is really nice today

Incorrect:
Input: we need ot finish our project soon
Expected: we need to finish our project soon
Predicted: we need of finish our project soon

Incorrect:
Input: he quickly ran across the street too catch the bus
Expected: he quickly ran across the street to catch the bus
Predicted: he quickly ran across the street too catch the but

Correct:
Input: my favorite color is blue but I also like red
Expected: my favorite color is blue but I also like red
Predicted: my favorite co

}## Norvig solution

In [40]:
import re
from collections import Counter
import csv

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())):
    "Probability of `word`."
    return WORDS[word] / N

def correction(word):
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word):
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words):
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word):
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

def correct_sentence_function(sentence):
    """Corrects a sentence by applying word-level correction."""
    words = sentence.split()
    return ' '.join([correction(word) for word in words])

In [41]:
## Norvig

def test_correction(csv_filename):
    num_sentences = 0
    correct_sencences = 0
    with open(csv_filename, 'r', encoding='utf-8') as file:
        reader = csv.reader(file)
        next(reader)
        for incorrect, correct in reader:
            num_sentences +=1
            predicted = correct_sentence_function(incorrect)
            if predicted == correct:
                correct_sencences += 1
                print(f"Correct:\nInput: {incorrect}\nExpected: {correct}\nPredicted: {predicted}\n")
            else:
                print(f"Incorrect:\nInput: {incorrect}\nExpected: {correct}\nPredicted: {predicted}\n")
        print(f"Accuracy: {correct_sencences/num_sentences}")


test_correction("testing_sentences_2.csv")

Correct:
Input: i am goig to the store later
Expected: i am going to the store later
Predicted: i am going to the store later

Correct:
Input: this is a simple test sentence
Expected: this is a simple test sentence
Predicted: this is a simple test sentence

Correct:
Input: she enjoys playing the piano
Expected: she enjoys playing the piano
Predicted: she enjoys playing the piano

Correct:
Input: the weather is really nice today
Expected: the weather is really nice today
Predicted: the weather is really nice today

Incorrect:
Input: we need ot finish our project soon
Expected: we need to finish our project soon
Predicted: we need of finish our project soon

Incorrect:
Input: he quickly ran across the street too catch the bus
Expected: he quickly ran across the street to catch the bus
Predicted: he quickly ran across the street too catch the but

Incorrect:
Input: my favorite color is blue but I also like red
Expected: my favorite color is blue but I also like red
Predicted: my favorite 

##

## Results


My model performs better because it considers a wider range of cases, catching many common spelling mistakes and even some homophones. However, it's still not good enough—it struggles with subtle context-based errors and sometimes introduces mistakes while trying to correct others. While it improves basic spelling, it needs refinement to better handle small but important word choice errors (like too → to). Overall, it’s an improvement, but there’s still work to do before it feels truly reliable.

#### Useful resources (also included in the archive in moodle):

1. [Possible dataset with N-grams](https://www.ngrams.info/download_coca.asp)
2. [Damerau–Levenshtein distance](https://en.wikipedia.org/wiki/Damerau–Levenshtein_distance#:~:text=Informally%2C%20the%20Damerau–Levenshtein%20distance,one%20word%20into%20the%20other.)