# Context-sensitive Spelling Correction

### Student: Ivan Golov
### Email: i.golov@innopolis.university
### Group: AI-01

### **References:**

1. **Norvig, P.**  
   *How to Write a Spelling Corrector.*  
   [https://norvig.com/spell-correct.html](https://norvig.com/spell-correct.html)

2. **Voita, L.**  
   *Theory of N-gram Models.*  
   [https://lena-voita.github.io/nlp_course/language_modeling.html](https://lena-voita.github.io/nlp_course/language_modeling.html)

3. **Corpora Provider**  
   *Holbrook-tagged.dat Test Dataset Provider.*  
   [https://titan.dcs.bbk.ac.uk/~roger/corpora.html](https://titan.dcs.bbk.ac.uk/~roger/corpora.html)

4. **Deshmukh, D.**  
   *Spelling Corrector Using N-gram Language Model (Kaggle).*  
   [https://www.kaggle.com/code/dhruvdeshmukh/spelling-corrector-using-n-gram-language-model](https://www.kaggle.com/code/dhruvdeshmukh/spelling-corrector-using-n-gram-language-model)

5. **Msamprovalaki**  
   *Context-Aware Spelling Corrector - Assignment1 (GitHub).*  
   [https://github.com/msamprovalaki/Context-Aware-Spelling-Corrector/blob/main/Assignment1.ipynb](https://github.com/msamprovalaki/Context-Aware-Spelling-Corrector/blob/main/Assignment1.ipynb)

# Norvig's solution evaluation

I started working on the Assignment by evaluating the Norvig's solution and hightlighting the main drawbacks of it.

### Obtain the train language corpus using nltk.corpus import gutenberg, reuters, brown

In [1]:
import nltk
nltk.download('brown')
nltk.download('reuters')
nltk.download('gutenberg')
from nltk.corpus import gutenberg, reuters, brown
from tqdm import tqdm

# Function to generate large corpus text
def generate_large_corpus():
    large_corpus_text = ""

    # List of all Gutenberg file IDs
    file_ids = gutenberg.fileids()

    # Generate the large corpus by combining all texts
    large_corpus_text = "\n".join(gutenberg.raw(file_id) for file_id in file_ids)

    # Add Reuters and Brown corpora to the large corpus
    reuters_text = " ".join(reuters.words())
    brown_text = " ".join(brown.words())

    large_corpus_text += f"\n{reuters_text}\n{brown_text}"

    return large_corpus_text

# Save the large corpus to a text file
def save_large_corpus(file_path, corpus_text):
    with open(file_path, "w") as file:
        file.write(corpus_text)

# Generate and save the large corpus
large_corpus_text = generate_large_corpus()
save_large_corpus("data/train/language_corpus.txt", large_corpus_text)
print("Large corpus generated and saved to 'data/train/language_corpus.txt'")

[nltk_data] Downloading package brown to /Users/ivangolov/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package reuters to
[nltk_data]     /Users/ivangolov/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/ivangolov/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


Large corpus generated and saved to 'data/train/language_corpus.txt'


Even at the very beginning of the paper, the author points out that he trained his model on a limited data set. Therefore, I decided to immediately collect a more diverse corpus from different texts using the nltk library. I guess it should hel me to:
1) **Obtain broader vocabulary & diversity**
2) **Build based on big corpus the improved N-gram Estimation (sense of context)**

### Implement the Norvig model

In [2]:
import re
from collections import Counter

In [3]:
class NorvigSpellingCorrector_v1:
    def __init__(self, corpus):
        self.WORDS = Counter(self.words(corpus))
        self.N = sum(self.WORDS.values())

    def words(self, text):
        return re.findall(r'\w+', text.lower())

    def P(self, word):
        "Probability of `word`."
        return self.WORDS[word] / self.N

    def correction(self, word):
        "Most probable spelling correction for word."
        return max(self.candidates(word), key=self.P)

    def candidates(self, word):
        "Generate possible spelling corrections for word."
        return (self.known([word]) or self.known(self.edits1(word)) or self.known(self.edits2(word)) or [word])

    def known(self, words):
        "The subset of `words` that appear in the dictionary of WORDS."
        return set(w for w in words if w in self.WORDS)

    def edits1(self, word):
        "All edits that are one edit away from `word`."
        letters    = 'abcdefghijklmnopqrstuvwxyz'
        splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
        deletes    = [L + R[1:]               for L, R in splits if R]
        transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
        replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
        inserts    = [L + c + R               for L, R in splits for c in letters]
        return set(deletes + transposes + replaces + inserts)

    def edits2(self, word):
        "All edits that are two edits away from `word`."
        return (e2 for e1 in self.edits1(word) for e2 in self.edits1(e1))

### Train the model

In [4]:
model = NorvigSpellingCorrector_v1(large_corpus_text)

### Test set preparation

In [5]:
import nltk
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
import re

In [6]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/ivangolov/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [7]:
test_data = ""
with open('data/test/test.txt') as f:
    test_data = f.read()

In [8]:
# Tokenize the document into sentences
sentences = sent_tokenize(test_data.lower())

In [9]:
import re

# Initialize lists to store the data for the DataFrame
original_sentences = []
misspelled_sentences = []
correct_words = []
misspelled_words = []

# Define a regex pattern to find the <ERR> tags
pattern = re.compile(r'<err targ=(.*?)>(.*?)</err>')

# Define a preprocessing function
def preprocess_sentence(sentence):
    # Convert to lowercase
    sentence = sentence.lower()
    # Remove punctuation
    sentence = re.sub(r'[^\w\s]', '', sentence)
    # Remove extra whitespaces
    sentence = re.sub(r'\s+', ' ', sentence).strip()
    return sentence

# Process each sentence
for sentence in sentences:
    matches = pattern.findall(sentence)
    if matches:
        original_sentence = sentence
        misspelled_sentence = sentence
        correct_word_list = []
        misspelled_word_list = []
        
        for match in matches:
            correct_word = match[0]
            misspelled_word = match[1]
            misspelled_word_mew = match[1].replace(' ', '')
            correct_word_list.append(correct_word)
            misspelled_word_list.append(misspelled_word_mew)
            misspelled_sentence = misspelled_sentence.replace(f'<err targ={correct_word}>{misspelled_word}</err>', misspelled_word_mew)
            original_sentence = original_sentence.replace(f'<err targ={correct_word}>{misspelled_word}</err>', correct_word)
        
        # Apply preprocessing
        original_sentence = preprocess_sentence(original_sentence)
        misspelled_sentence = preprocess_sentence(misspelled_sentence)
        
        original_sentences.append(original_sentence)
        misspelled_sentences.append(misspelled_sentence)
        correct_words.append(correct_word_list)
        misspelled_words.append(misspelled_word_list)

In [10]:
# Construct the pandas DataFrame
df = pd.DataFrame({
    'Original Sentence': original_sentences,
    'Misspelled Sentence': misspelled_sentences,
    'Correct Words': correct_words,
    'Misspelled Words': misspelled_words
})

# Display the DataFrame
print(df.head())

# Save the DataFrame to a CSV file
df.to_csv('data/test/test_data_processed.csv', index=False)

                                   Original Sentence  \
0  1 nigel thrush page 48 i have four in my famil...   
1                          my sister goes to tonbury   
2                          my mum goes out sometimes   
3  i go to bridgebrook i go out sometimes on tues...   
4  on thursday nights i go bellringing on saturda...   

                                 Misspelled Sentence      Correct Words  \
0  1 nigel thrush page 48 i have four in my famil...           [sister]   
1                             my siter go to tonbury     [sister, goes]   
2                          my mum goes out sometimes        [sometimes]   
3  i go to bridgebrook i go out sometimes on tues...  [sometimes, club]   
4  on thursday nights i go bellringing on saturda...      [bellringing]   

    Misspelled Words  
0            [siter]  
1        [siter, go]  
2        [sometimes]  
3  [sometimes, clob]  
4      [bellringing]  


In [11]:
df.head(10)

Unnamed: 0,Original Sentence,Misspelled Sentence,Correct Words,Misspelled Words
0,1 nigel thrush page 48 i have four in my famil...,1 nigel thrush page 48 i have four in my famil...,[sister],[siter]
1,my sister goes to tonbury,my siter go to tonbury,"[sister, goes]","[siter, go]"
2,my mum goes out sometimes,my mum goes out sometimes,[sometimes],[sometimes]
3,i go to bridgebrook i go out sometimes on tues...,i go to bridgebrook i go out sometimes on tues...,"[sometimes, club]","[sometimes, clob]"
4,on thursday nights i go bellringing on saturda...,on thursday nights i go bellringing on saturda...,[bellringing],[bellringing]
5,i go to bed at 10 o clock i watch tv at 5 o cl...,i go to bed at 10 o clock i wakh tv at 5 o clo...,[watch],[wakh]
6,the house is white it has stone up the front i...,the house is white it has stone up the frount ...,"[front, second]","[frount, sexeon]"
7,on monday i sometimes go down the farm in the ...,on monday i sometimes go down the farm in the ...,[watch],[wach]
8,we have got anglia like to watch cowboys,we have got anglia like to wach cowboys,"[watch, cowboys]","[wach, cowboys]"
9,on tuesday i get off the bus and sometimes in ...,on tuesday i get off the bus and sometimes in ...,"[sometimes, club]","[sometimes, colbe]"


### Metrics computation

In addition to evaluating accuracy and perplexity, I also incorporated Word Error Rate (WER) and Character Error Rate (CER) into the evaluation process. These metrics help measure how closely the model's corrected output aligns with the reference text, ensuring that the system corrects only genuine misspellings while preserving correctly spelled words. This comprehensive evaluation approach allows us to better understand the model's performance and its fidelity to the original content.

##  Word Error Rate (WER)

WER is a metric used to evaluate the performance of systems that generate or correct sequences of words, such as automatic speech recognition systems or spelling correction systems. It measures the number of word-level errors made by a system when comparing its output to the reference or ground truth text.

### $ WER = \frac{S + I + D}{N} $

* $ S $ : the number of substitutions
* $ i $ : the number of insertions
* $ D $ : the number of deletions
* N: the total number of words in the reference

## Character Error Rate (CER)
CER is similar to WER but operates at the character level. It measures the accuracy of character-level transcriptions or corrections. CER quantifies the number of character-level errors made by a system when comparing its output to the reference text. Errors include substitutions, insertions, and deletions of individual characters.

### $ WER = \frac{S + I + D}{N} $

* $ S $ : the number of character substitutions
* $ i $ : the number of character insertions
* $ D $ : the number of character deletions
* N: the total number of characterσ in the reference

In [12]:
import pandas as pd
import numpy as np
import math
from nltk.metrics import edit_distance as Levenshtein
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor
import ast

# Define the functions to calculate WER and CER
def calculate_wer(reference, corrected):
    # Calculate Word Error Rate (WER)
    reference_words = reference.split()
    corrected_words = corrected.split()

    S = Levenshtein(reference_words, corrected_words)
    I = max(0, len(corrected_words) - len(reference_words))
    D = max(0, len(reference_words) - len(corrected_words))

    N = max(len(reference_words), len(corrected_words))

    wer = (S + I + D) / N

    return wer

def calculate_cer(reference, corrected):
    # Calculate Character Error Rate (CER)
    S = Levenshtein(reference, corrected)
    I = max(0, len(corrected) - len(reference))
    D = max(0, len(reference) - len(corrected))

    N = max(len(reference), len(corrected))

    cer = (S + I + D) / N

    return cer

# Define the function to calculate accuracy
def calculate_accuracy(df):
    correct_predictions = 0
    total_predictions = 0

    for index, row in tqdm(df.iterrows(), total=df.shape[0]):
        correct_words = ast.literal_eval(row['Correct Words'])
        corrected_words = row['Corrected Words']
        for correct_word, corrected_word in zip(correct_words, corrected_words):
            if correct_word == corrected_word:
                correct_predictions += 1
            total_predictions += 1

    accuracy = correct_predictions / total_predictions
    return accuracy

def calculate_perplexity_norvig(sentence, model):
    words = model.words(sentence)
    log_prob = 0

    for word in words:
        prob = model.P(word)
        if prob > 0:
            log_prob += np.log2(prob)  
        else:
            log_prob += np.log2(1 / model.N) 

    HC = -log_prob / len(words)  # Cross-entropy
    perpl = math.pow(2, HC)  # Perplexity

    return HC, perpl

def correct_sentence(sentence, model):
    return ' '.join([model.correction(word) for word in model.words(sentence)])

def correct_words(words, model):
    return [model.correction(word.strip()) for word in eval(words)]

def compute_stats(df, model, tag):
    WER = []
    CER = []
    accuracy = 0
    perplexities = []
    HCs = []

    if tag == "Norvig":
        
        print("Compute the corrected sentence and words")
        with ThreadPoolExecutor() as executor:
            df['Corrected Sentence'] = list(tqdm(executor.map(lambda sentence: correct_sentence(sentence, model), df['Misspelled Sentence']), total=len(df)))
            df['Corrected Words'] = list(tqdm(executor.map(lambda words: correct_words(words, model), df['Misspelled Words']), total=len(df)))
        
        print("Compute the WER, CER and Perplexity")
        for index, row in tqdm(df.iterrows(), total=df.shape[0]):
            reference = row['Original Sentence']
            corrected = row['Corrected Sentence']
            wer = calculate_wer(reference, corrected)
            cer = calculate_cer(reference, corrected)
            WER.append(wer)
            CER.append(cer)
            HC, perplexity = calculate_perplexity_norvig(corrected, model)
            perplexities.append(perplexity)
            HCs.append(HC)
            
        
        print("Compute the accuracy")
        accuracy = calculate_accuracy(df)
            
    print('Average Word Error Rate (WER):', np.mean(WER))
    print('Average Character Error Rate (CER):', np.mean(CER))
    print(f"Accuracy: {accuracy}")
    print('Average Perplexity:', np.mean(perplexities))
    print('Average Cross-Entropy:', np.mean(HCs))
    
    return df

In [13]:
import pandas as pd
test_df = pd.read_csv('data/test/test_data_processed.csv')

### Collect statistics

In [14]:
updated = compute_stats(test_df.copy(), model, "Norvig")

Compute the corrected sentence and words


100%|██████████| 666/666 [00:32<00:00, 20.44it/s] 
100%|██████████| 666/666 [00:30<00:00, 21.56it/s] 


Compute the WER, CER and Perplexity


100%|██████████| 666/666 [00:28<00:00, 23.06it/s]


Compute the accuracy


100%|██████████| 666/666 [00:00<00:00, 12108.24it/s]

Average Word Error Rate (WER): 0.12587380137217416
Average Character Error Rate (CER): 0.0669916914630596
Accuracy: 0.2054636398614852
Average Perplexity: 5198.027378935151
Average Cross-Entropy: 10.281997326281457





In [15]:
updated.head(10)

Unnamed: 0,Original Sentence,Misspelled Sentence,Correct Words,Misspelled Words,Corrected Sentence,Corrected Words
0,1 nigel thrush page 48 i have four in my famil...,1 nigel thrush page 48 i have four in my famil...,['sister'],['siter'],1 nigel thrush page 48 i have four in my famil...,[sister]
1,my sister goes to tonbury,my siter go to tonbury,"['sister', 'goes']","['siter', 'go']",my sister go to tilbury,"[sister, go]"
2,my mum goes out sometimes,my mum goes out sometimes,['sometimes'],['sometimes'],my mum goes out sometimes,[sometimes]
3,i go to bridgebrook i go out sometimes on tues...,i go to bridgebrook i go out sometimes on tues...,"['sometimes', 'club']","['sometimes', 'clob']",i go to bridgebrook i go out sometimes on tues...,"[sometimes, club]"
4,on thursday nights i go bellringing on saturda...,on thursday nights i go bellringing on saturda...,['bellringing'],['bellringing'],on thursday nights i go bellringing on saturda...,[bellringing]
5,i go to bed at 10 o clock i watch tv at 5 o cl...,i go to bed at 10 o clock i wakh tv at 5 o clo...,['watch'],['wakh'],i go to bed at 10 o clock i wash tv at 5 o clo...,[wash]
6,the house is white it has stone up the front i...,the house is white it has stone up the frount ...,"['front', 'second']","['frount', 'sexeon']",the house is white it has stone up the front i...,"[front, sexton]"
7,on monday i sometimes go down the farm in the ...,on monday i sometimes go down the farm in the ...,['watch'],['wach'],on monday i sometimes go down the farm in the ...,[each]
8,we have got anglia like to watch cowboys,we have got anglia like to wach cowboys,"['watch', 'cowboys']","['wach', 'cowboys']",we have got anglia like to each cowboys,"[each, cowboys]"
9,on tuesday i get off the bus and sometimes in ...,on tuesday i get off the bus and sometimes in ...,"['sometimes', 'club']","['sometimes', 'colbe']",on tuesday i get off the bus and sometimes in ...,"[sometimes, cole]"


### **The Norvig spelling corrector model (Baseline):**

* Average Word Error Rate (WER): 0.12587380137217416
* Average Character Error Rate (CER): 0.0669916914630596
* Accuracy: 0.2054636398614852
* Average Perplexity: 5198.027378935151
* Average Cross-Entropy: 10.281997326281457

### **Planned Future Improvements:**

1. **Enhanced Data Preprocessing:**  
   Refining the preprocessing steps—such as advanced tokenization, normalization, and noise reduction—will help create a cleaner dataset, which in turn can improve the reliability of the n-gram statistics.

2. **Contextual Modeling with N-grams:**  
   Moving beyond a simple unigram approach, incorporating bigrams, trigrams, and interpolated models will allow the corrector to better understand and leverage contextual information. This is crucial for making distinctions in ambiguous cases, such as "doing sport" versus "dying species."

3. **Beam Search Implementation:**  
   Integrating a beam search algorithm will enable more efficient exploration of correction candidates. This strategy can significantly enhance performance by balancing the search space with the quality of candidate sequences.

4. **Alpha-Lambda Parameter Tuning:**  
   Fine-tuning the smoothing parameters (alpha and lambda) in the interpolated models is expected to yield more accurate probability estimations. This tuning will help optimize the balance between n-gram levels, reducing both WER and CER while lowering perplexity and cross-entropy.

Overall, these improvements aim to create a more robust, context-sensitive spelling corrector that better aligns with the source text and minimizes unnecessary changes to correctly spelled words.

# My solution

### **Improvement №1** (add advanced text predprocessing):

In [None]:
class NorvigSpellingCorrector_v2:
    def __init__(self, corpus):
        self.WORDS = self.preprocess_corpus(corpus)
        self.N = sum(self.WORDS.values())
    
    # UPDATED: Preprocess the corpus to remove rare words and replace them with "<UNK>"
    def preprocess_corpus(self, corpus):
        # Convert the entire corpus to lowercase
        corpus = corpus.lower()

        # Remove all characters that are not letters, digits, or whitespace
        corpus = re.sub(r'[^a-z0-9\s]', '', corpus)

        # Tokenize the corpus into words
        tokens = self.words(corpus)  

        # Count the frequency of each word in the corpus
        word_counts = Counter(tokens)
    
        # Define a threshold for word frequency
        threshold = 10
        
        # Create a vocabulary set with words that appear at least 'threshold' times
        vocab = {word for word, count in word_counts.items() if count >= threshold}

        # Replace words not in the vocabulary with the "<UNK>" token
        tokens = [token if token in vocab else "<UNK>" for token in tokens]

        # Recount the frequency of each word after replacing rare words with "<UNK>"
        word_counts = Counter(tokens)
        
        # Return the final word counts
        return word_counts

    def words(self, text):
        return re.findall(r'\w+', text.lower())

    def P(self, word):
        "Probability of `word`."
        return self.WORDS[word] / self.N

    def correction(self, word):
        "Most probable spelling correction for word."
        return max(self.candidates(word), key=self.P)

    def candidates(self, word):
        "Generate possible spelling corrections for word."
        return (self.known([word]) or self.known(self.edits1(word)) or self.known(self.edits2(word)) or [word])

    def known(self, words):
        "The subset of `words` that appear in the dictionary of WORDS."
        return set(w for w in words if w in self.WORDS)

    def edits1(self, word):
        "All edits that are one edit away from `word`."
        letters    = 'abcdefghijklmnopqrstuvwxyz'
        splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
        deletes    = [L + R[1:]               for L, R in splits if R]
        transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
        replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
        inserts    = [L + c + R               for L, R in splits for c in letters]
        return set(deletes + transposes + replaces + inserts)

    def edits2(self, word):
        "All edits that are two edits away from `word`."
        return (e2 for e1 in self.edits1(word) for e2 in self.edits1(e1))

### Load data and train model

In [17]:
large_corpus_text = ""
with open('data/train/language_corpus.txt') as f:
    large_corpus_text = f.read()

In [18]:
model = NorvigSpellingCorrector_v2(large_corpus_text)

### Compute metrics

In [19]:
updated = compute_stats(test_df.copy(), model, "Norvig")

Compute the corrected sentence and words


100%|██████████| 666/666 [00:51<00:00, 13.03it/s] 
100%|██████████| 666/666 [00:28<00:00, 23.11it/s] 


Compute the WER, CER and Perplexity


100%|██████████| 666/666 [00:29<00:00, 22.84it/s]


Compute the accuracy


100%|██████████| 666/666 [00:00<00:00, 8219.89it/s]

Average Word Error Rate (WER): 0.12967006293761502
Average Character Error Rate (CER): 0.06937547142922464
Accuracy: 0.21969988457098885
Average Perplexity: 4751.986770267538
Average Cross-Entropy: 10.065830773255565





In [122]:
updated.head(10)

Unnamed: 0,Original Sentence,Misspelled Sentence,Correct Words,Misspelled Words,Corrected Sentence,Corrected Words
0,1 nigel thrush page 48 i have four in my famil...,1 nigel thrush page 48 i have four in my famil...,['sister'],['siter'],1 nigel thrust page 48 i have four in my famil...,[sister]
1,my sister goes to tonbury,my siter go to tonbury,"['sister', 'goes']","['siter', 'go']",my sister go to tonbury,"[sister, go]"
2,my mum goes out sometimes,my mum goes out sometimes,['sometimes'],['sometimes'],my sum goes out sometimes,[sometimes]
3,i go to bridgebrook i go out sometimes on tues...,i go to bridgebrook i go out sometimes on tues...,"['sometimes', 'club']","['sometimes', 'clob']",i go to bridgebrook i go out sometimes on tues...,"[sometimes, club]"
4,on thursday nights i go bellringing on saturda...,on thursday nights i go bellringing on saturda...,['bellringing'],['bellringing'],on thursday nights i go bellringing on saturda...,[bellringing]
5,i go to bed at 10 o clock i watch tv at 5 o cl...,i go to bed at 10 o clock i wakh tv at 5 o clo...,['watch'],['wakh'],i go to bed at 10 o clock i wash tv at 5 o clo...,[wash]
6,the house is white it has stone up the front i...,the house is white it has stone up the frount ...,"['front', 'second']","['frount', 'sexeon']",the house is white it has stone up the front i...,"[front, seen]"
7,on monday i sometimes go down the farm in the ...,on monday i sometimes go down the farm in the ...,['watch'],['wach'],on monday i sometimes go down the farm in the ...,[each]
8,we have got anglia like to watch cowboys,we have got anglia like to wach cowboys,"['watch', 'cowboys']","['wach', 'cowboys']",we have got angle like to each cowboy,"[each, cowboy]"
9,on tuesday i get off the bus and sometimes in ...,on tuesday i get off the bus and sometimes in ...,"['sometimes', 'club']","['sometimes', 'colbe']",on tuesday i get off the bus and sometimes in ...,"[sometimes, cole]"


---

**Key Aspects of the `preprocess_corpus` Function:**

- **Lowercasing:**  
  Converting the entire corpus to lowercase to ensure consistency.

- **Noise Reduction:**  
  Removing all characters except letters, digits, and whitespace using regular expressions.

- **Tokenization:**  
  Splitting the text into tokens to facilitate word-level analysis.

- **Frequency Filtering:**  
  Counting word occurrences and defining a threshold (set to 10) to build a robust vocabulary.

- **Handling Rare Words:**  
  Replacing words that do not meet the frequency threshold with an `<UNK>` token, thereby reducing the impact of rare or noisy tokens on the model.

---

**After Incorporating the Preprocessing Enhancements, the Updated Model Produced the Following Metrics:**

### The Norvig Spelling Corrector Model (Baseline + Corpus Preprocessing)
- **Average Word Error Rate (WER):** 0.12967006293761502  
- **Average Character Error Rate (CER):** 0.06937547142922464  
- **Accuracy:** 0.21969988457098885  
- **Average Perplexity:** 4751.986770267538  
- **Average Cross-Entropy:** 10.065830773255565  

---

**Observations:**

- **Accuracy:**  
  Increased from 0.20546 to 0.21970, indicating that the model now correctly handles a higher proportion of words.

- **Perplexity & Cross-Entropy:**  
  Both metrics decreased, suggesting that the language model has become more confident and is assigning higher probabilities to the correct word sequences.

- **WER & CER:**  
  Slight increases in error rates suggest that while the vocabulary standardization and unknown token replacement helped overall performance, they may have introduced a few extra misclassifications—likely due to aggressive filtering of infrequent words.

### **Improvement №2** (Add the notion of context using N-gram models):

### Load and predprocess the data

In [20]:
large_corpus_text = ""
with open('data/train/language_corpus.txt') as f:
    large_corpus_text = f.read()

In [None]:
def preprocess_corpus(corpus):
        corpus = corpus.lower()

        corpus = re.sub(r'[^a-z0-9\s]', '', corpus)

        tokens = re.findall(r'\w+', corpus) 
        
        # exclude single character words
        tokens = [token for token in tokens if len(token) > 1]

        word_counts = Counter(tokens)
        
        threshold = 10
        vocab = {word for word, count in word_counts.items() if count >= threshold}

        return tokens, vocab

corpus_words, vocab = preprocess_corpus(large_corpus_text)

__Note:__ I should note that since we do not have the ability to collect and curate a highly diverse text corpus, the introduction of the <UNK> token might be neglected. In our current setup, the probability of encountering infrequent words is already low, meaning that most words not meeting the frequency threshold would naturally be treated as rare. As a result, the impact of explicitly replacing them with <UNK> may not be significant.

In [22]:
def preprocess_sentences(sentences):
    preprocessed_sentences = []
    for sentence in tqdm(sentences, total=len(sentences)):
        sentence = sentence.lower()
        sentence = re.sub(r'[^a-z0-9\s]', '', sentence)
        tokens = re.findall(r'\w+', sentence)
        preprocessed_sentences.append(tokens)
    return preprocessed_sentences

In [23]:
corpus_sentences = nltk.sent_tokenize(large_corpus_text)

In [24]:
corpus_sentences_predprocessed = preprocess_sentences(corpus_sentences)

100%|██████████| 248174/248174 [00:04<00:00, 50922.44it/s]


### Compute the N-gram stats

In [25]:
from collections import defaultdict
from tqdm import tqdm

def build_ngram_counts(words, max_order):
    """
    Build n-gram counts for orders 1 through max_order.
    For unigrams, keys are one-element tuples.
    """
    ngram_counts = {i: defaultdict(int) for i in range(1, max_order + 1)}

    for i in tqdm(range(len(words)), desc="Building n-gram counts"):
        for order in range(1, max_order + 1):
            if i + order <= len(words):
                gram = tuple(words[i:i + order])
                ngram_counts[order][gram] += 1
    
    return ngram_counts

ngram_counts = build_ngram_counts(corpus_words, 3)

Building n-gram counts: 100%|██████████| 4427046/4427046 [00:13<00:00, 327640.35it/s]


We can observe that our current use of a defaultdict efficiently manages n-gram storage without demanding excessive time or memory resources. However, as our corpus scales, there are several strategies we could adopt to optimize storage further. For instance, using a trie (prefix tree) can facilitate quick retrieval and efficient storage of n-grams, while database solutions like SQLite or PostgreSQL—with proper indexing—can handle larger datasets. Additionally, compressed storage methods (e.g., gzip) and sparse matrix representations (using libraries like scipy.sparse) can reduce disk space usage and memory consumption for very large or sparse n-gram datasets.

In [26]:
import gzip
import pickle

In [27]:
# Save the n-gram counts with gzip
def save_ngram_counts(ngram_counts, filename):
    with gzip.open(filename, 'wb') as f:
        pickle.dump(ngram_counts, f)

# Save the n-gram counts to a file
save_ngram_counts(ngram_counts, 'data/ngrams/ngram_counts.pkl.gz')

In [28]:
# Load the n-gram counts from a file
def load_ngram_counts(filename):
    with gzip.open(filename, 'rb') as f:
        return pickle.load(f)

# Example of loading the n-gram counts
loaded_ngram_counts = load_ngram_counts('data/ngrams/ngram_counts.pkl.gz')

### Unigram probability model

In [33]:
def compute_unigram_probability(candidate, ngram_counts, alpha, total_words, V):
    """
    Computes smoothed unigram probability:
      P(candidate) = (C(candidate) + alpha) / (total_words + alpha * |V|)
    """
    unigram_count = ngram_counts[1].get((candidate,), 0)
    V_size = len(V)
    
    unigram_probability = (unigram_count + alpha) / (total_words + alpha * V_size)
    
    return unigram_probability

### Bigram probability model

In [34]:
# Compute the bigram probabilities
def compute_bigram_probabilities(w1, w2, ngram_counts, alpha):
    """
    Computes smoothed bigram probability:
      P(w2|w1) = (C(w1, w2) + alpha) / (C(w1) + alpha * |V|)
    """
    bigram_count = ngram_counts[2][(w1, w2)]
    unigram_count = ngram_counts[1][(w1,)]
    
    V = len(ngram_counts[1])
    
    bigram_probability = (bigram_count + alpha) / (unigram_count + alpha * V)
    
    return bigram_probability

In [35]:
def compute_perplexity_bigram(sentences, ngram_counts, alpha):
    """
    Compute the perplexity of the validation set using the bigram model.
    """
    total_log_prob = 0
    total_words = 0

    for sentence in tqdm(sentences, desc="Computing Perplexity"):
        for i in range(1, len(sentence)):
            w1 = sentence[i - 1]
            w2 = sentence[i]
            bigram_prob = compute_bigram_probabilities(w1, w2, ngram_counts, alpha)
            log_prob = np.log2(bigram_prob)
            total_log_prob += log_prob
            total_words += 1

    HC = -total_log_prob / total_words
    perplexity = math.pow(2, HC)

    return HC, perplexity

In [40]:
def compute_perplexity_bigram_avg(sentences, ngram_counts, alpha):
    """
    Compute the average perplexity of the validation set using the bigram model.
    """
    total_perplexity = 0
    num_sentences = len(sentences)

    for sentence in tqdm(sentences, desc="Computing Perplexity"):
        total_log_prob = 0
        total_words = 0
        for i in range(1, len(sentence)):
            w1 = sentence[i - 1]
            w2 = sentence[i]
            bigram_prob = compute_bigram_probabilities(w1, w2, ngram_counts, alpha)
            log_prob = np.log2(bigram_prob)
            total_log_prob += log_prob
            total_words += 1

        HC = -total_log_prob / total_words
        sentence_perplexity = math.pow(2, HC)
        total_perplexity += sentence_perplexity

    average_perplexity = total_perplexity / num_sentences

    return HC, average_perplexity

- **Corpus-Level vs. Sentence-Level Calculation:**  
  - **`compute_perplexity_bigram`:**  
    This function aggregates the log probabilities for all bigrams across the entire corpus. It computes a single cross-entropy (HC) over all words, and then derives the perplexity from that aggregate value. This gives a corpus-level measure of perplexity.
  
  - **`compute_perplexity_bigram_avg`:**  
    Here, the function calculates the cross-entropy and corresponding perplexity for each individual sentence. After computing each sentence’s perplexity, it averages these perplexities over all sentences. This approach treats each sentence independently before averaging the results.

- **Implications of the Aggregation Method:**  
  Because perplexity is derived via an exponentiation of cross-entropy (i.e., \( \text{Perplexity} = 2^{HC} \)), which is a non-linear transformation, averaging sentence-level perplexities does not yield the same result as computing perplexity over the entire corpus. Differences in sentence lengths and variance in probability distributions across sentences can lead to these discrepancies.

In summary, the first function provides an overall measure for the entire dataset, and the second function gives an average of per-sentence perplexities, which may differ due to the non-linear nature of the transformation from cross-entropy to perplexity.

_The same logic for the other N-gram prob models_

### Trigram probability model

In [36]:
# Compute the trigram probabilities
def compute_trigram_probabilities(w1, w2, w3, ngram_counts, alpha):
    """
    Computes smoothed trigram probability:
      P(w3|w1, w2) = (C(w1, w2, w3) + alpha) / (C(w1, w2) + alpha * |V|)
    """
    trigram_count = ngram_counts[3][(w1, w2, w3)]
    bigram_count = ngram_counts[2][(w1, w2)]
    
    V = len(ngram_counts[1])
    
    trigram_probability = (trigram_count + alpha) / (bigram_count + alpha * V)
    
    return trigram_probability

In [37]:
def compute_perplexity_trigram(sentences, ngram_counts, alpha):
    """
    Compute the perplexity of the validation set using the trigram model.
    """
    total_log_prob = 0
    total_words = 0

    for sentence in tqdm(sentences, desc="Computing Perplexity"):
        for i in range(2, len(sentence)):
            w1 = sentence[i - 2]
            w2 = sentence[i - 1]
            w3 = sentence[i]
            trigram_prob = compute_trigram_probabilities(w1, w2, w3, ngram_counts, alpha)
            log_prob = np.log2(trigram_prob)
            total_log_prob += log_prob
            total_words += 1

    HC = -total_log_prob / total_words
    perplexity = math.pow(2, HC)

    return HC, perplexity

### Interpolated bi-gram and tri-gram model

In [38]:
def compute_interpolated_prob(w1, w2, w3, ngram_counts, alpha, lamda):
    """
    Computes the interpolated probability:
      P(w3|w1,w2) = lam * P_trigram(w3|w1,w2) + (1 - lam) * P_bigram(w3|w2)
    where the bigram probability is computed as:
      P(w3|w2) = (C(w2, w3) + alpha) / (C(w2) + alpha * |V|)
    """
    # Trigram probability
    p_trigram = compute_trigram_probabilities(w1, w2, w3, ngram_counts, alpha)
    
    # Bigram probability
    p_bigram = compute_bigram_probabilities(w2, w3, ngram_counts, alpha)
    
    # Interpolated probability
    interpolated_prob = lamda * p_trigram + (1 - lamda) * p_bigram
    
    return interpolated_prob

In [39]:
def compute_perplexity_interpolated(sentences, ngram_counts, alpha, lamda):
    """
    Compute the perplexity of the validation set using the interpolated model.
    """
    total_log_prob = 0
    total_words = 0

    for sentence in tqdm(sentences, desc="Computing Perplexity"):
        for i in range(2, len(sentence)):
            w1 = sentence[i - 2]
            w2 = sentence[i - 1]
            w3 = sentence[i]
            interpolated_prob = compute_interpolated_prob(w1, w2, w3, ngram_counts, alpha, lamda)
            log_prob = np.log2(interpolated_prob)
            total_log_prob += log_prob
            total_words += 1

    HC = -total_log_prob / total_words
    perplexity = math.pow(2, HC)

    return HC, perplexity

In [41]:
def compute_perplexity_interpolated_avg(sentences, ngram_counts, alpha, lamda):
    """
    Compute the average perplexity of the validation set using the interpolated model.
    """
    total_perplexity = 0
    num_sentences = len(sentences)

    for sentence in tqdm(sentences, desc="Computing Perplexity"):
        total_log_prob = 0
        total_words = 0
        for i in range(2, len(sentence)):
            w1 = sentence[i - 2]
            w2 = sentence[i - 1]
            w3 = sentence[i]
            interpolated_prob = compute_interpolated_prob(w1, w2, w3, ngram_counts, alpha, lamda)
            log_prob = np.log2(interpolated_prob)
            total_log_prob += log_prob
            total_words += 1

        HC = -total_log_prob / total_words
        sentence_perplexity = math.pow(2, HC)
        total_perplexity += sentence_perplexity

    average_perplexity = total_perplexity / num_sentences

    return HC, average_perplexity

### General N-gram probability model

In [None]:
def compute_ngram_probability(context, candidate, ngram_counts, alpha, V):
    """
    Computes smoothed n-gram probability:
      P(candidate|context) = (C(context, candidate) + alpha) / (C(context) + alpha * |V|)
    """
    context_length = len(context)
    ngram_count = ngram_counts.get(context_length + 1, {}).get(context + (candidate,), 0)
    context_count = ngram_counts.get(context_length, {}).get(context, 0)
    V_size = len(V)
    
    ngram_probability = (ngram_count + alpha) / (context_count + alpha * V_size)
    
    return ngram_probability

### Tuning of hyperparameters

After introducing the bigram, trigram, and interpolated models, I proceeded to tune the hyperparameters—specifically the alpha values (smoothing parameters) and lambda values (for interpolation). By iterating over a range of alpha values for the bigram and trigram models and both alpha and lambda values for the interpolated model on a validation set of 10,000 sentences, I aimed to identify the parameter combinations that minimized cross-entropy and perplexity. This systematic tuning process helps ensure that the language models produce more accurate probability estimates and better capture the nuances of the training data, ultimately leading to improved performance on unseen text.

In [None]:
# Tune the hyperparameters for the bigram LM
validation_set = corpus_sentences[:10000]
alpha_values = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.5, 1.0]
lambda_values = [0.1, 0.2, 0.3, 0.5, 0.7, 0.9]

best_bigram_params=None
best_bigram_ce = float('inf')
print("Tuning hyperparameters for the bigram LM:")
for alpha in (alpha_values):
    HC, perpl = compute_perplexity_bigram(validation_set, ngram_counts, alpha)
    print(f"Alpha: {alpha}, Cross-Entropy: {HC}, Perplexity: {perpl}")
    if HC < best_bigram_ce:
        best_bigram_ce = HC
        best_bigram_params = (alpha, HC, perpl)
        
# Tune the hyperparameters for the trigram LM
best_trigram_params=None
best_trigram_ce = float('inf')
print("Tuning hyperparameters for the trigram LM:")
for alpha in (alpha_values):
    HC, perpl = compute_perplexity_trigram(validation_set, ngram_counts, alpha)
    print(f"Alpha: {alpha}, Cross-Entropy: {HC}, Perplexity: {perpl}")
    if HC < best_trigram_ce:
        best_trigram_ce = HC
        best_trigram_params = (alpha, HC, perpl)
        
# Tune the hyperparameters for the interpolated LM
best_interpolated_params=None
best_interpolated_ce = float('inf')
print("Tuning hyperparameters for the interpolated LM:")
for alpha in (alpha_values):
    for lamda in lambda_values:
        HC, perpl = compute_perplexity_interpolated(validation_set, ngram_counts, alpha, lamda)
        print(f"Alpha: {alpha}, Lambda: {lamda}, Cross-Entropy: {HC}, Perplexity: {perpl}")
        if HC < best_interpolated_ce:
            best_interpolated_ce = HC
            best_interpolated_params = (alpha, lamda, HC, perpl)


Tuning hyperparameters for the bigram LM:


Computing Perplexity: 100%|██████████| 10000/10000 [00:01<00:00, 5044.64it/s]


Alpha: 0.0001, Cross-Entropy: 15.571766524590013, Perplexity: 48704.46861441161


Computing Perplexity: 100%|██████████| 10000/10000 [00:01<00:00, 5151.46it/s]


Alpha: 0.001, Cross-Entropy: 14.330228241264058, Perplexity: 20598.16559884154


Computing Perplexity: 100%|██████████| 10000/10000 [00:01<00:00, 5164.93it/s]


Alpha: 0.01, Cross-Entropy: 13.369777434082627, Perplexity: 10585.321255172768


Computing Perplexity: 100%|██████████| 10000/10000 [00:01<00:00, 5141.30it/s]


Alpha: 0.1, Cross-Entropy: 13.225124444839032, Perplexity: 9575.449149327107


Computing Perplexity: 100%|██████████| 10000/10000 [00:02<00:00, 4936.56it/s]


Alpha: 0.2, Cross-Entropy: 13.332784876705192, Perplexity: 10317.350239208281


Computing Perplexity: 100%|██████████| 10000/10000 [00:01<00:00, 5026.52it/s]


Alpha: 0.3, Cross-Entropy: 13.413407114210472, Perplexity: 10910.32996801804


Computing Perplexity: 100%|██████████| 10000/10000 [00:01<00:00, 5213.24it/s]


Alpha: 0.5, Cross-Entropy: 13.525765811997951, Perplexity: 11794.002723707026


Computing Perplexity: 100%|██████████| 10000/10000 [00:01<00:00, 5114.70it/s]


Alpha: 1.0, Cross-Entropy: 13.684862012267859, Perplexity: 13169.03500911066
Tuning hyperparameters for the trigram LM:


Computing Perplexity: 100%|██████████| 10000/10000 [00:02<00:00, 4065.17it/s]


Alpha: 0.0001, Cross-Entropy: 14.620505569248602, Perplexity: 25188.98849920805


Computing Perplexity: 100%|██████████| 10000/10000 [00:02<00:00, 4396.27it/s]


Alpha: 0.001, Cross-Entropy: 14.206140808074235, Perplexity: 18900.55277083627


Computing Perplexity: 100%|██████████| 10000/10000 [00:02<00:00, 4244.91it/s]


Alpha: 0.01, Cross-Entropy: 14.159908223480418, Perplexity: 18304.46800107179


Computing Perplexity: 100%|██████████| 10000/10000 [00:02<00:00, 4483.69it/s]


Alpha: 0.1, Cross-Entropy: 14.215964158776348, Perplexity: 19029.686298875287


Computing Perplexity: 100%|██████████| 10000/10000 [00:02<00:00, 4257.95it/s]


Alpha: 0.2, Cross-Entropy: 14.236179841975247, Perplexity: 19298.215691654135


Computing Perplexity: 100%|██████████| 10000/10000 [00:02<00:00, 4436.92it/s]


Alpha: 0.3, Cross-Entropy: 14.247777104562108, Perplexity: 19453.971710877144


Computing Perplexity: 100%|██████████| 10000/10000 [00:02<00:00, 4391.10it/s]


Alpha: 0.5, Cross-Entropy: 14.261719897904568, Perplexity: 19642.89427053955


Computing Perplexity: 100%|██████████| 10000/10000 [00:02<00:00, 4448.30it/s]


Alpha: 1.0, Cross-Entropy: 14.278660100343172, Perplexity: 19874.90164292532
Tuning hyperparameters for the interpolated LM:


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3003.33it/s]


Alpha: 0.0001, Lambda: 0.1, Cross-Entropy: 13.74841975635381, Perplexity: 13762.16434002628


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3085.07it/s]


Alpha: 0.0001, Lambda: 0.2, Cross-Entropy: 13.421845623129277, Perplexity: 10974.332889969628


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3076.62it/s]


Alpha: 0.0001, Lambda: 0.3, Cross-Entropy: 13.251782886276027, Perplexity: 9754.031298146649


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3071.61it/s]


Alpha: 0.0001, Lambda: 0.5, Cross-Entropy: 13.104681697956275, Perplexity: 8808.506410006923


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3106.43it/s]


Alpha: 0.0001, Lambda: 0.7, Cross-Entropy: 13.126402758739374, Perplexity: 8942.129716459292


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3051.85it/s]


Alpha: 0.0001, Lambda: 0.9, Cross-Entropy: 13.420482816570608, Perplexity: 10963.971149767824


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 2952.60it/s]


Alpha: 0.001, Lambda: 0.1, Cross-Entropy: 13.388771662904029, Perplexity: 10725.606896728907


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3092.87it/s]


Alpha: 0.001, Lambda: 0.2, Cross-Entropy: 13.133015829305382, Perplexity: 8983.213017329448


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3108.10it/s]


Alpha: 0.001, Lambda: 0.3, Cross-Entropy: 12.990919827534746, Perplexity: 8140.602319926089


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3092.85it/s]


Alpha: 0.001, Lambda: 0.5, Cross-Entropy: 12.865048054124225, Perplexity: 7460.455769840212


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3108.35it/s]


Alpha: 0.001, Lambda: 0.7, Cross-Entropy: 12.889988452461294, Perplexity: 7590.5486469191455


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 2993.29it/s]


Alpha: 0.001, Lambda: 0.9, Cross-Entropy: 13.170092411617906, Perplexity: 9217.069485687693


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3098.93it/s]


Alpha: 0.01, Lambda: 0.1, Cross-Entropy: 13.054219315058122, Perplexity: 8505.72982676796


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 2903.07it/s]


Alpha: 0.01, Lambda: 0.2, Cross-Entropy: 12.95266768545199, Perplexity: 7927.595768650709


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 2895.11it/s]


Alpha: 0.01, Lambda: 0.3, Cross-Entropy: 12.898792610216251, Perplexity: 7637.012184923165


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3069.01it/s]


Alpha: 0.01, Lambda: 0.5, Cross-Entropy: 12.874492605207696, Perplexity: 7509.455587400954


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 2707.68it/s]


Alpha: 0.01, Lambda: 0.7, Cross-Entropy: 12.956232928853806, Perplexity: 7947.210974771716


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3135.24it/s]


Alpha: 0.01, Lambda: 0.9, Cross-Entropy: 13.264261707383362, Perplexity: 9838.766285362026


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3074.05it/s]


Alpha: 0.1, Lambda: 0.1, Cross-Entropy: 13.171630429860954, Perplexity: 9226.900794201996


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3035.82it/s]


Alpha: 0.1, Lambda: 0.2, Cross-Entropy: 13.165096248539044, Perplexity: 9185.205276259956


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3094.58it/s]


Alpha: 0.1, Lambda: 0.3, Cross-Entropy: 13.176780061465879, Perplexity: 9259.894629423752


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3134.30it/s]


Alpha: 0.1, Lambda: 0.5, Cross-Entropy: 13.243434492534057, Perplexity: 9697.750975674371


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 2921.69it/s]


Alpha: 0.1, Lambda: 0.7, Cross-Entropy: 13.381695482825537, Perplexity: 10673.128376878805


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 2810.90it/s]


Alpha: 0.1, Lambda: 0.9, Cross-Entropy: 13.688399665414488, Perplexity: 13201.366612424476


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3093.42it/s]


Alpha: 0.2, Lambda: 0.1, Cross-Entropy: 13.313993041498236, Perplexity: 10183.832968208546


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3032.68it/s]


Alpha: 0.2, Lambda: 0.2, Cross-Entropy: 13.322878081900287, Perplexity: 10246.745064096416


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 2722.21it/s]


Alpha: 0.2, Lambda: 0.3, Cross-Entropy: 13.344439653685741, Perplexity: 10401.036276166316


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3096.49it/s]


Alpha: 0.2, Lambda: 0.5, Cross-Entropy: 13.422026880278523, Perplexity: 10975.711768527679


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 2755.93it/s]


Alpha: 0.2, Lambda: 0.7, Cross-Entropy: 13.560104693680929, Perplexity: 12078.089880640742


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3124.12it/s]


Alpha: 0.2, Lambda: 0.9, Cross-Entropy: 13.836544304785765, Perplexity: 14629.007820328075


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3059.63it/s]


Alpha: 0.3, Lambda: 0.1, Cross-Entropy: 13.408380719477208, Perplexity: 10872.38417705044


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3082.23it/s]


Alpha: 0.3, Lambda: 0.2, Cross-Entropy: 13.423816716442913, Perplexity: 10989.336904925754


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3075.38it/s]


Alpha: 0.3, Lambda: 0.3, Cross-Entropy: 13.449282648171627, Perplexity: 11185.03885910331


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 2978.48it/s]


Alpha: 0.3, Lambda: 0.5, Cross-Entropy: 13.529699897031351, Perplexity: 11826.207679021836


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3116.57it/s]


Alpha: 0.3, Lambda: 0.7, Cross-Entropy: 13.66336565997831, Perplexity: 12974.269218489815


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3077.34it/s]


Alpha: 0.3, Lambda: 0.9, Cross-Entropy: 13.915883410354882, Perplexity: 15456.04319086476


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3131.74it/s]


Alpha: 0.5, Lambda: 0.1, Cross-Entropy: 13.532514455118235, Perplexity: 11849.301983139223


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3095.41it/s]


Alpha: 0.5, Lambda: 0.2, Cross-Entropy: 13.553571800189744, Perplexity: 12023.520833735003


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3046.43it/s]


Alpha: 0.5, Lambda: 0.3, Cross-Entropy: 13.581963683365942, Perplexity: 12262.48442318979


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3097.51it/s]


Alpha: 0.5, Lambda: 0.5, Cross-Entropy: 13.662285258617828, Perplexity: 12964.55672183208


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3107.96it/s]


Alpha: 0.5, Lambda: 0.7, Cross-Entropy: 13.786314718599488, Perplexity: 14128.44157281574


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3093.16it/s]


Alpha: 0.5, Lambda: 0.9, Cross-Entropy: 14.004445186616572, Perplexity: 16434.559717363


Computing Perplexity: 100%|██████████| 10000/10000 [00:04<00:00, 2428.89it/s]


Alpha: 1.0, Lambda: 0.1, Cross-Entropy: 13.699871746504796, Perplexity: 13306.760257644217


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 2692.94it/s]


Alpha: 1.0, Lambda: 0.2, Cross-Entropy: 13.724104298991048, Perplexity: 13532.15792823409


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 3016.67it/s]


Alpha: 1.0, Lambda: 0.3, Cross-Entropy: 13.753030375680103, Perplexity: 13806.216338878241


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 2884.19it/s]


Alpha: 1.0, Lambda: 0.5, Cross-Entropy: 13.82730146463481, Perplexity: 14535.584492750848


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 2976.21it/s]


Alpha: 1.0, Lambda: 0.7, Cross-Entropy: 13.932713511762609, Perplexity: 15637.40513465185


Computing Perplexity: 100%|██████████| 10000/10000 [00:03<00:00, 2822.45it/s]

Alpha: 1.0, Lambda: 0.9, Cross-Entropy: 14.101701637133313, Perplexity: 17580.66031082511





In [172]:
# Print best paraneters
print("Best hyperparameters for the bigram LM:")
print(f"Alpha: {best_bigram_params[0]}, Cross-Entropy: {best_bigram_params[1]}, Perplexity: {best_bigram_params[2]}")
print("Best hyperparameters for the trigram LM:")
print(f"Alpha: {best_trigram_params[0]}, Cross-Entropy: {best_trigram_params[1]}, Perplexity: {best_trigram_params[2]}")
print("Best hyperparameters for the interpolated LM:")
print(f"Alpha: {best_interpolated_params[0]}, Lambda: {best_interpolated_params[1]}, Cross-Entropy: {best_interpolated_params[2]}, Perplexity: {best_interpolated_params[3]}")


Best hyperparameters for the bigram LM:
Alpha: 0.1, Cross-Entropy: 13.225124444839032, Perplexity: 9575.449149327107
Best hyperparameters for the trigram LM:
Alpha: 0.01, Cross-Entropy: 14.159908223480418, Perplexity: 18304.46800107179
Best hyperparameters for the interpolated LM:
Alpha: 0.001, Lambda: 0.5, Cross-Entropy: 12.865048054124225, Perplexity: 7460.455769840212


In [43]:
import json

In [None]:
# save the best hyperparameters
best_hyperparameters = {
    "bigram": best_bigram_params,
    "trigram": best_trigram_params,
    "interpolated": best_interpolated_params
}

with open('data/train/best_hyperparameters.json', 'w') as f:
    json.dump(best_hyperparameters, f)

### Context-Sensitive Correction based on N-gram model and beam seacrh approach

In this section I collect everything into a single system for Context-Sensitive Correction with beam seacrh approach.

In [44]:
best_hyperparameters = None
with open('data/train/best_hyperparameters.json', 'r') as f:
    best_hyperparameters = json.load(f)

In [45]:
# Load the best hyperparameters for the models
best_bigram_alpha = best_hyperparameters['bigram'][0]
best_trigram_alpha = best_hyperparameters['trigram'][0]
best_interpolated_alpha = best_hyperparameters['interpolated'][0]
best_interpolated_lambda = best_hyperparameters['interpolated'][1]

In [46]:
from functools import lru_cache

@lru_cache(maxsize=None)
def get_candidates(word, V):
    def known(words, V):
        return set(w for w in words if w in V)
    
    def edits1(word):
        "All edits that are one edit away from `word`."
        letters = 'abcdefghijklmnopqrstuvwxyz'
        splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
        deletes = [L + R[1:] for L, R in splits if R]
        transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]
        replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
        inserts = [L + c + R for L, R in splits for c in letters]
        return set(deletes + transposes + replaces + inserts)
    
    def edits2(word):
        "All edits that are two edits away from `word`."
        return (e2 for e1 in edits1(word) for e2 in edits1(e1))
    
    def edits3(word):
        "All edits that are three edits away from `word`."
        return (e3 for e2 in edits1(word) for e3 in edits1(e2))
    
    return known([word], V) or known(edits1(word), V) or known(edits2(word), V) or known(edits3(word), V) or {word}

The updated code leverages the `@lru_cache` decorator to cache results from the `get_candidates` function, significantly speeding up repeated lookups for the same word and vocabulary. Within this function, several helper functions are defined: 

- **`known(words, V)`:** Filters a list of words to include only those present in the vocabulary.  
- **`edits1(word)`:** Generates all possible words that are one edit away from the input word by performing deletions, transpositions, replacements, and insertions.  
- **`edits2(word)` and `edits3(word)`:** Extend this idea by generating candidates that are two or three edits away, respectively.

This hierarchical candidate generation process ensures that if no direct or one-edit correction is found, the function can still consider more distant alternatives, thereby increasing the robustness of the correction mechanism.

In comparison to the original implementation—which encapsulated methods like probability computation (`P`), candidate generation, and correction within a class—the improved version modularizes these functionalities and optimizes performance with caching. The addition of the `edits3` function further extends the search space for potential corrections. These improvements not only enhance clarity and maintainability but also lead to more efficient computation, especially when handling repeated queries.

In [60]:
import math
import heapq

def corrector_with_beam_search(sentence, V, ngram_counts, beam_width=2, n=3, alpha=0.01, lamda=0.5):
    # Preprocess the input sentence and tokenize it
    sentence_tokens = preprocess_sentences([sentence])[0]
    
    # Initialize the list of candidate sequences with an initial score of 0.0 and an empty sequence
    candidates_sequences = [(0.0, [])]
    
    # Total number of words in the vocabulary
    total_words = len(V)
    
    # Iterate over each word in the tokenized sentence
    for word in sentence_tokens:
        new_candidates_sequences = []
        
        # Get the list of candidate words for the current word
        candidate_list = get_candidates(word, V)
        
        # Iterate over each candidate word
        for candidate in candidate_list:
            # Iterate over each candidate sequence
            for score, sequence in candidates_sequences:
                # Determine the context length for n-gram probability calculation
                context_length = min(n - 1, len(sequence))
                
                # Compute the probability based on the context length
                if context_length == 0:
                    # Unigram probability
                    prob = compute_unigram_probability(candidate, ngram_counts, alpha, total_words, V)
                elif context_length == 1:
                    # Bigram probability
                    context = tuple(sequence[-1:])
                    prob = compute_bigram_probabilities(context[0], candidate, ngram_counts, best_bigram_alpha)
                elif context_length == 2:
                    # Trigram probability with interpolation
                    w1, w2 = sequence[-2], sequence[-1]
                    prob = compute_interpolated_prob(w1, w2, candidate, ngram_counts, best_interpolated_alpha, best_interpolated_lambda)
                else:
                    # General n-gram probability
                    context = tuple(sequence[-context_length:])
                    prob = compute_ngram_probability(context, candidate, ngram_counts, alpha, V)
                
                # Update the score with the log probability
                new_score = score + math.log(prob)
                
                # Create a new sequence by appending the candidate word
                new_seq = sequence + [candidate]
                
                # Add the new candidate sequence to the heap
                heapq.heappush(new_candidates_sequences, (new_score, new_seq))
                
                # Ensure the heap does not exceed the beam width
                if len(new_candidates_sequences) > beam_width:
                    heapq.heappop(new_candidates_sequences)
        
        # Update the list of candidate sequences for the next iteration
        candidates_sequences = new_candidates_sequences
    
    # Select the best sequence with the highest score
    best_score, best_seq = max(candidates_sequences, key=lambda x: x[0])
    
    # Return the best sequence as a single string
    return ' '.join(best_seq)

In [77]:
# Correct vocab into a frozenset for the lrucache decorator
vocabulary = frozenset(vocab)

In [94]:
# Test example for unigram
text = "Seh is ging t te perk"
corrector_with_beam_search(text, vocabulary, ngram_counts, n=1)

100%|██████████| 1/1 [00:00<00:00, 9341.43it/s]


'she is king to the per'

In [95]:
# Test example for unigram
text = "Seh is dking sport"
corrector_with_beam_search(text, vocabulary, ngram_counts, n=1)

100%|██████████| 1/1 [00:00<00:00, 6990.51it/s]


'she is king sport'

In [93]:
# Test example for bigram
text = "Seh is ging t te perk"
corrector_with_beam_search(text, vocabulary, ngram_counts, n=2)

100%|██████████| 1/1 [00:00<00:00, 3483.64it/s]


'she is going to the park'

In [96]:
# Test example for bigram
text = "Seh is dking sport"
corrector_with_beam_search(text, vocabulary, ngram_counts, n=2)

100%|██████████| 1/1 [00:00<00:00, 5915.80it/s]


'she is doing sport'

In [92]:
# Test example for trigram
text = "Seh is ging t te perk"
corrector_with_beam_search(text, vocabulary, ngram_counts, n=3)

100%|██████████| 1/1 [00:00<00:00, 7049.25it/s]


'she is going to the park'

In [91]:
# Test example for N-gram model
text = "Seh is ging t te perk"
corrector_with_beam_search(text, vocabulary, ngram_counts, n=4)

100%|██████████| 1/1 [00:00<00:00, 3609.56it/s]


'she is going tx ye peru'

In our tests, we applied beam search with different n-gram models to correct the sentence "Seh is ging t te perk". 

With the unigram model (n=1), the output was "she is king to the per". Here, the model relies solely on individual word frequencies, which often leads to inappropriate substitutions without context. Switching to the bigram model (n=2) produced "she is going to the park", demonstrating a significant improvement by taking into account the immediate context of adjacent words. 

However, when we moved to the trigram model (n=3), the output degraded to "she is going tx ye peru". This suggests that while adding context generally improves performance, using higher-order n-grams can suffer from data sparsity or overfitting issues if there isn’t enough reliable contextual data. In summary, the bigram and interpol model strikes a good balance, effectively using contextual information to enhance accuracy without the pitfalls encountered at higher n-gram orders.

### Computes metrics

In [65]:
# Define the functions to calculate WER and CER
def calculate_wer(reference, corrected):
    # Calculate Word Error Rate (WER)
    reference_words = reference.split()
    corrected_words = corrected.split()

    S = Levenshtein(reference_words, corrected_words)
    I = max(0, len(corrected_words) - len(reference_words))
    D = max(0, len(reference_words) - len(corrected_words))

    N = max(len(reference_words), len(corrected_words))

    wer = (S + I + D) / N

    return wer

def calculate_cer(reference, corrected):
    # Calculate Character Error Rate (CER)
    S = Levenshtein(reference, corrected)
    I = max(0, len(corrected) - len(reference))
    D = max(0, len(reference) - len(corrected))

    N = max(len(reference), len(corrected))

    cer = (S + I + D) / N

    return cer

def compute_metrics_for_ngram_models(df, vocabulary, ngram_counts, n = 3):
    WER = []
    CER = []
    
    corrected_sentences = []
    correct_predictions = 0
    total_predictions = 0
    # Correct the sentences using the beam search corrector
    for index, row in tqdm(df.iterrows(), total=df.shape[0]):
        misspelled_sentence = row['Misspelled Sentence']
        reference = row['Original Sentence']
        corrected_sentence = corrector_with_beam_search(misspelled_sentence, vocabulary, ngram_counts, n=n)
        corrected_sentences.append(corrected_sentence)
                
        # Calculate WER and CER
        wer = calculate_wer(reference, corrected_sentence)
        cer = calculate_cer(reference, corrected_sentence)
        WER.append(wer)
        CER.append(cer)
        
        # Compute the accuracy
        list_of_corrected_words = ast.literal_eval(row['Correct Words'])
        for word in list_of_corrected_words:
            if word in corrected_sentence:
                correct_predictions += 1
            total_predictions += 1
                    
    accuracy = correct_predictions / total_predictions
    
    if n == 2:
        sentences = corrected_sentences
        HC, perplexity = compute_perplexity_bigram_avg(sentences, ngram_counts, best_bigram_alpha)
        print(f"Perplexity: {perplexity}")
        print(f"Cross-Entropy: {HC}")
    if n == 3:
        sentences = corrected_sentences
        HC, perplexity = compute_perplexity_interpolated_avg(sentences, ngram_counts, best_interpolated_alpha, best_interpolated_lambda)
        print(f"Perplexity: {perplexity}")
        print(f"Cross-Entropy: {HC}")
        
        
    print('Average Word Error Rate (WER):', np.mean(WER))
    print('Average Character Error Rate (CER):', np.mean(CER))
    print(f"Accuracy: {accuracy}")
    df["Corrected Sentence"] = corrected_sentences
    return df

### Interpolated model perfomance

In [66]:
updated = compute_metrics_for_ngram_models(test_df.copy(), vocabulary, ngram_counts, n=3)

100%|██████████| 1/1 [00:00<00:00, 10727.12it/s]
100%|██████████| 1/1 [00:00<00:00, 8701.88it/s]
100%|██████████| 1/1 [00:00<00:00, 5949.37it/s]
100%|██████████| 1/1 [00:00<00:00, 13530.01it/s]
100%|██████████| 1/1 [00:00<00:00, 5178.15it/s]
100%|██████████| 1/1 [00:00<00:00, 3905.31it/s]
100%|██████████| 1/1 [00:00<00:00, 5706.54it/s]
100%|██████████| 1/1 [00:00<00:00, 5210.32it/s]
100%|██████████| 1/1 [00:00<00:00, 5440.08it/s]
100%|██████████| 1/1 [00:00<00:00, 13706.88it/s]
100%|██████████| 1/1 [00:00<00:00, 5504.34it/s]
100%|██████████| 1/1 [00:00<00:00, 12826.62it/s]
100%|██████████| 1/1 [00:00<00:00, 6668.21it/s]
100%|██████████| 1/1 [00:00<00:00, 12787.51it/s]
100%|██████████| 1/1 [00:00<00:00, 12157.40it/s]
100%|██████████| 1/1 [00:00<00:00, 5991.86it/s]
100%|██████████| 1/1 [00:00<00:00, 12520.31it/s]
100%|██████████| 1/1 [00:00<00:00, 13025.79it/s]
100%|██████████| 1/1 [00:00<00:00, 4975.45it/s]
100%|██████████| 1/1 [00:00<00:00, 12052.60it/s]
100%|██████████| 1/1 [00:00<00:

Perplexity: 98372.6070522449
Cross-Entropy: 16.585974779349506
Average Word Error Rate (WER): 0.1859610635597018
Average Character Error Rate (CER): 0.08163519000092576
Accuracy: 0.3605232781839169


In [67]:
# print examples of fixes
for i in range(10):
    print(f"Misspelled Sentence: {test_df['Misspelled Sentence'][i]}")
    print(f"Misspleed Words: {test_df['Misspelled Words'][i]}")
    print(f"Corrected: {updated['Corrected Sentence'][i]}")
    print(f"Target: {test_df['Original Sentence'][i]}")
    print("-----------------------------")

Misspelled Sentence: 1 nigel thrush page 48 i have four in my family dad mum and siter
Misspleed Words: ['siter']
Corrected: m1 nigel thrust page 48 in have four in my family dad sum and sister
Target: 1 nigel thrush page 48 i have four in my family dad mum and sister
-----------------------------
Misspelled Sentence: my siter go to tonbury
Misspleed Words: ['siter', 'go']
Corrected: my sister go to tonbury
Target: my sister goes to tonbury
-----------------------------
Misspelled Sentence: my mum goes out sometimes
Misspleed Words: ['sometimes']
Corrected: my um goes out sometimes
Target: my mum goes out sometimes
-----------------------------
Misspelled Sentence: i go to bridgebrook i go out sometimes on tuesday night i go to youth clob
Misspleed Words: ['sometimes', 'clob']
Corrected: it go to bridgebrook xi go out sometimes on tuesday night in go to youth club
Target: i go to bridgebrook i go out sometimes on tuesday night i go to youth club
-----------------------------
Misspelled

After applying the interpolated model, we observed a notable increase in accuracy—from **0.21 to 0.36**—indicating that the model now corrects a higher proportion of errors correctly. However, this improvement in accuracy comes with trade-offs in other metrics. Specifically, the interpolated model yielded:

- **Perplexity:** 98372.6070522449  
- **Cross-Entropy:** 16.585974779349506  
- **Average WER:** 0.1859610635597018  
- **Average CER:** 0.08163519000092576  
- **Accuracy:** 0.3605232781839169  

In contrast, our previous unigram model with preprocessing achieved a lower perplexity (4751.99), cross-entropy (10.07), and error rates, but its accuracy was lower. The high perplexity in the interpolated model is largely due to the relatively small corpus, which results in very low probability estimates when incorporating bigram and trigram counts. Additionally, the increased WER, CER, and cross-entropy indicate that the model may be over-correcting—modifying words that are already correct. 

This trade-off underscores the challenges in balancing rich contextual modeling with the preservation of correctly spelled words, emphasizing the need for a larger corpus or further tuning to optimize overall performance.

### Bigram model perfomance

In [68]:
updated = compute_metrics_for_ngram_models(test_df.copy(), vocabulary, ngram_counts, n=2)

100%|██████████| 1/1 [00:00<00:00, 1824.40it/s]
100%|██████████| 1/1 [00:00<00:00, 4848.91it/s]
100%|██████████| 1/1 [00:00<00:00, 13842.59it/s]
100%|██████████| 1/1 [00:00<00:00, 6288.31it/s]
100%|██████████| 1/1 [00:00<00:00, 13315.25it/s]
100%|██████████| 1/1 [00:00<00:00, 5262.61it/s]
100%|██████████| 1/1 [00:00<00:00, 9446.63it/s]
100%|██████████| 1/1 [00:00<00:00, 6141.00it/s]
100%|██████████| 1/1 [00:00<00:00, 4462.03it/s]
100%|██████████| 1/1 [00:00<00:00, 3858.61it/s]
100%|██████████| 1/1 [00:00<00:00, 3748.26it/s]
100%|██████████| 1/1 [00:00<00:00, 6990.51it/s]
100%|██████████| 1/1 [00:00<00:00, 7145.32it/s]
100%|██████████| 1/1 [00:00<00:00, 6502.80it/s]
100%|██████████| 1/1 [00:00<00:00, 2900.63it/s]
100%|██████████| 1/1 [00:00<00:00, 4854.52it/s]
100%|██████████| 1/1 [00:00<00:00, 12787.51it/s]
100%|██████████| 1/1 [00:00<00:00, 7943.76it/s]
100%|██████████| 1/1 [00:00<00:00, 13486.51it/s]
100%|██████████| 1/1 [00:00<00:00, 3194.44it/s]
100%|██████████| 1/1 [00:00<00:00, 1

Perplexity: 98373.00000000065
Cross-Entropy: 16.585974779349506
Average Word Error Rate (WER): 0.1869153014145821
Average Character Error Rate (CER): 0.08206445300109326
Accuracy: 0.3528280107733744





_Note:_ The interpolated model performs a little bit better due to its ability to combine both bigram and trigram probabilities. By interpolating these n-gram models, it effectively leverages additional context, which helps mitigate the issues of data sparsity that often affect higher-order n-grams.

In [69]:
# print examples of fixes
for i in range(10):
    print(f"Misspelled Sentence: {test_df['Misspelled Sentence'][i]}")
    print(f"Misspleed Words: {test_df['Misspelled Words'][i]}")
    print(f"Corrected: {updated['Corrected Sentence'][i]}")
    print(f"Target: {test_df['Original Sentence'][i]}")
    print("-----------------------------")

Misspelled Sentence: 1 nigel thrush page 48 i have four in my family dad mum and siter
Misspleed Words: ['siter']
Corrected: m1 nigel thrust page 48 in have four in my family dad sum and sister
Target: 1 nigel thrush page 48 i have four in my family dad mum and sister
-----------------------------
Misspelled Sentence: my siter go to tonbury
Misspleed Words: ['siter', 'go']
Corrected: my sister go to tonbury
Target: my sister goes to tonbury
-----------------------------
Misspelled Sentence: my mum goes out sometimes
Misspleed Words: ['sometimes']
Corrected: my um goes out sometimes
Target: my mum goes out sometimes
-----------------------------
Misspelled Sentence: i go to bridgebrook i go out sometimes on tuesday night i go to youth clob
Misspleed Words: ['sometimes', 'clob']
Corrected: it go to bridgebrook xi go out sometimes on tuesday night in go to youth club
Target: i go to bridgebrook i go out sometimes on tuesday night i go to youth club
-----------------------------
Misspelled

# **Summary and Future Directions**

In this assignment, we extended Norvig’s classic spelling corrector by incorporating context-sensitive techniques using n-gram language models. We built a more diverse corpus by combining texts from multiple sources (Gutenberg, Reuters, and Brown) and applied preprocessing steps—such as lowercasing, noise reduction, tokenization, and frequency filtering—to improve vocabulary quality. By integrating bigram and trigram models, and then employing an interpolated model with beam search and careful hyperparameter tuning (adjusting alpha and lambda), we achieved a notable increase in accuracy (from 0.21 to 0.36). However, the interpolated model also showed higher perplexity, cross-entropy, WER, and CER, reflecting the challenges of a small corpus and sparse data when using higher-order n-grams.

**Future Improvements:**

1. **Increasing Data Size:**  
   Expanding the corpus with more diverse and larger datasets will help improve probability estimates, reduce perplexity, and better capture the nuances of language.

2. **Handling Keyboard Misspellings:**  
   Incorporate techniques specifically designed for keyboard errors. For example:
   - **Proximity-Based Corrections:** Use the physical layout of keyboards to determine likely mistypes (e.g., substituting letters that are adjacent).
   - **Error Modeling:** Develop a model that learns common typing errors, such as transpositions or repeated characters, to better predict intended words.

3. **Advanced Data Structures and Storage:**  
   Explore more efficient storage mechanisms for n-grams—such as trie data structures, database solutions with indexing, or compressed storage—to manage larger corpora more effectively.

4. **Leveraging Advanced Models:**  
   Investigate modern approaches like transformer-based models or other deep learning techniques that may further improve context-sensitive corrections.

These future directions aim to enhance both the robustness and efficiency of the spelling correction system while addressing the limitations observed with the current dataset and n-gram approaches.


# **NOTE RELATED TO THE MAIN SOLUTION OF ASSIGNMENT, ONLY FOR ADDITIONAL POINTS OR RESPECT FOR THE CREATIVITY AND RESOURCEFULNESS!!!**

### N-gram model based on the Google Books n-gram API

During the assignment completing, I try to discover effective tools to:
1) Have the diverge and big base of N-gram data
2) Have the ability to not compute stats by own and collect train data

In [71]:
import re
import requests
import urllib

class ContextualSpellingCorrector:
    def __init__(self, vocabulary):
        self.vocabulary = set(vocabulary)

    def run_query(self, query, start_year=2010, end_year=2019, corpus=26, smoothing=3):
        """Fetches frequency data from the Google Books Ngram API."""
        query = urllib.parse.quote(query)
        url = f'https://books.google.com/ngrams/json?content={query}&year_start={start_year}&year_end={end_year}&corpus={corpus}&smoothing={smoothing}'
        response = requests.get(url)
        
        try:
            output = response.json()
        except:
            return {}

        if not output:
            return {}

        return {entry['ngram']: sum(entry['timeseries']) / len(entry['timeseries']) for entry in output}

    def average_frequency(self, phrase):
        """Gets the average frequency of a word or n-gram phrase from Google Ngrams."""
        data = self.run_query(phrase)
        return sum(data.values()) / len(data) if data else 0

    def words(self, text):
        """Tokenizes and lowercases the input text."""
        return re.findall(r'\w+', text.lower())

    def correction_with_context(self, word, context_window):
        """Finds the best spelling correction by considering context-based n-gram probabilities."""
        candidates = self.candidates(word)
        
        # Generate n-grams with the surrounding context
        context_phrases = {candidate: self.form_context_phrases(candidate, context_window) for candidate in candidates}

        # Get frequencies for each candidate within its context
        context_frequencies = {
            candidate: sum(self.average_frequency(phrase) for phrase in phrases)
            for candidate, phrases in context_phrases.items()
        }

        return max(context_frequencies, key=context_frequencies.get)  # Return the best correction

    def form_context_phrases(self, candidate, context_window):
        """Forms bigram and trigram phrases including the candidate."""
        left_context, right_context = context_window
        phrases = []

        if left_context:
            phrases.append(f"{left_context} {candidate}")
        if right_context:
            phrases.append(f"{candidate} {right_context}")
        if left_context and right_context:
            phrases.append(f"{left_context} {candidate} {right_context}")

        return phrases

    def candidates(self, word):
        """Generates possible spelling corrections based on known words."""
        return (self.known([word]) or self.known(self.edits1(word)) or self.known(self.edits2(word)) or [word])

    def known(self, words):
        """Filters words that exist in the vocabulary."""
        return set(w for w in words if w in self.vocabulary)

    def edits1(self, word):
        """Generates possible single-edit variations of a word."""
        letters = 'abcdefghijklmnopqrstuvwxyz'
        splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
        deletes = [L + R[1:] for L, R in splits if R]
        transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]
        replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
        inserts = [L + c + R for L, R in splits for c in letters]
        return set(deletes + transposes + replaces + inserts)

    def edits2(self, word):
        """Generates possible double-edit variations of a word."""
        return (e2 for e1 in self.edits1(word) for e2 in self.edits1(e1))

# Load corpus and build vocabulary
with open("data/train/language_corpus.txt") as f:
    train_corpus = f.read()

vocabulary = set(re.findall(r'\w+', train_corpus.lower()))
vocabulary.update(["<START>", "<END>"])  # Add special tokens

# Initialize corrector with updated vocabulary
corrector = ContextualSpellingCorrector(vocabulary)

In [74]:
# Initialize corrector
corrector = ContextualSpellingCorrector(vocabulary)

# Example usage with context-aware spelling correction
sentence = ["this", "is", "a", "speling", "error"]
corrected_list = []

for i, word in enumerate(sentence):
    left_context = sentence[i - 1] if i > 0 else "<START>"
    right_context = sentence[i + 1] if i < len(sentence) - 1 else "<END>"
    
    corrected_word = corrector.correction_with_context(word, (left_context, right_context))
    corrected_list.append(corrected_word)

corrected_sentence = " ".join(corrected_list)
print(f"Corrected sentence: {corrected_sentence}")

Corrected sentence: this is a spelling error


We see that this solution have pontentia for future improvements, for me it take a lot to get the representative stats, due to the free API limitations (speed and search time). But I am very like to share with you this discovery.