# Language Model Benchmar for Word Autocorrection

In this section, we build an **Autocorrection Model** that can be used to
reconstruct words from noisy letter predictions (like EEG outputs).
This LM assigns probabilities to sequences of characters, helping us
choose the most likely word given partial or uncertain inputs.



**Autocorrect (Python library)**

The autocorrect library implements a fast, word-level spell corrector popularized by Peter Norvig’s approach, using word frequencies to suggest likely corrections. It’s lightweight (pure Python), easy to set up, and well-suited for real-time use where you need quick, per-word fixes without heavy context modeling.

In [3]:
# === Autocorrect: end-to-end word-based autocorrection with timing ===
# This cell:
# 1) Ensures the autocorrect package is installed
# 2) Lets you input a sentence
# 3) Returns the autocorrected sentence (word-based)
# 4) Prints the time taken for the correction

import sys, subprocess, re, time

# --- 1) Ensure dependency is installed ---
try:
    from autocorrect import Speller
except Exception:
    print("Installing 'autocorrect'...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "autocorrect"])
    from autocorrect import Speller

# --- 2) Configure the word-level speller ---
spell = Speller(lang='en')  # change 'en' if you need another supported language

def autocorrect_sentence(sentence: str) -> str:
    """
    Word-based autocorrection that preserves whitespace and punctuation.
    Only alphabetic word tokens are corrected to avoid mangling numbers/symbols.
    """
    tokens = re.findall(r"\w+|[^\w\s]+|\s+", sentence)  # words, punctuation, spaces
    corrected_parts = []
    for tok in tokens:
        if tok.isalpha():  # correct only alphabetic tokens
            corrected_parts.append(spell(tok))
        else:
            corrected_parts.append(tok)
    return "".join(corrected_parts)

# --- 3) Get user input, run correction, measure latency ---
try:
    user_text = input("Enter a sentence to autocorrect: ").strip()
except EOFError:
    # Fallback for environments without stdin
    user_text = "I liek to wrok with pyhton and machne leraning."

start = time.perf_counter()
corrected = autocorrect_sentence(user_text)
elapsed_ms = (time.perf_counter() - start) * 1000.0

print("\nOriginal : ", user_text)
print("Corrected: ", corrected)
print(f"Latency  : {elapsed_ms:.2f} ms")


Enter a sentence to autocorrect: how ould are yo

Original :  how ould are yo
Corrected:  how would are yo
Latency  : 0.37 ms


**PySpellChecker**

The pyspellchecker library is a pure Python implementation of Peter Norvig’s algorithm for spell correction. It works by checking words against a frequency dictionary and finding the closest candidates based on edit distance. It’s lightweight, dependency-free, and offers a good balance between accuracy and speed for single-word corrections.

In [6]:
# === PySpellChecker: end-to-end word-based autocorrection with timing ===
# This cell:
# 1) Ensures the pyspellchecker package is installed
# 2) Lets you input a sentence
# 3) Returns the autocorrected sentence (word-based)
# 4) Prints the time taken for the correction

import sys, subprocess, re, time

# --- 1) Ensure dependency is installed ---
try:
    from spellchecker import SpellChecker
except Exception:
    print("Installing 'pyspellchecker'...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "pyspellchecker"])
    from spellchecker import SpellChecker

# --- 2) Configure the spell checker ---
spell = SpellChecker(language='en')  # English by default

def autocorrect_sentence(sentence: str) -> str:
    """
    Word-based autocorrection using pyspellchecker.
    Preserves whitespace and punctuation.
    """
    tokens = re.findall(r"\w+|[^\w\s]+|\s+", sentence)
    corrected_parts = []
    for tok in tokens:
        if tok.isalpha():
            # Correct word if misspelled, else keep original
            corrected = spell.correction(tok)
            corrected_parts.append(corrected if corrected else tok)
        else:
            corrected_parts.append(tok)
    return "".join(corrected_parts)

# --- 3) Get user input, run correction, measure latency ---
try:
    user_text = input("Enter a sentence to autocorrect: ").strip()
except EOFError:
    user_text = "I liek to wrok with pyhton and machne leraning."

start = time.perf_counter()
corrected = autocorrect_sentence(user_text)
elapsed_ms = (time.perf_counter() - start) * 1000.0

print("\nOriginal : ", user_text)
print("Corrected: ", corrected)
print(f"Latency  : {elapsed_ms:.2f} ms")


Enter a sentence to autocorrect: i am fine thannk yo whar abour yo

Original :  i am fine thannk yo whar abour yo
Corrected:  i am fine thank yo what about yo
Latency  : 2.75 ms


**SymSpell**

SymSpell is one of the fastest and most memory-efficient spell correction algorithms. Instead of computing edit distances at runtime, it precomputes and stores all possible word deletions in a dictionary. This allows real-time corrections even on large vocabularies, making it one of the best options for applications where speed and scalability are critical.

In [9]:
# === SymSpell: end-to-end word-based autocorrection with timing ===
# This cell:
# 1) Ensures the symspellpy package is installed
# 2) Initializes SymSpell with a frequency dictionary
# 3) Lets you input a sentence
# 4) Returns the autocorrected sentence (word-based)
# 5) Prints the time taken for the correction

import sys, subprocess, re, time, os

# --- 1) Ensure dependency is installed ---
try:
    from symspellpy import SymSpell, Verbosity
except Exception:
    print("Installing 'symspellpy'...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "symspellpy"])
    from symspellpy import SymSpell, Verbosity

# --- 2) Configure SymSpell ---
sym_spell = SymSpell()  # initialize with defaults

# Download dictionary if not already available
dict_path = "frequency_dictionary_en_82_765.txt"
if not os.path.exists(dict_path):
    import urllib.request
    print("Downloading frequency dictionary...")
    url = "https://raw.githubusercontent.com/mammothb/symspellpy/master/symspellpy/frequency_dictionary_en_82_765.txt"
    urllib.request.urlretrieve(url, dict_path)

# Load dictionary (word -> frequency)
sym_spell.load_dictionary(dict_path, term_index=0, count_index=1)

def autocorrect_sentence(sentence: str) -> str:
    """
    Word-based autocorrection using SymSpell.
    Preserves whitespace and punctuation.
    """
    tokens = re.findall(r"\w+|[^\w\s]+|\s+", sentence)
    corrected_parts = []
    for tok in tokens:
        if tok.isalpha():
            suggestions = sym_spell.lookup(tok, Verbosity.CLOSEST, max_edit_distance=2)
            corrected_parts.append(suggestions[0].term if suggestions else tok)
        else:
            corrected_parts.append(tok)
    return "".join(corrected_parts)

# --- 3) Get user input, run correction, measure latency ---
try:
    user_text = input("Enter a sentence to autocorrect: ").strip()
except EOFError:
    user_text = "I liek to wrok with pyhton and machne leraning."

start = time.perf_counter()
corrected = autocorrect_sentence(user_text)
elapsed_ms = (time.perf_counter() - start) * 1000.0

print("\nOriginal : ", user_text)
print("Corrected: ", corrected)
print(f"Latency  : {elapsed_ms:.2f} ms")


Enter a sentence to autocorrect: thiis ies actwallu veru impresive

Original :  thiis ies actwallu veru impresive
Corrected:  this is actually very impressive
Latency  : 0.97 ms


**Hunspell (via spylls, pure-Python)**

Hunspell underpins spell-checking in browsers and office suites. It’s morphology-aware (handles affixes, compounds) and very accurate with high-quality dictionaries. Native bindings can be tricky to install, so we’ll use spylls, a pure-Python implementation that loads standard Hunspell dictionaries—portable and suitable for real-time word-level correction.

In [13]:
# === Hunspell (spylls): end-to-end word-based autocorrection with timing (fixed) ===
# This cell:
# 1) Ensures 'spylls' is installed
# 2) Ensures/enables en_US Hunspell dictionary files
# 3) Lets you input a sentence
# 4) Returns the autocorrected sentence (word-based), preserving punctuation/whitespace
# 5) Prints latency

import sys, subprocess, os, re, time, urllib.request

# --- 1) Ensure dependency is installed ---
try:
    from spylls.hunspell import Dictionary
except Exception:
    print("Installing 'spylls' (pure-Python Hunspell)...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "spylls"])
    from spylls.hunspell import Dictionary

# --- 2) Ensure Hunspell dictionary files are present (en_US) ---
DICT_DIR = os.path.join(os.getcwd(), "hunspell_en_us")
os.makedirs(DICT_DIR, exist_ok=True)
AFF_PATH = os.path.join(DICT_DIR, "en_US.aff")
DIC_PATH = os.path.join(DICT_DIR, "en_US.dic")
BASE_PATH = os.path.join(DICT_DIR, "en_US")

if not (os.path.exists(AFF_PATH) and os.path.exists(DIC_PATH)):
    print("Downloading Hunspell en_US dictionary...")
    aff_url = "https://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.aff"
    dic_url = "https://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.dic"
    urllib.request.urlretrieve(aff_url, AFF_PATH)
    urllib.request.urlretrieve(dic_url, DIC_PATH)

# --- 3) Load dictionary ---
hun = Dictionary.from_files(BASE_PATH)

def hunspell_autocorrect_sentence(sentence: str) -> str:
    """
    Word-based autocorrection using Hunspell (spylls).
    - Corrects only alphabetic tokens
    - Preserves whitespace/punctuation
    - Preserves casing (UPPER/Title/lower) where possible
    """
    def apply_casing(src: str, dst: str) -> str:
        if src.isupper():
            return dst.upper()
        if src.istitle():
            return dst.capitalize()
        if src.islower():
            return dst.lower()
        return dst

    tokens = re.findall(r"\w+|[^\w\s]+|\s+", sentence)
    out = []
    for tok in tokens:
        if tok.isalpha():
            # If known, keep; else pick first suggestion (generator -> next)
            known = hun.lookup(tok)
            if known:
                out.append(tok)
            else:
                sugg_iter = hun.suggest(tok)
                first_suggestion = next(sugg_iter, None)
                out.append(apply_casing(tok, first_suggestion) if first_suggestion else tok)
        else:
            out.append(tok)
    return "".join(out)

# --- 4) Get user input, run correction, measure latency ---
try:
    user_text = input("Enter a sentence to autocorrect: ").strip()
except EOFError:
    user_text = "i wnt to go to mu schoul tomorow"

start = time.perf_counter()
corrected = hunspell_autocorrect_sentence(user_text)
elapsed_ms = (time.perf_counter() - start) * 1000.0

print("\nOriginal : ", user_text)
print("Corrected: ", corrected)
print(f"Latency  : {elapsed_ms:.2f} ms")


Enter a sentence to autocorrect: i wnt to go to mu schoul tommorrrow

Original :  i wnt to go to mu schoul tommorrrow
Corrected:  i wt to go to mu school tomorrow
Latency  : 198.50 ms


**SymSpell + Left-to-Right Context**


This setup keeps SymSpell’s ultra-fast candidate generation but selects the best correction only using past words via a left-to-right scorer. If kenlm is available, we use a compact n-gram LM to score P(word | history_so_far) with a tiny beam search. If kenlm isn’t available, we fall back to a pure-Python bigram backoff trained on a small embedded corpus—still enforcing the “use only previous words” rule.

In [14]:
# === Left-to-Right Context-Aware Autocorrection (past words only) ===
# Uses SymSpell for candidates + LM scorer that sees ONLY previous words.
# If 'kenlm' is available, it will be used for scoring; otherwise a tiny
# pure-Python bigram backoff model is used as a fallback.
#
# What this cell does:
# 1) Installs symspellpy; tries to install kenlm (optional).
# 2) Loads SymSpell dictionary.
# 3) Builds a scorer:
#       - Preferred: KenLM (if installed)
#       - Fallback : Tiny bigram model trained on an embedded mini-corpus
# 4) Processes input sentence left->right. For each token:
#       - If alphabetic: generate candidates with SymSpell (incl. original)
#       - Pick candidate that maximizes LM score given ONLY previous tokens.
# 5) Prints corrected sentence and latency (ms).
#
# Note: This is word-based, preserves punctuation/whitespace, and *never*
# looks at future words.

import sys, subprocess, os, re, time, math
from collections import defaultdict, Counter

# -----------------------------
# 1) Ensure dependencies
# -----------------------------
try:
    from symspellpy import SymSpell, Verbosity
except Exception:
    print("Installing 'symspellpy'...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "symspellpy"])
    from symspellpy import SymSpell, Verbosity

# Try kenlm (optional)
KENLM_OK = True
try:
    import kenlm  # type: ignore
except Exception:
    KENLM_OK = False

# -----------------------------
# 2) SymSpell dictionary
# -----------------------------
sym_spell = SymSpell()  # defaults
dict_path = "frequency_dictionary_en_82_765.txt"
if not os.path.exists(dict_path):
    import urllib.request
    print("Downloading SymSpell frequency dictionary...")
    url = "https://raw.githubusercontent.com/mammothb/symspellpy/master/symspellpy/frequency_dictionary_en_82_765.txt"
    urllib.request.urlretrieve(url, dict_path)

sym_spell.load_dictionary(dict_path, term_index=0, count_index=1)

# -----------------------------
# 3) Scorer(s)
# -----------------------------

# (A) KenLM scorer (if available). We'll train a tiny model on-the-fly from a mini-corpus
# to avoid external files. This is just to enable LM behavior; replace with your own LM binary
# for better quality.
def _train_tiny_arpa(corpus_text, arpa_path="tiny.arpa", order=3):
    """
    Minimal ARPA generator for demo purposes:
    Builds unigram/bigram/trigram with add-k smoothing (k=1e-3) from corpus.
    This is *very* small and only for illustration. For production, use a proper
    KenLM-built model trained on large text.
    """
    # Tokenize corpus
    toks = re.findall(r"[A-Za-z]+", corpus_text.lower())
    # Add sentence boundaries (roughly)
    sents = re.split(r"[.!?]+", corpus_text.lower())
    sents = [re.findall(r"[a-z]+", s) for s in sents if s.strip()]
    # Counts
    uni = Counter()
    bi  = Counter()
    tri = Counter()
    for s in sents:
        prev1 = "<s>"
        prev2 = None
        uni[prev1] += 1
        for w in s + ["</s>"]:
            uni[w] += 1
            bi[(prev1, w)] += 1
            if prev2 is not None:
                tri[(prev2, prev1, w)] += 1
            prev2, prev1 = prev1, w
    V = len(uni)
    k = 1e-3  # tiny smoothing

    # Convert to ARPA with stupid backoff-ish logs (not a perfect ARPA; enough for kenlm.load_arpa)
    def log10(x):
        return -99 if x <= 0 else math.log10(x)

    total_unigrams = sum(uni.values())

    with open(arpa_path, "w", encoding="utf-8") as f:
        f.write("\\data\\\n")
        f.write(f"ngram 1={len(uni)}\n")
        f.write(f"ngram 2={len(bi)}\n")
        f.write(f"ngram 3={len(tri)}\n\n")

        # Unigrams
        f.write("\\1-grams:\n")
        for w, c in uni.items():
            p = (c + k) / (total_unigrams + k * V)
            f.write(f"{log10(p):.6f}\t{w}\n")
        f.write("\n")

        # Bigrams
        f.write("\\2-grams:\n")
        prev_totals = defaultdict(int)
        for (w1, w2), c in bi.items():
            prev_totals[w1] += c
        for (w1, w2), c in bi.items():
            p = (c + k) / (prev_totals[w1] + k * V)
            f.write(f"{log10(p):.6f}\t{w1} {w2}\n")
        f.write("\n")

        # Trigrams
        f.write("\\3-grams:\n")
        prev2_totals = defaultdict(int)
        for (w0, w1, w2), c in tri.items():
            prev2_totals[(w0, w1)] += c
        for (w0, w1, w2), c in tri.items():
            p = (c + k) / (prev2_totals[(w0, w1)] + k * V)
            f.write(f"{log10(p):.6f}\t{w0} {w1} {w2}\n")
        f.write("\n\\end\\\n")

# Tiny embedded corpus (neutral, handcrafted sentences)
TINY_CORPUS = """
I want to go to school today. I went to school yesterday. I will go to school tomorrow.
Which of the two schools do you recommend for tomorrow?
I ate an apple today. I like to eat an apple every day.
He wants to go to the new school. She went to the old school.
"""

KENLM_MODEL = None
if KENLM_OK:
    try:
        if not os.path.exists("tiny.arpa"):
            _train_tiny_arpa(TINY_CORPUS, "tiny.arpa", order=3)
        KENLM_MODEL = kenlm.Model("tiny.arpa")
    except Exception as e:
        # If kenlm fails to load, disable it and fall back
        KENLM_MODEL = None
        KENLM_OK = False

# (B) Fallback: tiny bigram backoff scorer (pure Python)
class TinyBigramLM:
    def __init__(self, corpus_text):
        sents = [re.findall(r"[a-z]+", s) for s in re.split(r"[.!?]+", corpus_text.lower()) if s.strip()]
        self.uni = Counter()
        self.bi  = Counter()
        for s in sents:
            prev = "<s>"
            self.uni[prev] += 1
            for w in s + ["</s>"]:
                self.uni[w] += 1
                self.bi[(prev, w)] += 1
                prev = w
        self.V = max(1, len(self.uni))
        self.k = 1e-3
        self.total_uni = sum(self.uni.values())

    def logp(self, history_tokens, word):
        # Use only the immediate previous word for context (past-only)
        prev = history_tokens[-1] if history_tokens else "<s>"
        c_bigram = self.bi.get((prev, word), 0)
        c_prev   = sum(self.bi.get((prev, w), 0) for w in self.uni.keys())
        if c_prev == 0:
            # backoff to unigram
            p = (self.uni.get(word, 0) + self.k) / (self.total_uni + self.k * self.V)
        else:
            p = (c_bigram + self.k) / (c_prev + self.k * self.V)
        return math.log(p + 1e-12)

FALLBACK_LM = TinyBigramLM(TINY_CORPUS)

def lm_logp(history_tokens, cand_word):
    # Only use past words; never peek ahead.
    if KENLM_MODEL is not None:
        # Score with KenLM incrementally: join history + candidate
        # Using stateful scoring would be faster; for simplicity we recompute per step.
        seq = " ".join([t.lower() for t in history_tokens + [cand_word]])
        # kenlm.score returns a log10 probability of the whole sequence + </s> in some builds.
        return KENLM_MODEL.score(seq, bos=True, eos=False)
    else:
        return FALLBACK_LM.logp([t.lower() for t in history_tokens], cand_word.lower())

# -----------------------------
# 4) Left-to-right correction
# -----------------------------
def tokenize(sentence):
    return re.findall(r"\w+|[^\w\s]+|\s+", sentence)

def generate_candidates(word, max_edit_distance=2, keep_original=True, top_k=4):
    # SymSpell candidates for the *single* word (not compound). Include original to avoid overcorrection.
    cands = []
    if keep_original:
        cands.append(word)
    suggestions = sym_spell.lookup(word, Verbosity.CLOSEST, max_edit_distance=max_edit_distance)
    for s in suggestions[:top_k]:
        if s.term not in cands:
            cands.append(s.term)
    return cands or [word]

def apply_casing(src, dst):
    if src.isupper(): return dst.upper()
    if src.istitle(): return dst.capitalize()
    if src.islower(): return dst.lower()
    return dst

def correct_left_to_right(text):
    tokens = tokenize(text)
    history = []        # past corrected words (alphabetic only) for LM
    out = []

    for tok in tokens:
        if tok.isalpha():
            cands = generate_candidates(tok)
            # Score each candidate using ONLY history (past words)
            best = None
            best_score = -1e18
            for c in cands:
                score = lm_logp(history, c)
                if score > best_score:
                    best_score = score
                    best = c
            corrected = apply_casing(tok, best)
            out.append(corrected)
            history.append(corrected)  # update history with corrected word
        else:
            out.append(tok)            # preserve punctuation/space as-is
    return "".join(out)

# -----------------------------
# 5) Run with timing
# -----------------------------
try:
    user_text = input("Enter a sentence (past-only context correction): ").strip()
except EOFError:
    user_text = "I ate an appel today but I will go to the schol tomorrow."

start = time.perf_counter()
corrected = correct_left_to_right(user_text)
elapsed_ms = (time.perf_counter() - start) * 1000.0

print("\nOriginal : ", user_text)
print("Corrected: ", corrected)
print(f"Latency  : {elapsed_ms:.2f} ms")
print(f"(Scorer: {'KenLM' if KENLM_MODEL is not None else 'Tiny Bigram Fallback'})")


Enter a sentence (past-only context correction): i waant to eet an appel todau

Original :  i waant to eet an appel todau
Corrected:  i want to eet an appel today
Latency  : 0.81 ms
(Scorer: Tiny Bigram Fallback)


In [15]:
# === Benchmark: Five Autocorrection Models on the Same Paragraph ===
# Models (must already be initialized from previous cells):
# 1) Autocorrect (Python library)          -> expects a global `spell` that is an autocorrect.Speller, or a function
# 2) PySpellChecker                        -> expects a global `spell` that is a spellchecker.SpellChecker, or a function
# 3) SymSpell (basic, single-word)         -> expects a global `sym_spell` (from symspellpy)
# 4) Hunspell via spylls (pure-Python)     -> expects a function `hunspell_autocorrect_sentence`
# 5) SymSpell + Left-to-Right Context      -> expects a function `correct_left_to_right`
#
# What this cell does:
# - Defines a LONG paragraph with many spelling mistakes (P_MISSPELLED) and its corrected version (P_CORRECT).
# - Invokes each model (without redefining them) to correct the paragraph.
# - Measures average correction time per word.
# - Compares each output to the correct version (word-level accuracy).
#
# NOTE:
# - This cell **calls** the previously defined objects/functions. If a model wasn't set up,
#   it will be marked as "Unavailable".
# - We avoid rewriting the algorithms; we simply dispatch to already created instances/functions.

import re, time, sys

# -----------------------------
# 0) Test paragraphs
# -----------------------------
P_MISSPELLED = (
    "Tday I went to the schoul to met with my freinds and discuus the projct. "
    "We had planed to start earliy, but evryone arived late becuase of the trafic. "
    "The libary was to noisy, so we moved to a quet classrom. "
    "I wrote sevral paragraps explaing our ideea, but there were many speling mistaks. "
    "At lunch, we at sandwhiches and drank cofee while we reviewd the requirments. "
    "Eventualy, we agredd to updat the timline and asign cleer responsibilites to each person. "
    "Tomorow, I will send the summery with the corect verion and confirm the scheduel with the instrutor."
)

P_CORRECT = (
    "Today I went to the school to meet with my friends and discuss the project. "
    "We had planned to start early, but everyone arrived late because of the traffic. "
    "The library was too noisy, so we moved to a quiet classroom. "
    "I wrote several paragraphs explaining our idea, but there were many spelling mistakes. "
    "At lunch, we ate sandwiches and drank coffee while we reviewed the requirements. "
    "Eventually, we agreed to update the timeline and assign clear responsibilities to each person. "
    "Tomorrow, I will send the summary with the correct version and confirm the schedule with the instructor."
)

# Helper: tokenize into word tokens (letters only) for accuracy/time-per-word metrics
def word_tokens(text):
    return re.findall(r"[A-Za-z]+", text)

def tokenize_preserve(text):
    # words, punctuation, spaces — used by some simple callers when needed
    return re.findall(r"\w+|[^\w\s]+|\s+", text)

TOTAL_WORDS = len(word_tokens(P_MISSPELLED))

# -----------------------------
# 1) Dispatchers for each model
# -----------------------------
def run_autocorrect_model(text):
    """
    Autocorrect (Python library).
    We assume previous cell created either:
      - a global `spell` that is autocorrect.Speller, or
      - a function `autocorrect_sentence` that uses autocorrect.Speller under the hood.
    """
    try:
        from autocorrect import Speller
    except Exception:
        Speller = None

    # Prefer a distinct function if it exists and was defined for Autocorrect
    # (We cannot guarantee its name is unique; so we conservatively check the `spell` type.)
    if 'spell' in globals() and Speller and isinstance(globals()['spell'], Speller):
        s = globals()['spell']
        tokens = tokenize_preserve(text)
        out = []
        for t in tokens:
            out.append(s(t) if t.isalpha() else t)
        return "".join(out)

    # Fallback: try a generic `autocorrect_sentence` if available
    if 'autocorrect_sentence' in globals():
        try:
            return globals()['autocorrect_sentence'](text)
        except Exception:
            pass

    raise RuntimeError("Autocorrect model not available (spell/Speller or autocorrect_sentence).")


def run_pyspellchecker_model(text):
    """
    PySpellChecker.
    We assume previous cell created either:
      - a global `spell` that is spellchecker.SpellChecker, or
      - a function `autocorrect_sentence` that uses pyspellchecker under the hood.
    """
    try:
        from spellchecker import SpellChecker
    except Exception:
        SpellChecker = None

    if 'spell' in globals() and SpellChecker and isinstance(globals()['spell'], SpellChecker):
        sc = globals()['spell']
        tokens = tokenize_preserve(text)
        out = []
        for t in tokens:
            if t.isalpha():
                corr = sc.correction(t)
                out.append(corr if corr else t)
            else:
                out.append(t)
        return "".join(out)

    # Fallback: try a generic `autocorrect_sentence` if available (and hope it's the pyspell version)
    if 'autocorrect_sentence' in globals():
        try:
            return globals()['autocorrect_sentence'](text)
        except Exception:
            pass

    raise RuntimeError("PySpellChecker model not available (spell/SpellChecker or autocorrect_sentence).")


def run_symspell_basic_model(text):
    """
    SymSpell (basic single-word).
    We assume a global `sym_spell` from symspellpy is available, as per earlier cell.
    We'll call `sym_spell.lookup` per word and pick the closest suggestion.
    """
    if 'sym_spell' not in globals():
        raise RuntimeError("SymSpell instance `sym_spell` not available.")
    from symspellpy import Verbosity
    tokens = tokenize_preserve(text)
    out = []
    for t in tokens:
        if t.isalpha():
            suggs = sym_spell.lookup(t, Verbosity.CLOSEST, max_edit_distance=2)
            out.append(suggs[0].term if suggs else t)
        else:
            out.append(t)
    return "".join(out)


def run_hunspell_spylls_model(text):
    """
    Hunspell via spylls (pure-Python).
    We assume the earlier cell defined:
      - function `hunspell_autocorrect_sentence(text)` which performs the correction.
    """
    if 'hunspell_autocorrect_sentence' not in globals():
        raise RuntimeError("Hunspell function `hunspell_autocorrect_sentence` not available.")
    return hunspell_autocorrect_sentence(text)


def run_symspell_left2right_model(text):
    """
    SymSpell + Left-to-Right Context (past-only).
    We assume the earlier cell defined:
      - function `correct_left_to_right(text)`.
    """
    if 'correct_left_to_right' not in globals():
        raise RuntimeError("Left-to-Right function `correct_left_to_right` not available.")
    return correct_left_to_right(text)

# -----------------------------
# 2) Benchmark runner
# -----------------------------
def avg_time_per_word_seconds(func, text):
    start = time.perf_counter()
    out = func(text)
    total = time.perf_counter() - start
    return out, total / max(1, TOTAL_WORDS)

def word_accuracy(pred_text, gold_text):
    pred = word_tokens(pred_text)
    gold = word_tokens(gold_text)
    # Align by position (simple exact-match rate). If lengths differ, compare over min length.
    n = min(len(pred), len(gold))
    if n == 0:
        return 0.0
    correct = sum(1 for i in range(n) if pred[i].lower() == gold[i].lower())
    return 100.0 * correct / len(gold)

# -----------------------------
# 3) Execute and report
# -----------------------------
results = []

models = [
    ("Autocorrect (Python library)", run_autocorrect_model),
    ("PySpellChecker",               run_pyspellchecker_model),
    ("SymSpell (basic)",             run_symspell_basic_model),
    ("Hunspell (spylls)",            run_hunspell_spylls_model),
    ("SymSpell + Left-to-Right",     run_symspell_left2right_model),
]

print("=== Original (Misspelled) Paragraph ===")
print(P_MISSPELLED, "\n")
print("=== Correct (Reference) Paragraph ===")
print(P_CORRECT, "\n")

for name, runner in models:
    try:
        output, avg_sec_per_word = avg_time_per_word_seconds(runner, P_MISSPELLED)
        acc = word_accuracy(output, P_CORRECT)
        results.append((name, f"{avg_sec_per_word*1000:.3f} ms/word", f"{acc:.1f}%"))
        print(f"\n--- {name} ---")
        print(output)
        print(f"[Avg time/word] {avg_sec_per_word*1000:.3f} ms | [Word accuracy vs. reference] {acc:.1f}%")
    except Exception as e:
        results.append((name, "Unavailable", "N/A"))
        print(f"\n--- {name} ---")
        print(f"Unavailable: {e}")

# -----------------------------
# 4) Summary table
# -----------------------------
print("\n=== Summary (Avg Time Per Word & Word Accuracy) ===")
colw = [28, 18, 18]
print(f"{'Model':{colw[0]}} {'Avg Time/Word':{colw[1]}} {'Word Accuracy':{colw[2]}}")
print("-" * sum(colw))
for name, timepw, acc in results:
    print(f"{name:{colw[0]}} {timepw:{colw[1]}} {acc:{colw[2]}}")


=== Original (Misspelled) Paragraph ===
Tday I went to the schoul to met with my freinds and discuus the projct. We had planed to start earliy, but evryone arived late becuase of the trafic. The libary was to noisy, so we moved to a quet classrom. I wrote sevral paragraps explaing our ideea, but there were many speling mistaks. At lunch, we at sandwhiches and drank cofee while we reviewd the requirments. Eventualy, we agredd to updat the timline and asign cleer responsibilites to each person. Tomorow, I will send the summery with the corect verion and confirm the scheduel with the instrutor. 

=== Correct (Reference) Paragraph ===
Today I went to the school to meet with my friends and discuss the project. We had planned to start early, but everyone arrived late because of the traffic. The library was too noisy, so we moved to a quiet classroom. I wrote several paragraphs explaining our idea, but there were many spelling mistakes. At lunch, we ate sandwiches and drank coffee while we re

In [16]:
# === Benchmark (Round 2): Longer Paragraph with More Typos ===
# Assumes the following were already defined in previous cells:
#   - Autocorrect model: either global `spell` (autocorrect.Speller) or function `autocorrect_sentence`
#   - PySpellChecker model: either global `spell` (spellchecker.SpellChecker) or function `autocorrect_sentence`
#   - SymSpell instance: global `sym_spell`
#   - Hunspell (spylls): function `hunspell_autocorrect_sentence(text)`
#   - SymSpell + Left-to-Right Context: function `correct_left_to_right(text)`
#
# This cell:
#   1) Defines a LONGER misspelled paragraph and its corrected reference version.
#   2) Calls EACH model (without redefining them) to correct the paragraph.
#   3) Measures average correction time per word and word-level accuracy vs. reference.
#
# If any model hasn’t been initialized, it will be marked “Unavailable”.

import re, time

# -----------------------------
# 0) Longer test paragraphs
# -----------------------------
P_MISSPELLED = (
    "Yesturday mornng, I desided to wake up earli and take a quick walke to the libary, "
    "but the weathr was unpredicteble and it strated to rain hevily. On my way, I met two freinds "
    "who were also planing to study, yet we all relized we had fogoten our noteboks and pencials. "
    "At the coffe shop nerby, we ordred sandwitches and capachinos, but the barrista misspeled my "
    "name on the cup as Alxenderr, which made us laught for a whiel. When we finaly arived at the "
    "libary, it was overcrowded and very noizy, so we searched for a quiter clasroom in the oldr "
    "building. I began writting a draft of our reseach propasal, explaing the metodolgy and the "
    "expreimental desgin, but I kept makng speling mistkes becuse I was in a hurry. My freinds "
    "sugested we take a brek and reorgnize the timline, asign clerer resposnibilites, and setup "
    "a sharedd calender. Tomorow, we inted to meet agan with the instrutor to reveiw the feedbak, "
    "corect the erors, and submitt the finel versoin befor the dedline at midnigt."
)

P_CORRECT = (
    "Yesterday morning, I decided to wake up early and take a quick walk to the library, "
    "but the weather was unpredictable and it started to rain heavily. On my way, I met two friends "
    "who were also planning to study, yet we all realized we had forgotten our notebooks and pencils. "
    "At the coffee shop nearby, we ordered sandwiches and cappuccinos, but the barista misspelled my "
    "name on the cup as Alexander, which made us laugh for a while. When we finally arrived at the "
    "library, it was overcrowded and very noisy, so we searched for a quieter classroom in the older "
    "building. I began writing a draft of our research proposal, explaining the methodology and the "
    "experimental design, but I kept making spelling mistakes because I was in a hurry. My friends "
    "suggested we take a break and reorganize the timeline, assign clearer responsibilities, and set up "
    "a shared calendar. Tomorrow, we intend to meet again with the instructor to review the feedback, "
    "correct the errors, and submit the final version before the deadline at midnight."
)

# -----------------------------
# Helpers (no model code here)
# -----------------------------
def word_tokens(text):
    return re.findall(r"[A-Za-z]+", text)

def tokenize_preserve(text):
    return re.findall(r"\w+|[^\w\s]+|\s+", text)

TOTAL_WORDS = len(word_tokens(P_MISSPELLED))

def avg_time_per_word_seconds(func, text):
    start = time.perf_counter()
    out = func(text)
    total = time.perf_counter() - start
    return out, total / max(1, TOTAL_WORDS)

def word_accuracy(pred_text, gold_text):
    pred = word_tokens(pred_text)
    gold = word_tokens(gold_text)
    n = min(len(pred), len(gold))
    if n == 0:
        return 0.0
    correct = sum(1 for i in range(n) if pred[i].lower() == gold[i].lower())
    return 100.0 * correct / len(gold)

# -----------------------------
# Dispatchers (call-only)
# -----------------------------
def run_autocorrect_model(text):
    try:
        from autocorrect import Speller
    except Exception:
        Speller = None

    if 'spell' in globals() and Speller and isinstance(globals()['spell'], Speller):
        s = globals()['spell']
        tokens = tokenize_preserve(text)
        return "".join([s(t) if t.isalpha() else t for t in tokens])

    if 'autocorrect_sentence' in globals():
        return globals()['autocorrect_sentence'](text)

    raise RuntimeError("Autocorrect model not available.")


def run_pyspellchecker_model(text):
    try:
        from spellchecker import SpellChecker
    except Exception:
        SpellChecker = None

    if 'spell' in globals() and SpellChecker and isinstance(globals()['spell'], SpellChecker):
        sc = globals()['spell']
        out = []
        for t in tokenize_preserve(text):
            if t.isalpha():
                corr = sc.correction(t)
                out.append(corr if corr else t)
            else:
                out.append(t)
        return "".join(out)

    if 'autocorrect_sentence' in globals():
        return globals()['autocorrect_sentence'](text)

    raise RuntimeError("PySpellChecker model not available.")


def run_symspell_basic_model(text):
    if 'sym_spell' not in globals():
        raise RuntimeError("SymSpell instance `sym_spell` not available.")
    from symspellpy import Verbosity
    out = []
    for t in tokenize_preserve(text):
        if t.isalpha():
            suggs = sym_spell.lookup(t, Verbosity.CLOSEST, max_edit_distance=2)
            out.append(suggs[0].term if suggs else t)
        else:
            out.append(t)
    return "".join(out)


def run_hunspell_spylls_model(text):
    if 'hunspell_autocorrect_sentence' not in globals():
        raise RuntimeError("Hunspell function `hunspell_autocorrect_sentence` not available.")
    return hunspell_autocorrect_sentence(text)


def run_symspell_left2right_model(text):
    if 'correct_left_to_right' not in globals():
        raise RuntimeError("Left-to-Right function `correct_left_to_right` not available.")
    return correct_left_to_right(text)

# -----------------------------
# Execute & report
# -----------------------------
print("=== Original (Misspelled) Paragraph ===")
print(P_MISSPELLED, "\n")
print("=== Correct (Reference) Paragraph ===")
print(P_CORRECT, "\n")

results = []
models = [
    ("Autocorrect (Python library)", run_autocorrect_model),
    ("PySpellChecker",               run_pyspellchecker_model),
    ("SymSpell (basic)",             run_symspell_basic_model),
    ("Hunspell (spylls)",            run_hunspell_spylls_model),
    ("SymSpell + Left-to-Right",     run_symspell_left2right_model),
]

for name, runner in models:
    try:
        output, avg_sec_per_word = avg_time_per_word_seconds(runner, P_MISSPELLED)
        acc = word_accuracy(output, P_CORRECT)
        results.append((name, f"{avg_sec_per_word*1000:.3f} ms/word", f"{acc:.1f}%"))
        print(f"\n--- {name} ---")
        print(output)
        print(f"[Avg time/word] {avg_sec_per_word*1000:.3f} ms | [Word accuracy vs. reference] {acc:.1f}%")
    except Exception as e:
        results.append((name, "Unavailable", "N/A"))
        print(f"\n--- {name} ---")
        print(f"Unavailable: {e}")

# -----------------------------
# Summary table
# -----------------------------
print("\n=== Summary (Avg Time Per Word & Word Accuracy) ===")
colw = [28, 18, 18]
print(f"{'Model':{colw[0]}} {'Avg Time/Word':{colw[1]}} {'Word Accuracy':{colw[2]}}")
print("-" * sum(colw))
for name, timepw, acc in results:
    print(f"{name:{colw[0]}} {timepw:{colw[1]}} {acc:{colw[2]}}")


=== Original (Misspelled) Paragraph ===
Yesturday mornng, I desided to wake up earli and take a quick walke to the libary, but the weathr was unpredicteble and it strated to rain hevily. On my way, I met two freinds who were also planing to study, yet we all relized we had fogoten our noteboks and pencials. At the coffe shop nerby, we ordred sandwitches and capachinos, but the barrista misspeled my name on the cup as Alxenderr, which made us laught for a whiel. When we finaly arived at the libary, it was overcrowded and very noizy, so we searched for a quiter clasroom in the oldr building. I began writting a draft of our reseach propasal, explaing the metodolgy and the expreimental desgin, but I kept makng speling mistkes becuse I was in a hurry. My freinds sugested we take a brek and reorgnize the timline, asign clerer resposnibilites, and setup a sharedd calender. Tomorow, we inted to meet agan with the instrutor to reveiw the feedbak, corect the erors, and submitt the finel versoin 

In [18]:
# === Final Benchmark: Evaluate ONLY on Misspelled Words ===
# Assumes the following models are already set up in previous cells and callable:
#   - Autocorrect (Python library):        run_autocorrect_model(text)
#   - PySpellChecker:                      run_pyspellchecker_model(text)
#   - SymSpell (basic):                    run_symspell_basic_model(text)
#   - Hunspell (spylls):                   run_hunspell_spylls_model(text)
#   - SymSpell + Left-to-Right Context:    run_symspell_left2right_model(text)
#
# This cell:
#   1) Defines a test paragraph (misspelled) and its corrected reference.
#   2) Identifies positions of words that are misspelled (compared to reference).
#   3) Runs each model and computes accuracy ONLY over those misspelled positions.
#   4) Reports average time per *misspelled* word (latency divided by #misspelled words).

import re, time

# -----------------------------
# 0) Test paragraph (reuse the "common sentences" set for realistic phrasing)
# -----------------------------
P_MISSPELLED = (
    "I tolld my frend how are you afer we went to scool. "
    "I didnt recieve your emial yesturday, can you resent it agan? "
    "We are going to the libary later to finsh our homwork. "
    "The wethar was relly nice so we walkd to the coffe shop nearbly. "
    "She said she wil call me tomorow morning befor class. "
    "I bought vegtables and bred from the grocry store on my way home. "
    "Please chekc the sheduele and let me no if your free on Thusday. "
    "After luch we met the instrutor to discus the projeckt detales. "
    "He allways forgets his pasword and needs to resset it evry weak. "
    "Thanks for your help, I appriciate your quick responce."
)

P_CORRECT = (
    "I told my friend how are you after we went to school. "
    "I didn’t receive your email yesterday, can you resend it again? "
    "We are going to the library later to finish our homework. "
    "The weather was really nice so we walked to the coffee shop nearby. "
    "She said she will call me tomorrow morning before class. "
    "I bought vegetables and bread from the grocery store on my way home. "
    "Please check the schedule and let me know if you’re free on Thursday. "
    "After lunch we met the instructor to discuss the project details. "
    "He always forgets his password and needs to reset it every week. "
    "Thanks for your help, I appreciate your quick response."
)

# -----------------------------
# Helpers
# -----------------------------
def word_tokens(text):
    # normalize typographic apostrophes to plain for alignment fairness
    text = text.replace("’", "'")
    return re.findall(r"[A-Za-z']+", text)

def indices_of_misspellings(src_text, gold_text):
    src = word_tokens(src_text)
    gold = word_tokens(gold_text)
    n = min(len(src), len(gold))
    idxs = [i for i in range(n) if src[i].lower() != gold[i].lower()]
    return idxs, src, gold

def accuracy_on_indices(pred_text, gold_words, idxs):
    pred = word_tokens(pred_text)
    n = min(len(pred), len(gold_words), (max(idxs) + 1) if idxs else 0)
    if not idxs:
        return 100.0
    correct = 0
    for i in idxs:
        if i < n and pred[i].lower() == gold_words[i].lower():
            correct += 1
    return 100.0 * correct / len(idxs)

def avg_time_per_target_word_seconds(func, text, n_targets):
    start = time.perf_counter()
    out = func(text)
    total = time.perf_counter() - start
    denom = max(1, n_targets)
    return out, total / denom

# -----------------------------
# Validate availability of model runners
# -----------------------------
required = [
    ("Autocorrect (Python library)", "run_autocorrect_model"),
    ("PySpellChecker",               "run_pyspellchecker_model"),
    ("SymSpell (basic)",             "run_symspell_basic_model"),
    ("Hunspell (spylls)",            "run_hunspell_spylls_model"),
    ("SymSpell + Left-to-Right",     "run_symspell_left2right_model"),
]
missing = [name for name, fn in required if fn not in globals()]
if missing:
    print("WARNING: The following models are not available and will be marked Unavailable:")
    for m in missing: print(" -", m)

# -----------------------------
# Compute misspelled indices once
# -----------------------------
miss_idx, src_words, gold_words = indices_of_misspellings(P_MISSPELLED, P_CORRECT)
N_MISS = len(miss_idx)
print(f"Detected {N_MISS} misspelled word positions (evaluating only these).")

# -----------------------------
# Run each model on misspell-only metric
# -----------------------------
def get_runner(name, fn_name):
    return globals()[fn_name] if fn_name in globals() else None

models = [
    ("Autocorrect (Python library)", get_runner("Autocorrect (Python library)", "run_autocorrect_model")),
    ("PySpellChecker",               get_runner("PySpellChecker",               "run_pyspellchecker_model")),
    ("SymSpell (basic)",             get_runner("SymSpell (basic)",             "run_symspell_basic_model")),
    ("Hunspell (spylls)",            get_runner("Hunspell (spylls)",            "run_hunspell_spylls_model")),
    ("SymSpell + Left-to-Right",     get_runner("SymSpell + Left-to-Right",     "run_symspell_left2right_model")),
]

results = []
for name, runner in models:
    if runner is None:
        results.append((name, "Unavailable", "N/A"))
        print(f"\n--- {name} ---\nUnavailable: runner not found.")
        continue
    try:
        output, avg_sec_per_miss = avg_time_per_target_word_seconds(runner, P_MISSPELLED, N_MISS)
        acc = accuracy_on_indices(output, gold_words, miss_idx)
        results.append((name, f"{avg_sec_per_miss*1000:.3f} ms/misspelled-word", f"{acc:.1f}%"))
        print(f"\n--- {name} ---")
        print(output)
        print(f"[Avg time per misspelled word] {avg_sec_per_miss*1000:.3f} ms | "
              f"[Accuracy on misspelled words] {acc:.1f}%")
    except Exception as e:
        results.append((name, "Unavailable", "N/A"))
        print(f"\n--- {name} ---\nUnavailable: {e}")

# -----------------------------
# Summary
# -----------------------------
print("\n=== Summary (Avg Time Per Misspelled Word & Accuracy on Misspelled Words) ===")
colw = [28, 30, 26]
print(f"{'Model':{colw[0]}} {'Avg Time/Misspelled Word':{colw[1]}} {'Accuracy (Misspelled Only)':{colw[2]}}")
print("-" * sum(colw))
for name, timepw, acc in results:
    print(f"{name:{colw[0]}} {timepw:{colw[1]}} {acc:{colw[2]}}")


Detected 41 misspelled word positions (evaluating only these).

--- Autocorrect (Python library) ---
a told my friend how are you after we went to school. a didst receive your email yesterday, can you resent it again? be are going to the library later to fish our homework. the wether was reply nice so we walk to the coffee shop nearly. the said she will call me tomorrow morning before class. a bought vegetables and bred from the grocery store on my way home. please check the schedule and let me no if your free on thursday. after such we met the instructor to discus the project details. be always forgets his password and needs to reset it very weak. thanks for your help, a appreciate your quick response.
[Avg time per misspelled word] 0.271 ms | [Accuracy on misspelled words] 65.9%

--- PySpellChecker ---
I told my friend how are you after we went to school. I didn't receive your email yesterday, can you resent it again? We are going to the library later to finish our homework. The weth