# KNN Spell Checker Evaluation

This notebook evaluates the performance of the **K-Nearest Neighbors (KNN)** spell checker.

## 1. Setup

In [11]:
import sys
import os
import time
import random
import numpy as np

# Add project root to path
sys.path.append(os.path.abspath('..'))

from src.spell_checkers import KNNCorrector

DATA_PATH = '../data/urdu_words.txt'

print("Initializing KNNCorrector...")
knn = KNNCorrector(literature_path=DATA_PATH, k=1, cache_dir='../data')
print(f"Vocab Size: {len(knn.words_list)}")

Initializing KNNCorrector...
Vocab Size: 154781


## 2. Test Data Generation
We generate synthetic typos using insertions, deletions, substitutions, and transpositions.

In [12]:
def generate_typo(word):
    if len(word) < 2: return word
    urdu_chars = 'ابپتٹثجچحخدڈذرڑزژسشصضطظعغفقکگلمنںوہیے'
    op = random.choice(['insert', 'delete', 'replace', 'transpose'])
    word = list(word)
    idx = random.randint(0, len(word) - 1)
    if op == 'insert': word.insert(idx, random.choice(urdu_chars))
    elif op == 'delete': word.pop(idx)
    elif op == 'replace': word[idx] = random.choice(urdu_chars)
    elif op == 'transpose' and idx < len(word)-1: word[idx], word[idx+1] = word[idx+1], word[idx]
    return "".join(word)

random.seed(42)
valid_words = [w for w in knn.words_list if len(w) > 3]
test_set = [(generate_typo(w), w) for w in random.sample(valid_words, 500)]

print(f"Generated {len(test_set)} test pairs.")

Generated 500 test pairs.


## 3. Evaluation Loop

In [13]:
correct_count = 0
start_time = time.time()
failed_cases = []

print("Running Evaluation...")
for typo, truth in test_set:
    prediction = knn.correct(typo)
    if prediction == truth:
        correct_count += 1
    else:
        failed_cases.append((typo, truth, prediction))

duration = time.time() - start_time
accuracy = (correct_count / len(test_set)) * 100

print(f"\nAccuracy: {accuracy:.2f}%")
print(f"Average Time: {(duration/len(test_set))*1000:.2f}ms")

Running Evaluation...

Accuracy: 60.20%
Average Time: 129.75ms


## 4. Error Analysis

In [14]:
print("Typo -> Truth | Prediction")
for typo, truth, pred in failed_cases[:10]:
    print(f"{typo} -> {truth} | {pred}")

Typo -> Truth | Prediction
چمر_گر -> چمر_رگ | زر_گر
پامدری -> پامردی | پادری
چنڈر -> چنڈور | نڈر
ناموسم -> ناموسوم | ناموس
رسیو -> برسیو | رسی
پیشیل -> پیشین | پیشی
گچنی -> چگنی | چنی
راجھ -> راجچھ | راج
نشرتی -> نشریت | شرتی
کھیڑجا -> کھیجڑا | کھیڑا
