# Hybrid Spell Checker Pipeline

This notebook demonstrates the **Hybrid Spell Checker** integrated into the project.

## Architecture
The pipeline consists of three stages:
1.  **Candidate Generation (Recall)**: Uses **KNN** on character N-Grams to find the top 50 similar words in vector space.
2.  **Candidate Ranking (Precision)**: Uses a **Logistic Regression** classifier trained to distinguish optimal candidates based on features like *Edit Distance*, *Vector Distance*, and *Length Difference*.
3.  **Final Selection**: Combines the Logistic Regression score with the **Naive Bayes** prior (Word Frequency) to select the best correction.

$$ Score(c) = P_{LR}(Correct | Features) \times P_{Prior}(c) $$

## 1. Setup & Import

In [None]:
import sys
import os
import random
import time

# Add project root to path to import src modules
sys.path.append(os.path.abspath('..'))

from src.spell_checkers import HybridCorrector

DATA_PATH = '../data/urdu_words.txt'

print("Import Successful. Initializing HybridCorrector...")

## 2. Initialization & Training
The `HybridCorrector` handles its own training and caching. On the first run, it will:
1. Load the Urdu dictionary.
2. Train the KNN Index.
3. Generate synthetic training data (typos).
4. Train the Logistic Regression classifier.
5. Save everything to `data/urdu_words.txt.hybrid.pkl`.

In [None]:
start_time = time.time()
corrector = HybridCorrector(DATA_PATH, cache_dir='../data')
print(f"Model Ready in {time.time() - start_time:.2f}s")

# Check internal state
print(f"Dictionary Size: {len(corrector.lm_words)}")
print(f"Logistic Regression Classes: {corrector.clf.classes_}")

## 3. Interactive Demonstration
Let's test single corrections.

In [None]:
typos = ['کتاپ', 'لیکھن', 'مشینے', 'اردو']

print(f"{'Typo':<15} -> {'Correction':<15}")
print("-" * 35)
for t in typos:
    corr = corrector.correct(t)
    print(f"{t:<15} -> {corr:<15}")

## 4. Evaluation on Synthetic Data
We generate 500 random typos from the dictionary and measure the accuracy.

In [None]:
def generate_typo(word):
    if len(word) < 2: return word
    urdu_chars = 'ابپتٹثجچحخدڈذرڑزژسشصضطظعغفقکگلمنںوہیے'
    op = random.choice(['insert', 'delete', 'replace', 'transpose'])
    word = list(word)
    idx = random.randint(0, len(word) - 1)
    if op == 'insert': word.insert(idx, random.choice(urdu_chars))
    elif op == 'delete': word.pop(idx)
    elif op == 'replace': word[idx] = random.choice(urdu_chars)
    elif op == 'transpose' and idx < len(word)-1: word[idx], word[idx+1] = word[idx+1], word[idx]
    return "".join(word)

# Sample 500 words
random.seed(101)
valid_words = [w for w in corrector.lm_words.keys() if len(w) > 3]
test_set = [(generate_typo(w), w) for w in random.sample(valid_words, 500)]

print(f"Evaluating on {len(test_set)} words...")

correct_count = 0
start = time.time()

for typo, truth in test_set:
    pred = corrector.correct(typo)
    if pred == truth:
        correct_count += 1

duration = time.time() - start
acc = (correct_count / len(test_set)) * 100

print(f"\nHybrid Pipeline Accuracy: {acc:.2f}%")
print(f"Total Time: {duration:.2f}s")
print(f"Average Latency: {(duration/len(test_set))*1000:.2f}ms")