# Sinhala Dyslexia Binary Classifier — Improved Pipeline

**Improvements over baseline (v0):**
| Area | Baseline | Improved |
|---|---|---|
| Vectorizer | char TF-IDF (2,4) | char TF-IDF (2,5) + word TF-IDF (1,2) stacked |
| Classifier | Logistic Regression | LR + SVM + Voting Ensemble |
| Calibration | None | Platt scaling (CalibratedClassifierCV) |
| Handcrafted features | None | 8 Sinhala-specific linguistic features |
| Essay aggregation | ratio ≥ 0.2 & mean ≥ 0.5 | Weighted by sentence length + peak signal |
| Short-text handling | No filtering | Skip sentences < 4 chars |

**Baseline accuracy: 78%** → Target: 82–85%

In [1]:
# ------------------------------------------------------------
# 0. INSTALL DEPENDENCIES
# ------------------------------------------------------------
# Remove the line below if running locally
!pip install datasets pandas scikit-learn joblib scipy numpy



In [2]:
# ------------------------------------------------------------
# 1. IMPORTS
# ------------------------------------------------------------

import re
import numpy as np
import pandas as pd
import joblib

from datasets import load_dataset

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import VotingClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.preprocessing import StandardScaler

from scipy.sparse import hstack, csr_matrix

print("Imports OK")

Imports OK


In [3]:
# ------------------------------------------------------------
# 2. LOAD DATASET
# ------------------------------------------------------------
# Dataset: paired clean / dyslexic Sinhala sentences
# Source: SPEAK-ASR/sinhala-dyslexia-corrected-id20percent

dataset = load_dataset("SPEAK-ASR/sinhala-dyslexia-corrected-id20percent")
df = dataset["train"].to_pandas()

print(f"Dataset size: {len(df)} paired sentences")
print(df.columns.tolist())
df.head(3)



Dataset size: 27636 paired sentences
['clean_sentence', 'dyslexic_sentence', 'error_type']


Unnamed: 0,clean_sentence,dyslexic_sentence,error_type
0,වලිකුකුළා කෑගහනවා.,වලිකුකුළා කෑගහනව,Grammar
1,අම්මා කෑම දෙනවා,අම්මා කෑම දනවා,Phonetic Confusion
2,"{""correction"": ""අපි ගමට යනවා"", ""analysis"": [{""...",අපි යනව ගමට,unknown


In [4]:
# ------------------------------------------------------------
# 3. BUILD BINARY CLASSIFICATION DATASET
# ------------------------------------------------------------
# Label: 1 = dyslexic, 0 = clean

dys_df   = pd.DataFrame({"text": df["dyslexic_sentence"], "label": 1})
clean_df = pd.DataFrame({"text": df["clean_sentence"],    "label": 0})

binary_df = pd.concat([dys_df, clean_df], ignore_index=True)

# Drop very short sentences (< 4 chars) — unreliable for classification
binary_df = binary_df[binary_df["text"].str.len() >= 4].reset_index(drop=True)

print(f"Total samples: {len(binary_df)}")
print(binary_df["label"].value_counts())

Total samples: 54979
label
1    27500
0    27479
Name: count, dtype: int64


In [5]:
# ------------------------------------------------------------
# 4. TRAIN / TEST SPLIT
# ------------------------------------------------------------

X_train_raw, X_test_raw, y_train, y_test = train_test_split(
    binary_df["text"],
    binary_df["label"],
    test_size=0.2,
    random_state=42,
    stratify=binary_df["label"]
)

print(f"Train: {len(X_train_raw)} | Test: {len(X_test_raw)}")

Train: 43983 | Test: 10996


In [6]:
# ============================================================
# 5. FEATURE ENGINEERING
# ============================================================
#
# IMPROVEMENT 1: Dual TF-IDF
#   - char_wb (2,5): captures character-level misspellings,
#     missing diacritics, and transpositions at word boundaries
#   - word (1,2): captures dyslexic word-level patterns
#     (wrong word forms, dropped suffixes)
#
# IMPROVEMENT 2: Sinhala-specific handcrafted features
#   - hal_ratio: fraction of hal kirima (්) characters
#     Dyslexic writers frequently drop geminate markers
#   - diacritic_ratio: vowel diacritics relative to consonants
#   - avg_word_len: dyslexic writing tends to have shorter words
#   - word_count: sentence length signal
#   - unique_char_ratio: variety of Sinhala Unicode codepoints used
#   - space_ratio: spacing anomalies in dyslexic text
#   - has_english_chars: mixed-script writing pattern
#   - repeat_char_ratio: repeated character sequences (perseveration)
# ============================================================

# ---- TF-IDF Vectorizers ----

char_vectorizer = TfidfVectorizer(
    analyzer="char_wb",
    ngram_range=(2, 5),    # extended from (2,4) — captures longer error spans
    max_features=60000,
    sublinear_tf=True,     # log-scale TF — reduces dominance of frequent n-grams
    min_df=2               # ignore extremely rare n-grams
)

word_vectorizer = TfidfVectorizer(
    analyzer="word",
    ngram_range=(1, 2),
    max_features=20000,
    sublinear_tf=True,
    min_df=2
)

# Fit both vectorizers on training data
X_train_char = char_vectorizer.fit_transform(X_train_raw)
X_test_char  = char_vectorizer.transform(X_test_raw)

X_train_word = word_vectorizer.fit_transform(X_train_raw)
X_test_word  = word_vectorizer.transform(X_test_raw)

print(f"Char features: {X_train_char.shape[1]}")
print(f"Word features: {X_train_word.shape[1]}")

Char features: 41798
Word features: 9599


In [7]:
# ---- Sinhala-specific handcrafted features ----

def extract_sinhala_features(sentences):
    """
    Extract 8 linguistically-motivated features for Sinhala dyslexia detection.
    These are computed from Unicode character properties of Sinhala script.

    Sinhala Unicode block: U+0D80–U+0DFF
      - Consonants: U+0D9A–U+0DC6
      - Independent vowels: U+0D85–U+0D96
      - Dependent vowel signs: U+0DCF–U+0DDF
      - Hal kirima (virama / geminate marker): U+0DCA (්)
    """
    features = []

    for text in sentences:
        chars     = list(text)
        n         = max(len(chars), 1)
        words     = text.split()
        nw        = max(len(words), 1)

        # 1. hal_ratio — fraction of hal kirima (්) chars
        #    Dyslexic writers drop geminate markers: අම්මා → මමා
        hal_count  = text.count('\u0DCA')  # ් character
        hal_ratio  = hal_count / n

        # 2. diacritic_ratio — vowel diacritics / total chars
        #    Missing/wrong diacritics are a core dyslexia indicator
        diacritic_count = sum(1 for c in chars if '\u0DCF' <= c <= '\u0DDF')
        diacritic_ratio = diacritic_count / n

        # 3. avg_word_len — average word length in characters
        avg_word_len = sum(len(w) for w in words) / nw

        # 4. word_count — total word count
        word_count = nw

        # 5. unique_char_ratio — distinct Sinhala chars / sentence length
        #    Dyslexic writers often substitute similar-looking characters
        sinhala_chars  = [c for c in chars if '\u0D80' <= c <= '\u0DFF']
        unique_sinhala = len(set(sinhala_chars))
        unique_ratio   = unique_sinhala / max(len(sinhala_chars), 1)

        # 6. space_ratio — spaces / total chars
        #    Some dyslexic patterns include improper word boundaries
        space_ratio = text.count(' ') / n

        # 7. has_english — presence of Latin characters (0 or 1)
        #    Code-switching is sometimes a dyslexia avoidance strategy
        has_english = int(bool(re.search(r'[a-zA-Z]', text)))

        # 8. repeat_char_ratio — consecutive repeated characters
        #    Perseveration (e.g., азаза) can indicate dyslexic writing
        repeats = sum(1 for i in range(1, len(chars)) if chars[i] == chars[i-1])
        repeat_ratio = repeats / n

        features.append([
            hal_ratio,
            diacritic_ratio,
            avg_word_len,
            word_count,
            unique_ratio,
            space_ratio,
            has_english,
            repeat_ratio
        ])

    return csr_matrix(np.array(features, dtype=np.float32))


X_train_hf = extract_sinhala_features(X_train_raw.tolist())
X_test_hf  = extract_sinhala_features(X_test_raw.tolist())

print(f"Handcrafted features: {X_train_hf.shape[1]}")

Handcrafted features: 8


In [8]:
# ---- Stack all feature matrices ----
#
# Final feature vector per sentence:
#   [char TF-IDF (60k)] + [word TF-IDF (20k)] + [handcrafted (8)]

X_train = hstack([X_train_char, X_train_word, X_train_hf])
X_test  = hstack([X_test_char,  X_test_word,  X_test_hf])

print(f"Combined feature dimension: {X_train.shape[1]}")

Combined feature dimension: 51405


In [9]:
# ============================================================
# 6. MODEL TRAINING — CALIBRATED ENSEMBLE
# ============================================================
#
# IMPROVEMENT 3: Calibrated Logistic Regression
#   - CalibratedClassifierCV with Platt scaling corrects the
#     probability outputs so 0.65 truly means 65% confidence.
#   - Without calibration, raw LR/SVM probabilities can be
#     systematically overconfident or underconfident.
#
# IMPROVEMENT 4: LinearSVC as second learner
#   - LinearSVC is often stronger than LR on high-dimensional
#     sparse TF-IDF feature spaces.
#   - Wrapped in CalibratedClassifierCV to produce probabilities.
#
# IMPROVEMENT 5: Soft Voting Ensemble
#   - Averages probability estimates from both calibrated models.
#   - Reduces variance and improves reliability on borderline cases.
# ============================================================

# Calibrated Logistic Regression
lr_base = LogisticRegression(max_iter=1000, C=1.0, solver='saga', n_jobs=-1)
lr_calibrated = CalibratedClassifierCV(lr_base, cv=5, method='sigmoid')
lr_calibrated.fit(X_train, y_train)
print("LR calibrated — done")

# Calibrated LinearSVC
svc_base = LinearSVC(max_iter=2000, C=0.5)
svc_calibrated = CalibratedClassifierCV(svc_base, cv=5, method='sigmoid')
svc_calibrated.fit(X_train, y_train)
print("SVC calibrated — done")













LR calibrated — done










SVC calibrated — done




In [10]:
# ---- Individual model evaluation ----

for name, mdl in [("Logistic Regression", lr_calibrated), ("LinearSVC", svc_calibrated)]:
    y_pred = mdl.predict(X_test)
    y_prob = mdl.predict_proba(X_test)[:, 1]
    print(f"\n{'='*50}")
    print(f"  {name}")
    print(f"{'='*50}")
    print(classification_report(y_test, y_pred))
    print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))


  Logistic Regression
              precision    recall  f1-score   support

           0       0.76      0.84      0.80      5496
           1       0.82      0.73      0.77      5500

    accuracy                           0.79     10996
   macro avg       0.79      0.79      0.78     10996
weighted avg       0.79      0.79      0.78     10996

ROC-AUC: 0.8448
Confusion Matrix:
[[4613  883]
 [1480 4020]]

  LinearSVC
              precision    recall  f1-score   support

           0       0.74      0.85      0.79      5496
           1       0.83      0.71      0.76      5500

    accuracy                           0.78     10996
   macro avg       0.79      0.78      0.78     10996
weighted avg       0.79      0.78      0.78     10996

ROC-AUC: 0.8277
Confusion Matrix:
[[4679  817]
 [1607 3893]]


In [11]:
# ---- Soft Voting Ensemble ----
#
# Averages calibrated probabilities from LR and SVC.
# This is done manually (not sklearn VotingClassifier) because
# VotingClassifier doesn't natively accept pre-fitted models.

lr_probs  = lr_calibrated.predict_proba(X_test)[:, 1]
svc_probs = svc_calibrated.predict_proba(X_test)[:, 1]

# Equal-weight averaging
ensemble_probs = (lr_probs + svc_probs) / 2.0
ensemble_preds = (ensemble_probs >= 0.5).astype(int)

print("\n" + "="*50)
print("  SOFT VOTING ENSEMBLE (LR + SVC)")
print("="*50)
print(classification_report(y_test, ensemble_preds))
print(f"ROC-AUC: {roc_auc_score(y_test, ensemble_probs):.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, ensemble_preds))


  SOFT VOTING ENSEMBLE (LR + SVC)
              precision    recall  f1-score   support

           0       0.75      0.85      0.80      5496
           1       0.83      0.72      0.77      5500

    accuracy                           0.79     10996
   macro avg       0.79      0.79      0.78     10996
weighted avg       0.79      0.79      0.78     10996

ROC-AUC: 0.8401
Confusion Matrix:
[[4677  819]
 [1537 3963]]


In [12]:
# ============================================================
# 7. SAVE ALL ARTIFACTS
# ============================================================
#
# Saved artifacts:
#   - dyslexia_binary_model_lr.pkl      : calibrated LR model
#   - dyslexia_binary_model_svc.pkl     : calibrated SVC model
#   - tfidf_char_vectorizer.pkl         : char TF-IDF
#   - tfidf_word_vectorizer.pkl         : word TF-IDF
#
# Legacy compatibility:
#   - dyslexia_binary_model.pkl         : best single model (LR)
#   - tfidf_vectorizer.pkl              : char vectorizer (for existing code)

joblib.dump(lr_calibrated,    "dyslexia_binary_model_lr.pkl")
joblib.dump(svc_calibrated,   "dyslexia_binary_model_svc.pkl")
joblib.dump(char_vectorizer,  "tfidf_char_vectorizer.pkl")
joblib.dump(word_vectorizer,  "tfidf_word_vectorizer.pkl")

# Legacy aliases (drop-in replacement for existing service code)
joblib.dump(lr_calibrated,    "dyslexia_binary_model.pkl")
joblib.dump(char_vectorizer,  "tfidf_vectorizer.pkl")

print("All models and vectorizers saved!")

All models and vectorizers saved!


In [13]:
# ============================================================
# 8. IMPROVED INFERENCE FUNCTIONS
# ============================================================

# Reload all artifacts (simulates production service startup)
lr_model       = joblib.load("dyslexia_binary_model_lr.pkl")
svc_model      = joblib.load("dyslexia_binary_model_svc.pkl")
char_vec       = joblib.load("tfidf_char_vectorizer.pkl")
word_vec       = joblib.load("tfidf_word_vectorizer.pkl")


def vectorize_sentence_full(sentence: str):
    """
    Converts a Sinhala sentence into the full combined feature vector:
      char TF-IDF + word TF-IDF + 8 handcrafted features
    """
    cv = char_vec.transform([sentence])
    wv = word_vec.transform([sentence])
    hf = extract_sinhala_features([sentence])
    return hstack([cv, wv, hf])


def predict_sentence_ensemble(sentence: str) -> float:
    """
    Predicts dyslexia probability for a single sentence
    using the soft-voting ensemble of LR + SVC.

    Returns:
        float: Probability of dyslexia (0.0 – 1.0), calibrated
    """
    vec = vectorize_sentence_full(sentence)
    lr_p  = lr_model.predict_proba(vec)[0][1]
    svc_p = svc_model.predict_proba(vec)[0][1]
    return float((lr_p + svc_p) / 2.0)


def split_sentences(text: str):
    """
    Splits essay text into sentences.
    Handles: Sinhala punctuation, danda (।), newlines.
    Filters out fragments shorter than 4 characters.
    """
    if not text or not text.strip():
        return []

    text = text.replace("\r\n", "\n").replace("\r", "\n")
    raw  = re.split(r"[.!?।\n]+", text)
    cleaned = [s.strip() for s in raw if len(s.strip()) >= 4]

    # Chunk long single-paragraph essays
    if len(cleaned) == 1 and len(cleaned[0]) > 200:
        long_text = cleaned[0]
        cleaned = [long_text[i:i+120] for i in range(0, len(long_text), 120)]

    return cleaned


def analyze_essay(essay_text: str, threshold: float = 0.65) -> dict:
    """
    Performs essay-level dyslexia analysis.

    IMPROVEMENT 6: Weighted essay aggregation
      - Longer sentences carry more weight (more signal per sentence)
      - Peak probability is included as a strong indicator
      - Composite score: 60% weighted_mean + 40% peak_probability
      - This prevents short trivial sentences from diluting the score

    IMPROVEMENT 7: Three-tier sentence labeling
      - NORMAL    : prob < 0.50
      - BORDERLINE: 0.50 <= prob < threshold
      - DYSLEXIC  : prob >= threshold
    """
    sentences = split_sentences(essay_text)

    if not sentences:
        return {"error": "No valid sentences found."}

    probabilities    = []
    sentence_results = []
    dyslexic_count   = 0
    borderline_count = 0

    for s in sentences:
        prob = predict_sentence_ensemble(s)
        probabilities.append(prob)

        # Three-tier labeling
        if prob >= threshold:
            label = "DYSLEXIC"
            dyslexic_count += 1
        elif prob >= 0.50:
            label = "BORDERLINE"
            borderline_count += 1
        else:
            label = "NORMAL"

        sentence_results.append({
            "text": s,
            "probability": round(float(prob), 3),
            "label": label
        })

    # ---- Weighted essay aggregation ----
    # Weight each sentence by its word count (longer = more signal)
    weights       = [max(len(s.split()), 1) for s in sentences]
    total_weight  = sum(weights)
    weighted_mean = sum(p * w for p, w in zip(probabilities, weights)) / total_weight
    peak_prob     = max(probabilities)

    # Composite score: weighted mean (60%) + peak signal (40%)
    composite_score = 0.6 * weighted_mean + 0.4 * peak_prob

    # Essay-level decision
    dyslexic_ratio = dyslexic_count / len(sentences)

    if composite_score >= 0.55 or (dyslexic_ratio >= 0.2 and weighted_mean >= 0.5):
        essay_label = "DYSLEXIC ESSAY"
    elif composite_score >= 0.45:
        essay_label = "BORDERLINE ESSAY"
    else:
        essay_label = "NORMAL ESSAY"

    return {
        "essay_label":              essay_label,
        "composite_score":          round(composite_score, 3),
        "weighted_mean_prob":       round(weighted_mean, 3),
        "peak_sentence_prob":       round(peak_prob, 3),
        "dyslexic_ratio":           round(dyslexic_ratio, 3),
        "total_sentences":          len(sentences),
        "dyslexic_sentences":       dyslexic_count,
        "borderline_sentences":     borderline_count,
        "sentences":                sentence_results
    }

print("Inference functions defined.")

Inference functions defined.


In [14]:
# ============================================================
# 9. MANUAL TEST CASES
# ============================================================

normal_essay = """
මම අද පාසලට ගියෙමි. ගුරුතුමා අපට ගණිත පාඩම ඉගැන්වීය.
විවේක කාලයේදී මිතුරන් සමඟ කතා කළෙමි.
"""

dyslexic_essay = """
මම අද පාසල් ගිය. ගුරුතුමා අපට ගනිත පාඩම ඉගැන්වය.
විවේක කලයෙදි මිතුරන් සමග කතාකර ගිය.
"""

print("\n--- NORMAL ESSAY ---")
result = analyze_essay(normal_essay)
print(f"Label: {result['essay_label']}")
print(f"Composite score: {result['composite_score']}")
for s in result['sentences']:
    print(f"  [{s['label']:10}] {s['probability']:.3f}  {s['text']}")

print("\n--- DYSLEXIC ESSAY ---")
result = analyze_essay(dyslexic_essay)
print(f"Label: {result['essay_label']}")
print(f"Composite score: {result['composite_score']}")
for s in result['sentences']:
    print(f"  [{s['label']:10}] {s['probability']:.3f}  {s['text']}")


--- NORMAL ESSAY ---
Label: BORDERLINE ESSAY
Composite score: 0.531
  [BORDERLINE] 0.642  මම අද පාසලට ගියෙමි
  [NORMAL    ] 0.401  ගුරුතුමා අපට ගණිත පාඩම ඉගැන්වීය
  [NORMAL    ] 0.380  විවේක කාලයේදී මිතුරන් සමඟ කතා කළෙමි

--- DYSLEXIC ESSAY ---
Label: DYSLEXIC ESSAY
Composite score: 0.845
  [DYSLEXIC  ] 0.762  මම අද පාසල් ගිය
  [BORDERLINE] 0.609  ගුරුතුමා අපට ගනිත පාඩම ඉගැන්වය
  [DYSLEXIC  ] 0.940  විවේක කලයෙදි මිතුරන් සමග කතාකර ගිය


In [15]:
# ============================================================
# 10. ESSAY-LEVEL CROSS-VALIDATION SIMULATION
# ============================================================
#
# Since the model is trained at sentence level, we simulate
# essay-level performance by grouping test sentences into
# synthetic essays of 5 sentences each and checking whether
# the essay label matches the majority class of its sentences.

test_df = pd.DataFrame({
    "text":  X_test_raw.tolist(),
    "label": y_test.tolist()
}).reset_index(drop=True)

# Group into synthetic essays (5 sentences each)
ESSAY_SIZE = 5
essay_correct = 0
essay_total   = 0

for i in range(0, len(test_df) - ESSAY_SIZE, ESSAY_SIZE):
    chunk      = test_df.iloc[i:i+ESSAY_SIZE]
    essay_text = ". ".join(chunk["text"].tolist())
    true_label = 1 if chunk["label"].mean() >= 0.5 else 0

    result = analyze_essay(essay_text)
    pred_label = 1 if "DYSLEXIC" in result["essay_label"] else 0

    if pred_label == true_label:
        essay_correct += 1
    essay_total += 1

essay_accuracy = essay_correct / essay_total
print(f"\nSynthetic essay-level accuracy: {essay_accuracy:.4f} ({essay_correct}/{essay_total})")
print("(Note: essays grouped from test sentences, not real essays)")


Synthetic essay-level accuracy: 0.5894 (1296/2199)
(Note: essays grouped from test sentences, not real essays)


## Summary of Improvements

### Changes vs. Baseline (v0)

**Feature Engineering**
- Extended char n-gram range from `(2,4)` → `(2,5)` to capture longer misspelling spans
- Added `sublinear_tf=True` to reduce dominance of high-frequency n-grams
- Added word-level TF-IDF `(1,2)` stacked with char TF-IDF
- Added 8 Sinhala-specific handcrafted features (hal ratio, diacritic ratio, avg word length, etc.)

**Model**
- Baseline used plain `LogisticRegression` — probabilities uncalibrated
- Improved: `CalibratedClassifierCV` with 5-fold Platt scaling → reliable probabilities
- Added `LinearSVC` as a second learner
- Soft voting ensemble (average of LR + SVC calibrated probabilities)

**Essay Aggregation**
- Baseline: flat mean + ratio threshold
- Improved: word-count weighted mean + peak signal composite
- Three-tier labeling: NORMAL / BORDERLINE / DYSLEXIC
- New output field: `composite_score` for transparent ranking

**Data Cleaning**
- Drop sentences shorter than 4 characters (unreliable signal)

### Updated Inference Files
Update `vectorizer.py` to use both vectorizers + handcrafted features.
Update `sentence_classifier.py` to load both models and average predictions.
The `dyslexia_binary_model.pkl` and `tfidf_vectorizer.pkl` legacy aliases are saved
for backward compatibility with the existing API service.