## Sinhala Dyslexic Writing-Pattern Classifier
### Part B: Essay-level dyslexic writing-pattern profiling (rule-based)

⚠️ Note:
This module performs rule-based dyslexic writing-pattern classification.
No supervised learning metrics are reported due to inferred labels.


This module does NOT perform supervised classification.
Essay-level patterns are inferred via dominance-weighted aggregation
of sentence-level surface error patterns.
Therefore, no accuracy or classification metrics are reported.


# Essay-Level Dyslexic Writing Pattern Profiling

This notebook aggregates finalized **sentence-level dyslexic writing patterns**
into fixed-size essay abstractions.

- Input: sentence-level pattern annotations
- Output: essay-level dominance profiles
- Approach: rule-based aggregation (no machine learning)

This stage is designed for **explainability**, not prediction.


The dataset does not provide explicit essay boundaries. Therefore, essays are approximated by grouping consecutive sentences into fixed-size segments of five sentences. These segments are referred to as pseudo-essays and are used for essay-level pattern aggregation.

In [4]:
import pandas as pd


In [6]:
feature_df = pd.read_csv("sentence_level_patterns.csv")

print("Loaded sentence-level patterns")
feature_df.head()


Loaded sentence-level patterns


Unnamed: 0,char_addition,char_omission,char_substitution,clean_sentence,dyslexic_sentence,has_addition,has_omission,has_substitution,word_count_diff,has_spacing_issue,has_diacritic_loss,writing_pattern,writing_pattern_v2,writing_pattern_v3
0,0,2,0,වලිකුකුළා කෑගහනවා.,වලිකුකුළා කෑගහනව,False,True,False,0,False,True,Orthographic Instability,Orthographic Instability,Orthographic Instability
1,0,1,0,අම්මා කෑම දෙනවා,අම්මා කෑම දනවා,False,True,False,0,False,True,Orthographic Instability,Orthographic Instability,Orthographic Instability
2,0,0,0,එයා එනකන් ඉඩපන්,එයා එනකන් ඉඩපන්,False,False,False,0,False,False,No Dominant Pattern,No Dominant Pattern,No Dominant Pattern
3,0,2,0,රුපියල් දෙදාහක් තියෙනවා,රුපියල් දෙදාහක් තියනව,False,True,False,0,False,True,Orthographic Instability,Orthographic Instability,Orthographic Instability
4,1,0,0,ගාල්ලට යන්න ඕනෙ,ගාල්ලට යන්න ඕනෙඩ,True,False,False,0,False,False,No Dominant Pattern,No Dominant Pattern,No Dominant Pattern


## Essay Abstraction Strategy

Since the dataset is sentence-level, essays are **simulated** by grouping
a fixed number of consecutive sentences.

This is a **design abstraction** used solely for pattern profiling
and does not claim to represent real student essays.


In [32]:
ESSAY_SIZE = 5

feature_df["essay_id"] = feature_df.index // ESSAY_SIZE

feature_df[["essay_id","clean_sentence",  "writing_pattern"]].head(10)


Unnamed: 0,essay_id,clean_sentence,writing_pattern
0,0,වලිකුකුළා කෑගහනවා.,Orthographic Instability
1,0,අම්මා කෑම දෙනවා,Orthographic Instability
2,0,එයා එනකන් ඉඩපන්,No Dominant Pattern
3,0,රුපියල් දෙදාහක් තියෙනවා,Orthographic Instability
4,0,ගාල්ලට යන්න ඕනෙ,No Dominant Pattern
5,1,පැන දෙන්න,Phonetic Confusion
6,1,කළ දුරකථනය දෙනවා.,Mixed Dyslexic Pattern
7,1,උදේට කෑම කනවද,No Dominant Pattern
8,1,පන්ති යන්න ඕනෙද,Phonetic Confusion
9,1,ගානවා,Orthographic Instability


In [21]:
# STEP 2 — Aggregate sentence-level patterns per essay
essay_pattern_counts = (
    feature_df
    .groupby("essay_id")["writing_pattern"]
    .value_counts()
    .unstack(fill_value=0)
)

essay_pattern_counts.head()


writing_pattern,Mixed Dyslexic Pattern,No Dominant Pattern,Orthographic Instability,Phonetic Confusion,Word Boundary Confusion
essay_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2,3,0,0
1,1,1,1,2,0
2,1,1,1,2,0
3,0,1,2,2,0
4,1,1,3,0,0


In [22]:
# STEP 3 — Infer dominant pattern per essay
essay_summary = essay_pattern_counts.copy()

essay_summary["dominant_pattern"] = essay_summary.idxmax(axis=1)

essay_summary[["dominant_pattern"]].head()


writing_pattern,dominant_pattern
essay_id,Unnamed: 1_level_1
0,Orthographic Instability
1,Phonetic Confusion
2,Phonetic Confusion
3,Orthographic Instability
4,Orthographic Instability


In [24]:
# STEP 4.1 — Confidence calculation (FIXED)

pattern_columns = [
    "Mixed Dyslexic Pattern",
    "No Dominant Pattern",
    "Orthographic Instability",
    "Phonetic Confusion",
    "Word Boundary Confusion"
]

essay_summary["max_count"] = essay_summary[pattern_columns].max(axis=1)
essay_summary["total_sentences"] = essay_summary[pattern_columns].sum(axis=1)

essay_summary["confidence"] = (
    essay_summary["max_count"] / essay_summary["total_sentences"]
)

essay_summary[["dominant_pattern", "confidence"]].head()


writing_pattern,dominant_pattern,confidence
essay_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Orthographic Instability,0.6
1,Phonetic Confusion,0.4
2,Phonetic Confusion,0.4
3,Orthographic Instability,0.4
4,Orthographic Instability,0.6


In [25]:
essay_summary[["dominant_pattern", "confidence"]].head()


writing_pattern,dominant_pattern,confidence
essay_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Orthographic Instability,0.6
1,Phonetic Confusion,0.4
2,Phonetic Confusion,0.4
3,Orthographic Instability,0.4
4,Orthographic Instability,0.6


In [26]:
def dominance_strength(conf):
    if conf >= 0.6:
        return "Strong"
    elif conf >= 0.4:
        return "Moderate"
    else:
        return "Weak / Mixed"

essay_summary["dominance_strength"] = (
    essay_summary["confidence"].apply(dominance_strength)
)

essay_summary[
    ["dominant_pattern", "confidence", "dominance_strength"]
].head()


writing_pattern,dominant_pattern,confidence,dominance_strength
essay_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Orthographic Instability,0.6,Strong
1,Phonetic Confusion,0.4,Moderate
2,Phonetic Confusion,0.4,Moderate
3,Orthographic Instability,0.4,Moderate
4,Orthographic Instability,0.6,Strong


In [27]:
essay_summary["dominant_pattern"].value_counts(normalize=True)


Unnamed: 0_level_0,proportion
dominant_pattern,Unnamed: 1_level_1
Orthographic Instability,0.374954
Phonetic Confusion,0.29226
No Dominant Pattern,0.249179
Mixed Dyslexic Pattern,0.083425
Word Boundary Confusion,0.000183


In [29]:
print("Dominance Strength Distribution:")
essay_summary["dominance_strength"].value_counts()


Dominance Strength Distribution:


Unnamed: 0_level_0,count
dominance_strength,Unnamed: 1_level_1
Strong,2835
Moderate,2637
Weak / Mixed,6


In [30]:
essay_summary.sample(5)


writing_pattern,Mixed Dyslexic Pattern,No Dominant Pattern,Orthographic Instability,Phonetic Confusion,Word Boundary Confusion,dominant_pattern,max_count,total_sentences,confidence,dominance_strength
essay_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
5216,0,0,1,4,0,Phonetic Confusion,4,5,0.8,Strong
1697,1,0,1,3,0,Phonetic Confusion,3,5,0.6,Strong
4299,0,2,2,1,0,No Dominant Pattern,2,5,0.4,Moderate
429,0,2,2,1,0,No Dominant Pattern,2,5,0.4,Moderate
2104,1,0,2,2,0,Orthographic Instability,2,5,0.4,Moderate


In [31]:
def profile_essay(essay_id, feature_df, essay_summary):
    sentences = feature_df[feature_df["essay_id"] == essay_id][
        ["clean_sentence", "dyslexic_sentence", "writing_pattern_v3"]
    ]

    summary = essay_summary.loc[essay_id]

    return {
        "essay_id": essay_id,
        "dominant_pattern": summary["dominant_pattern"],
        "confidence": summary["confidence"],
        "dominance_strength": summary["dominance_strength"],
        "pattern_counts": summary.drop(
            ["dominant_pattern", "confidence", "dominance_strength"]
        ).to_dict(),
        "sentences": sentences.to_dict(orient="records")
    }


In [18]:
profile_essay(0, feature_df, essay_summary)


{'essay_id': 0,
 'dominant_pattern': 'Orthographic Instability',
 'confidence': np.float64(0.6),
 'dominance_strength': 'Strong',
 'pattern_counts': {'Mixed Dyslexic Pattern': 0,
  'No Dominant Pattern': 2,
  'Orthographic Instability': 3,
  'Phonetic Confusion': 0,
  'Word Boundary Confusion': 0,
  'max_count': 3,
  'total_sentences': 5},
 'sentences': [{'clean_sentence': 'වලිකුකුළා කෑගහනවා.',
   'dyslexic_sentence': 'වලිකුකුළා කෑගහනව',
   'writing_pattern_v3': 'Orthographic Instability'},
  {'clean_sentence': 'අම්මා කෑම දෙනවා',
   'dyslexic_sentence': 'අම්මා කෑම දනවා',
   'writing_pattern_v3': 'Orthographic Instability'},
  {'clean_sentence': 'එයා එනකන් ඉඩපන්',
   'dyslexic_sentence': 'එයා එනකන් ඉඩපන්',
   'writing_pattern_v3': 'No Dominant Pattern'},
  {'clean_sentence': 'රුපියල් දෙදාහක් තියෙනවා',
   'dyslexic_sentence': 'රුපියල් දෙදාහක් තියනව',
   'writing_pattern_v3': 'Orthographic Instability'},
  {'clean_sentence': 'ගාල්ලට යන්න ඕනෙ',
   'dyslexic_sentence': 'ගාල්ලට යන්න ඕනෙඩ',
 

### Case Study 1 – Orthographic Instability (Strong)

- Dominant pattern: Orthographic Instability
- Confidence: 0.6
- Observed issues:
  - Diacritic loss
  - Character omission
- Interpretation:
  Suggests unstable orthographic representation, consistent with dyslexic writing behavior.


### Limitations

- Essays are simulated using fixed-size sentence grouping.
- Pattern labels are rule-based and depend on surface error extraction quality.
- The system does not claim diagnostic validity.
- Intended for assistive and analytical use, not clinical diagnosis.


In [33]:
essay_summary.head()
essay_summary.columns


Index(['Mixed Dyslexic Pattern', 'No Dominant Pattern',
       'Orthographic Instability', 'Phonetic Confusion',
       'Word Boundary Confusion', 'dominant_pattern', 'max_count',
       'total_sentences', 'confidence', 'dominance_strength'],
      dtype='object', name='writing_pattern')

In [35]:
essay_summary.head(10)



writing_pattern,Mixed Dyslexic Pattern,No Dominant Pattern,Orthographic Instability,Phonetic Confusion,Word Boundary Confusion,dominant_pattern,max_count,total_sentences,confidence,dominance_strength
essay_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,0,2,3,0,0,Orthographic Instability,3,5,0.6,Strong
1,1,1,1,2,0,Phonetic Confusion,2,5,0.4,Moderate
2,1,1,1,2,0,Phonetic Confusion,2,5,0.4,Moderate
3,0,1,2,2,0,Orthographic Instability,2,5,0.4,Moderate
4,1,1,3,0,0,Orthographic Instability,3,5,0.6,Strong
5,0,1,2,2,0,Orthographic Instability,2,5,0.4,Moderate
6,0,0,1,4,0,Phonetic Confusion,4,5,0.8,Strong
7,0,2,1,2,0,No Dominant Pattern,2,5,0.4,Moderate
8,0,2,2,1,0,No Dominant Pattern,2,5,0.4,Moderate
9,0,2,1,2,0,No Dominant Pattern,2,5,0.4,Moderate


In [36]:
essay_summary.tail(10)

writing_pattern,Mixed Dyslexic Pattern,No Dominant Pattern,Orthographic Instability,Phonetic Confusion,Word Boundary Confusion,dominant_pattern,max_count,total_sentences,confidence,dominance_strength
essay_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
5468,2,0,2,1,0,Mixed Dyslexic Pattern,2,5,0.4,Moderate
5469,0,0,3,2,0,Orthographic Instability,3,5,0.6,Strong
5470,0,0,3,2,0,Orthographic Instability,3,5,0.6,Strong
5471,2,0,1,2,0,Mixed Dyslexic Pattern,2,5,0.4,Moderate
5472,0,0,2,3,0,Phonetic Confusion,3,5,0.6,Strong
5473,0,0,4,1,0,Orthographic Instability,4,5,0.8,Strong
5474,2,0,1,2,0,Mixed Dyslexic Pattern,2,5,0.4,Moderate
5475,0,1,1,3,0,Phonetic Confusion,3,5,0.6,Strong
5476,2,0,1,2,0,Mixed Dyslexic Pattern,2,5,0.4,Moderate
5477,1,1,0,2,0,Phonetic Confusion,2,4,0.5,Moderate


In [37]:
essay_id = 0
essay_pattern_counts.loc[essay_id]


Unnamed: 0_level_0,0
writing_pattern,Unnamed: 1_level_1
Mixed Dyslexic Pattern,0
No Dominant Pattern,2
Orthographic Instability,3
Phonetic Confusion,0
Word Boundary Confusion,0


In [38]:
essay_summary.loc[essay_id]


Unnamed: 0_level_0,0
writing_pattern,Unnamed: 1_level_1
Mixed Dyslexic Pattern,0
No Dominant Pattern,2
Orthographic Instability,3
Phonetic Confusion,0
Word Boundary Confusion,0
dominant_pattern,Orthographic Instability
max_count,3
total_sentences,5
confidence,0.6
dominance_strength,Strong


In [43]:
essay_id = 0
row = essay_summary.loc[essay_id]

result = {
    "dominant_pattern": row["dominant_pattern"],
    "confidence": row["confidence"],
    "dominance_strength": row["dominance_strength"]
}

result


{'dominant_pattern': 'Orthographic Instability',
 'confidence': np.float64(0.6),
 'dominance_strength': 'Strong'}

In [44]:
for essay_id in essay_summary.sample(3).index:
    row = essay_summary.loc[essay_id]
    print({
        "essay_id": essay_id,
        "dominant_pattern": row["dominant_pattern"],
        "confidence": row["confidence"],
        "dominance_strength": row["dominance_strength"]
    })


{'essay_id': 4993, 'dominant_pattern': 'No Dominant Pattern', 'confidence': np.float64(0.6), 'dominance_strength': 'Strong'}
{'essay_id': 2949, 'dominant_pattern': 'Orthographic Instability', 'confidence': np.float64(0.4), 'dominance_strength': 'Moderate'}
{'essay_id': 1499, 'dominant_pattern': 'Orthographic Instability', 'confidence': np.float64(0.6), 'dominance_strength': 'Strong'}


In [45]:
def profile_essay(essay_id, essay_summary):
    row = essay_summary.loc[essay_id]
    return {
        "essay_id": essay_id,
        "dominant_pattern": row["dominant_pattern"],
        "confidence": row["confidence"],
        "dominance_strength": row["dominance_strength"]
    }


In [47]:
profile_essay(0, essay_summary)



{'essay_id': 0,
 'dominant_pattern': 'Orthographic Instability',
 'confidence': np.float64(0.6),
 'dominance_strength': 'Strong'}

In [48]:
profile_essay(3717, essay_summary)

{'essay_id': 3717,
 'dominant_pattern': 'No Dominant Pattern',
 'confidence': np.float64(0.4),
 'dominance_strength': 'Moderate'}