
# AI - Based Reading Assistant for Children Using Wav2Vec2
## Student: Nataliia Kobrii. 
## UTORid: qq577503
This project builds a prototype **“Reading Assistant”** that helps children practice reading aloud. 
I use a pretrained **deep learning model (Wav2Vec2-Base-960h)** to transcribe the child’s speech, 
compare it to **the expected sentence using Word Error Rate (WER)**, and detect which words were 
spoken incorrectly. Finally, I provide phonics-based feedback to help the child improve reading 
and pronunciation. **The goal is to evaluate how well an adult-trained ASR model performs on 
children’s speech and to design a feedback mechanism that supports learning.**



## 1. Imports & Config


## Dataset

The dataset consists of 10 short English sentences designed for early readers. 
Each sentence was recorded once by a Grade 1 child reader who is still developing 
English reading and pronunciation skills at home. For privacy reasons, the audio files are not included in this submission, 
but the notebook contains full logs of the model predictions for each sentence. 
The model never stores or learns from audio, and recordings are only used locally 
for evaluation.

In [1]:
import os
import pandas as pd
import numpy as np
import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from jiwer import wer
from difflib import SequenceMatcher
import matplotlib.pyplot as plt
import pyttsx3

# Project paths
DATA_DIR = "../data"
CHILD_DIR = os.path.join(DATA_DIR, "child_recordings")
SENTENCES_CSV = os.path.join(DATA_DIR, "sentences.csv")

# Audio settings
SAMPLE_RATE = 16000
MODEL_NAME = "facebook/wav2vec2-base-960h"

# Load sentence prompts
sentences_df = pd.read_csv(SENTENCES_CSV)
sentences_df.head()



  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,id,text
0,s01,The cat is sleeping.
1,s02,The dog is very happy.
2,s03,The sun is shining.
3,s04,She has a red ball.
4,s05,The bird is on the tree.



## 2. Load Wav2Vec2 Model

In [2]:
processor = Wav2Vec2Processor.from_pretrained(MODEL_NAME)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_NAME)
model.eval()

print("Model and processor loaded.")

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model and processor loaded.



## 3. Audio Loading & ASR

In [3]:
def load_audio(path, sr=SAMPLE_RATE):
    """Load an audio file and resample to the target sampling rate."""
    audio, sr = librosa.load(path, sr=sr)
    return audio, sr


def transcribe(path):
    """Transcribe an audio file to text using Wav2Vec2."""
    audio, sr = load_audio(path)
    inputs = processor(audio, sampling_rate=sr, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(**inputs).logits

    pred_ids = torch.argmax(logits, dim=-1)
    text = processor.batch_decode(pred_ids)[0]
    return text.lower().strip()

In [4]:
# Test transcription on the first sentence
test_path = os.path.join(CHILD_DIR, "s01_child.wav")
if os.path.exists(test_path):
    print("ASR:", transcribe(test_path))
else:
    print("Record s01_child.wav in data/child_recordings first.")

ASR: that ket is slip



## 4. Text Normalization & Pronunciation Score (WER)

In [5]:
def normalize_text(t):
    """Lowercase and remove simple punctuation."""
    t = t.lower().strip()
    for ch in [",", ".", "?", "!", ";", ":", "'", "\""]:
        t = t.replace(ch, "")
    return t


def pronunciation_score(target_text, audio_path):
    """
    Compute WER-based pronunciation score between the target sentence
    and the child's spoken audio.
    """
    target_norm = normalize_text(target_text)
    predicted_text = transcribe(audio_path)
    pred_norm = normalize_text(predicted_text)

    # 0 = perfect, 1 = completely wrong
    error = wer(target_norm, pred_norm)

    # convert to 0 - 1; higher = better
    score = max(0.0, 1.0 - error)

    return {
        "target": target_text,
        "predicted": predicted_text,
        "wer": error,
        "score": score
    }

In [6]:
row = sentences_df.iloc[0]
sid, text = row["id"], row["text"]
child_path = os.path.join(CHILD_DIR, f"{sid}_child.wav")

if os.path.exists(child_path):
    result = pronunciation_score(text, child_path)
    result
else:
    print(f"Missing file: {child_path}")


## 5. Word-Level Alignment (Correct vs Wrong Words)

In [7]:
def word_diff(target_text, predicted_text):
    """
    Align words between target and predicted sentences.
    Returns a list of (target_word, status, child_word):
      - status: "correct", "wrong", "missing"
    """
    t_words = normalize_text(target_text).split()
    p_words = normalize_text(predicted_text).split()

    matcher = SequenceMatcher(None, t_words, p_words)
    aligned = []

    for tag, i1, i2, j1, j2 in matcher.get_opcodes():
        if tag == "equal":
            for i in range(i1, i2):
                aligned.append((t_words[i], "correct", t_words[i]))

        elif tag == "replace":
            for i in range(i1, i2):
                child_word = p_words[j1] if j1 < len(p_words) else None
                aligned.append((t_words[i], "wrong", child_word))

        elif tag == "delete":
            for i in range(i1, i2):
                aligned.append((t_words[i], "missing", None))

    return aligned

In [8]:
# Test word_diff using the previous pronunciation_score result
if 'result' in globals():
    diff_example = word_diff(result["target"], result["predicted"])
    diff_example
else:
    print("Run the pronunciation_score test cell first.")


## 6. Phonics Rule Definitions

In [9]:
RULES = {
    "th_initial": {
        "title": "The 'th' sound",
        "explain": (
            "The letters T and H together make the sound 'th'. "
            "Put your tongue gently between your teeth and blow air: th."
        ),
        "examples": "the, this, that"
    },
    "magic_e": {
        "title": "The magic 'e'",
        "explain": (
            "When a word ends with a silent 'e', the vowel often says its name. "
            "For example, 'cap' becomes 'cape'."
        ),
        "examples": "make, ride, bike"
    },
    "long_ee": {
        "title": "The long 'ee' sound",
        "explain": (
            "The letters E E together make a long E sound: 'ee', "
            "like in 'sleep', 'feet', 'keep'."
        ),
        "examples": "sleep, feet, keep"
    },
    "short_a": {
        "title": "The short 'a' sound",
        "explain": (
            "Short 'a' is a quick sound: 'a', like in 'cat', 'hat', 'bat'."
        ),
        "examples": "cat, hat, bat"
    }
}


## 7. Rule Matching & Text Feedback

In [10]:
def find_rule_for_word(word):
    """Return a RULES entry for the given word."""
    w = word.lower()

    if w.startswith("th"):
        return RULES["th_initial"]
    if "ee" in w:
        return RULES["long_ee"]
    if len(w) >= 3 and w.endswith("e") and w[-2] in "aeiou":
        return RULES["magic_e"]
    if len(w) == 3 and w[1] == "a" and w[-1] in "ptdn":
        return RULES["short_a"]

    return None


def feedback_for_word(target_word, child_word=None):
    """Generate a child-friendly explanation for target_word."""
    rule = find_rule_for_word(target_word)

    if rule is None:
        if child_word:
            return (
                f"We say '{target_word}', not '{child_word}'. "
                f"Let's repeat '{target_word}' together."
            )
        else:
            return (
                f"Let's try the word '{target_word}' again. "
                f"Read it slowly and clearly."
            )

    return (
        f"We say '{target_word}'. "
        f"{rule['title']}. {rule['explain']} "
        f"For example: {rule['examples']}."
    )


## 9. Text-to-Speech (TTS) for Explanations

In [21]:
# Initialize text-to-speech engine
tts = pyttsx3.init()

def speak(text):
    try:
        import pyttsx3
        engine = pyttsx3.init()
        engine.say(text)
        engine.runAndWait()
    except:
        print("[AUDIO] " + text)

def explain_rule_with_audio(target_word, child_word=None):
    """Print and speak a phonics explanation for the target word."""
    msg = feedback_for_word(target_word, child_word)
    print("Audio explanation:", msg)
    speak(msg)
    return msg

In [22]:
# Test TTS explanation
_ = explain_rule_with_audio("sleeping", "sliping")

Audio explanation: We say 'sleeping'. The long 'ee' sound. The letters E E together make a long E sound: 'ee', like in 'sleep', 'feet', 'keep'. For example: sleep, feet, keep.



## 10. End-to-End Evaluation for One Sentence

In [29]:
def evaluate_sentence(row, child_suffix="child"):
    """
    Evaluate one sentence for the child recording sXX_child.wav.
    """
    sid, text = row["id"], row["text"]
    child_path = os.path.join(CHILD_DIR, f"{sid}_{child_suffix}.wav")

    if not os.path.exists(child_path):
        print(f"Missing audio file: {child_path}")
        return None

    # 1. Overall pronunciation score
    pron = pronunciation_score(text, child_path)
    diff = word_diff(pron["target"], pron["predicted"])

    print("=" * 60)
    print(f"Sentence {sid}")
    print(f"Target:    {pron['target']}")
    print(f"Predicted: {pron['predicted']}")
    print(f"WER: {pron['wer']:.2f} -> Score: {pron['score']:.2f}")

    print("\nWord-level feedback:")
    for w, status, child_w in diff:
        mark = "Correct:" if status == "correct" else "Needs correction:"
        print(f"{mark}  {w}   (heard: {child_w})")

    print("\nAudio help for tricky words:")
    for w, status, child_w in diff:
        if status != "correct":
            explain_rule_with_audio(w, child_w)

    # Overall summary
    score = pron["score"]
    if score >= 0.8:
        print("\nOverall: Excellent! Great reading.")
    elif score >= 0.6:
        print("\nOverall: Good, but some words need practice.")
    else:
        print("\nOverall: Needs practice with this sentence.")

    return {**pron, "id": sid, "diff": diff}

In [30]:
# Evaluate the first sentence
evaluate_sentence(sentences_df.iloc[0], child_suffix="child")

Sentence s01
Target:    The cat is sleeping.
Predicted: that ket is slip
WER: 0.75 -> Score: 0.25

Word-level feedback:
Needs correction:  the   (heard: that)
Needs correction:  cat   (heard: that)
Correct:  is   (heard: is)
Needs correction:  sleeping   (heard: slip)

Audio help for tricky words:
Audio explanation: We say 'the'. The 'th' sound. The letters T and H together make the sound 'th'. Put your tongue gently between your teeth and blow air: th. For example: the, this, that.
Audio explanation: We say 'cat'. The short 'a' sound. Short 'a' is a quick sound: 'a', like in 'cat', 'hat', 'bat'. For example: cat, hat, bat.
Audio explanation: We say 'sleeping'. The long 'ee' sound. The letters E E together make a long E sound: 'ee', like in 'sleep', 'feet', 'keep'. For example: sleep, feet, keep.

Overall: Needs practice with this sentence.


{'target': 'The cat is sleeping.',
 'predicted': 'that ket is slip',
 'wer': 0.75,
 'score': 0.25,
 'id': 's01',
 'diff': [('the', 'wrong', 'that'),
  ('cat', 'wrong', 'that'),
  ('is', 'correct', 'is'),
  ('sleeping', 'wrong', 'slip')]}


## 11. Evaluate All Sentences

In [31]:
# Evaluate all sentences where audio is available
all_results = []

for _, row in sentences_df.iterrows():
    res = evaluate_sentence(row, child_suffix="child")
    if res is not None:
        all_results.append(res)

summary_df = pd.DataFrame([
    {
        "id": r["id"],
        "wer": r["wer"],
        "score": r["score"],
        "target": r["target"],
        "predicted": r["predicted"]
    }
    for r in all_results
])
summary_df
print("Average WER:", summary_df["wer"].mean())
print("Average Score:", summary_df["score"].mean())
print("Best Sentence:", summary_df.loc[summary_df["score"].idxmax(), ["id", "score"]].to_dict())
print("Worst Sentence:", summary_df.loc[summary_df["score"].idxmin(), ["id", "score"]].to_dict())


Sentence s01
Target:    The cat is sleeping.
Predicted: that ket is slip
WER: 0.75 -> Score: 0.25

Word-level feedback:
Needs correction:  the   (heard: that)
Needs correction:  cat   (heard: that)
Correct:  is   (heard: is)
Needs correction:  sleeping   (heard: slip)

Audio help for tricky words:
Audio explanation: We say 'the'. The 'th' sound. The letters T and H together make the sound 'th'. Put your tongue gently between your teeth and blow air: th. For example: the, this, that.
Audio explanation: We say 'cat'. The short 'a' sound. Short 'a' is a quick sound: 'a', like in 'cat', 'hat', 'bat'. For example: cat, hat, bat.
Audio explanation: We say 'sleeping'. The long 'ee' sound. The letters E E together make a long E sound: 'ee', like in 'sleep', 'feet', 'keep'. For example: sleep, feet, keep.

Overall: Needs practice with this sentence.
Sentence s02
Target:    The dog is very happy.
Predicted: sa dok reddy happy
WER: 0.80 -> Score: 0.20

Word-level feedback:
Needs correction:  the 

## Error Analysis

The Wav2Vec2 model performs reasonably well on short, clearly articulated sentences 
(e.g., s06 and s08), but struggles with longer or more complex recordings (e.g., s01 - s03, s07).  
The most common error patterns include:

- **Substitution of “th” → “t”/“s”** (e.g., “the” → “that”, “sa”), a known challenge in child speech.
- **Vowel confusion** in words like “cat”, “ball”, “sun”.
- **Dropping or simplifying endings**, such as "-ing" → "in" or "slip".
- **Catastrophic decoding errors** where multiple words collapse into one output (e.g., “zasan”).

These results match findings in speech research that adult-trained ASR models 
have difficulty with children’s higher pitch, shorter phoneme duration, and inconsistent articulation.


## Interpretation of Results

The average WER across the 10 sentences was approximately **0.62**, meaning that
the model misrecognized about 62% of the words in the recordings. This confirms
findings in speech recognition research that adult-trained ASR models perform
poorly on children’s speech due to differences in pitch, articulation, speech
rate, and phoneme duration.

The model performed best on shorter, clearly articulated sentences such as:

- *“The fox runs fast.”* (WER = 0.25)
- *“The mouse is hiding under the table.”* (WER = 0.29)

It performed worst on more difficult sentences, especially those beginning with
"the", which the model frequently misrecognized as “that”, “sa”, or “zasan”.
The sentence *“The sun is shining.”* was not recognized correctly at all
(WER = 1.00), demonstrating the model's difficulty with child pronunciation of
initial consonant clusters and long vowel sounds.

Despite limited transcription accuracy, the system successfully:
1. Detected incorrect words,
2. Mapped errors to simple phonics rules (e.g., “th”, “short a”, “long ee”),
3. Generated accessible spoken feedback for the child.

This shows that combining a pretrained ASR model with rule-based phonics 
explanations can provide meaningful instructional feedback, even when the 
transcriptions are imperfect.


## Conclusion

This project demonstrates that a pretrained Wav2Vec2 speech model can be used to provide 
automatic reading feedback for children. Although transcription accuracy on child speech 
is limited (average WER ≈ 0.62), the combination of ASR output with word alignment and a phonics
rule engine, the prototype successfully identifies difficult words and provides
child-friendly spoken explanations. This hybrid approach shows promise for
building supportive early-literacy tools.
