# Dataset Exploration: Mozilla Common Voice (English – Australian)

This notebook performs a structured exploratory analysis of the **Mozilla Common Voice v24 (English – Australian)** dataset.  
The goal is to understand the **statistical, linguistic, and speaker-level characteristics** of the corpus before designing an Automatic Speech Recognition (ASR) pipeline.

Careful dataset exploration is critical in speech research because:
- acoustic duration varies widely across speakers,
- textual complexity affects language modeling,
- speaker imbalance can bias acoustic models,
- accent and demographic distributions influence generalization.

This analysis informs **feature extraction choices**, **batching strategies**, and **model design decisions** in later stages.


In [None]:
import sys, os
PROJECT_ROOT = os.path.abspath("..")
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from src.dataset import CommonVoiceAUSDataset


## Dataset Structure and Metadata

The dataset consists of:
- a directory of audio recordings (`.mp3` format),
- a metadata file containing transcription and speaker information, and
- split definitions for downstream experimentation.

Each row in the metadata represents a **single spoken utterance**, paired with its transcription and speaker attributes.

We load the dataset using a custom dataset class to ensure:
- explicit path handling,
- format validation,
- reproducibility across environments.


In [None]:
dataset = CommonVoiceAUSDataset(
    root_dir="../data/raw/commonvoice_en_au"
)

print("Total samples:", len(dataset))
dataset.df.head()


## Dataset Overview

After loading the metadata, we inspect the dataset size and schema.

Key metadata fields include:
- `client_id`: anonymized speaker identifier,
- `path`: relative path to the audio file,
- `sentence`: ground-truth transcription,
- `accent`, `gender`, `age`: speaker attributes,
- `duration_ms`: audio duration in milliseconds.

This information allows us to analyze **textual complexity**, **speaker imbalance**, and **temporal properties** of the corpus.


## Textual Complexity Analysis

Before training an ASR model, it is important to understand the **distribution of transcription lengths**.

Sentence length affects:
- decoding difficulty,
- memory requirements during training,
- padding efficiency in batch processing,
- alignment stability between audio and text.

We compute sentence length as a **derived feature**, measured as the number of characters in each transcription.


In [None]:
# Create derived feature: sentence length (number of characters)
dataset.df["sentence_length"] = dataset.df["sentence"].astype(str).str.len()

# Sanity check
dataset.df[["sentence", "sentence_length"]].head()


### Sentence Length Distribution

The raw sentence length distribution exhibits a **long-tail behavior**, where a small number of very long sentences coexist with a large number of short utterances.

To make the distribution interpretable:
- we visualize the central mass using percentile-based clipping,
- and complement it with log-scaled visualizations where appropriate.

This prevents extreme outliers from dominating the plot while preserving statistical honesty.


In [None]:
from src.graph_utils import save_and_show

# Use 99th percentile to limit extreme outliers
max_len = np.percentile(dataset.df["sentence_length"], 99)

fig = plt.figure(figsize=(8, 4))
plt.hist(
    dataset.df["sentence_length"],
    bins=50,
    range=(0, max_len)
)
plt.title("Sentence Length Distribution (up to 99th percentile)")
plt.xlabel("Characters")
plt.ylabel("Count")

save_and_show(fig, "sentence_length_distribution.png")


## Audio Duration Analysis

Audio duration directly impacts:
- feature sequence length,
- GPU memory consumption,
- batch padding efficiency,
- training stability.

Understanding the duration distribution allows us to:
- choose appropriate frame sizes,
- define maximum sequence lengths,
- avoid excessive padding or truncation.


In [None]:
max_len = np.percentile(dataset.df["duration_ms"], 99)

fig = plt.figure(figsize=(8, 4))
plt.hist(dataset.df["duration_ms"] ,range=(0, max_len))
plt.title("Audio Duration Distribution")
plt.xlabel("Duration (s)")
plt.ylabel("Count")

save_and_show(fig, "duration_distribution.png")


## Accent Distribution Analysis

Accent variation plays a critical role in speech recognition performance.
Acoustic realizations of the same phoneme can differ significantly across accents due to changes in pronunciation, intonation, and prosody.

Understanding the accent distribution in the dataset is important for several reasons:
- ASR models trained on a dominant accent may generalize poorly to underrepresented accents.
- Accent imbalance can introduce systematic bias in recognition accuracy.
- Feature representations may capture accent-specific patterns if not carefully normalized.

In this dataset, each utterance is annotated with an accent label derived from speaker metadata.
We analyze the frequency of each accent to quantify representation imbalance and assess the potential impact on model training and evaluation.

This analysis informs future decisions such as:
- accent-aware sampling strategies,
- domain adaptation techniques,
- evaluation protocols that account for accent diversity.


In [None]:
accent_counts = dataset.df["accents"].value_counts().head(10)

fig = plt.figure(figsize=(10, 4))
accent_counts.plot(kind="bar")
plt.title("Top Accent Distribution")
plt.xlabel("Accent")
plt.ylabel("Count")

save_and_show(fig, "accent_distribution.png")


## Speaker Distribution and Imbalance

In crowd-sourced speech datasets, speaker contributions are rarely uniform.
Some speakers contribute hundreds of utterances, while others appear only once or twice.

Speaker imbalance is critical because:
- models may overfit to frequent speakers,
- rare speakers may be underrepresented,
- evaluation performance may appear inflated.

We analyze the number of utterances contributed per speaker to quantify this imbalance.


In [None]:
# Number of utterances per speaker
speaker_counts = (
    dataset.df
    .groupby("client_id")
    .size()
)

# Sanity check
speaker_counts.describe()


### Utterances per Speaker

We visualize the distribution of utterances per speaker.
Due to the heavy skew in contributions, a **log-scaled histogram** is used to clearly expose the imbalance structure.

This analysis motivates later decisions such as:
- speaker-balanced sampling,
- data augmentation strategies,
- curriculum-based training.


In [None]:

max_utt = np.percentile(speaker_counts.values, 99)

fig = plt.figure(figsize=(8, 4))
plt.hist(
    speaker_counts.values,
    bins=50,
    range=(0, max_utt)
)
plt.title("Speaker Utterance Count Distribution (≤99th percentile)")
plt.xlabel("Utterances per Speaker")
plt.ylabel("Number of Speakers")

save_and_show(fig, "speaker_imbalance_distribution.png")


## Key Observations and Implications

From this exploratory analysis, we observe that:

- Sentence lengths exhibit a long-tailed distribution.
- Speaker contributions are highly imbalanced.
- Audio durations vary significantly across utterances.

These characteristics have **direct implications** for downstream ASR modeling:
- feature normalization and padding strategies must be carefully designed,
- batching should consider duration-based grouping,
- speaker imbalance should be mitigated during training.

In the next stage, we move from metadata analysis to **acoustic feature extraction**, beginning with **Log-Mel Spectrograms**, which form the backbone of modern ASR systems.


In [None]:
print("Data Exploration completed and graphs saved in graphs folder")