# Log-Mel Feature Extraction for ASR

This notebook implements **Log-Mel spectrogram extraction**, the primary acoustic
representation used in modern Automatic Speech Recognition (ASR) systems.

The objective is to convert raw audio waveforms into a compact, perceptually motivated
time–frequency representation suitable for neural network training.


In [None]:
import sys, os
PROJECT_ROOT = os.path.abspath("..")
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

import matplotlib.pyplot as plt
import librosa.display

from src.dataset import CommonVoiceAUSDataset
from src.features import extract_log_mel, pad_log_mel
from src.graph_utils import save_and_show


## Loading a Sample Utterance

We select a single utterance from the dataset to demonstrate
Log-Mel feature extraction and visualization.


In [None]:
dataset = CommonVoiceAUSDataset("../data/raw/commonvoice_en_au")

sample = dataset.get_sample(0)
audio_path = sample["audio_path"]
text = sample["text"]

print("Transcript:")
print(text)
print("\nAudio path:")
print(audio_path)


## Log-Mel Spectrogram Extraction

The Log-Mel spectrogram captures how energy is distributed across perceptually
motivated frequency bands over time. This representation forms the input
to most modern ASR architectures.


In [None]:
log_mel = extract_log_mel(audio_path)

fig = plt.figure(figsize=(10, 4))
librosa.display.specshow(
    log_mel,
    sr=16000,
    hop_length=160,
    x_axis="time",
    y_axis="mel"
)
plt.colorbar(format="%+2.0f dB")
plt.title("Log-Mel Spectrogram")
plt.tight_layout()

save_and_show(fig, "logmel_example.png")


## Padding and Masking for ASR

Speech utterances vary in length, but neural networks require fixed-size tensors.
Padding standardizes feature dimensions, while masks prevent the model from
learning from artificial silence introduced by padding.


In [None]:
max_frames = 1000

padded_log_mel, mask = pad_log_mel(log_mel, max_frames)

print("Original shape:", log_mel.shape)
print("Padded shape:", padded_log_mel.shape)
print("Valid frames:", int(mask.sum()))


## Summary

In this step, we:
- Extracted Log-Mel spectrograms from raw audio
- Visualized time–frequency acoustic patterns
- Applied padding and masking for batch compatibility

These features serve as the acoustic foundation for downstream ASR modeling.
The next step introduces **text processing and tokenization**, bridging
acoustic features with language representations.
