Discovering Bias in Dutch Automatic Speech Recognition by Clustering Interpretable Acoustic and Prosodic Features
This repository contains the codebase for my BSc thesis titled "Discovering Bias in Dutch Automatic Speech Recognition by Clustering Interpretable Acoustic and Prosodic Features" for the Research Project 2024 at the TU Delft. The project proposes an interpretable approach to bias discovery by clustering speakers based on acoustic and prosodic features.
The repository consists of multiple components:
- Feature Extraction using Praat: Scripts for extracting acoustic and prosodic features from speech data using Praat.
- Bias Discovery in ASR using Python: Code to cluster the extracted features into speaker groups and quantify biases in ASR.
- Config file that connects the parts: config.json file where the location of feature extraction results, as well as the ASR recognition files, should be listed.
This part of the project focuses on extracting acoustic and prosodic features from speech data using Praat. The feature extraction script enables extraction of the following features:
- Mean Pitch (Hz)
- Mean Speech Rate (Phonemes per Minute): the number of phonemes per minute of audio
- Mean Articulation Rate (Phonemes per Minute): the number of phonemes per minute of speech
- Mean durations (in seconds) of specified phonemes
- Mean Formant Frequencies (F1, F2, ...) at midpoints (50%) of chosen phoneme(s)
- Mean difference between Formant Frequencies (F1, F2, ...) at 20% vs 80% of diphthong(s)
Formant frequencies can be extracted in hertz or bark. Up to 5 formants can be chosen.
The extraction setup file calls the extraction script on the experimental setup from the BSc thesis. However, the feature extraction script can be used to measure any phonemes, so the script can even be used on other languages than Dutch!
This part of the project focuses on clustering the extracted features into speaker groups. It expects ASR recognition output files with insertions, deletions, substitutions and word counts for each speaker, and output files from the feature extraction using Praat.
The Python component is split into three parts:
- Clustering: functionality for data preparation and clustering speakers into speaker groups. The scaling method and clustering algorithm are softcoded. Predefined groups can be defined as well. The folder also enables visualization of clusters, speech characteristics per cluster (boxplots) and feature correlation matrices.
- Evaluation: functionality for evaluating ASR models based on some grouping, using different metrics.
- Comparison: functionality for comparing the biases of ASR models when using different feature sets.
In main.py, the different parts are put together to form a "pipeline" that, given a valid config file, extracted features and ASR recognition output, can cluster speakers into speaker groups, then quantify bias and ASR performance using these groups vs demographic groups when these are present as well.
In the BSc project, demographic groups were present despite the goal being to find bias without demographic metadata. Due to time constraints, the current codebase still assumes demographic metadata in several places for plotting and comparisons.
The config.json File
This is where the information should be stored on what data the code should expect and where. The following things can be controlled:
- ASR models: names of the ASR models under evaluation
- Speaker groups: names of the predefined demographic groups
- Speaking styles, each containing an
id
,name
andabbreviation
- Filepaths to the extracted features. Expects one file per speaking style. The value of the
speaking_style
field should be equal to the corresponding speaking style'sid
. - Filepaths to the ASR recognition output. A filepath template can be given. The one that is there at the moment expects the names of each necessary file to be derived from the ASR model name(s) and speaking style abbreviation(s).
- Filepaths to the demographic metadata. A filepath template can be given. The one that is there at the moment expects the names of each necessary file to be derived from the speaking style abbreviation(s).
For more information on the functionality, please check the relevant files. All files in the python folder and the praat folder are documented.
This project is licensed under the MIT License. See the LICENSE file for details.