Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Significant drop in performance of DH / ð phoneme #440

Closed
kdonbekci opened this issue Apr 28, 2022 · 2 comments
Closed

[BUG] Significant drop in performance of DH / ð phoneme #440

kdonbekci opened this issue Apr 28, 2022 · 2 comments
Assignees
Labels

Comments

@kdonbekci
Copy link

Debugging checklist

[x] Have you updated to latest MFA version?
Running 2.0.0rc6
[x] Have you tried rerunning the command with the --clean flag?
Yes

Describe the issue
For both the arpa and the mfa phone set, the performance of DH phoneme has deteriorated significantly between the current acoustic models and the one with the meta.yaml file below:

authors: Michael McAuliffe
language: English
citation: "M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger (2017). Montreal Forced Aligner: an accurate and trainable aligner using Kaldi. Presented at the 91st Annual Meeting of the Linguistic Society of America, Austin, TX."
URL: https://montrealcorpustools.github.io/Montreal-Forced-Aligner/

version: 0.9.0
architecture: gmm-hmm
features: mfcc+deltas

phones: [AA0, AA1, AA2, AE0, AE1, AE2, AH0, AH1, AH2, AO0, AO1, AO2,
           AW0, AW1, AW2, AY0, AY1, AY2, EH0, EH1, EH2, ER0, ER1, ER2,
           EY0, EY1, EY2, IH0, IH1, IH2, IY0, IY1, IY2, OW0, OW1, OW2,
           OY0, OY1, OY2, UH0, UH1, UH2, UW0, UW1, UW2,
           B, CH, D, DH, F, G, HH, JH, K, L, M, N, NG, P, R,
           S, SH, T, TH, V, W, Y, Z, ZH]

On the new version, most phonemes perform similarly to the 0.9.0 version (albeit a little worse) but the DH / ð phoneme's recall and precision is severely affected. Running classification_report from scikit-learn, we are looking at 0.09 precision and 0.11 recall. In 0.9.0, running the same test using the same dictionary (librispeech-lexicon), the scores were 0.45 precision and 0.83 recall.

For Reproducing your issue
Please fill out the following:

  1. Corpus structure
    Standard TIMIT corpus
  2. Dictionary
    Using the librispeech-lexicon (but the issue persists on both the english_us_mfa and english_us_arpa dictionaries
  3. Acoustic model
    Using the english_us_arpa and the english_us_mfa acoustic models .

Additional context
Implementation of #439 can help in the future for catching such issues earlier.

@mmcauliffe
Copy link
Member

mmcauliffe commented May 1, 2022

After looking back through TIMIT, I'm actually not surprised, I think this is mostly an issue with TIMIT's annotation guidelines. They treat "dh" as not really having any pronunciation variants, other than if it is really stopped, they recommend dcl dh, but there's only 118 of those across 3879 dh tokens. In reality, /ð/ is almost never realized as a proper fricative in English, especially in the word "the". In any sort of connected speech, it'll assimilate to the previous sound and geminate, so you'll get a long [nː] or [mː] or what have you (see https://en.wikipedia.org/wiki/Pronunciation_of_English_%E2%9F%A8th%E2%9F%A9#Assimilation for more details). Additionally so many dialects and sociolects have stopped or affricated realizations of /ð/, but that dialectal difference isn't incorporated into TIMIT. The best they do for dialectal differences is in vowels, but I assume this is because so much ink has been spilled on vowels in different dialects. Like it's a little mind-boggling to me that they have 4 realizations of unstressed vowels, and have transcribed monophthongal /aj/ as [aa], and included flaps and nasal flaps, but just one consistent [ð].

Here are some examples across a range of files:

Screenshot 2022-04-30 183228
Screenshot 2022-04-30 182717
Screenshot 2022-04-30 182839
Screenshot 2022-04-30 182955
Screenshot 2022-04-30 183103
Screenshot 2022-04-30 183152

In general, I don't think a lot of tokens of /ð/ in "the" should be transcribed with [ð] as a rule, if they exist, they're an assimilation/gemination or [d̪]. I'll try a round of training having the stopped realization as explicitly a dental [d̪] separate from the alveolar [d], that might help with the over-generalization to [d], since the stopped realization is much more frequent than the fricative version in the training data. But I really don't think that TIMIT is a good benchmark for ASR in general, let alone for alignment. All the sentences are so stilted because they tried to phonetically balance it, and that's just an impossible task in natural language. Trying to have the same amount of /t/ as /ʒ/ just really doesn't make sense. And then if you take a look at their speaker info section: https://catalog.ldc.upenn.edu/docs/LDC93S1/SPKRINFO.TXT, hooo boy, biases abound if you look at those notes (and more concretely, of the 630 speakers, 578 were marked as White, 26 as Black, 3 as Asian, 2 as Native American, 4 as Hispanic, and 17 as ???).

With that said, most of my benchmarking and analysis has been on Buckeye, which does have its own fair share of issues that I've correct over the years.

Anyway, I have thoughts and opinions about TIMIT, but my point is that they are encoding an analysis of English that I have not adhered to with the dictionaries that I've written and have no desire to replicate.

Here are some benchmarks I've done. I'm not surprised if ARPA models outperform IPA-based, because the benchmarks of both TIMIT and Buckeye are inspired by ARPAbet (and Buckeye transcription set is directly inspired from a reduced set of TIMIT's phone set), but again, I'm really not interested in models that work well on American English ARPAbet in particular.

mfa2_alignment_score
mfa2_phone_error_rate

You can see the scripts that generated these here: https://github.com/mmcauliffe/memcauliffe-blog-scripts/tree/main/aligning/mfa_2.0_evaluations, along with https://github.com/mmcauliffe/corpus-creation-scripts/blob/main/buckeye/create_buckeye_benchmark.py and https://github.com/mmcauliffe/corpus-creation-scripts/blob/main/timit/create_timit_benchmark.py. Let me know if I've done something wrong, but otherwise I'll close this out.

@mmcauliffe
Copy link
Member

mmcauliffe commented Jun 8, 2022

Updated models (2.0.0a, available here: https://github.com/MontrealCorpusTools/mfa-models/releases) are live now for all languages, should show some improvements in performance across the board including for [ð], but again for the reasons listed above, I wouldn't be surprised if the MFA phone set models do not predict a fricatives in places that TIMIT has one transcribed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants