Issues with the model for PronunciationAssessment #1530

fabswt · 2022-06-13T16:14:16Z

Describe the bug
PronunciationAssessment reports as erroneous pronunciations that are perfectly correct.

To Reproduce
Steps to reproduce the behavior:

Take any of the sentences provided below
Generate TTS for it using en-US, ChristopherNeural
Use PronunciationAssessment to rate the pronunciation
Observe how PronunciationAssessment is giving low ratings to pronunciations that may be deemed perfect.

Expected behavior
I would expect PronunciationAssessment to return a perfect score for the audio produced by the TTS (or for identical pronunciations from native speakers.)

Instead, some phonemes are being rated with very low scores, because the model does not seem to understand that some words accept multiple pronunciations.

Version of the Cognitive Services Speech SDK
azure-cognitiveservices-speech Python 1.21.0

Platform, Operating System, and Programming Language

macOS / ARM M1 / Python 3.9.12

Sentences

What are you thinking about?

From NBest[0].Words[X].Phonemes: Phoneme and AccuracyScore:
w       ɑ       t            ɑ       r            j       u            θ       ɪ       ŋ       k       ɪ       ŋ            ə       b       aʊ      t           
87      4       69           1       100          100     3            100     100     100     100     100     100          100     100     100     100         

From NBest[0].Words[X].Phonemes.NBestPhonemes[0]: Phoneme and Score:
w       ʌ       t            ə       r            j       ʊ            θ       ɪ       ŋ       k       ɪ       ŋ            ə       b       aʊ      t           
100     100     100          100     100          100     100          100     100     100     100     100     100          100     100     100     100

In other words: PronunciationAssessment considers the pronunciation of "what are" as [wɑt ɑr] to be the only one correct, even though pronouncing it as [wʌt ər] is correct – it's correct in the sense that the TTS does it, that Wiktionary gives these transcriptions, or that I perceive it as correct among native speakers of American English.

Likewise, it's considering "you" as [jʊ] to be incorrect, even though it's fine.

Hello, world!

From NBest[0].Words[X].Phonemes: Phoneme and AccuracyScore:
 h       ɛ       l       oʊ           w       ɝ       r       l       d           
100     59      100     100          100     100     100     100     100    

From NBest[0].Words[X].Phonemes.NBestPhonemes[0]: Phoneme and Score:
h       ə       l       oʊ           w       ɝ       r       l       d           
100     100     100     100          100     100     100     100     100

In other words: PronunciationAssessment considers "Hello" as [hɛloʊ] to be the only one correct, even though [həloʊ] is common and correct (and, again, given by the TTS API.)

(On the bright side, it confirms the API's accuracy: it did detect the phones realized by the TTS, it just doesn't know that such pronunciations are correct, if not common.)

I'm stopping at these two-three examples, but I know I could produce dozens if not hundreds more. Seems to be common on words that accept multiple pronunciations.

yulin-li · 2022-06-14T08:55:50Z

@wangkenpu could you take a look?
cc @yinhew

wangkenpu · 2022-06-21T03:11:56Z

Thanks to @fabswt. We will look into your feedback.

pankopon · 2022-06-24T00:58:31Z

@wangkenpu @yulin-li Please update with status.

wangkenpu · 2022-06-27T10:10:10Z

transfer to @yinhew

yinhew · 2022-07-04T05:32:06Z

Hi, @fabswt

This is a known gap of our API.
We currently only apply single pronunciation as expected pronunciation.
We used to be tolerant by applying multiple allowed pronunciations.
But that introduced a problem, which was complained by customer.
e.g.: for word "read", we used to allow both / r i d / and / r ɛ d /.
But actually, only one pronunciation should be allowed at certain context.
For sentence "I read a book yesterday.", the customer expected us to give low score when the kid spoken "read" as / r i d / and give good score when the kid spoken it as / r ɛ d /. But we gave good score for both at that time.

We are currently not able to distinguish "multiple allowed pronunciations for same context" and "multiple allowed pronunciations for different contexts". We will need improvement on such capability.

BTW, are you working on the prototype or product?
How much potential usage do you estimate if this is a product?

Thanks,
Yinhe

fabswt · 2022-07-29T14:28:37Z

Hi @yinhew,

Function words and strong/weak forms... everywhere

The problem is that function words are everywhere and many of them accept both a weak form that's more common (e.g. ‘as’ as [əz]) and a strong form that is less common, but still would not be considered a mistake (in this case, ‘as’ as [æz].)

I was about to share a demo with my list to start launching the product, until I realized that about every other sentence I tried returned a false positive (really, any sentence with a function word) because of this very issue. Just consider:

I just generated the sentences above with TTS (en-US-ChristopherNeural, if that matters) and then fed the audio to the PronunciationAPI. In red are sounds the PronunciationAPI considered wrong, in most cases with a score close to zero and in most (all?) cases what happened is that the TTS used a schwa (like a normal person would) i.e. the weak form, whereas the PronunciationAPI expected whatever symbol is shown, i.e. the strong form.

(Only the word ‘the’ seems to be unaffected by this issue. Imagine if every occurence of it was reported as wrong for using unstressed [ðə] in place of stressed [ði]... this is exactly what we have above.)

These are pretty basic sentences and the PronunciationAPI is off.

Other words

Other words that accept multiple pronunciations suffer from the same issue. e.g.:

caramel
- the model expects EH even though a schwa is fine.
- Wiktionary gives us:
data
business
- if using a schwa for the second syllable (which is a thing), the API will complain.
chocolate
- the TTS uses AA, the API expects AO.

read, read, read

About the ‘read’ example:

it's an interesting example for sure, but there are way less words of this type than there are function words. Function words, in their weak form, are simply everywhere.
The PronunciationAPI still returns a false positive on “I read a lot” in the present form (it will only accept it in the past, with EH.)

Variable syllable count

Catholic.
- The TTS uses a 2-syllable pronunciation
- The PronunciationAssessment API expects three syllables, and will not tolerate a 2-syllable pronunciation.
broccoli
- The TTS almost elides the schwa, but the API expects it.

I'm close to launching (or at least so I thought) an early version of the product to about 100 paying customers.

This issue is really a big deal. If the PronunciationAPI cannot rate output from the TTS (and the TTS is fine) accurately then how is it supposed to rate learners.

fabswt · 2022-08-05T11:09:04Z

Hey,
Any news on this? Would help me see what I can plan with the API.

yinhew · 2022-08-05T12:47:41Z

Hi, @fabswt

Appreciate for your deep testing on our API.
There are indeed some gaps pending improvement. However we are having limited resource and need to prioritize carefully.

Can you please answer my question above?
Are you working on the prototype or product?
How much potential usage do you estimate if this is a product?

The answer can help us prioritize the work.

Thanks
Yinhe

fabswt · 2022-08-05T12:59:45Z

Hey @yinhew

I'm close to launching [...] an early version of the product to about 100 paying customers.

Planning to improve the product with the users themselves, in early access.

Got 1,500+ customers who bought my previous product and for whom this would be a good fit.

But I feel handicapped by the false positives.

I'd submit more constructive feedback, but I guess I'll wait after the above improvements.

Best –

pankopon · 2022-08-31T20:08:43Z

@yinhew Can you please comment on the plans for support, based on customer info from fabswt?

yinhew · 2022-09-02T03:15:41Z

We are still doing some internal researching work to determine which behavior is the best one to apply.

pankopon · 2023-02-28T01:33:05Z

@yinhew Is there any feature work expected due to this? If yes please provide a work item id and we can mark this as an accepted enhancement request.

pankopon · 2023-02-28T02:34:53Z

Internal work item ref. 4930020.

pankopon · 2023-02-28T21:58:16Z

Closing the issue as the enhancement request is now being tracked with a task on the team backlog, no ETA. This item will be updated with information on availability after changes have been implemented and deployed.

wangkenpu · 2023-08-22T07:03:04Z

Has been fixed. @fabswt

yulin-li added the pronunciation assessment label Jun 14, 2022

pankopon added the in-review In review label Jun 21, 2022

pankopon assigned yulin-li Jun 21, 2022

yulin-li assigned wangkenpu and unassigned yulin-li Jun 22, 2022

wangkenpu removed their assignment Jun 23, 2022

pankopon assigned yulin-li Jun 23, 2022

pankopon assigned yinhew and unassigned yulin-li Aug 31, 2022

pankopon added enhancement New feature or request accepted Issue moved to product team backlog. Will be closed when addressed. pronunciation assessment and removed pronunciation assessment in-review In review labels Feb 28, 2023

pankopon closed this as completed Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with the model for PronunciationAssessment #1530

Issues with the model for PronunciationAssessment #1530

fabswt commented Jun 13, 2022

yulin-li commented Jun 14, 2022

wangkenpu commented Jun 21, 2022

pankopon commented Jun 24, 2022

wangkenpu commented Jun 27, 2022

yinhew commented Jul 4, 2022

fabswt commented Jul 29, 2022 •

edited

fabswt commented Aug 5, 2022

yinhew commented Aug 5, 2022

fabswt commented Aug 5, 2022 •

edited

pankopon commented Aug 31, 2022

yinhew commented Sep 2, 2022

pankopon commented Feb 28, 2023

pankopon commented Feb 28, 2023

pankopon commented Feb 28, 2023

wangkenpu commented Aug 22, 2023

Issues with the model for PronunciationAssessment #1530

Issues with the model for PronunciationAssessment #1530

Comments

fabswt commented Jun 13, 2022

yulin-li commented Jun 14, 2022

wangkenpu commented Jun 21, 2022

pankopon commented Jun 24, 2022

wangkenpu commented Jun 27, 2022

yinhew commented Jul 4, 2022

fabswt commented Jul 29, 2022 • edited

Function words and strong/weak forms... everywhere

Other words

read, read, read

Variable syllable count

fabswt commented Aug 5, 2022

yinhew commented Aug 5, 2022

fabswt commented Aug 5, 2022 • edited

pankopon commented Aug 31, 2022

yinhew commented Sep 2, 2022

pankopon commented Feb 28, 2023

pankopon commented Feb 28, 2023

pankopon commented Feb 28, 2023

wangkenpu commented Aug 22, 2023

fabswt commented Jul 29, 2022 •

edited

fabswt commented Aug 5, 2022 •

edited