Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fastconformer timestamp (reopen) #8438

Closed
wants to merge 5 commits into from

Conversation

biscayan
Copy link

@biscayan biscayan commented Feb 16, 2024

What does this PR do ?

I closed a pull request #8341 because of lots of conflicts in the branch.
I tried to solve, but finally I deleted the branch, made new one, and add sign-off to the commit.
Sorry for my mistakes.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

@titu1994 @nithinraok @tango4j
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: biscayan <skgudwn34@gmail.com>
@github-actions github-actions bot added the ASR label Feb 16, 2024
Copy link
Contributor

github-actions bot commented Mar 2, 2024

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Mar 2, 2024
@github-actions github-actions bot removed the stale label Mar 5, 2024
@tango4j
Copy link
Collaborator

tango4j commented Mar 5, 2024

@biscayan

Do you remember how you come up with the model_stride_sec = 0.08? Link to the line
With 0.08 sec, fast conformer is not yielding the right timestamp. It is generating slightly longer sample length.

This needs more investigation on time-stamp generation on FastConformer Model.

@biscayan
Copy link
Author

biscayan commented Mar 6, 2024

@tango4j
Thank you for reviewing.
In the fastconformer paper, it is explained that fastconformer model has downsampling schema with 80ms frame rate.
It means that the time of one frame is 80ms.
I tested the extracted timestamps while listening the wav file.
When I set the model_stride_sec=0.04, the timestamps do not match with the wavfile, and after changing the model_stride_sec from 0.04 to 0.08, timestamps accurately match with the wavfile

@nithinraok
Copy link
Collaborator

jenkins

@tango4j
Copy link
Collaborator

tango4j commented Mar 8, 2024

@biscayan
Did you check the cpWER ? or you just checked DER ?
When I tested with CallHome data, the fast conformer was showing 50% ish cpWER while other CTC models are showing 18~20% errors.
I think there is some issues with chunk-based ASR system.

@biscayan
Copy link
Author

@tango4j
I didn't compute the cpWER for fastconformer model.
I just compared the asr result and the transcript in clean speech data, and the result is almost accurate.
However, I also faced the problem in real reverberated meeting data, the model has low asr performance.

@nithinraok
Copy link
Collaborator

@tango4j what are other CTC models you were comparing, are both of similar sizes?

@tango4j
Copy link
Collaborator

tango4j commented Mar 11, 2024

@nithinraok
Model size does not affect the timestamp issue with chunk based ASR. Only fast-conformer CTC models are returning the inaccurate timestamps with chunk based ASR.
We use ConformerCTC large stt_en_conformer_ctc_large for test purpose.

@biscayan
Copy link
Author

@tango4j
Should I do something more for the review?

Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Mar 30, 2024
Copy link
Contributor

github-actions bot commented Apr 6, 2024

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this Apr 6, 2024
@nithinraok
Copy link
Collaborator

@KunalDhawan pls test this PR on a known set to compare against similar sized conformer model

@nithinraok nithinraok reopened this May 8, 2024
@nithinraok
Copy link
Collaborator

Also add a integration test for testing purposes! testing on English dataset is good enough.

Copy link
Collaborator

@KunalDhawan KunalDhawan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @biscayan,
I ran offline diarization+ASR with timestamps using your proposed changes for the following 2 models:

  1. conformer-ctc-large: 120M params trained on 24.5K hours of English speech
  2. fastconformer-ctc-large: 115M params trained on 24.5K hours of English speech

Both the models have similar parameters and have been trained on the same data.

For evaluation, I used a cleaned version of CallHome-109.

Following is the performance of both the models:

  1. conformer-ctc-large:
    DER : 0.0474
    FA : 0.0111
    MISS : 0.0285
    CER : 0.0078
    Spk. counting acc. : 0.9083
    cpWER : 0.4640
    WER : 0.2383

  2. fastconformer-ctc-large:
    DER : 0.0474
    FA : 0.0111
    MISS : 0.0285
    CER : 0.0078
    Spk. counting acc. : 0.9083
    cpWER : 0.5158
    WER : 0.3153

Clearly the fastconformer model with the parameters proposed in the PR doesn't perform up to the expectation. I also verified the performance of the conformer and fastconformer models on non-chunk ASR (librispeech test), here the fastconformer model performs better than the conformer model, as expected.

Looking at the model predictions on the CH109 samples used above, it appears that the fastconformer model is missing a few words in the chunk based prediction, leading to the high observed WER and hence cpWER. Chucked ASR with fastconformer in offline_diar_with_asr_infer would have to be fixed before we can support timestamp calculation with fastconformer with this script.

Because of the reasons highlighted above, we'll close this incomplete PR and open another one with all the fixes.

@biscayan
Copy link
Author

Hi @KunalDhawan
Thank you for your report.
I will develop the PR later.
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants