Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update logic to get segment from features before encoding #705

Merged
merged 1 commit into from
Feb 29, 2024

Conversation

trungkienbkhn
Copy link
Collaborator

Quick fix for #646

I tested with a poor quality audio file (192s) and received:

  • ctranslate2==3.24.0 and faster-whisper==0.10.1, the execution time is 12.9s
  • ctranslate2==3.24.0 and faster-whisper==1.0.1, the execution time is 28.9s

Below is my testing code:

model = WhisperModel('large-v3', device='cuda')
segments, info = model.transcribe(jfk_path, word_timestamps=True)

After investigating, I found that the logic changes below increased the latency:
Old logic:

segment = features[:, seek : seek + self.feature_extractor.nb_max_frames]
segment_size = min(
    self.feature_extractor.nb_max_frames, content_frames - seek
)

New update logic in #646:

segment_size = min(
    self.feature_extractor.nb_max_frames,
    content_frames - seek,
    seek_clip_end - seek,
)
segment = features[:, seek : seek + segment_size]

Cutting the segment like this changed the encode_output and it only happens at the last loop because then segment_size < self.feature_extractor.nb_max_frames.
So the decode_result also changed and it's quality is reduced in my tests.
Specifically, I got the following error in my log:

Compression ratio threshold is not met with temperature 0.0 (3.9903846153846154 > 2.4)
Compression ratio threshold is not met with temperature 0.2 (3.9903846153846154 > 2.4)
Compression ratio threshold is not met with temperature 0.4 (3.9903846153846154 > 2.4)
Compression ratio threshold is not met with temperature 0.6 (3.7184466019417477 > 2.4)

=> The quality of decode_result is poor and repetitive so it takes more iterations to decode and export final result, leading to increased transcription time.
For more info, you can check this logic.

@thomasmol
Copy link

I tried this and fixes a latency/speed bug for me mentioned in #703
Thanks!

@trungkienbkhn
Copy link
Collaborator Author

@kale4eat, tks for your recommend in #712. I updated my code logic with pad_or_trim function and it works as expected.

@doublex
Copy link

doublex commented Feb 24, 2024

@trungkienbkhn
Thanks for your efforts!

@nguyendc-systran nguyendc-systran merged commit 16141e6 into SYSTRAN:master Feb 29, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants