Update logic to get segment from features before encoding #705

trungkienbkhn · 2024-02-23T12:29:55Z

Quick fix for #646

I tested with a poor quality audio file (192s) and received:

ctranslate2==3.24.0 and faster-whisper==0.10.1, the execution time is 12.9s
ctranslate2==3.24.0 and faster-whisper==1.0.1, the execution time is 28.9s

Below is my testing code:

model = WhisperModel('large-v3', device='cuda')
segments, info = model.transcribe(jfk_path, word_timestamps=True)

After investigating, I found that the logic changes below increased the latency:
Old logic:

segment = features[:, seek : seek + self.feature_extractor.nb_max_frames]
segment_size = min(
    self.feature_extractor.nb_max_frames, content_frames - seek
)

New update logic in #646:

segment_size = min(
    self.feature_extractor.nb_max_frames,
    content_frames - seek,
    seek_clip_end - seek,
)
segment = features[:, seek : seek + segment_size]

Cutting the segment like this changed the encode_output and it only happens at the last loop because then segment_size < self.feature_extractor.nb_max_frames.
So the decode_result also changed and it's quality is reduced in my tests.
Specifically, I got the following error in my log:

Compression ratio threshold is not met with temperature 0.0 (3.9903846153846154 > 2.4)
Compression ratio threshold is not met with temperature 0.2 (3.9903846153846154 > 2.4)
Compression ratio threshold is not met with temperature 0.4 (3.9903846153846154 > 2.4)
Compression ratio threshold is not met with temperature 0.6 (3.7184466019417477 > 2.4)

=> The quality of decode_result is poor and repetitive so it takes more iterations to decode and export final result, leading to increased transcription time.
For more info, you can check this logic.

thomasmol · 2024-02-23T14:15:29Z

I tried this and fixes a latency/speed bug for me mentioned in #703
Thanks!

trungkienbkhn · 2024-02-24T03:29:35Z

@kale4eat, tks for your recommend in #712. I updated my code logic with pad_or_trim function and it works as expected.

doublex · 2024-02-24T17:05:47Z

@trungkienbkhn
Thanks for your efforts!

trungkienbkhn force-pushed the fix_latency_increase branch from e20cc93 to 426daca Compare February 23, 2024 12:34

trungkienbkhn mentioned this pull request Feb 23, 2024

slower inference with 1.0.0 compared to 0.10.0 #703

Open

trungkienbkhn mentioned this pull request Feb 24, 2024

Question about Tensor Input Size Changes in Version 1.0.0 #712

Open

trungkienbkhn force-pushed the fix_latency_increase branch from 426daca to 4f712b0 Compare February 24, 2024 03:23

Add pad_or_trim function to handle segment before encoding

5bde10b

trungkienbkhn force-pushed the fix_latency_increase branch from 4f712b0 to 5bde10b Compare February 24, 2024 03:25

trungkienbkhn mentioned this pull request Feb 24, 2024

Testcase for issue in #646 #702

Closed

thomasmol mentioned this pull request Feb 24, 2024

1.0 fix yet for faster-whisper? #711

Closed

trungkienbkhn mentioned this pull request Feb 28, 2024

Transcription is creating duplicate sentences #716

Open

nguyendc-systran merged commit 16141e6 into SYSTRAN:master Feb 29, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update logic to get segment from features before encoding #705

Update logic to get segment from features before encoding #705

trungkienbkhn commented Feb 23, 2024

thomasmol commented Feb 23, 2024

trungkienbkhn commented Feb 24, 2024

doublex commented Feb 24, 2024

Update logic to get segment from features before encoding #705

Update logic to get segment from features before encoding #705

Conversation

trungkienbkhn commented Feb 23, 2024

thomasmol commented Feb 23, 2024

trungkienbkhn commented Feb 24, 2024

doublex commented Feb 24, 2024