Deduplication of language detection functions by MahmoudAshraf97 · Pull Request #1146 · SYSTRAN/faster-whisper

MahmoudAshraf97 · 2024-11-15T16:38:41Z

This PR aims to unify language detection in both WhisperModel and BatchedInferencePipeline

Summary:

Supported new options for batched transcriptions:
- language_detection_threshold
- language_detection_segments
Updated WhisperModel.detect_language function to include the improved language detection from Improve language detection #732 and added docstrings, it's now used inside transcribe function.
Removed the following functions as they are no longer needed:
- WhisperModel.detect_language_multi_segment and its test
- BatchedInferencePipeline.get_language_and_tokenizer
Added tests for empty audios

MahmoudAshraf97 added 3 commits November 15, 2024 18:31

initial commit

1c3dd3b

update readme

186fed3

.

b8c3ba2

MahmoudAshraf97 marked this pull request as draft November 15, 2024 17:12

MahmoudAshraf97 added 2 commits November 16, 2024 01:37

add test for empty audios

042d369

fixes for empty audios

853aa71

MahmoudAshraf97 marked this pull request as ready for review November 15, 2024 23:49

MahmoudAshraf97 merged commit a6f8fba into SYSTRAN:master Nov 16, 2024

MahmoudAshraf97 deleted the language_detection_refactor branch November 16, 2024 12:45

Provide feedback