Ways to transcribe real-time/detect the end of speaking in indefinite file? #151

kingcharlezz · 2023-04-15T05:04:22Z

Hello all. I am working on a project pertaining to ASR in phone calls. After being dissatisfied with some of the commercial options, I wanted to try this. Is there a built in way to know when the other party is not talking? or something like whisper.cpp's stream function? I have seen mention of VAD in the docs this but I am not sure how to elegantantly implement this into my problem. Any comments are appreciated.

Thanks.

EtienneAb3d · 2023-04-15T05:34:09Z

@kingcharlezz, You may have a look at this project:
https://github.com/mallorbc/whisper_mic

JonathanFly · 2023-04-15T06:38:45Z

Hello all. I am working on a project pertaining to ASR in phone calls. After being dissatisfied with some of the commercial options, I wanted to try this. Is there a built in way to know when the other party is not talking? or something like whisper.cpp's stream function? I have seen mention of VAD in the docs this but I am not sure how to elegantantly implement this into my problem. Any comments are appreciated.

Thanks.

I user faster-whisper real time livestream (so infinite duration) and it works great. (Actually more than great, I can actually run two large faster-whisper models simultaneously and get both transcription and translation, it's so fast!)

For the vad, you can pass in vad_filter=True and by default will break look for 2 second silences. (min_silence_duration_ms = 2000)

Also check out the non Vad no_speech_threshold and log_prob_threshold options.

More specific vad options, from vad.py, you just pass these exactly the name names to faster-whisper the values get pass through to the vad:


def get_speech_timestamps(
    audio: np.ndarray,
    *,
    threshold: float = 0.5,
    min_speech_duration_ms: int = 250,
    max_speech_duration_s: float = float("inf"),
    min_silence_duration_ms: int = 2000,
    window_size_samples: int = 1024,
    speech_pad_ms: int = 200,
) -> List[dict]:
    """This method is used for splitting long audios into speech chunks using silero VAD.
    Args:
      audio: One dimensional float array.
      threshold: Speech threshold. Silero VAD outputs speech probabilities for each audio chunk,
        probabilities ABOVE this value are considered as SPEECH. It is better to tune this
        parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets.
      min_speech_duration_ms: Final speech chunks shorter min_speech_duration_ms are thrown out.
      max_speech_duration_s: Maximum duration of speech chunks in seconds. Chunks longer
        than max_speech_duration_s will be split at the timestamp of the last silence that
        lasts more than 100s (if any), to prevent agressive cutting. Otherwise, they will be
        split aggressively just before max_speech_duration_s.
      min_silence_duration_ms: In the end of each speech chunk wait for min_silence_duration_ms
        before separating it
      window_size_samples: Audio chunks of window_size_samples size are fed to the silero VAD model.
        WARNING! Silero VAD models were trained using 512, 1024, 1536 samples for 16000 sample rate.
        Values other than these may affect model perfomance!!
      speech_pad_ms: Final speech chunks are padded by speech_pad_ms each side
    Returns:
      List of dicts containing begin and end samples of each speech chunk.
    """

For livestreams the biggest bottleneck in my opinion, after the VAD, is noise reduction. I pipe the live audio through OBS using NVIDIA noise reduction filter before sending it to faster whisper. It's a day or night difference in Whisper performance on audio with lots of background music or noise. For phone calls you can probably get away without doing that though.

kingcharlezz · 2023-04-15T19:20:37Z

Appreciate this! seems to accomplish what I need it to do. Thanks for the in-depth responses.

lpy-ET · 2023-04-25T16:55:32Z

user faster-whisper real time livestream (so infinite duration) and it works great. (Actually more than great, I can actually run two large faster-whisper models simultaneously and get both transcription and translation, it's so fast!)

Hi @JonathanFly, could you please give more info on how you proceed to use this "real time livestream" with infinite duration, please?

JonathanFly · 2023-04-26T19:43:06Z

user faster-whisper real time livestream (so infinite duration) and it works great. (Actually more than great, I can actually run two large faster-whisper models simultaneously and get both transcription and translation, it's so fast!)

Hi @JonathanFly, could you please give more info on how you proceed to use this "real time livestream" with infinite duration, please?

I threw it up here: https://github.com/JonathanFly/faster-whisper-livestream-translator

I kind of left it in a not great state though, but you can get the idea. It's a messy fork of https://github.com/fortypercnt/stream-translator

guillaumekln closed this as completed Apr 17, 2023

HaujetZhao mentioned this issue Jul 19, 2023

Let you know if you don't, there is a faster whisper than whisper.cpp Const-me/Whisper#143

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ways to transcribe real-time/detect the end of speaking in indefinite file? #151

Ways to transcribe real-time/detect the end of speaking in indefinite file? #151

kingcharlezz commented Apr 15, 2023

EtienneAb3d commented Apr 15, 2023

JonathanFly commented Apr 15, 2023 •

edited

Loading

kingcharlezz commented Apr 15, 2023

lpy-ET commented Apr 25, 2023

JonathanFly commented Apr 26, 2023 •

edited

Loading

Ways to transcribe real-time/detect the end of speaking in indefinite file? #151

Ways to transcribe real-time/detect the end of speaking in indefinite file? #151

Comments

kingcharlezz commented Apr 15, 2023

EtienneAb3d commented Apr 15, 2023

JonathanFly commented Apr 15, 2023 • edited Loading

kingcharlezz commented Apr 15, 2023

lpy-ET commented Apr 25, 2023

JonathanFly commented Apr 26, 2023 • edited Loading

JonathanFly commented Apr 15, 2023 •

edited

Loading

JonathanFly commented Apr 26, 2023 •

edited

Loading