Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ways to transcribe real-time/detect the end of speaking in indefinite file? #151

Closed
kingcharlezz opened this issue Apr 15, 2023 · 5 comments

Comments

@kingcharlezz
Copy link

Hello all. I am working on a project pertaining to ASR in phone calls. After being dissatisfied with some of the commercial options, I wanted to try this. Is there a built in way to know when the other party is not talking? or something like whisper.cpp's stream function? I have seen mention of VAD in the docs this but I am not sure how to elegantantly implement this into my problem. Any comments are appreciated.

Thanks.

@EtienneAb3d
Copy link

@kingcharlezz, You may have a look at this project:
https://github.com/mallorbc/whisper_mic

@JonathanFly
Copy link

JonathanFly commented Apr 15, 2023

Hello all. I am working on a project pertaining to ASR in phone calls. After being dissatisfied with some of the commercial options, I wanted to try this. Is there a built in way to know when the other party is not talking? or something like whisper.cpp's stream function? I have seen mention of VAD in the docs this but I am not sure how to elegantantly implement this into my problem. Any comments are appreciated.

Thanks.

I user faster-whisper real time livestream (so infinite duration) and it works great. (Actually more than great, I can actually run two large faster-whisper models simultaneously and get both transcription and translation, it's so fast!)

For the vad, you can pass in vad_filter=True and by default will break look for 2 second silences. (min_silence_duration_ms = 2000)

Also check out the non Vad no_speech_threshold and log_prob_threshold options.

More specific vad options, from vad.py, you just pass these exactly the name names to faster-whisper the values get pass through to the vad:


def get_speech_timestamps(
    audio: np.ndarray,
    *,
    threshold: float = 0.5,
    min_speech_duration_ms: int = 250,
    max_speech_duration_s: float = float("inf"),
    min_silence_duration_ms: int = 2000,
    window_size_samples: int = 1024,
    speech_pad_ms: int = 200,
) -> List[dict]:
    """This method is used for splitting long audios into speech chunks using silero VAD.
    Args:
      audio: One dimensional float array.
      threshold: Speech threshold. Silero VAD outputs speech probabilities for each audio chunk,
        probabilities ABOVE this value are considered as SPEECH. It is better to tune this
        parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets.
      min_speech_duration_ms: Final speech chunks shorter min_speech_duration_ms are thrown out.
      max_speech_duration_s: Maximum duration of speech chunks in seconds. Chunks longer
        than max_speech_duration_s will be split at the timestamp of the last silence that
        lasts more than 100s (if any), to prevent agressive cutting. Otherwise, they will be
        split aggressively just before max_speech_duration_s.
      min_silence_duration_ms: In the end of each speech chunk wait for min_silence_duration_ms
        before separating it
      window_size_samples: Audio chunks of window_size_samples size are fed to the silero VAD model.
        WARNING! Silero VAD models were trained using 512, 1024, 1536 samples for 16000 sample rate.
        Values other than these may affect model perfomance!!
      speech_pad_ms: Final speech chunks are padded by speech_pad_ms each side
    Returns:
      List of dicts containing begin and end samples of each speech chunk.
    """ 

For livestreams the biggest bottleneck in my opinion, after the VAD, is noise reduction. I pipe the live audio through OBS using NVIDIA noise reduction filter before sending it to faster whisper. It's a day or night difference in Whisper performance on audio with lots of background music or noise. For phone calls you can probably get away without doing that though.

@kingcharlezz
Copy link
Author

Appreciate this! seems to accomplish what I need it to do. Thanks for the in-depth responses.

@lpy-ET
Copy link

lpy-ET commented Apr 25, 2023

user faster-whisper real time livestream (so infinite duration) and it works great. (Actually more than great, I can actually run two large faster-whisper models simultaneously and get both transcription and translation, it's so fast!)

Hi @JonathanFly, could you please give more info on how you proceed to use this "real time livestream" with infinite duration, please?

@JonathanFly
Copy link

JonathanFly commented Apr 26, 2023

user faster-whisper real time livestream (so infinite duration) and it works great. (Actually more than great, I can actually run two large faster-whisper models simultaneously and get both transcription and translation, it's so fast!)

Hi @JonathanFly, could you please give more info on how you proceed to use this "real time livestream" with infinite duration, please?

I threw it up here: https://github.com/JonathanFly/faster-whisper-livestream-translator

I kind of left it in a not great state though, but you can get the idea. It's a messy fork of https://github.com/fortypercnt/stream-translator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants