Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/add hotwords #731

Merged
merged 7 commits into from
May 4, 2024

Conversation

jax-explorer
Copy link
Contributor

hello!
During the transcription process, I often encounter some proprietary or new vocabulary, and Whisper cannot handle it well. I searched for solutions, and the community provided two options:

Fine-tuning the model: This approach is costly, and it's not practical to fine-tune the model every time a new term emerges.

Using initial_prompt: However, initial_prompt only applies to the first window. If specialized terms don't appear at the beginning, this method is ineffective.

Upon reviewing other transcription models, it's common practice to use hotwords. So, I implemented this feature. My approach is to add hotword-related prompts before each transcription window. Since there's a maximum length limit, I occupy the space previously used by the prefix. When the prefix isn't set, hotwords take effect. After testing, it indeed resolved the issue of specialized vocabulary in my scenario.

The following is the community discussion on this issue:
openai/whisper#1477
https://discuss.huggingface.co/t/adding-custom-vocabularies-on-whisper/29311
https://stackoverflow.com/questions/73833916/how-can-i-give-some-hint-phrases-to-openais-whisper-asr

Since my project utilizes faster-whisper, I will first submit it to the git project. If the submission is approved, I will synchronize it with the whisper project.

@jax-explorer
Copy link
Contributor Author

@nguyendc-systran hello, please check out this pr.

@trungkienbkhn
Copy link
Collaborator

@jax-explorer, hello. Can you please provide an example of your test cases?

@jax-explorer
Copy link
Contributor Author

@trungkienbkhn

ok, comfyUI is a new word, it is The most powerful and modular stable diffusion GUI and backend.

the test video is https://www.youtube.com/watch?v=Ybu6qTbEsew

if no hotwords
segments, info = model.transcribe(input_file, beam_size=5, language="en", vad_filter=False, vad_parameters = dict(min_silence_duration_ms=1000))
the result is:
“[261.76s -> 263.12s] The first thing you need to do is,
[263.12s -> 265.36s] of course, to copy the web address
[265.36s -> 266.12s] up here.
[266.12s -> 267.84s] Then you go into your Conf UI
[267.84s -> 270.04s] folder, again in the Conf UI
[270.04s -> 272.08s] folder, in there in the custom
[272.08s -> 274.08s] nodes folder and then up here
[274.08s -> 276.28s] in the address bar type CMD,
[276.28s -> 277.40s] hit enter.
[277.40s -> 279.40s] This opens up your command
[279.40s -> 281.24s] window. In here you type
[281.24s -> 283.36s] git clone and then
[283.36s -> 285.32s] put the web address and hit
[285.32s -> 287.36s] enter to clone the git
[287.36s -> 289.68s] project into your custom
[289.68s -> 290.56s] nodes folder.
[290.56s -> 291.60s] After you've done this, you're going
[291.60s -> 293.32s] to find in here the Conf UI”

It is incorrectly recognized as Conf UI

if add hotwords
segments, info = model.transcribe(input_file, hotwords="the video is about comfyUI", beam_size=5, language="en", vad_filter=False, vad_parameters = dict(min_silence_duration_ms=1000))
the result is:
"
[261.76s -> 263.12s] The first thing you need to do is,
[263.12s -> 264.84s] of course, to copy the web
[264.84s -> 266.68s] address up here, then you go
[266.68s -> 268.48s] into your comfyUI folder,
[268.48s -> 270.80s] again in the comfyUI folder,
[270.80s -> 272.48s] in there in the custom nodes
[272.48s -> 274.28s] folder, and then up here in the
[274.28s -> 276.28s] address bar type cmd,
[276.28s -> 277.40s] hit enter.
[277.40s -> 279.40s] This opens up your command
[279.40s -> 281.20s] window. In here you type
[281.20s -> 283.08s] git clone and
[283.08s -> 285.00s] then put the web address and
[285.00s -> 286.92s] hit enter to clone
[286.92s -> 288.88s] the git project into
[288.88s -> 290.56s] your custom nodes folder.
[290.56s -> 291.48s] After you've done this, you're
[291.48s -> 293.32s] going to find in here the comfyUI
"
It is correctly recognized as comfyUI

@jax-explorer
Copy link
Contributor Author

@trungkienbkhn hello,
Please see if there are any other changes or information that needs to be made to this MR, if not I'm going to submit this MR to Whisper.

@trungkienbkhn
Copy link
Collaborator

@jax-explorer , thanks for your PR. LGTM.

@jax-explorer
Copy link
Contributor Author

ok, thanks.

@RichardQin1
Copy link

@jax-explorer I encountered an issue when using fast-whisper where the person name in initial_prompt only takes effect in the first part. Can your method solve this problem. How should I use it. thank

@jax-explorer
Copy link
Contributor Author

@RichardQin1 Yes, this PR will solve your problem.
this is example:
segments, info = model.transcribe(input_file, hotwords="the video is about comfyUI", beam_size=5, language="en", vad_filter=False, vad_parameters = dict(min_silence_duration_ms=1000))

@arabcoders
Copy link

arabcoders commented Mar 9, 2024

I tested the patch out and it does seems to improve the vocabulary if given appropriate words

Edit:

I've noticed a small side effect, when the model is hallucinating, it will show the hot words, i personally can clean it up by a post-processor. But it worth mentioning. the hallucinated line is exact copy of the hot_words given.

Edit2:

After longer test, the hallucination still happens and there are variety to it. Sometimes it's exact copy of hot_words, on the next it's slight variation of the hot_words

@jax-explorer
Copy link
Contributor Author

@arabcoders hello, It is true that the output of the hallucination will be affected by this setting and will change from the last few sentences of the previous window to the hot word related sentences, but I don't think that this is a side effect because when a hallucination occurs we shouldn't be concerned about the output of the hallucination, but rather the resolution of the hallucination, e.g. by using a vad, etc.

@arabcoders
Copy link

@arabcoders hello, It is true that the output of the hallucination will be affected by this setting and will change from the last few sentences of the previous window to the hot word related sentences, but I don't think that this is a side effect because when a hallucination occurs we shouldn't be concerned about the output of the hallucination, but rather the resolution of the hallucination, e.g. by using a vad, etc.

Hi, this was with silvero vad filtering out the silence segments. I noticed it's occurring when the voice pitch changes i.e. before a song for example. This rarely happens without this patch. The promote reset when that happens and because this patch add hot words when the prompt is empty this occurs more frequently.

I suggest you implement a more state aware injection instead of blindly adding the hot words when the prompt is empty.

Thank you.

@jax-explorer
Copy link
Contributor Author

@arabcoders hi, Got it, the equivalent of using hotwords where the original illusion didn't appear right, is there a link to an audio that can reproduce this problem? I'll try to modify and test.

@arabcoders
Copy link

@arabcoders hi, Got it, the equivalent of using hotwords where the original illusion didn't appear right, is there a link to an audio that can reproduce this problem? I'll try to modify and test.

Sure, try this partial clip, i couldn't upload the entire thing as it's 2h+, this a 10min clip of that concert and it shows the problem i am speaking about. you can download this clip. the hot words i used the video is about #Babababambi an all girls idol group from Japan. the parameters were

{
  "task": "translate",
  "language": "Japanese",
  "temperature": [
    0.0,
    0.2,
    0.4,
    0.6000000000000001,
    0.8,
    1.0
  ],
  "best_of": 5,
  "beam_size": 5,
  "patience": 2,
  "length_penalty": null,
  "suppress_tokens": "-1",
  "initial_prompt": null,
  "condition_on_previous_text": true,
  "compression_ratio_threshold": 2.4,
  "logprob_threshold": -1.0,
  "no_speech_threshold": 0.6,
  "word_timestamps": false,
  "prepend_punctuations": "\"'“¿([{-",
  "append_punctuations": "\"'.。,,!!??::”)]}、"
}

@JH90iOS
Copy link

JH90iOS commented Apr 30, 2024

Thanks for your PR ! It's very useful to me .
I have tested it with more than 300 speech datas ,and it works fine. I also did not find any significant increase of hallucinations.

@jax-explorer
Copy link
Contributor Author

@JH90iOS Thanks for the affirmation, as I've been busy lately and haven't checked out the previous feedback on adding illusions.

@trungkienbkhn trungkienbkhn merged commit 847fec4 into SYSTRAN:master May 4, 2024
3 checks passed
@WeiFangping
Copy link

WeiFangping commented May 23, 2024

@jax-explorer Thanks for your PR! This helps me a lot. But I also encontered these two problems, please tell me if there is any solution:

  1. the hallucination, sometimes hallucination will occur in the transcribed text
  2. the timestamps changed if hotwords are set. After setting hotwords, .this whole speech is devided into only 2 or 3 parts, not devided by sentence. I only added a hotwords setting, I tried to change the VAD parameters, but it doesn't help.
    This is the timestamps before set hotwords:
    image
    This is the timestamps after set hotwords:
    image
    So do you have any idea about this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants