Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some problems with large-v3 #100

Closed
despairTK opened this issue Nov 9, 2023 · 48 comments
Closed

Some problems with large-v3 #100

despairTK opened this issue Nov 9, 2023 · 48 comments

Comments

@despairTK
Copy link

despairTK commented Nov 9, 2023

I'm glad you're quick to support large-v3. Your work is outstanding!

I had some problems testing large-v3-int8 and large-v3-fp16.

I have transcribed videos in English, Japanese, Korean and Russian, dozens of videos in total, each of them varying in length from less than 10 minutes to more than 1 hour. The settings for transcription remain unchanged except for the model selection. The following problems occur when transcribing these videos.

  1. During the transcription process, some sentences will be repeated, which is much higher than that of large-v2, and the chance of repeated sentences in large-v2 is very, very small.
  2. timestamp accuracy is terrible compared to large-v2, and some of them are even misplaced.
  3. The [Compression ratio threshold is not met with temperature 0.0] prompt appears frequently, which is hardly seen in large-v2.
@despairTK despairTK changed the title Error with large-v3 model. Some problems with large-v3 Nov 9, 2023
@despairTK
Copy link
Author

I ran some more tests, this time focusing on adjusting some parameters to see if I could improve the problem I had above with the large-v3 model. The results are still not good, the short audio results are acceptable, but the long audio results are not as good as the large-v2 model.

@Purfview
Copy link
Owner

Purfview commented Nov 9, 2023

Can you run reference Whisper large-v3? How results there?

@despairTK
Copy link
Author

Whisper's large-v3 will run even worse. I tried large-v3 on the first day it came out, and even Japanese audio under ten minutes in length would repeat sentences.

And I also saw a test of the conversion in whisper.cpp yesterday, and it had the same problem. For example here: ggerganov/whisper.cpp#1444
openai/whisper#1783
I've also had people ask me questions about repeated sentences and bad timestamps.

@despairTK
Copy link
Author

By the way, there are some problems with punctuation as well. Even using --initial_prompt doesn't correct these issues....

@Purfview
Copy link
Owner

Purfview commented Nov 9, 2023

I did only few quick English tests, some places better some worse, but I noticed some weird repetitions with no reason too...

I wonder if model was converted correctly, I can't run large in reference Whisper, could you do some tests for me?

Test these short audios, in reference Whisper's console exe with command below [use ffmpeg v5, not v6]:
--language English --model large-v3 --beam_size 1 --temperature_increment_on_fallback None --fp16 False

And upload all output files somewhere.

@despairTK
Copy link
Author

--language English --model large-v3 --beam_size 1 --temperature_increment_on_fallback None --fp16 False

Sorry, I don't know some programming terms and sometimes I may not understand some sentences translated from the web page.

Can you see if it's these output files?
test2.zip

Repository owner deleted a comment from despairTK Nov 9, 2023
@Purfview
Copy link
Owner

Purfview commented Nov 9, 2023

Results of the reference somewhat similar somewhat not, something weird is going on at chunk's end/start, not sure if it's the model or faster-whisper's code. Needs more investigation.

@despairTK
Copy link
Author

Ok, I have a request, is it possible to add the download option for large-v3-fp16 to Subtitle Edit before the official release of Subtitle Edit 4.02 goes live?

large-v3-fp16 will be better than large-v3-int8 when selecting words for some sentences. at least that's what I saw when I tested it in Japanese audio.

@Purfview
Copy link
Owner

What compute type do you use?

@Purfview
Copy link
Owner

Results of the reference somewhat similar somewhat not...

False alarm, discrepancies were because of the bug, -prompt=None doesn't work, it will be fixed in the next release.

@despairTK
Copy link
Author

despairTK commented Nov 11, 2023

使用哪种计算类型?

I'll choose different --compute_type depending on the language I'm transcribing, for English I'll use the auto-selected int8-float16, for Japanese and Korean I'll choose bfloat16. This is because English itself is a very good transcriber and transcribes sentences with good punctuation and segmentation, but for Asian languages, only bfloat16 can make However, for Asian languages, only bfloat16 allows Asian languages to get the same good sentence segmentation and punctuation as English.However, sometimes float16 or float32 is chosen, depending on the effect of the transcription in the different languages. There are a lot of differences between languages.

Using --initial_prompt also gives some results, but after testing a lot of audio, I've concluded that using --initial_prompt causes some sentence splits and punctuation to be incorrectly placed when transcribing. This will result in the wrong meaning being conveyed in the final translation.

@Purfview
Copy link
Owner

Purfview commented Nov 11, 2023

So far I'm not convinced that large-v3 is better than v2, but so far I tested only English, btw there shouldn't be much of improvement for English according to OpenAI's tests, I see that it's a bit more accurate in some places, but v3 hits more fallbacks, it want to repeat things a lot, it wants to hallucinate... I've feeling that v3 is a flop [for English], same as v1... maybe it performs better in other languages.

I'll choose different --compute_type depending on the language..

Choosing computing type by a language is not right, you could have such impression only by looking at some short samples, the extensive tests would show you that accuracy is almost same between all different types. Just don't bother with it.

Choose the fastest type for your hardware, that's it.

Disable the fallback when benchmarking different types, "Transcription speed" printed at the end is accurate benchmark of transcription.

@Purfview
Copy link
Owner

Here are OpenAI's tests v2 vs v3:

The large-v3 model shows improved performance over a wide variety of languages, and the plot below includes all languages where Whisper large-v3 performs lower than 60% error rate on Common Voice 15 and Fleurs, showing 10% to 20% reduction of errors compared to large-v2:

language-breakdown

@gerrywastaken
Copy link

gerrywastaken commented Nov 11, 2023

According to that Cantonese has gone from a 30% error rate which I'm told is pretty much unusable, to 10%, putting it close to English performance.

@despairTK
Copy link
Author

despairTK commented Nov 11, 2023

--compute_type

Actually, it involves another issue, which is that no matter if it's transcribed in Japanese, English, Korean or any other language, eventually many people have to translate those subtitles into their native language.

I'm going to use deepl pro, chatgpt 3.5, and google to do the translation. Machine translation is different from human translation in that machine translation can only translate sentence by sentence, not in context like human translation. So I use different --compute_type for different languages because sentence and punctuation segmentation is more accurate and complete.

For example, if I use int8_float16 when transcribing English, and I use it to transcribe Japanese, the Japanese sentences in the result will not be punctuated, and the sentences will be very fragmented. But when I use bfloat16 to transcribe Japanese, the resulting sentences are well separated and punctuated.

The reason for this is to allow the machine translator to better understand what each line of the subtitles means when translating, after all, every language has polyphonic words, and if the sentences in the transcription result are too fragmented, the machine translator will translate these polyphonic words into the wrong meaning, and if the sentence breaks are too bad, it will affect the machine translation result as well.

Although using --initial_prompt to fill in some example sentences, you can also get a seemingly normal sentence division and punctuation, in fact, it will still affect the sentence division, which will cause some sentences to be forcibly broken and then inserted into the middle of a punctuation mark, which will also change the meaning of the machine translation.

As a simple example, the second one will have one more punctuation mark than the first one:
Netizens offered their blessings one after another, wishing this great father a speedy recovery and a happy life with his family].
[Netizens offered their blessings one after another, wishing this great father. a speedy recovery and a happy life with his family]

In addition, I will say large-v3, in fact, at present large-v2, has been very good, especially faster-whisper under the work of large-v2, and your whisper-standalone-win project, make faster-whisper use more simple, easy to use. large-v2's The disadvantage of large-v2 is that it can't select better homophones, which is very forgiving when translating English subtitles, but not for other languages. For example, the Japanese words 髪 and 紙 are different in writing, but the pronunciation is exactly the same, one means hair and the other means paper. (You can listen to it on Google Translate...) Of course it's also a matter of the amount of training data. large-v3 does have a better selection of some homophones than large-v2, and if large-v3 didn't have the hallucinatory, time-stamping problems that these current transcriptions have so badly, then it would be a perfect model upgrade.

@despairTK
Copy link
Author

By the way, you are right about one thing. large-v3 transcription just turns out to be very close to large-v1. This is confirmed in my multilingual transcription test.

@Purfview
Copy link
Owner

Purfview commented Nov 11, 2023

--compute_type for different languages because sentence and punctuation segmentation is more accurate and complete.

I don't think that "punctuation" somehow relates to some particular --compute_type, I think that's is just random subjective observation. Probably on some other audio you can have an opposite effect of "punctuation".

I use bfloat16 to transcribe Japanese, the resulting sentences are well separated and punctuated

How much data did you tested for this conclusion?

@Purfview
Copy link
Owner

Purfview commented Nov 11, 2023

I think if you want some little improvement then you could increase beam_size to 8 or 10, and same to best_of.

@despairTK
Copy link
Author

I've tested close to 300 hours of different audio so far. The audio is categorized differently, from speakers with orderly pauses, to speakers with intermittent pauses, or very short pauses from the beginning to the end of the sentence, and so on. I also tried the same audio in different types of audio formats. For example, WAV, MP3, AAC and so on.

Both beam_size and best_of I have tried and the improvement is very small, most of the homophones still stay wrong.

The most interesting thing is the performance of large-v1, in my test, large-v2 transcription has wrong homophones, while large-v1 has correct ones. Similarly, where large-v1 makes a homophone mistake, large-v2 gets it right. Sometimes I wonder if there is any way to make large-v1 and large-v2 complement each other's deficiencies or load them at the same time for transcription, so that the correct rate will be much higher.

@despairTK
Copy link
Author

I don't know how to code, but would love to know if there's a way for large-v2 and large-v1 to be able to transcribe an audio at the same time, taking the best sentence based on plausibility? Or other ways of transcribing an audio at the same time for both models to get a better result 。。。。

This is just a personal guess 。。。。 If it's not right, please take it as a joke I told :)

@Purfview
Copy link
Owner

I don't know how to code, but would love to know if there's a way for large-v2 and large-v1 to be able to transcribe an audio at the same time, taking the best sentence based on plausibility? Or other ways of transcribing an audio at the same time for both models to get a better result 。。。。

This is just a personal guess 。。。。 If it's not right, please take it as a joke I told :)

Then we need 3rd model which can evaluate that "plausibility", if there would be such model then we would use IT to transcribe, we wouldn't need those "v1" and "v2", 😉

@despairTK
Copy link
Author

I will continue to wait for a more stable version and also look forward to faster-whisper's updates, which I can only help test. Please don't hesitate to let me know if you have any requests for testing!

@Purfview
Copy link
Owner

Please don't hesitate to let me know if you have any requests for testing!

Hmm, maybe I would be interested in some tests, but not related to large-v3.
What is your GPU and VRAM size?

@Purfview
Copy link
Owner

I would like to see benchmarks of all 3 versions of cuBLAS and cuDNN libs.

Tested in a console, not in SE.
With "--compute_type=float16".
Post "Transcription speed", not "Operation" time.

@despairTK
Copy link
Author

despairTK commented Nov 12, 2023

Use one of the two audio files from before?

@Purfview
Copy link
Owner

Just use audio long enough where test runs at least ~3 - 5 minutes.

@despairTK
Copy link
Author

Audio Duration 6 minutes 38 seconds
--language=Japanese --model=large-v2 --compute_type=float16 --verbose true
V1:Transcription speed: 11.38 audio seconds/s
V2:Transcription speed: 13.57 audio seconds/s
V3:Transcription speed: 13.43 audio seconds/s

@Purfview
Copy link
Owner

I didn't meant "Audio Duration", I meant "test duration", looking at speeds you should use ~50 minutes audio.

Looks like V1 speed is much worse, I will delete it from repo after SE updates to the new version.

@despairTK
Copy link
Author

6 minutes and 38 seconds is the audio duration I used. . . . No [Transcription speed]
V1
1

V2
2

V3
3

@Purfview
Copy link
Owner

Nevermind that previous test... [ probably the google translate issues 😉 ]

@despairTK
Copy link
Author

Audio test results of 1 hour, 15 minutes and 42 seconds.
--language=Japanese --model=large-v2 --compute_type=float16 --verbose true

V1
1

V2
2

V3
3

@Purfview
Copy link
Owner

Thx for retest, now we can see that V3 is a bit more faster. Longer tests are more accurate.

@Purfview
Copy link
Owner

I'm interested in testing the speeds of different compute_types.

Post only "Transcription speed". no need for images.
Use v3 libs for all tests.
Use that "1 hour, 15 minutes and 42 seconds" audio.

Benchmarks with these settings:

--model=large-v2 --temperature_increment_on_fallback=None --compute_type=int8_float16
--model=large-v2 --temperature_increment_on_fallback=None --compute_type=int8_float32
--model=large-v2 --temperature_increment_on_fallback=None --compute_type=int8_bfloat16
--model=large-v2 --temperature_increment_on_fallback=None --compute_type=int16
--model=large-v2 --temperature_increment_on_fallback=None --compute_type=float16
--model=large-v2 --temperature_increment_on_fallback=None --compute_type=float32
--model=large-v2 --temperature_increment_on_fallback=None --compute_type=bfloat16

@despairTK
Copy link
Author

int8_float16:Transcription speed: 25.48 audio seconds/s
int8_float32:Transcription speed: 24.0 audio seconds/s
int8_bfloat16:Transcription speed: 26.9 audio seconds/s
float16:Transcription speed: 27.46 audio seconds/s
float32:Transcription speed: 19.87 audio seconds/s
bfloat16:Transcription speed: 27.99 audio seconds/s
int8:Transcription speed: 25.99 audio seconds/s

int16 Report an error :

Traceback (most recent call last):
File "D:\whisper-fast_main_.py", line 687, in
File "D:\whisper-fast_main_.py", line 562, in cli
File "faster_whisper\transcribe.py", line 134, in init
ValueError: Requested int16 compute type, but the target device or backend do not support efficient int16 computation.
[27364] Failed to execute script 'main' due to unhandled exception!

@Purfview
Copy link
Owner

Purfview commented Nov 12, 2023

Thanks for these GeForce RTX 4090 tests, here I sorted results by speed for readability:

bfloat16:     Transcription speed: 27.99 audio seconds/s
float16:      Transcription speed: 27.46 audio seconds/s
int8_bfloat16:Transcription speed: 26.90 audio seconds/s
int8_float16: Transcription speed: 25.48 audio seconds/s
int8_float32: Transcription speed: 24.00 audio seconds/s
float32:      Transcription speed: 19.87 audio seconds/s
int16:         Not supported.

int8:Transcription speed: 25.99 audio seconds/s:

No need to test int8, it resolves to one of 3 int8 types AKA int8_xxxxx, on your GPU it resolved to int8_float16.
You can see actual quantization type in use with --verbose, for example at the top you can see something like this:

[2023-11-12 14:52:41.312] [ctranslate2] [thread 4848] [info]  - Selected compute type: int8_float32

@Purfview
Copy link
Owner

Could you do same tests but with added --device=cpu parameter?
Use shorter audio, maybe that "6 minutes and 38 seconds" audio.

@despairTK
Copy link
Author

Tomorrow, because of the time difference, it's late at night here and I'm going to bed soon.

@despairTK
Copy link
Author

CPU:13700k
--language=Japanese --model=large-v2 --compute_type=X --device=cpu
6 minutes and 38 seconds of audio test sample

bfloat16:Not supported
float16:Not supported
int8_bfloat16:Not supported
int8_float16:Not supported
int8_float32:Transcription speed: 0.8 audio seconds/s
float32:Transcription speed: 0.55 audio seconds/s
int16: Transcription speed: 0.56 audio seconds/s

@Purfview
Copy link
Owner

Did you forgot --temperature_increment_on_fallback=None?

@despairTK
Copy link
Author

I didn’t forget, I just forgot to copy it when I was replying to you.

To be honest, I don't recommend using a CPU for transcription. . . . . My 13700K took closer to around 12 minutes to transcribe this 6 minute audio sample.

@Purfview
Copy link
Owner

Ok, I have a request, is it possible to add the download option for large-v3-fp16 to Subtitle Edit before the official release of Subtitle Edit 4.02 goes live?

Switched to fp16 model by default in r160.5.

@Purfview
Copy link
Owner

Closing this.
Post a feedback directly to the producer of large-v3 model. 😛

@Purfview
Copy link
Owner

@despairTK You can check if new fixed config for v3 improves anything: https://we.tl/t-13kKOAs9Vi

@despairTK
Copy link
Author

@despairTK You can check if new fixed config for v3 improves anything: https://we.tl/t-13kKOAs9Vi

I tested ten audio samples, including Japanese and English. The long audio samples were about 30 minutes and the short ones were less than 10 minutes.

The result is that there is no improvement, and in short audio samples, the results of both profiles are exactly the same. The results for long audio files are slightly different, with only slight changes in timestamps and segmented sentences, but no change in overall accuracy. Errors in repeated statements, timestamps, and punctuation still exist.

In addition, after I woke up in the morning, I saw two configuration changes in https://huggingface.co/openai/whisper-large-v3. I wonder if this will be helpful to you.
https://huggingface.co/openai/whisper-large-v3/discussions/24
https://huggingface.co/openai/whisper-large-v3/discussions/23

@Purfview
Copy link
Owner

The result is that there is no improvement

Same here, no difference.

@despairTK
Copy link
Author

despairTK commented Nov 25, 2023

Hi! I see faster-whisper has finally been updated.
https://github.com/SYSTRAN/faster-whisper/releases/tag/0.10.0
Also uploaded new V3 model.
https://huggingface.co/Systran/faster-whisper-large-v3

If you have also updated here, please don’t forget to @ me. I'd love to test if it solves the previous problem.

@Purfview
Copy link
Owner

Purfview commented Nov 25, 2023

Nothing new there, we have already v3, almost for a month.

@despairTK
Copy link
Author

All right. . . I thought something new had been updated. . .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants