Some problems with large-v3 #100

despairTK · 2023-11-09T04:57:35Z

I'm glad you're quick to support large-v3. Your work is outstanding!

I had some problems testing large-v3-int8 and large-v3-fp16.

I have transcribed videos in English, Japanese, Korean and Russian, dozens of videos in total, each of them varying in length from less than 10 minutes to more than 1 hour. The settings for transcription remain unchanged except for the model selection. The following problems occur when transcribing these videos.

During the transcription process, some sentences will be repeated, which is much higher than that of large-v2, and the chance of repeated sentences in large-v2 is very, very small.
timestamp accuracy is terrible compared to large-v2, and some of them are even misplaced.
The [Compression ratio threshold is not met with temperature 0.0] prompt appears frequently, which is hardly seen in large-v2.

despairTK · 2023-11-09T07:24:55Z

I ran some more tests, this time focusing on adjusting some parameters to see if I could improve the problem I had above with the large-v3 model. The results are still not good, the short audio results are acceptable, but the long audio results are not as good as the large-v2 model.

Purfview · 2023-11-09T10:20:39Z

Can you run reference Whisper large-v3? How results there?

despairTK · 2023-11-09T10:42:08Z

Whisper's large-v3 will run even worse. I tried large-v3 on the first day it came out, and even Japanese audio under ten minutes in length would repeat sentences.

And I also saw a test of the conversion in whisper.cpp yesterday, and it had the same problem. For example here: ggerganov/whisper.cpp#1444
openai/whisper#1783
I've also had people ask me questions about repeated sentences and bad timestamps.

despairTK · 2023-11-09T10:43:45Z

By the way, there are some problems with punctuation as well. Even using --initial_prompt doesn't correct these issues....

Purfview · 2023-11-09T11:11:35Z

I did only few quick English tests, some places better some worse, but I noticed some weird repetitions with no reason too...

I wonder if model was converted correctly, I can't run large in reference Whisper, could you do some tests for me?

Test these short audios, in reference Whisper's console exe with command below [use ffmpeg v5, not v6]:
--language English --model large-v3 --beam_size 1 --temperature_increment_on_fallback None --fp16 False

And upload all output files somewhere.

despairTK · 2023-11-09T13:26:37Z

--language English --model large-v3 --beam_size 1 --temperature_increment_on_fallback None --fp16 False

Sorry, I don't know some programming terms and sometimes I may not understand some sentences translated from the web page.

Can you see if it's these output files?
test2.zip

Purfview · 2023-11-09T17:08:17Z

Results of the reference somewhat similar somewhat not, something weird is going on at chunk's end/start, not sure if it's the model or faster-whisper's code. Needs more investigation.

despairTK · 2023-11-10T02:48:31Z

Ok, I have a request, is it possible to add the download option for large-v3-fp16 to Subtitle Edit before the official release of Subtitle Edit 4.02 goes live?

large-v3-fp16 will be better than large-v3-int8 when selecting words for some sentences. at least that's what I saw when I tested it in Japanese audio.

Purfview · 2023-11-10T16:45:48Z

What compute type do you use?

Purfview · 2023-11-10T23:50:51Z

Results of the reference somewhat similar somewhat not...

False alarm, discrepancies were because of the bug, -prompt=None doesn't work, it will be fixed in the next release.

despairTK · 2023-11-11T01:23:10Z

使用哪种计算类型？

I'll choose different --compute_type depending on the language I'm transcribing, for English I'll use the auto-selected int8-float16, for Japanese and Korean I'll choose bfloat16. This is because English itself is a very good transcriber and transcribes sentences with good punctuation and segmentation, but for Asian languages, only bfloat16 can make However, for Asian languages, only bfloat16 allows Asian languages to get the same good sentence segmentation and punctuation as English.However, sometimes float16 or float32 is chosen, depending on the effect of the transcription in the different languages. There are a lot of differences between languages.

Using --initial_prompt also gives some results, but after testing a lot of audio, I've concluded that using --initial_prompt causes some sentence splits and punctuation to be incorrectly placed when transcribing. This will result in the wrong meaning being conveyed in the final translation.

Purfview · 2023-11-11T14:11:31Z

So far I'm not convinced that large-v3 is better than v2, but so far I tested only English, btw there shouldn't be much of improvement for English according to OpenAI's tests, I see that it's a bit more accurate in some places, but v3 hits more fallbacks, it want to repeat things a lot, it wants to hallucinate... I've feeling that v3 is a flop [for English], same as v1... maybe it performs better in other languages.

I'll choose different --compute_type depending on the language..

Choosing computing type by a language is not right, you could have such impression only by looking at some short samples, the extensive tests would show you that accuracy is almost same between all different types. Just don't bother with it.

Choose the fastest type for your hardware, that's it.

Disable the fallback when benchmarking different types, "Transcription speed" printed at the end is accurate benchmark of transcription.

Purfview · 2023-11-11T14:20:15Z

Here are OpenAI's tests v2 vs v3:

The large-v3 model shows improved performance over a wide variety of languages, and the plot below includes all languages where Whisper large-v3 performs lower than 60% error rate on Common Voice 15 and Fleurs, showing 10% to 20% reduction of errors compared to large-v2:

gerrywastaken · 2023-11-11T14:33:46Z

According to that Cantonese has gone from a 30% error rate which I'm told is pretty much unusable, to 10%, putting it close to English performance.

despairTK · 2023-11-11T15:11:07Z

--compute_type

Actually, it involves another issue, which is that no matter if it's transcribed in Japanese, English, Korean or any other language, eventually many people have to translate those subtitles into their native language.

I'm going to use deepl pro, chatgpt 3.5, and google to do the translation. Machine translation is different from human translation in that machine translation can only translate sentence by sentence, not in context like human translation. So I use different --compute_type for different languages because sentence and punctuation segmentation is more accurate and complete.

For example, if I use int8_float16 when transcribing English, and I use it to transcribe Japanese, the Japanese sentences in the result will not be punctuated, and the sentences will be very fragmented. But when I use bfloat16 to transcribe Japanese, the resulting sentences are well separated and punctuated.

The reason for this is to allow the machine translator to better understand what each line of the subtitles means when translating, after all, every language has polyphonic words, and if the sentences in the transcription result are too fragmented, the machine translator will translate these polyphonic words into the wrong meaning, and if the sentence breaks are too bad, it will affect the machine translation result as well.

Although using --initial_prompt to fill in some example sentences, you can also get a seemingly normal sentence division and punctuation, in fact, it will still affect the sentence division, which will cause some sentences to be forcibly broken and then inserted into the middle of a punctuation mark, which will also change the meaning of the machine translation.

As a simple example, the second one will have one more punctuation mark than the first one:
Netizens offered their blessings one after another, wishing this great father a speedy recovery and a happy life with his family].
[Netizens offered their blessings one after another, wishing this great father. a speedy recovery and a happy life with his family]

In addition, I will say large-v3, in fact, at present large-v2, has been very good, especially faster-whisper under the work of large-v2, and your whisper-standalone-win project, make faster-whisper use more simple, easy to use. large-v2's The disadvantage of large-v2 is that it can't select better homophones, which is very forgiving when translating English subtitles, but not for other languages. For example, the Japanese words 髪 and 紙 are different in writing, but the pronunciation is exactly the same, one means hair and the other means paper. (You can listen to it on Google Translate...) Of course it's also a matter of the amount of training data. large-v3 does have a better selection of some homophones than large-v2, and if large-v3 didn't have the hallucinatory, time-stamping problems that these current transcriptions have so badly, then it would be a perfect model upgrade.

despairTK · 2023-11-11T15:24:34Z

By the way, you are right about one thing. large-v3 transcription just turns out to be very close to large-v1. This is confirmed in my multilingual transcription test.

Purfview · 2023-11-11T15:28:24Z

--compute_type for different languages because sentence and punctuation segmentation is more accurate and complete.

I don't think that "punctuation" somehow relates to some particular --compute_type, I think that's is just random subjective observation. Probably on some other audio you can have an opposite effect of "punctuation".

I use bfloat16 to transcribe Japanese, the resulting sentences are well separated and punctuated

How much data did you tested for this conclusion?

Purfview · 2023-11-11T15:34:48Z

I think if you want some little improvement then you could increase beam_size to 8 or 10, and same to best_of.

despairTK · 2023-11-11T15:57:44Z

I've tested close to 300 hours of different audio so far. The audio is categorized differently, from speakers with orderly pauses, to speakers with intermittent pauses, or very short pauses from the beginning to the end of the sentence, and so on. I also tried the same audio in different types of audio formats. For example, WAV, MP3, AAC and so on.

Both beam_size and best_of I have tried and the improvement is very small, most of the homophones still stay wrong.

The most interesting thing is the performance of large-v1, in my test, large-v2 transcription has wrong homophones, while large-v1 has correct ones. Similarly, where large-v1 makes a homophone mistake, large-v2 gets it right. Sometimes I wonder if there is any way to make large-v1 and large-v2 complement each other's deficiencies or load them at the same time for transcription, so that the correct rate will be much higher.

despairTK · 2023-11-11T16:25:44Z

I don't know how to code, but would love to know if there's a way for large-v2 and large-v1 to be able to transcribe an audio at the same time, taking the best sentence based on plausibility? Or other ways of transcribing an audio at the same time for both models to get a better result 。。。。

This is just a personal guess 。。。。 If it's not right, please take it as a joke I told :)

Purfview · 2023-11-11T16:34:04Z

I don't know how to code, but would love to know if there's a way for large-v2 and large-v1 to be able to transcribe an audio at the same time, taking the best sentence based on plausibility? Or other ways of transcribing an audio at the same time for both models to get a better result 。。。。

This is just a personal guess 。。。。 If it's not right, please take it as a joke I told :)

Then we need 3rd model which can evaluate that "plausibility", if there would be such model then we would use IT to transcribe, we wouldn't need those "v1" and "v2", 😉

despairTK · 2023-11-12T01:31:06Z

I will continue to wait for a more stable version and also look forward to faster-whisper's updates, which I can only help test. Please don't hesitate to let me know if you have any requests for testing!

Purfview · 2023-11-12T12:30:57Z

Please don't hesitate to let me know if you have any requests for testing!

Hmm, maybe I would be interested in some tests, but not related to large-v3.
What is your GPU and VRAM size?

Purfview · 2023-11-12T12:51:26Z

I would like to see benchmarks of all 3 versions of cuBLAS and cuDNN libs.

Tested in a console, not in SE.
With "--compute_type=float16".
Post "Transcription speed", not "Operation" time.

despairTK · 2023-11-12T12:54:40Z

Use one of the two audio files from before?

Purfview · 2023-11-12T12:56:09Z

Just use audio long enough where test runs at least ~3 - 5 minutes.

despairTK · 2023-11-12T13:06:35Z

Audio Duration 6 minutes 38 seconds
--language=Japanese --model=large-v2 --compute_type=float16 --verbose true
V1：Transcription speed: 11.38 audio seconds/s
V2：Transcription speed: 13.57 audio seconds/s
V3：Transcription speed: 13.43 audio seconds/s

Purfview · 2023-11-12T13:17:59Z

I didn't meant "Audio Duration", I meant "test duration", looking at speeds you should use ~50 minutes audio.

Looks like V1 speed is much worse, I will delete it from repo after SE updates to the new version.

despairTK · 2023-11-12T13:23:51Z

6 minutes and 38 seconds is the audio duration I used. . . . No [Transcription speed]
V1

V2

V3

Purfview · 2023-11-12T13:28:14Z

Nevermind that previous test... [ probably the google translate issues 😉 ]

despairTK · 2023-11-12T13:36:10Z

Audio test results of 1 hour, 15 minutes and 42 seconds.
--language=Japanese --model=large-v2 --compute_type=float16 --verbose true

V1

V2

V3

Purfview · 2023-11-12T13:40:50Z

Thx for retest, now we can see that V3 is a bit more faster. Longer tests are more accurate.

Purfview · 2023-11-12T13:43:40Z

I'm interested in testing the speeds of different compute_types.

Post only "Transcription speed". no need for images.
Use v3 libs for all tests.
Use that "1 hour, 15 minutes and 42 seconds" audio.

Benchmarks with these settings:

--model=large-v2 --temperature_increment_on_fallback=None --compute_type=int8_float16
--model=large-v2 --temperature_increment_on_fallback=None --compute_type=int8_float32
--model=large-v2 --temperature_increment_on_fallback=None --compute_type=int8_bfloat16
--model=large-v2 --temperature_increment_on_fallback=None --compute_type=int16
--model=large-v2 --temperature_increment_on_fallback=None --compute_type=float16
--model=large-v2 --temperature_increment_on_fallback=None --compute_type=float32
--model=large-v2 --temperature_increment_on_fallback=None --compute_type=bfloat16

despairTK · 2023-11-12T14:13:27Z

int8_float16：Transcription speed: 25.48 audio seconds/s
int8_float32：Transcription speed: 24.0 audio seconds/s
int8_bfloat16：Transcription speed: 26.9 audio seconds/s
float16：Transcription speed: 27.46 audio seconds/s
float32：Transcription speed: 19.87 audio seconds/s
bfloat16：Transcription speed: 27.99 audio seconds/s
int8：Transcription speed: 25.99 audio seconds/s

int16 Report an error ：

Traceback (most recent call last):
File "D:\whisper-fast_main_.py", line 687, in
File "D:\whisper-fast_main_.py", line 562, in cli
File "faster_whisper\transcribe.py", line 134, in init
ValueError: Requested int16 compute type, but the target device or backend do not support efficient int16 computation.
[27364] Failed to execute script 'main' due to unhandled exception!

Purfview · 2023-11-12T14:56:44Z

Thanks for these GeForce RTX 4090 tests, here I sorted results by speed for readability:

bfloat16：     Transcription speed: 27.99 audio seconds/s
float16：      Transcription speed: 27.46 audio seconds/s
int8_bfloat16：Transcription speed: 26.90 audio seconds/s
int8_float16： Transcription speed: 25.48 audio seconds/s
int8_float32： Transcription speed: 24.00 audio seconds/s
float32：      Transcription speed: 19.87 audio seconds/s
int16:         Not supported.

int8：Transcription speed: 25.99 audio seconds/s：

No need to test int8, it resolves to one of 3 int8 types AKA int8_xxxxx, on your GPU it resolved to int8_float16.
You can see actual quantization type in use with --verbose, for example at the top you can see something like this:

[2023-11-12 14:52:41.312] [ctranslate2] [thread 4848] [info]  - Selected compute type: int8_float32

Purfview · 2023-11-12T15:02:10Z

Could you do same tests but with added --device=cpu parameter?
Use shorter audio, maybe that "6 minutes and 38 seconds" audio.

despairTK · 2023-11-12T15:38:13Z

Tomorrow, because of the time difference, it's late at night here and I'm going to bed soon.

despairTK · 2023-11-13T02:54:24Z

CPU:13700k
--language=Japanese --model=large-v2 --compute_type=X --device=cpu
6 minutes and 38 seconds of audio test sample

bfloat16：Not supported
float16：Not supported
int8_bfloat16：Not supported
int8_float16：Not supported
int8_float32：Transcription speed: 0.8 audio seconds/s
float32：Transcription speed: 0.55 audio seconds/s
int16: Transcription speed: 0.56 audio seconds/s

Purfview · 2023-11-13T10:15:45Z

Did you forgot --temperature_increment_on_fallback=None?

despairTK · 2023-11-13T10:38:20Z

I didn’t forget, I just forgot to copy it when I was replying to you.

To be honest, I don't recommend using a CPU for transcription. . . . . My 13700K took closer to around 12 minutes to transcribe this 6 minute audio sample.

Purfview · 2023-11-14T18:56:06Z

Ok, I have a request, is it possible to add the download option for large-v3-fp16 to Subtitle Edit before the official release of Subtitle Edit 4.02 goes live?

Switched to fp16 model by default in r160.5.

Purfview · 2023-11-14T20:13:54Z

Closing this.
Post a feedback directly to the producer of large-v3 model. 😛

Purfview · 2023-11-15T18:02:48Z

@despairTK You can check if new fixed config for v3 improves anything: https://we.tl/t-13kKOAs9Vi

despairTK · 2023-11-16T02:38:33Z

@despairTK You can check if new fixed config for v3 improves anything: https://we.tl/t-13kKOAs9Vi

I tested ten audio samples, including Japanese and English. The long audio samples were about 30 minutes and the short ones were less than 10 minutes.

The result is that there is no improvement, and in short audio samples, the results of both profiles are exactly the same. The results for long audio files are slightly different, with only slight changes in timestamps and segmented sentences, but no change in overall accuracy. Errors in repeated statements, timestamps, and punctuation still exist.

In addition, after I woke up in the morning, I saw two configuration changes in https://huggingface.co/openai/whisper-large-v3. I wonder if this will be helpful to you.
https://huggingface.co/openai/whisper-large-v3/discussions/24
https://huggingface.co/openai/whisper-large-v3/discussions/23

Purfview · 2023-11-16T02:48:49Z

The result is that there is no improvement

Same here, no difference.

despairTK · 2023-11-25T01:48:10Z

Hi! I see faster-whisper has finally been updated.
https://github.com/SYSTRAN/faster-whisper/releases/tag/0.10.0
Also uploaded new V3 model.
https://huggingface.co/Systran/faster-whisper-large-v3

If you have also updated here, please don’t forget to @ me. I'd love to test if it solves the previous problem.

Purfview · 2023-11-25T01:49:43Z

Nothing new there, we have already v3, almost for a month.

despairTK · 2023-11-25T01:58:55Z

All right. . . I thought something new had been updated. . .

despairTK changed the title ~~Error with large-v3 model.~~ Some problems with large-v3 Nov 9, 2023

Repository owner deleted a comment from despairTK Nov 9, 2023

Purfview closed this as completed Nov 14, 2023

Some problems with large-v3 #100

Some problems with large-v3 #100

Comments

despairTK commented Nov 9, 2023 • edited

despairTK commented Nov 9, 2023

Purfview commented Nov 9, 2023 • edited

despairTK commented Nov 9, 2023

despairTK commented Nov 9, 2023

Purfview commented Nov 9, 2023 • edited

despairTK commented Nov 9, 2023

Purfview commented Nov 9, 2023

despairTK commented Nov 10, 2023

Purfview commented Nov 10, 2023

Purfview commented Nov 10, 2023

despairTK commented Nov 11, 2023 • edited

Purfview commented Nov 11, 2023 • edited

Purfview commented Nov 11, 2023

gerrywastaken commented Nov 11, 2023 • edited

despairTK commented Nov 11, 2023 • edited

despairTK commented Nov 11, 2023

Purfview commented Nov 11, 2023 • edited

Purfview commented Nov 11, 2023 • edited

despairTK commented Nov 11, 2023

despairTK commented Nov 11, 2023

Purfview commented Nov 11, 2023

despairTK commented Nov 12, 2023

Purfview commented Nov 12, 2023

Purfview commented Nov 12, 2023

despairTK commented Nov 12, 2023 • edited

Purfview commented Nov 12, 2023

despairTK commented Nov 12, 2023

Purfview commented Nov 12, 2023

despairTK commented Nov 12, 2023

Purfview commented Nov 12, 2023

despairTK commented Nov 12, 2023

Purfview commented Nov 12, 2023

Purfview commented Nov 12, 2023

despairTK commented Nov 12, 2023

Purfview commented Nov 12, 2023 • edited

Purfview commented Nov 12, 2023

despairTK commented Nov 12, 2023

despairTK commented Nov 13, 2023

Purfview commented Nov 13, 2023

despairTK commented Nov 13, 2023

Purfview commented Nov 14, 2023

Purfview commented Nov 14, 2023

Purfview commented Nov 15, 2023

despairTK commented Nov 16, 2023

Purfview commented Nov 16, 2023

despairTK commented Nov 25, 2023 • edited

Purfview commented Nov 25, 2023 • edited

despairTK commented Nov 25, 2023

despairTK commented Nov 9, 2023 •

edited

Purfview commented Nov 9, 2023 •

edited

Purfview commented Nov 9, 2023 •

edited

despairTK commented Nov 11, 2023 •

edited

Purfview commented Nov 11, 2023 •

edited

gerrywastaken commented Nov 11, 2023 •

edited

despairTK commented Nov 11, 2023 •

edited

Purfview commented Nov 11, 2023 •

edited

Purfview commented Nov 11, 2023 •

edited

despairTK commented Nov 12, 2023 •

edited

Purfview commented Nov 12, 2023 •

edited

despairTK commented Nov 25, 2023 •

edited

Purfview commented Nov 25, 2023 •

edited