Best Open Source Subtitle Generator? Canary Qwen 2.5B + Whisper Full Guide #369
FurkanGozukara
announced in
Tutorials
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Best Open Source Subtitle Generator? Canary Qwen 2.5B + Whisper Full Guide
Full tutorial: https://www.youtube.com/watch?v=4lAk6sf1qF8
NVIDIA NeMo Canary-Qwen-2.5B is an English speech recognition model that achieves state-of-the art performance on multiple English speech benchmarks. Canary model is the new king that dethroned famous Whisper.
Full tutorial for the Whisper TTS Premium speech-to-text app by SECourses with new NVIDIA Canary Qwen 2.5B support. In this video, I demo local subtitle generation, compare Canary Qwen 2.5B against Whisper Large V3, show output formats, batch processing, presets, YouTube URL and live microphone options, then install the app from scratch on Windows.
You will also see RunPod and Massed Compute notes, first-run model download, RTX 5000/CUDA 13 driver requirements, subprocess mode for preventing VRAM/RAM leaks, and when to use Whisper instead of Canary.
Links:
Download App and the source post: [ https://www.patreon.com/posts/whisper-webui-to-145395299 ]
Discord: [ https://discord.com/channels/772774097734074388/1079506787734134844 ]
Patreon app index: [ https://github.com/FurkanGozukara/Stable-Diffusion/blob/main/Patreon-Posts-Index.md ]
Related RunPod/Massed Compute setup tutorial: [ https://youtu.be/ZRrzvD4wNys ]
In my tutorial-video tests, Canary Qwen 2.5B achieved 5.91% global WER and reached up to 46x faster than real-time transcription, making it my new recommended default for English speech-to-text. Whisper remains useful when you need broader spoken-language support or word-level timestamps.
Chapters:
00:00:00 Intro to the local open-source speech-to-text app and new Canary support
00:00:20 Quick demo setup with NVIDIA Canary Qwen 2.5B
00:00:33 Maximum-quality defaults and starting subtitle generation
00:00:48 Live transcription speed and accuracy preview
00:01:03 Chunk length settings for smaller or larger subtitle segments
00:01:18 Fast generation, supported exports, and restarting with all formats
00:01:31 Multiple subtitle file formats explained
00:01:47 Batch processing folders, output paths, subfolders, and overwrite mode
00:01:58 YouTube URLs, microphone/live transcription, translation, and BGM separation
00:02:09 Saving presets and using advanced parameters
00:02:24 Auto-optimized defaults for Whisper and Canary models
00:02:39 Canary Qwen 2.5B vs Whisper Large V3 comparison begins
00:02:54 Real-world WER benchmark and 5.91% Canary result
00:03:10 Why non-native English speech is harder to transcribe accurately
00:03:24 Canary speed advantage and 46x real-time transcription explained
00:03:43 Test averages across long and short tutorial videos
00:03:59 Cases where Whisper slightly wins and final Canary recommendation
00:04:14 Opening the output folder after transcription completes
00:04:27 VTT output, matching filenames, capitalization, and punctuation
00:04:44 Accuracy examples inside the generated transcript
00:04:58 TXT, TSV, SRT, LRC exports and word-level timestamp note
00:05:20 Download page, latest ZIP, and installation overview
00:05:31 Windows requirements: Python 3.11, Git, CUDA, and C++ notes
00:05:51 Choosing install location and keeping the app isolated in venv
00:06:04 Extracting the ZIP and running Windows install/update BAT
00:06:23 Automatic model downloads on first run
00:06:34 RunPod, Massed Compute, and Linux installation files
00:06:50 Where to learn RunPod and Massed Compute setup in the related guide
00:07:29 UV-powered Windows installation completes quickly
00:07:41 Starting the app with Windows start app BAT
00:07:58 Selecting video/audio input and generating subtitles on a fresh install
00:08:10 First-run Canary model download and 5GB model size
00:08:35 Easy setup goal and automatic fresh-install workflow
00:08:53 Discord, Patreon index, and 100+ SECourses applications
00:09:13 RTX 5000 support and updated NVIDIA driver requirement
00:09:35 Fresh-install transcription starts successfully
00:09:47 Automatic downloads for Canary, Whisper, diarization, and extra tools
00:10:16 Canary becomes the new default model recommendation
00:10:36 Subprocess mode to prevent VRAM and RAM leaks
00:10:51 Why running transcription as a subprocess is recommended
00:11:04 Switching back to Whisper models when needed
00:11:20 Whisper language coverage vs Canary and audio/video support
00:11:42 Real recording benchmark: 27 minutes transcribed in about 2 minutes
00:11:56 Model loading overhead and clean RAM/VRAM release
00:12:08 Final notes, subscribe reminder, and downloading the full transcript ZIP
#CanaryQwen #Whisper #SpeechToText #SubtitleGenerator #OpenSourceAI #LocalAI #NVIDIA #RunPod #MassedCompute #SECourses
Video Transcription
00:00:00 Greetings everyone. Today I am going to show you state of the art, open source,
00:00:06 locally runnable speech-to-text application. It is called as Whisper TTS Premium application
00:00:13 by SECourses. However, with today's update, now we are supporting Canary model as well,
00:00:20 NVIDIA Canary Qwen 2.5 billion parameters. Let's make a quick demonstration. So this
00:00:26 is my latest tutorial video. I have drag and drop it. The default parameters are all set for maximum
00:00:34 quality. So generate subtitle, and let's watch live how it is being transcribed. It is starting.
00:00:43 Okay, and the transcription started quickly. It is really, really fast and it is the best quality,
00:00:50 best accuracy speech to text model right now. So as we can see live, it is currently transcribing
00:00:59 my video fully automatically. Currently it is transcribing as 40 second chunks, but if
00:01:06 you want smaller chunks for better subtitles or whatever reason, you can go to advanced parameters
00:01:14 and change the chunk length to whichever parameter you like, like 20, 30, 10, 50, it is
00:01:21 up to you. And the file is about to be generated. So this application is amazing. It is supporting
00:01:27 so many features like you can get with multiple file formats, but we should have started that
00:01:34 way. Okay, let's cancel generation and start again because it is fast. And generate subtitle file. So
00:01:43 it will generate all these formats and save with all of them. It supports full batch processing, so
00:01:50 you can give a folder path and output folder path, it will process all the videos and audio files
00:01:55 inside that. You can also include subdirectories, overwrite existing files. It supports YouTube
00:02:01 URLs, microphone, live transcription, some translation but it is not very good,
00:02:06 and BGM separation. So you can also save and load your presets from here. Fully saving and
00:02:13 loading presets. So we support all of the advanced parameters you can also set, but these parameters
00:02:20 are automatically set to best defaults for both Whisper models and also Canary model. For example,
00:02:30 if I change the model from here, Canary to Whisper, the advanced parameters will
00:02:36 be accordingly changed, according to the model. Meanwhile it is transcribing again, let's see the
00:02:43 difference between NVIDIA Canary Qwen 2.5 billion versus Whisper Large version 3, which is the
00:02:51 latest version of the Whisper model. So currently Canary Qwen 2.5 has 5.91 percentage global word
00:03:02 error rate. I have tested these on my tutorial videos, so this is a real test. I am not a
00:03:08 native English speaker, therefore it is harder to accurately transcribe speech to text my voice.
00:03:16 However, it did an excellent job. You see the global word error rate of the Whisper is much
00:03:22 bigger than the Canary. Moreover, the Canary speech to text speed is much faster. This is
00:03:29 46 times faster than real-time. Which means 46 minutes of speech is transcribed in 1 minute.
00:03:39 You can see the values. So the Canary model is much better at every statistic compared to the
00:03:47 Whisper. And there are individual test results as well, as you are seeing. So I did run big tests
00:03:55 like 1 hour, like 5 minute, like 13 minute, 55 minutes, and I did get the overall averages. In
00:04:02 few cases, the Whisper did perform very slightly better as you can see. So this new model is my
00:04:10 recommendation, use it as a default model. Moreover, it is also default set in our
00:04:18 application. So let's look at the results, because the transcription has been completed. When you
00:04:24 click to open outputs folder, it will open you the outputs folder, you will see the transcription
00:04:31 names are same as your input file name. So we have VTT format. You can see VTT format. And as
00:04:39 you can see, it is also having proper uppercase and lowercase characters as well. Also having the
00:04:48 accurate syntax or punctuation as well. I mean, you see installers zip file. This is amazing.
00:04:55 It also supports TXT format as you can see, or TSV format as you can see, our classic subtitle
00:05:03 format SRT, or LRC format. So we support all of the formats. Currently this model
00:05:11 doesn't have word-level timestamps, but Whisper also has it if you need word-level timestamps.
00:05:19 So this application is the ultimate, very best application right now. How you are going to
00:05:24 install and use it? We have an excellent page as usual. The link will be in the description
00:05:29 of the video. Download the latest zip file. Before starting installation, make sure that
00:05:34 you have read and follow the requirements. So the requirements are here, Windows requirements.
00:05:41 We need Python 3.11. Moreover, you need Git. CUDA and C++ tools probably won't be necessary since I
00:05:49 have pre-compiled all the libraries and all of my applications install into a virtual environment.
00:05:56 So move the latest zip file into wherever you want to install. So let's install into our
00:06:02 Q drive T2. Do not have space characters in your installations. Extract files here. Then all I need
00:06:10 to do is Windows install and update dot bat file. So it will fully automatically install everything
00:06:16 for me. It will generate a virtual environment and install all the libraries inside here,
00:06:21 so it is fully isolated. The models will be automatically downloaded into models
00:06:26 folder when you first time run them. All of them will be fully automatically downloaded.
00:06:31 For RunPod we have RunPod Simple Pod instructions. Please follow these files.
00:06:36 It has all the instructions as you can see. We also support Massed Compute. Please follow
00:06:41 this. If you have a Linux users, use the Massed Compute install sh file. If you don't know how
00:06:47 to use RunPod or Massed Compute, our latest tutorial in details explain how to use them.
00:06:55 So you can follow the ultimate guide to AI voice cloning tutorial. This tutorial has full chapters,
00:07:02 so you can go to video description and find the Simple Pod RunPod part if you don't know,
00:07:09 or Massed Compute part. For example, let's search for Massed Compute. So the Massed Compute part
00:07:15 starts here and the RunPod compute part starts here. So you can follow this tutorial to learn
00:07:20 how to install and use this application or all of my applications on RunPod and Massed Compute.
00:07:27 The installation on Windows will be very fast as you can see because we are using
00:07:31 UV installation and it is already completed. Now all you need to do is just double click
00:07:36 Windows start app dot bat file and it will start the application and it will be right away ready
00:07:43 to use. So the application started. Let's make a test and see how it downloads automatically
00:07:49 the model. So I have selected my video. This supports both videos and audio files. Canary is
00:07:56 selected. It is all default. You can also set all of the options that you want and generate
00:08:01 subtitle file. This is a fresh installation, so first it need to download model. Okay,
00:08:07 as you can see the model download started. So it will download the model, it is around 5 gigabytes
00:08:13 for this new model. It is downloading pretty fast, it depends on your internet connection. Once the
00:08:19 download has been completed, it will start. I can see that it is using my entire internet right
00:08:24 now. It is using 1 gigabits per second. So after this download we will see the
00:08:31 transcription started on a fresh installation. I am making all of my applications extremely easy
00:08:39 to install, set up, and use. You can find our Discord server in the link in the source post.
00:08:48 You see we have Patreon exclusive post index, plus we have SECourses Discord. When you click
00:08:55 this link you will get to our Discord page. And in the index page you can see all of
00:09:02 our applications. We have more than 100 unique applications and we are adding more and keeping
00:09:07 all of them up to date and working. All of my applications works on RTX 5000 series as well,
00:09:13 just make sure that you have updated NVIDIA driver because CUDA 13 requires updated NVIDIA driver.
00:09:20 For example, when I type NVIDIA-smi, I should see an updated driver like this and I should
00:09:26 see CUDA version 13 or bigger on this screen when I type NVIDIA-smi. Make sure that it is like this.
00:09:35 Alright, so our transcription started with the fresh installation. As you can see,
00:09:40 it is working amazingly. This model requires a lot of requirements, but I have made it so easy and
00:09:47 so fluent to install and use, start using right away. The models are automatically downloaded,
00:09:53 both for Canary, Canary from NVIDIA with Qwen 2.5, and both for Whisper. Even the other models like
00:10:03 diarization and some background remove or other features that we support, all of them is fully
00:10:11 automatically downloaded and done. Transcription almost completed. You can see the speed,
00:10:16 it is just amazingly fast. I like this model, so this model is my now new default model, but always
00:10:23 you can also use other models to test. If you encounter any issues, just message me from Patreon
00:10:30 or from Discord or from YouTube, leave a comment. Check out all the features of the application.
00:10:36 One another thing that is, we are starting the process as a subprocess, therefore after the
00:10:43 transcription has been completed, there will be absolutely zero VRAM leak or the RAM leak.
00:10:51 So this is a very big advantage. If you don't want as a subprocess, you can uncheck that box,
00:10:57 but I recommend to run as a subprocess because it is extremely effective. The transcription speed is
00:11:04 amazing with this new Canary model, but always you can switch to the Whisper and use all of
00:11:12 these Whisper models. One advantage of Whisper is that it supports 100 different spoken languages.
00:11:20 However, if you are going to transcribe English audio, I recommend you to use Canary. Our
00:11:29 interface, as I have said, supports both audio and video files. I also show preview like this,
00:11:35 this is a feature that I have coded. So this application is extremely improved by me. It has so
00:11:42 many new features, you won't find this application anywhere else. And you can see that while I am
00:11:47 recording a video, the transcription speed was 12 times real-time. So 27 minute took only 2 minutes
00:11:56 to transcribe. Most of the time is spent for loading the model into the VRAM and starting
00:12:02 transcription. We can also see that it is entirely cleared. The RAM, VRAM, everything is cleared.
00:12:08 So thank you so much for watching. Hopefully I will start making more videos. Please open
00:12:13 the bell, subscribe. You can also download the transcription from here. You see it is
00:12:18 downloaded as a zip file and it includes all of the transcription files. This is very useful when
00:12:24 you are working with RunPod or Massed Compute. Thank you so much. Hopefully see you later.
Beta Was this translation helpful? Give feedback.
All reactions