Skip to content

HenestrosaDev/audiotext

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Logo

Audiotext

A desktop application that transcribes audio from files, microphone input or YouTube videos with the option to translate the content and create subtitles.

GitHub Contributors GitHub Contributors Issues GitHub pull requests GitHub pull requests

Report Bug · Request Feature · Ask Question

Table of Contents

About the Project

Main

Audiotext transcribes the audio from an audio file, video file, microphone input, or YouTube video into one of the 74 different languages it supports, along with some of their dialects. You can transcribe using the Google Speech-to-Text API or WhisperX, which can even translate the transcription or generate subtitles!

You can also choose the theme you like best. It can be dark, light, or the one configured in the system.

Supported Languages

Click here to display
  • Afrikaans
  • Amharic (አማርኛ)
  • Arabic (لعربية)
  • Armenian (հայերեն)
  • Azerbaijan (Azərbaycan)
  • Basque (Euskara)
  • Belarusian (беларуская)
  • Bengali (বাংলা)
  • Bulgarian (Български)
  • Catalan (Català)
  • Chinese (China) (中文(中国))
  • Chinese (Hong Kong) (中文(香港)
  • Chinese (Taiwan) (中文(台灣))
  • Croatian (Hrvatski)
  • Czech (Čeština)
  • Danish (Dansk)
  • Dutch (Nederlands)
  • English
  • Estonian (Eesti keel)
  • Farsi (فارسی)
  • Filipino
  • Finnish (Suomi)
  • French (Français)
  • Galician (Galego)
  • Georgian (ქართული)
  • German (Deutsch)
  • German (Swiss Standard) (Schweizer Hochdeutsch)
  • Greek (Ελληνικά)
  • Gujarati (ગુજરાતી)
  • Hebrew (עברית)
  • Hindi (हिन्दी)
  • Hungarian (Magyar)
  • Icelandic (Íslenska)
  • Indonesian (Bahasa Indonesia)
  • Italian (Italiano)
  • Swiss Italian (Italiano (Svizzera))
  • Japanese (日本語)
  • Javanese (Basa Jawa)
  • Kannada (ಕನ್ನಡ)
  • Kazakh (Қазақ)
  • Khmer (ខ្មែរ)
  • Korean (한국어)
  • Lao (ລາວ)
  • Latvian (Latviešu)
  • Lithuanian (Lietuvių)
  • Malay (Bahasa Melayu)
  • Malayalam (മലയാളം)
  • Maltese (Malti)
  • Marathi (मराठी)
  • Mongolian (Монгол)
  • Nepali (नेपाली)
  • Norwegian (Bokmål)
  • Norwegian Nynorsk (Norsk (Nynorsk))
  • Polish (Polski)
  • Portuguese (Português)
  • Punjabi (ਪੰਜਾਬੀ)
  • Romanian (Română)
  • Russian (Русский)
  • Serbian (Српски)
  • Sinhala (සිංහල)
  • Slovak (Slovenčina)
  • Slovenian (Slovenščina)
  • Spanish (Español)
  • Sundanese (Basa Sunda)
  • Swahili (Kiswahili)
  • Swedish (Svenska)
  • Tamil (தமிழ்)
  • Telugu (తెలుగు)
  • Thai (ไทย)
  • Turkish (Türkçe)
  • Ukrainian (Українська)
  • Urdu (اردو)
  • Vietnamese (Tiếng Việt)
  • Zulu (Isizulu)

Project Structure

ASCII folder structure
│   .gitignore
│   audiotext.spec
│   LICENSE
│   README.md
│   requirements.txt
│
├───.github
│   │   CONTRIBUTING.md
│   │
│   ├───ISSUE_TEMPLATE
│   │       bug_report_template.md
│   │       feature_request_template.md
│   │
│   └───PULL_REQUEST_TEMPLATE
│           pull_request_template.md
│
├───res
│   ├───img
│   │       icon.ico
│   │
│   └───locales
│       │   main_controller.pot
│       │   main_window.pot
│       │
│       ├───en
│       │   └───LC_MESSAGES
│       │           app.mo
│       │           app.po
│       │           main_controller.po
│       │           main_window.po
│       │
│       └───es
│           └───LC_MESSAGES
│                   app.mo
│                   app.po
│                   main_controller.po
│                   main_window.po
│
└───src
    │   app.py
    │
    ├───controller
    │       __init__.py
    │       main_controller.py
    │
    ├───model
    │   │   __init__.py
    │   │   transcription.py
    │   │     
    │   └───config
    │           __init__.py
    │           config_google_api.py
    │           config_subtitles.py
    │           config_whisperx.py
    │
    ├───utils
    │       __init__.py
    │       audio_utils.py
    │       config_manager.py
    │       constants.py
    │       dict_utils.py
    │       enums.py
    │       i18n.py
    │       path_helper.py
    │
    └───view
        │   __init__.py   
        │   main_window.py
        │    
        └───custom_widgets
                __init__.py
                ctk_scrollable_dropdown/
                ctk_input_dialog.py

Built With

  • CTkScrollableDropdown for the scrollable option menu to display the full list of supported languages.
  • CustomTkinter for the GUI.
  • moviepy for video processing, from which the program extracts the audio to be transcribed.
  • PyAudio for recording microphone audio.
  • pydub for audio processing.
  • PyTorch for building and training neural networks.
  • PyTorch-CUDA for enabling GPU support (CUDA) with PyTorch. CUDA is a parallel computing platform and application programming interface model created by NVIDIA.
  • pytube for audio download of YouTube videos.
  • SpeechRecognition for converting audio into text.
  • Torchaudio for audio processing tasks, including speech recognition and audio classification.
  • WhisperX for fast automatic speech recognition. This product includes software developed by Max Bain. Uses faster-whisper, which is a reimplementation of OpenAI's Whisper model using CTranslate2.

(back to top)

Getting Started

Installation

  1. Install FFmpeg to execute the program. Otherwise, it won't be able to process the audio files.

    To check if you have it installed on your system, run ffmpeg -version. It should return something similar to this:

    ffmpeg version 5.1.2-essentials_build-www.gyan.dev Copyright (c) 2000-2022 the FFmpeg developers
    built with gcc 12.1.0 (Rev2, Built by MSYS2 project)
    configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-lzma --enable-zlib --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-sdl2 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxvid --enable-libaom --enable-libopenjpeg --enable-libvpx --enable-libass --enable-libfreetype --enable-libfribidi --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-ffnvcodec --enable-nvdec --enable-nvenc --enable-d3d11va --enable-dxva2 --enable-libmfx --enable-libgme --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --enable-libtheora --enable-libvo-amrwbenc --enable-libgsm --enable-libopencore-amrnb --enable-libopus --enable-libspeex --enable-libvorbis --enable-librubberband
    libavutil      57. 28.100 / 57. 28.100
    libavcodec     59. 37.100 / 59. 37.100
    libavformat    59. 27.100 / 59. 27.100
    libavdevice    59.  7.100 / 59.  7.100
    libavfilter     8. 44.100 /  8. 44.100
    libswscale      6.  7.100 /  6.  7.100
    libswresample   4.  7.100 /  4.  7.100
    

    If the output is an error, it is because your system cannot find the ffmpeg system variable, which is probably because you don't have it installed on your system. To install ffmpeg, open a command prompt and run one of the following commands, depending on your operating system:

    # on Ubuntu or Debian
    sudo apt update && sudo apt install ffmpeg
    
    # on Arch Linux
    sudo pacman -S ffmpeg
    
    # on MacOS using Homebrew (https://brew.sh/)
    brew install ffmpeg
    
    # on Windows using Chocolatey (https://chocolatey.org/)
    choco install ffmpeg
    
    # on Windows using Scoop (https://scoop.sh/)
    scoop install ffmpeg
    
  2. Go to releases and download the latest.

  3. Decompress the downloaded file.

  4. Open the audiotext folder and double-click the Audiotext executable file.

To Set Up the Project Locally

  1. Clone the repository by running git clone https://github.com/HenestrosaDev/audiotext.git.
  2. Change the current working directory to audiotext by running cd audiotext.
  3. (Optional but recommended) Create a Python virtual environment in the project root. If you're using virtualenv, you would run virtualenv venv.
  4. (Optional but recommended) Activate the virtual environment:
    # on Windows
    . venv/Scripts/activate
    # if you get the error `FullyQualifiedErrorId : UnauthorizedAccess`, run this:
    Set-ExecutionPolicy Unrestricted -Scope Process
    # and then . venv/Scripts/activate
    
    # on macOS and Linux
    source venv/Scripts/activate
  5. Run cat requirements.txt | xargs -n 1 pip install to install the dependencies.

    [!WARNING] For some reason, pip install -r requirements.txt throws the error "Could not find a version that satisfies the requirement [PACKAGE_NAME]==[PACKAGE_VERSION] (from version: none)"

  6. Run python src/app.py to start the program.

Notes

  • You cannot generate a single executable file for this project with PyInstaller due to the dependency with the CustomTkinter package (reason here).
  • For Mac computers with Apple silicon: An error occurs when trying to install the pyaudio package. Here is a StackOverflow post explaining how to solve this issue.
  • I had to comment out the lines pprint(response_text, indent=4) in the recognize_google function from the __init__.py file of the SpeechRecognition package to avoid opening a command line along with the GUI. Otherwise, the program would not be able to use the Google API transcription method because pprint throws an error if it cannot print to the CLI, preventing the code from generating the transcription. The same applies to the lines using the logger package in the moviepy/audio/io/ffmpeg_audiowriter file from the moviepy package. There is also a change in the line 169. logger=logger has been changed to logger=None to avoid more errors related to opening the console.

(back to top)

Usage

Once you open the Audiotext executable file (explained in the getting started section), you'll see something like this:

Main

Transcribe From

You can transcribe from three audio sources:

  • File (see image above): Click on the file explorer icon to select the file you want to transcribe. You can also manually enter the path to the file into the input field. You can transcribe audio from both audio and video files. Note that the file explorer has the All supported files option selected by default. To select only audio files or video files, click the combo box in the lower right corner of the file explorer to change the file type, as marked in red in the following image:

    File explorer

    Supported files

    Supported audio file formats
    • .mp3
    • .mpeg
    • .wav
    • .wma
    • .aac
    • .flac
    • .ogg
    • .oga
    • .opus
    Supported video file formats
    • .mp4
    • .m4a
    • .m4v
    • .f4v
    • .f4a
    • .m4b
    • .m4r
    • .f4b
    • .mov
    • .avi
    • .webm
    • .flv
    • .mkv
    • .3gp
    • .3gp2
    • .3g2
    • .3gpp
    • .3gpp2
    • .ogv
    • .ogx
    • .wmv
    • .asf
  • Microphone: To start recording, simply click the Start recording button to begin the process. The text of the button will change to Stop recording and its color will change to red. Click it to stop recording and generate the transcription.

    Note that your operating system must recognize an input source, otherwise an error will appear in the text box indicating that no input source was detected.

    Here is a video demonstrating this feature:

    english.mp4

    Video from v2.1.0

  • YouTube video: Enter the video URL in the upper input field. When finished, click on the Generate transcription button.

    From microphone

Save Transcription

Once the program has generated the transcription, you'll see a green Save transcription button below the text box. If you click on it, you'll be prompted for a file explorer where you can give the file a name and select the path where you want to save it. The file extension is .txt by default, but you can change it to any other text file type.

If you used WhisperX to generate the transcription and checked the Generate subtitles option, you'll notice that two files are also saved along with the text file: a .vtt file and a .srt file. Both contain the subtitles for the transcribed file, as explained in the Generate Subtitles section.

Transcribe Using

Before you start transcribing, it's important to understand what each transcription method offers:

  • WhisperX: Selected by default. This method runs locally on your machine and can run on CPUs and CUDA GPUs, although it performs better on the latter. The transcriptions generated by WhisperX are generally much more accurate than those generated by the Google API, although this may vary depending on the model size and computation type selected. In addition, WhisperX offers a wider range of features, including subtitle generation and translation into any other supported language. It's fast, especially when transcribing large files, and has no usage restrictions while remaining completely free.
  • Google Speech-To-Text API (hereafter referred to as Google API): Audiotext sends the audio to the remote Google API to get the transcription. It doesn't punctuate sentences, and the quality of the resulting transcriptions often requires manual adjustment due to lower quality compared to WhisperX. In its free tier, usage is limited to 60 minutes per month, but this limit can be extended by adding an API key. Unlike WhisperX, the Google API is much less demanding on hardware resources because the transcription process is handled entirely on remote servers.

WhisperX Options

The WhisperX options appear when the selected transcription method is WhisperX. You can choose whether to translate the audio into English and whether to generate subtitles from the transcription.

WhisperX options

Transcription Translation

To translate the audio into English, simply check the Translate to English checkbox before generating the transcription, as shown in the video below.

french-to-english.mp4

Note

Video from v2.1.0

However, there is another unofficial way to translate audio into any supported language by setting the Audio language to the target translation language. For example, if the audio is in English and you want to translate it into Spanish, you would set the Audio language to "Spanish".

Here is a practical example using the microphone:

english-to-spanish.mp4

Note

Video from v2.1.0

Make sure to double-check the generated translations.

Generate Subtitles

To generate subtitles, simply check the Generate subtitles option before generating the transcription, as you would with the Translate to English option.

When you select this option, you'll see a Subtitle options frame like the one below with these three options:

  • Highlight words: Underline each word as it's spoken in .srt and .vtt subtitle files. Not checked by default.
  • Max. line count: The maximum number of lines in a segment. 2 by default.
  • Max. line width: The maximum number of characters in a line before breaking the line. 42 by default.

Subtitle options

To get the files after the audio is transcribed, click Save transcription and select the path where you want to save them, as explained in the Save Transcription section.

The output formats are .vtt and .srt, which are two of the most common subtitle file formats. Unfortunately, there is no current support for the .ass file type at the moment, but it will be added as soon as WhisperX fixes a bug that prevented it from being created correctly.

WhisperX Advanced Options

When you click the Show advanced options button in the WhisperX options frame, the Advanced options frame appears, as shown in the figure below.

WhisperX advanced options

It's highly recommended that you don't change the default configuration unless you're having problems with WhisperX or you know exactly what you're doing, especially the "Compute type" and "Batch size" options. Change them at your own risk and be aware that you may experience problems, such as having to reboot your system if the GPU runs out of VRAM.

Model Size

There are five main model sizes that offer tradeoffs between speed and accuracy. The larger the model size, the more VRAM it uses and the longer it takes to transcribe. Unfortunately, WhisperX hasn't provided specific performance data for each model, so the table below is based on the one detailed in OpenAI's Whisper README. According to WhisperX, the large-v2 model requires <8GB of GPU memory and batches inference for 70x real-time transcription (taken from the project's README).

Model Parameters Required VRAM
tiny 39 M ~1 GB
base 74 M ~1 GB
small 244 M ~2 GB
medium 769 M ~5 GB
large 1550 M <8 GB

Note

large is divided into three versions: large-v1, large-v2, and large-v3. The default model size is large-v2, since large-v3 has some bugs that weren't as common in large-v2, such as hallucination and repetition, especially for certain languages like Japanese. There are also more prevalent problems with missing punctuation and capitalization. See the announcements for the large-v2 and the large-v3 models for more insight into their differences and the issues encountered with each.

The larger the model size, the lower the WER (Word Error Rate in %). The table below is taken from this Medium article, which analyzes the performance of pre-trained Whisper models on common Dutch speech.

Model WER
tiny 50.98
small 17.90
large-v2 7.81

Compute Type

This term refers to different data types used in computing, particularly in the context of numerical representation. It determines how numbers are stored and represented in a computer's memory. The higher the precision, the more resources will be needed and the better the transcription will be.

There are three possible values for Audiotext:

  • int8: Default if using CPU. It represents whole numbers without any fractional part. Its size is 8 bits (1 byte) and it can represent integer values from -128 to 127 (signed) or 0 to 255 (unsigned). It is used in scenarios where memory efficiency is critical, such as in quantized neural networks or edge devices with limited computational resources.
  • float16: Default if using CUDA GPU. It's a half precision type representing 16-bit floating point numbers. Its size is 16 bits (2 bytes). It has a smaller range and precision compared to float32. It's often used in applications where memory is a critical resource, such as in deep learning models running on GPUs or TPUs.
  • float32: Recommended for CUDA GPUs with more than 8 GB of VRAM. It's a single precision type representing 32-bit floating point numbers, which is a standard for representing real numbers in computers. Its size is 32 bits (4 bytes). It can represent a wide range of real numbers with a reasonable level of precision.

Batch Size

This option determines how many samples are processed together before the model parameters are updated. It doesn't affect the quality of the transcription, only the generation speed (the smaller, the slower).

For simplicity, let's divide the possible batch size values into two groups:

  • Small batch size (<=8): Training with small batch sizes means that model weights are updated more frequently, potentially leading to more stable convergence. They use less memory, which can be important when working with limited resources. 8 is the default value.
  • Large batch size (>8): Speeds up in training, especially on hardware optimized for parallel processing such as GPUs. Max. recommended to 16.

Use CPU

Checked by default if there is no CUDA GPU. WhisperX will use the CPU for transcription if checked.

As noted in the Compute Type section, the default compute type value for the CPU is int8, since many CPUs don't support efficient float16 or float32 computation, which would result in an error. Change it at your own risk.

Google Speech-To-Text API Options

The Google API options frame appears if the selected transcription method is Google API.

google-api-options

API Key

Since the program uses the free Google API tier by default, which allows you to transcribe up to 60 minutes of audio per month for free, you may need to add an API key if you want to make extensive use of this feature. To do so, click the Set API key button. You'll be presented with a dialog box where you can enter your API key, which will only be used to make requests to the API.

google-api-key-dialog

Remember that WhisperX provides fast, unlimited audio transcription that supports translation and subtitle generation for free, unlike the Google API. Also note that Google charges for the use of the API key, for which Audiotext is not responsible.

Appearance Mode

The program supports three appearance modes:

System (default) System
Dark Dark theme
Light Light theme

Troubleshooting

  • Generating a transcription may take some time, depending on the length and size of the audio and whether it's extracted from a video file. Do not close the program, even if it appears to be unresponsive.
  • The first transcription created by WhisperX will take a while. That's because Audiotext needs to load the model, which can take a while, even a few minutes, depending on the hardware the program is running on. Once it's loaded, however, you'll notice a dramatic increase in the speed of subsequent transcriptions using this method.
  • If you get the error RuntimeError: CUDA Out of memory or want to reduce GPU/CPU memory requirements, try any of the following (2 and 3 can affect quality) (taken from WhisperX README):
    1. Reduce batch size, e.g. 4
    2. Use a smaller ASR model, e.g. base
    3. Use lighter compute type, e.g. int8
  • If the program takes too much time to generate the transcription, i.e. about x3 the time of the original audio length, try using a smaller ASR model and/or use a lighter computation type, as indicated in the point above. Keep in mind that the first WhisperX transcription will take some time to load the model. Also remember that the transcription process depends heavily on your system's hardware, so don't expect instant results on modest CPUs. Remember that you can also use the Google API transcription method, which is much less hardware demanding than WhisperX.

(back to top)

Roadmap

  • Add support for WhisperX.
  • Generate .srt and .vtt files for subtitles (only for WhisperX).
  • Add "Stop recording" button state when recording from the microphone.
  • Add a dialogue to let users input their Google Speech-To-Text API key.
  • Add subtitle options.
  • Add advanced options for WhisperX.
  • Add the option to transcribe YouTube videos.
  • Change the "Generate transcription" button to "Cancel transcription" when a transcription is in progress.
  • Generate executables for macOS and Linux.
  • Add pre-commit config for using Black, isort, and mypy.
  • Add tests.

You can propose a new feature creating an issue.

Authors

See also the list of contributors who participated in this project.

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. Please read the CONTRIBUTING.md file, where you can find more detailed information about how to contribute to the project.

Acknowledgments

I have made use of the following resources to make this project:

License

Distributed under the BSD-4-Clause license. See LICENSE for more information.

Support

Would you like to support the project? That's very kind of you! However, I would suggest that you to consider supporting the packages that I've used to build this project first. If you still want to support this particular project, you can go to my Ko-Fi profile by clicking on the button down below!

ko-fi

(back to top)

About

A desktop application that transcribes audio from files, microphone input or YouTube videos with the option to translate the content and create subtitles.

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

Languages