Skip to content

EmotionEngineer/AutoLang

Repository files navigation

AutoLang: Audio Language Detector

Welcome to the AutoLang repository! This powerful tool transcribes audio files and automatically detects the language of the transcribed text. It's built with Python and leverages the power of boosting trees to recognize audio and predict languages.

Table of Contents

Features

  • Transcribes audio files into text
  • Predicts the language of the transcribed text
  • Handles audio processing (changing speed and volume)
  • Provides results in a user-friendly JSON format

Quick Start

Before you start, make sure you have Python and pip installed on your machine. This project was tested with Python 3.10.12 on Pop!_OS machine.

  1. Clone the repository:
git clone https://github.com/EmotionEngineer/AutoLang.git
cd AutoLang
  1. Install the necessary Python libraries by running:
pip install -r requirements.txt

Here are the required libraries:

  • tensorflow
  • torch
  • kenlm
  • pyctcdecode
  • catboost
  • transformers
  • datasets
  • numpy
  • onnxruntime
  • scipy

If you encounter any issues during the installation, please check the respective library's documentation.

  1. Download the model files from the Releases section of the GitHub repository. The model files have been split into four parts: models_part_aa, models_part_ab, models_part_ac, and models_part_ad

  2. Reassemble and unzip the model files:

cat models_part_aa models_part_ab models_part_ac models_part_ad > models.zip
unzip models.zip
  1. Run the main.py script with the necessary command-line arguments, including the -l or --language option to set the language manually ('EN', 'RU', or 'Auto'):
python main.py -i <input_file_path> -m <model_path> -o <output_file_path> -s <speed> -v <volume> -p <processed_audio_path> -l <language>

For example, to transcribe an audio file and detect the language automatically:

python main.py -i path/to/audio.wav -o output.json

And to transcribe an audio file with a manually set language:

python main.py -i path/to/audio.wav -o output.json -l EN

Command-line Arguments

  • -i, --input: Path to the input audio file (Required)
  • -m, --model: Path to the trained CatBoost LanguageDetector model (Default: './models/LangModel.cbm')
  • -o, --output: Path to the output JSON file
  • -s, --speed: Speed change of the audio file (Default: 1.0)
  • -v, --volume: Volume change of the audio file (Default: 1.0)
  • -p, --path: Path to save the processed audio file
  • -l, --language: Set the language manually (EN, RU) or leave it as Auto for auto-detection (Default: 'Auto')

Components

The program is composed of several classes:

  • Normalizer: Normalizes audio data
  • TextProcessor: Processes text such as adding spaces between characters
  • LanguageModel: Predicts the transcription of a given audio file
  • TextLanguagePredictor: Predicts the language of a given text
  • AudioLanguageDetector: Detects the language of a given audio file
  • AudioProcessor: Processes the audio file with the given speed and volume
  • JSONWriter: Writes the result into a JSON file

Model Training Notebooks

The models used in this project were trained and evaluated in Kaggle notebooks. The notebooks detail the process of training the language models, inference, and conversion to ONNX format. Here are the links to those notebooks:

License

This project is licensed under the terms of the MIT license. See LICENSE for additional details.

Acknowledgements

This project uses the Wav2Vec2ProcessorWithLM from HuggingFace, the CatBoost library, and ONNX Runtime.