# üßê TTS & ASR Dataset Pipeline

An open-source pipeline designed to simplify the creation of datasets for Text-to-Speech (TTS) and Automatic Speech Recognition (ASR), supporting both specific (biblical) and generic (any corpus/language) use cases.

## üéØ Motivation & Inspirations

This project was born out of a desire to expand access to speech technologies for all languages, especially **low-resource languages**. Inspired by Meta AI‚Äôs **Massively Multilingual Speech (MMS)** project, this open-source pipeline aims to make TTS/ASR dataset creation easier by providing a full set of tools for:

- preprocessing text and audio data,
- automatic alignment,
- data filtering and formatting.

## üß† Pipeline Architecture

```
Raw Data (audio/text)
      ‚îÇ
      ‚ñº
Preprocessing (conversion + cleaning)
      ‚îÇ
      ‚ñº
Text-audio alignment
      ‚îÇ
      ‚ñº
Alignment filtering
      ‚îÇ
      ‚ñº
Final formatting (CSV, JSON, etc.)
```

## üìö Use Cases

This pipeline supports **two main scenarios**:

- üìñ **Biblical case**: Texts from scriptures or other books, often structured using book,chapter,verse segmentation.
- üåç **Generic case**: Any data with transcription and audio (podcasts, stories, interviews, etc.).

## ‚öôÔ∏è Installation

In [None]:
!git clone https://github.com/MendoLeo/tts-dataset-pipeline.git

In [None]:
%cd tts-dataset-pipeline
!pip install -r requirements.txt

### üìå C++ Alignment Dependency (Generic case only)

In [None]:
%cd forced-alignhf-model

In [None]:
!pip install pybind11
!python setup.py build_ext --inplace

In [None]:
%cd ..

## üßæ Data Preparation

### üìñ Biblical Data

The required format is a JSON file with book/chapter/verse structure:

[
    {
        "numVerset": "MAT.1.1",
        "verset": "Kalate √©ndane Y√©sus Krist, e mona David, e mon Abraham."
    },
    {
        "numVerset": "MAT.1.2",
        "verset": "Abraham a nga bia√© Izak, Izak a nga bia√© Yakob, Yakob a nga bia√© Yuda baa be bobenya√±."
    }
]

Structure:

audio_dir/
‚îú‚îÄ‚îÄ MAT
‚îÇ   ‚îú‚îÄ‚îÄ MAT_001.wav
‚îú‚îÄ‚îÄ 1CO
...
transcripts/
‚îú‚îÄ‚îÄ MAT.json
‚îú‚îÄ‚îÄ PSA.json
...

### üåç Generic Data

Text/audio file pairs. Structure:

text_dir/
‚îú‚îÄ‚îÄ AV1.txt
‚îú‚îÄ‚îÄ AV2.txt
audio_dir/
‚îú‚îÄ‚îÄ AV1.wav
‚îú‚îÄ‚îÄ AV2.wav

## Mount you drive


In [None]:
from google.colab import drive
drive.mount('/content/drive')

## ‚öôÔ∏è Common Preprocessing Steps

### üîÑ Audio Conversion

In [None]:
%cd data_prep

In [None]:
!python convert_audio.py --audio_dir ./audio_files --sample 16000 --output_dir wav_output

### üîá Audio Denoising

In [None]:
!python denoising.py --src_path /noised/audios --output_dir /denoised/audios

## üìñ Biblical Case

### üìå Alignment

In [None]:
# align single book
!python run_segmentation.py --json_path /transcripts/json-file/PSA.json --audio_dir /audio_dir/PSA --output_dir /outputs/PSA --language 'bum' --chunk_size_s 15

In [None]:
# Align most book
%cd scripts-bash
!segmentation.sh -j /to/json_files -a /to/audio_files -o /path/to/output_dir -b "GEN EXO PSA" -c 15

### üßπ Filtering

In [None]:
# Filtering single book
!python run_filter.py --audio_dir /aligned/GEN/ --output_dir /output/filtered --language 'bum' --chunk_size_s 15 --probability_difference_threshold -0.2

In [None]:
# Filtering mutiple book
%cd scripts-bash
!run_filter.sh -a /path/to/audio_files -o /path/to/output_dir -b "GEN EXO PSA" -t -0.2


## üåç Generic Case

### üîß Alignment Setup and üîÅ Alignment

In [None]:
%cd /content/tts-dataset-pipeline/forced-alignhf-model

In [None]:
!python align_batch.py --audio_dir "/content/tts-dataset-pipeline/data/audios" --text_dir "/content/tts-dataset-pipeline/data/transcripts" --output_dir "/content/sample_data/pipeline-align" --language "bum" --romanize --segment_audio --generate_txt --split_size "sentence"

### üßπ Generic Filtering

In [None]:
!python generic_filter.py --audio_dir /path/to/align/data --output_dir /path/to/cleaned/alignment --language "bum" --probability_difference_threshold -0.25