# 🧐 TTS & ASR Dataset Pipeline

An open-source pipeline designed to simplify the creation of datasets for Text-to-Speech (TTS) and Automatic Speech Recognition (ASR), supporting both specific (biblical) and generic (any corpus/language) use cases.

## 🎯 Motivation & Inspirations

This project was born out of a desire to expand access to speech technologies for all languages, especially **low-resource languages**. Inspired by Meta AI’s **Massively Multilingual Speech (MMS)** project, this open-source pipeline aims to make TTS/ASR dataset creation easier by providing a full set of tools for:

- preprocessing text and audio data,
- automatic alignment,
- data filtering and formatting.

## 🧠 Pipeline Architecture

```
Raw Data (audio/text)
      │
      ▼
Preprocessing (conversion + cleaning)
      │
      ▼
Text-audio alignment
      │
      ▼
Alignment filtering
      │
      ▼
Final formatting (CSV, JSON, etc.)
```

## 📚 Use Cases

This pipeline supports **two main scenarios**:

- 📖 **Biblical case**: Texts from scriptures or other books, often structured using book,chapter,verse segmentation.
- 🌍 **Generic case**: Any data with transcription and audio (podcasts, stories, interviews, etc.).

## ⚙️ Installation

In [None]:
!git clone https://github.com/MendoLeo/tts-dataset-pipeline.git

In [None]:
%cd tts-dataset-pipeline
!pip install -r requirements.txt

### 📌 C++ Alignment Dependency (Generic case only)

In [None]:
%cd forced-alignhf-model

In [None]:
!pip install pybind11
!python setup.py build_ext --inplace

In [None]:
%cd ..

## 🧾 Data Preparation

### 📖 Biblical Data

The required format is a JSON file with book/chapter/verse structure:

[
    {
        "numVerset": "MAT.1.1",
        "verset": "Kalate éndane Yésus Krist, e mona David, e mon Abraham."
    },
    {
        "numVerset": "MAT.1.2",
        "verset": "Abraham a nga biaé Izak, Izak a nga biaé Yakob, Yakob a nga biaé Yuda baa be bobenyañ."
    }
]

Structure:

audio_dir/
├── MAT
│   ├── MAT_001.wav
├── 1CO
...
transcripts/
├── MAT.json
├── PSA.json
...

### 🌍 Generic Data

Text/audio file pairs. Structure:

text_dir/
├── AV1.txt
├── AV2.txt
audio_dir/
├── AV1.wav
├── AV2.wav

## Mount you drive


In [None]:
from google.colab import drive
drive.mount('/content/drive')

## ⚙️ Common Preprocessing Steps

### 🔄 Audio Conversion

In [None]:
%cd data_prep

In [None]:
!python convert_audio.py --audio_dir ./audio_files --sample 16000 --output_dir wav_output

### 🔇 Audio Denoising

In [None]:
!python denoising.py --src_path /noised/audios --output_dir /denoised/audios

## 📖 Biblical Case

### 📌 Alignment

In [None]:
# align single book
!python run_segmentation.py --json_path /transcripts/json-file/PSA.json --audio_dir /audio_dir/PSA --output_dir /outputs/PSA --language 'bum' --chunk_size_s 15

In [None]:
# Align most book
%cd scripts-bash
!segmentation.sh -j /to/json_files -a /to/audio_files -o /path/to/output_dir -b "GEN EXO PSA" -c 15

### 🧹 Filtering

In [None]:
# Filtering single book
!python run_filter.py --audio_dir /aligned/GEN/ --output_dir /output/filtered --language 'bum' --chunk_size_s 15 --probability_difference_threshold -0.2

In [None]:
# Filtering mutiple book
%cd scripts-bash
!run_filter.sh -a /path/to/audio_files -o /path/to/output_dir -b "GEN EXO PSA" -t -0.2


## 🌍 Generic Case

### 🔧 Alignment Setup and 🔁 Alignment

In [None]:
%cd /content/tts-dataset-pipeline/forced-alignhf-model

In [None]:
!python align_batch.py --audio_dir "/content/tts-dataset-pipeline/data/audios" --text_dir "/content/tts-dataset-pipeline/data/transcripts" --output_dir "/content/sample_data/pipeline-align" --language "bum" --romanize --segment_audio --generate_txt --split_size "sentence"

### 🧹 Generic Filtering

In [None]:
!python generic_filter.py --audio_dir /path/to/align/data --output_dir /path/to/cleaned/alignment --language "bum" --probability_difference_threshold -0.25