# ManaTTS Dataset Processing
This notebook provides the complete processing pipeline for [ManaTTS](https://aclanthology.org/2025.naacl-long.464/).

To run the pipeline, first execute the Environment Setup cells. Then, place the raw audio and text files in a directory named raw and run the remaining cells in sequence. Note that the original pipeline was not executed on Colab. To offer an executable demo on Colab's free account with limited memory, we have commented out two of the ASR models. The rest of the pipeline remains unchanged.



# Environment Setup

In [None]:
! pip install hazm  # Requires Restart

Collecting hazm
  Downloading hazm-0.10.0-py3-none-any.whl.metadata (11 kB)
Collecting fasttext-wheel<0.10.0,>=0.9.2 (from hazm)
  Downloading fasttext_wheel-0.9.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting flashtext<3.0,>=2.7 (from hazm)
  Downloading flashtext-2.7.tar.gz (14 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gensim<5.0.0,>=4.3.1 (from hazm)
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting numpy==1.24.3 (from hazm)
  Downloading numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting python-crfsuite<0.10.0,>=0.9.9 (from hazm)
  Downloading python_crfsuite-0.9.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Collecting pybind11>=2.2 (from fasttext-wheel<0.10.0,>=0.9.2->hazm)
  Downloading pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Collecting scipy<1.14.0,>=1.7.0

In [None]:
! pip install hezar  # Requires Restart

Collecting hezar
  Downloading hezar-0.42.0-py3-none-any.whl.metadata (18 kB)
Collecting omegaconf>=2.3.0 (from hezar)
  Downloading omegaconf-2.3.0-py3-none-any.whl.metadata (3.9 kB)
Collecting antlr4-python3-runtime==4.9.* (from omegaconf>=2.3.0->hezar)
  Downloading antlr4-python3-runtime-4.9.3.tar.gz (117 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.0/117.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.10.0->hezar)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.10.0->hezar)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.10.0->hezar)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata 

In [None]:
! pip install pydub

Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


In [None]:
! pip install pyaudioconvert

Collecting pyaudioconvert
  Downloading pyaudioconvert-0.0.5-py3-none-any.whl.metadata (2.1 kB)
Downloading pyaudioconvert-0.0.5-py3-none-any.whl (4.8 kB)
Installing collected packages: pyaudioconvert
Successfully installed pyaudioconvert-0.0.5


In [None]:
! pip install jiwer

Collecting jiwer
  Downloading jiwer-3.1.0-py3-none-any.whl.metadata (2.6 kB)
Collecting rapidfuzz>=3.9.7 (from jiwer)
  Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading jiwer-3.1.0-py3-none-any.whl (22 kB)
Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, jiwer
Successfully installed jiwer-3.1.0 rapidfuzz-3.13.0


## Setup parsi io

In [None]:
! git clone https://github.com/language-ml/parsi.io.git

Cloning into 'parsi.io'...
remote: Enumerating objects: 1322, done.[K
remote: Counting objects: 100% (1322/1322), done.[K
remote: Compressing objects: 100% (745/745), done.[K
remote: Total 1322 (delta 525), reused 1261 (delta 478), pack-reused 0 (from 0)[K
Receiving objects: 100% (1322/1322), 62.28 MiB | 21.21 MiB/s, done.
Resolving deltas: 100% (525/525), done.
Updating files: 100% (217/217), done.


In [None]:
mv parsi.io parsi_io

## Setup Spleeter
The audio source separation tool

In [None]:
! git clone https://github.com/deezer/spleeter

Cloning into 'spleeter'...
remote: Enumerating objects: 2704, done.[K
remote: Counting objects: 100% (591/591), done.[K
remote: Compressing objects: 100% (123/123), done.[K
remote: Total 2704 (delta 509), reused 468 (delta 468), pack-reused 2113 (from 3)[K
Receiving objects: 100% (2704/2704), 9.44 MiB | 17.02 MiB/s, done.
Resolving deltas: 100% (1738/1738), done.


In [None]:
! pip3 install typer
! pip3 install numpy
! pip3 install "tensorflow>=2.0.0"
! pip3 install librosa
! pip3 install pandas
! pip3 install ffmpeg-python
! sudo apt install -y ffmpeg
! pip3 install httpx[http2]

Collecting numpy<2.1.0,>=1.26.0 (from tensorflow>=2.0.0)
  Downloading numpy-2.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.9/60.9 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Downloading numpy-2.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.5/19.5 MB[0m [31m79.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.3
    Uninstalling numpy-1.24.3:
      Successfully uninstalled numpy-1.24.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.0.2 which is incompatible.
hazm 0.10.0 requires numpy==1.24.3, but you h

In [None]:
! mkdir -p spleeter/pretrained_models/2stems

In [None]:
! wget https://github.com/deezer/spleeter/releases/download/v1.4.0/2stems.tar.gz
! mv 2stems.tar.gz spleeter/pretrained_models/2stems/2stems.tar.gz

--2025-05-08 10:21:03--  https://github.com/deezer/spleeter/releases/download/v1.4.0/2stems.tar.gz
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/211124697/e5a4d280-f98d-11e9-905c-849465861ed7?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250508%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250508T102103Z&X-Amz-Expires=300&X-Amz-Signature=a989c63a7853699565b29f90fb39a813e5426571c30f79b03e9ea7d8c50cb190&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3D2stems.tar.gz&response-content-type=application%2Foctet-stream [following]
--2025-05-08 10:21:03--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/211124697/e5a4d280-f98d-11e9-905c-849465861ed7?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential

In [None]:
! tar xvzf spleeter/pretrained_models/2stems/2stems.tar.gz -C spleeter/pretrained_models/2stems/

./._checkpoint
checkpoint
model.data-00000-of-00001
model.index
model.meta


In [None]:
!pip install numpy==1.26.0
# Restart Session

Collecting numpy==1.26.0
  Downloading numpy-1.26.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (58 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/58.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.5/58.5 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m101.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.5
    Uninstalling numpy-2.2.5:
      Successfully uninstalled numpy-2.2.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
hazm 0.10.0 requires numpy==1.24.3, but you hav

# Process Text

In [None]:
import os
import re
from functools import reduce
from hazm import Normalizer
from parsi_io.parsi_io.modules.number_extractor import NumberExtractor
from parsi_io.parsi_io.modules.convert_number_to_text import ConvertNumberToText

## Normalization

In [None]:
normalizer = Normalizer()

def normalize_text(text):
  return normalizer.normalize(text)

## Symbol Substitution
This step is designed to unify various forms of symbols into their more commonly used counterparts.

In [None]:
substitution_dict = {'ﯽ': 'ی', '—': '–', '\u200f': '\u200c', '\xad': '\u200c', '\u200e': '\u200c', '\u200d': '\u200c'}

def substitute_symbols(text):
    translation_table = str.maketrans(substitution_dict)
    substituted_text = text.translate(translation_table)
    return substituted_text

## Remove In-text References
This step is designed to remove the references that come inside the text but are not read aloud. For example:
> They have introduced a new tool [1] which ...

In [None]:
def remove_inline_references(text):
    # Define pattern to match references like "[NUM]"
    pattern_fa = r"\[\d+\]|\[۰-۹]+\]"
    pattern_en = r"\[\d+\]|\[0-9]+\]"

    # Use regular expression to remove references
    text_without_refs_fa = re.sub(pattern_fa, " ", text)
    text_without_refs_en = re.sub(pattern_en, " ", text_without_refs_fa)

    return text_without_refs_en

## Remove Reference Lines
This step is designed to remove the references that come at the end of the text but are not read aloud. For example:
> [1] Roshan-AI. Hazm. https://www.roshan-ai.ir/hazm/docs/index.html. Accessed:
May 3, 2024.
>
> [2] ...



In [None]:
def remove_references_lines(text):
    # Define pattern to match references like "[NUM] "
    pattern_fa = r"^\s*\[\d+\]|\[۰-۹]+\]"
    pattern_en = r"^\s*\[\d+\]|\[0-9]+\]"

    # Split text into lines
    lines = text.split('\n')

    # Remove lines starting with references
    cleaned_lines = [line for line in lines if not re.match(pattern_fa, line.strip()) and not re.match(pattern_en, line.strip())]

    # Join cleaned lines back into text
    cleaned_text = '\n'.join(cleaned_lines)

    return cleaned_text

## Remove Link Lines
This step is designed to remove the links and urls that come at the end of the text but are not read aloud. For example:
> Resources:
>
> https://www.roshan-ai.ir/hazm/docs/index.html
>
> https://virgool.io/
>
> ...

In [None]:
def remove_link_lines(text):
    # Define the pattern to match lines starting with http or www
    pattern = r"^\s*(?:http|www)"

    # Split text into lines
    lines = text.split('\n')

    # Remove lines starting with link
    cleaned_lines = [line for line in lines if not re.match(pattern, line.strip())]

    # Join cleaned lines back into text
    cleaned_text = '\n'.join(cleaned_lines)

    return cleaned_text

## Convert Numbers to Text
This step is designed to convert the numbers in digit format into their spoken version. For example:

> 22 → twenty two

Converting numbers to their spoken format would be as simple as the `replace_numbers_with_text_normally` function. But the [parsi.io](https://github.com/language-ml/parsi.io) library has a bug reported [here](https://github.com/language-ml/parsi.io/issues/50). It actually cannot process text with a specific numeric phrases. Following are some examples of these phrases:

> قرن سوم و چهارم هجری
>
> کلاس سوم و چهارم
>
> نوزده و بیست میلادی

We have detected such phrases in the dataset by try-catching the error cases and handle them separately.

In [None]:
persian_digits_pattern = re.compile(r'[۰۱۲۳۴۵۶۷۸۹0123456789]')
num2text = ConvertNumberToText()
extractor = NumberExtractor()

In [None]:
def replace_numbers_with_text_at_exception(text):
  reg = r'(سوم و چهارم|نوزده و بیست)'
  # Split the text based on problematic expressions
  splits = re.split(reg, text)
  # Process each split individually
  processed_splits = [replace_numbers_with_text(split) for split in splits if not re.search(reg, split)]
  # Concatenate the processed splits
  return ''.join(processed_splits)

In [None]:
def replace_numbers_with_text_normally(text):
  # Find all number spans in the text
  number_spans = extractor.run(text)

  # Filter out spans that contain digits
  filtered_spans = [span for span in number_spans if persian_digits_pattern.search(span['phrase'])]

  # Convert the filtered numbers to text and replace them in the text
  offset = 0  # Track the offset due to previous replacements
  for span in filtered_spans:
      start, end = span['span']
      start -= offset  # Adjust start position based on previous replacements
      end -= offset  # Adjust end position based on previous replacements
      number_text = span['phrase']
      number_value = span['value']

      # Convert the number to text
      text_value = num2text.run(str(number_value))

      # Replace the number in the text with its textual equivalent
      text = text[:start] + text_value + text[end:]

      # Update the offset
      offset += len(number_text) - len(text_value)

  return text

In [None]:
def replace_numbers_with_text(text):
    # The number extraction library has a bug reported here: https://github.com/language-ml/parsi.io/issues/50
    # And example of this error happens on 'قرن سوم و چهارم هجری', 'کلاس سوم و چهارم', 'نوزده و بیست میلادی
    # We will process these texts separately
    try:
        number_spans = extractor.run(text)
        return replace_numbers_with_text_normally(text)
    except (ValueError, IndexError) as e:
        return replace_numbers_with_text_at_exception(text)

## Remove Symbols
This step is designed to remove some of the symbols that are not very common or do not affect the TTS-ASR models' outputs. This helps simplify the input to the models.

In [None]:
symbols_to_remove = "«»*[]\"'^&<>{}|٫《》•\x9d\u200b\x7f"

def remove_symbols(text):
    pattern = "[" + re.escape(symbols_to_remove) + "]"
    return re.sub(pattern, ' ', text)

## Remove Extra White Spaces
This step is designed to remove extra white spaces inluding multiple consequent white spaces and new lines.

In [None]:
def remove_extra_white_spaces(text):
    cleaned_text = re.sub(r'\s+', ' ', text)
    return cleaned_text.strip()

## Full Pipeline
Here we define the complete text processing pipelin and the processing code.

In [None]:
pipeline = [
    normalize_text,
    substitute_symbols,
    remove_inline_references,
    remove_references_lines,
    remove_link_lines,
    replace_numbers_with_text,
    remove_symbols,
    remove_extra_white_spaces
  ]

In [None]:
def process_text(input_file_path, output_dir_path):
  input_file_name = input_file_path.split('/')[-1].split('.')[0]
  output_file_path = os.path.join(output_dir_path, input_file_name + '.txt')

  # Check if the output file already exists
  if os.path.exists(output_file_path):
    print(f"Skipping file {input_file_name}.txt. Processed text file already exists.")
    return output_file_path

  # Apply the text processing pipeline
  with open(input_file_path, 'r') as f:
    text = reduce(lambda txt, func: func(txt), pipeline, f.read())

  # Export the processed text
  with open(output_file_path, 'w') as f:
     f.write(text)

  return output_file_path

# Convert MP3 format to WAV
MP3 format is lossy, meaning each time an audio file is encoded in MP3 format, it undergoes some quality degradation. To preserve the original quality of audio files during the processing pipeline, we convert them to the lossless WAV format.

In [None]:
import os
from pydub import AudioSegment

In [None]:
def convert_mp3_to_wav(input_file_path, output_dir_path):
    # Get the input file name without the extension
    input_file_name = os.path.splitext(os.path.basename(input_file_path))[0]

    # Construct the output file path
    output_file_path = os.path.join(output_dir_path, f"{input_file_name}.wav")

    # Load the MP3 file
    audio = AudioSegment.from_mp3(input_file_path)

    # Export the audio as a WAV file
    audio.export(output_file_path, format="wav")

    return output_file_path

# Convert Stereo to Mono
Converting audio from stereo to mono for a TTS dataset ensures consistency, simplifies processing, reduces storage needs, and aligns with the design of most TTS models, which are optimized for mono input. Mono audio eliminates unnecessary spatial effects, providing clear and intelligible speech essential for TTS applications.

In [None]:
import os
from pydub import AudioSegment

In [None]:
def convert_stereo_to_mono(input_file_path, output_dir_path):
    # Get the input file name with the extension
    input_file_name = os.path.basename(input_file_path)

    # Construct the output file path
    output_file_path = os.path.join(output_dir_path, input_file_name)

    # Load the audio file
    audio = AudioSegment.from_wav(input_file_path)

    # Convert stereo audio to mono
    mono_audio = audio.set_channels(1)

    # Save the mono audio file
    mono_audio.export(output_file_path, format="wav")

    return output_file_path

# Remove Background Music
To have a clean speech file, it is essential to remove any potential background music. We use the source separation tool [spleeter](https://github.com/deezer/spleeter) for this purpose.

In the following code, we split the audio into chunks of at most 10 minutes that can effectively fo under background music removal process and then concat back the processed chunks into the resulting audio file.

In [None]:
import os
import subprocess
import shutil
from pydub import AudioSegment

In [None]:
def get_file_name(full_path):
    file_name_with_extension = os.path.basename(full_path)
    file_name_without_extension, _ = os.path.splitext(file_name_with_extension)
    return file_name_without_extension

In [None]:
def remove_background_music(input_file_path, output_dir_path, log_idx):

    input_file_name = input_file_path.split('/')[-1].split('_mono.wav')[0]
    output_file_path = os.path.join(output_dir_path, input_file_name + '.wav')

    # Check if the output file already exists
    if os.path.exists(output_file_path):
        print(f"Skipping file {input_file_name}. Vocals file exists.")
        return output_file_path

    # Define duration of each chunk in milliseconds (10 minutes)
    chunk_duration = 10 * 60 * 1000

    # Load input audio
    audio = AudioSegment.from_file(input_file_path)

    # Get total duration of input audio in milliseconds
    total_duration = len(audio)

    # Initialize empty list to store processed audio chunks
    processed_chunks = []

    # Split the audio into chunks of 10 minutes, process each chunk, and store processed chunks
    for i in range(0, total_duration, chunk_duration):
        start_time = i
        end_time = min(i + chunk_duration, total_duration)

        # Extract the chunk
        input_file_name = get_file_name(input_file_path)
        basename = f'{input_file_name}_temp_chunk_{log_idx}.wav'

        chunk = audio[start_time:end_time]
        temp_input_file_path = os.path.join('spleeter', basename)
        chunk.export(temp_input_file_path, format='wav')

        # Process the chunk to remove background music
        temp_output_folder_name = f'temp_output_{log_idx}'

        while True:     # Used to handle Cuda out of memory in which spleeter doesn't create output file
            subprocess.run(['python3', '-m', 'spleeter', 'separate', basename, '-o', temp_output_folder_name, '-c', 'wav', '-b', '128k'], cwd='spleeter')

            try:
                # Load processed chunk
                processed_chunk = AudioSegment.from_file(f'spleeter/{temp_output_folder_name}/{basename.split(".")[0]}/vocals.wav')
                break
            except FileNotFoundError:
                pass



        # Add processed chunk to the list
        processed_chunks.append(processed_chunk)

    # Concatenate processed chunks
    concatenated_audio = processed_chunks[0]
    for processed_chunk in processed_chunks[1:]:
        concatenated_audio += processed_chunk

    # Export concatenated audio to output file

    concatenated_audio.export(output_file_path, format='wav')

    os.remove(input_file_path)
    os.remove(temp_input_file_path)
    shutil.rmtree(f'spleeter/{temp_output_folder_name}/{basename.split(".")[0]}/')

    return output_file_path

# Remove Silent Moments
It is desirable to have audio files that do not have long periods of silence. So we remove the silent parts of the audio longer than a second using the [pydub](https://github.com/jiaaro/pydub) library as a post-processing step.

In [None]:
import os
import pandas as pd
from pydub import AudioSegment
from pydub.silence import split_on_silence

In [None]:
def remove_silent_parts(input_file_path, output_dir_path, metadata, min_silence_len=1000, silence_thresh=-50, keep_silence=1000):
    input_file_name = input_file_path.split('/')[-1]
    output_file_path = os.path.join(output_dir_path, input_file_name)

    # Check if the output file already exists
    if os.path.exists(output_file_path):
        print(f"Skipping file {input_file_name}. Silence removed file exists.")
        return output_file_path

    # Split audio to silent parts by duration more than `min_silence_len` and keep only `keep_silence` of silence
    audio = AudioSegment.from_file(input_file_path, format='wav')
    parts = split_on_silence(audio, min_silence_len=min_silence_len, silence_thresh=silence_thresh, keep_silence=keep_silence)

    # Concat the silence removed parts
    output = AudioSegment.empty()
    for part in parts:
        output += part

    # Write the silence removed audio file
    output.export(output_file_path, format='wav')

    # Update the duration in metadata
    df = pd.read_csv(metadata)
    row_index = df.index[df['Audio'] == input_file_name].tolist()
    df.at[row_index[0], 'Silence Removed Duration'] = output.duration_seconds
    df.to_csv(metadata, index=False)

    return output_file_path

# Transcription Module
This module uses several Persian ASR models to get a list of reliable transcripts. According to your computational resources including RAM, you can comment out some of the ASR models. We have commented Vosk and Whisper for this notebook.

As explained in the original paper (link to be updated), the ASR models have been evaluated and sorted in the order of their reliability. To checkout the evaluation of these models, please refer to this repository (link to be updated).

In [None]:
import os
import re
import numpy as np
import librosa
import torch
import torchaudio
import wave
import json
# import pyaudioconvert as pac
import uuid
import csv
import pandas as pd
from pydub import AudioSegment
from hazm import word_tokenize
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from hezar.models import Model as HezarModel
# from vosk import Model as VoskModel
# from vosk import KaldiRecognizer, SetLogLevel
# from speechbrain.inference.ASR import WhisperASR

ModuleNotFoundError: No module named 'hezar'

In [None]:
cuda = "cuda:0"
wav2vec_v3_model_name = "m3hrdadfi/wav2vec2-large-xlsr-persian-v3"
device = torch.device(cuda if torch.cuda.is_available() else "cpu")
wav2vec_v3_processor = Wav2Vec2Processor.from_pretrained(wav2vec_v3_model_name)
wav2vec_v3_model = Wav2Vec2ForCTC.from_pretrained(wav2vec_v3_model_name).to(device)

wav2vec_fa_model_name = "masoudmzb/wav2vec2-xlsr-multilingual-53-fa"
wav2vec_fa_processor = Wav2Vec2Processor.from_pretrained(wav2vec_fa_model_name)
wav2vec_fa_model = Wav2Vec2ForCTC.from_pretrained(wav2vec_fa_model_name).to(device)

hezar_model = HezarModel.load("hezarai/whisper-small-fa").to(device)

# # You can set log level to -1 to disable debug messages
# SetLogLevel(0)
# vosk_model = VoskModel(model_name="vosk-model-fa-0.5")

# whisper_model = WhisperASR.from_hparams(source="speechbrain/asr-whisper-large-v2-commonvoice-fa", run_opts={"device":cuda}).to(device)

In [None]:
# Regular expression pattern to match words containing at least one word character
word_pattern = re.compile(r'\w+')

def get_word_count(text):
    words = word_tokenize(text)
    valid_words = [word for word in words if word_pattern.match(word)]
    return len(valid_words)

In [None]:
def wav2vec_v3_speech_file_to_array_fn(path):
    speech_array, sampling_rate = torchaudio.load(path)
    speech_array = speech_array.squeeze().numpy()
    speech_array = librosa.resample(np.asarray(speech_array), orig_sr=sampling_rate, target_sr=wav2vec_v3_processor.feature_extractor.sampling_rate)

    return speech_array

In [None]:
def wav2vec_v3_transcript(audio_file_path):
    speech = wav2vec_v3_speech_file_to_array_fn(audio_file_path)

    features = wav2vec_v3_processor(
        speech,
        sampling_rate=wav2vec_v3_processor.feature_extractor.sampling_rate,
        return_tensors="pt",
        padding=True
    )

    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)

    with torch.no_grad():
        logits = wav2vec_v3_model(input_values, attention_mask=attention_mask).logits

    pred_ids = torch.argmax(logits, dim=-1)

    predicted = wav2vec_v3_processor.batch_decode(pred_ids)
    return predicted[0]

In [None]:
def wav2vec_fa_speech_file_to_array_fn(path):
    speech_array, sampling_rate = torchaudio.load(path)
    speech_array = speech_array.squeeze().numpy()
    speech_array = librosa.resample(np.asarray(speech_array), orig_sr=sampling_rate, target_sr=wav2vec_fa_processor.feature_extractor.sampling_rate)

    return speech_array

In [None]:
def wav2vec_fa_transcript(audio_file_path):
    speech = wav2vec_fa_speech_file_to_array_fn(audio_file_path)

    features = wav2vec_fa_processor(
        speech,
        sampling_rate=wav2vec_fa_processor.feature_extractor.sampling_rate,
        return_tensors="pt",
        padding=True
    )

    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)

    with torch.no_grad():
        logits = wav2vec_fa_model(input_values, attention_mask=attention_mask).logits

    pred_ids = torch.argmax(logits, dim=-1)

    predicted = wav2vec_fa_processor.batch_decode(pred_ids)
    return predicted[0]

In [None]:
def hezar_transcript(audio_file_path):
    transcript = hezar_model.predict(audio_file_path)
    transcript = transcript[0]['text']

    return transcript.strip()

In [None]:
# def vosk_transcript(audio_file_path):
#     # Generate a unique identifier for the file names
#     unique_id = uuid.uuid4()

#     # Create an output file name with a random suffix
#     temp_output_file_name = f"{unique_id}.{audio_file_path.split('.')[-1]}"

#     # Load the audio file
#     audio = AudioSegment.from_mp3(audio_file_path)
#     # Export the audio in WAV format
#     audio.export(temp_output_file_name, format="wav")

#     # Convert the WAV file to 16-bit
#     pac.convert_wav_to_16bit_mono(temp_output_file_name, temp_output_file_name)

#     wf = wave.open(temp_output_file_name, "rb")

#     if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getcomptype() != "NONE":
#         print("Audio file must be WAV format mono PCM.")
#         return ''

#     rec = KaldiRecognizer(vosk_model, wf.getframerate())
#     rec.SetWords(True)
#     rec.SetPartialWords(True)

#     while True:
#         data = wf.readframes(4000)
#         if len(data) == 0:
#             break
#         if rec.AcceptWaveform(data):
#             rec.Result()
#         else:
#             rec.PartialResult()

#     os.remove(temp_output_file_name)
#     return json.loads(rec.FinalResult())['text']

In [None]:
# def whisper_transcript(audio_file_path):
#     transcript = whisper_model.transcribe_file(audio_file_path)

#     symbolic_link_file_path = audio_file_path.split('/')[-1]
#     if os.path.islink(symbolic_link_file_path):
#         os.remove(symbolic_link_file_path)

#     return transcript

In [None]:
# With zero having least avg cer and 3 most
# asrs_dict = {'Vosk': 0, 'Wav2Vec-V3': 1, 'Wav2Vec-FA': 2, 'Whisper': 3, 'Hezar': 4}
asrs_dict = {'Wav2Vec-V3': 1, 'Wav2Vec-FA': 2, 'Hezar': 3}

# asr_transcripts = [('Hezar', hezar_transcript), ('Wav2Vec-V3', wav2vec_v3_transcript), ('Whisper', whisper_transcript), ('Vosk', vosk_transcript), ('Wav2Vec-FA', wav2vec_fa_transcript)]
asr_transcripts = [('Hezar', hezar_transcript), ('Wav2Vec-V3', wav2vec_v3_transcript), ('Wav2Vec-FA', wav2vec_fa_transcript)]

In [None]:
# Used to store the transcript of audio chunks
transcripts_metadata_path = 'transcripts_metadata.csv'

# Create the metadata file with header
if not os.path.exists(transcripts_metadata_path):
    with open(transcripts_metadata_path, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['Audio', *[asr_name for asr_name, _ in asr_transcripts]])

transcripts_metadata = pd.read_csv(transcripts_metadata_path)

In [None]:
def get_best_transcripts(audio_file_path, output_chunk_path='', log=False):
    transcripts = []

    # Get transcript of all ASRs
    for asr, asr_transcript in asr_transcripts:
        transcript = asr_transcript(audio_file_path)
        transcripts.append((asr, transcript, get_word_count(transcript)))

    # If log is True, save the transcripts to the transcripts_metadata csv file defined above
    if log == True:
        with open(transcripts_metadata_path, 'a', newline='') as csvfile:
            writer = csv.writer(csvfile)
            writer.writerow([output_chunk_path, *[transcript for _, transcript, _ in transcripts]])

    # Filter out corrupt transcripts like "من من من من من من من" with repetitive patterns
    transcripts_filter_for_repetition = [(asr, transcript, word_count) for (asr, transcript, word_count) in transcripts if not re.search(r'(.{1,5})(\1{5,})|(.{5,30})(\3{3,})|(.{30,})(\5{2,})', transcript)]

    # Sort transcripts based on length and get the max length
    transcripts_sorted_by_word_count = sorted(transcripts_filter_for_repetition, key=lambda x: x[2], reverse=True)
    max_word_count = transcripts_sorted_by_word_count[0][2]

    # Filter out transcripts that are shorter than 80% of the max length of transcripts
    transcripts_filtered_for_truncated = [(asr, transcript) for (asr, transcript, word_count) in transcripts_sorted_by_word_count if word_count >= 0.8 * max_word_count]

    # Sort transcripts based on ASR reliablity order defined above
    transcripts_sorted_by_least_avg_cer = sorted(transcripts_filtered_for_truncated, key=lambda x: asrs_dict[x[0]])

    return transcripts_sorted_by_least_avg_cer

# Start-End Alignment
There might be some mismatches at the start and end of the audio and text files. As an example, the speaker might read the title and author name while it is not in the text, or the text might have some additional sentences at the end which are not read by the speaker. This section attempts to match the audio and text files in the start and end by removing a few seconds/words from each.

In [None]:
import re
import os
from pydub import AudioSegment
from hazm import Normalizer
from jiwer import cer
from pydub import AudioSegment
from pydub.silence import split_on_silence

In [None]:
non_word_char_pattern = r"[\s«»()!*[\]\"'^\-+=_–—&<>;٫,.?:؛/{}،؟]"

def match_special_character(char):
    return bool(re.match(non_word_char_pattern, char))

In [None]:
def remove_non_word_chars(text):
    pattern = r'[^\w\s]'
    cleaned_text = re.sub(pattern, ' ', text)
    return cleaned_text

In [None]:
def remove_white_spaces(text):
    cleaned_text = re.sub(r'\s+', ' ', text)
    return cleaned_text.strip()

In [None]:
def get_word_only_text(text):
  word_only_text = remove_non_word_chars(text)
  extra_space_removed_text = remove_white_spaces(word_only_text)

  return extra_space_removed_text

In [None]:
def get_texts_cer_with_lookup(reference_text, transcript, lookup, direction='start'):
  """
  Returns CER of either the starting or ending #lookup characters of the two texts
  """

  # Preprocess input texts to only contain word characters
  word_only_reference_text = get_word_only_text(reference_text)
  word_only_transcript = get_word_only_text(transcript)

  # Return +infinity for CER if any of the texts is empty
  if not word_only_reference_text.strip() or not word_only_transcript.strip():
    return float('inf')

  # Get CER of #lookup start characters of the texts
  if direction == 'start':
    return cer(word_only_reference_text[:lookup], word_only_transcript[:lookup])

  # Get CER of #lookup last characters of the texts
  else:
    return cer(word_only_reference_text[-lookup:], word_only_transcript[-lookup:])

In [None]:
def match_texts_from_start(reference_text, transcript, max_index=500, min_lookup=50, max_lookup=150):
    """
    Tries to find the best match to the transcript from the start of the reference text

    Arguments:
    - reference_text: ground truth text from which the true transcript is extracted
    - transcript: the hypothesis transcript from ASR
    - max_index: the maximum index to search for starting cut index
    - min_lookup: in case the texts are short, the max_index would only go as far as min_lookup characters are there for comparison
    - max_lookup: determines the number of characters that is compared in the selected reference text and hypothesis

    Returns:
    - CER of the parts matched
    - the index for which reference_text[i:] is the best match for the transcript
    """

    min_cer = float('inf')
    best_index = 0

    for i in range(min(len(reference_text) - min_lookup, max_index)):
        # Ensures the reference text is split in word boundaries only
        if i != 0 and not (match_special_character(reference_text[i - 1]) and not match_special_character(reference_text[i])): continue

        # Calculate maximum number of characters that are available to compare the selected text and hypothesis
        min_left = min(len(transcript), len(reference_text) - (i + 1))
        lookup = min(min_left, max_lookup)

        cer_value = get_texts_cer_with_lookup(reference_text[i:], transcript, lookup)

        # Store the best search result
        if cer_value < min_cer:
            min_cer = cer_value
            best_index = i

    return min_cer, best_index

In [None]:
# Fix transcript and tune the end index of the reference text for least cer
def match_texts_from_end(reference_text, transcript, max_index=600, min_lookup=50, max_lookup=150):
    """
    Tries to find the best match to the transcript from the end of the reference text

    Arguments:
    - reference_text: ground truth text from which the true transcript is extracted
    - transcript: the hypothesis transcript from ASR
    - max_index: the maximum index to search for ending cut index
    - min_lookup: in case the texts are short, the max_index would only go as far as min_lookup characters are there for comparison
    - max_lookup: determines the number of characters that is compared in the selected reference text and hypothesis

    Returns:
    - CER of the parts matched
    - the index for which reference_text[:i] is the best match for the transcript
    """

    min_cer = float('inf')
    best_index = 0

    for i in range(len(reference_text) - 1, max(len(reference_text) - min_lookup, max_index) - 1, -1):
        # Calculate maximum number of characters that are available to compare the selected text and hypothesis
        min_left = min(len(transcript), i + 1)
        lookup = min(min_left, max_lookup)

        # Ensures the reference text is split in word boundaries only
        if i != len(reference_text) - 1 and not (match_special_character(reference_text[i]) and not match_special_character(reference_text[i + 1])): continue

        cer_value = get_texts_cer_with_lookup(reference_text[:i], transcript, lookup, direction='end')

        # Store the best search result
        if cer_value < min_cer:
            min_cer = cer_value
            best_index = i

    return min_cer, best_index

In [None]:
def match_with_transcript(reference_text, transcript, match_function):
    """"
    Searches the index of the reference text from which it is a good match to the transcript

    Arguments:
    - reference_text: ground truth text from which the true transcript is extracted
    - transcript: the hypothesis transcript from ASR
    - match_function: either match_texts_from_start or match_texts_from_end

    Returns:
    - CER of the best match found
    - index for which reference_text[i:] or reference_text[:i] is the best match to the transcript
    """

    processed_transcript = normalize_text(transcript)
    cer_value, index = match_function(reference_text, processed_transcript)
    return cer_value, index

In [None]:
def fill_parts_transcript(parts, transcripts, start_index, end_index, log_idx):
  """"
    Inputs a list of audio chunks with a corresponding list of extracted transcripts
    and fills in missing transcripts in the range transcripts[start_index:end_index]
  """

  out_file = f"./temp_constructed_audio_{log_idx}.wav"


  for i in range(start_index, end_index):
    if transcripts[i] == None:
      parts[i].export(out_file, format="wav")
      audio_transcript = get_best_transcripts(out_file)[0][1]
      transcripts[i] = audio_transcript

  if os.path.exists(out_file): os.remove(out_file)

In [None]:
def get_parts_by_max_duration(parts, max_duration, direction='start'):
    """
    Inputs a list of audio chunks, selects as many chunks from start(end) of the chunks as they sum up to more than `max_duration`
    """

    # Determines if the parts are going to be selected from start or end of list
    slice_step = 1 if direction == 'start' else -1

    selected_parts = []
    current_diration_seconds = 0

    for part in parts[::slice_step]:
      selected_parts.append(part)
      current_diration_seconds += part.duration_seconds

      # The duration of selected chunks exceeds the max_duration for the first time to break the process
      if current_diration_seconds > max_duration:
        break

    return selected_parts[::slice_step]

In [None]:
def find_best_start_match(parts, reference_text, max_lookup_duration, max_segment_duration, cer_threshold=0.2, text_len_diff_threshold=10, log_idx=0):
    """
    Finds the best starting point for an audio regarding how well it matches the reference text

    Arguments:
    - parts: input audio chunks
    - reference_text: the ground truth text for most of the audio except for perhaps its start and end
    - max_lookup_duration: maximum time that the function searches in the audio to find best starting point for audio
    - max_segment_duration: maximum time that the function matches the audio from an start point with the text
    - cer_threshold: acceptable CER value showing a good match between audio and text
    - text_len_diff_threshold: indicates the number of characters we may discard in the reference text to find a better CER value.
      More specifically, once the cer_threshold is already found in the search, the audio and text are most probaby a good match.
      And a better CER value with a smaller subpart of them may only be random due to ASR not being 100 correct. So there is a threshold
      for the number of characters we may allow to be skipped for an even better CER value.

    Returns:
    - CER value of the found match
    - text index that gives best match
    - audio chunk index that gives best match
    """

    print("\tMatching Start...")
    # Selects as many input chunks for search as max_lookup_duration
    starting_parts = get_parts_by_max_duration(parts, max_lookup_duration)
    starting_transcripts = [None for _ in range(len(starting_parts))]

    # History of found matches
    results = []

    # If acceptable CER is already found in search
    cer_threshold_met = False

    # How many further characters do the matching functions suggest to remove from reference text
    text_len_diff = 0

    # best CER value, starting index of the reference text, and starting chunk of the audio
    best_cer, best_text_idx, best_audio_idx = (float('inf'), -1, -1)

    for i in range(len(starting_parts)):
        # Starting from an input audio chunk, select as many chunks for matching as max_segment_duration
        current_parts = get_parts_by_max_duration(starting_parts[i:], max_segment_duration)
        constructed_audio = sum(current_parts, AudioSegment.empty())
        # temp_constructed_audio_path = 'temp_constructed_audio.wav'
        temp_constructed_audio_path = f'temp_constructed_audio_{log_idx}.wav'
        constructed_audio.export(temp_constructed_audio_path, format='wav')

        # Get transcript for chunks that haven't already been processed
        fill_parts_transcript(starting_parts, starting_transcripts, i, i + len(current_parts), log_idx=log_idx)
        audio_transcript = ' '.join(starting_transcripts[i:i+len(current_parts)])

        # Given the selected chunks and the reference text, find best index of text that reference_text[idx:] is the best match
        cer_value, text_index = match_with_transcript(reference_text, audio_transcript, match_texts_from_start)

        print(f"\t\tPart [{i}]: Transcript= {audio_transcript}")
        print(f"\t\tMatched Text= {reference_text[text_index:text_index + len(audio_transcript)]}")
        print(f'\t\tCER={cer_value}')

        # Save to search history
        results.append((cer_value, text_index, i))

        if len(results) > 1:
          text_len_diff = abs(results[-2][1] - results[-1][1])
          if cer_threshold_met and (cer_value > results[-1][0] or text_len_diff > text_len_diff_threshold):
            # If acceptable CER is already found in search, we wont accept a smaller CER value or a further large jump in text
            break

        # Set the flag that shows if acceptable CER is found in search resulst
        if cer_value < cer_threshold:
            cer_threshold_met = True

        # Save best search result
        best_cer, best_text_idx, best_audio_idx = (cer_value, text_index, i) if cer_value < best_cer else (best_cer, best_text_idx, best_audio_idx)

    if os.path.exists(temp_constructed_audio_path): os.remove(temp_constructed_audio_path)
    return best_cer, best_text_idx, best_audio_idx

In [None]:
def find_best_end_match(parts, reference_text, max_lookup_duration, max_segment_duration, cer_threshold=0.2, text_len_diff_threshold=10, log_idx=0):
    """
    Finds the best ending point for an audio regarding how well it matches the reference text

    Arguments:
    - parts: input audio chunks
    - reference_text: the ground truth text for most of the audio except for perhaps its start and end
    - max_lookup_duration: maximum time that the function searches in the audio to find best ending point for audio
    - max_segment_duration: maximum time that the function matches the audio from an end point with the text
    - cer_threshold: acceptable CER value showing a good match between audio and text
    - text_len_diff_threshold: indicates the number of characters we may discard in the reference text to find a better CER value.
      More specifically, once the cer_threshold is already found in the search, the audio and text are most probaby a good match.
      And a better CER value with a smaller subpart of them may only be random due to ASR not being 100 correct. So there is a threshold
      for the number of characters we may allow to be skipped for an even better CER value.

    Returns:
    - CER value of the found match
    - text index that gives best match
    - audio chunk index that gives best match
    """

    print("\tMatching End...")
    # Selects as many input chunks for search as max_lookup_duration
    ending_parts = get_parts_by_max_duration(parts, max_lookup_duration, direction='end')
    ending_transcripts = [None for _ in range(len(ending_parts))]

    # History of found matches
    results = []

    # If acceptable CER is already found in search
    cer_threshold_met = False

    # How many further characters do the matching functions suggest to remove from reference text
    text_len_diff = 0

    # best CER value, starting index of the reference text, and starting chunk of the audio
    best_cer, best_text_idx, best_audio_idx = (float('inf'), -1, -1)

    for i in range(len(ending_parts)):
        # Starting from an input audio chunk, select as many chunk for matching as max_segment_duration
        current_parts = get_parts_by_max_duration(ending_parts[:len(ending_parts) - i], max_segment_duration, direction='end')
        constructed_audio = sum(current_parts, AudioSegment.empty())

        # constructed_audio.export('temp_constructed_audio.wav', format='wav')
        constructed_audio.export(f'temp_constructed_audio_{log_idx}.wav', format='wav')


        # Get transcript for chunks that haven't already been processed
        fill_parts_transcript(ending_parts, ending_transcripts, len(ending_parts) - i - len(current_parts), len(ending_parts) - i, log_idx=log_idx)

        audio_transcript = ' '.join(ending_transcripts[len(ending_parts) - i - len(current_parts):len(ending_parts) - i])
        cer_value, text_index = match_with_transcript(reference_text, audio_transcript, match_texts_from_end)

        print(f"\t\tPart [{i}]: Transcript= {audio_transcript}")
        print(f"\t\tMatched Text= {reference_text[text_index - len(audio_transcript):text_index]}")
        print(f'\t\tCER={cer_value}')

        # Save to search history
        results.append((cer_value, text_index, i))

        if len(results) > 1:
          text_len_diff = abs(results[-2][1] - results[-1][1])
          if cer_threshold_met and (cer_value > results[-1][0] or text_len_diff > text_len_diff_threshold):
            # If acceptable CER is already found in search, we wont accept a smaller CER value or a further large jump in text
            break

        if cer_value < cer_threshold:
            cer_threshold_met = True

        # Save best search result
        best_cer, best_text_idx, best_audio_idx = (cer_value, text_index, i) if cer_value < best_cer else (best_cer, best_text_idx, best_audio_idx)

    return best_cer, best_text_idx, best_audio_idx

In [None]:
def split_audio_on_silence(audio_file, silence_threshold=-40, min_silence_len=300):
    audio = AudioSegment.from_file(audio_file, format="wav")
    parts = split_on_silence(audio, min_silence_len=min_silence_len, silence_thresh=silence_threshold, keep_silence=True)
    return parts

In [None]:
def start_end_align(audio_file_path, reference_text_path, output_dir_path, max_lookup_duration=60, max_segment_duration=15, cer_threshold=0.2, log_idx=0):
    """
    Finds best starting/ending point for a pair of audio and text files regarding how well they match

    Arguments:
    - audio_path: path to audio to match with text
    - aligned_audio_path: path to output audio, potentially trimmed from start/end
    - reference_text_path: path to ground truth text to match with audio
    - aligned_text_path: path to output text, potentially trimmed from start/end
    - max_lookup_duration: max time duration to search from start/end of the audio for best match
    - max_segment_duration: indicates duration of the audio used to compare text and audio as being a complete match for each other
    - cer_threshold: acceptable CER value that guarantees a good match
    """
    input_audio_name = audio_file_path.split('/')[-1].split('.')[0]
    output_audio_path = os.path.join(output_dir_path, input_audio_name + '.wav')
    output_text_path = os.path.join(output_dir_path, input_audio_name + '.txt')

    if os.path.exists(output_audio_path):
       print(f"Skipping file {input_audio_name}.wav. Start-End aligned audio exists.")
       return output_audio_path, output_text_path

    reference_text = ''

    with open(reference_text_path, 'r') as f:
      reference_text = f.read()

    # Match start of audio with start of reference text
    parts = split_audio_on_silence(audio_file_path)

    best_match_start = find_best_start_match(parts, reference_text, max_lookup_duration, max_segment_duration, cer_threshold, log_idx=log_idx)

    if not best_match_start:
      print("\x1b[31m\"The audio and text are not a match!\"\x1b[0m")
      raise Exception("The audio and text are not a match!")

    # Output start matching result in a temp file to use for end matching
    constructed_audio_start = sum(parts[best_match_start[2]:], AudioSegment.empty())
    temp_start_aligned_filename = f"./temp_aligned_start_audio_{log_idx}.wav"
    constructed_audio_start.export(temp_start_aligned_filename, format="wav")

    # Update the reference text with the strat match result
    reference_text_start = reference_text[best_match_start[1]:]

    # Match end of audio with start of reference text
    parts = split_audio_on_silence(temp_start_aligned_filename)
    best_match_end = find_best_end_match(parts, reference_text_start, max_lookup_duration, max_segment_duration, cer_threshold, log_idx=log_idx)

    if not best_match_end:
      print("\x1b[31m\"The audio and text are not a match!\"\x1b[0m")
      raise Exception("The audio and text are not a match!")

    # Export the final start/end aligned audio file
    constructed_audio_end = sum(parts[:len(parts) - best_match_end[2]], AudioSegment.empty())
    constructed_audio_end.export(output_audio_path, format="wav")

    # Export the final start/end aligned text file
    aligned_reference_text = reference_text_start[:best_match_end[1]]
    with open(output_text_path, 'w') as f:
      f.write(aligned_reference_text)

    if os.path.exists(temp_start_aligned_filename): os.remove(temp_start_aligned_filename)

    return output_audio_path, output_text_path

# Forced Alignemnt
In order to prepare the audio and text files for text-to-speech, speech-to-text, ... tasks, we must break them into smaller chunks of a few seconds and a few matching words. This is the forced alignment task. [Aeneas](https://github.com/readbeyond/aeneas) is an awesome tool for this task and you can see an example of Persian forced alignemtn using this task in [this repository](https://github.com/MahtaFetrat/VirgoolInformal-Speech-Dataset). But this tool doesn't work well when the audio and text files are not exact matches as stated in its limitations. So we implemented our own method of forced alignment as follows.

In [None]:
import os
import re
import time
import csv
from jiwer import cer
from pydub import AudioSegment
from pydub.silence import split_on_silence
from pydub import AudioSegment
from hazm import Normalizer
from pydub import AudioSegment
from pydub.silence import split_on_silence
from hazm import word_tokenize

In [None]:
def get_texts_cer(reference_text, transcript):
  """
  Returns CER of either the two texts
  """

  # Preprocess input texts to only contain word characters
  word_only_reference_text = get_word_only_text(reference_text)
  word_only_transcript = get_word_only_text(transcript)

  # Return +infinity for CER if any of the texts is empty
  if not word_only_reference_text.strip() or not word_only_transcript.strip():
    return float('inf')

  return cer(word_only_reference_text, word_only_transcript)


In [None]:
def concat_parts_to_min_duration(parts, min_part_duration=4):
  """
  Merge input audio chunks such that they exceeds a minimum duration and return the resulting merged chunks list
  """

  new_parts = []
  current_duration = 0
  current_parts = []

  for part in parts:
    current_parts.append(part)
    current_duration += part.duration_seconds

    if current_duration >= min_part_duration:
      new_parts.append(sum(current_parts))
      current_parts = []
      current_duration = 0

  return new_parts

In [None]:
def cut_part_in_half(part, min_duration=2, min_silence_len=10, silence_thresh=-50):
  """
  Devide a given audio chunk to two almost equal subparts

  Arguments:
  - min_duration: min duration of resulting chunks
  - min_silence_len: min silence len used to split audio on silent moments
  - silence_thresh: silence threshold used to split audio on silent moments
  """

  # Get total duration of audio
  part_duration = part.duration_seconds

  # Split the audio into subparts on silent moments that are candidate points for deviding the audio
  subparts = split_on_silence(part, min_silence_len=min_silence_len, silence_thresh=silence_thresh, keep_silence=True)

  # Duration of currently selected subparts to construct first half
  current_duration = 0

  for i, subpart in enumerate(subparts):
    current_duration += subpart.duration_seconds

    # Check if currently selected subparts have met half or original duration
    if current_duration > part.duration_seconds / 2:

      # Accept this division if the remaning of the audio is ok in terms of min_duration
      if part_duration - current_duration > min_duration:
        return [sum(subparts[:i + 1]), sum(subparts[i + 1:])]

      # If selecting the last chunk has left less than min_duration for the second part,
      # select chunks before it if min_duration is not violated
      if current_duration - subpart.duration_seconds > min_duration:
        return [sum(subparts[:i]), sum(subparts[i:])]

      # If neither of natural (silence based) division above worked, split audio from middle moment
      half = len(part) // 2
      return [part[:half], part[half:]]

In [None]:
def split_part_to_max_duration(part, min_duration=2, max_duration=12, min_silence_len=10, silence_thresh=-50):
  """
  Split the audio to smaller chunks by recursively dividing it to two almost equal supparts until all chunks are shorter than max_duration

  Arguments:
  - min_duration: minimum acceptable chunk duration
  - max_duration: maximum acceptable chunk duration
  - min_silence_len: min silence len used to split audio on silent moments
  - silence_thresh: silence threshold used to split audio on silent moments
  """

  # Return if input audio is already below max_duration
  if part.duration_seconds <= max_duration:
    return [part]

  # Recursively divide the audio to two almost equal subparts otherwise
  part1, part2 = cut_part_in_half(part, min_silence_len, silence_thresh)
  return  split_part_to_max_duration(part1, min_duration, max_duration, min_silence_len, silence_thresh) + \
          split_part_to_max_duration(part2, min_duration, max_duration, min_silence_len, silence_thresh)

In [None]:
def split_parts_to_max_duration(parts, min_duration=2, max_duration=12, min_silence_len=10, silence_thresh=-50):
  """
  Merely calls split_part_to_max_duration on a list of input audio and returns a list of audio chunks all below max_duration
  """

  new_parts = []

  for part in parts:
    new_parts.extend(split_part_to_max_duration(part, min_duration, max_duration, min_silence_len, silence_thresh))

  return new_parts

In [None]:
def get_audio_chunks(audio_file_path, min_chunk_duration=2, max_chunk_duration=12, min_silence_len_normal=150, silence_thresh=-50, min_silence_len_exessive=10):
  """
  Splits the input audio file to chunks on silent moments, then merges the chunks to ensure a minimum duration
  and breaks down large chunks to ensure a maximum duration

  Arguments:
  - audio_file_path: path to audio to be chunked
  - min_chunk_duration: minimum acceptable chunk duration
  - max_chunk_duration: maximum acceptable chunk duration
  - min_silence_len_normal: min silence len used to split origianl audio to chunks at first attempt
    More specifically, we first try to split the audio to chunks by silent moments that are longer and correspond to more natural stops.
    Then relax this min duration to meet the min/max chunk durations with more risky silent moments.
  - silence_thresh: silence threshold used to split audio on silent moments
  - min_silence_len_exessive: min silence len used to split audio chunks further to ensure max_duration

  Returns:
  - a list of properly sized audio chunks splitted on natural silent moments whenever possible
  """

  format = audio_file_path.split('.')[-1]
  audio = AudioSegment.from_file(audio_file_path, format=format)

  # Split on the more reliable and longer silent moments
  parts = split_on_silence(audio, min_silence_len=min_silence_len_normal, silence_thresh=silence_thresh, keep_silence=True)

  # Merge small audio chunks to ensure min_duration
  parts_concat_to_min_duration = concat_parts_to_min_duration(parts, min_chunk_duration)

  # Further split large chunks to ensure max_duration
  parts_split_to_max_duration = split_parts_to_max_duration(parts_concat_to_min_duration, min_chunk_duration, max_chunk_duration, min_silence_len_exessive, silence_thresh)

  return parts_split_to_max_duration

In [None]:
def match_transcript(transcript, text, max_lookup=500, max_removal=300, cer_threshold_1=0.05, cer_threshold_2=0.2, backtrack_cer=0.8, bactrack_search1_coeff=1.7, backtrack_search2_coeff=1.2, log_idx=0):
  """
  Given a hypothesis transcript and the ground truth text, selects a subpart of the text that best matches the transcript.

  Two search methods are used:
  1- Interval Search: looks for all possible intervals like text[s:i]
  2- Gapped Search: looks for all subparts formed as text[s:j] + text[k:i]

  The first search is less computationally expensive, so getting a good result in this search is preferred.
  In case no good match like text[i:j] is found, the second search method is run.

  Arguments:
  - transcript: the hypothesis (uncertain/asr output) transcript for which we want to find a matching text
  - text: the ground truth text to look for the transctipt
  - max_lookup: number of characters on which we search for matching texts
  - max_removal: the maximum difference between j and k. i.e. the maximum gap size in second search
  - cer_threshold_1: the CER value that causes and early stop of the search whenever reached
  - cer_threshold_2: the CER value that prevents the Gapped Search if reached in Interval Search,
    this value is also the maximum CER that is acceptable after the entire search process
  - backtrack_cer: used to backtrack some combinations in the Gapped Search. if text[s:i] is already more than this value,
    it wont look for j and k indices
  - bactrack_search1_coeff: also used to backtrack some combinations in the Gapped Search. if text[s:i] is already more than
   best cer in Interval Search multiplied by this value, it wont look for j and k indices
  - bactrack_search2_coeff: also used to backtrack some combinations in the Gapped Search. if text[s:i] is already more than
   best cer in Gapped Search multiplied by this value, it wont look for j and k indices

  min_cer, matching_text, end_pointer, "REJECT", time.time() - start_time, 2
  Returns:
  - min_cer: CER of the matching text found
  - matching_text: the text subpart selected from the input text as the best matching transcript
  - end_pointer: pointer indicating how much we have proceeded in the input text already to match with audio files. Used in the upper function call.
  - quality: a quality level either "HIGH" for cer values below cer_threshold_1, "MIDDLE" for cer values between cer_threshold_1 and cer_threshold_2, and "REJECT" for ver values more than cer_threshold_2
  - time: processing time of the input audio chunk alignment
  """

  start_time = time.time()

  min_cer = float('inf')
  matching_text = ''
  end_pointer = -1

  print(f"\tTranscript:\t\t {transcript}")

  # Interval Search: search on text[s:i] subparts
  for s in range(min(len(text), max_removal)):

    # Ensures the reference text is split in word boundaries only
    if s != 0 and not (match_special_character(text[s - 1]) and not match_special_character(text[s])): continue

    for i in range(s + 1, min(len(text), s + 1 + max_lookup)):

        # Ensures the reference text is split in word boundaries only
        if not i == len(text) and not (match_special_character(text[i - 1]) and not match_special_character(text[i])): continue

        cer_value = get_texts_cer(text[s:i], transcript)

       # Save best search results
        if cer_value < min_cer:
            min_cer = cer_value
            end_pointer = i
            matching_text = text[s:i]

        # Early stop the search if CER value below cer_threshold_1
        if cer_value <= cer_threshold_1:
            print(f"\033[92m\tInterval Search: Matching Text:{matching_text}\033[0m")
            print(f"\033[92m\tInterval Search: CER: {min_cer}, End Pointer: {end_pointer}\033[0m")
            return min_cer, matching_text, end_pointer, "HIGH", 1

  # Dont proceed to Gapped Search if the CER value from Interval Search is below cer_threshold_2
  if min_cer <= cer_threshold_2:
    print(f"\033[93m\tInterval Search: Matching Text: {matching_text}\033[0m")
    print(f"\033[93m\tInterval Search: CER: {min_cer}, End Pointer: {end_pointer}\033[0m")
    return min_cer, matching_text, end_pointer, "MIDDLE", 1

  # If Interval Search not accepted, just log the best results
  print(f"\tInterval Search: Matching Text: {matching_text}")
  print(f"\tInterval Search: CER: {min_cer}, End Pointer: {end_pointer}")
  interval_search_cer = min_cer


  # Gapped Search: search on text[s:k] + text[j:i] subparts
  gapped_search_cer = float('inf')
  for s in range(min(len(text), max_removal)):

    # Ensures the reference text is split in word boundaries only
    if s != 0 and not (match_special_character(text[s - 1]) and not match_special_character(text[s])): continue

    for i in range(s + 1, min(len(text) + 1, s + 1 + max_lookup)):

      # Ensures the reference text is split in word boundaries only
      if not i == len(text) and not (match_special_character(text[i - 1]) and not match_special_character(text[i])): continue

      # Back track from searching j and k if CER value above the defined thresholds
      interval_cer = get_texts_cer(text[s:i], transcript)
      if interval_cer > backtrack_cer or\
        interval_cer > interval_search_cer * bactrack_search1_coeff or \
        interval_cer > gapped_search_cer * backtrack_search2_coeff: continue

      for j in range(s + 1, i):

        # Ensures the reference text is split in word boundaries only
        if not (match_special_character(text[j - 1]) and not match_special_character(text[j])): continue

        for k in range(j + 1, min(j + max_removal, i)):

          # Ensures the reference text is split in word boundaries only
          if not (match_special_character(text[k - 1]) and not match_special_character(text[k])): continue

          cer_value = get_texts_cer(text[s:j] + text[k:i], transcript)

          # Update Gapped Search best results used for backtracking
          gapped_search_cer = min(gapped_search_cer, cer_value)

          # Save best results of entire search
          if cer_value < min_cer:
              min_cer = cer_value
              end_pointer = i
              matching_text = text[s:j] + text[k:i]

          # Early stop the search if CER value below cer_threshold_1
          if cer_value <= cer_threshold_1:
              print(f"\033[92m\tGapped Search: Matching Text:{matching_text}\033[0m")
              print(f"\033[92m\tGapped Search: CER: {min_cer}, End Pointer: {end_pointer}\033[0m")
              return min_cer, matching_text, end_pointer, "HIGH", 2

          if time.time() - start_time > 900:
            print("Search timeout. Quitting...")
            return min_cer, matching_text, end_pointer, "REJECT", 0

  # Accept the the result with "MIDDLE" quality if the CER value from Interval Search is below cer_threshold_2
  if min_cer <= cer_threshold_2:
    print(f"\033[93m\tBoth Searches: Matching Text:{matching_text}\033[0m")
    print(f"\033[93m\tBoth Searches: CER: {min_cer}, End Pointer: {end_pointer}\033[0m")

    return min_cer, matching_text, end_pointer, "MIDDLE", 2


  # Return with "REJECT" status meaning no good matching text found
  print(f"\033[91m\tBoth Searches: Matching Text:{matching_text}\033[0m")
  print(f"\033[91m\tBoth Searches: CER: {min_cer}, End Pointer: {end_pointer}\033[0m")
  return min_cer, matching_text, end_pointer, "REJECT", 2

In [None]:
def get_word_count(text):
    words = word_tokenize(text)
    valid_words = [word for word in words if word_pattern.match(word)]
    return len(valid_words)

In [None]:
def match_audio(audio, output_chunk_path, text, max_lookup=500, max_removal=300, cer_threshold_1=0.05, cer_threshold_2=0.2, backtrack_cer=0.8, bactrack_search1_coeff=1.7, backtrack_search2_coeff=1.2, log_idx=0):
    """
      Given an audio chunk and the ground truth text, selects a subpart of the text that best matches the audio.
      It tries the best transcripts from asr models in order until one is accepted.

      Arguments:
      - audio: audio segment to for which we want to find a matching text
      - text: the ground truth text to look for the transctipt
      - max_lookup: number of characters on which we search for matching texts
      - max_removal: the maximum difference between j and k. i.e. the maximum gap size in second search
      - cer_threshold_1: the CER value that causes and early stop of the search whenever reached
      - cer_threshold_2: the CER value that prevents the Gapped Search if reached in Interval Search,
        this value is also the maximum CER that is acceptable after the entire search process
      - backtrack_cer: used to backtrack some combinations in the Gapped Search. if text[s:i] is already more than this value,
        it wont look for j and k indices
      - bactrack_search1_coeff: also used to backtrack some combinations in the Gapped Search. if text[s:i] is already more than
      best cer in Interval Search multiplied by this value, it wont look for j and k indices
      - bactrack_search2_coeff: also used to backtrack some combinations in the Gapped Search. if text[s:i] is already more than
      best cer in Gapped Search multiplied by this value, it wont look for j and k indices

      Returns:
      - min_cer: CER of the matching text found
      - transcript: the hypothesis transcript from ASR for the given audio chunk
      - matching_text: the text subpart selected from the input text as the best matching transcript
      - end_pointer: pointer indicating how much we have proceeded in the input text already to match with audio files. Used in the upper function call.
      - quality: a quality level either "HIGH" for cer values below cer_threshold_1, "MIDDLE" for cer values between cer_threshold_1 and cer_threshold_2, and "REJECT" for ver values more than cer_threshold_2
      - time: processing time of the input audio chunk alignment
      - asrs: the asr models that were used in order
    """

    start_time = time.time()

    tempfile = f"temp_{log_idx}.wav"
    audio.export(tempfile, format="wav")
    transcripts = get_best_transcripts(tempfile, output_chunk_path, log=True)
    os.remove(tempfile)

    if not transcripts:
       return float('inf'), '', '', 0, -1, 'REJECT', -1, time.time() - start_time, []

    asrs = []
    for asr, transcript in transcripts:
      min_cer, matching_text, end_pointer, status, search_type = match_transcript(transcript, text, max_lookup, max_removal, cer_threshold_1, cer_threshold_2, backtrack_cer, bactrack_search1_coeff, backtrack_search2_coeff, log_idx)
      asrs.append(asr)
      if status != 'REJECT': break

    return min_cer, transcript, matching_text, get_word_count(matching_text), end_pointer, status, search_type, time.time() - start_time, asrs


In [None]:
def forced_align(audio_file_path, text_file_path, alignment_dir, metadata, processed_files, log_idx):
  """
  Given an input audio file path and a ground truth text file path, splits the audio into chunks and searches for matching subparts form the text for each chunk
  It exports each audio chunk and its matching text as files to the alignment dir and writes its results to a metadata file given as input.
  """
  # Get file name used for logging
  audio_filename = audio_file_path.split('/')[-1].split('.')[-2]

  # Create a directory for aligned chunk files of the audio
  audio_alignment_dir = os.path.join(alignment_dir, audio_filename)

  # Stop if aleady aligned
  if audio_filename in processed_files:
     print(f"Skipping file {audio_filename}.wav. Forced alignment directory exists.")
     return audio_alignment_dir

  os.makedirs(audio_alignment_dir, exist_ok=True)

  text = ''
  with open(text_file_path, 'r') as f: text = f.read()
  audio_chunks = get_audio_chunks(audio_file_path)

  text_pointer = 0
  max_lookup = 500  # number of characters on which we search for matching texts
  max_lookup_stepsize = 500  # constant increase value to max_lookup to recover from previously unmatched chracters in reference text

  for i, chunk in enumerate(audio_chunks, start=1):
    print(f"Matching chunk {i}/{len(audio_chunks)} of {audio_filename}.wav...")

    chunk_audio_path = os.path.join(audio_alignment_dir, f'{audio_filename}-{i}.wav')

    # Search for best matching text from reference text
    min_cer, transcript, matching_text, matching_text_word_count, pointer_shift, quality, search_type, processing_time, asrs = match_audio(chunk, chunk_audio_path, text[text_pointer:], max_lookup, log_idx=log_idx)

    # Write search results to metadata
    with open(metadata, 'a', newline='') as csvfile:
      writer = csv.writer(csvfile)
      writer.writerow([audio_alignment_dir, f'{audio_filename}-{i}.wav', matching_text, matching_text_word_count, chunk.duration_seconds, 0, min_cer, quality, transcript, processing_time, search_type, asrs])

    # Export the audio chunk
    print(f"\tWriting audio chunk {i}...")
    chunk.export(chunk_audio_path, format="wav")

    if quality == "REJECT":
      # If no matching text was found, there might be some leftover text at start.
      # We increase max_lookup to recover from this mismatch
      max_lookup += max_lookup_stepsize

    else:
      # If a matching text was found, simply move the starting pointer of the reference text and reset max_lookup to normal
      max_lookup = max_lookup_stepsize
      text_pointer += pointer_shift - 15  # Let about one word be mistakenly included in last match


    # Export the matching text to a file
    print(f"\tWriting text chunk {i}...")
    segment_text_path = os.path.join(audio_alignment_dir, f"{audio_filename}-{i}.txt")
    with open(segment_text_path, 'w', encoding='utf-8') as segment_file:
        segment_file.write(matching_text.strip())

  return audio_alignment_dir

# Process Data
Here we define the full processing pipeline. It contains three main steps:



1.   **Pre-processing:** Format conversion, background music removal, and text pre-processing
2.   **Alignment:** Start-end alignment and forced alignment
3.   **Post-processing:** Silence removal and stereo to mono



In [None]:
import os
import shutil
import csv
import argparse

In [None]:
def pre_process(input_audio_path, input_text_path, output_dir_path, log_idx):
    # pre-process audio
    output_audio_file_path = convert_mp3_to_wav(input_audio_path, output_dir_path)
    output_audio_file_path = remove_background_music(output_audio_file_path, output_dir_path, log_idx=log_idx)

    # pre-process text
    output_text_file_path = process_text(input_text_path, output_dir_path)

    return output_audio_file_path, output_text_file_path

In [None]:
def align(input_audio_path, input_text_path, output_start_end_align_dir_path, output_forced_align_dir_path, metadata, force_aligned_files, log_idx):
    output_audio_file_path, output_text_file_path = start_end_align(input_audio_path, input_text_path, output_start_end_align_dir_path, log_idx=log_idx)
    audio_forced_align_dir_path = forced_align(output_audio_file_path, output_text_file_path, output_forced_align_dir_path, metadata, force_aligned_files, log_idx=log_idx)

    return audio_forced_align_dir_path

In [None]:
def post_process(forced_aligned_dir_path, post_processed_dir_path, metadata):
   for filename in os.listdir(forced_aligned_dir_path):
        src_path = os.path.join(forced_aligned_dir_path, filename)

        if filename.endswith(".txt"):   # Copy text files exactly
            dest_path = os.path.join(post_processed_dir_path, filename)
            shutil.copyfile(src_path, dest_path)

        if filename.endswith(".wav"):
            output_file_path = remove_silent_parts(src_path, post_processed_dir_path, metadata)
            convert_stereo_to_mono(output_file_path, post_processed_dir_path)

In [None]:
def process_data(data_root, stage_dirs, metadata, force_aligned_files_path, range_start=None, range_end=None, log_idx=None):
    # Read the forced aligned audio files as a set
    force_aligned_files = None
    with open(force_aligned_files_path, 'r') as f:
      force_aligned_files = set([line.strip() for line in f.readlines()])


    # Get total number of audio files in the source directory
    audio_files = [audio_file for audio_file in os.listdir(data_root) if audio_file.endswith('.mp3')]


    # Filter the audio files if the processing is limited to some range
    if range_start is not None and range_end is not None:
        audio_files = [audio_file for audio_file in audio_files if range_start <= int(audio_file.split('.')[0]) < range_end]


    total_files = len(audio_files)


    # Iterate through all audio files in the source directory
    for idx, audio_file in enumerate(audio_files, start=1):
        progress = f"({idx}/{total_files})"
        print(f"{progress}: Processing file {audio_file}...")


        # Get corresponding text file
        audio_file_path = os.path.join(data_root, audio_file)
        audio_file_name = audio_file.split('.')[0]
        text_file_path = os.path.join(data_root, audio_file_name + '.txt')


        # Pass the raw data through pipeline
        audio_path, text_path = pre_process(audio_file_path, text_file_path, stage_dirs[0], log_idx=log_idx)

        force_alignment_path = align(audio_path, text_path, stage_dirs[1], stage_dirs[2], metadata, force_aligned_files, log_idx=log_idx)
        if audio_file_name not in force_aligned_files:
            with open(force_aligned_files_path, 'a') as f: f.writelines(audio_file_name + '\n')

        post_process(force_alignment_path, stage_dirs[3], metadata)

## Run the Processing Pipeline
Four directories are defined as stage dirs. The results of the data at different stages of the pipeline, including the pre-processed data, the start-end aligned data, the forced-aligned data, and the post-processed data area stored in these dirs.

The arguments `start`, `end`, and `log` are used for parallel processing of data. You defined the range of files to process in each thread by setting the `start` and `end` index of the files and label that run with a number for the `log` argument. This `log` number will be used to avoid conflicts in naming temporary and log files. If you are running a single thread, simply set `start` to 1, `end` to the index of the last file, and `log` to an arbitrary number like 1.

NOTE: we assume the raw audio and text files have numeric names as in ManaTTS data.

In [None]:
# # Get file processing range if provided
# parser = argparse.ArgumentParser(description="Process files within a specified range.")
# parser.add_argument("--start", type=int, default=None, help="Start range value (inclusive)")
# parser.add_argument("--end", type=int, default=None, help="End range value (exclusive)")
# parser.add_argument("--log", type=int, default=None, help="Log files suffix")
# args = parser.parse_args()
# start = args.start
# end = args.end
# log = args.log

start = 1
end = 600
log = 1

data_root = 'raw'

# Define directories to save different states of the processing data
stage_dirs = [
    'pre-processed',
    'start-end-alignemnt',
    'forced-alignment',
    'post-processed'
]
for stage_dir in stage_dirs: os.makedirs(stage_dir, exist_ok=True)


# Since forced alignment is computaionally expensive, we will keep track of aligned audio in a file
force_aligned_files_path = f'force_aligned_files_path-{log}.txt'
if not os.path.exists(force_aligned_files_path): open(force_aligned_files_path, 'a').close()


metadata = f'metadata-{log}.csv'

# Create the metadata file with header
if not os.path.exists(metadata):
  with open(metadata, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['Path', 'Audio', 'Transcript', 'Transcript Word Count', 'Duration', 'Silence Removed Duration', 'CER', 'Status', 'Hypothesis', 'Processing Time', 'Search Type', 'ASRs'])


process_data(data_root, stage_dirs, metadata, force_aligned_files_path, start, end, log)


(1/2): Processing file 4.mp3...
Skipping file 4.wav. Vocals file exists.
Skipping file 4.txt. Processed text file already exists.
Skipping file 4.wav. Start-End aligned audio exists.
Skipping file 4.wav. Forced alignment directory exists.
Skipping file 4-170.wav. Silence removed file exists.
Skipping file 4-43.wav. Silence removed file exists.
Skipping file 4-77.wav. Silence removed file exists.
Skipping file 4-87.wav. Silence removed file exists.
Skipping file 4-184.wav. Silence removed file exists.
Skipping file 4-38.wav. Silence removed file exists.
Skipping file 4-60.wav. Silence removed file exists.
Skipping file 4-203.wav. Silence removed file exists.
Skipping file 4-94.wav. Silence removed file exists.
Skipping file 4-116.wav. Silence removed file exists.
Skipping file 4-136.wav. Silence removed file exists.
Skipping file 4-130.wav. Silence removed file exists.
Skipping file 4-219.wav. Silence removed file exists.
Skipping file 4-2.wav. Silence removed file exists.
Skipping file