# Audio curation in NeMo Curator

In the following notebook, we'll be exploring basic functionality that NeMo Curator has for audio dataset curation. NeMo Curator has a few built-in modules for:

- Download and prepare FLEURS data

- Inference of ASR (Autimatic Speech Recognition) using NeMo toolkit

- Calculate pairwise WER (Word Error Rate)

- Get duration for each audio 

- Save metadata to Jsonl file

We'll cover all the modules in this tutorial notebook. First, we'll need to install NeMo toolkit and NeMo Curator!

NOTE: Please ensure you meet the requirements before proceeding!

## Table of Contents

## Install NeMo Curator

If you have not already, please install NeMo Curator by following the README; you should install either nemo-curator[all] or nemo-curator[audio] for this tutorial. If you are using the NeMo Framework Container, then NeMo Curator is already installed and no action is needed.

We also need to install some additional libraries for helper functions in the notebook:

# Install Ray Curator

In [3]:
import sys

CURATORPATH = "~/workspace/Curator/ray-curator"
sys.path.append(CURATORPATH)

# Install cosmos_xenna

In [None]:
! pip install cosmos_xenna

# Start Ray Cluster 

In [None]:
!

In [None]:
import argparse
import os
import shutil
import tempfile

from loguru import logger

from ray_curator.backends.xenna import XennaExecutor
from ray_curator.pipeline import Pipeline
from ray_curator.stages.audio.common import GetAudioDurationStage, PreserveByValueStage
from ray_curator.stages.audio.datasets.fleurs.create_initial_manifest import CreateInitialManifestFleursStage
from ray_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
from ray_curator.stages.audio.io.object_to_batch import ObjectToBatchStage
from ray_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
from ray_curator.stages.resources import Resources
from ray_curator.stages.text.io.writer import JsonlWriter

In [None]:
def create_audio_pipeline(args: argparse.Namespace) -> Pipeline:
    # Define pipeline
    pipeline = Pipeline(name="audio_inference", description="Inference audio and filter by WER threshold.")

    # Add stages
    # Add the composite stage that combines reading and downloading
    pipeline.add_stage(
        CreateInitialManifestFleursStage(
            lang=args.lang,
            split=args.split,
            raw_data_dir=args.raw_data_dir,
        )
    )
    pipeline.add_stage(
        InferenceAsrNemoStage(model_name=args.model_name).with_(batch_size=16, resources=Resources(gpus=1.0))
    )
    pipeline.add_stage(GetPairwiseWerStage(text_key="text", pred_text_key="pred_text", wer_key="wer"))
    pipeline.add_stage(GetAudioDurationStage(audio_filepath_key="audio_filepath", duration_key="duration"))
    pipeline.add_stage(PreserveByValueStage(input_value_key="wer", target_value=args.wer_threshold, operator="le"))
    pipeline.add_stage(ObjectToBatchStage().with_(batch_size=1))
    result_dir = os.path.join(args.raw_data_dir, "result")
    if os.path.isdir(result_dir):
        shutil.rmtree(result_dir)  # clean up resulting folder
    pipeline.add_stage(JsonlWriter(output_dir=result_dir, force_ascii=False))
    return pipeline

In [None]:
tmpdir = tempfile.TemporaryDirectory()

args = argparse.Namespace(
    raw_data_dir=os.path.join(tmpdir, "armenian/fleurs"),
    model_name="nvidia/stt_hy_fastconformer_hybrid_large_pc",
    lang="hy_am",
    split="dev",
    wer_threshold=5.5,
)

In [None]:
"""
Prepare FLEURS dataset, run ASR inference and filer by WER threshold.
"""
pipeline = create_audio_pipeline(args)

# Print pipeline description
logger.info(pipeline.describe())
logger.info("\n" + "=" * 50 + "\n")

# Create executor
executor = XennaExecutor()

# Execute pipeline
logger.info("Starting pipeline execution...")
pipeline.run(executor)
tmpdir.cleanup()
# Print results
logger.info("\nPipeline completed!")

In [None]:
# Clear temporary folder
tmpdir.cleanup()