<h1 style="text-align: center; font-size: 50px;"> 🎙️English to Spanish Audio Translation </h1>

This notebook shows how to use NVIDIA NeMo (https://github.com/NVIDIA/NeMo) to construct a toy demo which translate English audio file into a Spanish one.

The demo demonstrates how to: 

* Instantiate pre-trained NeMo models from NVIDIA NGC as AIStudio assets.
* Transcribe audio with English speech recognition model.
* Translate text to Spanish with machine translation model.
* Generate audio with text-to-speech models fine-tuned to Spanish speach.
* Deploy these models locally using MLFlow and AI Studio deployments

# Notebook Overview
- Imports
- Configurations
- Verify Assets
- Loading from local saved models
- Play the Original English Audio
- Transcribe the Audio
- Translate the Text
- Convert Text to Audio
- Play the Generated Spanish Audio
- Register the Models to MLFlow

# Imports

In [1]:
%pip install -r ../requirements.txt --quiet

[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/lightning_utilities-0.14.0-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/deep_ep-1.0.0+a84a248-py3.12-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/lightning_thunder-0.2.2.dev0-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/loca

In [2]:
# ------------------------- NeMo Core Imports -------------------------
from transformers import MarianMTModel, MarianTokenizer
import nemo                             # NVIDIA NeMo core package
import nemo.collections.asr as nemo_asr # Speech Recognition (ASR) collection
import nemo.collections.nlp as nemo_nlp # Natural Language Processing (NLP) collection
import nemo.collections.tts as nemo_tts # Text-to-Speech (TTS) collection

# ------------------------- Audio Processing Utilities -------------------------

import IPython                          # For playing audio inside Jupyter Notebooks
import soundfile                        # For reading and writing audio files
from pathlib import Path                # Filesystem path management

# ------------------------- System Utilities -------------------------

import os                               # Operating system interfaces
import shutil                           # High-level file operations
import uuid                             # Unique ID generation
import io                               # Input/Output core tools
import base64                           # Encoding and decoding base64 strings
import json                             # JSON serialization and deserialization
import logging                          # Logging support
import warnings                         # Suppressing and managing warnings
import numpy as np                      # Numerical array operations
import torch

# ------------------------- MLflow Integration -------------------------

import mlflow                           # MLflow experiment tracking and model management
from mlflow.types.schema import Schema, ColSpec
from mlflow.types import ParamSchema, ParamSpec
from mlflow.models import ModelSignature

  from .autonotebook import tqdm as notebook_tqdm
    


# Configurations

In [3]:
# Suppress Python warnings
warnings.filterwarnings("ignore")

# Suppress NeMo internal logging
logging.getLogger('nemo_logger').setLevel(logging.ERROR)

In [4]:
# Create logger
logger = logging.getLogger("tourism_logger")
logger.setLevel(logging.INFO)

formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s", 
                              datefmt="%Y-%m-%d %H:%M:%S")  

stream_handler = logging.StreamHandler()
stream_handler.setFormatter(formatter)
logger.addHandler(stream_handler)
logger.propagate = False

In [5]:
# ------------------------- Model File Paths -------------------------

MT_MODEL = "Helsinki-NLP/opus-mt-en-es"
ASR_MODEL_PATH = "/home/jovyan/datafabric/STT_En_Citrinet_1024_Gamma_0.25/stt_en_citrinet_1024_gamma_0_25.nemo"        # Speech-to-Text (ASR) model
SPECTROGRAM_GENERATOR_PATH = "/home/jovyan/datafabric/TTS_Es_Multispeaker_FastPitch_HiFiGAN/tts_es_fastpitch_multispeaker.nemo"  # Spectrogram generator model (FastPitch)
VOCODER_PATH = "/home/jovyan/datafabric/TTS_Es_Multispeaker_FastPitch_HiFiGAN/tts_es_hifigan_ft_fastpitch_multispeaker.nemo"     # Vocoder model (HiFiGAN)

# ------------------------- Sample Audio Path -------------------------

AUDIO_SAMPLE_PATH = "../data/ForrestGump.mp3"     # Path to the input English audio sample

# ------------------------- MLflow Experiment Configuration -------------------------

EXPERIMENT_NAME = "NeMo_Translation_Experiment"   # MLflow experiment name
RUN_NAME = "NeMo_en_es_Translation_Run"            # Specific run name inside the experiment
MODEL_NAME = "nemo_en_es"                          # Registered model name in MLflow
DEMO_PATH = "../demo"                              # Path to save demo outputs

In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [7]:
logger.info('Notebook execution started.')

2025-06-16 21:14:40 - INFO - Notebook execution started.


# Verify Assets

In [8]:
def log_asset_status(asset_path: str, asset_name: str, success_message: str, failure_message: str) -> None:
    """
    Logs the status of a given asset based on its existence.

    Parameters:
        asset_path (str): File or directory path to check.
        asset_name (str): Name of the asset for logging context.
        success_message (str): Message to log if asset exists.
        failure_message (str): Message to log if asset does not exist.
    """
    if Path(asset_path).exists():
        logger.info(f"{asset_name} is properly configured. {success_message}")
    else:
        logger.info(f"{asset_name} is not properly configured. {failure_message}")
        
log_asset_status(
    asset_path=ASR_MODEL_PATH,
    asset_name="ASR model",
    success_message="",
    failure_message="Please create and download the required assets in your project on AI Studio."
)

log_asset_status(
    asset_path=SPECTROGRAM_GENERATOR_PATH,
    asset_name="Spectrogram generator",
    success_message="",
    failure_message="Please create and download the required assets in your project on AI Studio."
)

log_asset_status(
    asset_path=VOCODER_PATH,
    asset_name="Vocoder",
    success_message="You can now proceed with running the entire notebook.",
    failure_message="Please create and download the required assets in your project on AI Studio."
)

log_asset_status(
    asset_path=AUDIO_SAMPLE_PATH,
    asset_name="Audio Sample",
    success_message="You can now proceed with running the entire notebook.",
    failure_message="Please check if the data folder was properly downloaded in your project on AI Studio."
)

2025-06-16 21:14:40 - INFO - ASR model is properly configured. 
2025-06-16 21:14:40 - INFO - Spectrogram generator is properly configured. 
2025-06-16 21:14:40 - INFO - Vocoder is properly configured. You can now proceed with running the entire notebook.
2025-06-16 21:14:40 - INFO - Audio Sample is properly configured. You can now proceed with running the entire notebook.


# Loading from local saved models

Here, instead of downloading the models directly from NGC via code, we are showing that we can access the models that were downloaded previously, using Ai Studio assets manager

In [9]:
%%time

# ------------------------- Restore Pre-trained NeMo Models -------------------------

# Restore the Speech-to-Text (ASR) model - Citrinet fine-tuned on Aishell-2 corpus
asr_model = nemo_asr.models.EncDecCTCModel.restore_from(ASR_MODEL_PATH)

# Restore the Neural Machine Translation (NMT) model - English to Spanish Transformer

tokenizer = MarianTokenizer.from_pretrained(MT_MODEL)
mt_model = MarianMTModel.from_pretrained(MT_MODEL)

# Restore the Spectrogram Generator model (FastPitch) - Converts text to mel-spectrograms
spectrogram_generator = nemo_tts.models.FastPitchModel.restore_from(SPECTROGRAM_GENERATOR_PATH)

# Restore the Vocoder model (HiFiGAN) - Synthesizes audio waveform from spectrograms
vocoder = nemo_tts.models.HifiGanModel.restore_from(VOCODER_PATH)

CPU times: user 11.6 s, sys: 4.32 s, total: 15.9 s
Wall time: 25.7 s


# Play the Original English Audio

In [10]:
# ------------------------- Load and Play the Original English Audio -------------------------

# Play the input English audio sample
IPython.display.Audio(AUDIO_SAMPLE_PATH)

# Transcribe the Audio

# Translate the Text

In [11]:
# ------------------------- Step 1: Speech-to-Text (ASR) -------------------------

# Move the ASR model to GPU
asr_model = asr_model.to(device)

# Transcribe the audio
transcribed = asr_model.transcribe([AUDIO_SAMPLE_PATH])

# Extract the text from the first hypothesis
transcribed_text = transcribed[0].text

print("Transcribed text:", transcribed_text)


Transcribing: 100%|██████████| 1/1 [00:02<00:00,  2.15s/it]

Transcribed text: my mom always said life like a box of chocolates never know what you're going to get





In [12]:
# ------------------------- Step 2: Neural Machine Translation (NMT) -------------------------

# Move the NMT model to GPU
mt_model = mt_model.to(device)

# Tokenize the transcribed text
inputs = tokenizer(transcribed_text, return_tensors="pt", padding=True)

# Move inputs to the same device as the model
inputs = {key: value.to(device) for key, value in inputs.items()}

# Generate the translation
translated = mt_model.generate(**inputs)

# Decode the translated tokens into text
translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)

# Print the translated text
print(f"Translated Text:\n{translated_text}\n")


Translated Text:
mi mamá siempre decía que la vida como una caja de chocolates nunca sabe lo que vas a conseguir



# Convert Text to Audio

In [13]:
# ------------------------- Step 3: Text-to-Speech (TTS) -------------------------

# Move the Spectrogram Generator and Vocoder models to GPU
spectrogram_generator = spectrogram_generator.to(device)
vocoder = vocoder.to(device)

# Parse the translated text into tokens for spectrogram generation

tokens = spectrogram_generator.parse(translated_text)


# Generate a mel-spectrogram for the parsed tokens (speaker ID 2 used here)
spectrogram = spectrogram_generator.generate_spectrogram(tokens=tokens, speaker=2)

# Convert the generated spectrogram into audio waveform using the vocoder
audio_tensor = vocoder.convert_spectrogram_to_audio(spec=spectrogram)

# Play the Generated Spanish Audio

In [14]:
# ------------------------- Play the Generated Spanish Audio -------------------------

# Play the generated Spanish audio
IPython.display.Audio(audio_tensor.to('cpu').detach().numpy(), rate=44100)

# Register the Models to MLFlow

In [15]:
class NemoTranslationModel(mlflow.pyfunc.PythonModel):
    """
    A custom MLflow pyfunc model for performing end-to-end audio translation using NVIDIA NeMo models.
    """

    def load_context(self, context):
        """Load NeMo models and prepare the temporary working directory."""
        model_dir = context.artifacts["model"]

        self.asr_model = nemo_asr.models.EncDecCTCModel.restore_from(f"{model_dir}/enc_dec_CTC.nemo")
        self.mt_model = nemo_nlp.models.MTEncDecModel.restore_from(f"{model_dir}/MT_enc_dec.nemo")
        self.spectrogram_generator = nemo_tts.models.FastPitchModel.restore_from(f"{model_dir}/fast_pitch.nemo")
        self.vocoder = nemo_tts.models.HifiGanModel.restore_from(f"{model_dir}/hifi_gan.nemo")

        self.framerate = 41000

        os.makedirs("/phoenix/mlflow/tmp", exist_ok=True)

    def transcribe_audio(self, model_input):
        """Deserialize base64-encoded audio, save it temporarily, and perform speech-to-text."""
        serialized_audio = model_input['source_serialized_audio'][0]
        audio_buffer = io.BytesIO(base64.b64decode(serialized_audio))
        audio_array, self.framerate = soundfile.read(audio_buffer)

        # Ensure mono-channel audio
        if audio_array.ndim > 1:
            audio_array = audio_array[:, 0]

        temp_wave_path = f"/phoenix/mlflow/tmp/{self.file_id}.wav"
        soundfile.write(temp_wave_path, audio_array, self.framerate)

        # Perform ASR
        transcribed_text = self.asr_model.cuda().transcribe([temp_wave_path])
        return transcribed_text

    def text_to_audio(self, text: str):
        """Generate audio waveform from text using TTS models."""
        parsed_tokens = self.spectrogram_generator.cuda().parse(text)
        spectrogram = self.spectrogram_generator.cuda().generate_spectrogram(tokens=parsed_tokens, speaker=2)
        audio_tensor = self.vocoder.cuda().convert_spectrogram_to_audio(spec=spectrogram)

        return audio_tensor.to('cpu').detach().numpy()

    def serialize_audio(self, audio_array: np.ndarray):
        """Serialize a NumPy audio array into a base64-encoded WAV file."""

        
        wave_path = f"/phoenix/mlflow/tmp/out_{self.file_id}.wav"
        soundfile.write(wave_path, audio_array, samplerate=self.framerate, format='WAV')

        with io.BytesIO() as buffer:
            soundfile.write(buffer, audio_array, samplerate=self.framerate, format='WAV')
            buffer.seek(0)
            audio_base64 = base64.b64encode(buffer.read()).decode('utf-8')

        return audio_base64

    def predict(self, context, model_input, params):
        """
        Perform inference:
        1. Transcribe audio (if input is audio)
        2. Translate text
        3. Synthesize translated text into speech
        4. Serialize the audio if needed
        """
        self.file_id = uuid.uuid1()
        use_audio = params.get("use_audio", False)

        if use_audio:
            source_text = self.transcribe_audio(model_input)[0]
        else:
            source_text = model_input['source_text'][0]

        translated_text = self.mt_model.cuda().translate([source_text])[0]

        translated_audio_base64 = ""
        if use_audio:
            audio_array = self.text_to_audio(translated_text)
            translated_audio_base64 = self.serialize_audio(audio_array[0])

        return {
            "original_text": source_text,
            "translated_text": translated_text,
            "translated_serialized_audio": translated_audio_base64
        }

    @classmethod
    def log_model(cls, model_name: str, nemo_models: dict, demo_folder: str):
        """
        Log the translation model to MLflow with model artifacts and signatures.
        
        Args:
            model_name: Name under which to register the model.
            nemo_models: Dictionary mapping component names to their local .nemo file paths.
            demo_folder: Path to the demo files folder.
        """
        
        input_schema = Schema([
            ColSpec("string", "source_text"),
            ColSpec("string", "source_serialized_audio"),
        ])

        output_schema = Schema([
            ColSpec("string", "original_text"),
            ColSpec("string", "translated_text"),
            ColSpec("string", "translated_serialized_audio"),
        ])

        params_schema = ParamSchema([
            ParamSpec("use_audio", "boolean", False)
        ])

        signature = ModelSignature(
            inputs=input_schema,
            outputs=output_schema,
            params=params_schema
        )

        os.makedirs(model_name, exist_ok=True)

        # Copy NeMo model artifacts
        if "enc_dec_CTC" in nemo_models:
            shutil.copyfile(nemo_models["enc_dec_CTC"], f"{model_name}/enc_dec_CTC.nemo")
        if "MT_enc_dec" in nemo_models:
            shutil.copyfile(nemo_models["MT_enc_dec"], f"{model_name}/MT_enc_dec.nemo")
        if "fast_pitch" in nemo_models:
            shutil.copyfile(nemo_models["fast_pitch"], f"{model_name}/fast_pitch.nemo")
        if "hifi_gan" in nemo_models:
            shutil.copyfile(nemo_models["hifi_gan"], f"{model_name}/hifi_gan.nemo")

        # Log model to MLflow
        mlflow.pyfunc.log_model(
            artifact_path=model_name,
            python_model=cls(),
            artifacts={"model": model_name, "demo": demo_folder},
            signature=signature
        )

        # Clean up temporary files
        shutil.rmtree(model_name)

In [16]:
# ------------------------- MLflow Model Logging and Registration -------------------------

mlflow.set_tracking_uri('/phoenix/mlflow')
# Set the MLflow experiment
mlflow.set_experiment(experiment_name=EXPERIMENT_NAME)

# Start a new MLflow run
with mlflow.start_run(run_name=RUN_NAME) as run:
    # Define the set of NeMo model components to be logged
    nemo_model_artifacts = {
        "enc_dec_CTC": ASR_MODEL_PATH,
        "MT_enc_dec": NMT_MODEL_PATH,
        "fast_pitch": SPECTROGRAM_GENERATOR_PATH,
        "hifi_gan": VOCODER_PATH,
    }

    # Log the custom translation model with specified artifacts and demo folder
    NemoTranslationModel.log_model(
        model_name=MODEL_NAME,
        nemo_models=nemo_model_artifacts,
        demo_folder=DEMO_PATH
    )

    # Register the logged model in MLflow Model Registry
    mlflow.register_model(
        model_uri=f"runs:/{run.info.run_id}/{MODEL_NAME}",
        name=MODEL_NAME
    )

NameError: name 'NMT_MODEL_PATH' is not defined

In [None]:
# ------------------------- Success Confirmation -------------------------

print(f"✅ Model '{MODEL_NAME}' successfully logged and registered under experiment '{EXPERIMENT_NAME}'.")

In [None]:
logger.info('Notebook execution completed.')

Built with ❤️ using [**Z by HP AI Studio**](https://zdocs.datascience.hp.com/docs/aistudio/overview).