<h1 style="text-align: center; font-size: 50px;"> 🎙️English to Spanish Audio Translation </h1>

This notebook shows how to use NVIDIA NeMo (https://github.com/NVIDIA/NeMo) to construct a toy demo which translates English audio file into a Spanish one.

The demo demonstrates how to: 

* Instantiate pre-trained NeMo models from NVIDIA NGC as AIStudio assets.
* Transcribe audio with an English speech recognition model.
* Translate text to Spanish with a machine translation model.
* Generate audio with text-to-speech models fine-tuned to Spanish.

# Notebook Overview
- Start Execution
- Install and Import Libraries
- Configure Settings
- Verify Assets
- Loading from local saved models
- Play the Original English Audio
- Transcribe the Audio
- Translate the Text
- Convert Text to Audio
- Play the Generated Spanish Audio

# Start Execution

In [1]:
import logging  # For application-level logging
import time     # For runtime measurement (wall clock)

# Configure logger
logger: logging.Logger = logging.getLogger("run_workflow_logger")
logger.setLevel(logging.INFO)
logger.propagate = False  # Prevent duplicate logs from parent loggers

# Set formatter
formatter: logging.Formatter = logging.Formatter(
    fmt="%(asctime)s - %(levelname)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S"
)

# Configure and attach stream handler
stream_handler: logging.StreamHandler = logging.StreamHandler()
stream_handler.setFormatter(formatter)
logger.addHandler(stream_handler)

In [2]:
start_time = time.time()  

logger.info("Notebook execution started.")

2025-09-10 16:46:33 - INFO - Notebook execution started.


# Install and Import Libraries

In [3]:
%%time

%pip install -r ../requirements.txt --quiet

Note: you may need to restart the kernel to use updated packages.
CPU times: user 45.2 ms, sys: 13.8 ms, total: 58.9 ms
Wall time: 3.19 s


In [4]:
# ------------------------- NeMo Core Imports -------------------------
import nemo                             # NVIDIA NeMo core package
import nemo.collections.asr as nemo_asr # Speech Recognition (ASR) collection
import nemo.collections.tts as nemo_tts # Text-to-Speech (TTS) collection

# ------------------------- Transformers -------------------------
from transformers import MarianMTModel, MarianTokenizer

# ------------------------- Audio Processing Utilities -------------------------
import IPython                          # For playing audio inside Jupyter Notebooks
import soundfile                        # For reading and writing audio files
from pathlib import Path                # Filesystem path management

# ------------------------- System Utilities -------------------------

import os                               # Operating system interfaces
import shutil                           # High-level file operations
import uuid                             # Unique ID generation
import io                               # Input/Output core tools
import base64                           # Encoding and decoding base64 strings
import json                             # JSON serialization and deserialization
import warnings                         # Suppressing and managing warnings
import numpy as np                      # Numerical array operations
np.float_ = np.float64
import torch

    


# Configure Settings

In [5]:
# Suppress Python warnings
warnings.filterwarnings("ignore")

# Suppress NeMo internal logging
logging.getLogger('nemo_logger').setLevel(logging.ERROR)

In [6]:
# ------------------------- Model File Paths -------------------------
MT_MODEL = "Helsinki-NLP/opus-mt-en-es"
ASR_MODEL_PATH = "/home/jovyan/datafabric/STT_En_Citrinet_1024_Gamma_0.25/stt_en_citrinet_1024_gamma_0_25.nemo"        # Speech-to-Text (ASR) model
SPECTROGRAM_GENERATOR_PATH = "/home/jovyan/datafabric/TTS_Es_Multispeaker_FastPitch_HiFiGAN/tts_es_fastpitch_multispeaker.nemo"  # Spectrogram generator model (FastPitch)
VOCODER_PATH = "/home/jovyan/datafabric/TTS_Es_Multispeaker_FastPitch_HiFiGAN/tts_es_hifigan_ft_fastpitch_multispeaker.nemo"     # Vocoder model (HiFiGAN)

# ------------------------- Sample Audio Path -------------------------

AUDIO_SAMPLE_PATH = "../data/ForrestGump.mp3"     # Path to the input English audio sample

In [7]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


# Verify Assets

In [8]:
def log_asset_status(asset_path: str, asset_name: str, success_message: str, failure_message: str) -> None:
    """
    Logs the status of a given asset based on its existence.

    Parameters:
        asset_path (str): File or directory path to check.
        asset_name (str): Name of the asset for logging context.
        success_message (str): Message to log if asset exists.
        failure_message (str): Message to log if asset does not exist.
    """
    if Path(asset_path).exists():
        logger.info(f"{asset_name} is properly configured. {success_message}")
    else:
        logger.info(f"{asset_name} is not properly configured. {failure_message}")
        
log_asset_status(
    asset_path=ASR_MODEL_PATH,
    asset_name="ASR model",
    success_message="",
    failure_message="Please create and download the required assets in your project on AI Studio."
)

log_asset_status(
    asset_path=SPECTROGRAM_GENERATOR_PATH,
    asset_name="Spectrogram generator",
    success_message="",
    failure_message="Please create and download the required assets in your project on AI Studio."
)

log_asset_status(
    asset_path=VOCODER_PATH,
    asset_name="Vocoder",
    success_message="",
    failure_message="Please create and download the required assets in your project on AI Studio."
)

log_asset_status(
    asset_path=AUDIO_SAMPLE_PATH,
    asset_name="Audio Sample",
    success_message="",
    failure_message="Please check if the data folder was properly downloaded in your project on AI Studio."
)

2025-09-10 16:46:52 - INFO - ASR model is properly configured. 
2025-09-10 16:46:52 - INFO - Spectrogram generator is properly configured. 
2025-09-10 16:46:52 - INFO - Vocoder is properly configured. 
2025-09-10 16:46:52 - INFO - Audio Sample is properly configured. 


# Loading from local saved models

Here, instead of downloading the models directly from NGC via code, we are showing that we can access the models that were downloaded previously, using Ai Studio assets manager

In [9]:
%%time

# ------------------------- Restore Pre-trained NeMo Models -------------------------

# Restore the Speech-to-Text (ASR) model - Citrinet fine-tuned on Aishell-2 corpus
asr_model = nemo_asr.models.EncDecCTCModel.restore_from(ASR_MODEL_PATH)

# Restore the Neural Machine Translation (NMT) model - English to Spanish Transformer

tokenizer = MarianTokenizer.from_pretrained(MT_MODEL)
mt_model = MarianMTModel.from_pretrained(MT_MODEL)

# Restore the Spectrogram Generator model (FastPitch) - Converts text to mel-spectrograms
spectrogram_generator = nemo_tts.models.FastPitchModel.restore_from(SPECTROGRAM_GENERATOR_PATH)

# Restore the Vocoder model (HiFiGAN) - Synthesizes audio waveform from spectrograms
vocoder = nemo_tts.models.HifiGanModel.restore_from(VOCODER_PATH)

CPU times: user 18.9 s, sys: 4.91 s, total: 23.8 s
Wall time: 59.7 s


# Play the Original English Audio

In [10]:
# ------------------------- Load and Play the Original English Audio -------------------------

# Play the input English audio sample
IPython.display.Audio(AUDIO_SAMPLE_PATH)

# Transcribe the Audio

In [11]:
# ------------------------- Step 1: Speech-to-Text (ASR) -------------------------

# Move the ASR model to GPU
asr_model = asr_model.to(device)

# Transcribe the audio
transcribed = asr_model.transcribe([AUDIO_SAMPLE_PATH])

# Extract the text from the first hypothesis
# transcribed_text = transcribed[0].text  # 25.04 NeMo version
transcribed_text = transcribed[0]   # 23.10 NeMo version

print("Transcribed text:", transcribed_text)


Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

Transcribed text: my mom always said life like a box of chocolates never know what you're going to get


# Translate the Text

In [12]:
# ------------------------- Step 2: Neural Machine Translation (NMT) -------------------------

# Move the NMT model to GPU
mt_model = mt_model.to(device)

# Tokenize the transcribed text
inputs = tokenizer(transcribed_text, return_tensors="pt", padding=True)

# Move inputs to the same device as the model
inputs = {key: value.to(device) for key, value in inputs.items()}

# Generate the translation
translated = mt_model.generate(**inputs)

# Decode the translated tokens into text
translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)

# Print the translated text
print(f"Translated Text:\n{translated_text}\n")


Translated Text:
mi mamá siempre decía que la vida como una caja de chocolates nunca sabe lo que vas a conseguir



# Convert Text to Audio

In [13]:
# ------------------------- Step 3: Text-to-Speech (TTS) -------------------------

# Move the Spectrogram Generator and Vocoder models to GPU
spectrogram_generator = spectrogram_generator.to(device)
vocoder = vocoder.to(device)

# Parse the translated text into tokens for spectrogram generation
tokens = spectrogram_generator.parse(translated_text)

# Generate a mel-spectrogram for the parsed tokens (speaker ID 2 used here)
spectrogram = spectrogram_generator.generate_spectrogram(tokens=tokens, speaker=2)

# Convert the generated spectrogram into audio waveform using the vocoder
audio_tensor = vocoder.convert_spectrogram_to_audio(spec=spectrogram)

# Play the Generated Spanish Audio

In [14]:
# ------------------------- Play the Generated Spanish Audio -------------------------

# Play the generated Spanish audio
IPython.display.Audio(audio_tensor.to('cpu').detach().numpy(), rate=44100)

In [15]:
end_time: float = time.time()
elapsed_time: float = end_time - start_time
elapsed_minutes: int = int(elapsed_time // 60)
elapsed_seconds: float = elapsed_time % 60

logger.info(f"⏱️ Total execution time: {elapsed_minutes}m {elapsed_seconds:.2f}s")
logger.info("✅ Notebook execution completed successfully.")

2025-09-10 16:48:00 - INFO - ⏱️ Total execution time: 1m 27.13s
2025-09-10 16:48:00 - INFO - ✅ Notebook execution completed successfully.


Built with ❤️ using [**Z by HP AI Studio**](https://zdocs.datascience.hp.com/docs/aistudio/overview).