# Audio translator



This notebook tests certain features from NeMo in the form of an audio translator. The translator takes in an audio file, and converts it into an audio file in the target language. 

# How does it work?

The audio translator:

*   Converts audio to written text using ASR
*   Translates the written text to the target language
*   Creates a TTS audio file in the target language from the translated text


For GPU purposes, this notebook works best on Google Colab with a recording that isn't too long (under 1 minute). 

If you you run into issues with your own recording, see if you can find a shorter one to check if that works. 

# Importing the tools used

First, let's install and import the right collections.

In [None]:
!pip install nemo_toolkit[all]

In [None]:
# From NeMo, we import the following:

# Import NeMo and it's ASR, NLP and TTS collections
import nemo
# Import Speech Recognition collection
import nemo.collections.asr as nemo_asr
# Import Natural Language Processing colleciton
import nemo.collections.nlp as nemo_nlp
# Import Speech Synthesis collection
import nemo.collections.tts as nemo_tts

# To listen to our audio files
import IPython

Next, we need to clarify which specific models from our collections we'd like to use. In our example, we use a Spanish recording as input, and we want our output to be in English.

We need


*   An ASR model in the language of our audiofile
*   A translation model that translates from the language of our audiofile to our target language
*   A spectogram generator in our target language
*   A vocoder that can turn our spectogram into an audiofile





In [None]:
# Speech Recognition model 
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="stt_es_quartznet15x5").cuda()

# Neural Machine Translation model 
nmt_model = nemo_nlp.models.MTEncDecModel.from_pretrained(model_name='nmt_es_en_transformer12x2').cuda()

# Spectrogram generator which takes text as an input and produces spectrogram
spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name="tts_en_fastpitch").cuda()

# Vocoder model which takes spectrogram and produces actual audio
vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_hifigan").cuda()

In [None]:
# # If you'd like to use this with other models uncomment this block
# nemo_nlp.models.MTEncDecModel.list_available_models()
# nemo_asr.models.EncDecCTCModel.list_available_models()

# Let's start translating

Add the path to the audio file you would like to have translated.

In [None]:
# Feel free to add your own audio here, but if you don't have an audio sample yet, you can use the following
!wget 'https://www.lightbulblanguages.co.uk/resources/sp-audio/tengo-once-anos.mp3'

In [None]:
# Download audio sample which we'll try
# IMPORTANT: The audio must be mono with 16Khz sampling rate
audio_sample = 'tengo-once-anos.mp3'

# To listen it, click on the play button below
IPython.display.Audio(audio_sample)

Next, we'll transcribe the text from the audio sample and print the transcribed text. 

In [None]:
transcribed_text = asr_model.transcribe([audio_sample])
print(transcribed_text)

Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

['tengo once años']


Then, we translate the transcribed text to our target language.

In [None]:
english_text = nmt_model.translate(transcribed_text)
print(english_text)

["I'm eleven years old"]


Lastly, we convert the translated into speech using a spectogram generator and a vocoder. 

In [None]:
# A helper function which combines FastPitch and HifiGan to go directly from 
# text to audio
def text_to_audio(text):
  parsed = spectrogram_generator.parse(text)
  spectrogram = spectrogram_generator.generate_spectrogram(tokens=parsed)
  audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
  return audio.to('cpu').detach().numpy()

Now we have our output

In [None]:
# Listen to generated audio in English
IPython.display.Audio(text_to_audio(english_text[0]), rate=22050)