## Hands-on exercise

Your objective is to take the cascaded speech-to-speech translation Gradio demo from the first section in this Unit, and update it to translate to any non-English language. That is to say, the demo should take speech in language X, and translate it to speech in language Y, where the target language Y is not English. You should start by duplicating the template under your Hugging Face namespace. There’s no requirement to use a GPU accelerator device - the free CPU tier works just fine 🤗 However, you should ensure that the visibility of your demo is set to public. This is required such that your demo is accessible to us and can thus be checked for correctness.

Tips for updating the speech translation function to perform multilingual speech translation are provided in the section on speech-to-speech translation. By following these instructions, you should be able to update the demo to translate from speech in language X to text in language Y, which is half of the task!

To synthesise from text in language Y to speech in language Y, where Y is a multilingual language, you will need to use a multilingual TTS checkpoint. For this, you can either use the SpeechT5 TTS checkpoint that you fine-tuned in the previous hands-on exercise, or a pre-trained multilingual TTS checkpoint. There are two options for pre-trained checkpoints, either the checkpoint sanchit-gandhi/speecht5_tts_vox_nl, which is a SpeechT5 checkpoint fine-tuned on the Dutch split of the VoxPopuli dataset, or an MMS TTS checkpoint (see section on pretrained models for TTS).

In our experience experimenting with the Dutch language, using an MMS TTS checkpoint results in better performance than a fine-tuned SpeechT5 one, but you might find that your fine-tuned TTS checkpoint is preferable in your language. If you decide to use an MMS TTS checkpoint, you will need to update the requirements.txt file of your demo to install transformers from the PR branch:
git+https://github.com/hollance/transformers.git@6900e8ba6532162a8613d2270ec2286c3f58f57b

Your demo should take as input an audio file, and return as output another audio file, matching the signature of the speech_to_speech_translation function in the template demo. Therefore, we recommend that you leave the main function speech_to_speech_translation as is, and only update the translate and synthesise functions as required.

Once you have built your demo as a Gradio demo on the Hugging Face Hub, you can submit it for assessment. Head to the Space audio-course-u7-assessment and provide the repository id of your demo when prompted. This Space will check that your demo has been built correctly by sending a sample audio file to your demo and checking that the returned audio file is indeed non-English. If your demo works correctly, you’ll get a green tick next to your name on the overall progress space ✅

In [48]:
!pip install torch gradio sentencepiece sacremoses

Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: sacremoses
Successfully installed sacremoses-0.1.1


In [49]:
# Speech to Speech Translation (English to French) from tutorial

import torch
from transformers import pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
    "automatic-speech-recognition", model="openai/whisper-base", device=device
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [22]:
# This is just for Text translation

# import torch
# from transformers import pipeline, MarianMTModel, MarianTokenizer

# model_checkpoint = "facebook/" # 'facebook/m2m100_418M'
# pipe = pipeline("translation", model=model_checkpoint) # device=device translator



JackismyShephard/speecht5_tts-finetuned-nst-da

In [50]:
#load one language (English) split of Vox Populi

from datasets import load_dataset

dataset = load_dataset("facebook/voxpopuli", "en", split="validation", streaming=True, trust_remote_code=True)  
sample = next(iter(dataset))

In [51]:
# Play example

from IPython.display import Audio

Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"])

In [53]:
# Function to translate audio to text

def translate(audio):
    outputs = pipe(audio, max_new_tokens=256, generate_kwargs={"task": "transcribe", "language": "fr"}) # "language": "fr"
    return outputs["text"]

In [54]:
# Check the model

translate(sample["audio"].copy())


" Il y a des mesures qui ont été décourées, pas en septembre, mais aussi en marches. Et bien, nous verrons des mesures, peut-être pas encore, mais il y a des mesures qui ont été décourées. Et la situation pourrait être en worse si nous n'avons pas été décourées."

In [29]:
# Speech2Text Translation how do I take Audio(sample["audio"]from above??

# import torch
# from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
# from datasets import load_dataset

# model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-medium-mustc-multilingual-st")
# processor = Speech2TextProcessor.from_pretrained("facebook/s2t-medium-mustc-multilingual-st")

# #ds = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
# ds = dataset
# def translate(audio):
#     inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], return_tensors="pt")
#     generated_ids = model.generate(
#     inputs["input_features"],
#     attention_mask=inputs["attention_mask"],
#     forced_bos_token_id=processor.tokenizer.lang_code_to_id["fr"],
# )

# translation = processor.batch_decode(generated_ids, skip_special_tokens=True)
# translation

In [50]:
# This worked for text

# def translate(text, target_language):
#     model_name = f'Helsinki-NLP/opus-mt-en-fr'
#     tokenizer = MarianTokenizer.from_pretrained(model_name)
#     model = MarianMTModel.from_pretrained(model_name)

#     inputs = tokenizer.encode(text, return_tensors="pt")
#     outputs = model.generate(inputs, num_beams=4, max_length=50, early_stopping=True)
#     translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

#     return translated_text

In [51]:
# # This worked for text

# input_text = sample["raw_text"]
# target_language = 'fr'
# translated_text = translate(input_text, target_language)
# print(translated_text)



De nombreuses mesures ont été prises, non seulement en septembre, mais aussi en mars, et nous voyons bien sûr certains effets de ces mesures, peut-être pas assez, mais il y a des effets de ces mesures, et la situation aurait


In [55]:
# Compare transcription to source audio

sample["raw_text"]

'Many measures have been taken, not only in September but also in March, and of course we see some effects of those measures perhaps not enough, but there are effects of those measures, and the situation could have been worse if we did not have taken those measures.'

In [56]:
# Load model for TTS

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")

model = SpeechT5ForTextToSpeech.from_pretrained("ccourc23/fine_tuned_SpeechT5") 
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

In [57]:
# Load speaker embeddings

embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

In [58]:
# Function that takes text and returns speech

def synthesise(text):
    inputs = processor(text=text, return_tensors="pt")
    speech = model.generate_speech(
        inputs["input_ids"].to(device), speaker_embeddings.to(device), vocoder=vocoder
    )
    return speech.cpu()

In [59]:
# Check it works

speech = synthesise("Hey there! This is a test!")

Audio(speech, rate=16000)

In [60]:
# Concatenate the two functions

import numpy as np

target_dtype = np.int16
max_range = np.iinfo(target_dtype).max


def speech_to_speech_translation(audio):
    translated_text = translate(audio)
    synthesised_speech = synthesise(translated_text)
    synthesised_speech = (synthesised_speech.numpy() * max_range).astype(np.int16)
    return 16000, synthesised_speech

In [61]:
# Make sure you get the expected result

sampling_rate, synthesised_speech = speech_to_speech_translation(sample["audio"])

Audio(synthesised_speech, rate=sampling_rate)

In [62]:
# Create Gradio demo to use mic input

import gradio as gr

demo = gr.Blocks()

mic_translate = gr.Interface(
    fn=speech_to_speech_translation,
    inputs=gr.Audio(sources="microphone", type="filepath"),
    outputs=gr.Audio(label="Generated Speech", type="numpy"),
)

file_translate = gr.Interface(
    fn=speech_to_speech_translation,
    inputs=gr.Audio(sources="upload", type="filepath"),
    outputs=gr.Audio(label="Generated Speech", type="numpy"),
)

with demo:
    gr.TabbedInterface([mic_translate, file_translate], ["Microphone", "Audio File"])

demo.launch(debug=True, share=True)

# To create a public link, set `share=True` in `launch()`

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://23c34c9a40458f2de6.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://23c34c9a40458f2de6.gradio.live


