# SeamlessM4Tv2 🤗

## Goal of this notebook

This notebook will teach you how to use how to easily use [SeamlessM4T v2](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel), a foundational multimodal model for speech translation using 🤗 Transformers.

## TL;DR pointers

1. [Installation in one line](#installation) -> `!pip install --quiet git+https://github.com/huggingface/transformers sentencepiece`
2. [Speech to speech](#s2st)
3. [Speech to text](#s2tt)
4. [Text to speech](#t2ts)
5. [Text to text](#t2tt)


## Resources

1. [SeamlessM4T v2 docs in 🤗 Transformers](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2)
2. [Demo on 🤗 Spaces](https://huggingface.co/spaces/facebook/seamless-m4t-v2-large)
3. [Model card](https://huggingface.co/facebook/seamless-m4t-v2-large)
4. [Original repository](https://github.com/facebookresearch/seamless_communication)

## Presentation of the model

SeamlessM4T is designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.

SeamlessM4T enables multiple tasks without relying on separate models:

- Speech-to-speech translation (S2ST)
- Speech-to-text translation (S2TT)
- Text-to-speech translation (T2ST)
- Text-to-text translation (T2TT)
- Automatic speech recognition (ASR)

SeamlessM4T-v2 features a versatile architecture that smoothly handles the sequential generation of text and speech. This setup comprises two sequence-to-sequence (seq2seq) models. The first model translates the input modality into translated text, while the second model generates speech tokens, known as "unit tokens," from the translated text.

Each modality has its own dedicated encoder with a unique architecture. Additionally, for speech output, a vocoder inspired by the [HiFi-GAN](https://arxiv.org/abs/2010.05646) architecture is placed on top of the second seq2seq model.

Here's how the generation process works:

- Input text or speech is processed through its specific encoder.
- A decoder creates text tokens in the desired language.
- If speech generation is required, the second seq2seq model, generates unit tokens in an non auto-regressive way.
- These unit tokens are then passed through the final vocoder to produce the actual speech.

## Prepare the Environment

Throughout this tutorial, we'll use a GPU. The runtime is already configured to use the free 16GB T4 GPU provided through Google Colab Free Tier, so all you need to do is hit "Connect T4" in the top right-hand corner of the screen.

##### <a name="installation"> We just need to install the 🤗 Transformers package from the main branch and the sentencepiece package:</a>




In [None]:
!pip install --quiet git+https://github.com/huggingface/transformers sentencepiece

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


We'll also install the `datasets` package to have convenient examples at hand:

In [None]:
!pip install --quiet datasets

## Preprocessing



### Load the model

The pre-trained checkpoint can be loaded from [the pre-trained weights]((https://huggingface.co/facebook/seamless-m4t-v2-large)) on the 🤗 Hugging Face Hub.

In [1]:
import os
mycache_dir= r"D:\Cache\huggingface"
os.environ['HF_HOME'] = mycache_dir

In [3]:
from transformers import SeamlessM4Tv2Model

model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large", cache_dir=mycache_dir)

Loading checkpoint shards: 100%|██████████| 2/2 [00:09<00:00,  4.79s/it]


Place the model to an accelerator device if available.

In [4]:
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)

### Load the Processor

Before everything, load `SeamlessM4TProcessor` in order to be able to pre-process the inputs. The Transformers package has a wide range of processors, so we'll use the `AutoProcessor` class than can recognize which processor to load from the repository id.

The processor role here is two-sides:
1. It is used to prepare inputs. It tokenizes the input text, i.e. to cut it into small pieces that the model can understand, and transforms the audio into a format more suitable for the model.
2. It is used to process the model results. Here, it is used to "detokenize" the output, i.e. to perform the opposite operation to that described above.

In [5]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large", cache_dir=mycache_dir)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.

### Preparing audio
Here is how to use the processor to process audio. Here, we'll use an audio taken from an Hindi speech corpus.

**Note that you don't need to specify the source language, it will be automatically understood by the model!**


In [6]:
# let's load an audio sample from an Hindi speech corpus
from datasets import load_dataset
dataset = load_dataset("google/fleurs", "hi_in", split="train", streaming=True)
audio_sample = next(iter(dataset))["audio"]

print(f"Sampling rate: {audio_sample['sampling_rate']}")

Downloading builder script: 100%|██████████| 12.6k/12.6k [00:00<00:00, 6.27MB/s]
Downloading readme: 100%|██████████| 13.3k/13.3k [00:00<00:00, 6.67MB/s]


Sampling rate: 16000


In [8]:
audio_sample.keys()

dict_keys(['path', 'array', 'sampling_rate'])

In [9]:
audio_sample['array'].shape

(138240,)

In [10]:
model.config.sampling_rate

16000

In [32]:
from IPython.display import Audio

sample_rate = audio_sample['sampling_rate']
Audio(audio_sample['array'], rate=sample_rate)

**Note:** The [sampling rate](https://en.wikipedia.org/wiki/Sampling_(signal_processing)) of the input audio must 16 kHz. If your sampling rate is different, you'll need an additional step to prepare the input audio.

**Here is how to do it:**
```python
# we need an additional library
# you can install it with:
# !pip install torchaudio
import torchaudio, torch

# you need to convert the audio from a numpy array to a torch tensor first
audio = torch.tensor(audio_sample["array"])

# now downsample the audio
audio = torchaudio.functional.resample(audio, orig_freq=audio_sample['sampling_rate'], new_freq=model.config.sampling_rate)
```

Now, use the processor:

In [7]:
audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt").to(device)

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


In [11]:
audio_inputs.keys()

dict_keys(['input_features', 'attention_mask'])

In [12]:
audio_inputs['input_features'].shape

torch.Size([1, 431, 160])

### Preparing text

It is much easier to prepare text, you just have to give it to the processor, alongs with the language of the text. Here the text is in English so we'll set `src_lang="eng"`.

In [13]:
# now, process some English test as well
text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt").to(device)

In [14]:
text_inputs.keys()

dict_keys(['input_ids', 'attention_mask'])

In [16]:
text_inputs['input_ids'].shape

torch.Size([1, 8])

## Model usage

Now, we got everything ready to actually use the model !

### Generate translated speech

`SeamlessM4TModel v2` can *seamlessly* generate text or speech with few or no changes. Let's target Chinese voice translation:

In [20]:
audio_array_from_text = model.generate(**text_inputs, tgt_lang="cmn")
audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="cmn")

In [22]:
audio_array_from_text #tuple

(tensor([[ 4.0736e-04,  6.2483e-04,  6.9649e-04,  ..., -2.6668e-05,
          -7.4610e-05,  1.0283e-05]], device='cuda:0'),
 tensor(35200, device='cuda:0'))

In [23]:
audio_array_from_text[0].shape

torch.Size([1, 35200])

In [24]:
audio_array_from_text = audio_array_from_text[0].cpu().numpy().squeeze()

In [25]:
audio_array_from_text.shape

(35200,)

In [26]:
audio_array_from_audio

(tensor([[ 7.6779e-05,  2.0793e-04,  2.0927e-04,  ..., -9.6151e-05,
          -1.4052e-04, -8.8349e-05]], device='cuda:0'),
 tensor(120960, device='cuda:0'))

In [27]:
audio_array_from_audio[0].shape

torch.Size([1, 120960])

In [28]:
audio_array_from_audio = audio_array_from_audio[0].cpu().numpy().squeeze()

In [29]:
audio_array_from_audio.shape

(120960,)

**With the exact same code but different inputs, I’ve translated English text and Hindi speech to Russian speech samples.**

Now, let's listen to the generated audios!

#### From text

In [30]:
from IPython.display import Audio

sample_rate = model.config.sampling_rate
Audio(audio_array_from_text, rate=sample_rate)

#### From audio

In [31]:
Audio(audio_array_from_audio, rate=sample_rate)

You can also save audio as .wav files using a third-party library, e.g. scipy (note here that we also need to remove the channel dimension from our audio tensor):

In [36]:
import scipy

scipy.io.wavfile.write("seamless_m4t_out.wav", rate=sample_rate, data=audio_array_from_text) # audio_array_from_audio


### Generate translated text

Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass `generate_speech=False` to `SeamlessM4Tv2Model.generate`.

This time, let's translate the Hindi audio to English (I personnaly don't speak Hindi 🤗) and the English text to French.

In [37]:
# from audio
output_tokens = model.generate(**audio_inputs, tgt_lang="eng", generate_speech=False)

In [40]:
output_tokens.keys()

odict_keys(['sequences', 'scores', 'encoder_hidden_states', 'decoder_hidden_states', 'past_key_values'])

In [42]:
output_tokens[0].shape

torch.Size([1, 27])

In [45]:
output_tokens[0]

tensor([[     3, 256022, 151200,  73420,  14375,  10143,  24842,  73368,   2539,
           7047,   1208,     70,    321,  43195, 149432,    243,    128,    249,
           2959, 228400, 172675,    321,  20077,   2361,  94931, 247676,      3]],
       device='cuda:0')

In [43]:
output_tokens['sequences'].shape

torch.Size([1, 27])

In [44]:
output_tokens['sequences']

tensor([[     3, 256022, 151200,  73420,  14375,  10143,  24842,  73368,   2539,
           7047,   1208,     70,    321,  43195, 149432,    243,    128,    249,
           2959, 228400, 172675,    321,  20077,   2361,  94931, 247676,      3]],
       device='cuda:0')

In [48]:
len(output_tokens['scores']) #tuple

25

In [50]:
output_tokens['scores'][0].shape

torch.Size([1, 256102])

In [55]:
output_tokens[0].tolist()

[[3,
  256022,
  151200,
  73420,
  14375,
  10143,
  24842,
  73368,
  2539,
  7047,
  1208,
  70,
  321,
  43195,
  149432,
  243,
  128,
  249,
  2959,
  228400,
  172675,
  321,
  20077,
  2361,
  94931,
  247676,
  3]]

In [56]:
len(output_tokens[0].tolist()[0])

27

In [57]:
translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
print(f"Translation from audio: {translated_text_from_audio}")

Translation from audio: Politicians said they found enough ambiguity in the Afghan constitution to unnecessarily determine the decisive vote.


In [60]:
output_tokens[0].squeeze()

tensor([     3, 256022, 151200,  73420,  14375,  10143,  24842,  73368,   2539,
          7047,   1208,     70,    321,  43195, 149432,    243,    128,    249,
          2959, 228400, 172675,    321,  20077,   2361,  94931, 247676,      3],
       device='cuda:0')

In [61]:
#decode requires vector as the input
translated_text_from_audio = processor.decode(output_tokens[0].squeeze(), skip_special_tokens=True)
print(f"Translation from audio: {translated_text_from_audio}")

Translation from audio: Politicians said they found enough ambiguity in the Afghan constitution to unnecessarily determine the decisive vote.


In [63]:
output_tokens[0].shape

torch.Size([1, 27])

In [62]:
#batch_decode requires a [batch_size, len] as the input, output a list
translated_text_from_audio = processor.batch_decode(output_tokens[0], skip_special_tokens=True)
print(f"Translation from audio: {translated_text_from_audio}")

Translation from audio: ['Politicians said they found enough ambiguity in the Afghan constitution to unnecessarily determine the decisive vote.']


In [64]:
# from text, to text
output_tokens = model.generate(**text_inputs, tgt_lang="cmn", generate_speech=False)

In [65]:
output_tokens

GenerateEncoderDecoderOutput(sequences=tensor([[     3, 256016,  15079, 249071, 247681,  23559, 251474, 249315, 248812,
         250321,      3]], device='cuda:0'), scores=(tensor([[2.0167, 4.6405, 1.9794,  ..., 1.9157, 1.6959, 1.9943]],
       device='cuda:0'), tensor([[ 1.3537,  4.9796,  1.2878,  ...,  1.2913,  0.8213, -0.3426]],
       device='cuda:0'), tensor([[2.1103, 8.9505, 2.1238,  ..., 2.0964, 2.5877, 2.1784]],
       device='cuda:0'), tensor([[1.0962, 4.9048, 1.0854,  ..., 1.0492, 1.2545, 1.0371]],
       device='cuda:0'), tensor([[1.5307, 6.4930, 1.5911,  ..., 1.5024, 1.3599, 1.0185]],
       device='cuda:0'), tensor([[1.5632, 4.7460, 1.6296,  ..., 1.6065, 0.6436, 0.0816]],
       device='cuda:0'), tensor([[1.8889, 8.3170, 1.9302,  ..., 1.8464, 2.1612, 0.7674]],
       device='cuda:0'), tensor([[1.5138, 5.5752, 1.5303,  ..., 1.4857, 1.2686, 0.4793]],
       device='cuda:0'), tensor([[ 2.0143, 13.9362,  1.9946,  ...,  2.0041,  2.1285,  1.9348]],
       device='cuda:0')), logi

In [66]:
output_tokens.keys()

odict_keys(['sequences', 'scores', 'encoder_hidden_states', 'decoder_hidden_states', 'past_key_values'])

In [67]:
output_tokens['sequences'].shape

torch.Size([1, 11])

In [68]:
output_tokens[0].shape

torch.Size([1, 11])

In [69]:
translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
print(f"Translation from text: {translated_text_from_text}")

Translation from text: 你好,我的狗很可爱


In [70]:
translated_text_from_text = processor.decode(output_tokens[0].squeeze(), skip_special_tokens=True)
print(f"Translation from text: {translated_text_from_text}")

Translation from text: 你好,我的狗很可爱


In [71]:
translated_text_from_text = processor.batch_decode(output_tokens[0], skip_special_tokens=True)
print(f"Translation from text: {translated_text_from_text}")

Translation from text: ['你好,我的狗很可爱']


## Intermediary conclusion

Now you know how to use [SeamlessM4T using 🤗 Transformers](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2)

Let's wrap it up:
1. SeamlessM4T v2 can **translate text/speech to text/speech.**
2. It **supports numerous languages** and is a great step towards reducing language barriers in AI.
3. It is **fast and efficient**.
4. The rest of this notebook will share some **tips** on how to best use the model! (**Spoiler:** You can also do batching inference!)
5. You can try this new version of SeamlessM4T in this [demo on 🤗 Spaces](https://huggingface.co/spaces/facebook/seamless-m4t-v2-large).

**Don't hesitate to share how you think this model should be used!**


## Tips



### 1. Use dedicated models

`SeamlessM4Tv2Model` is a Transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint.
For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code:

```python
>>> from transformers import SeamlessM4Tv2ForSpeechToSpeech
>>> model = SeamlessM4TForSpeechToSpeech.from_pretrained("facebook/seamless-m4t-v2-large")
```

Or you can replace the text-to-text generation snippet with the model dedicated to the T2TT task, you only have to remove `generate_speech=False`.

```python
>>> from transformers import SeamlessM4Tv2ForTextToText
>>> model = SeamlessM4TForTextToText.from_pretrained("facebook/seamless-m4t-v2-large")
```

Feel free to try out `SeamlessM4Tv2ForSpeechToText` and `SeamlessM4Tv2ForTextToSpeech` as well.

#### 2. Change the speaker identity

You have the possibility to change the speaker used for speech synthesis with the `speaker_id` argument. Some `speaker_id` works better than other for some languages!


In [72]:
# let's test with let say speaker_id=5 and tgt_lang="eng"
audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="eng", speaker_id=5)[0].cpu().numpy().squeeze()

Audio(audio_array_from_audio, rate=sample_rate)


#### 3. Change the generation strategy

You can use different [generation strategies](./generation_strategies) for speech and text generation, e.g `.generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True)` which will successively perform beam-search decoding on the text model, and multinomial sampling on the speech model.


In [73]:
audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="eng", speaker_id=7, text_num_beams=4, speech_do_sample=True, speech_temperature=0.6)[0].cpu().numpy().squeeze()

Audio(audio_array_from_audio, rate=sample_rate)


#### 4. Generate speech and text at the same time

Use `return_intermediate_token_ids=True` with [`SeamlessM4TModel`] to return both speech and text !

In [74]:
output = model.generate(**audio_inputs, return_intermediate_token_ids=True, tgt_lang="eng", speaker_id=7, text_num_beams=4, speech_do_sample=True, speech_temperature=0.6)

audio_array_from_audio = output[0].cpu().numpy().squeeze()
text_tokens = output[2]
translated_text_from_text = processor.decode(text_tokens.tolist()[0], skip_special_tokens=True)
print(f"TRANSLATION: {translated_text_from_text}")

Audio(audio_array_from_audio, rate=sample_rate)

TRANSLATION: Politicians said they had found enough ambiguity in the Afghan constitution to unnecessarily determine the decisive vote.


### 5. Use batching for increased throughput

Batching with SeamlessM4T is supported in 🤗 Transformers. Here is an example with two French sentences translated to English!

In [None]:
text_inputs = processor(text = ["J'aime HF de tout mon coeur.", "La vie est belle."], src_lang="fra", return_tensors="pt").to(device)

audio_array_from_text = model.generate(**text_inputs, tgt_lang="eng", speaker_id=7, num_beams=5, speech_do_sample=True, speech_temperature=0.6)

When batching, you can get the length of each generated waveform by accessing `audio_array_from_text[1]`.

In [None]:
# first sentence
length = audio_array_from_text[1][0]
audio = audio_array_from_text[0][0]
Audio(audio[:length].cpu().numpy().squeeze(), rate=sample_rate)

In [None]:
# second sentence
length = audio_array_from_text[1][1]
audio = audio_array_from_text[0][1]
Audio(audio[:length].cpu().numpy().squeeze(), rate=sample_rate)


## Summary


This last part will group together the various code snippets to enable you to use the model even more easily

### <a name="s2st"> Speech to translated speech</a>

`audio_sample["array"]` is an audio waveform that have been loaded [here](#preparing-audio) using `datasets`. You can replace it with your own one-dimensional audio waveform numpy array.
  

In [None]:
from transformers import SeamlessM4Tv2Model, AutoProcessor

processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")

# process input
audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt").to(device)

# generate translation
audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.



Now, let's listen to the generated audios

In [None]:
from IPython.display import Audio

sample_rate = model.config.sampling_rate
Audio(audio_array_from_audio, rate=sample_rate)

You can also save audio as .wav files using a third-party library, e.g. scipy (note here that we also need to remove the channel dimension from our audio tensor):

```python
import scipy

scipy.io.wavfile.write("seamless_m4t_out.wav", rate=sample_rate, data=audio_array_from_audio)
```

### <a name="s2tt"> Speech to translated text</a>

In [None]:
from transformers import SeamlessM4Tv2Model, AutoProcessor


processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")

# process input
audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt").to(device)

# generate translation
output_tokens = model.generate(**audio_inputs, tgt_lang="eng", generate_speech=False)
translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
print(f"Translation from audio: {translated_text_from_audio}")

### <a name="t2ts"> Text to translated speech</a>

In [None]:
from transformers import SeamlessM4Tv2Model, AutoProcessor


processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")

# process input
text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt").to(device)

# generate translation
audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()


Now, let's listen to the generated audios!

In [None]:
from IPython.display import Audio

sample_rate = model.config.sampling_rate
Audio(audio_array_from_text, rate=sample_rate)

You can also save audio as .wav files using a third-party library, e.g. scipy (note here that we also need to remove the channel dimension from our audio tensor):

```python
import scipy

scipy.io.wavfile.write("seamless_m4t_out.wav", rate=sample_rate, data=audio_array_from_text)
```

### <a name="t2tt"> Text to translated text</a>

In [None]:
from transformers import SeamlessM4Tv2Model, AutoProcessor
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")

# process input
text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt").to(device)

# generate translation
output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
print(f"Translation from text: {translated_text_from_text}")