<a href="https://colab.research.google.com/github/R3gm/InsightSolver-Colab/blob/main/SeamlessM4T.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SeamlessM4T

| Code Credits | Link |
| ----------- | ---- |
| 🎉 seamless_communication | [![GitHub Repository](https://img.shields.io/github/stars/facebookresearch/seamless_communication?style=social)](https://github.com/facebookresearch/seamless_communication) |
| 🚀 Online inference | [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/facebook/seamless_m4t) |
| 🔥 Discover More Colab Notebooks | [![GitHub Repository](https://img.shields.io/badge/GitHub-Repository-black?style=flat-square&logo=github)](https://github.com/R3gm/InsightSolver-Colab/) |


SeamlessM4T is a groundbreaking Massively Multilingual & Multimodal Machine Translation model, bridging speech and text translation for up to 100 languages.

Generally, the translation from one audio to another is done with several intermediate steps, such as transcription, translation, and later conversion to audio (Cascaded systems), as in [SoniTranslate](https://github.com/R3gm/SoniTranslate). However, the current model allows us to perform all these tasks.

In [None]:
!pip install fairseq2==0.1 pydub yt-dlp
!git clone https://github.com/facebookresearch/seamless_communication.git
%cd seamless_communication
!git checkout 01c1042841f9bce66902eb2c7512dbdd71d42112 # We will use a stable version; if you want to use the latest version, comment out this line.
!pip install .

Utility Functions and Libraries

In [None]:
from seamless_communication.models.inference import Translator
from IPython.display import Audio
from pydub import AudioSegment
from pydub.silence import split_on_silence
from pydub import AudioSegment
import torchaudio
import torch
import os

def save_and_play_audio(path_save, audio, sample_rate):
    torchaudio.save(
        path_save,
        audio[0].cpu(),
        sample_rate=sample_rate,
    )

    audio_play = Audio(path_save, rate=sample_rate, autoplay=True, normalize=True)
    display(audio_play)

def split_audio_with_max_duration(input_file, output_directory, min_silence_len=2500, silence_thresh=-60, max_chunk_duration=15000):

    sound = AudioSegment.from_wav(input_file)

    # Splitting on silence
    audio_chunks = split_on_silence(sound, min_silence_len=min_silence_len, silence_thresh=silence_thresh)

    # split for max_chunk_duration
    final_audio_chunks = []
    for chunk in audio_chunks:
        if len(chunk) > max_chunk_duration:
            num_subchunks = len(chunk) // max_chunk_duration + 1
            subchunk_size = len(chunk) // num_subchunks
            for i in range(num_subchunks):
                start_idx = i * subchunk_size
                end_idx = (i + 1) * subchunk_size
                subchunk = chunk[start_idx:end_idx]
                final_audio_chunks.append(subchunk)
        else:
            final_audio_chunks.append(chunk)

    # Export wav
    for i, chunk in enumerate(final_audio_chunks):
        output_file = f"{output_directory}/chunk{i}.wav"
        print("Exporting file", output_file)
        chunk.export(output_file, format="wav")

Load the model

In [None]:
# Initialize a Translator object with a multitask model, vocoder on the GPU.
translator = Translator(
    "seamlessM4T_large",
    "vocoder_36langs",
    torch.device("cuda:0")
)

Downloading the checkpoint of the model 'seamlessM4T_large'...
100%|██████████| 10.7G/10.7G [00:57<00:00, 200MB/s]
Downloading the tokenizer of the model 'seamlessM4T_large'...
100%|██████████| 4.93M/4.93M [00:00<00:00, 104MB/s]
Downloading the checkpoint of the model 'vocoder_36langs'...
100%|██████████| 160M/160M [00:00<00:00, 244MB/s]


We will process the audio from a YouTube video.

In [None]:
# Download the video
video_url = 'www.youtube.com/watch?v=g_9rPvbENUw'
!yt-dlp -f "mp4"  --force-overwrites --max-downloads 1 --no-warnings --no-abort-on-error --ignore-no-formats-error --restrict-filenames -o Video.mp4  $video_url

[generic] Extracting URL: www.youtube.com/watch?v=g_9rPvbENUw
[youtube] Extracting URL: http://www.youtube.com/watch?v=g_9rPvbENUw
[youtube] g_9rPvbENUw: Downloading webpage
[youtube] g_9rPvbENUw: Downloading ios player API JSON
[youtube] g_9rPvbENUw: Downloading android player API JSON
[youtube] g_9rPvbENUw: Downloading m3u8 information
[youtube] g_9rPvbENUw: Downloading MPD manifest
[info] g_9rPvbENUw: Downloading 1 format(s): 22
[download] Destination: Video.mp4
[K[download] 100% of    8.68MiB in [1;37m00:00:00[0m at [0;32m12.17MiB/s[0m
[info] Maximum number of downloads reached, stopping due to --max-downloads
Aborting remaining downloads


In [None]:
# Convert to wav
!ffmpeg -y -i Video.mp4 -vn -acodec pcm_s16le -ar 44100 -ac 2 audio.wav

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enab

# Split the audio

To process the audio, we need to split it due to model limitations.

In [None]:
input_audio_file = "/content/seamless_communication/audio.wav"
output_directory = "/content/seamless_communication/split_segments"

!mkdir split_segments
!rm -rf /content/seamless_communication/split_segments/*
split_audio_with_max_duration(input_audio_file, output_directory)

mkdir: cannot create directory ‘split_segments’: File exists
Exporting file /content/seamless_communication/split_segments/chunk0.wav
Exporting file /content/seamless_communication/split_segments/chunk1.wav
Exporting file /content/seamless_communication/split_segments/chunk2.wav
Exporting file /content/seamless_communication/split_segments/chunk3.wav
Exporting file /content/seamless_communication/split_segments/chunk4.wav


In [None]:
# Play a split
audio_path = '/content/seamless_communication/split_segments/chunk1.wav'
audio = Audio(audio_path, rate=44100, autoplay=True, normalize=True)
display(audio)

## Speech to Speech Translate

In [None]:
# Example
translated_text, wav, sr = translator.predict(
    input='/content/seamless_communication/split_segments/chunk1.wav',
    task_str='s2st',
    tgt_lang='eng', # target language
    src_lang='spa', # source language # If you specify this, it will improve the model's result.
    spkr= -1,
)

# Save the audio and play
save_and_play_audio(
    '/content/seamless_communication/audiot.wav',
    wav,
    sr,
)

Now we will translate all the segments and combine them into a new audio file.

In [None]:
segments = []

for filename in sorted(os.listdir(output_directory)):
    if filename.startswith("chunk") and filename.endswith(".wav"):
        segment_path = os.path.join(output_directory, filename)

        translated_text, wav, sr = translator.predict(
            input=segment_path,
            task_str='s2st',
            tgt_lang='eng',
            src_lang='spa',
        )
        print(translated_text, segment_path)

        torchaudio.save(
            segment_path,
            wav[0].cpu(),
            sample_rate=sr,
        )

        segment = AudioSegment.from_file(segment_path)
        segments.append(segment)

    combined_audio = sum(segments)
    combined_audio.export('/content/seamless_communication/audio_eng.mp3', format="mp3")

Good afternoon. We're meeting with Gesser. /content/seamless_communication/split_segments/chunk0.wav
And he's going to answer some questions. First, what are the most polluted areas by solid waste or packages that are in the school? /content/seamless_communication/split_segments/chunk1.wav
Approximately how many cameras do you have? /content/seamless_communication/split_segments/chunk2.wav
(Applause from the audience) /content/seamless_communication/split_segments/chunk3.wav
I'm sure most of you have seen it. /content/seamless_communication/split_segments/chunk4.wav


In [None]:
audio_path = '/content/seamless_communication/audio_eng.mp3'
audio = Audio(audio_path, rate=44100, autoplay=True, normalize=True)
display(audio)

## Text to Speech Translate

In [None]:
text = 'En el bosque encantado'

In [None]:
translated_text, wav, sr = translator.predict(
    text,
    "t2st",
    tgt_lang='eng',
    src_lang='spa'
)

save_and_play_audio(
    '/content/seamless_communication/text2speech.wav',
    wav,
    sr,
)

## Text to text translate

In [None]:
text = 'En el bosque encantado, un zorro curioso halló un reloj antiguo. Al tocarlo, quedó atrapado en un bucle temporal. Buscó ayuda de un búho sabio, quien reveló que solo resolviendo acertijos podría romper el hechizo. Juntos descifraron enigmas, liberando al zorro y tejiendo una amistad eterna.'

In [None]:
translated_text, _, _ = translator.predict(text, "t2tt", 'eng', src_lang='spa')
translated_text

CString('In the enchanted forest, a curious fox found an ancient clock. When he touched it, he was trapped in a time loop. He sought help from a wise owl, who revealed that only by solving riddles could he break the spell. Together they solved riddles, freeing the fox and forging an eternal friendship.')

## Speech to text translate

In [None]:
# Resample audio
resample_rate = 44100
waveform, sample_rate = torchaudio.load('/content/seamless_communication/split_segments/chunk1.wav')
resampler = torchaudio.transforms.Resample(sample_rate, resample_rate, dtype=waveform.dtype)
resampled_waveform = resampler(waveform)
torchaudio.save('/content/seamless_communication/split_segments/resample_chunk1.wav', resampled_waveform, resample_rate)

In [None]:
translated_text, _, _ = translator.predict('/content/seamless_communication/split_segments/resample_chunk1.wav', "s2tt", 'eng')
translated_text

CString('And he's going to answer some questions: First, what are the most polluted areas from solid waste or packages that are in the school?')

License Attribution-NonCommercial 4.0 International: https://github.com/facebookresearch/seamless_communication/blob/main/LICENSE