<a href="https://colab.research.google.com/github/R3gm/Colab-resources/blob/main/Massively_Multilingual_Speech_MMS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Massively Multilingual Speech (MMS)

https://huggingface.co/spaces/mms-meta/MMS

The Massively Multilingual Speech (MMS) project, led by Meta, is focused on expanding the language coverage of speech technology. Their goal is to surpass the existing coverage of approximately one hundred languages and extend it to over 1,000 languages. To achieve this, they have developed innovative approaches utilizing a new dataset derived from publicly available religious texts and leveraging self-supervised learning techniques.

The MMS project has successfully created several models to support their objective. They have developed pre-trained wav2vec 2.0 models that cover an impressive 1,406 languages. Additionally, they have designed a single multilingual automatic speech recognition model capable of working with 1,107 languages. Furthermore, speech synthesis models have been developed for the same number of languages. Lastly, the project has produced a language identification model that can identify a staggering 4,017 languages.

The results achieved by the MMS models have surpassed existing models and offer coverage for ten times more languages.

In [5]:
!pip install transformers datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
Collec

In [17]:
import numpy as np
from IPython.display import Audio as audio_show
from datasets import load_dataset, Audio

## Multilingual Automatic Speech Recognition (ASR).
Adapter models to transcribe 1000+ languages

| Model | Languages | Dataset | Model | Dictionary* | Supported languages |  |
|---|---|---|---|---|---|---
MMS-1B:FL102 | 102 | FLEURS | [download](https://dl.fbaipublicfiles.com/mms/asr/mms1b_fl102.pt) | [download](https://dl.fbaipublicfiles.com/mms/asr/dict/mms1b_fl102/eng.txt) | [download](https://dl.fbaipublicfiles.com/mms/asr/mms1b_fl102_langs.html) | [🤗 Hub](https://huggingface.co/facebook/mms-1b-fl102)
MMS-1B:L1107| 1107 | MMS-lab | [download](https://dl.fbaipublicfiles.com/mms/asr/mms1b_l1107.pt) | [download](https://dl.fbaipublicfiles.com/mms/asr/dict/mms1b_l1107/eng.txt)  | [download](https://dl.fbaipublicfiles.com/mms/asr/mms1b_l1107_langs.html) | [🤗 Hub](https://huggingface.co/facebook/mms-1b-l1107)
MMS-1B-all| 1162 | MMS-lab + FLEURS <br>+ CV + VP + MLS |  [download](https://dl.fbaipublicfiles.com/mms/asr/mms1b_all.pt) | [download](https://dl.fbaipublicfiles.com/mms/asr/dict/mms1b_all/eng.txt) | [download](https://dl.fbaipublicfiles.com/mms/asr/mms1b_all_langs.html) | [🤗 Hub](https://huggingface.co/facebook/mms-1b-all)




Load audio data in different languages using the Datasets.

In [1]:
from datasets import load_dataset, Audio

# the audio need a sample rate of 16000
# English
stream_data = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
en_sample = next(iter(stream_data))["audio"]["array"]

# French
stream_data = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
fr_sample = next(iter(stream_data))["audio"]["array"]

Reading metadata...: 16354it [00:02, 5883.71it/s]
Reading metadata...: 16089it [00:02, 5920.66it/s]


In [8]:
audio_show(en_sample, rate=16000)

Load the model and processor

In [5]:
from transformers import Wav2Vec2ForCTC, AutoProcessor
import torch

model_id = "facebook/mms-1b-all"

processor = AutoProcessor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)

Pass the processed audio data to the model and transcribe the model output

In [9]:
inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs).logits

ids = torch.argmax(outputs, dim=-1)[0]
transcription = processor.decode(ids)
transcription

'joe keton disapproved of films and buster also had reservations about the media'

### Language adapters
We can now keep the same model in memory and simply switch out the language adapters by calling the convenient load_adapter() function for the model and set_target_lang() for the tokenizer. We pass the target language as an input - "fra" for French.

In [10]:
audio_show(fr_sample, rate=16000)

In [12]:
processor.tokenizer.set_target_lang("fra")
model.load_adapter("fra")

inputs = processor(fr_sample, sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs).logits

ids = torch.argmax(outputs, dim=-1)[0]
transcription = processor.decode(ids)
transcription

"ce dernier est volé tout au long de l'histoire romaine"

In [None]:
## Alternative with pipeline
# from transformers import pipeline

# model_id = "facebook/mms-1b-all"
# target_lang = "fra"

# pipe = pipeline(model=model_id, model_kwargs={"target_lang": "fra", "ignore_mismatched_sizes": True})

In [None]:
## Alternative set the language before load the model
# from transformers import Wav2Vec2ForCTC, AutoProcessor

# model_id = "facebook/mms-1b-all"
# target_lang = "fra"

# processor = AutoProcessor.from_pretrained(model_id, target_lang=target_lang)
# model = Wav2Vec2ForCTC.from_pretrained(model_id, target_lang=target_lang, ignore_mismatched_sizes=True)

### Dict of supported languages

In [None]:
processor.tokenizer.vocab.keys()

## Spoken Language Identification (LID).

Classifies raw audio input to a probability distribution over 4017 output classes (each class representing a language)


| Languages | Dataset | Model | Dictionary | Supported languages | |
|---|---|---|---|---|---
126 | FLEURS + VL + MMS-lab-U + MMS-unlab | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l126.pt) | [download](https://dl.fbaipublicfiles.com/mms/lid/dict/l126/dict.lang.txt) | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l126_langs.html) | [🤗 Hub](https://huggingface.co/facebook/mms-lid-126)
256 | FLEURS + VL + MMS-lab-U + MMS-unlab | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l256.pt) | [download](https://dl.fbaipublicfiles.com/mms/lid/dict/l256/dict.lang.txt) | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l256_langs.html) | [🤗 Hub](https://huggingface.co/facebook/mms-lid-256)
512 | FLEURS + VL + MMS-lab-U + MMS-unlab | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l512.pt) | [download](https://dl.fbaipublicfiles.com/mms/lid/dict/l512/dict.lang.txt) | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l512_langs.html)| [🤗 Hub](https://huggingface.co/facebook/mms-lid-512)
1024 | FLEURS + VL + MMS-lab-U + MMS-unlab | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l1024.pt) | [download](https://dl.fbaipublicfiles.com/mms/lid/dict/l1024/dict.lang.txt) | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l1024_langs.html)| [🤗 Hub](https://huggingface.co/facebook/mms-lid-1024)
2048 | FLEURS + VL + MMS-lab-U + MMS-unlab | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l2048.pt) | [download](https://dl.fbaipublicfiles.com/mms/lid/dict/l2048/dict.lang.txt) | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l2048_langs.html)| [🤗 Hub](https://huggingface.co/facebook/mms-lid-2048)
4017 | FLEURS + VL + MMS-lab-U + MMS-unlab | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l4017.pt) | [download](https://dl.fbaipublicfiles.com/mms/lid/dict/l4017/dict.lang.txt) | [download](https://dl.fbaipublicfiles.com/mms/lid/mms1b_l4017_langs.html)| [🤗 Hub](https://huggingface.co/facebook/mms-lid-4017)



Load the audio

In [20]:
# the audio need a sample rate of 16000
stream_data = load_dataset("mozilla-foundation/common_voice_11_0", "ar", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
ar_sample = next(iter(stream_data))["audio"]["array"]
audio_show(ar_sample, rate=16000)

Reading metadata...: 10440it [00:02, 4188.05it/s]


Load the model and processor

In [21]:
from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
import torch

model_id = "facebook/mms-lid-4017"

processor = AutoFeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)

Downloading (…)rocessor_config.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json: 0.00B [00:00, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.88G [00:00<?, ?B/s]

Pass the processed audio data to the model to classify it into a language

In [23]:
inputs = processor(ar_sample, sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs).logits

lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
detected_lang

'ara'

See the language name

In [58]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

response = requests.get('https://dl.fbaipublicfiles.com/mms/lid/mms1b_l4017_langs.html')
html = response.content
soup = BeautifulSoup(html, 'html.parser')

# Extract language data using list comprehension
data = [(p.get_text().split('\u2003')[0].strip(), p.get_text().split('\u2003')[1].strip())
        for p in soup.find_all('p') if 'Iso Code' not in p.get_text() and 'Language Name' not in p.get_text()]

df = pd.DataFrame(data, columns=['Iso Code', 'Language Name'])

print(len(df), 'languages')
df[df['Iso Code'].isin(['ara'])]

4017 languages


Unnamed: 0,Iso Code,Language Name
0,ara,Arabic


## Multilingual Text-To-Speech (TTS).
Speech technology across a diverse range of languages

https://github.com/jaywalnut310/vits