# ⚠️ EXPERIMENTAL: Transcribe audio to any language w/ 🤗 Transformers

by: [Vaibhav (VB) Srivastav](https://twitter.com/reach_vb) and [Prince Canuma](https://twitter.com/CanumaGdt)

In this notebook, we'll evaluate Whisper's ability to transcribe audio from English to any language.

Whisper by default is capable of transcribing audio from any language to English. However, we can use 🤗 Transformers to force it translate the English to any language without any fine-tuning.

We'll use Google's [Fleurs Dataset](https://huggingface.co/datasets/google/fleurs) to evaluate the performance of Whisper's translation capabilities. Fleurs is a dataset of 2,000 English sentences translated into 40 languages by professional translators.

And to get the translations, we'll use Facebook's [NLLB-200](https://huggingface.co/facebook/nllb-200-distilled-1.3B) model. NLLB-200 is a machine translation model primarily intended for research in machine translation, - especially for low-resource languages. It allows for single sentence translation among 200 languages.

Let's get started!

The environment setup is pretty much straightforward, we'll use `transformers` to load the [Whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) checkpoint (`fp-16`) in a colab's free GPU! 🏃‍♂️

In [116]:
!pip -q install transformers datasets huggingface_hub soundfile librosa evaluate 

Note: We only need to authenticate for the purpose of accessing the [Common Voice dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0). You can safely ignore this if you are using your own audio dataset.

You can find your access token at: [hf.co/settings/token](httpS://hf.co/settings/token)

In [5]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [6]:
batch_size = 8 # Increase this if you have a GPU with more memory

For the purpose of colab demo, we'll use a Whisper-large-v2 checkpoint in half-precision (`fp16`). If you have access to a larger GPU VRAM then remove the `torch_dtype` arg 🤗

In [7]:
import torch
from transformers import pipeline

whisper_asr = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v2",
    torch_dtype=torch.float16,
    device="cuda:0",
    batch_size=batch_size
    )

In [8]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

TASK = "translation"
CKPT = "facebook/nllb-200-distilled-1.3B"

model = AutoModelForSeq2SeqLM.from_pretrained(CKPT)
tokenizer = AutoTokenizer.from_pretrained(CKPT)

To keep things simple, we'll use the Common Voice dataset from the 🤗 Hub via `streaming` mode & resample the audio to 16KHz as expected by Whisper.

In [222]:
from datasets import load_dataset, Dataset
from datasets import Audio

source_lang = "en_us"
fleurs_en = load_dataset("google/fleurs", source_lang,
                        split="train", # You can add the validation and test sets here
                        streaming=True,
                        use_auth_token=True)



In [223]:
fleurs_en = fleurs_en.cast_column("audio", Audio(sampling_rate=16000))

In [224]:
fleurs_en.features

{'id': Value(dtype='int32', id=None),
 'num_samples': Value(dtype='int32', id=None),
 'path': Value(dtype='string', id=None),
 'audio': Audio(sampling_rate=16000, mono=True, decode=True, id=None),
 'transcription': Value(dtype='string', id=None),
 'raw_transcription': Value(dtype='string', id=None),
 'gender': ClassLabel(names=['male', 'female', 'other'], id=None),
 'lang_id': ClassLabel(names=['af_za', 'am_et', 'ar_eg', 'as_in', 'ast_es', 'az_az', 'be_by', 'bg_bg', 'bn_in', 'bs_ba', 'ca_es', 'ceb_ph', 'ckb_iq', 'cmn_hans_cn', 'cs_cz', 'cy_gb', 'da_dk', 'de_de', 'el_gr', 'en_us', 'es_419', 'et_ee', 'fa_ir', 'ff_sn', 'fi_fi', 'fil_ph', 'fr_fr', 'ga_ie', 'gl_es', 'gu_in', 'ha_ng', 'he_il', 'hi_in', 'hr_hr', 'hu_hu', 'hy_am', 'id_id', 'ig_ng', 'is_is', 'it_it', 'ja_jp', 'jv_id', 'ka_ge', 'kam_ke', 'kea_cv', 'kk_kz', 'km_kh', 'kn_in', 'ko_kr', 'ky_kg', 'lb_lu', 'lg_ug', 'ln_cd', 'lo_la', 'lt_lt', 'luo_ke', 'lv_lv', 'mi_nz', 'mk_mk', 'ml_in', 'mn_mn', 'mr_in', 'ms_my', 'mt_mt', 'my_mm', 'nb

In [18]:
import re
list_of_languages = [re.sub(r'[^a-zA-Z0-9\s]', '', t) for t in whisper_asr.tokenizer.additional_special_tokens if len(t) == 6]
print(list_of_languages)
print(len(list_of_languages))

In [7]:
import pandas as pd
df_lang_codes = pd.read_csv("matched_languages (1).csv") # NLLB-200 and Whisper language codes 
df_lang_codes = df_lang_codes[df_lang_codes['code'].isin(list_of_languages)] # Filter out languages not in Whisper
df_lang_codes

Unnamed: 0,language,composite_code,code
0,Acehnese (Arabic script),ace_Arab,ac
1,Tunisian Arabic,aeb_Arab,ae
2,Afrikaans,afr_Latn,af
3,South Levantine Arabic,ajp_Arab,aj
4,Akan,aka_Latn,ak
...,...,...,...
134,Yoruba,yor_Latn,yo
135,Yue Chinese,yue_Hant,yu
136,Chinese (Simplified),zho_Hans,zh
137,Standard Malay,zsm_Latn,zs


Alrighty! Let's listen to our audio file and take a look at its transcription!

In [13]:
import IPython.display as ipd

ipd.Audio(next(iter(fleurs_en))["audio"]["array"], rate=next(iter(fleurs_en))["audio"]["sampling_rate"]) # load a NumPy array

## Generate transcriptions and translations

In [225]:
from typing import List, Any, Dict
import evaluate

def translate_texts(texts: List[str], source_language_code: str, target_language_code: str) -> List[str]:
    # Translate the text from English to the target language
    try:
        nllb_translate = pipeline(
            TASK, 
            model=model,
            tokenizer=tokenizer,
            src_lang=source_language_code,
            tgt_lang=target_language_code,
            torch_dtype=torch.float16,
            device="cuda:0",
            batch_size=batch_size
        
        )
        return [text["translation_text"] for text in nllb_translate(texts)]

    except Exception as e:
        print(f"Error during translation: {e}")
        return [" "] * len(texts)


def translate_audios(audio_arrays: List[Any], target_language_code: str) -> List[str]:
    # Translate the audio from English to the target language
    # Set the decoder prompt to the target language
    whisper_asr.model.config.forced_decoder_ids = (
        whisper_asr.tokenizer.get_decoder_prompt_ids(
            language=target_language_code,
            task="transcribe"
        )
    )
    return [
        text["text"] for text in whisper_asr(
            audio_arrays, 
            generate_kwargs =
                 {
                      "penalty_alpha": 0.6,
                      "top_k": 5,
                 } 
        )]

def compute_metrics(metric, predictions: List[str], references: List[str], batch_size: int):
    return [metric.compute(predictions=[predictions[i]], references=[references[i]])['bleu'] for i in range(batch_size)]

def get_translations(items: List[Dict[str, Any]]):
    # Get batched transcriptions and predictions
    source_lang_composite_code = df_lang_codes[df_lang_codes['code'] == source_lang.split("_")[0]]["composite_code"].values[0]
    bleu = evaluate.load("bleu")
    
    for _, row in df_lang_codes.iterrows():
        if row["code"] != source_lang.split("_")[0]:
            
            transcriptions = items["transcription"]
            audio_arrays = [item["array"] for item in items["audio"]]
            
            target_lang_composite_code = row["composite_code"]
            language = row["code"]
            
            translated_transcriptions = translate_texts(transcriptions, source_lang_composite_code, target_lang_composite_code)
            translated_predictions = translate_audios(audio_arrays, language)
            scores = compute_metrics(bleu, translated_predictions, translated_transcriptions, len(transcriptions))
            
            items[f"{language}_transcription"] = translated_transcriptions
            items[f"{language}_prediction"] = translated_predictions
            items[f"{language}_bleu_score"] = scores

    return items



In [226]:
fleurs_en = fleurs_en.map(
    get_translations,
    batched=True,
    batch_size=batch_size,
    remove_columns=["path", "audio", "gender", "lang_id", "lang_group_id", "language", "num_samples", "raw_transcription"]
)

# If streaming is False, the dataset is already loaded in memory
# and we can directly convert it to a pandas DataFrame
# results = fleurs_en.to_pandas()
# results.to_csv("translation_results.csv", index=False) 

In [227]:
from torch.utils.data import DataLoader
import torch
from tqdm import tqdm # if streaming is False

results = []
dataloader = DataLoader(fleurs_en, batch_size=batch_size)

for i, batch in enumerate(dataloader):
    results.append(pd.DataFrame.from_dict(batch, orient='index')) # Store the results in a list of DataFrames

100%|██████████| 1/1 [04:05<00:00, 245.48s/it]


## Bleu Score per language results

In [230]:
def tensor_to_num(cell):
    return cell.item() if isinstance(cell, torch.Tensor) else cell

results = pd.concat(results, axis=1, ignore_index=True).T # Concatenate all batches
results = results.applymap(tensor_to_num) # Convert all tensors to float
results.to_csv("translation_results.csv", index=False) # Save results to CSV
results

Unnamed: 0,id,transcription,af_transcription,af_prediction,af_bleu_score,am_transcription,am_prediction,am_bleu_score,ar_transcription,ar_prediction,ar_bleu_score
0,903,a tornado is a spinning column of very low-pre...,'n Tornado is 'n draaiende kolom van baie lae-...,Een tornado is een roerende kolum van erg laa...,0.0,አውሎ ነፋስ በዙሪያው ያለውን አየር ወደ ውስጥና ወደ ላይ የሚስበው በጣም...,បាក់នែ is a spinning column of very low-press...,0.0,العاصفة هي عمود تدور من الهواء منخفض الضغط جدا...,طيرات محطمة قطعة تحرير دائرة اضراع محددة و تس...,0.0
1,279,former u.s. speaker of the house newt gingrich...,Voormalige Amerikaanse Speaker van die Huis Ne...,Vroeger U.S. House Speaker Newt Gingrich kwam...,0.0,የቀድሞው የአሜሪካ ምክር ቤት አፈ ጉባኤ ኒውተን ጊንጊሪች በ32 በመቶ ሁ...,ḏḵḏḴḥḍḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱḱ...,0.0,رئيس مجلس النواب السابق نيوتن جينغريش جاء في ا...,سيد المنطقة الأمريكية المشهور نوت جينجريتش أت...,0.0
2,722,the island was first inhabited by the taínos a...,Die eiland is eers bewoon deur die Tainos en C...,Het eiland word die Tiano's en Caribbe's verl...,0.0,ደሴቲቱ መጀመሪያ ላይ በታይኖዎችና በካሪቢስ ሰዎች ትኖር ነበር። ካሪቢስ ...,Ἀλλανδὲ ʻἰἰἰἰ ἐἰἰἰἰἰἰἰἰἰἰἰἰἰἰἰἰἰἰἰἰἰἰἰἰἰἰἰἰἰἰ...,0.0,كانت الجزيرة مأهولة من قبل التاينوس والكاريبس....,المدينة أولاً تم الحضور بها من طرائل التيانو ...,0.0
3,581,these nerve impulses can be sent so quickly th...,Hierdie senuweepulse kan so vinnig deur die li...,Diese nerf impulsies kan so sneg gydaardien l...,0.0,እነዚህ የነርቭ ቅስቀሳዎች በፍጥነት በሰውነት ውስጥ ሊላኩ ይችላሉ ይህም ...,სარისირირირ can be sensed so quickly througho...,0.0,هذه النبضات العصبية يمكن أن ترسل بسرعة في جميع...,يمكن أن يرسل هذه النيران الاغلقات بسرعة حول ا...,0.0
4,46,on september 24 1759 arthur guinness signed a ...,Op 24 September 1759 het Arthur Guinness 'n 90...,1759 Arthur Guinness sîn ein 9000-jaarlies vo...,0.0,መስከረም 24 ቀን 1759 አርተር ጊነስ በዱብሊን አየርላንድ ውስጥ ለስም...,"ὀἰ᾽ᾶ ἡ᾽ᾶ ᾽᾽᾽ 1759-᾽᾽᾽, Arthur Guinness tʰiʻin...",0.0,في 24 سبتمبر 1759، وقع آرثر غينيس عقد إيجار لم...,عشرين سبتمبر 1759 اعطى ارثر جينيس قراءة وصفة ...,0.0
5,1177,today timbuktu is an impoverished town althoug...,Vandag is Timbuktu 'n arm stad alhoewel sy rep...,Vandaag is Timbuktu een onwilrelijke stad hoe...,0.0,ዛሬ ቲምቡክቱ ድሃ ከተማ ናት፤ ምንም እንኳን ታዋቂነቱ የቱሪስት መስህብ ...,"სარი, Timbuktu əs tʰɨ ʔɪmpovɹʂ tʰɨʔən, altho ...",0.0,اليوم تمبوكتو هي مدينة فقيرة على الرغم من سمعت...,اليوم تمبكتو مدينة غامضة ولكن رجلتها تجعلها م...,0.0
6,742,with the same time zone as hawaii the islands ...,met dieselfde tydsone as Hawaii word die eilan...,"With the same time zone as Hawaii, the island...",0.0,ከሃዋይ ጋር ተመሳሳይ የጊዜ ቀጠና ያላቸው ደሴቶች አንዳንዴ ከሃዋይ በታች...,ḏḵᵐᵗ ʻ�ᵃᵗ ᵗᵒᵐᵗ ᵗᵒᵐᵗ ᵗᵒᵐᵗ ᵗᵒᵐᵗ ᵗᵒᵐᵗ ᵗᵒᵐᵗ ᵗᵒᵐᵗ ...,0.0,مع نفس المنطقة الزمنية مثل هاواي ، يتم التفكير...,بالمناسبة للوقت الحالي في هاوائي المدينة تعتق...,0.0
7,1281,hokuriku electric power co reported no effects...,Hokuriku Electric Power Co. het geen gevolge v...,Hokuruki elektrische voedingskamp nie genees ...,0.139513,የሆኩሪኩ ኤሌክትሪክ ኃይል ኩባንያ ከድርቅ ምንም ጉዳት እንደሌለ እና በሺ...,Ḥukuruki Electric Power Co. Ḥukuruki Electric...,0.0,"شركة ""هوكوريكو"" الكهربائية لم تبلغ عن أي آثار ...",تقرير حركة الطاقة الإلكترونية الهوكوروكية لم ...,0.0


In [17]:
bleu_scores = results.filter(regex="bleu_score")
bleu_scores.columns = [col.split("_")[0] for col in bleu_scores.columns]
bleu_scores.head()

average_scores = bleu_scores.mean()
std_dev_scores = bleu_scores.std()

# Create a dataframe to display the results
summary_df_optimized = pd.DataFrame({
    "Language": average_scores.index,
    "Average BLEU Score": average_scores.values,
    "Standard Deviation": std_dev_scores.values
})

summary_df_optimized

Unnamed: 0,Language,Average BLEU Score,Standard Deviation
0,af,0.013452,0.049527
1,am,0.0,0.0
2,ar,0.014756,0.052431
