<a href="https://colab.research.google.com/github/Iqbalca/speechrecognition/blob/main/Evaluation_of%20XLSR053_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model: facebook/wav2vec2-large-xlsr-53-german
# Evaluation on **Common Voice German Test**

 ****
# The XLSR model uses the following datasets for multilingual pretraining

MLS: Multilingual LibriSpeech (8 languages, 50.7k hours): 
Dutch, English, French, German, Italian, Polish, Portuguese, Spanish

CommonVoice (36 languages, 3.6k hours): Arabic, Basque, Breton, Chinese (CN), Chinese (HK), Chinese (TW), Chuvash, Dhivehi, Dutch, English, Esperanto, Estonian, French, German, Hakh-Chin, Indonesian, Interlingua, Irish, Italian, Japanese, Kabyle, Kinyarwanda, Kyrgyz, Latvian, Mongolian, Persian, Portuguese, Russian, Sakha, Slovenian, Spanish, Swedish, Tamil, Tatar, Turkish, Welsh (see also finetuning splits from this paper).

Babel (17 languages, 1.7k hours): Assamese, Bengali, Cantonese, Cebuano, Georgian, Haitian, Kazakh, Kurmanji, Lao, Pashto, Swahili, Tagalog, Tamil, Tok, Turkish, Vietnamese, Zulu
 also finetuned several models on languages from CommonVoice

 It is also finetuned several models on languages from CommonVoice





In [7]:
!pip install transformers
!pip install datasets

In [8]:
import torchaudio
from datasets import load_dataset, load_metric

In [9]:
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import re
import sys

In [11]:
model_name = "facebook/wav2vec2-large-xlsr-53-german"
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'  # noqa: W605

model = Wav2Vec2ForCTC.from_pretrained(model_name)
processor = Wav2Vec2Processor.from_pretrained(model_name)

Downloading:   0%|          | 0.00/158 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/378 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/330 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

In [12]:
ds = load_dataset("common_voice", "de", split="test", data_dir="./cv-corpus-6.1-2020-12-11")

Downloading builder script:   0%|          | 0.00/5.21k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/11.5k [00:00<?, ?B/s]

Using custom data configuration de-24242e0bc11eddb3


Downloading and preparing dataset common_voice/de (download: 21.68 GiB, generated: 34.69 GiB, post-processed: Unknown size, total: 56.37 GiB) to /root/.cache/huggingface/datasets/common_voice/de-24242e0bc11eddb3/6.1.0/a1dc74461f6c839bfe1e8cf1262fd4cf24297e3fbd4087a711bd090779023a5e...


Downloading data:   0%|          | 0.00/23.3G [00:00<?, ?B/s]

Generating train split:   0%|          | 0/246525 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/15588 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/15588 [00:00<?, ? examples/s]

Generating other split:   0%|          | 0/10095 [00:00<?, ? examples/s]

Generating validated split:   0%|          | 0/565186 [00:00<?, ? examples/s]

Generating invalidated split:   0%|          | 0/32789 [00:00<?, ? examples/s]

Dataset common_voice downloaded and prepared to /root/.cache/huggingface/datasets/common_voice/de-24242e0bc11eddb3/6.1.0/a1dc74461f6c839bfe1e8cf1262fd4cf24297e3fbd4087a711bd090779023a5e. Subsequent calls will reuse this data.


In [13]:
resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)

In [14]:
def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
    batch["sampling_rate"] = resampler.new_freq
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
    return batch

In [15]:
ds = ds.map(map_to_array)



  0%|          | 0/15588 [00:00<?, ?ex/s]

In [16]:
def map_to_pred(batch):
    features = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0], padding=True, return_tensors="pt")
    input_values = features.input_values
    attention_mask = features.attention_mask
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["predicted"] = processor.batch_decode(pred_ids)
    batch["target"] = batch["sentence"]
    return batch
    

In [None]:
result = ds.map(map_to_pred, batched=True, batch_size=16, remove_columns=list(ds.features.keys()))

In [None]:
wer = load_metric("wer")
wer

In [None]:
print(wer.compute(predictions=result["predicted"], references=result["target"]))

[link text](https://huggingface.co/facebook/wav2vec2-large-xlsr-53-german)

WER=18.5%