# Common ML tasks: Automatic Speech Recognition

The pipeline module has many benefits:
- a pre-trained model may exist that already solves your task really well, saving you plenty of time
- pipeline() takes care of all the pre/post-processing for you, so you don’t have to worry about getting the data into the right format for a model
- if the result isn’t ideal, this still gives you a quick baseline for future fine-tuning
- once you fine-tune a model on your custom data and share it on Hub, the whole community will be able to use it quickly and effortlessly via the pipeline() method making AI more accessible.

In [34]:
%pip install torch torchvision torchaudio  "datasets[audio]" transformers

Note: you may need to restart the kernel to use updated packages.


Lets start by doing automatic speech recognition (ASR) on english audio files, where the speaker speaks with an australian accent.

In [35]:
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))


In [36]:
from transformers import pipeline

asr = pipeline("automatic-speech-recognition")

No model was supplied, defaulted to facebook/wav2vec2-base-960h and revision 22aad52 (https://huggingface.co/facebook/wav2vec2-base-960h).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


In [37]:
example = minds[150]
import IPython.display as ipd
ipd.display(ipd.Audio(example["audio"]["array"], rate=example["audio"]["sampling_rate"]))
actual = example['transcription']
example


{'path': '/home/svernys/.cache/huggingface/datasets/downloads/extracted/006687c72f62d73aea93cc14b4904e5e42a68ac457f745dc2b866ae093470839/en-AU~ATM_LIMIT/response_39.wav',
 'audio': {'path': '/home/svernys/.cache/huggingface/datasets/downloads/extracted/006687c72f62d73aea93cc14b4904e5e42a68ac457f745dc2b866ae093470839/en-AU~ATM_LIMIT/response_39.wav',
  'array': array([ 9.47322405e-07,  1.17699674e-05, -4.20695869e-08, ...,
          7.58675305e-05,  9.06501828e-06, -4.91019091e-05], shape=(91478,)),
  'sampling_rate': 16000},
 'transcription': "I'm wondering how much money I can withdraw from the ATM",
 'english_transcription': "I'm wondering how much money I can withdraw from the ATM",
 'intent_class': 3,
 'lang_id': 2}

In [38]:
output = asr(example["audio"]["array"])
print(f"Transcript: {output}")
print(f"Actual: {actual}")


Transcript: {'text': "I'M WONDERING HOW MUCH MONEY I CAN WITHDRAW FROM ME IDAM"}
Actual: I'm wondering how much money I can withdraw from the ATM


Lets now try another language, german

In [39]:
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="de-DE", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

In [40]:
example = minds[150]
import IPython.display as ipd
ipd.display(ipd.Audio(example["audio"]["array"], rate=example["audio"]["sampling_rate"]))
actual = example['transcription']
example

{'path': '/home/svernys/.cache/huggingface/datasets/downloads/extracted/006687c72f62d73aea93cc14b4904e5e42a68ac457f745dc2b866ae093470839/de-DE~HIGH_VALUE_PAYMENT/response_19.wav',
 'audio': {'path': '/home/svernys/.cache/huggingface/datasets/downloads/extracted/006687c72f62d73aea93cc14b4904e5e42a68ac457f745dc2b866ae093470839/de-DE~HIGH_VALUE_PAYMENT/response_19.wav',
  'array': array([-1.41472265e-05, -1.56384718e-04, -2.29944068e-04, ...,
         -1.94719993e-02, -1.57685764e-02, -7.14429654e-03], shape=(179816,)),
  'sampling_rate': 16000},
 'transcription': 'jedes Mal wenn ich eine größere Bezahlung tätigen möchte bekomme ich danach eine SMS aber ich weiß nicht ob ich mit den abholen oder mit dem Code machen soll',
 'english_transcription': "every time I want to make a larger payment, I get a text message afterwards, but I don't know whether to pick it up or use the code",
 'intent_class': 10,
 'lang_id': 1}

In [41]:
output = asr(example["audio"]["array"])
print(f"Transcript: {output}")
print(f"Actual: {actual}")

Transcript: {'text': 'LOR YET IS MIE VENIH AND COISOT BATTALON CREATING ISTER BOCOMISANA AN S M S AND BA FHASNA HIPOTIC MADIN ATAMAS BUD EMEDIN COLD MAHUNZOY'}
Actual: jedes Mal wenn ich eine größere Bezahlung tätigen möchte bekomme ich danach eine SMS aber ich weiß nicht ob ich mit den abholen oder mit dem Code machen soll


As we can see, the model is not very good at recognizing german words, even though it was trained on english audio files. This is because the model is not multilingual and is only trained on english data.

Now lets try a ASM that was trained on the german language.

In [None]:
from transformers import pipeline

asr = pipeline("automatic-speech-recognition", model="maxidl/wav2vec2-large-xlsr-german")
asr(example["audio"]["array"])

In [None]:
output = asr(example["audio"]["array"])
print(f"Transcript: {output}")
print(f"Actual: {actual}")