# Detecting audio issues in the Common Voice dataset
This notebook aims at showing how you can leverage sliceguard to detect issues in audio datasets, using the commonvoice dataset as an example. Focus will be on the basic workflow, as well as showing how to leverage different embedding models from the huggingface hub.

In order to run this example you will need some **dependencies**. Install them as follows.

In [32]:
!pip install sliceguard librosa soundfile datasets tqdm jiwer









You should consider upgrading via the '/home/daniel/code/sliceguard/.venv/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

## Step 1: Generate predictions for the Common Voice dataset

**IMPORTANT NOTE**: In order to access the commonvoice dataset you have to accept certain terms and conditions. To do this, create a huggingface account and accept the terms and conditions [HERE](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0). You then need to **create an access token** to access your datasets programmatically. Follow the steps for configuring one [HERE](https://huggingface.co/docs/hub/security-tokens). It is just a matter of few minutes. Just paste your access token into a file called **access_token.txt** and place it in the same directory as this notebook.

In [1]:
# Configure this example here.
# Like this it is optimized for fast execution only using whisper tiny.
HF_MODEL = "openai/whisper-tiny"
ACCESS_TOKEN_FILE = "access_token.txt"
AUDIO_SAVE_DIR = "audios"
NUM_SAMPLES = 2000

In [2]:
# Some imports your will need to execute this
import uuid
import shutil
from pathlib import Path
import pandas as pd
from tqdm import tqdm
import torch
import librosa
import soundfile as sf
from jiwer import wer
from datasets import load_dataset, Audio
from transformers import pipeline
from transformers import WhisperProcessor, WhisperFeatureExtractor, WhisperTokenizer, WhisperForConditionalGeneration

In [3]:
# Read the acces token for downloading the dataset
access_token = Path(ACCESS_TOKEN_FILE).read_text()
cv_13 = load_dataset("mozilla-foundation/common_voice_13_0", "en", use_auth_token=access_token, streaming=True)

In [4]:
# Instantiate an ASR pipeline with the configured model
device = "cuda:0" if torch.cuda.is_available() else "cpu"

feature_extractor = WhisperFeatureExtractor.from_pretrained(HF_MODEL)
tokenizer = WhisperTokenizer.from_pretrained(HF_MODEL, language="en", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained(HF_MODEL).to(device)

model.config.forced_decoder_ids = tokenizer.get_decoder_prompt_ids() # Specify the task as we always want to use german and transcribe
model.config.language = "<|en|>"
model.config.task = "transcribe"

pipe = pipeline("automatic-speech-recognition", model=model, tokenizer=tokenizer, feature_extractor=feature_extractor, device=device)

In [5]:
keys_to_save = ["sentence", "up_votes", "down_votes", "age", "gender", "accent", "locale", "segment", "variant"]

audio_save_dir = Path(AUDIO_SAVE_DIR)
if  not audio_save_dir.is_dir():
    audio_save_dir.mkdir()
else:
    shutil.rmtree(audio_save_dir)
    audio_save_dir.mkdir()

num_samples = 0
data = []
for sample in tqdm(cv_13["train"], total=NUM_SAMPLES):
    new_audio = librosa.resample(sample["audio"]["array"], orig_sr=sample["audio"]["sampling_rate"], target_sr=16000)
    file_stem = str(uuid.uuid4())
    cur_data = {}
    for k in keys_to_save:
        cur_data[k] = sample[k]
    prediction = pipe(new_audio)["text"]
    cur_data["prediction"] = prediction
    
    sample_wer = wer(sample["sentence"], prediction)
    cur_data["wer"] = sample_wer
    
    target_path = audio_save_dir / (file_stem + ".wav")
    cur_data["audio"] = target_path
    sf.write(target_path, new_audio, 16000)
    data.append(cur_data)
    num_samples += 1
    if num_samples > NUM_SAMPLES:
        break

  0%|                                                  | 0/2000 [00:00<?, ?it/s]
Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 1it [00:01,  1.23s/it][A
Reading metadata...: 22068it [00:02, 11095.23it/s][A
Reading metadata...: 23220it [00:03, 8113.91it/s] [A
Reading metadata...: 43987it [00:04, 12758.16it/s][A
Reading metadata...: 65891it [00:05, 14326.40it/s][A
Reading metadata...: 87679it [00:06, 17802.27it/s][A
Reading metadata...: 109202it [00:07, 17426.94it/s][A
Reading metadata...: 130629it [00:09, 15503.84it/s][A
Reading metadata...: 151866it [00:10, 16407.14it/s][A
Reading metadata...: 172914it [00:12, 14621.06it/s][A
Reading metadata...: 193877it [00:13, 15747.20it/s][A
Reading metadata...: 214827it [00:14, 16473.47it/s][A
Reading metadata...: 235814it [00:15, 18431.60it/s][A
Reading metadata...: 256646it [00:16, 19990.62it/s][A
Reading metadata...: 277374it [00:17, 19559.37it/s][A
Reading metadata...: 298001it [00:18, 19177.35it/s][A
Reading m

In [6]:
df = pd.DataFrame(data)
df["audio"] = df["audio"].astype("string") # otherwise overflow in serializing json
df.to_json("dataset.json", orient="records")

## Step 2: Detect issues caused by environmental noise
First check we want to do is checking wether there are audio recordings that are somehow so different from the rest of the data that they cannot be properly transcribed. Here we mostly target **general audio properties and environmental noise** such as background noises.

In order to do this, we leverage **general purpose audio embeddings** of a model trained on Audioset.

In [7]:
# Some imports you will need for this step
import pandas as pd
import numpy as np
from jiwer import wer
from sliceguard import SliceGuard
from renumics.spotlight import Audio

In [8]:
# Read the generated dataset including the predictions
df = pd.read_json("dataset.json")

In [9]:
# Define the metric function
def wer_metric(y_true, y_pred):
    return np.mean([wer(s_y, s_pred) for s_y, s_pred in zip(y_true, y_pred)])

In [10]:
# Perform an initial detection aiming for relatively small clusters of minimum 5 similar samples
sg = SliceGuard()
issue_df = sg.find_issues(
        df,
        ["audio"],
        "sentence",
        "prediction",
        wer_metric,
        metric_mode="min",
        embedding_models={"audio": "MIT/ast-finetuned-audioset-10-10-0.4593"},
        min_support=5,
        min_drop=0.2,
    )

Feature audio was inferred as referring to raw data. If this is not the case, please specify in feature_types!
Using MIT/ast-finetuned-audioset-10-10-0.4593 for computing embeddings for feature audio.


Some weights of the model checkpoint at MIT/ast-finetuned-audioset-10-10-0.4593 were not used when initializing ASTModel: ['classifier.dense.bias', 'classifier.layernorm.weight', 'classifier.layernorm.bias', 'classifier.dense.weight']
- This IS expected if you are initializing ASTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ASTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|███████████████████████████████████████| 2001/2001 [01:59<00:00, 16.78it/s]


The overall metric value is 0.3165315929558808
Using 5 as minimum support for determining problematic clusters.
Using 0.2 as minimum drop for determining problematic clusters.
Identified 9 problematic slices.


In [11]:
# Report the issues using Renumics Spotlight
sg.report(spotlight_dtype={"audio": Audio})

Unnamed: 0,sentence,up_votes,down_votes,age,gender,accent,locale,segment,variant,prediction,wer,audio,issue,issue_metric,issue_explanation,sg_emb_audio
0,This device has a cathode inside an anode wire...,2,0,,,,en,,,This device has a cathode inside an anode wir...,0.000000,audios/fd07fefa-9a45-4b8d-ab3a-996b260e8df5.wav,-1,,,"[-0.8767191767692566, 0.8408157825469971, -1.0..."
1,This product is almost always produced by the ...,2,0,,,,en,,,This product is almost always produced by the...,0.000000,audios/ef3a59ef-1993-4be6-a471-a90de6ef6d3f.wav,-1,,,"[-0.564724326133728, 0.7514095306396484, -0.86..."
2,It is named after Edward Singleton Holden.,2,0,,,,en,,,It is named after Edward Singleton Hold Them.,0.285714,audios/6cad4f24-efed-43b7-9b00-91f2a8e16d65.wav,-1,,,"[-0.7435396313667297, 0.9744898080825806, 0.18..."
3,It is north west of the regional centre of Clare.,2,0,,,,en,,,This northwest of the regional center of Claire.,0.600000,audios/45ab62fa-ce5e-4e5f-a0d1-a6372eeed758.wav,-1,,,"[-1.1273648738861084, 1.2430431842803955, -0.6..."
4,He was a nephew of Rear-Admiral Sir Francis Au...,2,0,twenties,female,United States English,en,,,He was a nephew of rear admiral surferances A...,0.300000,audios/81e9d54f-23b1-499b-81ff-3ed05bfe7325.wav,-1,,,"[-0.7308357357978821, 1.0558786392211914, -0.0..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1996,"At this time, Charles Henrotin was elected the...",2,0,,,,en,,,"At his time Charles Henk wrote in, was electe...",0.454545,audios/c79bffc0-16fb-4652-95fb-56b0b99a5f96.wav,-1,,,"[0.038931772112846375, 0.36863601207733154, -0..."
1997,They held Catholic beliefs but were consistent...,2,1,,,,en,,,"They held cap in police, but it's just a play...",1.000000,audios/b62d03ee-4bc9-4ec9-bc88-c0c2b8ddb000.wav,-1,,,"[-0.24036164581775665, -0.05433245375752449, 0..."
1998,My feelings are not puffed about with every at...,2,0,,,,en,,,My feelings are not left about whatever you a...,0.333333,audios/439b6105-2801-475a-a6bb-0bde80d47703.wav,-1,,,"[-0.7008544206619263, 0.6815686821937561, 0.16..."
1999,There are three pubs and a Social Club.,2,0,,,,en,,,There are print hobs and social forms.,0.625000,audios/46d0e502-6d56-478d-a588-71ef762fb4d9.wav,-1,,,"[0.2030886560678482, 0.0592283271253109, -0.21..."


In [12]:
# Of course if you want to run additional checks you don't need to recompute the embeddings all the time.
# Just save them here, and supply the precomputed embeddings in the next call
# where we will check for smaller clusters aka outliers.
computed_embeddings = sg.embeddings

In [13]:
# Perform an additional detection, targeting outliers with significant drops (see min_support and min_drop)
# We even allow for clusters containing single samples here.
sg = SliceGuard()
issue_df = sg.find_issues(
        df,
        ["audio"],
        "sentence",
        "prediction",
        wer_metric,
        metric_mode="min",
        min_support=1,
        min_drop=0.3,
        precomputed_embeddings=computed_embeddings
    )

The overall metric value is 0.3165315929558808
Using 1 as minimum support for determining problematic clusters.
Using 0.3 as minimum drop for determining problematic clusters.
Identified 35 problematic slices.


In [14]:
# Report the issues using Renumics Spotlight
sg.report(spotlight_dtype={"audio": Audio})

## Step 3: Detect issues caused by (uncommon) speakers
While the previous detection example targeted finding general audio conditions that can cause issues, this is not always the criterion we want to check for. A way of defining other criterions is **changing the underlying embedding** to **capture different properties of the data**. In this case, we define the embedding model to be a model for **speaker identification**. This should allow us, to **detect uncommon speakers**, although they are note explicitely labeled.

In [15]:
# Perform a detection using a speaker identification model for computing embeddings.
# This will help to recover problematic speakers even though they are not explicitely labeled.
sg = SliceGuard()
issue_df = sg.find_issues(
        df,
        ["audio"],
        "sentence",
        "prediction",
        wer_metric,
        metric_mode="min",
        embedding_models={"audio": "superb/wav2vec2-base-superb-sid"},
        min_support=1,
        min_drop=0.3,
    )

Feature audio was inferred as referring to raw data. If this is not the case, please specify in feature_types!
Using superb/wav2vec2-base-superb-sid for computing embeddings for feature audio.


Some weights of the model checkpoint at superb/wav2vec2-base-superb-sid were not used when initializing Wav2Vec2Model: ['projector.weight', 'classifier.weight', 'layer_weights', 'projector.bias', 'classifier.bias']
- This IS expected if you are initializing Wav2Vec2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|███████████████████████████████████████| 2001/2001 [00:43<00:00, 46.53it/s]


The overall metric value is 0.3165315929558808
Using 1 as minimum support for determining problematic clusters.
Using 0.3 as minimum drop for determining problematic clusters.
Identified 28 problematic slices.


In [16]:
# Report the issues using Renumics Spotlight
sg.report(spotlight_dtype={"audio": Audio})

Unnamed: 0,sentence,up_votes,down_votes,age,gender,accent,locale,segment,variant,prediction,wer,audio,issue,issue_metric,issue_explanation,sg_emb_audio
0,This device has a cathode inside an anode wire...,2,0,,,,en,,,This device has a cathode inside an anode wir...,0.000000,audios/fd07fefa-9a45-4b8d-ab3a-996b260e8df5.wav,-1,,,"[0.06265784054994583, 0.11362244933843613, 0.2..."
1,This product is almost always produced by the ...,2,0,,,,en,,,This product is almost always produced by the...,0.000000,audios/ef3a59ef-1993-4be6-a471-a90de6ef6d3f.wav,-1,,,"[0.044952478259801865, 0.2273407131433487, 0.2..."
2,It is named after Edward Singleton Holden.,2,0,,,,en,,,It is named after Edward Singleton Hold Them.,0.285714,audios/6cad4f24-efed-43b7-9b00-91f2a8e16d65.wav,-1,,,"[0.038528237491846085, 0.2174566090106964, 0.3..."
3,It is north west of the regional centre of Clare.,2,0,,,,en,,,This northwest of the regional center of Claire.,0.600000,audios/45ab62fa-ce5e-4e5f-a0d1-a6372eeed758.wav,-1,,,"[0.08671652525663376, 0.09317027032375336, 0.2..."
4,He was a nephew of Rear-Admiral Sir Francis Au...,2,0,twenties,female,United States English,en,,,He was a nephew of rear admiral surferances A...,0.300000,audios/81e9d54f-23b1-499b-81ff-3ed05bfe7325.wav,-1,,,"[-0.19993984699249268, 0.07098667323589325, 0...."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1996,"At this time, Charles Henrotin was elected the...",2,0,,,,en,,,"At his time Charles Henk wrote in, was electe...",0.454545,audios/c79bffc0-16fb-4652-95fb-56b0b99a5f96.wav,-1,,,"[-0.14808976650238037, 0.1537473499774933, 0.2..."
1997,They held Catholic beliefs but were consistent...,2,1,,,,en,,,"They held cap in police, but it's just a play...",1.000000,audios/b62d03ee-4bc9-4ec9-bc88-c0c2b8ddb000.wav,-1,,,"[-0.29085758328437805, 0.14806108176708221, 0...."
1998,My feelings are not puffed about with every at...,2,0,,,,en,,,My feelings are not left about whatever you a...,0.333333,audios/439b6105-2801-475a-a6bb-0bde80d47703.wav,-1,,,"[-0.06850507855415344, 0.24348363280296326, 0...."
1999,There are three pubs and a Social Club.,2,0,,,,en,,,There are print hobs and social forms.,0.625000,audios/46d0e502-6d56-478d-a588-71ef762fb4d9.wav,-1,,,"[-0.002221673261374235, 0.349068284034729, 0.2..."
