# Detecting audio issues in the Common Voice dataset
This notebook aims at showing how you can leverage sliceguard to detect issues in audio datasets, using the commonvoice dataset as an example. Focus will be on the basic workflow, as well as showing how to leverage different embedding models from the huggingface hub.

In order to run this example you will need some **dependencies**. Install them as follows.

In [32]:
!pip install sliceguard librosa soundfile datasets tqdm jiwer









You should consider upgrading via the '/home/daniel/code/sliceguard/.venv/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

## Step 1: Generate predictions for the Common Voice dataset

**IMPORTANT NOTE**: In order to access the commonvoice dataset you have to accept certain terms and conditions. To do this, create a huggingface account and accept the terms and conditions [HERE](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0). You then need to **create an access token** to access your datasets programmatically. Follow the steps for configuring one [HERE](https://huggingface.co/docs/hub/security-tokens). It is just a matter of few minutes. Just paste your access token into a file called **access_token.txt** and place it in the same directory as this notebook.

In [1]:
# Configure this example here.
# Like this it is optimized for fast execution only using whisper tiny.
HF_MODEL = "openai/whisper-tiny"
ACCESS_TOKEN_FILE = "access_token.txt"
AUDIO_SAVE_DIR = "audios"
NUM_SAMPLES = 500

In [33]:
# Some imports your will need to execute this
import uuid
import shutil
from pathlib import Path
import pandas as pd
from tqdm import tqdm
import torch
import librosa
import soundfile as sf
from jiwer import wer
from datasets import load_dataset, Audio
from transformers import pipeline
from transformers import WhisperProcessor, WhisperFeatureExtractor, WhisperTokenizer, WhisperForConditionalGeneration

In [15]:
# Read the acces token for downloading the dataset
access_token = Path(ACCESS_TOKEN_FILE).read_text()
cv_13 = load_dataset("mozilla-foundation/common_voice_13_0", "en", use_auth_token=access_token, streaming=True)

In [19]:
# Instantiate an ASR pipeline with the configured model
device = "cuda:0" if torch.cuda.is_available() else "cpu"

feature_extractor = WhisperFeatureExtractor.from_pretrained(HF_MODEL)
tokenizer = WhisperTokenizer.from_pretrained(HF_MODEL, language="en", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained(HF_MODEL).to(device)

model.config.forced_decoder_ids = tokenizer.get_decoder_prompt_ids() # Specify the task as we always want to use german and transcribe
model.config.language = "<|en|>"
model.config.task = "transcribe"

pipe = pipeline("automatic-speech-recognition", model=model, tokenizer=tokenizer, feature_extractor=feature_extractor, device=device)

In [37]:
keys_to_save = ["sentence", "up_votes", "down_votes", "age", "gender", "accent", "locale", "segment", "variant"]

audio_save_dir = Path(AUDIO_SAVE_DIR)
if  not audio_save_dir.is_dir():
    audio_save_dir.mkdir()
else:
    shutil.rmtree(audio_save_dir)
    audio_save_dir.mkdir()

num_samples = 0
data = []
for sample in tqdm(cv_13["train"], total=NUM_SAMPLES):
    new_audio = librosa.resample(sample["audio"]["array"], orig_sr=sample["audio"]["sampling_rate"], target_sr=16000)
    file_stem = str(uuid.uuid4())
    cur_data = {}
    for k in keys_to_save:
        cur_data[k] = sample[k]
    prediction = pipe(new_audio)["text"]
    cur_data["prediction"] = prediction
    
    sample_wer = wer(sample["sentence"], prediction)
    cur_data["wer"] = sample_wer
    
    target_path = audio_save_dir / (file_stem + ".wav")
    cur_data["audio"] = target_path
    sf.write(target_path, new_audio, 16000)
    data.append(cur_data)
    num_samples += 1
    if num_samples > NUM_SAMPLES:
        break

  0%|                                                   | 0/500 [00:00<?, ?it/s]
Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 1it [00:00,  1.52it/s][A
Reading metadata...: 22068it [00:01, 18803.64it/s][A
Reading metadata...: 23887it [00:02, 11889.38it/s][A
Reading metadata...: 43987it [00:02, 18799.91it/s][A
Reading metadata...: 65891it [00:03, 23312.51it/s][A
Reading metadata...: 87679it [00:04, 25927.66it/s][A
Reading metadata...: 109202it [00:04, 27477.84it/s][A
Reading metadata...: 130629it [00:05, 28525.90it/s][A
Reading metadata...: 151866it [00:06, 29016.60it/s][A
Reading metadata...: 172914it [00:06, 29380.31it/s][A
Reading metadata...: 193877it [00:07, 29536.68it/s][A
Reading metadata...: 214827it [00:08, 29528.14it/s][A
Reading metadata...: 235814it [00:09, 29619.24it/s][A
Reading metadata...: 256646it [00:09, 29176.88it/s][A
Reading metadata...: 277374it [00:10, 29162.39it/s][A
Reading metadata...: 298001it [00:11, 29137.92it/s][A
Reading m

In [38]:
df = pd.DataFrame(data)
df["audio"] = df["audio"].astype("string") # otherwise overflow in serializing json
df.to_json("dataset.json", orient="records")

## Step 2: Detect issues caused by environmental noise
First check we want to do is checking wether there are audio recordings that are somehow so different from the rest of the data that they cannot be properly transcribed. Here we mostly target general audio properties and environmental noise such as background noises.

In order to do this, we leverage general purpose audio embeddings of a model trained on Audioset.

In [42]:
from sliceguard import SliceGuard

ModuleNotFoundError: No module named 'html.parser'

In [40]:
df = pd.read_json("dataset.json")

In [41]:
issue_df = sg.find_issues(
        df,
        ["path"],
        "label",
        "class",
        accuracy_score,
        metric_mode="max",
        # feature_types={"age": "ordinal"},
        # feature_orders={"age": ["", "teens", "twenties", "thirties", "fourties", "fifties", "sixties", "seventies", "eighties", "nineties"]},
        embedding_models={"path": "superb/wav2vec2-base-superb-sid"},
        min_support=5,
        min_drop=0.1,
    )

Index(['sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent',
       'locale', 'segment', 'variant', 'prediction', 'wer', 'audio'],
      dtype='object')

## Step 3: Detect issues caused by (uncommon) speakers