# Detecting audio issues in the Common Voice dataset
This notebook aims at showing how you can leverage sliceguard to detect issues in audio datasets, using the commonvoice dataset as an example. Focus will be on the basic workflow, as well as showing how to leverage different embedding models from the huggingface hub.

In order to run this example you will need some **dependencies**. Install them as follows.

In [None]:
!pip install sliceguard librosa soundfile datasets tqdm jiwer

## Step 1: Generate predictions for the Common Voice dataset

**IMPORTANT NOTE**: In order to access the commonvoice dataset you have to accept certain terms and conditions. To do this, create a huggingface account and accept the terms and conditions [HERE](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0). You then need to **create an access token** to access your datasets programmatically. Follow the steps for configuring one [HERE](https://huggingface.co/docs/hub/security-tokens). It is just a matter of few minutes. Just paste your access token into a file called **access_token.txt** and place it in the same directory as this notebook.

In [None]:
# Configure this example here.
# Like this it is optimized for fast execution only using whisper tiny.
HF_MODEL = "openai/whisper-tiny"
ACCESS_TOKEN_FILE = "access_token.txt"
AUDIO_SAVE_DIR = "audios"
NUM_SAMPLES = 2000

In [None]:
# Some imports your will need to execute this
import uuid
import shutil
from pathlib import Path
import pandas as pd
from tqdm import tqdm
import torch
import librosa
import soundfile as sf
from jiwer import wer
from datasets import load_dataset, Audio
from transformers import pipeline
from transformers import WhisperProcessor, WhisperFeatureExtractor, WhisperTokenizer, WhisperForConditionalGeneration

In [None]:
# Read the acces token for downloading the dataset
access_token = Path(ACCESS_TOKEN_FILE).read_text()
cv_13 = load_dataset("mozilla-foundation/common_voice_13_0", "en", use_auth_token=access_token, streaming=True)

In [None]:
# Instantiate an ASR pipeline with the configured model
device = "cuda:0" if torch.cuda.is_available() else "cpu"

feature_extractor = WhisperFeatureExtractor.from_pretrained(HF_MODEL)
tokenizer = WhisperTokenizer.from_pretrained(HF_MODEL, language="en", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained(HF_MODEL).to(device)

model.config.forced_decoder_ids = tokenizer.get_decoder_prompt_ids() # Specify the task as we always want to use german and transcribe
model.config.language = "<|en|>"
model.config.task = "transcribe"

pipe = pipeline("automatic-speech-recognition", model=model, tokenizer=tokenizer, feature_extractor=feature_extractor, device=device)

In [None]:
keys_to_save = ["sentence", "up_votes", "down_votes", "age", "gender", "accent", "locale", "segment", "variant"]

audio_save_dir = Path(AUDIO_SAVE_DIR)
if  not audio_save_dir.is_dir():
    audio_save_dir.mkdir()
else:
    shutil.rmtree(audio_save_dir)
    audio_save_dir.mkdir()

num_samples = 0
data = []
for sample in tqdm(cv_13["train"], total=NUM_SAMPLES):
    new_audio = librosa.resample(sample["audio"]["array"], orig_sr=sample["audio"]["sampling_rate"], target_sr=16000)
    file_stem = str(uuid.uuid4())
    cur_data = {}
    for k in keys_to_save:
        cur_data[k] = sample[k]
    prediction = pipe(new_audio)["text"]
    cur_data["prediction"] = prediction
    
    sample_wer = wer(sample["sentence"], prediction)
    cur_data["wer"] = sample_wer
    
    target_path = audio_save_dir / (file_stem + ".wav")
    cur_data["audio"] = target_path
    sf.write(target_path, new_audio, 16000)
    data.append(cur_data)
    num_samples += 1
    if num_samples > NUM_SAMPLES:
        break

In [None]:
df = pd.DataFrame(data)
df["audio"] = df["audio"].astype("string") # otherwise overflow in serializing json
df.to_json("dataset.json", orient="records")

## Step 2: Detect issues caused by environmental noise
First check we want to do is checking wether there are audio recordings that are somehow so different from the rest of the data that they cannot be properly transcribed. Here we mostly target **general audio properties and environmental noise** such as background noises.

In order to do this, we leverage **general purpose audio embeddings** of a model trained on Audioset.

In [None]:
# Some imports you will need for this step
import pandas as pd
import numpy as np
from jiwer import wer
from sliceguard import SliceGuard
from renumics.spotlight import Audio

In [None]:
# Read the generated dataset including the predictions
df = pd.read_json("dataset.json")

In [None]:
# Define the metric function
def wer_metric(y_true, y_pred):
    return np.mean([wer(s_y, s_pred) for s_y, s_pred in zip(y_true, y_pred)])

In [None]:
# Perform an initial detection aiming for relatively small clusters of minimum 5 similar samples
sg = SliceGuard()
issues = sg.find_issues(
        df,
        ["audio"],
        "sentence",
        "prediction",
        wer_metric,
        metric_mode="min",
        embedding_models={"audio": "MIT/ast-finetuned-audioset-10-10-0.4593"},
        min_support=5,
        min_drop=0.2,
    )

In [None]:
# Report the issues using Renumics Spotlight
sg.report(spotlight_dtype={"audio": Audio})

In [None]:
# Of course if you want to run additional checks you don't need to recompute the embeddings all the time.
# Just save them here, and supply the precomputed embeddings in the next call
# where we will check for smaller clusters aka outliers.
computed_embeddings = sg.embeddings

In [None]:
# Perform an additional detection, targeting outliers with significant drops (see min_support and min_drop)
# We even allow for clusters containing single samples here.
sg = SliceGuard()
issues = sg.find_issues(
        df,
        ["audio"],
        "sentence",
        "prediction",
        wer_metric,
        metric_mode="min",
        min_support=1,
        min_drop=0.3,
        precomputed_embeddings=computed_embeddings
    )

In [None]:
# Report the issues using Renumics Spotlight
sg.report(spotlight_dtype={"audio": Audio})

## Step 3: Detect issues caused by (uncommon) speakers
While the previous detection example targeted finding general audio conditions that can cause issues, this is not always the criterion we want to check for. A way of defining other criterions is **changing the underlying embedding** to **capture different properties of the data**. In this case, we define the embedding model to be a model for **speaker identification**. This should allow us, to **detect uncommon speakers**, although they are note explicitely labeled.

In [None]:
# Perform a detection using a speaker identification model for computing embeddings.
# This will help to recover problematic speakers even though they are not explicitely labeled.
sg = SliceGuard()
issues = sg.find_issues(
        df,
        ["audio"],
        "sentence",
        "prediction",
        wer_metric,
        metric_mode="min",
        embedding_models={"audio": "superb/wav2vec2-base-superb-sid"},
        min_support=1,
        min_drop=0.3,
    )

In [None]:
# Report the issues using Renumics Spotlight
sg.report(spotlight_dtype={"audio": Audio})