In [None]:
%load_ext autoreload
%autoreload 2

# Exploratory Data Analysis - Bengali.AI Kaggle Competition

This Notebook aims at doing a brief **exploratory data exploration** of the data available for the Bengali.AI Competition. Concretely by going through this notebook you will get the following:
1. **Insights** on the available training and validation data (basic information, biases, outliers, duplicates, ...)
2. A **template for doing interactive EDA** in the EDA tool [Spotlight](https://github.com/Renumics/spotlight)
3. An **embedding- and feature-enriched dataset** as a starting point for your own analysis

**NOTE**: You have to **adjust the path to your dataset** in the **Setup section** below!

## Setup

In [None]:
# Install these dependencies
!pip install -U sliceguard pandas numpy plotly datasets scikit-learn tqdm

In [None]:
# Configure the path to your dataset here
INPUT_DIR = "/home/daniel/data/bengaliai/bengaliai-speech" # CHANGE DATASET PATH HERE!!!

In [None]:
# The imports you will need
from pathlib import Path
from tqdm import tqdm
import pandas as pd
import numpy as np
import datasets
import plotly.express as px
from sklearn.neighbors import KDTree
from renumics import spotlight
from renumics.spotlight import Audio, Embedding
from sliceguard import SliceGuard

## Load the data
This code is simply for loading the data. *df* will afterwards contain a feature- and embedding-enriched dataset.

In [None]:
# Load the raw data from your machine
df = pd.read_csv(Path(INPUT_DIR) / "train.csv")

# Pull features and embeddings from huggingface dataset hub
dataset = datasets.load_dataset("renumics/bengaliai-competition-features-embeddings")
feature_df = dataset["train"].to_pandas()

# Merge the two datasets
additional_columns = feature_df.columns.difference(df.columns).tolist() + ["id"]

df = pd.merge(df, feature_df[additional_columns], on='id')

if not INPUT_DIR.endswith("/"):
    INPUT_DIR = INPUT_DIR + "/"
df["audio"] = INPUT_DIR + df["audio"]

## Basics about the raw data

In [None]:
# Sample count and columns
print(f"Sample count is {len(df)}.")
print(f"Dataframe contains the columns {df.columns.tolist()}.")

**Dataframe Structure:**
* The dataframe contains almost 1 million rows
* It contains the following basic columns:
    * *id*: The sample id which maps to the corresponding audio file in "train_mp3s"
    * *sentence* The ground-truth transcription of the audio file 
* The enrichment adds the following columns:
    * *audio_length_s*: Length of the audio file in seconds
    * *audio_rms_max*: Maximum signal energy of the sample
    * *audio_rms_mean*: Mean signal energy of the sample
    * *audio_rms_std*: Maximum signal energy standard deviation
    * *audio_spectral_flatness_mean*: Audio spectral flatness mean (the higher the more noise like)
    * *audio_embedding*: Audio embeddings computed using embedding model trained on Audioset
    * *text_embedding*: Multilingual text embeddings

In [None]:
# Split ratio
print("##### Distribution between splits #####")
px.histogram(df, x="split")

**Data Split Ratio:**
* Around 30k samples of all public data are part of the validation set.

In [None]:
# Check if the split is random or group-wise
spotlight.show(df.groupby("split").sample(3000), dtype={"audio_embedding": Embedding, "text_embedding": Embedding, "audio": Audio})

*Result when ordered by audio_embedding*:

![Split Image](images/split.png)

**Which type of split is this?**
* Seems like a not completely random sample-wise split.
* Most of the audio embedding space covered by samples from both splits.
* However, there are groups that only exist in the train split.
* Additionally there are regions where the validation data is a lot denser, however to some extent exists in the train data.
* So just beware that it could make sense to track specific regions in your evaluation and potentially adjust the sample distribution if your model is weak with certain types of samples!
* Something similar is also observable for the text_embedding field, however the embedding quality is probably quite bad so the effect cannot be shown as strongly here.

## Distribution of Simple Audio Features

In [None]:
print("##### Distribution of audio lengths (s) #####")
audio_length_fig = px.histogram(df, x="audio_length_s", nbins=200)
audio_length_fig.show()

print("##### Distribution of rms (signal power) means #####")
audio_rms_means_fig = px.histogram(df, x="audio_rms_mean", nbins=200)
audio_rms_means_fig.show()

print("##### Distribution of rms (signal power) maxs #####")
audio_rms_maxs_fig = px.histogram(df, x="audio_rms_max", nbins=200)
audio_rms_maxs_fig.show()

print("##### Distribution of rms (signal power) stds #####")
audio_rms_stds_fig = px.histogram(df, x="audio_rms_std", nbins=200)
audio_rms_stds_fig.show()


print("##### Distribution of spectral flatness (noisyness) #####")
audio_spectral_flatness_fig = px.histogram(df, x="audio_spectral_flatness_mean", nbins=200)
audio_spectral_flatness_fig.show()



**Simple Audio Feature Distributions:**
* Most of the audio files are around 1.8-6.5 seconds long.
* There are very few samples below 1.6 seconds and almost none aboce 10.6 seconds length.
* There seem to exist two peaks in the mean and max distribution of the signal energy, maybe caused by normalizations of two input datasets?
* The signal energy also has a longer tail towards the high end of the signal energy, meaning there will be a large spectrum of louder samples with varying signal energy.
* Most of the data seems to be of quite tonal nature. However, there are around 300k samples that are a little noisy and quite a few samples with high noisyness level. (according to spectral flatness)

**For all these findings:** Consider checking if the distribution if the same in train, val or try to experimentally determine if it seems to be the same in test. If not, consider adjusting the distribution, e.g., by normalizing the loudness of the signal.

## Biases and Embedding-based Distributions

### Interactive Bias Detection

In [None]:
spotlight.show(df.sample(10000), dtype={"audio_embedding": Embedding, "text_embedding": Embedding, "audio": Audio})

**Embedding-based Bias Detection**:
* There seems to be a slight bias towards male speakers.
* ...more biases to be identified via baseline model.

## Duplicates and Near Duplicates

In [None]:
# Exact duplicates in sentences
print("##### Distribution on exact duplicates in sentences #####")
px.histogram(df["sentence"].value_counts())

**Exact duplicates (Text):**
* There are around 350k sentences that are exactly duplicated in the dataset.
* 67k samples sentences are duplicated twice.
* Few sentences are duplicated significantly more.

In [None]:
# Near duplicates in sentences
text_embeddings = np.vstack(df["text_embedding"].sample(20000))
kdtree = KDTree(text_embeddings)

distances = []
for emb in tqdm(text_embeddings):
    dist, ind = kdtree.query([emb], k=10)
    distances.append(dist[0])

distances = np.array(distances)
distances = distances[:,1:]

In [None]:
px.histogram(distances.min(1))

**Near Duplicates (Text):**
* It seems like there are about 2.5% of samples with near or identical duplicates.
* Then there are very few samples that are maybe really similar, however they are not a significant portion.

In [None]:
# Near duplicates in audio
audio_embeddings = np.vstack(df["audio_embedding"].sample(20000))
kdtree = KDTree(audio_embeddings)

distances = []
for emb in tqdm(audio_embeddings):
    dist, ind = kdtree.query([emb], k=10)
    distances.append(dist[0])

distances = np.array(distances)
distances = distances[:,1:]

In [None]:
px.histogram(distances.min(1))

**Near Duplicates (Audio):**
* It seems like there are no distances close to zero.
* This makes near duplicates being present in the audio data unlikely.

## Outliers/Anomalies/Errors

In [None]:
# Note that for calculating outliers for the full dataset you will need around 40GB of RAM.
# To decrease the amount of memory needed downsample the data. Of course this will throw away some potential outliers,
# however enough will be left to get a feel for typical problematic cases.
NUM_SAMPLES = 30000

if NUM_SAMPLES is not None:
    selected_indices = np.random.choice(np.arange(len(df)), size=NUM_SAMPLES)
else:
    selected_indices = np.arange(len(df))

In [None]:
# Detect outliers based on the audio_embedding column.
# The library will essentially fit an outlier detection model and search for clusters of data that are more anomal than their parent clusters.
sg = SliceGuard()
issues = sg.find_issues(df.iloc[selected_indices], ["audio_embedding"], min_drop=0.035, min_support=10, drop_reference="parent", precomputed_embeddings={"audio_embedding": np.vstack(df["audio_embedding"].iloc[selected_indices])})

In [None]:
# Display only the found issues, no other samples.
# Remove the parameter if you want to see all data, but beware the interactive report gets jerky above 50k samples.
sg.report(non_issue_portion=0.0)

In [None]:
# Detect outliers based on the text_embedding column.
# The library will essentially fit an outlier detection model and search for clusters of data that are more anomal than their parent clusters.
sg = SliceGuard()
sg.find_issues(df, ["text_embedding"], min_drop=0.1, min_support=10, precomputed_embeddings={"text_embedding": np.vstack(df["text_embedding"])}, drop_reference="parent")

In [None]:
# Display only the found issues, no other samples.
# Remove the parameter if you want to see all data, but beware the interactive report gets jerky above 50k samples.
sg.report(non_issue_portion=0.0)

# Free EDA in Spotlight
Now it's your turn. Use Spotlight to uncover even more hidden patterns.

In [None]:
spotlight.show(df, dtype={"audio_embedding": Embedding, "text_embedding": Embedding, "audio": Audio})