# Interactive Visualization, Features, Embeddings

![Spotlight](https://github.com/Renumics/sliceguard/blob/main/static/img/bengaliai_spotlight.png?raw=true)

This notebook provides you with two resources that can help you to suceed in this competition:
1. An **enriched dataset version** containing **audio features, as well as audio- and text embeddings**
2. Code for **interactive exploration** in the data curation tool [Spotlight](https://github.com/Renumics/spotlight) to conduct your own **EDA and Evaluation**

Note that in order for the interactive exploration to work you should **RUN THIS LOCALLY**, not in the kaggle environment.

The dataset contains the following columns:
* *audio_length_s*: Length of the audio file in seconds
* *audio_rms_max*: Maximum signal energy of the sample
* *audio_rms_mean*: Mean signal energy of the sample
* *audio_rms_std*: Maximum signal energy standard deviation
* *audio_spectral_flatness_mean*: Audio spectral flatness mean
* *audio_embedding*: Audio embeddings computed using embedding model trained on Audioset
* *text_embedding*: Multilingual text embeddings

In [None]:
# IMPORTANT (!): Change this if you are executing this locally for interactive exploration.
# Set your directory containing the train.csv file
INPUT_DIR = "/kaggle/input/bengaliai-speech"

In [None]:
# Install these dependencies to get and view the enriched dataset
!pip install -U pandas datasets renumics-spotlight==1.3.0rc6

In [None]:
# All imports
from pathlib import Path
import pandas as pd
import datasets
from renumics import spotlight
from renumics.spotlight import Audio, Embedding

In [None]:
# Load the raw data from your machine
df = pd.read_csv(Path(INPUT_DIR) / "train.csv")

# Pull features and embeddings from huggingface dataset hub
dataset = datasets.load_dataset("renumics/bengaliai-competition-features-embeddings")
feature_df = dataset["train"].to_pandas()

# Merge the two datasets
additional_columns = feature_df.columns.difference(df.columns).tolist() + ["id"]
df = pd.merge(df, feature_df[additional_columns], on='id')
if not INPUT_DIR.endswith("/"):
    INPUT_DIR = INPUT_DIR + "/"
df["audio"] = INPUT_DIR + df["audio"]

In [None]:
# Display the dataframe containing additional features and embeddings
df

In [None]:
# Open the dataset for interactive exploration (Subsampled to 5000 samples)
spotlight.show(df.sample(5000), dtype={"audio": Audio, "audio_embedding": Embedding, "text_embedding": Embedding})

**[Docs for the Exploration Tool on Github](https://github.com/Renumics/spotlight)**