# MLPC2025 Dataset
- `metadata.csv` lists the individual audio files in the data set and corresponding metadata (keywords, descriptions, title, license, download link of the original file, ...)
- `metadata_keywords_embeddings.npz` holds one text embedding vector for each list of keywords in `metadata.csv`; rows of `metadata.csv` and `metadata_keywords_embeddings.npz` are aligned; use the index to retrieve the text embedding
- `metadata_title_embeddings.npz` holds one text embedding vector for each title in `metadata.csv`; rows of `metadata.csv` and `metadata_title_embeddings.npz` are aligned; use the index to retrieve the text embedding
- `annotation.csv` list all temporal annotations and the text description of the region
- `annotations_text_embeddings.npz` holds one text embedding vector for each annotation in `annotations.csv`; rows of `annotations.csv` and `annotations_text_embeddings.npz` are aligned; use the index to retrieve the text embedding
- folder `audio` contains the audio recordings in mp3 format
- folder `audio_features` contains the audio features we extracted for you from the waveforms; each feature file holds multiple feature array. 
  - See the example below to on how to access the individual arrays.

In [2]:
import numpy as np
import pandas as pd
import os

# load the metadata
metadata_df = pd.read_csv("MLPC2025_dataset/metadata.csv")
title_embeddings = np.load("MLPC2025_dataset/metadata_title_embeddings.npz")["embeddings"]
keywords_embeddings = np.load("MLPC2025_dataset/metadata_keywords_embeddings.npz")["embeddings"]

# load the annotations
annotations_df = pd.read_csv("MLPC2025_dataset/annotations.csv")
annotations_embeddings = np.load("MLPC2025_dataset/annotations_text_embeddings.npz")["embeddings"]

# load audio features
feature_filename = metadata_df.loc[0, "filename"].replace("mp3", "npz")
features = np.load(os.path.join("MLPC2025_dataset/audio_features", feature_filename))
print(list(features.keys()))

print("Shape of ZCR feature (time, n_features)", features["zerocrossingrate"].shape)
print("Shape of MFCC features (time, n_features)", features["mfcc"].shape)

['embeddings', 'melspectrogram', 'mfcc', 'mfcc_delta', 'mfcc_delta2', 'flatness', 'centroid', 'flux', 'energy', 'power', 'bandwidth', 'contrast', 'zerocrossingrate']
Shape of ZCR feature (time, n_features) (233, 1)
Shape of MFCC features (time, n_features) (233, 32)
