# MusicCaps Explorer

In this notebook, we see how you can use `yt-dlp` to download clips from the MusicCaps dataset from Google. The MusicCaps dataset contains music and their associated text captions. You could use a dataset such as this to train a nice text-to-audio generation model 😉!

This notebook is 100% inspired and based on https://github.com/nateraw/download-musiccaps-dataset, with some additional annotations, so please give a star to that repo. The notebook shows how to load the dataset, underlying clips, and explore them.

Let's get started! 🔥

## Introduction and setup

Let's kick things off by installing some dependencies and load the dataset. **Note that Kaggle comes with an old datasets version but we need a newer one, so you might need to restart the notebook after install to make sure it's using the last version.**

In [1]:
%%capture
! pip install -U datasets[audio]
! pip install yt-dlp

# For the interactive interface we'll need gradio
! pip install gradio



In [2]:
# ! pip install --upgrade pyarrow==12.0.1


In [3]:
import pyarrow

# 打印 pyarrow 版本信息
pyarrow_version = pyarrow.__version__
print("PyArrow version:", pyarrow_version)


PyArrow version: 8.0.0


We'll use the Hugging Face `datasets` library to load the dataset version hosted over there in the [google/MusicCaps](https://huggingface.co/datasets/google/MusicCaps) repository. 

In [4]:
from datasets import load_dataset  # datasets是Hugging的library

ds = load_dataset('google/MusicCaps', split='train')
ds

Downloading readme:   0%|          | 0.00/5.06k [00:00<?, ?B/s]

Downloading and preparing dataset csv/google--MusicCaps to /root/.cache/huggingface/datasets/google___csv/google--MusicCaps-b454a6f2b5e1bcf5/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/google___csv/google--MusicCaps-b454a6f2b5e1bcf5/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.


Dataset({
    features: ['ytid', 'start_s', 'end_s', 'audioset_positive_labels', 'aspect_list', 'caption', 'author_id', 'is_balanced_subset', 'is_audioset_eval'],
    num_rows: 5521
})

We see that there are 5,521 music samples. Each sample contains information such as the audio caption and a YouTube ID, which can be surprising. Rather than exposing the audio files directly, this dataset contains the ID of YouTube videos (`ytid` field) and the `start_s` and `end_s`, which indicate the time range of the video of the sample. This makes it a bit harder to work compared to other datasets.

## Loading audio data

As our goal is just loading some data and exploring it, we'll limit ourselves to load only 32 samples. Feel free to change the `samples_to_load` variable in the next cell, but take into account that it might take a long time for the whole dataset.Kaggle notebooks have 4 cores, so we can use that for our advantage too. 

Let's go and download the data! 🚀

In [5]:
# 辅助方法定义，用于下载和处理音频片段


import subprocess
import os
from pathlib import Path


def download_clip(
    video_identifier,
    output_filename,
    start_time,
    end_time,
    tmp_dir='/tmp/musiccaps',
    num_attempts=5,
    url_base='https://www.youtube.com/watch?v='
):
    # 此函数用于下载特定的YouTube音频片段。
    # 输入参数包括视频标识符、输出文件名、开始和结束时间等。
    # 使用 yt-dlp 工具进行下载，并将音频转换为 wav 格式。
   
    
    status = False

    command = f"""
        yt-dlp --quiet --no-warnings -x --audio-format wav -f bestaudio -o "{output_filename}" --download-sections "*{start_time}-{end_time}" {url_base}{video_identifier}
    """.strip()

    attempts = 0
    while True: 
         # 尝试下载多次以确保成功。
        try:
            output = subprocess.check_output(command, shell=True,
                                                stderr=subprocess.STDOUT)
        except subprocess.CalledProcessError as err:
            attempts += 1
            if attempts == num_attempts:
                return status, err.output
        else:
            break

    # Check if the video was successfully saved.
    status = os.path.exists(output_filename)
    return status, 'Downloaded'

def process(example):
    # 此函数用于处理单个数据集样本。
    # 它检查本地是否已有对应的音频文件，如果没有，则调用 download_clip 函数下载。
    # 更新样本的字典，添加音频文件路径和下载状态信息。
    
    outfile_path = str(data_dir / f"{example['ytid']}.wav")
    status = True
    if not os.path.exists(outfile_path):
        status = False
        status, log = download_clip(
            example['ytid'],
            outfile_path,
            example['start_s'],
            example['end_s'],
        )

    example['audio'] = outfile_path
    example['download_status'] = status
    return example

In [6]:
from datasets import Audio

samples_to_load = 60      # How many samples to load
cores = 20                 # How many processes to use for the loading
sampling_rate = 44100     # Sampling rate for the audio, keep in 44100
writer_batch_size = 1000  # 每个工作进程内存中保留的示例数。如果出现内存不足，则减少此数值。
data_dir = "./music_data" # Where to save the data

# 选择部分样本进行处理 
ds = ds.select(range(samples_to_load))

# 创建保存数据的目录
data_dir = Path(data_dir)
data_dir.mkdir(exist_ok=True, parents=True)

# 使用多进程将 process 函数映射到数据集的每个样本上
ds = ds.map(
        process,
        num_proc=cores,
        writer_batch_size=writer_batch_size,
        keep_in_memory=False
    ).cast_column('audio', Audio(sampling_rate=sampling_rate))

Map (num_proc=20):   0%|          | 0/60 [00:00<?, ? examples/s]

Done! Let's look at the data of an example

In [7]:
ds[0]

{'ytid': '-0Gj8-vB1q4',
 'start_s': 30,
 'end_s': 40,
 'audioset_positive_labels': '/m/0140xf,/m/02cjck,/m/04rlf',
 'aspect_list': "['low quality', 'sustained strings melody', 'soft female vocal', 'mellow piano melody', 'sad', 'soulful', 'ballad']",
 'caption': 'The low quality recording features a ballad song that contains sustained strings, mellow piano melody and soft female vocal singing over it. It sounds sad and soulful, like something you would hear at Sunday services.',
 'author_id': 4,
 'is_balanced_subset': False,
 'is_audioset_eval': True,
 'audio': {'path': 'music_data/-0Gj8-vB1q4.wav',
  'array': array([-0.00195292,  0.00100993,  0.00316163, ..., -0.01966176,
         -0.02357896,  0.        ]),
  'sampling_rate': 44100},
 'download_status': True}

Interesting! Let's see what we have
* The `audio` key maps to a dictionary that contains both the audio (`.wav`) file and the `numpy` array of the data already loaded, as well as the sampling rate
* `is_audioset_eval` specifies if it's from the eval or train split
* The `caption` field has the description of the audio: "The low quality recording features a ballad song that contains sustained strings, mellow piano melody and soft female vocal singing over it. It sounds sad and soulful, like something you would hear at Sunday services."

## Interactive explorer

We can use [Gradio](https://gradio.app/), an open-source library to build ML demos, to build an interface in which the user selects the index of the sample and can then listen to the audio and read the caption. Gradio has a nice `Interface` class which has three key components
* `inputs`: specifies which are the input components. In this case, we'll want a slider that will represent the index.
* `outputs`: the output components. In this case, we want an audio and a textarea
* Any inference function that receives the `inputs` type and outputs the `outputs` types. 

Let's see it in action!

In [8]:
import gradio as gr

def get_example(idx):
    ex = ds[idx]
    return ex['audio']['path'], ex['caption']

gr.Interface(
    get_example,
    inputs=gr.Slider(0, len(ds) - 1, value=0, step=1),
    outputs=['audio', 'textarea'],
    allow_flagging="never",
    live=True
).launch(share=True)

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://1b794bcb2a174b321d.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




That's it! I hope you find this notebook useful! 

Hugs!🤗