<a href="https://colab.research.google.com/github/Huang-Yongzhi/musiclm-pytorch/blob/main/musiclm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install musiclm-pytorch

Collecting musiclm-pytorch
  Obtaining dependency information for musiclm-pytorch from https://files.pythonhosted.org/packages/b3/da/20c86133f49aeb634ada68b66f5516e81bcf1de9dc9b3e3c37989ba18a23/musiclm_pytorch-0.2.8-py3-none-any.whl.metadata
  Downloading musiclm_pytorch-0.2.8-py3-none-any.whl.metadata (956 bytes)
Collecting audiolm-pytorch>=0.17.0 (from musiclm-pytorch)
  Obtaining dependency information for audiolm-pytorch>=0.17.0 from https://files.pythonhosted.org/packages/07/08/4f3a45f1a2b62cdd833c6f9000d2733307488972e6521a9ac42bef86b5b4/audiolm_pytorch-1.7.6-py3-none-any.whl.metadata
  Downloading audiolm_pytorch-1.7.6-py3-none-any.whl.metadata (1.2 kB)
Collecting beartype (from musiclm-pytorch)
  Obtaining dependency information for beartype from https://files.pythonhosted.org/packages/46/8a/a90fe78c73958340ed6b6ab128a10598ad5f0ff57537ad17f6ccd1ad830b/beartype-0.16.4-py3-none-any.whl.metadata
  Downloading beartype-0.16.4-py3-none-any.whl.metadata (29 kB)
Collecting einops>=0.6 

# Usage
`MuLaN` first needs to be trained

In [2]:
import torch
from musiclm_pytorch import MuLaN, AudioSpectrogramTransformer, TextTransformer

audio_transformer = AudioSpectrogramTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64,
    spec_n_fft = 128,
    spec_win_length = 24,
    spec_aug_stretch_factor = 0.8
)

text_transformer = TextTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64
)

mulan = MuLaN(
    audio_transformer = audio_transformer,
    text_transformer = text_transformer
)

# get a ton of <sound, text> pairs and train

wavs = torch.randn(2, 1024)
texts = torch.randint(0, 20000, (2, 256))

loss = mulan(wavs, texts)
loss.backward()

# after much training, you can embed sounds and text into a joint embedding space
# for conditioning the audio LM

embeds = mulan.get_audio_latents(wavs)  # during training

embeds = mulan.get_text_latents(texts)  # during inference



spectrogram yielded shape of (65, 86), but had to be cropped to (64, 80) to be patchified for transformer


To obtain the conditioning embeddings for the three transformers that are a part of AudioLM, you must use the `MuLaNEmbedQuantizer` as so

In [None]:
from musiclm_pytorch import MuLaNEmbedQuantizer

# setup the quantizer with the namespaced conditioning embeddings, unique per quantizer as well as namespace (per transformer)

quantizer = MuLaNEmbedQuantizer(
    mulan = mulan,                          # pass in trained mulan from above
    conditioning_dims = (1024, 1024, 1024), # say all three transformers have model dimensions of 1024
    namespaces = ('semantic', 'coarse', 'fine')
)

# now say you want the conditioning embeddings for semantic transformer

wavs = torch.randn(2, 1024)
conds = quantizer(wavs = wavs, namespace = 'semantic') # (2, 8, 1024) - 8 is number of quantizers

In [4]:
!pip install audiolm_pytorch

[33mDEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of omegaconf or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

# 加载数据集

1.数据集内容

调用的.csv文件内容如下
```
ytid,start_s,end_s,audioset_positive_labels,aspect_list,caption,author_id,is_balanced_subset,is_audioset_eval
-0Gj8-vB1q4,30,40,"/m/0140xf,/m/02cjck,/m/04rlf","['low quality', 'sustained strings melody', 'soft female vocal', 'mellow piano melody', 'sad', 'soulful', 'ballad']","The low quality recording features a ballad song that contains sustained strings, mellow piano melody and soft female vocal singing over it. It sounds sad and soulful, like something you would hear at Sunday services.",4,False,True
...
```

**解释**：
数据集是一个包含音频信息和描述的元数据文件，格式类似于 CSV。每行包含一个 YouTube 音频的标识符（ytid），音频的开始和结束时间（start_s 和 end_s），音频标签（audioset_positive_labels）和其他相关信息。

使用如 youtube-dl 这类工具来下载视频，然后使用音频处理库（例如 librosa 或 pydub）来裁剪音频。以下是一个大致的步骤指南：

# 1. 使用 youtube-dl
下载 YouTube 音频
首先，您需要安装 youtube-dl。在 Colab 中，您可以使用以下命令安装：

In [5]:
# !pip install youtube-dl


In [6]:
# !pip install --upgrade youtube-dl


In [7]:
# import librosa
# import soundfile as sf
# import os
# import pandas as pd
# from youtube_dl import YoutubeDL

# def trim_audio(file_path, start_time, end_time, output_path):
#     y, sr = librosa.load(file_path, sr=None, offset=start_time, duration=end_time - start_time)
#     sf.write(output_path, y, sr)

# def download_youtube_audio(ytid, output_dir):
#     ydl_opts = {
#         'format': 'bestaudio/best',
#         'postprocessors': [{
#             'key': 'FFmpegExtractAudio',
#             'preferredcodec': 'wav',
#             'preferredquality': '192',

#         }],
#         'verbose': True,
#         'outtmpl': os.path.join(output_dir, '%(id)s.%(ext)s')
#     }

#     try:
#         with YoutubeDL(ydl_opts) as ydl:
#             ydl.download([f'http://www.youtube.com/watch?v={ytid}'])
#     except Exception as e:
#         print(f"Error downloading video {ytid}: {e}")
#         return False  # Indicate failure
#     return True  # Indicate success


# # 加载CSV文件
# csv_file = 'musiccaps-public.csv'
# df = pd.read_csv(csv_file)

# # 遍历CSV文件，下载并裁剪音频
# for index, row in df.iterrows():
#     ytid = row['ytid']
#     start_s = row['start_s']
#     end_s = row['end_s']
#     if download_youtube_audio(ytid, 'downloaded_audios'):
#         try:
#             trim_audio(f'downloaded_audios/{ytid}.wav', start_s, end_s, f'trimmed_audios/{ytid}.wav')
#         except Exception as e:
#             print(f"Error trimming audio for video {ytid}: {e}")

## 使用Youtube-dl会报错
改用you-get

In [8]:
!pip install you-get

Collecting you-get
  Downloading you_get-0.4.1650-py3-none-any.whl (231 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m231.6/231.6 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h[33mDEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of omegaconf or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mInstalling collected packages: you-get
Successfully installed you-get-0.4.1650


测试一下You-Get

In [9]:
# !you-get -i 'https://www.youtube.com/watch?v=jNQXAC9IVRw'

检查可用格式：运行 you-get 命令带 -i 选项（用于信息查看模式），查看该视频支持的所有可用格式。这样可以帮助您了解是否有特定的音频格式可供下载。执行命令如下：

In [10]:
!you-get -i "https://www.youtube.com/watch?v=-0Gj8-vB1q4"


site:                YouTube
title:               lds music - perfect love
streams:             # Available quality and codecs
    [ DASH ] ____________________________________
    - itag:          [7m244[0m
      container:     webm
      quality:       640x480 (480p)
      size:          7.6 MiB (7975173 bytes)
    # download-with: [4myou-get --itag=244 [URL][0m

    - itag:          [7m397[0m
      container:     mp4
      quality:       640x480 (480p)
      size:          6.4 MiB (6670720 bytes)
    # download-with: [4myou-get --itag=397 [URL][0m

    - itag:          [7m243[0m
      container:     webm
      quality:       480x360 (360p)
      size:          6.1 MiB (6354867 bytes)
    # download-with: [4myou-get --itag=243 [URL][0m

    - itag:          [7m396[0m
      container:     mp4
      quality:       480x360 (360p)
      size:          5.5 MiB (5723034 bytes)
    # download-with: [4myou-get --itag=396 [URL][0m

    - itag:          [7m135[0m
      contai

In [None]:
!you-get --no-caption -o "./downloaded_videos" --itag=160 "https://www.youtube.com/watch?v=-0Gj8-vB1q4"


site:                YouTube
title:               lds music - perfect love
stream:
    - itag:          [7m160[0m
      container:     mp4
      quality:       192x144 (144p)
      size:          4.0 MiB (4186336 bytes)
    # download-with: [4myou-get --itag=160 [URL][0m

Downloading lds music - perfect love.mp4 ...
52.3% (  2.1/  4.0MB) ├█████████████████████───────────────────┤[2/2]   64 kB/s

安装**ffmpeg** 或其他类似工具来从下载的视频文件中提取音频。

In [1]:
!sudo apt-get install ffmpeg


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 46 not upgraded.


In [5]:
import subprocess
import librosa
import soundfile as sf
import os
import pandas as pd
import glob # 用于文件路径名的模式匹配
from datetime import datetime

# def trim_audio(file_path, start_time, end_time, output_path):
#     y, sr = librosa.load(file_path, sr=None, offset=start_time, duration=end_time - start_time)
#     sf.write(output_path, y, sr)




# def download_lowest_resolution_video(ytid, video_output_dir):
#     video_url = f'https://www.youtube.com/watch?v={ytid}'
#     try:
#         # 使用 you-get 下载分辨率最低的视频
#         subprocess.run(['you-get', '--no-caption', '-o', video_output_dir, '--itag=160', video_url], check=True)
#     except subprocess.CalledProcessError as e:
#         print(f"Error downloading video {ytid}: {e}")
#         return False  # Indicate failure
#     return True  # Indicate success


def get_latest_file_in_dir(directory):
    """ 获取指定目录中最新的文件 """
    list_of_files = glob.glob(os.path.join(directory, '*'))
    if not list_of_files:  # 如果目录为空
        return None
    latest_file = max(list_of_files, key=os.path.getmtime)
    return latest_file



def download_lowest_resolution_video(ytid, video_output_dir):
    video_url = f'https://www.youtube.com/watch?v={ytid}'
    try:
      # 使用 you-get 下载分辨率最低的视频
        subprocess.run(['you-get', '--no-caption', '-o', video_output_dir, '--itag=160', video_url], check=True)
        print (f"Download {ytid} video.")
    except subprocess.CalledProcessError as e:
        print(f"Error downloading video {ytid}: {e}")
        return None
    # 查找下载的视频文件
    return get_latest_file_in_dir(video_output_dir)



def extract_audio_from_video(video_path, output_audio_path):
    try:
        result = subprocess.run(['ffmpeg', '-i', video_path, '-vn', '-acodec', 'aac', output_audio_path], check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        file_name = video_path.split('/')[-1]
        print (f"Extract {file_name} video.")
    except subprocess.CalledProcessError as e:
        print(f"Error extracting audio from video {video_path}: {e}\nOutput: {e.stdout.decode()}\nError: {e.stderr.decode()}")





# 加载CSV文件
csv_file = '/kaggle/input/musiccaps/musiccaps-public.csv' # musiccaps-public.csv
df = pd.read_csv(csv_file)

video_output_dir = './downloaded_videos' # .downloaded_videos
audio_output_dir = './downloaded_audios' # ./downloaded_audios

# 确保输出目录存在
os.makedirs(video_output_dir, exist_ok=True)
os.makedirs(audio_output_dir, exist_ok=True)

test_n = 0
# 遍历CSV文件，下载视频并提取音频
for index, row in df.iterrows():
  if test_n >= 3:
    break
  ytid = row['ytid']
  downloaded_video = download_lowest_resolution_video(ytid, video_output_dir) # 下载视频
  if downloaded_video:
    # video_path = os.path.join(video_output_dir, f'{ytid}.mp4')  # 假设视频文件扩展名为 .mp4，命名可能不成功
    audio_path = os.path.join(audio_output_dir, f'{ytid}.m4a')   # 输出音频文件为 .m4a
    extract_audio_from_video(downloaded_video, audio_path)
  test_n += 1

FileNotFoundError: [Errno 2] No such file or directory: 'you-get'

下载文件内容

hubert_base_ls960.pt 文件是一个预训练的模型权重文件，用于 **HuBERT （Hidden Unit BERT）模型**。HuBERT 是由Facebook AI 研究团队开发的一种**自监督学习的语音识别模型**。它是基于 BERT 架构的，专门针对语音处理任务进行了优化。

In [None]:
import requests

def download_file(url, filename):
    response = requests.get(url)
    response.raise_for_status()  # 检查请求是否成功

    with open(filename, 'wb') as f:
        f.write(response.content)

# 设置文件的URL和你想要保存的文件名
file_url = "https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt"
file_name = "hubert_base_ls960.pt"

# 下载文件
download_file(file_url, file_name)

# 设置文件的URL和你想要保存的文件名
file_url = "https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960_L9_km500.bin"
file_name = "hubert_base_ls960_L9_km500.bin"

# 下载文件
download_file(file_url, file_name)



SemanticTransformerTrainer（这可能是一个音频处理或自然语言处理相关的训练器）

In [None]:
# 不是我们要的音频链接
# import requests

# url = "https://github.com/hsfzxjy/models.storage/releases/download/HRNet-OCR/hrnet_cs_8090_torch11.pth"
# response = requests.get(url)
# response.raise_for_status()

# file_name = url.split('/')[-1]

# with open(file_name, 'wb') as f:
#   f.write(response.content)






To train (or finetune) the three transformers that are a part of `AudioLM`, you simply follow the instructions over at `audiolm-pytorch` for training, but pass in the `MulanEmbedQuantizer` instance to the training classes under the keyword `audio_conditioner`

ex. `SemanticTransformerTrainer`

In [None]:
import torch
from audiolm_pytorch import HubertWithKmeans, SemanticTransformer, SemanticTransformerTrainer

wav2vec = HubertWithKmeans(
    checkpoint_path = 'hubert_base_ls960.pt',
    kmeans_path = 'hubert_base_ls960_L9_km500.bin'
)


semantic_transformer = SemanticTransformer(
    num_semantic_tokens = wav2vec.codebook_size,
    dim = 1024,
    depth = 6,
    audio_text_condition = True      # this must be set to True (same for CoarseTransformer and FineTransformers)
).cuda()

trainer = SemanticTransformerTrainer(
    transformer = semantic_transformer,
    wav2vec = wav2vec,
    audio_conditioner = quantizer,   # pass in the MulanEmbedQuantizer instance above
    folder ='/content/downloaded_audios',
    batch_size = 1,
    data_max_length = 320 * 32,
    num_train_steps = 1
)

trainer.train()

In [None]:
# you need the trained AudioLM (audio_lm) from above
# with the MulanEmbedQuantizer (mulan_embed_quantizer)

from musiclm_pytorch import MusicLM

musiclm = MusicLM(
    audio_lm = audio_lm,                 # `AudioLM` from https://github.com/lucidrains/audiolm-pytorch
    mulan_embed_quantizer = quantizer    # the `MuLaNEmbedQuantizer` from above
)

music = musiclm('the crystalline sounds of the piano in a ballroom', num_samples = 4) # sample 4 and pick the top match with mulan