# **Music detection using DistilHuBERT**

**DistilHuBERT** is a distilled version of HuBERT (Hidden-Unit BERT) and offers a more efficient and lightweight alternative while maintaining high performance.It was released in April 2022 by Heng-Jui Chang, Shu-wen Yang and Hung-yi Lee. You can learn more about DistilHuBERT [here](https://arxiv.org/abs/2110.01900).

In this notebook, we finetune DistilHuBERT for **music detection** task.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

%env LC_ALL=C.UTF-8
%env LANG=C.UTF-8
%env TRANSFORMERS_CACHE=/content/cache
%env HF_DATASETS_CACHE=/content/cache
%env CUDA_LAUNCH_BLOCKING=1

!pip install evaluate
#!pip install git+https://github.com/huggingface/datasets.git
#!pip install git+https://github.com/huggingface/transformers.git
!pip install jiwer
!pip install torchaudio
!pip install librosa
!pip install transformers[torch]

from transformers import AutoFeatureExtractor
from transformers import AutoModelForAudioClassification
from transformers import TrainingArguments
from transformers import Trainer

import evaluate
import numpy as np
import pandas as pd
from pathlib import Path
from tqdm import tqdm
import torchaudio
from sklearn.model_selection import train_test_split
import os
import sys

Mounted at /content/drive
env: LC_ALL=C.UTF-8
env: LANG=C.UTF-8
env: TRANSFORMERS_CACHE=/content/cache
env: HF_DATASETS_CACHE=/content/cache
env: CUDA_LAUNCH_BLOCKING=1
Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from evaluate)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 



In [None]:
data = []

for path in tqdm(Path("/content/drive/MyDrive/data2").glob("**/*.wav")):
  name = str(path).split('/')[-1].split('.')[0]
  print(name)
  label = str(path).split('/')[-2]
  print(label)

  data.append({
    "name": name,
    "path": path,
    "audio": label
    })

df = pd.DataFrame(data)
df.head()

print("Labels: ", df["audio"].unique())
print()
df.groupby("audio").count()[["path"]]

247it [00:00, 1551.32it/s]

Urban Ambient Sound
Non Music
V8 Engine Startup and Revving ｜ HQ Sound Effects
Non Music
MOTORCYCLE RIDE SOUND EFFECT ｜ Free Sound Effects ｜ SPEED
Non Music
V8 Engine Startup and Revving ｜ HQ Sound Effects (1)
Non Music
(No Music) Oddly Satisfying Video With Original Sound #11 ｜ Original Relaxing Videos for Deep Sleep
Non Music
Co se děje s tělem, když jíte cukr
Non Music
👔 Ťažký týždeň： Sklamania voliča Smeru ｜ Aktuality
Non Music
20 Wild Animals - Animal Sounds for Kids to Learn
Non Music
EPIC demolition sound effect
Non Music
Biking⧸Cycling ｜ HQ Sound Effect
Non Music
Autá - bager, buldozér, miešačka ｜ Stavba ｜ stavebné stroje pre deti ｜ Hanička a Murko
Non Music
HIMYM moments that live rent free in my head
Non Music
The Uncomfortable Truth of Life
Non Music
10 Minutes People Conversing Crowds Talking
Non Music
[Eng Sub] Japanese Listening Practice ｜ Walk and Talk in Jinbouchou in Tokyo
Non Music
(10 Minutes) Ultimate Lawn Mower Sound for Relaxation, Focus, and Prod

474it [00:00, 1472.00it/s]

Non Music
Memorable lines about love in How I Met Your Mother_9
Non Music
Girl Crying Sound Effects All Sounds_14
Non Music
Memorable lines about love in How I Met Your Mother_20
Non Music
Memorable lines about love in How I Met Your Mother_15
Non Music
Memorable lines about love in How I Met Your Mother_21
Non Music
Memorable lines about love in How I Met Your Mother_18
Non Music
Memorable lines about love in How I Met Your Mother_12
Non Music
Memorable lines about love in How I Met Your Mother_17
Non Music
Memorable lines about love in How I Met Your Mother_13
Non Music
Memorable lines about love in How I Met Your Mother_16
Non Music
Girl Crying Sound Effects All Sounds_21
Non Music
Memorable lines about love in How I Met Your Mother_14
Non Music
Girl Crying Sound Effects All Sounds_20
Non Music
Memorable lines about love in How I Met Your Mother_19
Non Music
Girl Crying Sound Effects All Sounds_19
Non Music
City Car Driving - Tesla Model Y [Steering wheel gameplay]_1
Non Music
City 

810it [00:00, 1483.81it/s]

60#105# What The French - Excuse My French [Drum & Bass]_9
Music
60#105# What The French - Excuse My French [Drum & Bass]_3
Music
80#125# Birdy - Wings (Nu：Logic Remix)_2
Music
80#125# Birdy - Wings (Nu：Logic Remix)_3
Music
80#125# Birdy - Wings (Nu：Logic Remix)_1
Music
80#125# Birdy - Wings (Nu：Logic Remix)_4
Music
80#125# Birdy - Wings (Nu：Logic Remix)_6
Music
80#125# Birdy - Wings (Nu：Logic Remix)_5
Music
55#100# RUFFNECK (FULL FLEX) - SKRILLEX_2
Music
55#100# RUFFNECK (FULL FLEX) - SKRILLEX_1
Music
55#100# RUFFNECK (FULL FLEX) - SKRILLEX_6
Music
55#100# RUFFNECK (FULL FLEX) - SKRILLEX_3
Music
55#100# RUFFNECK (FULL FLEX) - SKRILLEX_4
Music
55#100# RUFFNECK (FULL FLEX) - SKRILLEX_7
Music
55#100# RUFFNECK (FULL FLEX) - SKRILLEX_5
Music
70#115# Lindsey Stirling - Crystallize (Dubstep Violin Original Song)_7
Music
70#115# Virtual Riot - Never Let Me Go_1
Music
70#115# Virtual Riot - Never Let Me Go_4
Music
70#115# Lindsey Stirling - Crystallize (Dubstep Violin Original Song)_9
Music
70

1068it [00:00, 1533.09it/s]


60#105#Fetty Wap  - Trap Queen (Official Video) Prod
Music
60#105#Fetty Wap  - Trap Queen (Official Video) Prod
Music
21 Savage - redrum (Official Music Video)_1
Music
60#105#Fetty Wap  - Trap Queen (Official Video) Prod
Music
60#105#Fetty Wap  - Trap Queen (Official Video) Prod
Music
60#105#Fetty Wap  - Trap Queen (Official Video) Prod
Music
60#105#Fetty Wap  - Trap Queen (Official Video) Prod
Music
60#105#Fetty Wap  - Trap Queen (Official Video) Prod
Music
60#105#Fetty Wap  - Trap Queen (Official Video) Prod
Music
60#105#NF - The Search_1
Music
30#75#Collie Buddz - Love & Reggae (Official Music Video)_4
Music
30#75#Collie Buddz - Love & Reggae (Official Music Video)_7
Music
30#75#Collie Buddz - Love & Reggae (Official Music Video)_1
Music
30#75#Collie Buddz - Love & Reggae (Official Music Video)_5
Music
30#75#Collie Buddz - Love & Reggae (Official Music Video)_6
Music
30#75#Collie Buddz - Love & Reggae (Official Music Video)_2
Music
30#75#Collie Buddz - Love & Reggae (Official Music




Unnamed: 0_level_0,path
audio,Unnamed: 1_level_1
Music,595
Non Music,473


In [None]:
save_path = "/content/drive/MyDrive/data2"

train_df, test_df = train_test_split(df, test_size=0.2, random_state=101, stratify=df["audio"])

train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

train_df.to_csv(f"{save_path}/train.csv", sep="\t", encoding="utf-8", index=False)
test_df.to_csv(f"{save_path}/test.csv", sep="\t", encoding="utf-8", index=False)


print(train_df.shape)
print(test_df.shape)

data_files = {
    "train": r"/content/drive/MyDrive/data2/train.csv",
    "validation": r"/content/drive/MyDrive/data2/test.csv",
}

from datasets import load_dataset, load_metric

dataset = load_dataset("csv", data_files=data_files, delimiter="\t", )
train_dataset = dataset["train"]
eval_dataset = dataset["validation"]

print(train_dataset)
print(eval_dataset)

input_column = "path"
output_column = "audio"

label_list = train_dataset.unique(output_column)
label_list.sort(reverse=True)  # Let's sort it for determinism
num_labels = len(label_list)
print(f"A classification problem with {num_labels} classes: {label_list}")

(854, 3)
(214, 3)


Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['name', 'path', 'audio'],
    num_rows: 854
})
Dataset({
    features: ['name', 'path', 'audio'],
    num_rows: 214
})
A classification problem with 2 classes: ['Non Music', 'Music']


In [None]:
model = "ntu-spml/distilhubert"
feature_extractor = AutoFeatureExtractor.from_pretrained(model, do_normalize = True, return_attention_mask = True)
sampling_rate = feature_extractor.sampling_rate
print(f'distilhubert Sampling Rate: {sampling_rate} Hz')

distilhubert Sampling Rate: 16000 Hz


In [None]:
import librosa
import torch

first_sample = train_dataset[0]
print(first_sample)
#inputs = feature_extractor(train_dataset)

def speech_file_to_array_fn(path, sampling_rate=16000):
    audio, sr = librosa.load(path, sr=sampling_rate)
    return audio

audio = speech_file_to_array_fn(first_sample['path'])
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True)

# Normalize to have variance 1
mean = torch.mean(inputs['input_values'])
variance = torch.var(inputs['input_values'])

print(f"Mean: {mean.item()}")
print(f"Variance: {variance.item()}")


{'name': 'AC_DC - Back In Black (Official 4K Video)_1', 'path': '/content/drive/MyDrive/data2/Music/AC_DC - Back In Black (Official 4K Video)_1.wav', 'audio': 'Music'}
Mean: -8.416176200398695e-09
Variance: 0.9999973773956299


In [None]:
max_duration = 30.0 # 30 seconds

def label_to_id(label, label_list):

    if len(label_list) > 0:
        return label_list.index(label) if label in label_list else -1

    return label

def preprocess_function(examples):
    audio_list = [speech_file_to_array_fn(path) for path in examples[input_column]]
    target_list = [label_to_id(label, label_list) for label in examples[output_column]]

    # Preprocessing audio inputs
    inputs = feature_extractor(audio_list,
                              sampling_rate = feature_extractor.sampling_rate,
                              max_length = int(feature_extractor.sampling_rate * max_duration),
                              truncation = True,
                              return_attention_mask = True)
    inputs["labels"] = list(target_list)

    return inputs

In [None]:
train_dataset = train_dataset.map(
    preprocess_function,
    batch_size=100,
    batched=True,
    num_proc=4
)
eval_dataset = eval_dataset.map(
    preprocess_function,
    batch_size=100,
    batched=True,
    num_proc=4
)

  self.pid = os.fork()


Map (num_proc=4):   0%|          | 0/854 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/214 [00:00<?, ? examples/s]

In [None]:
hubert_model = AutoModelForAudioClassification.from_pretrained(
    model,
    num_labels=num_labels,
    label2id={label: i for i, label in enumerate(label_list)},
    id2label={i: label for i, label in enumerate(label_list)},
    ignore_mismatched_sizes=True,
  )

Some weights of HubertForSequenceClassification were not initialized from the model checkpoint at ntu-spml/distilhubert and are newly initialized: ['classifier.bias', 'classifier.weight', 'encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'projector.bias', 'projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
idx = 0
#print(f"Training input_values: {train_dataset[idx]['input_values']}")
print(f"Training labels: {train_dataset[idx]['labels']} - {train_dataset[idx]['audio']}")

Training labels: 1 - Music


In [None]:
model_output_dir = "/content/drive/MyDrive/hubert detection/"

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir = model_output_dir,
    evaluation_strategy = 'steps',
    save_strategy = 'steps',
    load_best_model_at_end = True,
    metric_for_best_model = 'accuracy',
    save_steps = 500,
    eval_steps = 500,
    logging_steps = 500,
    learning_rate = 5e-5,
    seed = 42,
    per_device_train_batch_size = 4,
    per_device_eval_batch_size = 4,
    gradient_accumulation_steps = 1,
    num_train_epochs = 17,
    warmup_ratio = 0.1,
    fp16 = True,
    save_total_limit = 2,
    report_to = 'none',
    adam_epsilon = 1e-08,
    adam_beta1 = 0.9,
    adam_beta2 = 0.999,
)



In [None]:
metric = evaluate.load('accuracy')
# Creating function to compute accuracy
def compute_metrics(eval_pred):
    predictions = np.argmax(eval_pred.predictions, axis = 1)
    return metric.compute(predictions = predictions, references = eval_pred.label_ids)

In [None]:
trainer = Trainer(
    model=hubert_model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,
    tokenizer = feature_extractor,
    compute_metrics = compute_metrics)

In [None]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy
500,0.3506,0.104931,0.96729
1000,0.0627,0.052843,0.990654
1500,0.018,0.028386,0.990654
2000,0.0064,0.112834,0.985981
2500,0.0001,0.113428,0.985981
3000,0.0001,0.107309,0.985981
3500,0.0,0.109167,0.985981


TrainOutput(global_step=3638, training_loss=0.06017447706596611, metrics={'train_runtime': 9237.3163, 'train_samples_per_second': 1.572, 'train_steps_per_second': 0.394, 'total_flos': 9.9049514465664e+17, 'train_loss': 0.06017447706596611, 'epoch': 17.0})

In [None]:
trainer.save_model(model_output_dir)
feature_extractor.save_pretrained(model_output_dir)

['/content/drive/MyDrive/hubert detection/preprocessor_config.json']