## HF AUDIO CLASSIFIERS

### KEYWORD SPOTTING WITH MINDS-14

Full colab available at [this link](https://colab.research.google.com/github/Leofltt/Audio-Analysis-Experiments/blob/master/Deep_Learning/music_genre_classifier/audio_classifier_huggingface.ipynb)

In [None]:
!pip install transformers
!pip install gradio
!pip install datasets
!pip install accelerate -U
!pip install evaluate
!pip install ipython
!pip install huggingface_hub

In [None]:
!nvidia-smi

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [3]:
from datasets import load_dataset

minds14 = load_dataset("PolyAI/minds14", name="en-AU", split="train")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [4]:
from transformers import pipeline

classifier = pipeline("audio-classification", model='anton-l/xtreme_s_xlsr_300m_minds14',)

classifier(minds14[0]["audio"])

Some weights of the model checkpoint at anton-l/xtreme_s_xlsr_300m_minds14 were not used when initializing Wav2Vec2ForSequenceClassification: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at anton-l/xtreme_s_xlsr_300m_minds14 and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos

[{'score': 0.9611988067626953, 'label': 'pay_bill'},
 {'score': 0.029601896181702614, 'label': 'freeze'},
 {'score': 0.003550308756530285, 'label': 'card_issues'},
 {'score': 0.0021323112305253744, 'label': 'abroad'},
 {'score': 0.0008829658618196845, 'label': 'high_value_payment'}]

### AUDIO CLASSIFICATION WITH SPEECH COMMANDS AND AUDIO SPECTROGRAM TRANSFORMER

In [5]:
speech_commands = load_dataset(
    "speech_commands", "v0.02", split="validation", streaming=True
)
sample = next(iter(speech_commands))

classifier2 = pipeline("audio-classification", model="MIT/ast-finetuned-speech-commands-v2")
classifier2(sample["audio"].copy())

[{'score': 0.9999892711639404, 'label': 'backward'},
 {'score': 1.750490582708153e-06, 'label': 'happy'},
 {'score': 6.703052690681943e-07, 'label': 'follow'},
 {'score': 5.805895852972753e-07, 'label': 'stop'},
 {'score': 5.614546694232558e-07, 'label': 'up'}]

In [2]:
from IPython.display import Audio
Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"])

NameError: name 'sample' is not defined

### LANGUAGE IDENTIFICATION WITH FLEURS

In [7]:
fleurs = load_dataset("google/fleurs", "all", split="validation", streaming=True)
sample = next(iter(fleurs))
classifier = pipeline("audio-classification", model="sanchit-gandhi/whisper-medium-fleurs-lang-id")
classifier(sample["audio"])

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
Downloading builder script: 100%|██████████| 12.6k/12.6k [00:00<00:00, 13.5MB/s]
Downloading readme: 100%|██████████| 13.3k/13.3k [00:00<00:00, 14.1MB/s]


[{'score': 0.9999328851699829, 'label': 'Afrikaans'},
 {'score': 7.093015938153258e-06, 'label': 'Northern-Sotho'},
 {'score': 4.269149030733388e-06, 'label': 'Icelandic'},
 {'score': 3.266116891609272e-06, 'label': 'Danish'},
 {'score': 3.2580719562247396e-06, 'label': 'Cantonese Chinese'}]

### ZERO SHOT AUDIO CLASSIFICATION WITH CLAP AND ESC50

In [1]:
dataset = load_dataset("ashraq/esc50", split="train", streaming=True)
audio_sample = next(iter(dataset))['audio']['array']

NameError: name 'load_dataset' is not defined

In [13]:
candidate_labels = ["Sound of a cat", "Sound of a airplane"]

classifier = pipeline("zero-shot-audio-classification", model="laion/clap-htsat-unfused")
classifier(audio_sample, candidate_labels=candidate_labels)

Audio(audio_sample, rate=16000)


### GTZAN GENRE CLASSIFICATION

In [1]:
from datasets import load_dataset

gtzan = load_dataset("marsyas/gtzan", "all")
gtzan = gtzan["train"].train_test_split(seed=42, shuffle=True, test_size=0.1)

gtzan

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


DatasetDict({
    train: Dataset({
        features: ['file', 'audio', 'genre'],
        num_rows: 899
    })
    test: Dataset({
        features: ['file', 'audio', 'genre'],
        num_rows: 100
    })
})

In [2]:
id2label_fn = gtzan["train"].features["genre"].int2str
id2label_fn(gtzan["train"][0]["genre"])


'pop'

In [3]:
import gradio as gr 

def generate_audio():
    example = gtzan["train"].shuffle()[0]
    audio = example["audio"]
    return (audio["sampling_rate"], audio["array"]), id2label_fn(example["genre"])

with gr.Blocks() as demo:
    with gr.Column():
        for _ in range(4):
            audio, label = generate_audio()
            output = gr.Audio(audio, label=label)

demo.launch(debug=True)



Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


Keyboard interruption in main thread... closing server.




In [4]:
from transformers import AutoFeatureExtractor
from datasets import Audio

model_id = 'ntu-spml/distilhubert'

feature_extractor = AutoFeatureExtractor.from_pretrained(model_id, do_normalize=True, return_attention_mask=True)

# resample to 16kHz
sr = feature_extractor.sampling_rate
gtzan = gtzan.cast_column("audio", Audio(sampling_rate=sr))

gtzan['train'][0]

{'file': '/Users/leofltt/.cache/huggingface/datasets/downloads/extracted/a6e225c10bd43d04eece393f2c4c96844bb8510ba4335c16f1b78f28b988c3ee/genres/pop/pop.00098.wav',
 'audio': {'path': '/Users/leofltt/.cache/huggingface/datasets/downloads/extracted/a6e225c10bd43d04eece393f2c4c96844bb8510ba4335c16f1b78f28b988c3ee/genres/pop/pop.00098.wav',
  'array': array([ 0.08735079,  0.20183371,  0.47908673, ..., -0.18743171,
         -0.23294398, -0.13517424]),
  'sampling_rate': 16000},
 'genre': 7}

In [5]:
import numpy as np

sample = gtzan["train"][0]["audio"]

inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])

print(f"inputs keys: {list(inputs.keys())}")

print(
    f"Mean: {np.mean(inputs['input_values']):.3}, Variance: {np.var(inputs['input_values']):.3}"
)

inputs keys: ['input_values', 'attention_mask']
Mean: -6.75e-09, Variance: 1.0


In [6]:
max_duration = 30.0


def preprocess_samples(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=int(feature_extractor.sampling_rate * max_duration),
        truncation=True,
        return_attention_mask=True,
    )
    return inputs

In [7]:
gtzan_encoded = gtzan.map(
    preprocess_samples,
    remove_columns=["audio", "file"],
    batched=True,
    batch_size=100,
    num_proc=1,
)
gtzan_encoded

DatasetDict({
    train: Dataset({
        features: ['genre', 'input_values', 'attention_mask'],
        num_rows: 899
    })
    test: Dataset({
        features: ['genre', 'input_values', 'attention_mask'],
        num_rows: 100
    })
})

In [8]:
gtzan_encoded = gtzan_encoded.rename_column("genre", "label")

id2label = {
    str(i): id2label_fn(i)
    for i in range(len(gtzan_encoded["train"].features["label"].names))
}
label2id = {v: k for k, v in id2label.items()}

id2label["7"]

'pop'

In [9]:
from transformers import AutoModelForAudioClassification

num_labels = len(id2label)

model = AutoModelForAudioClassification.from_pretrained(model_id, num_labels=num_labels, label2id=label2id, id2label=id2label)

Some weights of HubertForSequenceClassification were not initialized from the model checkpoint at ntu-spml/distilhubert and are newly initialized: ['classifier.bias', 'classifier.weight', 'encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'projector.bias', 'projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
from transformers import TrainingArguments

model_name = model_id.split("/")[-1]
batch_size = 8
gradient_accumulation_steps = 1
num_train_epochs = 10

training_args = TrainingArguments(
    f"{model_name}-finetuned-gtzan",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    warmup_ratio=0.1,
    logging_steps=5,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=False,
    push_to_hub=False,
)

In [13]:
import evaluate
import numpy as np

metric = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [14]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=gtzan_encoded["train"],
    eval_dataset=gtzan_encoded["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


  0%|          | 0/1130 [00:00<?, ?it/s]

{'loss': 2.2993, 'grad_norm': 227318.828125, 'learning_rate': 2.2123893805309734e-06, 'epoch': 0.04}
{'loss': 2.2918, 'grad_norm': 1950517.625, 'learning_rate': 4.424778761061947e-06, 'epoch': 0.09}
{'loss': 2.3011, 'grad_norm': 237503.421875, 'learning_rate': 6.6371681415929215e-06, 'epoch': 0.13}
{'loss': 2.2945, 'grad_norm': 1686734.5, 'learning_rate': 8.849557522123894e-06, 'epoch': 0.18}
{'loss': 2.2927, 'grad_norm': 182207.671875, 'learning_rate': 1.1061946902654869e-05, 'epoch': 0.22}
{'loss': 2.3193, 'grad_norm': 1709513.0, 'learning_rate': 1.3274336283185843e-05, 'epoch': 0.27}
{'loss': 2.285, 'grad_norm': 100947.6171875, 'learning_rate': 1.5486725663716813e-05, 'epoch': 0.31}
{'loss': 2.2893, 'grad_norm': 1621544.5, 'learning_rate': 1.7699115044247787e-05, 'epoch': 0.35}
{'loss': 2.3034, 'grad_norm': 16674.4296875, 'learning_rate': 1.991150442477876e-05, 'epoch': 0.4}
{'loss': 2.2948, 'grad_norm': 1602154.625, 'learning_rate': 2.2123893805309738e-05, 'epoch': 0.44}
{'loss': 2

  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 2.2177350521087646, 'eval_accuracy': 0.26, 'eval_runtime': 22.6961, 'eval_samples_per_second': 4.406, 'eval_steps_per_second': 0.573, 'epoch': 1.0}
{'loss': 2.252, 'grad_norm': 1134124.25, 'learning_rate': 4.990167158308752e-05, 'epoch': 1.02}
{'loss': 2.2002, 'grad_norm': 1163342.375, 'learning_rate': 4.9655850540806295e-05, 'epoch': 1.06}
{'loss': 2.2426, 'grad_norm': 1102841.75, 'learning_rate': 4.941002949852507e-05, 'epoch': 1.11}
{'loss': 2.259, 'grad_norm': 1076151.875, 'learning_rate': 4.9164208456243856e-05, 'epoch': 1.15}
{'loss': 2.2495, 'grad_norm': 1063478.25, 'learning_rate': 4.891838741396263e-05, 'epoch': 1.19}
{'loss': 2.3421, 'grad_norm': 1008705.8125, 'learning_rate': 4.867256637168142e-05, 'epoch': 1.24}
{'loss': 2.3098, 'grad_norm': 934664.5625, 'learning_rate': 4.8426745329400195e-05, 'epoch': 1.28}
{'loss': 2.2594, 'grad_norm': 910355.8125, 'learning_rate': 4.818092428711898e-05, 'epoch': 1.33}
{'loss': 2.315, 'grad_norm': 891480.875, 'learning_rate

  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 2.2727584838867188, 'eval_accuracy': 0.19, 'eval_runtime': 31.8275, 'eval_samples_per_second': 3.142, 'eval_steps_per_second': 0.408, 'epoch': 2.0}
{'loss': 2.2049, 'grad_norm': 978035.625, 'learning_rate': 4.4247787610619477e-05, 'epoch': 2.04}
{'loss': 2.3321, 'grad_norm': 1120861.875, 'learning_rate': 4.4001966568338254e-05, 'epoch': 2.08}
{'loss': 2.3033, 'grad_norm': 1280494.75, 'learning_rate': 4.375614552605704e-05, 'epoch': 2.12}
{'loss': 2.3348, 'grad_norm': 1334316.25, 'learning_rate': 4.351032448377581e-05, 'epoch': 2.17}
{'loss': 2.3284, 'grad_norm': 1438968.5, 'learning_rate': 4.326450344149459e-05, 'epoch': 2.21}
{'loss': 2.3976, 'grad_norm': 1563783.75, 'learning_rate': 4.301868239921337e-05, 'epoch': 2.26}
{'loss': 2.1774, 'grad_norm': 1788497.375, 'learning_rate': 4.2772861356932154e-05, 'epoch': 2.3}
{'loss': 2.282, 'grad_norm': 1893414.25, 'learning_rate': 4.252704031465093e-05, 'epoch': 2.35}
{'loss': 2.375, 'grad_norm': 1890093.0, 'learning_rate': 4.2

  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 2.304631233215332, 'eval_accuracy': 0.16, 'eval_runtime': 39.1366, 'eval_samples_per_second': 2.555, 'eval_steps_per_second': 0.332, 'epoch': 3.0}
{'loss': 2.3433, 'grad_norm': 3213755.0, 'learning_rate': 3.883972468043265e-05, 'epoch': 3.01}
{'loss': 2.3343, 'grad_norm': 3622021.5, 'learning_rate': 3.859390363815143e-05, 'epoch': 3.05}
{'loss': 2.3046, 'grad_norm': 3390234.0, 'learning_rate': 3.834808259587021e-05, 'epoch': 3.1}
{'loss': 2.2452, 'grad_norm': 4264022.5, 'learning_rate': 3.810226155358899e-05, 'epoch': 3.14}
{'loss': 2.372, 'grad_norm': 3784628.75, 'learning_rate': 3.7856440511307774e-05, 'epoch': 3.19}
{'loss': 2.3209, 'grad_norm': 4426461.5, 'learning_rate': 3.7610619469026545e-05, 'epoch': 3.23}
{'loss': 2.2862, 'grad_norm': 4252684.5, 'learning_rate': 3.736479842674533e-05, 'epoch': 3.27}


KeyboardInterrupt: 

In [None]:
kwargs = {
    "dataset_tags": "marsyas/gtzan",
    "dataset": "GTZAN",
    "model_name": f"{model_name}-finetuned-gtzan",
    "finetuned_from": model_id,
    "tasks": "audio-classification",
}

trainer.push_to_hub(**kwargs)