# Assignment for Unit 4 content

This assignment is meant to ensure you have experience with fine-tuning a pretrained model for an audio classification task. You can use the following resources to help you:
- *https://huggingface.co/learn/audio-course/chapter4/classification_models*
- *https://huggingface.co/learn/audio-course/chapter4/fine-tuning*
- *https://huggingface.co/learn/audio-course/chapter4/demo*
- Any of the previous notebooks

The HuggingFace tutorials above demonstrated how to fine-tune a Hubert model on marsyas/gtzan dataset for music classification. Their example achieved 83% accuracy. Your task is to improve upon this accuracy metric.

Feel free to choose any model on the HuggingFace Hub that you think is suitable for audio classification. However, you should use the exact same dataset (marsyas/gtzan) to build your own classifier.

Your goal is to achieve 87% accuracy on this dataset with your classifier. You can choose the exact same model, and play with the training hyperparameters, or pick an entirely different model - it’s up to you!

Here are some additional resources that you may find helpful when working on this exercise:

Audio classification task guide in Transformers documentation:
*https://huggingface.co/docs/transformers/tasks/audio_classification*
Hubert model documentation:
*https://huggingface.co/docs/transformers/model_doc/hubert*
M-CTC-T model documentation:
*https://huggingface.co/docs/transformers/model_doc/mctct*
Audio Spectrogram Transformer documentation:
*https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer*
Wav2Vec2 documentation:
*https://huggingface.co/docs/transformers/model_doc/wav2vec2*

This exercise can be hard for people new to machine learning/audio programming! If you are struggling, make sure to ask for help in the group Teams chat!

In [None]:
!pip install datasets[audio]

In [2]:
from datasets import load_dataset

mg = load_dataset("marsyas/gtzan", "all")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/4.42k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.23G [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [3]:
mg = mg["train"].train_test_split(seed=42, shuffle=True, test_size=0.1)


In [4]:
mg["train"][1]

{'file': '/root/.cache/huggingface/datasets/downloads/extracted/5022b0984afa7334ff9a3c60566280b08b5179d4ac96a628052bada7d8940244/genres/classical/classical.00080.wav',
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/5022b0984afa7334ff9a3c60566280b08b5179d4ac96a628052bada7d8940244/genres/classical/classical.00080.wav',
  'array': array([-0.04049683, -0.0402832 , -0.0397644 , ...,  0.01278687,
          0.01135254,  0.00958252]),
  'sampling_rate': 22050},
 'genre': 1}

In [5]:
id2label_fn = mg["train"].features["genre"].int2str
id2label_fn(mg["train"][0]["genre"])

'pop'

In [6]:
from transformers import AutoFeatureExtractor

model_id = "ntu-spml/distilhubert"
feature_extractor = AutoFeatureExtractor.from_pretrained(
    model_id, do_normalize=True, return_attention_mask=True
)

preprocessor_config.json:   0%|          | 0.00/214 [00:00<?, ?B/s]

In [7]:
sampling_rate = feature_extractor.sampling_rate
sampling_rate

16000

In [8]:
from datasets import Audio

mg = mg.cast_column("audio", Audio(sampling_rate=sampling_rate))

In [9]:
mg["train"][1]

{'file': '/root/.cache/huggingface/datasets/downloads/extracted/5022b0984afa7334ff9a3c60566280b08b5179d4ac96a628052bada7d8940244/genres/classical/classical.00080.wav',
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/5022b0984afa7334ff9a3c60566280b08b5179d4ac96a628052bada7d8940244/genres/classical/classical.00080.wav',
  'array': array([-0.03288342, -0.04257958, -0.03760961, ...,  0.0126256 ,
          0.01197463,  0.        ]),
  'sampling_rate': 16000},
 'genre': 1}

In [10]:
import numpy as np

sample = mg["train"][0]["audio"]
inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])

print(f"inputs keys: {list(inputs.keys())}")

print(
    f"Mean: {np.mean(inputs['input_values']):.3}, Variance: {np.var(inputs['input_values']):.3}"
)

inputs keys: ['input_values', 'attention_mask']
Mean: -7.45e-09, Variance: 1.0


In [11]:
# Audio clips expected to be 30 sec
# we truncate any longer clips by using the max_length
# and truncation arguments of the feature extractor
max_duration = 30.0


def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=int(feature_extractor.sampling_rate * max_duration),
        truncation=True,
        return_attention_mask=True,
    )
    return inputs

In [12]:
gtzan_encoded = mg.map(
    preprocess_function,
    remove_columns=["audio", "file"],
    batched=True,
    batch_size=100,
    num_proc=1,
)
gtzan_encoded

Map:   0%|          | 0/899 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['genre', 'input_values', 'attention_mask'],
        num_rows: 899
    })
    test: Dataset({
        features: ['genre', 'input_values', 'attention_mask'],
        num_rows: 100
    })
})

In [13]:
gtzan_encoded = gtzan_encoded.rename_column("genre", "label")
id2label = {
    str(i): id2label_fn(i)
    for i in range(len(gtzan_encoded["train"].features["label"].names))
}
label2id = {v: k for k, v in id2label.items()}

id2label["7"]

'pop'

In [14]:
from transformers import AutoModelForAudioClassification

num_labels = len(id2label)

model = AutoModelForAudioClassification.from_pretrained(
    model_id,
    num_labels=num_labels,
    label2id=label2id,
    id2label=id2label,
)

config.json:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/94.0M [00:00<?, ?B/s]

Some weights of HubertForSequenceClassification were not initialized from the model checkpoint at ntu-spml/distilhubert and are newly initialized: ['encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'projector.bias', 'classifier.bias', 'encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'classifier.weight', 'projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [16]:
! pip install -U accelerate
! pip install -U transformers

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.26.1
Collecting transformers
  Downloading transformers-4.36.2-py3-none-any.whl (8.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.35.2
    Uninstalling transformers-4.35.2:
      Successfully uninstalled transformers-4.35.2
Successfully installed transformers-4.36.2


In [17]:
import accelerate
import transformers

transformers.__version__, accelerate.__version__

('4.35.2', '0.26.1')

In [18]:
from transformers import TrainingArguments

model_name = model_id.split("/")[-1]
batch_size = 8
gradient_accumulation_steps = 1
num_train_epochs = 10

training_args = TrainingArguments(
    f"{model_name}-finetuned-gtzan",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    warmup_ratio=0.1,
    logging_steps=5,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
)

ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`: Please run `pip install transformers[torch]` or `pip install accelerate -U`

In [None]:
!pip install evaluate

In [None]:
import evaluate
import numpy as np

metric = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=gtzan_encoded["train"],
    eval_dataset=gtzan_encoded["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

trainer.train()