Here we will implement an audio classifier without the need of fine-tuning (i.e. zero shot).

To install the required libraries:
``` 
    !pip install transformers
    !pip install datasets
    !pip install soundfile
    !pip install librosa

```
In addition, you might need to have [ffmpeg](https://www.ffmpeg.org/download.html) installed in your operating system (OS) for [librosa](https://pypi.org/project/librosa/) to work. Please low the instruction from your OS for this
```

In [15]:
from transformers.utils import logging
logging.set_verbosity_error() # suppress warning messages

from datasets import load_dataset, load_from_disk, Audio
from IPython.display import Audio as IPythonAudio
from transformers import pipeline

We'll use the esc50 dataset, which is a labelled collection of different environmental sounds of 5 seconds, such as:
- sounds made by humans
- sounds made by animals
- nature sounds
- indoor sounds
- urban noises

In [2]:
# load dataset from HuggingFaceHub
dataset = load_dataset("ashraq/esc50", split="train[0:10]") # we'll load only a few examples
#dataset = load_from_disk("./models/ashraq/esc50/train")

Downloading readme:   0%|          | 0.00/345 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading metadata:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/387M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/387M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [4]:
# examine the audio of the first sample
first_sample = dataset[0]
IPythonAudio(first_sample["audio"]["array"], rate=first_sample["audio"]["sampling_rate"])

In [5]:
# listen to the audio above and check for yourself if it matches the 'category' label
first_sample

{'filename': '1-100032-A-0.wav',
 'fold': 1,
 'target': 0,
 'category': 'dog',
 'esc10': True,
 'src_file': 100032,
 'take': 'A',
 'audio': {'path': None,
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 44100}}

In [8]:
# build the classification pipeline using the transformers library
sound_classifier = pipeline(task="zero-shot-audio-classification", model="laion/clap-htsat-unfused")

config.json:   0%|          | 0.00/5.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/615M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

For audio classification tasks, we can use CLAP (Contrastive Language-Audio Pretraining) models [laion/clap-htsat-unfused](https://huggingface.co/laion/clap-htsat-unfused). The thing to be mindful of is that the sampling rate of the audio file must match what the model expects. This is because a particular transformer model is trained on dataset with a certain sampling rate.

Physics crash course: a microphone converts continuous sound wave into electrical signal, which can be transformed by an analog-to-digital converter by sampling into a digital representation. Hence, sampling rate of an audio file refers to the number of samples taken in one second, for which the [SI unit](https://en.wikipedia.org/wiki/International_System_of_Units) is Hertz:
- 8 kHz: telephone
- 16 kHz: human speech recording
- 192 kHz: high resolution audio

Image below from Wikipedia: Signal sampling representation. The continuous signal $S(t)$ is represented with a green colored line while the discrete samples are indicated by the blue vertical lines.
![](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Signal_Sampling.svg/1920px-Signal_Sampling.svg.png)

In [10]:
# a Whisper model is trained on 16kHz sampling rate
# 5 seconds of high resolution audio (192 kHz) will appear to the Whisper model as 60 seconds of audio
(5 * 192000)/16000

60.0

In [12]:
# get the sampling rate the model is trained on
sound_classifier.feature_extractor.sampling_rate

48000

In [13]:
# get the sampling rate of the dataset
first_sample["audio"]["sampling_rate"]

44100

In [16]:
# set the correct sampling rate for the dataset
dataset = dataset.cast_column("audio", Audio(sampling_rate=sound_classifier.feature_extractor.sampling_rate))

In [17]:
first_sample = dataset[0]

In [18]:
first_sample

{'filename': '1-100032-A-0.wav',
 'fold': 1,
 'target': 0,
 'category': 'dog',
 'esc10': True,
 'src_file': 100032,
 'take': 'A',
 'audio': {'path': None,
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 48000}}

Now, we also need to pass candidate labels to the CLAP model. The model computes similarity between the candidate labels and the audio.

In [23]:
candidate_labels = ["child crying",
                    "vacuum cleaner",
                    "bird singing",
                    "airplane",
                   "dog"]
sound_classifier(first_sample["audio"]["array"], candidate_labels=candidate_labels)

[{'score': 0.9934026598930359, 'label': 'dog'},
 {'score': 0.0034067188389599323, 'label': 'bird singing'},
 {'score': 0.0017710168613120914, 'label': 'vacuum cleaner'},
 {'score': 0.0009858678095042706, 'label': 'airplane'},
 {'score': 0.00043370015919208527, 'label': 'child crying'}]

In [24]:
second_sample = dataset[1]

In [25]:
second_sample

{'filename': '1-100038-A-14.wav',
 'fold': 1,
 'target': 14,
 'category': 'chirping_birds',
 'esc10': False,
 'src_file': 100038,
 'take': 'A',
 'audio': {'path': None,
  'array': array([-0.01288922, -0.09524129, -0.14230728, ...,  0.03312215,
          0.00153297,  0.        ]),
  'sampling_rate': 48000}}

In [26]:
IPythonAudio(second_sample["audio"]["array"], rate=second_sample["audio"]["sampling_rate"])

In [27]:
candidate_labels = ["child crying",
                    "vacuum cleaner",
                    "bird singing",
                    "airplane",
                   "dog"]
sound_classifier(second_sample["audio"]["array"], candidate_labels=candidate_labels)

[{'score': 0.9913629293441772, 'label': 'bird singing'},
 {'score': 0.004377551376819611, 'label': 'airplane'},
 {'score': 0.0018786629661917686, 'label': 'dog'},
 {'score': 0.0017541953129693866, 'label': 'child crying'},
 {'score': 0.0006266148993745446, 'label': 'vacuum cleaner'}]