### **SET B1 - Zero-Shot Audio Classification**

-----

In [None]:
!pip install transformers
!pip install datasets
!pip install soundfile
!pip install librosa

The librosa library may need to have [ffmpeg](https://www.ffmpeg.org/download.html) installed.

This page on [librosa](https://pypi.org/project/librosa/) provides installation instructions for ffmpeg.

In [None]:
#Suppresses the warning message

from transformers.utils import logging
logging.set_verbosity_error()

Preparing dataset of Audio recordings

In [None]:
from datasets import load_dataset, load_from_disk

# This dataset is a collection of different sounds of 5 seconds, The commented section is what was used in the tutorial. The next line was added to the provided notebook.
# dataset = load_dataset("ashraq/esc50",
#                       split="train[0:10]")
dataset = load_from_disk("./models/ashraq/esc50/train")

In [None]:
audio_sample = dataset[0]
audio_sample

Playing the audio file

In [None]:
from IPython.display import Audio as IPythonAudio
IPythonAudio(audio_sample["audio"]["array"], rate=audio_sample["audio"]["sampling_rate"])

**Building the audio classification pipeline using the HuggingFace Transformers Library**

In [None]:
from transformers import pipeline

In [None]:
zero_shot_classifier = pipeline(
    task="zero-shot-audio-classification",
    model="./models/laion/clap-htsat-unfused")

**Sampling Rate for Transformer Models**

How long does 1 second of high resolution audio (192,000 Hz) appear to the Whisper model (which is trained to expect audio files at 16,000 Hz)? \
*(1 * 192000) / 16000 = 12.0* \
The 1 second of high resolution audio appears to the model as if it is 12 seconds of audio.


How about 5 seconds of audio? \
*(5 * 192000) / 16000 = 60.0* \
5 seconds of high resolution audio appears to the model as if it is 60 seconds of audio.

Checking the sampling rate for our model,

In [None]:
zero_shot_classifier.feature_extractor.sampling_rate

Now checking the sampling rate of our audio file,

In [None]:
audio_sample["audio"]["sampling_rate"]

Therefore, we set the correct sampling rate for the input and the model.

In [None]:
from datasets import Audio

In [None]:
dataset = dataset.cast_column("audio", Audio(sampling_rate=48_000))

In [None]:
audio_sample = dataset[0]
audio_sample

Checking with outputs:

In [None]:
candidate_labels = ["Sound of a dog",
                    "Sound of vacuum cleaner"]

In [None]:
zero_shot_classifier(audio_sample["audio"]["array"], candidate_labels=candidate_labels)

In [None]:
candidate_labels = ["Sound of a child crying",
                    "Sound of vacuum cleaner",
                    "Sound of a bird singing",
                    "Sound of an airplane"]

In [None]:
zero_shot_classifier(audio_sample["audio"]["array"], candidate_labels=candidate_labels)

------