# Zero-Shot Audio Classification

In the example provided, the libraries are already installed. If we wanted to run this, we would have to download the model libraries that we want from Hugging Face & run with the Transformers library.

Some use cases and services that can be built using these models are:



In [None]:
# !pip install transformers
# !pip install datasets
# !pip install soundfile
# !pip install librosa

# The `librosa` library is used for audio processing & needs ffmepg
# https://pypi.org/project/librosa/

### Prepare the dataset of audio recordings

In [1]:
# Suppress warning messages
from transformers.utils import logging
logging.set_verbosity_error()

In [None]:
from datasets import load_dataset, load_from_disk

# This dataset is a collection of different environmental sounds of 5 seconds
# dataset = load_dataset("ashraq/esc50",
#                       split="train[0:10]")
dataset = load_from_disk("./models/ashraq/esc50/train")

In [None]:
# Take a random sample and see what it looks like
audio_sample = dataset[0]
audio_sample    # This is a dictionary with keys 'audio' and 'label'

# Play the audio
from IPython.display import Audio as IPythonAudio
IPythonAudio(audio_sample["audio"]["array"],
             rate=audio_sample["audio"]["sampling_rate"])


#### Build the `audio classification` pipeline using Transformers Library 

In [None]:
from transformers import pipeline

Info about the model used: [clap-htsat-unfused](https://huggingface.co/laion/clap-htsat-unfused).

Before it can be used, one must have downloaded it. Other models that can be used can be found by searching `clap` in Hugging Face.

In [None]:
# Load the model
zero_shot_classifier = pipeline(
    task="zero-shot-audio-classification",
    model="./models/laion/clap-htsat-unfused")

##### Sampling Rate for Transformer Models

Depending on the sampling rate (Hz) a model is trained with and the audio resolution of the samples, the length 

In [None]:
# Extract sampling rate of model
zero_shot_classifier.feature_extractor.sampling_rate

# Extract sampling rate of audio
audio_sample["audio"]["sampling_rate"]

In [None]:
# Set the correct sampling rate for the input audio
from datasets import Audio

dataset = dataset.cast_column(
    "audio",
     Audio(sampling_rate=48_000))

# Take an example from the dataset to check sampling rate is changed
audio_sample = dataset[0]
audio_sample

In [None]:
# Pass candidate labels
candidate_labels = ["Sound of a dog",
                    "Sound of vacuum cleaner",
                    "Sound of a car",
                    "Sound of a cat",]

# Classify the audio
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)