# Lesson 5: Zero-Shot Audio Classification 🎧

- If you are running this code on your own machine, please install the following:
``` 
    !pip install transformers
    !pip install datasets
    !pip install soundfile
    !pip install librosa
```

- The trasformers library is need to use the pipeline API (available on the Hugging Face website).

- Using a dataset of audio recordings in Hugging Face typically involves working with the datasets library and the Hugging Face model hub, which provides access to numerous pre-trained models that can be used for tasks like speech recognition, sound classification, or speaker identification. Hugging Face also offers tools to preprocess audio data and easily use models for inference. 

- Librosa is a Python library used for analyzing and processing audio files. It is widely used in music and audio signal processing tasks due to its simple interface and rich functionality. 
The `librosa` library may need to have [ffmpeg](https://www.ffmpeg.org/download.html) installed. This page on [librosa](https://pypi.org/project/librosa/) provides installation instructions for ffmpeg.

- The soundfile library is a simple and efficient Python library for reading and writing sound files (audio files). 

In [1]:
%pip install transformers
%pip install datasets
%pip install soundfile
%pip install librosa

- Here is some code that suppresses warning messages.

In [2]:
#to suppress non- critical log messages
from transformers.utils import logging
logging.set_verbosity_error()

### Prepare the dataset of audio recordings

- To get an overview of the audio files in a dataset using Hugging Face's datasets library, you can load the dataset and inspect the features of the dataset. When dealing with audio datasets, the audio files are typically stored in an "audio" column, and you can examine the dataset to get an idea of its structure and metadata.
Use load_dataset to load the dataset you are interested in.
In Hugging Face's datasets library, the term split refers to different subsets of the dataset that are used for various stages of model training, evaluation, and testing. A dataset is typically divided into several splits, with the most common ones being Training Split (train), which is the largest subset of the dataset and is used to train the model. The model learns from this data during training, adjusting its weights and parameters to minimize error.

In [3]:
#from datasets import load_dataset, load_from_disk
from datasets import load_dataset  #load the dataset of interest

# This dataset is a collection of different sounds of 5 seconds
dataset = load_dataset("ashraq/esc50",
                     split="train[0:10]")
#dataset = load_from_disk("./models/ashraq/esc50/train") 

https://huggingface.co/datasets/ashraq/esc50

- EXAMPLE 1

In [4]:
audio_sample = dataset[0]

In [5]:
audio_sample

In [6]:
from IPython.display import Audio as IPythonAudio
IPythonAudio(audio_sample["audio"]["array"],
             rate=audio_sample["audio"]["sampling_rate"])

### Build the `audio classification` pipeline using 🤗 Transformers Library

In [7]:
from transformers import pipeline

In [8]:
zero_shot_classifier = pipeline(
    task="zero-shot-audio-classification",
    model="laion/clap-htsat-unfused")

More info on [laion/clap-htsat-unfused](https://huggingface.co/laion/clap-htsat-unfused).

### Sampling Rate for Transformer Models
- How long does 1 second of high resolution audio (192,000 Hz) appear to the Whisper model (which is trained to expect audio files at 16,000 Hz)? 

In [9]:
(1 * 192000) / 16000

- The 1 second of high resolution audio appears to the model as if it is 12 seconds of audio.

- How about 5 seconds of audio?

In [10]:
(5 * 192000) / 16000

- 5 seconds of high resolution audio appears to the model as if it is 60 seconds of audio.

In [11]:
zero_shot_classifier.feature_extractor.sampling_rate

In [12]:
audio_sample["audio"]["sampling_rate"]

* Set the correct sampling rate for the input and the model.

In [13]:
from datasets import Audio

In [14]:
dataset = dataset.cast_column(
    "audio",
     Audio(sampling_rate=48_000))

In [15]:
audio_sample = dataset[0]

In [16]:
audio_sample

In [17]:
candidate_labels = ["Sound of a dog",
                    "Sound of vacuum cleaner"]

In [18]:
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

From the analysis of the values of the scores, it can be concluded that the best candidate_labels seems to be "Sound of a dog", since its score is higher than the other candidate_labels (score 0.99 vs. 0.001)

In [19]:
candidate_labels = ["Sound of a child crying",
                    "Sound of vacuum cleaner",
                    "Sound of a bird singing",
                    "Sound of an airplane"]

In [20]:
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

Following analysis of the output, it can be concluded that in above example, the highest score (0.61) seems to be related to the sound of a bird singing. Important to be mentioned that in the above list of caandidates labels the "correct label"(Sound of a dog) is not enclosed. 

- EXAMPLE 2

Let's try the above models with some other audio files. 
The audio files are available in the following link 
https://huggingface.co/datasets/ashraq/esc50

In [21]:
audio_sample1 = dataset[1]

In [22]:
audio_sample1

In [23]:
from IPython.display import Audio as IPythonAudio
IPythonAudio(audio_sample1["audio"]["array"],
             rate=audio_sample1["audio"]["sampling_rate"])

In [24]:
from transformers import pipeline
zero_shot_classifier = pipeline(
    task="zero-shot-audio-classification",
    model="laion/clap-htsat-unfused")

 * Set the correct sampling rate for the input and the model.

In [25]:
from datasets import Audio

In [26]:
dataset = dataset.cast_column(
    "audio",
     Audio(sampling_rate=48_000))

In [27]:
audio_sample1 = dataset[1]

In [28]:
audio_sample1

From the analysis of the values of the scores, it can be concluded that the best candidate_labels seems to be "Sound of a bird", since its score is higher than the other candidate_labels (score 0.99 vs. 0.000006)

In [31]:
candidate_labels2 = ["Sound of a child crying",
                    "Sound of vacuum cleaner",
                    "Sound of a dog",
                    "Sound of an airplane"]

In [32]:
zero_shot_classifier(audio_sample1["audio"]["array"],
                     candidate_labels=candidate_labels2)

From the analysis of the values of the scores, it can be concluded that the best candidate_labels seems to be "Sound of a bird", since its score (score = 0.99) is higher than the others candidate_labels.

### Try it yourself! 
- Try this model with some other labels and audio files!