# Lesson 5: Zero-Shot Audio Classification

- In the classroom, the libraries have already been installed for you.
- If you are running this code on your own machine, please install the following:
``` 
    !pip install transformers
    !pip install datasets
    !pip install soundfile
    !pip install librosa
```

The `librosa` library may need to have [ffmpeg](https://www.ffmpeg.org/download.html) installed. 
- This page on [librosa](https://pypi.org/project/librosa/) provides installation instructions for ffmpeg.

- Here is some code that suppresses warning messages.

In [1]:
!pip install transformers
!pip install datasets
!pip install soundfile
!pip install librosa



In [2]:
from transformers.utils import logging
logging.set_verbosity_error()

### Prepare the dataset of audio recordings

In [3]:
from datasets import load_dataset, load_from_disk

# This dataset is a collection of different sounds of 5 seconds
dataset = load_dataset("ashraq/esc50",
                      split="train[0:10]")
# dataset = load_from_disk("./models/ashraq/esc50/train")

Repo card metadata block was not found. Setting CardData to empty.


In [4]:
audio_sample = dataset[0]

In [5]:
audio_sample

{'filename': '1-100032-A-0.wav',
 'fold': 1,
 'target': 0,
 'category': 'dog',
 'esc10': True,
 'src_file': 100032,
 'take': 'A',
 'audio': {'path': None,
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 44100}}

In [6]:
from IPython.display import Audio as IPythonAudio
IPythonAudio(audio_sample["audio"]["array"],
             rate=audio_sample["audio"]["sampling_rate"])

### Build the `audio classification` pipeline using 🤗 Transformers Library

In [7]:
from transformers import pipeline

In [22]:
zero_shot_classifier = pipeline(
    task="zero-shot-audio-classification",
    model="laion/clap-htsat-unfused",
    device=-1)

More info on [laion/clap-htsat-unfused](https://huggingface.co/laion/clap-htsat-unfused).

Sampling Rate
- A sound wave is a continuous signal. This means it contains an infinite number of signal values in a given time. But the audio your computer can work with is a series of discrete values, known as digital representation.
- To get the digital representation of a continuous audio signal, we first capture this sound with a microphone. Then the analog signal is converted into an electrical signal. Then the electrical signal is sampled to get the digital representation.
- Sampling means measuring the value of a continuous signal at fixed time steps. As a result, the sampled waveform is discrete with a finite number of values at uniform intervals.

![image.png](attachment:f2160da9-3dec-435b-b18c-e02e4537e8ce.png)

![image.png](attachment:2f29c9f3-62be-4079-8ba1-ab79bc546b78.png)

### Sampling Rate for Transformer Models
- A very important characteristic of the digitized audio is the sampling rate. It is the number of samples taken in 1 second and it is measured in hertz or kilohertz.
- Examples: telephone/walkie-talkie - 8000 Hz, human speech recording - 16000 Hz, high-resolution audio - 192000 Hz
- 

![image.png](attachment:64960709-e8d3-445b-a804-a6e0b4ca4adc.png)

In [9]:
# How long does 1 second of high resolution audio (192,000 Hz) appear to the Whisper model (which is trained to expect audio files at 16,000 Hz)? 
(1 * 192000) / 16000

12.0

- The 1 second of high resolution audio appears to the model as if it is 12 seconds of audio.

- How about 5 seconds of audio?

In [10]:
(5 * 192000) / 16000

60.0

- 5 seconds of high resolution audio appears to the model as if it is 60 seconds of audio.

In [11]:
zero_shot_classifier.feature_extractor.sampling_rate

48000

In [12]:
audio_sample["audio"]["sampling_rate"]

44100

* Set the correct sampling rate for the input and the model.

In [13]:
from datasets import Audio

In [14]:
dataset = dataset.cast_column(
    "audio",
     Audio(sampling_rate=48_000))

In [15]:
audio_sample = dataset[0]

In [16]:
audio_sample

{'filename': '1-100032-A-0.wav',
 'fold': 1,
 'target': 0,
 'category': 'dog',
 'esc10': True,
 'src_file': 100032,
 'take': 'A',
 'audio': {'path': None,
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 48000}}

In [17]:
# Need to provide the pipeline with the candidate label
# Clap takes both audio and text as input and compute the similarity between the 2.
candidate_labels = ["Sound of a dog",
                    "Sound of vacuum cleaner"]

In [23]:
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

[{'score': 0.6172527074813843, 'label': 'Sound of a bird singing'},
 {'score': 0.21602587401866913, 'label': 'Sound of vacuum cleaner'},
 {'score': 0.1254725605249405, 'label': 'Sound of an airplane'},
 {'score': 0.041248906403779984, 'label': 'Sound of a child crying'}]

In [20]:
candidate_labels = ["Sound of a child crying",
                    "Sound of vacuum cleaner",
                    "Sound of a bird singing",
                    "Sound of an airplane"]

In [24]:
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

[{'score': 0.6172527074813843, 'label': 'Sound of a bird singing'},
 {'score': 0.21602587401866913, 'label': 'Sound of vacuum cleaner'},
 {'score': 0.1254725605249405, 'label': 'Sound of an airplane'},
 {'score': 0.041248906403779984, 'label': 'Sound of a child crying'}]

### Try it yourself! 
- Try this model with some other labels and audio files!