# Audio classification on the Urban Sounds dataset with CLIP and AST

In this notebook you can do audio classification on the Urban Sounds dataset. 
This dataset contains of nine classes of audio events, such as motor cycle, screaming people or gunshots.

The dataset is hosted on the Huggingface Hub at: https://huggingface.co/datasets/UrbanSounds/urban_sounds_smal

Two AI models are used in this notebook:
- CLAP 
- Audio Spectrum Transformer (AST)

### Contents
0. Install packages
1. CLAP on the Urban Sounds Amsterdam dataset
2. AST on the Urban Sounds Amsterdam dataset

## Introduction 

### Using datasets
To classify sounds, I've uploaded a sub-sample of the Amsterdam Sounds Database at Huggingface. 

In this notebook we will use Huggingface's ```dataset``` library to load this dataset. 

**links & sources**:
- https://pypi.org/project/datasets/
- https://huggingface.co/docs/datasets/audio_process
- https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Audio
- https://huggingface.co/learn/audio-course/chapter2/audio_classification_pipeline?fw=pt

# 0. Install packages

In [None]:
#!pip install datasets

In [None]:
pip install soundfile

In [None]:
%pip install datasets\[audio\]

In [None]:
#check numpy version is 1.24
import numpy as np
np.__version__

In [None]:
!pip show matplotlib

## 1. CLAP on the Urban Sounds Amsterdam dataset

Source: https://huggingface.co/laion/larger_clap_general

Paper: https://arxiv.org/abs/2211.06687

CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task.



### Inspection of the dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset("UrbanSounds/urban_sounds_small")

In [None]:
#inspect the dataset
dataset

In [None]:
#Inspect one sample from 
example = dataset['train']['audio'][0]
label = dataset['train']['label'][0]

You may notice that the audio column contains several features. Here’s what they are:

- path: the path to the downloaded (and converted) audio file
- array: The decoded audio data, represented as a 1-dimensional NumPy array.
- sampling_rate. The sampling rate of the audio file.

In [None]:
example

In [None]:
#print the label data
print(dataset['train']['label'])
print(len(dataset['train']['label']))

In [None]:
#inspecting the audio array
array = dataset["train"]["audio"][0]["array"]
sampling_rate = example["sampling_rate"]
print(array.shape)
print(array)
print(type(array))
print(sampling_rate)

In [None]:
import librosa
import matplotlib.pyplot as plt
import librosa.display

plt.figure().set_figwidth(10)
librosa.display.waveshow(array, sr=sampling_rate)

In [None]:
#the display script
from IPython.display import Audio

Audio(example["array"], rate=example['sampling_rate'])

In [None]:
#check length of dataset
print(len(dataset['train']['audio']))
print(type(dataset['train']['audio']))

In [None]:
#Script to load a random number out of the dataset
from transformers import ClapModel, ClapProcessor
from datasets import load_dataset
from transformers import pipeline
import IPython
import random

#creating a random number
random_number = random.randint(0, len(dataset['train']['audio']))

example=dataset['train']['audio'][random_number]
audio = dataset["train"]["audio"][random_number]["array"]

audio_classifier = pipeline(task="zero-shot-audio-classification", model="laion/larger_clap_music_and_speech")
output = audio_classifier(audio, candidate_labels=["Motorcycle", "Moped", 'Claxon','Alarm', 'Silence','Loud people','Talking','Gunshot', 'Slamming door','Music'])
print(output[0],'\n',output[1])
print(random_number)
IPython.display.Audio(example["array"], rate=example['sampling_rate'])

## CLAP on the urban_sounds_small dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset("UrbanSounds/urban_sounds_small")

In [None]:
#create a dictionary the converts the class folders to real names
label_dict ={0:'Gunshot', 1:'Moped alarm', 2:'Moped', 3:'Claxon', 4:'Slamming door', 5:'Screaming', 6:'Motorcycle', 7:'Talking', 8:'Music'}
print('The given labels are: ')
for i in range(0,9):
    print(label_dict[i])

In [None]:
#Set the item from the dataset < 223
i = 20

In [None]:
#larger_clap_general
from transformers import ClapModel, ClapProcessor
from transformers import pipeline
from datasets import load_dataset
import IPython

#dataset = load_dataset("MichielBontenbal/UrbanSounds")

example=dataset['train']['audio'][i]
audio = dataset["train"]["audio"][i]['array']

audio_classifier = pipeline(task="zero-shot-audio-classification", model="laion/larger_clap_general")
output = audio_classifier(audio, candidate_labels=["Gunshot", "Moped", 'Moped alarm','Claxon','Screaming', 'Motorcycle','Talking', 'Slamming door','Music', 'Silence'])

predicted_label = output[0]['label']
print(f'Predicted label: {predicted_label}')

label_name =label_dict[dataset['train']['label'][i]]
print(f'The given label: {label_name}')

if label_name == output[0]['label']:
    print("This is correct")
else:
    print('This is false')
print(f'Probability: {round(output[0]["score"],3)}')

IPython.display.Audio(example['array'], rate=example['sampling_rate'])

## Code as a function (same code as above)

In [None]:
#the code above as a function
from transformers import pipeline
import IPython
from IPython.display import Audio
from IPython.display import display #use display to create a audio player in a function
import pandas as pd

dataset = load_dataset("UrbanSounds/urban_sounds_small")

def process_audio(i, dataset):
    example = dataset['train']['audio'][i]
    audio = dataset["train"]["audio"][i]['array']
    return example, audio

def classify_audio(audio, model="laion/larger_clap_general"):
    audio_classifier = pipeline(task="zero-shot-audio-classification", model=model)
    output_var = audio_classifier(audio, candidate_labels=["Gunshot", "Moped", 'Moped alarm','Claxon','Screaming', 'Motorcycle','Talking', 'Slamming door','Music', 'Silence'])
    print(output_var[0])
    return output_var

def display_results(output, i, dataset, label_dict):
    predicted_label = output[0]['label']
    print(f'Predicted label: {predicted_label}')
    
    label_name = label_dict[dataset['train']['label'][i]]
    print(f'The given label: {label_name}')
    
    if label_name == output[0]['label']:
        print("This is correct")
    else:
        print('This is wrong')

    probability = output[0]['score']
    print(f'Probability: {round(probability,3)}')
    #IPython.display.display.Audio(dataset['train']['audio'][i]['array'], rate=dataset['train']['audio'][i]['sampling_rate'])

    display(Audio(dataset['train']['audio'][i]['array'], rate=dataset['train']['audio'][i]['sampling_rate']))
    
    return predicted_label, label_name, probability

def main(i):
    #dataset = load_audio_dataset()
    
    # Replace 'i' with the appropriate index
    #i = 0
    
    example, audio = process_audio(i, dataset)
    output = classify_audio(audio)
    display_results(output, i, dataset, label_dict)


In [None]:
main(0)

In [None]:
import threading
import psutil
import time

def cpu_usage_monitor(interval=1):
    """
    Monitors CPU usage every 'interval' seconds.
    """
    try:
        while True:
            cpu_usage = psutil.cpu_percent(interval=interval)
            print(f"CPU Usage: {cpu_usage}%")
    except KeyboardInterrupt:
        print("CPU monitoring stopped.")

def another_task():
    """
    A placeholder function for another task.
    """
    try:
        while True:
            # Replace this with your code for the other task
            example, audio = process_audio(i, dataset)
            output = classify_audio(audio)
            display_results(output, i, dataset, label_dict)
            time.sleep(2)  # Example delay
    except KeyboardInterrupt:
        print("Another task stopped.")

# Create threads
thread_cpu_monitor = threading.Thread(target=cpu_usage_monitor, args=(1,))
thread_another_task = threading.Thread(target=another_task)

# Start threads
thread_cpu_monitor.start()
thread_another_task.start()

# Wait for the threads to finish (optional)
thread_cpu_monitor.join()
thread_another_task.join()


In [None]:
print(output)

print(50 * '-')
print(output[0])

In [None]:
def predict_label(output):
    predicted_label = output[0]['label']
    print(f'Predicted label: {predicted_label}')
    return predicted_label

predict_label(output)

In [None]:
def get_given_label(dataset, i):
    label_name = label_dict[dataset['train']['label'][i]]
    print(f'The given label: {label_name}')
    return label_name

get_given_label(dataset, i)

In [None]:
def get_prob(output):
    probability = output[0]['score']
    print(probability)
    return probability

get_prob(output)

In [None]:
predicted_label = predict_label(output)
given_label = get_given_label(dataset, i)
probability = get_prob(output)

In [None]:
print(predicted_label)
print(given_label)
print(probability)

In [None]:
#the code above as a function
from transformers import pipeline
from datasets import load_dataset
import IPython
from IPython.display import Audio
from IPython.display import display #use display to create a audio player in a function
import pandas as pd

dataset = load_dataset("UrbanSounds/urban_sounds_small")

def process_audio(i, dataset):
    example = dataset['train']['audio'][i]
    audio = dataset["train"]["audio"][i]['array']
    return example, audio

def classify_audio(audio, model="laion/larger_clap_general"):
    audio_classifier = pipeline(task="zero-shot-audio-classification", model=model)
    output = audio_classifier(audio, candidate_labels=["Gunshot", "Moped", 'Moped alarm','Claxon','Screaming', 'Motorcycle','Talking', 'Slamming door','Music', 'Silence'])
    return output

def predict_label(output):
    predicted_label = output[0]['label']
    print(f'Predicted label: {predicted_label}')
    return predicted_label
    
def get_given_label(dataset, i):
    label_name = label_dict[dataset['train']['label'][i]]
    print(f'The given label: {label_name}')
    return label_name
    
def get_prob(output):
    probability = output[0]['score']
    print(probability)
    return probability

def diplay_audio(dataset, i):
    display(Audio(dataset['train']['audio'][i]['array'], rate=dataset['train']['audio'][i]['sampling_rate']))

def main2(i):
    #dataset = load_audio_dataset()
    # Replace 'i' with the appropriate index
    
    example, audio = process_audio(i, dataset)
    output = classify_audio(audio)
    predicted_label = predict_label(output)
    given_label = get_given_label(dataset, i)
    probability = get_prob(output)
    diplay_audio(dataset, i)
    return output, predicted_label
    print(output)
    print('done')

In [None]:
main2(1)

In [None]:
result_list=[]
for i in range(0,223):
    example, audio = process_audio(i, dataset)
    output = classify_audio(audio)
    predicted_label = predict_label(output)
    given_label = get_given_label(dataset, i)
    probability = get_prob(output)
    result_list.append([predicted_label, given_label, probability])
print(result_list)

In [None]:
import json

# Your Python list
#my_list = [1, 2, 3, 4, 5]

# Specify the file path where you want to save the JSON file
file_path = 'result_list.json'

# Open the file in write mode and use json.dump to write the list to the file
with open(file_path, 'w') as json_file:
    json.dump(result_list, json_file)

In [None]:
%pycat result_list.json

In [None]:
print(example)
print(audio)
print(output[0])
print(output[0]['label'])
print(predicted_label)
print(given_label)
print(probability)

In [None]:
for i in range(0, 222, 20):
    main(i)

# 2. Audio Spectrum Transfomer with Urban Sounds Amsterdam Dataset

Model: https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593
Paper: https://arxiv.org/abs/2104.01778


In [None]:
from transformers import AutoFeatureExtractor, ASTForAudioClassification
from datasets import load_dataset
import torch

dataset = load_dataset("UrbanSounds/urban_sounds_small")

feature_extractor = AutoFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")
model = ASTForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

# audio file is decoded on the fly
inputs = feature_extractor(dataset['train']["audio"][1]["array"], sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_ids = torch.argmax(logits, dim=-1).item()
predicted_label = model.config.id2label[predicted_class_ids]
print(predicted_label)

# compute loss - target_label is e.g. "down"
target_label = model.config.id2label[0]
inputs["labels"] = torch.tensor([model.config.label2id[target_label]])
loss = model(**inputs).loss
print(round(loss.item(), 2))

In [None]:
import IPython
example=dataset['train']["audio"][1]
IPython.display.Audio(example["array"], rate=example['sampling_rate'])