# Audio Analysis and Sound Event Detection using 2-chunk model
### Author: Rohit Dhanda

## Overview

This script is designed to perform audio analysis and sound event detection, notably to facilitate the testing on a two-chunk model. Utilizing the pre-trained deep learning model, YAMNet, it performs sound event detection on audio files. It classifies detected events into various classes and filters out specific events of interest such as animal-related sounds.

## Libraries and Models

### Libraries:
- **pickle:** For deserializing data from the file storage.
- **numpy and pandas:** Essential libraries for data handling and manipulation.
- **soundfile and librosa:** Libraries used to process and handle audio files.
- **tensorflow and tensorflow_hub:** To work with the pre-trained YAMNet model from TensorFlow Hub.

### Models:
- **YAMNet:** A deep learning model pre-trained to classify a wide array of audio events. It's loaded both from a local setup and from TensorFlow Hub.
- **model_2_79.h5:** A custom trained model for specific classifications, used in tandem with YAMNet for precise event detection.

## Functions

### 1. load_audio_file
- **Input:** File path of the audio file.
- **Output:** A NumPy array of the audio data.
- **Description:** It uses librosa to load an audio file with a standard sampling rate of 16000 Hz.

### 2. extract_features
- **Input:** The YAMNet model and an array of audio data.
- **Output:** An array with the extracted features.
- **Description:** This function utilizes the YAMNet model to extract embeddings from each frame in the audio data array, returning the mean of the embeddings.

### 3. predict_on_audio
- **Input:** Binary representation of audio data.
- **Output:** The top two class predictions along with their corresponding probabilities.
- **Description:** It writes the binary audio data to a temporary file from which features are extracted and then used by the custom model to make predictions.

### 4. sound_event_detection
- **Input:** File path of the audio file.
- **Output:** A pandas dataframe recording the start and end times of detected sound events along with the top two class predictions and their probabilities.
- **Description:** 
  - **Step 1:** It divides the audio file into 1-second chunks.
  - **Step 2:** Using YAMNet, it iterates over each chunk to make sound event predictions.
  - **Step 3:** If a chunk is recognized as containing animal-related sounds based on a predetermined threshold, it is stored in a buffer.
  - **Step 4:** When a non-animal-related sound is detected, the buffered chunks are processed to make a final prediction using the custom model.
  - **Step 5:** It then records the details such as the start and end time of the sound event and the top two class predictions into a dataframe.
  - **Step 6:** Handles the end case where the last chunk contains an animal-related sound, ensuring it is processed correctly.

### Animal-Related Classes
A predefined set of labels used to identify and isolate animal-related sounds from the predictions generated by YAMNet.

## Usage

To utilize this script for sound event detection and testing on a two-chunk model:
1. Set the file path of the audio file you wish to analyze in the `filename` variable.
2. Call the `sound_event_detection` function with the `filename` as the parameter.
3. The function will return a pandas dataframe with details of each detected sound event.
4. Print the dataframe to visualize the results, noting each sound event's start and end times, and the top two predictions with their probabilities.

## Note
- Ensure the necessary files and models are correctly loaded at the beginning of the script.
- Adjust the threshold value in the `sound_event_detection` function as necessary to correctly classify chunks as containing animal-related sounds.


In [None]:
import pickle
import numpy as np
import pandas as pd
import soundfile as sf
import yamnet.params as params
import yamnet.yamnet as yamnet_model
import librosa
from tensorflow.keras.models import load_model
import tempfile
import tensorflow_hub as hub

In [None]:
# Load the necessary data and models
with open('yamnet/class_names.pkl', 'rb') as f:
    class_names = pickle.load(f)

with open('yamnet/label_encoder.pkl', 'rb') as f:
    le = pickle.load(f)

yamnet = yamnet_model.yamnet_frames_model(params)
yamnet.load_weights('yamnet/yamnet.h5')
yamnet_classes = yamnet_model.class_names('yamnet/yamnet_class_map.csv')
model = load_model('models/model_2_79.h5')

# Load the YAMNet model
yamnet_model_handle = 'https://tfhub.dev/google/yamnet/1'
yamnet_model = hub.load(yamnet_model_handle)


In [None]:

def load_audio_file(file_path):
    wav, sr = librosa.load(file_path, sr=16000)
    return np.array([wav])

def extract_features(model, X):
    features = []
    for wav in X:
        scores, embeddings, spectrogram = model(wav)
        features.append(embeddings.numpy().mean(axis=0))
    return np.array(features)

def predict_on_audio(binary_audio):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_audio_file:
        with open(temp_audio_file.name, 'wb') as f:
            f.write(binary_audio)
        X_new = load_audio_file(temp_audio_file.name)
        X_new_features = extract_features(yamnet_model, X_new)

        predictions = model.predict(X_new_features)
        top_two_prob_indices = np.argsort(predictions[0])[-2:]
        top_two_prob_values = predictions[0][top_two_prob_indices]

        top_two_class_names = le.inverse_transform(top_two_prob_indices)
        
        return [(class_names[top_two_prob_indices[1-i]], top_two_prob_values[1-i]) for i in range(2)]

In [6]:


def sound_event_detection(filepath):
    data, sr = librosa.load(filepath, sr=48000)
    frame_len = int(sr * 1)
    num_chunks = len(data) // frame_len
    chunks = [data[i*frame_len:(i+1)*frame_len] for i in range(num_chunks)]

    # Adding the last chunk which can be less than 1 second
    last_chunk = data[num_chunks*frame_len:]
    if len(last_chunk) > 0:
        chunks.append(last_chunk)

    animal_related_classes = [
        'Dog', 'Cat', 'Bird', 'Animal', 'Birdsong', 'Canidae', 'Feline', 'Livestock',
        'Rodents, Mice', 'Wild animals', 'Pets', 'Frogs', 'Insect', 'Snake', 
        'Domestic animals, pets', 'crow'
    ]

    df_rows = []
    buffer = []
    start_time = None
    for cnt, frame_data in enumerate(chunks):
        frame_data = np.reshape(frame_data, (-1,)) # Flatten the array to 1D
        frame_data = np.array([frame_data]) # Wrapping it back into a 2D array
        outputs = yamnet(frame_data)
        yamnet_prediction = np.mean(outputs[0], axis=0)
        top2_i = np.argsort(yamnet_prediction)[::-1][:2]
        threshold=0.10
        if any(yamnet_prediction[np.where(yamnet_classes == cls)[0][0]] >= threshold for cls in animal_related_classes if cls in yamnet_classes):
            if start_time is None:
                start_time = cnt
            buffer.append(frame_data)
        else:
            if start_time is not None:
                segment_data = np.concatenate(buffer, axis=1)[0]
                with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_audio_file:
                    sf.write(temp_audio_file.name, segment_data, sr)
                    with open(temp_audio_file.name, 'rb') as binary_file:
                        top2_predictions = predict_on_audio(binary_file.read())

                df_row = {'start_time': start_time, 'end_time': cnt}
                
                for i, pred in enumerate(top2_predictions[:2]):
                    df_row[f'echonet_label_{i+1}'] = pred[0] if pred[0] is not None else None
                    df_row[f'echonet_confidence_{i+1}'] = pred[1] if pred[1] is not None else None

                df_rows.append(df_row)
                buffer = []
                start_time = None

    # Handling the case where the last chunk contains an animal-related sound
    if start_time is not None:
        segment_data = np.concatenate(buffer, axis=1)[0]
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_audio_file:
            sf.write(temp_audio_file.name, segment_data, sr)
            with open(temp_audio_file.name, 'rb') as binary_file:
                top2_predictions = predict_on_audio(binary_file.read())

        df_row = {'start_time': start_time, 'end_time': len(chunks)}
        
        for i, pred in enumerate(top2_predictions[:2]):
            df_row[f'echonet_label_{i+1}'] = pred[0] if pred[0] is not None else None
            df_row[f'echonet_confidence_{i+1}'] = pred[1] if pred[1] is not None else None

        df_rows.append(df_row)

    df = pd.DataFrame(df_rows)
    return df


# Use the function
filename = 'test5.m4a'
df = sound_event_detection(filename)
print(df)


  data, sr = librosa.load(filepath, sr=16000)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
2023-09-15 21:51:35.478562: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.




2023-09-15 21:51:35.919682: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


   start_time  end_time     echonet_label_1  echonet_confidence_1  \
0           4         9  Strepera Graculina              0.999611   
1          20        22         Felis Catus              0.966280   
2          32        34   Rattus Norvegicus              0.618285   
3          51        56          Sus_Scrofa              0.673910   
4          63        66          Sus_Scrofa              0.668935   
5          67        69         Felis Catus              0.919543   

           echonet_label_2  echonet_confidence_2  
0  Colluricincla Harmonica              0.000279  
1        Corvus Coronoides              0.024591  
2  Colluricincla Harmonica              0.146041  
3              Felis Catus              0.134276  
4              Felis Catus              0.330683  
5        Rattus Norvegicus              0.035131  


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = um.true_divide(


In [7]:
df

Unnamed: 0,start_time,end_time,echonet_label_1,echonet_confidence_1,echonet_label_2,echonet_confidence_2
0,4,9,Strepera Graculina,0.999611,Colluricincla Harmonica,0.000279
1,20,22,Felis Catus,0.96628,Corvus Coronoides,0.024591
2,32,34,Rattus Norvegicus,0.618285,Colluricincla Harmonica,0.146041
3,51,56,Sus_Scrofa,0.67391,Felis Catus,0.134276
4,63,66,Sus_Scrofa,0.668935,Felis Catus,0.330683
5,67,69,Felis Catus,0.919543,Rattus Norvegicus,0.035131
