

### Documentation

#### Overview:
This script provides functionality for sound event detection, specifically aimed at identifying segments within an audio file that potentially contain animal-related sounds. The script uses a pre-trained `YAMNet` model to perform this detection and further refines the results with another pre-trained model (2-3chunk).
### Author: Rohit Dhanda 
---

#### Imports:
- **pickle**: For loading serialized class names and label encoder objects.
- **numpy, pandas**: For numerical and data manipulation.
- **soundfile**: To handle audio file I/O.
- **yamnet**: To access the YAMNet model and its parameters.
- **librosa**: For audio processing tasks.
- **tensorflow.keras**: To load the deep learning model.
- **tempfile**: To create temporary files.
- **tensorflow_hub**: To access TensorFlow Hub models.

---

#### Global Variables and Model Initialization:
1. **class_names.pkl**: A serialized list of class names.
2. **label_encoder.pkl**: A serialized label encoder which can convert class names to integers and vice versa.
3. **yamnet.h5**: Weights for the YAMNet model.
4. **model_3_78_48000.h5**: Pre-trained Keras model for sound classification.

---

#### Functions:

1. **load_audio_file(file_path)**:
    - **Input**: Path to the audio file.
    - **Output**: An array of audio data samples.
    - **Purpose**: Loads the audio file using librosa with a sampling rate of 48000 Hz.

2. **extract_features(model, X)**:
    - **Input**: A model and an array X containing audio data samples.
    - **Output**: Extracted feature array.
    - **Purpose**: For each audio sample in X, embeddings are extracted using the given model. The mean of the embeddings is then computed to get a feature vector.

3. **predict_on_audio(binary_audio)**:
    - **Input**: A binary audio data.
    - **Output**: Top two class predictions and their associated probabilities.
    - **Purpose**: To make predictions on the given audio data using the pre-trained model.

4. **sound_event_detection(filepath)**:
    - **Input**: Path to an audio file.
    - **Output**: A pandas DataFrame containing segments of the audio where animal-related sounds were detected. Each row of the DataFrame represents a segment and contains the start time, end time, predicted class labels, and their associated confidences.
    - **Purpose**: The main function to detect animal-related sounds in an audio file. The function first divides the audio into 1-second chunks, detects potential animal sounds in each chunk using the YAMNet model, and then refines the results using another model. Detected segments are then combined and returned as a DataFrame.

---

#### Usage:
At the end of the script, an example usage is provided using the file 'test9.m4a'. The `sound_event_detection` function is called with this file, and the results are printed to the console.

---

**Note**: It's essential to ensure all the necessary files and weights are available in the respective directories as mentioned in the script before running it.

In [8]:
import pickle
import numpy as np
import pandas as pd
import soundfile as sf
import yamnet.params as params
import yamnet.yamnet as yamnet_model
import librosa
from tensorflow.keras.models import load_model
import tempfile
import tensorflow_hub as hub

# Load the necessary data and models
with open('yamnet/class_names.pkl', 'rb') as f:
    class_names = pickle.load(f)

with open('yamnet/label_encoder.pkl', 'rb') as f:
    le = pickle.load(f)

yamnet = yamnet_model.yamnet_frames_model(params)
yamnet.load_weights('yamnet/yamnet.h5')
yamnet_classes = yamnet_model.class_names('yamnet/yamnet_class_map.csv')
model = load_model('models/model_3_78_48000.h5')

# Load the YAMNet model
yamnet_model_handle = 'https://tfhub.dev/google/yamnet/1'
yamnet_model = hub.load(yamnet_model_handle)

def load_audio_file(file_path):
    wav, sr = librosa.load(file_path, sr=48000)
    return np.array([wav])

def extract_features(model, X):
    features = []
    for wav in X:
        scores, embeddings, spectrogram = model(wav)
        features.append(embeddings.numpy().mean(axis=0))
    return np.array(features)

def predict_on_audio(binary_audio):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_audio_file:
        with open(temp_audio_file.name, 'wb') as f:
            f.write(binary_audio)
        X_new = load_audio_file(temp_audio_file.name)
        X_new_features = extract_features(yamnet_model, X_new)

        predictions = model.predict(X_new_features)
        top_two_prob_indices = np.argsort(predictions[0])[-2:]
        top_two_prob_values = predictions[0][top_two_prob_indices]

        top_two_class_names = le.inverse_transform(top_two_prob_indices)
        
        return [(class_names[top_two_prob_indices[1-i]], top_two_prob_values[1-i]) for i in range(2)]

def sound_event_detection(filepath):
    data, sr = librosa.load(filepath, sr=48000)
    frame_len = int(sr * 1)
    num_chunks = len(data) // frame_len
    chunks = [data[i*frame_len:(i+1)*frame_len] for i in range(num_chunks)]

    # Adding the last chunk which can be less than 1 second
    last_chunk = data[num_chunks*frame_len:]
    if len(last_chunk) > 0:
        chunks.append(last_chunk)

    animal_related_classes = [
        'Dog', 'Cat', 'Bird', 'Animal', 'Birdsong', 'Canidae', 'Feline', 'Livestock',
        'Rodents, Mice', 'Wild animals', 'Pets', 'Frogs', 'Insect', 'Snake', 
        'Domestic animals, pets', 'crow'
    ]

    df_rows = []
    buffer = []
    start_time = None
    for cnt, frame_data in enumerate(chunks):
        frame_data = np.reshape(frame_data, (-1,)) # Flatten the array to 1D
        frame_data = np.array([frame_data]) # Wrapping it back into a 2D array
        outputs = yamnet(frame_data)
        yamnet_prediction = np.mean(outputs[0], axis=0)
        top2_i = np.argsort(yamnet_prediction)[::-1][:2]
        threshold=0.05
        if any(yamnet_prediction[np.where(yamnet_classes == cls)[0][0]] >= threshold for cls in animal_related_classes if cls in yamnet_classes):
            if start_time is None:
                start_time = cnt
            buffer.append(frame_data)
        else:
            if start_time is not None:
                segment_data = np.concatenate(buffer, axis=1)[0]
                with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_audio_file:
                    sf.write(temp_audio_file.name, segment_data, sr)
                    with open(temp_audio_file.name, 'rb') as binary_file:
                        top2_predictions = predict_on_audio(binary_file.read())

                df_row = {'start_time': start_time, 'end_time': cnt}
                
                for i, pred in enumerate(top2_predictions[:2]):
                    df_row[f'echonet_label_{i+1}'] = pred[0] if pred[0] is not None else None
                    df_row[f'echonet_confidence_{i+1}'] = pred[1] if pred[1] is not None else None

                df_rows.append(df_row)
                buffer = []
                start_time = None

    # Handling the case where the last chunk contains an animal-related sound
    if start_time is not None:
        segment_data = np.concatenate(buffer, axis=1)[0]
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_audio_file:
            sf.write(temp_audio_file.name, segment_data, sr)
            with open(temp_audio_file.name, 'rb') as binary_file:
                top2_predictions = predict_on_audio(binary_file.read())

        df_row = {'start_time': start_time, 'end_time': len(chunks)}
        
        for i, pred in enumerate(top2_predictions[:2]):
            df_row[f'echonet_label_{i+1}'] = pred[0] if pred[0] is not None else None
            df_row[f'echonet_confidence_{i+1}'] = pred[1] if pred[1] is not None else None

        df_rows.append(df_row)

    df = pd.DataFrame(df_rows)
    return df


# Use the function
filename = 'test9.m4a'
df = sound_event_detection(filename)
print(df)


  data, sr = librosa.load(filepath, sr=48000)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
2023-09-22 02:27:07.186266: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.




2023-09-22 02:27:07.518272: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


    start_time  end_time           echonet_label_1  echonet_confidence_1  \
0            0         3                Asian Koel              0.562831   
1            5         8           common pheasant              0.621891   
2           11        12              Spotted Dove              0.632553   
3           15        17                 Dama dama              0.552381   
4           20        21               Felis Catus              0.526123   
5           29        31            Red Junglefowl              0.744855   
6           40        44          Savanna Nightjar              0.949137   
7           45        47            Red Junglefowl              0.749519   
8           55        56    Australian Brushturkey              0.948759   
9           59        60    Australian Brushturkey              0.642718   
10          61        63                Asian Koel              0.637242   
11          68        70    Australian Brushturkey              0.574619   
12          

In [9]:
import pandas as pd

# Set the display options to show all rows
pd.set_option('display.max_rows', None)

# Now print the DataFrame
print(df)


    start_time  end_time           echonet_label_1  echonet_confidence_1  \
0            0         3                Asian Koel              0.562831   
1            5         8           common pheasant              0.621891   
2           11        12              Spotted Dove              0.632553   
3           15        17                 Dama dama              0.552381   
4           20        21               Felis Catus              0.526123   
5           29        31            Red Junglefowl              0.744855   
6           40        44          Savanna Nightjar              0.949137   
7           45        47            Red Junglefowl              0.749519   
8           55        56    Australian Brushturkey              0.948759   
9           59        60    Australian Brushturkey              0.642718   
10          61        63                Asian Koel              0.637242   
11          68        70    Australian Brushturkey              0.574619   
12          