

### Documentation

#### Overview:
This script provides functionality for sound event detection, specifically aimed at identifying segments within an audio file that potentially contain animal-related sounds. The script uses a pre-trained `YAMNet` model to perform this detection and further refines the results with another pre-trained model (2-3chunk).
### Author: Rohit Dhanda 
---

#### Imports:
- **pickle**: For loading serialized class names and label encoder objects.
- **numpy, pandas**: For numerical and data manipulation.
- **soundfile**: To handle audio file I/O.
- **yamnet**: To access the YAMNet model and its parameters.
- **librosa**: For audio processing tasks.
- **tensorflow.keras**: To load the deep learning model.
- **tempfile**: To create temporary files.
- **tensorflow_hub**: To access TensorFlow Hub models.

---

#### Global Variables and Model Initialization:
1. **class_names.pkl**: A serialized list of class names.
2. **label_encoder.pkl**: A serialized label encoder which can convert class names to integers and vice versa.
3. **yamnet.h5**: Weights for the YAMNet model.
4. **model_3_78_48000.h5**: Pre-trained Keras model for sound classification.

---

#### Functions:

1. **load_audio_file(file_path)**:
    - **Input**: Path to the audio file.
    - **Output**: An array of audio data samples.
    - **Purpose**: Loads the audio file using librosa with a sampling rate of 48000 Hz.

2. **extract_features(model, X)**:
    - **Input**: A model and an array X containing audio data samples.
    - **Output**: Extracted feature array.
    - **Purpose**: For each audio sample in X, embeddings are extracted using the given model. The mean of the embeddings is then computed to get a feature vector.

3. **predict_on_audio(binary_audio)**:
    - **Input**: A binary audio data.
    - **Output**: Top two class predictions and their associated probabilities.
    - **Purpose**: To make predictions on the given audio data using the pre-trained model.

4. **sound_event_detection(filepath)**:
    - **Input**: Path to an audio file.
    - **Output**: A pandas DataFrame containing segments of the audio where animal-related sounds were detected. Each row of the DataFrame represents a segment and contains the start time, end time, predicted class labels, and their associated confidences.
    - **Purpose**: The main function to detect animal-related sounds in an audio file. The function first divides the audio into 1-second chunks, detects potential animal sounds in each chunk using the YAMNet model, and then refines the results using another model. Detected segments are then combined and returned as a DataFrame.

---

#### Usage:
At the end of the script, an example usage is provided using the file 'test9.m4a'. The `sound_event_detection` function is called with this file, and the results are printed to the console.

---

**Note**: It's essential to ensure all the necessary files and weights are available in the respective directories as mentioned in the script before running it.

In [3]:
import pickle
import numpy as np
import pandas as pd
import soundfile as sf

import librosa

import tempfile
import tensorflow_hub as hub
import tensorflow as tf
from tensorflow.keras.models import load_model

import sys
sys.path.append("yamnet_dir/")
# yamnet related imports
from yamnet_dir import params as params
from yamnet_dir import yamnet as yamnet_model


# Load the necessary data and models
with open('yamnet_dir/class_names.pkl', 'rb') as f:
    class_names = pickle.load(f)

with open('yamnet_dir/label_encoder.pkl', 'rb') as f:
    le = pickle.load(f)

yamnet = yamnet_model.yamnet_frames_model(params)
yamnet.load_weights('yamnet_dir/yamnet.h5')
yamnet_classes = yamnet_model.class_names('yamnet_dir/yamnet_class_map.csv')
model = load_model('yamnet_dir/model_3_82_16000.h5')

# Load the YAMNet model
yamnet_model_handle = 'https://tfhub.dev/google/yamnet/1'
yamnet_model = hub.load(yamnet_model_handle)

def load_audio_file(file_path):
    wav, sr = librosa.load(file_path, sr=16000)
    return np.array([wav])

def extract_features(model, X):
    features = []
    for wav in X:
        scores, embeddings, spectrogram = model(wav)
        features.append(embeddings.numpy().mean(axis=0))
    return np.array(features)

def predict_on_audio(binary_audio):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_audio_file:
        with open(temp_audio_file.name, 'wb') as f:
            f.write(binary_audio)
        X_new = load_audio_file(temp_audio_file.name)
        X_new_features = extract_features(yamnet_model, X_new)

        predictions = model.predict(X_new_features)
        top_two_prob_indices = np.argsort(predictions[0])[-2:]
        top_two_prob_values = predictions[0][top_two_prob_indices]

        top_two_class_names = le.inverse_transform(top_two_prob_indices)
        
        return [(class_names[top_two_prob_indices[1-i]], top_two_prob_values[1-i]) for i in range(2)]

def sound_event_detection(filepath):
    data, sr = librosa.load(filepath, sr=16000)
    frame_len = int(sr * 1)
    num_chunks = len(data) // frame_len
    chunks = [data[i*frame_len:(i+1)*frame_len] for i in range(num_chunks)]

    # Adding the last chunk which can be less than 1 second
    last_chunk = data[num_chunks*frame_len:]
    if len(last_chunk) > 0:
        chunks.append(last_chunk)

    animal_related_classes = [
        'Dog', 'Cat', 'Bird', 'Animal', 'Birdsong', 'Canidae', 'Feline', 'Livestock',
        'Rodents, Mice', 'Wild animals', 'Pets', 'Frogs', 'Insect', 'Snake', 
        'Domestic animals, pets', 'crow'
    ]

    df_rows = []
    buffer = []
    start_time = None
    for cnt, frame_data in enumerate(chunks):
        frame_data = np.reshape(frame_data, (-1,)) # Flatten the array to 1D
        frame_data = np.array([frame_data]) # Wrapping it back into a 2D array
        outputs = yamnet(frame_data)
        yamnet_prediction = np.mean(outputs[0], axis=0)
        top2_i = np.argsort(yamnet_prediction)[::-1][:2]
        threshold=0.2
        if any(yamnet_prediction[np.where(yamnet_classes == cls)[0][0]] >= threshold for cls in animal_related_classes if cls in yamnet_classes):
            if start_time is None:
                start_time = cnt
            buffer.append(frame_data)
        else:
            if start_time is not None:
                segment_data = np.concatenate(buffer, axis=1)[0]
                with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_audio_file:
                    sf.write(temp_audio_file.name, segment_data, sr)
                    with open(temp_audio_file.name, 'rb') as binary_file:
                        top2_predictions = predict_on_audio(binary_file.read())

                df_row = {'start_time': start_time, 'end_time': cnt}
                
                for i, pred in enumerate(top2_predictions[:2]):
                    df_row[f'echonet_label_{i+1}'] = pred[0] if pred[0] is not None else None
                    df_row[f'echonet_confidence_{i+1}'] = pred[1] if pred[1] is not None else None

                df_rows.append(df_row)
                buffer = []
                start_time = None

    # Handling the case where the last chunk contains an animal-related sound
    if start_time is not None:
        segment_data = np.concatenate(buffer, axis=1)[0]
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_audio_file:
            sf.write(temp_audio_file.name, segment_data, sr)
            with open(temp_audio_file.name, 'rb') as binary_file:
                top2_predictions = predict_on_audio(binary_file.read())

        df_row = {'start_time': start_time, 'end_time': len(chunks)}
        
        for i, pred in enumerate(top2_predictions[:2]):
            df_row[f'echonet_label_{i+1}'] = pred[0] if pred[0] is not None else None
            df_row[f'echonet_confidence_{i+1}'] = pred[1] if pred[1] is not None else None

        df_rows.append(df_row)

    df = pd.DataFrame(df_rows)
    return df


# Use the function
filename = 'yamnet_dir/cat-ul-goat.wav'
df = sound_event_detection(filename)
print(df)



print(tf.__version__)


   start_time  end_time      echonet_label_1  echonet_confidence_1  \
0           0         3          Felis catus              0.999516   
1           4         5          Felis catus              1.000000   
2           6         8          Felis catus              0.999991   
3          11        12  Uperoleia laevigata              0.721224   
4          15        17         Capra hircus              0.998273   

          echonet_label_2  echonet_confidence_2  
0       Canis lupus dingo          4.835153e-04  
1       Canis lupus dingo          1.608894e-09  
2  Menura novaehollandiae          6.932215e-06  
3           Anas gracilis          2.277969e-01  
4           Anas gracilis          1.726876e-03  
2.13.0


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = um.true_divide(


In [1]:
import pandas as pd
import pickle
import json
import tensorflow as tf
import pickle
import numpy as np
import soundfile as sf
import librosa
import tensorflow_hub as hub

# For yamnet, you might have to import it differently if the submodules are not directly exposed.
# I'm providing a generic approach here.

modules = {
    'pickle': pickle,
    'numpy': np,
    'pandas': pd,
    'soundfile': sf,
    'librosa': librosa,
    'tensorflow': tf,
    'tensorflow_hub': hub
    # 'yamnet.params': params,
    # 'yamnet.yamnet': yamnet_model
}

for name, module in modules.items():
    try:
        print(f"{name}: {module.__version__}")
    except AttributeError:
        print(f"{name}: version not found")

# Set the display options to show all rows
pd.set_option('display.max_rows', None)

# Now print the DataFrame
print(df)


# 1. Read the pickle file
with open('yamnet_dir/class_names.pkl', 'rb') as f:
   data = pickle.load(f)

# Ensure that the data is serializable.
# If your data contains any non-serializable parts, you'll need to handle those separately.

# 2. Convert and save data as JSON
with open('yamnet_dir/class_names.json', 'w') as f:
   json.dump(data, f)




pickle: version not found
numpy: 1.24.3
pandas: 2.1.0
soundfile: 0.12.1
librosa: 0.10.1
tensorflow: 2.13.0
tensorflow_hub: 0.14.0


NameError: name 'df' is not defined