

## Optimised model and Event Detection

*Author: Rohit and Andrew*

### Overview

This code comprises two primary sections. The first section is a pre-trained model(optmiised model) for predicting animal classes using audio files, while the second utilizes the YAMNet model to detect events in an audio file.

### Components

#### Configuration Parameters

A dictionary, `config`, houses key settings and parameters that are essential for audio processing and the generation of mel-spectrogram images. It includes details like the audio sample rate, clip duration, NFFT settings, mel-spectrogram settings, and model input configurations.

#### Animal Classes

A list, `class_names`, contains the names of the animals that the pre-trained model can predict.

#### Model Loading

The model is loaded from a pre-defined path (`models/echo_model/1/`). This model is used to predict specific animal sounds from a given audio clip.

#### Audio Processing and Prediction Functions

Several helper functions preprocess the audio clips:

- `combined_pipeline()`: Processes an audio clip and returns the mel-spectrogram image.
- `predict_class()`: Takes in model predictions and returns the predicted class name and the probability.
- `load_random_subsection()`: Retrieves a random subsection of an audio clip.
- `predict_on_audio()`: Combines the above functions to predict an animal class given an audio binary.

#### YAMNet Event Detection

The YAMNet model is loaded and used to detect events in audio files. The main aim is to detect the 'Animal' event. This section uses the following main components:

- `yamnet_frames_model`: A YAMNet model designed to predict events from frames of audio data.
- `yamnet_classes`: A list of class names that YAMNet can predict.

### Workflow

1. The code starts by setting up the animal prediction model.
2. An audio file (`test.m4a`) is loaded and divided into 1-second chunks.
3. YAMNet is then used to detect the 'Animal' event in these chunks. Intervals containing the 'Animal' event are stored.
4. For each interval detected by YAMNet, the specific animal sound is predicted using the pre-trained model.
5. The start and end times, YAMNet's prediction, YAMNet's probability, the specific animal prediction, and its associated probability are all logged in a DataFrame (`df`).

### Notes

- Redundant code: A function, `predict_class`, is defined twice. Ensure to remove the redundant definition.
- Unused functions: `audio_to_string` and `string_to_audio` are defined but not used in the provided context.
- File paths: Ensure that paths to YAMNet weights and class maps are correct.
- Image shapes: Ensure that the mel-spectrogram image shape aligns with the model's expectations.

### Conclusion

The code provides a comprehensive way of detecting generic animal sounds using YAMNet and then refines the detection by predicting the specific animal type using a custom model.

--- 

This markdown provides a thorough walkthrough of your code. Adjustments can be made based on specific requirements or additional details.

In [4]:
import numpy as np
import tensorflow as tf
import librosa
import base64
import io
import json


# Make sure to use the correct configuration
config = {
    'AUDIO_SAMPLE_RATE': 48000,
    'AUDIO_CLIP_DURATION': 5,
    'AUDIO_NFFT': 2048,
    'AUDIO_STRIDE': 200,
    'AUDIO_MELS': 260,
    'AUDIO_FMIN': 20,
    'AUDIO_FMAX': 13000,
    'AUDIO_WINDOW': None,
    'AUDIO_TOP_DB': 80,
    'MODEL_INPUT_IMAGE_CHANNELS': 3,
    'MODEL_INPUT_IMAGE_WIDTH': 260,
    'MODEL_INPUT_IMAGE_HEIGHT': 260
}


class_names= ['Aegotheles Cristatus', 'Alauda Arvensis', 'Caligavis Chrysops', 'Capra Hircus', 'Cervus Unicolour', 'Colluricincla Harmonica', 'Corvus Coronoides',
              'Dama Dama', 'Eopsaltria Australis', 'Felis Catus', 'Pachycephala Rufiventris', 'Ptilotula Penicillata', 'Rattus Norvegicus', 'Strepera Graculina', 'Sus Scrofa']


# Load the model
model = tf.keras.models.load_model('models/echo_model/1/')

# Define the preprocessing steps as functions.



#####################################################################################
    # this function is adapted from generic_engine_pipeline.ipynb
    # TODO: need to create a pipeline library and link same code into engine
    ########################################################################################
def combined_pipeline(config, audio_clip):
    # Create a file-like object from the bytes.
    #file = io.BytesIO(audio_clip)
    

    # Load the audio data with librosa
    audio_clip, sample_rate = librosa.load(audio_clip, sr=config['AUDIO_SAMPLE_RATE'])
        
    # keep right channel only
    if audio_clip.ndim == 2 and audio_clip.shape[0] == 2:
        audio_clip = audio_clip[1, :]
        
    # cast to float32 type
    audio_clip = audio_clip.astype(np.float32)
        
    # analyse a random 5 second subsection
    audio_clip = load_random_subsection(audio_clip, duration_secs=config['AUDIO_CLIP_DURATION'])

    # Compute the mel-spectrogram
    image = librosa.feature.melspectrogram(
        y=audio_clip, 
        sr=config['AUDIO_SAMPLE_RATE'], 
        n_fft=config['AUDIO_NFFT'], 
        hop_length=config['AUDIO_STRIDE'], 
        n_mels=config['AUDIO_MELS'],
        fmin=config['AUDIO_FMIN'],
        fmax=config['AUDIO_FMAX'],
        win_length=config['AUDIO_WINDOW'])

    # Optionally convert the mel-spectrogram to decibel scale
    image = librosa.power_to_db(
        image, 
        top_db=config['AUDIO_TOP_DB'], 
        ref=1.0)
        
    # Calculate the expected number of samples in a clip
    expected_clip_samples = int(config['AUDIO_CLIP_DURATION'] * config['AUDIO_SAMPLE_RATE'] / config['AUDIO_STRIDE'])
        
    # swap axis and clip to expected samples to avoid rounding errors
    image = np.moveaxis(image, 1, 0)
    image = image[0:expected_clip_samples,:]
        
    # reshape into standard 3 channels to add the color channel
    image = tf.expand_dims(image, -1)
        
    # most pre-trained model classifer model expects 3 color channels
    image = tf.repeat(image, config['MODEL_INPUT_IMAGE_CHANNELS'], axis=2)
        
    # calculate the image shape and ensure it is correct
    expected_clip_samples = int(config['AUDIO_CLIP_DURATION'] * config['AUDIO_SAMPLE_RATE'] / config['AUDIO_STRIDE'])
    image = tf.ensure_shape(image, [expected_clip_samples, config['AUDIO_MELS'], config['MODEL_INPUT_IMAGE_CHANNELS']])
        
    # note here a high quality LANCZOS5 is applied to resize the image to match model image input size
    image = tf.image.resize(image, (config['MODEL_INPUT_IMAGE_WIDTH'], config['MODEL_INPUT_IMAGE_HEIGHT']), 
                            method=tf.image.ResizeMethod.LANCZOS5)


    # rescale to range [0,1]
    image = image - tf.reduce_min(image) 
    image = image / (tf.reduce_max(image)+0.0000001)
        
    return image, audio_clip, sample_rate



 ########################################################################################
    # Function to predict class and probability given a prediction
    ########################################################################################
def predict_class( predictions):
    # Get the index of the class with the highest predicted probability
    predicted_index = int(tf.argmax(tf.squeeze(predictions)).numpy())
    print(predicted_index, type(predicted_index))

    # Get the class name using the predicted index
    predicted_class = self.class_names[predicted_index]
    # Calculate the predicted probability for the selected class
    predicted_probability = 100.0 * tf.nn.softmax(predictions)[predicted_index].numpy()
    # Round the probability to 2 decimal places
    predicted_probability = round(predicted_probability, 2)
    return predicted_class, predicted_probability

# this method takes in binary audio data and encodes to string
def audio_to_string( audio_binary):
    base64_encoded_data = base64.b64encode(audio_binary)
    base64_message = base64_encoded_data.decode('utf-8')
    return base64_message    


########################################################################################
    # this method takes in string and ecodes to audio binary data
    ########################################################################################
def string_to_audio( audio_string):
    base64_img_bytes = audio_string.encode('utf-8')
    decoded_data = base64.decodebytes(base64_img_bytes)
    return decoded_data
    
def predict_class(predictions):
    predicted_index = int(tf.argmax(tf.squeeze(predictions)).numpy())
    predicted_class = class_names[predicted_index]
    predicted_probability = 100.0 * tf.nn.softmax(predictions)[0, predicted_index].numpy()
    predicted_probability = round(predicted_probability, 2)
    return predicted_class, predicted_probability


def load_random_subsection(audio_clip, duration_secs):
    clip_length = len(audio_clip)
    subsection_length = duration_secs * config['AUDIO_SAMPLE_RATE']
    if clip_length > subsection_length:
        start_idx = np.random.randint(0, clip_length - subsection_length)
        return audio_clip[start_idx:start_idx+subsection_length]
    else:
        return audio_clip


def predict_on_audio(audio_binary):
    # Preprocess the audio to be suitable for your model
    image, audio_clip, sample_rate = combined_pipeline(config, audio_binary)
    
    # Add a dimension to match the model's input shape
    image = tf.expand_dims(image, 0)
    
    # Make the prediction
    predictions = model.predict(image)
    print(predictions.shape, predictions)

    
    # Predict class and probability using the prediction function
    predicted_class, predicted_probability = predict_class(predictions)
    
    print(f'Predicted class: {predicted_class}')
    print(f'Predicted probability: {predicted_probability}')




# Now you can use predict_on_audio function to predict on your audio binary data.


In [21]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import soundfile as sf
import yamnet.params as params
import yamnet.yamnet as yamnet_model
import librosa
import tempfile
from collections import defaultdict
# Load YAMNet model
yamnet = yamnet_model.yamnet_frames_model(params)
yamnet.load_weights('yamnet/yamnet.h5')
yamnet_classes = yamnet_model.class_names('yamnet/yamnet_class_map.csv')

frame_len = int(params.SAMPLE_RATE * 1)  # 1sec
# Read the whole audio file
filename = 'test.m4a'
data, sr = librosa.load(filename, sr=params.SAMPLE_RATE)

# Split the audio data into 1 second chunks
chunks = np.array_split(data, len(data) // frame_len)

intervals = []
current_interval = None

for cnt, frame_data in enumerate(chunks):
    start_time = cnt
    end_time = cnt + 1
    scores, _ = yamnet.predict(np.reshape(frame_data, [1, -1]), steps=1)
    yamnet_prediction = np.mean(scores, axis=0)
    top5_i = np.argsort(yamnet_prediction)[::-1][:5]

    if yamnet_classes[top5_i[0]] == 'Animal' and yamnet_prediction[top5_i[0]] > 0.2:
        if current_interval is None:
            current_interval = {'start': cnt, 'end': cnt+1}
        else:
            current_interval['end'] = cnt+1
    else:
        if current_interval:
            intervals.append(current_interval)
            current_interval = None

if current_interval:
    intervals.append(current_interval)

df = pd.DataFrame(columns=['start_time', 'end_time', 'yamnet_label', 'yamnet_probability', 'your_model_label', 'your_model_probability'])

for interval in intervals:  
    segment_data = data[interval['start']*frame_len : interval['end']*frame_len]

    with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
        sf.write(temp_audio_file.name, segment_data, params.SAMPLE_RATE)
        
        with open(temp_audio_file.name, 'rb') as binary_file:
            image, _, _ = combined_pipeline(config, binary_file.read())
            # Check the image shape and adjust if necessary
            if image.shape != (500, 64, 3):
                image = pad_tensor_to_shape(image, (500, 64, 3))
            your_model_prediction, your_model_probability = predict_on_audio(binary_file.read())
            
        df = df.append({
            'start_time': interval['start'],
            'end_time': interval['end'],
            'yamnet_label': 'Animal',
            'yamnet_probability': np.mean(yamnet_prediction[top5_i]),
            'your_model_label': your_model_prediction,
            'your_model_probability': your_model_probability
        }, ignore_index=True)

# print the DataFrame
print(df)


  data, sr = librosa.load(filename, sr=params.SAMPLE_RATE)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)




InvalidArgumentError: {{function_node __wrapped__EnsureShape_device_/job:localhost/replica:0/task:0/device:GPU:0}} Shape of tensor input [101,64,3] is not compatible with expected shape [500,64,3]. [Op:EnsureShape] name: 

In [20]:
df


Unnamed: 0,start_time,end_time,yamnet_label,yamnet_probability,your_model_label,your_model_probability


In [21]:
import numpy as np
import tensorflow as tf
import librosa
import base64
import io
import json


# Make sure to use the correct configuration
config = {
    'AUDIO_SAMPLE_RATE': 48000,
    'AUDIO_CLIP_DURATION': 5,
    'AUDIO_NFFT': 2048,
    'AUDIO_STRIDE': 200,
    'AUDIO_MELS': 260,
    'AUDIO_FMIN': 20,
    'AUDIO_FMAX': 13000,
    'AUDIO_WINDOW': None,
    'AUDIO_TOP_DB': 80,
    'MODEL_INPUT_IMAGE_CHANNELS': 3,
    'MODEL_INPUT_IMAGE_WIDTH': 260,
    'MODEL_INPUT_IMAGE_HEIGHT': 260
}


class_names= ['Aegotheles cristatus owlet-nightjar', 'Alauda arvensis European Skylark', 'Caligavis chrysops Yellow-faced honeyeater', 'Capra hircus Feral goat', 'Cervus unicolour Sambar deer', 'Colluricincla harmonica Grey shrikethrush', 'Corvus coronoides Australian raven',
              'Dama dama Fallow Deer', 'Eopsaltria australis Eastern yellow robin', 'Felis Catus Cat', 'Pachycephala rufiventris Rufous whistler', 'Ptilotula penicillata White-plumed honeyeater', 'Rattus norvegicus Brown rat', 'Strepera graculina Pied currawong', 'sus scrofa Wild pig']

# Load the model
model = tf.keras.models.load_model('models/echo_model/1/')

# Define the preprocessing steps as functions.



#####################################################################################
    # this function is adapted from generic_engine_pipeline.ipynb
    # TODO: need to create a pipeline library and link same code into engine
    ########################################################################################
def combined_pipeline(config, audio_clip):

    # Load the audio data with librosa(works only while give direct audio to it)
    #audio_clip, sample_rate = librosa.load(audio_clip, sr=config['AUDIO_SAMPLE_RATE'])
    
    #to use it with yamnet
    file = io.BytesIO(audio_clip)
    audio_clip, sample_rate = librosa.load(file, sr=config['AUDIO_SAMPLE_RATE'])
        
    # keep right channel only
    if audio_clip.ndim == 2 and audio_clip.shape[0] == 2:
        audio_clip = audio_clip[1, :]
        
    # cast to float32 type
    audio_clip = audio_clip.astype(np.float32)
        
    # analyse a random 5 second subsection
    audio_clip = load_random_subsection(audio_clip, duration_secs=config['AUDIO_CLIP_DURATION'])

    # Compute the mel-spectrogram
    image = librosa.feature.melspectrogram(
        y=audio_clip, 
        sr=config['AUDIO_SAMPLE_RATE'], 
        n_fft=config['AUDIO_NFFT'], 
        hop_length=config['AUDIO_STRIDE'], 
        n_mels=config['AUDIO_MELS'],
        fmin=config['AUDIO_FMIN'],
        fmax=config['AUDIO_FMAX'],
        win_length=config['AUDIO_WINDOW'])

    # Optionally convert the mel-spectrogram to decibel scale
    image = librosa.power_to_db(
        image, 
        top_db=config['AUDIO_TOP_DB'], 
        ref=1.0)
        
    # Calculate the expected number of samples in a clip
    expected_clip_samples = int(config['AUDIO_CLIP_DURATION'] * config['AUDIO_SAMPLE_RATE'] / config['AUDIO_STRIDE'])
        
    # swap axis and clip to expected samples to avoid rounding errors
    image = np.moveaxis(image, 1, 0)
    image = image[0:expected_clip_samples,:]
        
    # reshape into standard 3 channels to add the color channel
    image = tf.expand_dims(image, -1)
        
    # most pre-trained model classifer model expects 3 color channels
    image = tf.repeat(image, config['MODEL_INPUT_IMAGE_CHANNELS'], axis=2)
        
    # calculate the image shape and ensure it is correct
    expected_clip_samples = int(config['AUDIO_CLIP_DURATION'] * config['AUDIO_SAMPLE_RATE'] / config['AUDIO_STRIDE'])
    image = tf.ensure_shape(image, [expected_clip_samples, config['AUDIO_MELS'], config['MODEL_INPUT_IMAGE_CHANNELS']])
        
    # note here a high quality LANCZOS5 is applied to resize the image to match model image input size
    image = tf.image.resize(image, (config['MODEL_INPUT_IMAGE_WIDTH'], config['MODEL_INPUT_IMAGE_HEIGHT']), 
                            method=tf.image.ResizeMethod.LANCZOS5)


    # rescale to range [0,1]
    image = image - tf.reduce_min(image) 
    image = image / (tf.reduce_max(image)+0.0000001)
        
    return image, audio_clip, sample_rate



 ########################################################################################
    # Function to predict class and probability given a prediction
    ########################################################################################
def predict_class( predictions):
    # Get the index of the class with the highest predicted probability
    predicted_index = int(tf.argmax(tf.squeeze(predictions)).numpy())
    print(predicted_index, type(predicted_index))

    # Get the class name using the predicted index
    predicted_class = self.class_names[predicted_index]
    # Calculate the predicted probability for the selected class
    predicted_probability = 100.0 * tf.nn.softmax(predictions)[predicted_index].numpy()
    # Round the probability to 2 decimal places
    predicted_probability = round(predicted_probability, 2)
    return predicted_class, predicted_probability

# this method takes in binary audio data and encodes to string
def audio_to_string( audio_binary):
    base64_encoded_data = base64.b64encode(audio_binary)
    base64_message = base64_encoded_data.decode('utf-8')
    return base64_message    


########################################################################################
    # this method takes in string and ecodes to audio binary data
    ########################################################################################
def string_to_audio( audio_string):
    base64_img_bytes = audio_string.encode('utf-8')
    decoded_data = base64.decodebytes(base64_img_bytes)
    return decoded_data
    
def predict_class(predictions):
    predicted_index = int(tf.argmax(tf.squeeze(predictions)).numpy())
    predicted_class = class_names[predicted_index]
    predicted_probability = 100.0 * tf.nn.softmax(predictions)[0, predicted_index].numpy()
    predicted_probability = round(predicted_probability, 2)
    return predicted_class, predicted_probability



def load_random_subsection(audio_clip, duration_secs):
    clip_length = len(audio_clip)
    subsection_length = duration_secs * config['AUDIO_SAMPLE_RATE']
    
    if clip_length > subsection_length:
        start_idx = np.random.randint(0, clip_length - subsection_length)
        return audio_clip[start_idx:start_idx+subsection_length]
    elif clip_length < subsection_length:
        padding = np.zeros(int(subsection_length - clip_length))
        return np.concatenate((audio_clip, padding))
    else:
        return audio_clip



#tis is standartd , works with audio more then 5 sec 

def load_random_subsection(audio_clip, duration_secs):
    clip_length = len(audio_clip)
    subsection_length = duration_secs * config['AUDIO_SAMPLE_RATE']
    if clip_length > subsection_length:
        start_idx = np.random.randint(0, clip_length - subsection_length)
        return audio_clip[start_idx:start_idx+subsection_length]
    else:
        return audio_clip











#works with when you directly give audio to predicyt 


def predict_on_audio(audio_binary):
    # Preprocess the audio to be suitable for your model
    image, audio_clip, sample_rate = combined_pipeline(config, audio_binary)
    
    # Add a dimension to match the model's input shape
    image = tf.expand_dims(image, 0)
    
    # Make the prediction
    predictions = model.predict(image)
    print(predictions.shape, predictions)

    
    # Predict class and probability using the prediction function
    predicted_class, predicted_probability = predict_class(predictions)
    
    #print(f'Predicted class: {predicted_class}')
    #print(f'Predicted probability: {predicted_probability}')
    # Return the results
    return predicted_class, predicted_probability


def predict_on_audio(audio_binary):
    # Preprocess the audio to be suitable for your model
    image, audio_clip, sample_rate = combined_pipeline(config, audio_binary)
    
    # Add a dimension to match the model's input shape
    image = tf.expand_dims(image, 0)
    
    # Make the prediction
    predictions_array = model.predict(image)[0]  # Assuming the model returns 2D array, take the first element
    
    # Pair the class names with the predictions
    paired_predictions = list(zip(class_names, predictions_array))
    
    # Sort the paired predictions based on probability
    sorted_predictions = sorted(paired_predictions, key=lambda x: x[1], reverse=True)
    
    return sorted_predictions[:3]



# Now you can use predict_on_audio function to predict on your audio binary data.


In [67]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import soundfile as sf
import yamnet.params as params
import yamnet.yamnet as yamnet_model
import librosa
import tempfile
from collections import defaultdict
# Load YAMNet model
yamnet = yamnet_model.yamnet_frames_model(params)
yamnet.load_weights('yamnet/yamnet.h5')
yamnet_classes = yamnet_model.class_names('yamnet/yamnet_class_map.csv')

frame_len = int(params.SAMPLE_RATE * 1)  # 1sec
# Read the whole audio file
filename = 'test.m4a'
data, sr = librosa.load(filename, sr=params.SAMPLE_RATE)

# Split the audio data into 1 second chunks
chunks = np.array_split(data, len(data) // frame_len)

intervals = []
current_interval = None

for cnt, frame_data in enumerate(chunks):
    start_time = cnt
    end_time = cnt + 1
    scores, _ = yamnet.predict(np.reshape(frame_data, [1, -1]), steps=1)
    yamnet_prediction = np.mean(scores, axis=0)
    top5_i = np.argsort(yamnet_prediction)[::-1][:5]

    if yamnet_classes[top5_i[0]] == 'Animal' and yamnet_prediction[top5_i[0]] > 0.2:
        if current_interval is None:
            current_interval = {'start': cnt, 'end': cnt+1}
        else:
            current_interval['end'] = cnt+1
    else:
        if current_interval:
            intervals.append(current_interval)
            current_interval = None

if current_interval:
    intervals.append(current_interval)

df = pd.DataFrame(columns=['start_time', 'end_time', 'yamnet_label', 'yamnet_probability', 'your_model_label', 'your_model_probability'])

for interval in intervals:  
    segment_data = data[interval['start']*frame_len : interval['end']*frame_len]

    with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
        sf.write(temp_audio_file.name, segment_data, params.SAMPLE_RATE)
        with open(temp_audio_file.name, 'rb') as binary_file:
            predicted_class, predicted_probability = predict_on_audio(binary_file.read())
            
        df = df.append({
            'start_time': interval['start'],
            'end_time': interval['end'],
            'yamnet_label': 'Animal',
            'yamnet_probability': np.mean(yamnet_prediction[top5_i]),
            'your_model_label': predicted_class,
            'your_model_probability': predicted_probability
        }, ignore_index=True)

print(df)

  data, sr = librosa.load(filename, sr=params.SAMPLE_RATE)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


(1, 15) [[ 0.7405799  -0.1215523  -3.3350132   2.3348048   3.4266036   0.7246958
   0.3850828  -0.7148199  -4.122674    9.141884   -3.6063426  -1.2421137
  -0.25307405 -2.0979192  -0.1465312 ]]


  df = df.append({


(1, 15) [[-0.2944098  -2.3573754   1.0701752  -0.4338679   2.1659954  -0.2703556
   2.1128     -1.1273507  -1.835596    1.7622149  -0.7875129  -0.72051674
  -0.15483226  0.9607781  -0.41029572]]


  df = df.append({


(1, 15) [[ 0.93047553 -1.6647632  -1.2673922  -0.4354485  -0.31507757 -1.6520909
  -1.2490606  -0.05418235  4.0025015   1.6061516   2.4417086  -2.505131
   0.11330611 -2.9678547   0.7121459 ]]

  df = df.append({


(1, 15) [[ 1.2425679   0.14110121 -3.3561156  -0.55636466  0.06118035 -0.836532
  -2.3813221  -0.5147153   0.5666767   2.772831   -0.8777431  -0.6715298
   1.8792427  -2.782655    1.1477658 ]]


  df = df.append({


(1, 15) [[-1.088898   -0.7406041   0.38333064 -0.01174553  2.6887362  -0.72896075
   0.85128415 -2.555808   -2.3863757   3.1148634  -1.2484775  -1.1477927
   0.08408052 -0.45379427 -0.49087188]]


  df = df.append({


(1, 15) [[-0.14137267 -0.7150199  -1.5210944  -0.8631474   0.9455284  -1.1775297
  -0.57695854 -1.0064554  -0.60633785  2.2444437  -1.0986612  -1.2718554
   1.1775382  -0.11007012 -0.2058347 ]]
  start_time end_time yamnet_label  yamnet_probability  \
0          5        6       Animal            0.159492   
1         16       18       Animal            0.159492   
2         20       21       Animal            0.159492   
3         27       28       Animal            0.159492   
4         31       32       Animal            0.159492   
5         40       41       Animal            0.159492   

                            your_model_label  your_model_probability  
0                            Felis Catus Cat                   99.46  
1               Cervus unicolour Sambar deer                   26.00  
2  Eopsaltria australis Eastern yellow robin                   68.15  
3                            Felis Catus Cat                   44.62  
4                            Felis Catus Cat

  df = df.append({


In [None]:
#Trying different output

In [89]:
import os
import pandas as pd
import numpy as np
import soundfile as sf
import yamnet.params as params
import yamnet.yamnet as yamnet_model
import librosa
import tempfile

# Load YAMNet model
yamnet = yamnet_model.yamnet_frames_model(params)
yamnet.load_weights('yamnet/yamnet.h5')
yamnet_classes = yamnet_model.class_names('yamnet/yamnet_class_map.csv')

frame_len = int(params.SAMPLE_RATE * 1)  # 1sec

# Read the whole audio file
filename = 'test.m4a'
data, sr = librosa.load(filename, sr=params.SAMPLE_RATE)

# Split the audio data into 1 second chunks
chunks = np.array_split(data, len(data) // frame_len)

intervals = []
current_interval = None

yamnet_predictions = []
top_indices = []
# ... [initial imports and model loading here]

df_rows = []

for cnt, frame_data in enumerate(chunks):
    # Get YAMNet predictions
    scores, _ = yamnet.predict(np.reshape(frame_data, [1, -1]), steps=1)
    yamnet_prediction = np.mean(scores, axis=0)
    top5_i = np.argsort(yamnet_prediction)[::-1][:5]

    if (yamnet_classes[top5_i[0]] in ['Animal', 'Bird'] and yamnet_prediction[top5_i[0]] > 0.2) or (yamnet_classes[top5_i[1]] in ['Animal', 'Bird'] and yamnet_prediction[top5_i[1]] > 0.2):

        # Extract segment data for your model
        segment_data = data[cnt*frame_len : (cnt+1)*frame_len]

        with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
            sf.write(temp_audio_file.name, segment_data, params.SAMPLE_RATE)
            with open(temp_audio_file.name, 'rb') as binary_file:
                top3_predictions = predict_on_audio(binary_file.read())

        # Build a row for our dataframe
        df_row = {
            'start_time': cnt,
            'end_time': cnt+1,
            'yamnet_label_1': yamnet_classes[top5_i[0]],
            'yamnet_probability_1': yamnet_prediction[top5_i[0]],
            'yamnet_label_2': yamnet_classes[top5_i[1]],
            'yamnet_probability_2': yamnet_prediction[top5_i[1]],
            'yamnet_label_3': yamnet_classes[top5_i[2]],
            'yamnet_probability_3': yamnet_prediction[top5_i[2]],
        }

        for i, pred in enumerate(top3_predictions):
            df_row[f'your_model_label_{i+1}'] = pred[0] if len(pred) > 0 else None
            df_row[f'your_model_probability_{i+1}'] = pred[1] if len(pred) > 1 else None

        df_rows.append(df_row)

df = pd.DataFrame(df_rows)

print(df)



  data, sr = librosa.load(filename, sr=params.SAMPLE_RATE)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


    start_time  end_time                            yamnet_label_1  \
0            5         6                                    Animal   
1           15        16                              Wild animals   
2           16        17                                    Animal   
3           17        18                                    Animal   
4           20        21                                    Animal   
5           27        28                                    Animal   
6           30        31                              Wild animals   
7           31        32                                    Animal   
8           32        33                                      Bird   
9           33        34                                      Bird   
10          40        41                                    Animal   
11          41        42  Livestock, farm animals, working animals   
12          42        43   Bird vocalization, bird call, bird song   

    yamnet_probabil

In [90]:
df

Unnamed: 0,start_time,end_time,yamnet_label_1,yamnet_probability_1,yamnet_label_2,yamnet_probability_2,yamnet_label_3,yamnet_probability_3,your_model_label_1,your_model_probability_1,your_model_label_2,your_model_probability_2,your_model_label_3,your_model_probability_3
0,5,6,Animal,0.968866,"Livestock, farm animals, working animals",0.939845,Fowl,0.924176,Felis Catus Cat,9.141884,Cervus unicolour Sambar deer,3.426604,Capra hircus Feral goat,2.334805
1,15,16,Wild animals,0.56379,Bird,0.477609,Animal,0.391069,Felis Catus Cat,2.22996,Rattus norvegicus Brown rat,0.863575,Cervus unicolour Sambar deer,0.706902
2,16,17,Animal,0.773564,Wild animals,0.638726,Bird,0.621845,Felis Catus Cat,7.215248,Cervus unicolour Sambar deer,2.56625,Capra hircus Feral goat,1.487156
3,17,18,Animal,0.903251,Wild animals,0.899348,Bird,0.871762,Felis Catus Cat,2.530861,Cervus unicolour Sambar deer,1.385957,sus scrofa Wild pig,1.051897
4,20,21,Animal,0.388485,Wild animals,0.33428,Bird,0.206926,Eopsaltria australis Eastern yellow robin,4.002501,Pachycephala rufiventris Rufous whistler,2.441709,Felis Catus Cat,1.606152
5,27,28,Animal,0.386347,Bird,0.30135,"Outside, rural or natural",0.284338,Felis Catus Cat,2.772831,Rattus norvegicus Brown rat,1.879243,Aegotheles cristatus owlet-nightjar,1.242568
6,30,31,Wild animals,0.798768,Bird,0.794923,"Bird vocalization, bird call, bird song",0.785069,Felis Catus Cat,4.783803,Cervus unicolour Sambar deer,1.872087,Corvus coronoides Australian raven,1.49245
7,31,32,Animal,0.8922,Wild animals,0.87533,Bird,0.851301,Felis Catus Cat,3.114863,Cervus unicolour Sambar deer,2.688736,Corvus coronoides Australian raven,0.851284
8,32,33,Bird,0.677822,Wild animals,0.667248,Animal,0.658058,Felis Catus Cat,3.450251,Rattus norvegicus Brown rat,1.835899,Alauda arvensis European Skylark,0.970522
9,33,34,Bird,0.616023,"Bird vocalization, bird call, bird song",0.576408,Wild animals,0.563931,Rattus norvegicus Brown rat,1.809888,Felis Catus Cat,1.304349,sus scrofa Wild pig,0.725475


In [91]:
df.shape

(13, 14)