Going to start a ground  u p rework fothe data based on features suggested by our favourite co-pilot. It seems like we can get a decent answer based on a lightweight model, by processing the audio files using librosa. The features we're looking to extract are: Mel-Frequency Spectral Coefficients(MFCC) 
Spectral Centroid
Spectral Bandwidth
Spectral Flatness
Spectral Contrast
Time Domain features - Zero Crossing rate (ZCR)
Root Mean Square Energy (RMSE)
Temporal features - Short-time energy
Tempo

Tips for Efficient Processing on Raspberry Pi 5

    Downsampling:
        Use a lower sampling rate (e.g., 16 kHz) unless high-frequency information is crucial.
    Windowing:
        Apply short-time Fourier transform (STFT) with small windows (e.g., 20–50 ms) for manageable computational loads.
    Batch Processing:
        Process audio in chunks to avoid memory and CPU spikes.
    Feature Dimensionality Reduction:
        Reduce the number of features using techniques like Principal Component Analysis (PCA) after extraction.
    Quantized Models:
        Use a lightweight, quantized ML model optimized for edge devices.

Import basic pacages for data processing . We're going to use Librosa to process the data as it seems to be  more lightweight than alternatives.   

In [1]:
import librosa
import numpy as np
from numpy import nan
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
import math

sample_rate = 16000

**Pre-Processing**

Preprocessing of audio, we are resampling, normalising and trimming silence.

Resampling will normalise the sampling rate, reducing one variable between datasets. I assume this means that we will want to resamle audio in the input stream.

Normalise standardises the audio amplitude, apparently mimising the influence of recording volume, to prevent outlier high or low sounds from affecting the  model training. 

Trimming silence to remove useless pieces of data to save on computation, and reduce the affect silence could have on affecting our model weights. (Need to check this, feel like it should be heavily caveated.) I have now checked this, it makes sense that silence being a recurring feature across classes could lead to issues, but having a separate class specifically for silence callsed (no event) would be good potentially.

Below is initial set  up of the functions and importing of metadata for the dataset.

In [64]:
import os
def initialise_array(x):
    # Specify the folder path
    folder_path = f'./Audio/fold{x}'

    # List all files in the folder
    files = os.listdir(folder_path)

    # Filter out non-file entries (e.g., directories), and select only .wav files
    files = [f for f in files if os.path.isfile(os.path.join(folder_path, f)) and f.endswith('.wav')]

    # Count the number of .wav files
    num_files = len(files)

    # Initialize a blank NumPy array of the same length
    # You can initialize with NaN or None depending on your needs
    blank_array = np.zeros(num_files, dtype=object)  # Or np.nan, or np.zeros(num_files) depending on your use case
    blank_array[:] = np.nan # placeholder array for audio slices

    # Print the result
    print(f"Number of .wav files: {num_files}")
    print("Initialized NumPy array:", np.size(blank_array))
    return(blank_array)

In [72]:
AUDIO_DIR = './Audio' # path to audio files directory
x = 1
folders = ['fold{}'.format(x)] # array of paths to each audio folder
slices = initialise_array(x=x)

# Load the metadata:
metadata = pd.read_csv('./data-description.csv')
metadata['length'] = metadata['end'] - metadata['start']

# Label map of the different sound classes:
label_map = {
    'air_conditioner': 0,
    'car_horn': 1,
    'children_playing': 2,
    'dog_bark': 3,
    'drilling': 4,
    'engine_idling': 5,
    'gun_shot': 6,
    'jackhammer': 7,
    'siren': 8,
    'street_music': 9,
}

def get_class_name(idx):
    return list(label_map.keys())[list(label_map.values()).index(idx)]


Number of .wav files: 873
Initialized NumPy array: 873


In [71]:
classes = np.array(metadata['class']) # list of audio classes

In [68]:
for fold in folders:
    print('collecting {}...'.format(fold), end = "")
    files = librosa.util.find_files('{}/{}'.format(AUDIO_DIR, fold), ext=['wav'])
    files = np.asarray(files)
    for file in files:
        if '.wav' in file:
            name = file.split('/').pop()
            wave_arr, sr = librosa.load(file, sr = sample_rate, mono = True)
            idx = list(metadata.index[metadata['slice_file_name'] == name])[0]
            slices[idx] = wave_arr.astype(object)
    print("done!")


collecting fold1...

IndexError: index 877 is out of bounds for axis 0 with size 873

In [31]:
print(slices)

[nan nan nan ... nan nan nan]


In [46]:
np.savetxt("output.csv", array_no_nan, delimiter=",")

TypeError: Mismatch between array dtype ('object') and format specifier ('%.18e')

In [44]:
array_no_nan = np.nan_to_num(slices, nan=0.0)

Generally, it is advisable to split the formatted audio files into several **equal** brief segments with predetermined time intervals. For this project, a threshold value of 2 seconds is set for all clips. Two seconds is a suitable threshold as it is equal to half of the maximum lengths of all classes. It is also slightly higher than the mean length of the interest class (1.65 s) and less than the mean length across all classes (3.61 s).

In [23]:
goal_len = 2
length_of_wave_arr = round(goal_len * sample_rate) # the length of the desired slices segments arrays
length_of_wave_arr

32000

Due to issues with RAM, I'm only loading a small amount ofthe 8700 files at a time, leaving nan values in teh initialised array, this will break the program.

In [24]:
def split_audio_slices(slices, classes, split_length=length_of_wave_arr):
    X = slices.copy()
    y = classes.copy()
    
    idx = 0
    while idx < X.shape[0]:
        _slice = X[idx]
        _class = y[idx]
        
        if(_slice.shape[0] == split_length): # If it is already 2 seconds long, skip.
            idx += 1
            continue
            
        elif(_slice.shape[0] < split_length):  # If it is less than 2 seconds long
            diff = split_length - _slice.shape[0]
            silence = np.zeros(split_length) # silence
            lead = silence[0 : math.ceil(diff / 2)]
            trail = silence[0 : math.floor(diff / 2)]
            
            X[idx] = np.concatenate((lead, _slice, trail)) # pad with silence
        
        else:
            # split into two segments
            seg1 = _slice[:length_of_wave_arr]
            seg2 = _slice[length_of_wave_arr:]
            
            X[idx] = seg1
            
            # If it is longer than 2 s and belongs to the small classes
            # add to the queue to undergo another split/silence padding
            if(seg2.shape[0] >= MIN_LEN and (_class == 'gun_shot' or _class == 'car_horn')):
                X = np.array(list(X) + list(np.array([seg2])), dtype=object )
                y = np.append(y, _class)
                
        idx += 1
        
    return X, y

In [27]:
print(type(slices.copy()))

<class 'numpy.ndarray'>


In [25]:
X, y = split_audio_slices(slices, classes)

AttributeError: 'float' object has no attribute 'shape'

**Feature Extraction**

Going to start with a single wav file as a test.

In [2]:
TestAudioPath = "audio/fold1/7061-6-0-0.wav"

def extract_features(file_path, sr=16000, n_mfcc=13):
    # Load the audio file
    audio, sample_rate = librosa.load(TestAudioPath, sr=sr)

    # Extract the MFCC features
    mfccs = librosa.features.mfcc(y=audio, sr=sample_rate, n_mfcc = n_mfcc)

    # Compute the mean of the extracted features across  frames
    mfccs_mean = np.mean(mfccs, axis=1)

I've questioned chatgpt on why it wants  to normalise the MFCCs to a human frequency (using the mel scale.) It has agreed this doesnt really make sense. I think I will try to go down the path of doing both mel scale and none mel scale feature sets, build a model from both and see how they compare. Potentially may be worth stacking them? I don't see how applying a log scale is going to help much though...

In [3]:
def extract_spectral_features(audio, sr = 16000):
    
    audio, sample_rate = librosa.load(audio, sr=sr)
    print("Audio data type:", type(audio))
    #  Short-Time Fourier Transform (STFT)
    # Audio is the inputted audio waveform, it transforms it into the time frequency domain.
    # we use abs becase it takes the magnitude, or the "strength" of each frequency
    # np.abs is a numpy feature to take an array and convert all values to positives.
    stft = np.abs(librosa.stft(audio))

    
    # Spectral Features, the spectral centroied is seen as the "brightness" of a sound/ Gunshots typically? Have higher spectral centroids due to high energy  at high frequencies.
    centroid = np.mean(librosa.feature.spectral_centroid(S=stft, sr=sr))
    # the bandwidth shows the width of the distributionn of frequencies, according to ChatGPT gunshots will have a higher than typical sound badnnwidth
    bandwidth = np.mean(librosa.feature.spectral_bandwidth(S=stft, sr=sr))
    # Apparently Gunshots are also generally high spectral flatness.
    flatness = np.mean(librosa.feature.spectral_flatness(S=stft))
    
    # Temporal Features. The onsert_env seems like a really important one, it's the sharpness/prominence of the onset of the sound.
    # Due to what a gunshot is, I expect this to be MASSIVe compared to a usual sound.
    onset_env = librosa.onset.onset_strength(y=audio, sr=sr)
    # Captures the maximum intensity of the sound, which will probably be higher than an average sound.
    peak_amplitude = np.max(np.abs(audio))
    
    # Combine features into a single array representing the audio clip
    features = [centroid, bandwidth, flatness, onset_env, peak_amplitude]
    return features

I'm going to try the feature set as described above. But want to bare in mind Mariams paper. The authors in [1] stress the fact that the orders in which these features are stacked is significantly impactful on the final result. Therefore, following their recommended order, the features are horizontally stacked in the following order: spectral contrast, tonnetz, chromagram, Mel-spectrogram, and MFCC. The resulting vectors are then inputted into the classifier.

Will try and test both.

In [4]:
all_features = []  # A list to hold features for each audio clip

#Single test audio file for now, need to  do some processing of files first.
TestAudioPath = "audio/fold1/7061-6-0-0.wav"

# For each audio clip, append its features
features = extract_spectral_features(TestAudioPath, sr=16000)
all_features.append(features)

# Create a DataFrame
df = pd.DataFrame(all_features, columns=["Centroid", "Bandwidth", "Flatness","Onset Env", "PeakAmplitude"])

# Save to CSV
df.to_csv("spectral_features.csv", index=False)

# Check the extracted features
df.head()

  audio, sample_rate = librosa.load(audio, sr=sr)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


FileNotFoundError: [Errno 2] No such file or directory: 'audio/fold1/7061-6-0-0.wav'

I want to try and run a sample of different sounds and see what the different types of sound look like visually. First need to do some data loading of the metadata

In [None]:
# Load the metadata:
metadata = pd.read_csv('./data-description.csv')

Each row in the dataframe named `metadata' represents the metadata for one audio file. For each audio file, the following details are included:
- Name of the file.
- The recording sound ID.
- The start time of the slice in the original audio.
- The end time of the slice in the original audio.
- salience, an indicator for whether the sound slice is a foreground sound (1) or background sound (2).
- ID of the class to which the sound slice belongs.
- Name of the class to which the sound slice belongs.

In [None]:
# Add column calculating the length of the clips
metadata['length'] = metadata['end'] - metadata['start']


Unnamed: 0,slice_file_name,fsID,start,end,salience,fold,classID,class,length
0,100032-3-0-0.wav,100032,0.0,0.317551,1,5,3,dog_bark,0.317551
1,100263-2-0-117.wav,100263,58.5,62.5,1,5,2,children_playing,4.0
2,100263-2-0-121.wav,100263,60.5,64.5,1,5,2,children_playing,4.0
3,100263-2-0-126.wav,100263,63.0,67.0,1,5,2,children_playing,4.0
4,100263-2-0-137.wav,100263,68.5,72.5,1,5,2,children_playing,4.0


In [None]:
# Label map of the different sound classes:
label_map = {
    'air_conditioner': 0,
    'car_horn': 1,
    'children_playing': 2,
    'dog_bark': 3,
    'drilling': 4,
    'engine_idling': 5,
    'gun_shot': 6,
    'jackhammer': 7,
    'siren': 8,
    'street_music': 9,
}

def get_class_name(idx):
    return list(label_map.keys())[list(label_map.values()).index(idx)]

In [None]:
# Get all the folders 
# Array of paths to each audio folder
AUDIO_DIR = './audio' # path to audio files directory
folders = ['fold{}'.format(x) for x in range(1, 12)] 



In [None]:
for fold in folders:
    print('collecting {}...'.format(fold), end = "")
    files = librosa.util.find_files('{}/{}'.format(AUDIO_DIR, fold), ext=['wav'])
    files = np.asarray(files)
    for file in files:
        if '.wav' in file:
            name = file.split('/').pop()
            idx = list(df.index[df['slice_file_name'] == name])[0]
    print("done!")


collecting fold1...done!
collecting fold2...done!
collecting fold3...done!
collecting fold4...done!
collecting fold5...done!
collecting fold6...done!
collecting fold7...done!
collecting fold8...done!
collecting fold9...done!
collecting fold10...done!
collecting fold11...done!


In [None]:
# import os

# def rename_wav_files(directory):
#     """
#     Renames .wav files in the specified directory by chopping the first 4 digits,
#     the last 2 digits, and appending an incremental counter.
    
#     Args:
#         directory (str): Path to the directory containing .wav files.
#     """
#     counter = 0  # Initialize the counter
    
#     # Iterate through all files in the directory
#     for filename in os.listdir(directory):
#         if filename.endswith(".wav"):  # Process only .wav files
#             # Chop the filename as specified
#             parts = filename.split("-")
#             if len(parts) >= 4:
#                 new_name = f"{counter}-{parts[1]}.wav"
#                 old_path = os.path.join(directory, filename)
#                 new_path = os.path.join(directory, new_name)
                
#                 # Rename the file
#                 os.rename(old_path, new_path)
#                 print(f"Renamed: {filename} -> {new_name}")
                
#                 # Increment the counter
#                 counter += 1

# # Specify the directory containing your .wav files
# # directory = "audio/fold1"  # Replace with the actual path
# counter = 3
# for fold in folders:
#     directory = f"audio/fold{counter}"
#     counter += 1
#     rename_wav_files(directory)


All files renamed to be more useable for my purposes. 
Going to redo the function to loop through the folders. I think what I need to do is completely rewrite the function below so that it inputs the file, processes  the features,then adds to a CSV and then continues. Hopefully this will then get around the ram issues I was having. GOing to write the processing functions first.

In [None]:
# Initialise some slices and classes arrays.

slices = np.zeros(df.shape[0], dtype=object)
slices[:] = np.nan # placeholder array for audio slices
classes = np.array(metadata['class']) # list of audio classes for each of the audio in order


8732
8732


In [None]:
#for fold in folders:  
    # Loop through each folder in the 'folders' list.
    
    print('collecting {}...'.format(fold), end="")  
    # Print a message indicating the folder being processed, without a new line.
    
    files = librosa.util.find_files('{}/{}'.format(AUDIO_DIR, fold), ext=['wav'])  
    # Find all `.wav` files in the specified folder (inside AUDIO_DIR).
    # Returns a list of full file paths.
    
    files = np.asarray(files)  
    # Convert the list of file paths to a NumPy array, possibly for compatibility with downstream operations.

    for file in files:  
        # Loop through each `.wav` file found in the current folder.

        if '.wav' in file:  
            # Check if the file has the `.wav` extension (redundant here, as librosa already filters for it).

            name = file.split('/').pop()  
            # Extract the file name by splitting the file path and taking the last element (the name).

            wave_arr, sr = librosa.load(file, sr=SAMPLE_RATE, mono=True)  
            # Load the audio file into `wave_arr` (waveform as a NumPy array) and `sr` (sample rate).
            # - `sr=SAMPLE_RATE` resamples the audio to the desired sample rate.
            # - `mono=True` converts audio to a single-channel (mono) format.

            idx = list(df.index[df['slice_file_name'] == name])[0]  
            # Locate the index in the DataFrame `df` where the 'slice_file_name' matches the file name.
            # Assumes there is exactly one match in `df`.

            slices[idx] = wave_arr.astype(object)  
            # Store the waveform data (`wave_arr`) in the `slices` array at the corresponding index (`idx`).
            # The `.astype(object)` ensures compatibility with the `slices` array (which is of `dtype=object`).

    print("done!")  
    # Indicate the completion of processing for the current folder.


Mani is suggesting use of an LSTM as  they interact wel with time series data.