# Mann's Planet

The objective is the classify given audio files into one of the following categories:
1) strathea_iv
2) aegir_27
3) solmara_vi
4) zephyrion_9
5) veyrah_theta
6) xyphos_1

## The approach:

Extract valuable information from the sample data provided by converting the audio files into Mel Spectrograms.
Mel Spectrogram is a visual representation of the frequency content of an audio signal over time.
This is then used to train a Convolutional Neural Network (CNN). A Mel spectrogram is a representation of the audio signal in the time-frequency domain. It is calculated by applying the Short-Time Fourier Transform (STFT) to the audio signal, followed by a transformation to the Mel scale, which mimics the way humans perceive pitch.
Steps Involved:
1) Load Audio: The audio file (e.g., .wav) is loaded into memory. This can be done using libraries like librosa in Python.

2) Short-Time Fourier Transform (STFT): The audio signal is divided into small overlapping frames, and the Fourier Transform is applied to each frame. This process extracts the frequency information for each frame.

3) Mel Filter Bank: The frequency bins from the STFT are mapped to the Mel scale using a filter bank. This scale approximates the human ear’s response to different frequencies, emphasizing lower frequencies and compressing higher frequencies.

4) Logarithmic Compression: The Mel spectrogram is often compressed logarithmically to reduce the dynamic range and make the features more suitable for neural network-based learning.

5) Resulting Output: The output is a 2D matrix, where the x-axis represents time and the y-axis represents frequency (in Mel scale). Each cell contains the amplitude at that particular time and frequency.



### Imports

In [79]:
import h5py
import librosa
import numpy as np
import os
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers, models

## Data pre-processing

#### Setting train directory

In [112]:
train_dir = "train/" 

#### Constants

In [51]:
SR = 16000  # Sampling rate
TARGET_DURATION = 30  # Target audio length in seconds
MAX_PAD_LENGTH = 128  # Max time steps for Mel spectrogram
FEATURE_SIZE = 128 * MAX_PAD_LENGTH  # Size of flattened Mel spectrogram

#### Checking the training dataset for empty/corrupted files

The empty files are substituted with placeholder mel spectrograms to account for empty files in the test dataset

In [65]:
valid_files = 0
empty_files = 0

for category in os.listdir(train_dir):
    category_path = os.path.join(train_dir, category)
    if os.path.isdir(category_path):
        for file in os.listdir(category_path):
            file_path = os.path.join(category_path, file)
            try:
                y, sr = librosa.load(file_path, sr=16000)
                if len(y) == 0:
                    print(f"Empty file: {file_path}")
                    empty_files += 1
                else:
                    valid_files += 1
            except Exception as e:
                print(f"Error loading {file_path}: {e}")

print(f"✅ Valid audio files: {valid_files}")
print(f"❌ Empty files: {empty_files}")

Empty file: train/strathea_iv\strathea_iv_32.wav
Empty file: train/strathea_iv\strathea_iv_34.wav
Empty file: train/strathea_iv\strathea_iv_35.wav
Empty file: train/strathea_iv\strathea_iv_36.wav
Empty file: train/strathea_iv\strathea_iv_38.wav
Empty file: train/strathea_iv\strathea_iv_39.wav
Empty file: train/strathea_iv\strathea_iv_40.wav
Empty file: train/strathea_iv\strathea_iv_41.wav
Empty file: train/strathea_iv\strathea_iv_42.wav
Empty file: train/strathea_iv\strathea_iv_70.wav
Empty file: train/strathea_iv\strathea_iv_71.wav
Empty file: train/strathea_iv\strathea_iv_78.wav
✅ Valid audio files: 468
❌ Empty files: 12


#### Function to extract mel spectrogram features of the audio files

The tasks performed by this function are:
1) Check length of audio file, if 0 mark as empty file and set zeroed np array
2) To make it so the CNN is trained on uniform data each spectrogram is made using only 30 seconds of audio data, if the length of the audio clip is greater than 30 seconds, the function splits the audio clip into segments. For example a 40 second long audio clip would be split into 30 seconds and 10 seconds segments, a mel spectrogram would be made on the 30 second segment and the 10 second segment gets padded up to 30 seconds and then another spectrogram is made on that
3) If length of audio clip is less than 30 seconds, the clip is padded upto 30 seconds and then the mel spectrogram is extracted

In [67]:
def extract_mel_spectrogram(file_path):
    """
    Extracts Mel spectrograms from audio, handles empty files with a placeholder.
    """
    try:
        y, sr = librosa.load(file_path, sr=16000)

        if len(y) == 0:
            print(f"⚠️ Empty file: {file_path}")
            return [np.zeros((FEATURE_SIZE,))]  # Placeholder for empty files

        max_length = sr * 30  # 30 seconds
        num_segments = max(1, len(y) // max_length)

        features = []
        for i in range(num_segments):
            start, end = i * max_length, min((i + 1) * max_length, len(y))
            segment = y[start:end]

            if len(segment) < max_length:
                segment = np.pad(segment, (0, max_length - len(segment)))

            mel_spec = librosa.feature.melspectrogram(y=segment, sr=sr, n_mels=128)
            mel_spec = librosa.power_to_db(mel_spec, ref=np.max)
            features.append(mel_spec.flatten())

        print(f"✅ {file_path}: {len(features)} spectrograms extracted")
        return features

    except Exception as e:
        print(f"❌ Error processing {file_path}: {e}")
        return [np.zeros((FEATURE_SIZE,))]  # Return placeholder on error

#### This function loops through the following directory structure:
##### Manns-Planet.ipynb
##### train/
##### ├── strathea_iv/
##### │    ├── strathea_iv_1.wav
##### │    ├── strathea_iv_2.wav
##### │    ├── ...
##### │    └── strathea_iv_80.wav
##### ├── aegir_27/
##### │    ├── aegir_27_1.wav
##### │    ├── aegir_27_2.wav
##### │    ├── ...
##### │    └── aegir_27_80.wav
##### ├── solmara_vi/
##### │    ├── solmara_vi_1.wav
##### │    ├── solmara_vi_2.wav
##### │    ├── ...
##### │    └── solmara_vi_80.wav
##### ├── zephyrion_9/
##### │    ├── zephyrion_9_1.wav
##### │    ├── zephyrion_9_2.wav
##### │    ├── ...
##### │    └── zephyrion_9_80.wav
##### ├── veyrah_theta/
##### │    ├── veyrah_theta_1.wav
##### │    ├── veyrah_theta_2.wav
##### │    ├── ...
##### │    └── veyrah_theta_80.wav
##### └── xyphos_1/
#####      ├── xyphos_1_1.wav
#####      ├── xyphos_1_2.wav
#####      ├── ...
#####      └── xyphos_1_80.wav

In the loop this function passes each audio file to the extract_mel_spectrogram(file_path) function while also creating a list of features and labels which is then added into the output file - tain_dataset.csv

In [99]:
def process_audio_dataset(data_dir, save_path):
    data = []
    labels = []

    for category in os.listdir(data_dir):
        category_path = os.path.join(data_dir, category)
        if os.path.isdir(category_path):
            label = category  # Use category folder name as label
            for file in os.listdir(category_path):
                file_path = os.path.join(category_path, file)
                mel_features = extract_mel_spectrogram(file_path)
                
                for feature in mel_features:
                    data.append(feature)
                    labels.append(label)
    
    # Convert to DataFrame
    df = pd.DataFrame(data)
    df["label"] = labels  # Add labels column

    if len(df) > 0:
        print("📊 First row of dataset:", df.iloc[0])  # Debugging
        df.to_csv(save_path, index=False)
        print(f"✅ Dataset saved to {save_path} with {len(df)} samples.")
    else:
        print("❌ No data to save! Check audio processing.")

# Run the function for your train and test directories
train_dir = "train/"
test_dir = "test/"

#### Processes the training data and outputs train_dataset.csv

In [101]:
process_audio_dataset(train_dir, "train_dataset.csv")

✅ train/aegir_27\aegir_27_0.wav: 1 spectrograms extracted
✅ train/aegir_27\aegir_27_1.wav: 1 spectrograms extracted
✅ train/aegir_27\aegir_27_10.wav: 1 spectrograms extracted
✅ train/aegir_27\aegir_27_11.wav: 1 spectrograms extracted
✅ train/aegir_27\aegir_27_12.wav: 1 spectrograms extracted
✅ train/aegir_27\aegir_27_13.wav: 1 spectrograms extracted
✅ train/aegir_27\aegir_27_14.wav: 1 spectrograms extracted
✅ train/aegir_27\aegir_27_15.wav: 1 spectrograms extracted
✅ train/aegir_27\aegir_27_16.wav: 1 spectrograms extracted
✅ train/aegir_27\aegir_27_17.wav: 1 spectrograms extracted
✅ train/aegir_27\aegir_27_18.wav: 1 spectrograms extracted
✅ train/aegir_27\aegir_27_19.wav: 1 spectrograms extracted
✅ train/aegir_27\aegir_27_2.wav: 1 spectrograms extracted
✅ train/aegir_27\aegir_27_20.wav: 1 spectrograms extracted
✅ train/aegir_27\aegir_27_21.wav: 1 spectrograms extracted
✅ train/aegir_27\aegir_27_22.wav: 1 spectrograms extracted
✅ train/aegir_27\aegir_27_23.wav: 1 spectrograms extracted


#### Due to large number of features, csv files become inefficient so the csv file is converted to .h5 (HDF5) file using h5py library

Advantages:
1) Optimized for large datasets: HDF5 is very efficient for storing large amounts of data, especially multi-dimensional arrays.
2) Supports compression: You can compress data in HDF5, which can help reduce storage requirements.
3) Efficient reading/writing: HDF5 is designed for efficient random access to large datasets, which is great for training.
4) Widely supported in machine learning frameworks: Libraries like TensorFlow and Keras work well with HDF5 files, making it easy to integrate with your CNN training pipeline.

In [103]:
df = pd.read_csv("train_dataset.csv")

In [105]:
with h5py.File('train_dataset.h5', 'w') as hf:
    hf.create_dataset('features', data=df.drop(columns=['label']).values)
    hf.create_dataset('labels', data=df['label'].values)

In [138]:
file_path = 'train_dataset.h5' 
with h5py.File(file_path, 'r') as file:
    print("Keys in the file:", list(file.keys()))

    dataset_name = 'labels'  
    if dataset_name in file:
        dataset = file[dataset_name]
        print(f"Dataset shape: {dataset.shape}")
        print(f"Dataset dtype: {dataset.dtype}")
        print(f"Dataset contents (first 5 elements): {dataset[:5]}")

Keys in the file: ['features', 'labels']
Dataset shape: (528,)
Dataset dtype: object
Dataset contents (first 5 elements): [b'aegir_27' b'aegir_27' b'aegir_27' b'aegir_27' b'aegir_27']


#### We'll reshape the dataset to better suit the CNN training process

The shape of the features is now (528, 256, 469), which means there are 528 samples, each represented by a 2D array of size (256, 469). This structure is suitable for input into a CNN.

The labels are in binary form (e.g. [b'aegir_27...]). The labels are correctly decoded to strings, and we have an array of labels (such as 'aegir_27') corresponding to each sample.

In [140]:
with h5py.File(file_path, 'r') as file:
    features_dataset = file['features']
    print(f"Features dataset shape: {features_dataset.shape}")
    print(f"Features dataset dtype: {features_dataset.dtype}")
    
    reshaped_features = features_dataset[:].reshape(-1, 256, 469)  
    print(f"Reshaped features shape: {reshaped_features.shape}")

    labels_dataset = file['labels']
    print(f"Labels dataset shape: {labels_dataset.shape}")
    print(f"Labels dataset dtype: {labels_dataset.dtype}")
    
    decoded_labels = [label.decode('utf-8') for label in labels_dataset]
    print(f"Decoded labels (first 5): {decoded_labels[:5]}")

Features dataset shape: (528, 120064)
Features dataset dtype: float64
Reshaped features shape: (528, 256, 469)
Labels dataset shape: (528,)
Labels dataset dtype: object
Decoded labels (first 5): ['aegir_27', 'aegir_27', 'aegir_27', 'aegir_27', 'aegir_27']
