# MEL Spectrogram 3 Seconds

We can extend our dataset by creating windows for each of the mel spectrograms. This will allow us to have more training data and improve the performance of our model. We will create windows of 3 seconds for each mel spectrogram, which will give us a total of 10 windows for each mel spectrogram. This will increase our dataset from 1000 samples to 10000 samples, which will help our model learn better.

This process is known as data augmentation, and it is a common technique used in machine learning to increase the size of the training dataset and improve the performance of the model. By creating windows for each mel spectrogram, we are effectively creating new samples that can be used for training, which can help our model learn better and generalize well to unseen data.

## Generating MEL Spectrogram Windows

Let's start by loading the required libraries.

In [3]:
import os
import numpy as np
import pandas as pd
import librosa
import librosa.display
import matplotlib.pyplot as plt

Next we will define a funcition that will process the dataset we provide it and generate the mel spectrogram windows for each sample in the dataset. This function will take in the dataset and the window size as input and will return a new dataset with the generated windows.

This function will iterate through each sample in the dataset, generate the mel spectrogram for the sample, and then create windows of the specified size from the mel spectrogram. The generated windows will be stored in a new dataset, which will be returned at the end of the function.

Additionally, we will use librosa to extract features from the audio files and create the mel spectrograms. We will also use numpy to handle the data manipulation and storage of the generated windows.

In [1]:
def process_dataset_windowed(
    dataset_dir: str,
    output_img_dir: str,
    output_csv: str,
    window_sec: float = 3.0,
    sr: int = 22050,
    n_mels: int = 128,
    n_fft: int = 2048,
    hop_length: int = 512
):
    """
    Processes a categorized audio dataset using fixed 3-second windows.
    Generates mel spectrogram PNGs and extracts librosa features per window.
    """

    os.makedirs(output_img_dir, exist_ok=True)
    records = []

    window_samples = int(window_sec * sr)

    for label in sorted(os.listdir(dataset_dir)):
        class_dir = os.path.join(dataset_dir, label)
        if not os.path.isdir(class_dir):
            continue

        class_img_dir = os.path.join(output_img_dir, label)
        os.makedirs(class_img_dir, exist_ok=True)

        for file in os.listdir(class_dir):
            if not file.lower().endswith(".wav"):
                continue


            try:

                wav_path = os.path.join(class_dir, file)
                y, sr = librosa.load(wav_path, sr=sr, mono=True)

                num_windows = len(y) // window_samples

                for w in range(num_windows):
                    start = w * window_samples
                    end = start + window_samples
                    y_win = y[start:end]

                    win_id = f"{file[:-4]}_w{w:03d}"
                    img_path = os.path.join(class_img_dir, f"{win_id}.png")

                    # ------------------ Mel Spectrogram ------------------
                    mel = librosa.feature.melspectrogram(
                        y=y_win,
                        sr=sr,
                        n_fft=n_fft,
                        hop_length=hop_length,
                        n_mels=n_mels
                    )
                    mel_db = librosa.power_to_db(mel, ref=np.max)

                    plt.figure(figsize=(10, 4))
                    librosa.display.specshow(
                        mel_db,
                        sr=sr,
                        hop_length=hop_length,
                        x_axis=None,
                        y_axis=None
                    )
                    plt.axis("off")
                    plt.tight_layout()
                    plt.savefig(img_path, dpi=300, bbox_inches="tight", pad_inches=0)
                    plt.close()

                    # ------------------ Feature Extraction ------------------
                    row = {
                        "label": label,
                        "file": file,
                        "window": w
                    }

                    chroma = librosa.feature.chroma_stft(y=y_win, sr=sr)
                    row["chroma_stft_mean"] = chroma.mean()
                    row["chroma_stft_var"] = chroma.var()

                    rms = librosa.feature.rms(y=y_win)
                    row["rms_mean"] = rms.mean()
                    row["rms_var"] = rms.var()

                    centroid = librosa.feature.spectral_centroid(y=y_win, sr=sr)
                    row["spectral_centroid_mean"] = centroid.mean()
                    row["spectral_centroid_var"] = centroid.var()

                    bandwidth = librosa.feature.spectral_bandwidth(y=y_win, sr=sr)
                    row["spectral_bandwidth_mean"] = bandwidth.mean()
                    row["spectral_bandwidth_var"] = bandwidth.var()

                    rolloff = librosa.feature.spectral_rolloff(y=y_win, sr=sr)
                    row["rolloff_mean"] = rolloff.mean()
                    row["rolloff_var"] = rolloff.var()

                    zcr = librosa.feature.zero_crossing_rate(y_win)
                    row["zero_crossing_rate_mean"] = zcr.mean()
                    row["zero_crossing_rate_var"] = zcr.var()

                    y_harm = librosa.effects.harmonic(y_win)
                    row["harmony_mean"] = y_harm.mean()
                    row["harmony_var"] = y_harm.var()

                    perceptr = librosa.feature.spectral_contrast(y=y_win, sr=sr)
                    row["perceptr_mean"] = perceptr.mean()
                    row["perceptr_var"] = perceptr.var()

                    tempo, _ = librosa.beat.beat_track(y=y_win, sr=sr)
                    row["tempo"] = float(tempo)

                    mfcc = librosa.feature.mfcc(y=y_win, sr=sr, n_mfcc=20)
                    for i in range(20):
                        row[f"mfcc{i+1}_mean"] = mfcc[i].mean()
                        row[f"mfcc{i+1}_var"] = mfcc[i].var()

                    records.append(row)

            except Exception as e:
                print(f"Error processing file {file}. Skipping.......")

    df = pd.DataFrame(records)
    df.to_csv(output_csv, index=False)
    print(f"Saved {len(df)} windowed samples â†’ {output_csv}")


In [None]:
audio_dir = r"C:\Users\JTWit\Documents\ECE 579\Datasets\GTZAN Dataset\genres_original"
img_dir = r"C:\Users\JTWit\Desktop\GTZAN 3 Seconds"

process_dataset_windowed(
    dataset_dir=audio_dir,
    output_img_dir=img_dir,
    output_csv="audio_features.csv",
    window_sec=3.0
)



  row["tempo"] = float(tempo)
  y, sr = librosa.load(wav_path, sr=sr, mono=True)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


Error processing file jazz.00054.wav. Skipping.......
