# MEL Spectrogram 3 Seconds

We can extend our dataset by creating windows for each of the mel spectrograms. This will allow us to have more training data and improve the performance of our model. We will create windows of 3 seconds for each mel spectrogram, which will give us a total of 10 windows for each mel spectrogram. This will increase our dataset from 1000 samples to 10000 samples, which will help our model learn better.

This process is known as data augmentation, and it is a common technique used in machine learning to increase the size of the training dataset and improve the performance of the model. By creating windows for each mel spectrogram, we are effectively creating new samples that can be used for training, which can help our model learn better and generalize well to unseen data.

## Generating MEL Spectrogram Windows

Let's start by loading the required libraries.

In [11]:
import os
import cv2
import numpy as np
import pandas as pd
import librosa
import librosa.display
import matplotlib.pyplot as plt
import matplotlib.cm as cm

Next we will define a funcition that will process the dataset we provide it and generate the mel spectrogram windows for each sample in the dataset. This function will take in the dataset and the window size as input and will return a new dataset with the generated windows.

This function will iterate through each sample in the dataset, generate the mel spectrogram for the sample, and then create windows of the specified size from the mel spectrogram. The generated windows will be stored in a new dataset, which will be returned at the end of the function.

Additionally, we will use librosa to extract features from the audio files and create the mel spectrograms. We will also use numpy to handle the data manipulation and storage of the generated windows.

In [12]:
def process_dataset_windowed(
    dataset_dir: str,
    output_img_dir: str,
    output_csv: str,
    window_sec: float = 3.0,
    sr: int = 22050,
    n_mels: int = 128,
    n_fft: int = 2048,
    hop_length: int = 512
):
    """
    Processes a categorized audio dataset using fixed 3-second windows.
    Generates mel spectrogram PNGs and extracts librosa features per window.
    """

    os.makedirs(output_img_dir, exist_ok=True)
    records = []

    window_samples = int(window_sec * sr)

    for label in sorted(os.listdir(dataset_dir)):
        class_dir = os.path.join(dataset_dir, label)
        if not os.path.isdir(class_dir):
            continue

        class_img_dir = os.path.join(output_img_dir, label)
        os.makedirs(class_img_dir, exist_ok=True)

        for file in os.listdir(class_dir):
            if not file.lower().endswith(".wav"):
                continue


            try:

                wav_path = os.path.join(class_dir, file)
                y, sr = librosa.load(wav_path, sr=sr, mono=True)

                num_windows = len(y) // window_samples

                for w in range(num_windows):
                    start = w * window_samples
                    end = start + window_samples
                    y_win = y[start:end]

                    win_id = f"{file[:-4]}_w{w:03d}"
                    img_path = os.path.join(class_img_dir, f"{win_id}.png")

                    # ------------------ Mel Spectrogram (RGB, 128x128) ------------------
                    mel = librosa.feature.melspectrogram(
                        y=y_win,
                        sr=sr,
                        n_fft=n_fft,
                        hop_length=hop_length,
                        n_mels=128
                    )
                    mel_db = librosa.power_to_db(mel, ref=np.max)

                    # Normalize to [0, 1]
                    mel_norm = (mel_db - mel_db.min()) / (mel_db.max() - mel_db.min() + 1e-6)

                    # Apply colormap (returns RGBA)
                    mel_rgb = cm.viridis(mel_norm)

                    # Drop alpha channel, convert to uint8
                    mel_rgb = (mel_rgb[:, :, :3] * 255).astype(np.uint8)

                    # Resize to EXACT 128x128
                    mel_rgb = cv2.resize(
                        mel_rgb,
                        (128, 128),
                        interpolation=cv2.INTER_AREA
                    )

                    # Save PNG (RGB)
                    cv2.imwrite(img_path, cv2.cvtColor(mel_rgb, cv2.COLOR_RGB2BGR))


                    # ------------------ Feature Extraction ------------------
                    row = {
                        "label": label,
                        "file": file,
                        "window": w
                    }

                    chroma = librosa.feature.chroma_stft(y=y_win, sr=sr)
                    row["chroma_stft_mean"] = chroma.mean()
                    row["chroma_stft_var"] = chroma.var()

                    rms = librosa.feature.rms(y=y_win)
                    row["rms_mean"] = rms.mean()
                    row["rms_var"] = rms.var()

                    centroid = librosa.feature.spectral_centroid(y=y_win, sr=sr)
                    row["spectral_centroid_mean"] = centroid.mean()
                    row["spectral_centroid_var"] = centroid.var()

                    bandwidth = librosa.feature.spectral_bandwidth(y=y_win, sr=sr)
                    row["spectral_bandwidth_mean"] = bandwidth.mean()
                    row["spectral_bandwidth_var"] = bandwidth.var()

                    rolloff = librosa.feature.spectral_rolloff(y=y_win, sr=sr)
                    row["rolloff_mean"] = rolloff.mean()
                    row["rolloff_var"] = rolloff.var()

                    zcr = librosa.feature.zero_crossing_rate(y_win)
                    row["zero_crossing_rate_mean"] = zcr.mean()
                    row["zero_crossing_rate_var"] = zcr.var()

                    y_harm = librosa.effects.harmonic(y_win)
                    row["harmony_mean"] = y_harm.mean()
                    row["harmony_var"] = y_harm.var()

                    perceptr = librosa.feature.spectral_contrast(y=y_win, sr=sr)
                    row["perceptr_mean"] = perceptr.mean()
                    row["perceptr_var"] = perceptr.var()

                    tempo, _ = librosa.beat.beat_track(y=y_win, sr=sr)
                    row["tempo"] = float(tempo)

                    mfcc = librosa.feature.mfcc(y=y_win, sr=sr, n_mfcc=20)
                    for i in range(20):
                        row[f"mfcc{i+1}_mean"] = mfcc[i].mean()
                        row[f"mfcc{i+1}_var"] = mfcc[i].var()

                    records.append(row)

            except Exception as e:
                print(f"Error processing file {file}: {e}. Skipping.......")

    df = pd.DataFrame(records)
    df.to_csv(output_csv, index=False)
    print(f"Saved {len(df)} windowed samples → {output_csv}")


In [13]:
audio_dir = r"C:\Users\JTWit\Documents\ECE 579\Datasets\GTZAN Dataset\genres_original"
img_dir = r"C:\Users\JTWit\Desktop\GTZAN 3 Seconds"

process_dataset_windowed(
    dataset_dir=audio_dir,
    output_img_dir=img_dir,
    output_csv="audio_features.csv",
    window_sec=3.0
)



  row["tempo"] = float(tempo)
  y, sr = librosa.load(wav_path, sr=sr, mono=True)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


Error processing file jazz.00054.wav: . Skipping.......
Saved 9981 windowed samples → audio_features.csv


In [24]:
import pandas as pd

df = pd.read_csv(r"C:\Users\JTWit\Documents\ECE 579\Datasets\GTZAN Dataset\audio_features.csv")


# Function to create image filename
def get_image_filename(row):
    label = row['label']
    number = row['file'].split('.')[1]
    window_padded = f"{row['window']:03d}"  # Pad window with 3 digits
    return f"{label}.{number}_w{window_padded}.png"

# Add new column
df['image_file'] = df.apply(get_image_filename, axis=1)

print(df.head())

df.to_csv(r"C:\Users\JTWit\Documents\ECE 579\Datasets\GTZAN Dataset\audio_features.csv")

   label             file  window  chroma_stft_mean  chroma_stft_var  \
0  blues  blues.00000.wav       0          0.335555         0.090997   
1  blues  blues.00000.wav       1          0.343523         0.086782   
2  blues  blues.00000.wav       2          0.347746         0.092495   
3  blues  blues.00000.wav       3          0.363863         0.087207   
4  blues  blues.00000.wav       4          0.335481         0.088482   

   rms_mean   rms_var  spectral_centroid_mean  spectral_centroid_var  \
0  0.130189  0.003559             1773.358004          169450.829707   
1  0.112119  0.001491             1817.244034           90766.297514   
2  0.130895  0.004552             1790.722358          110071.206762   
3  0.131349  0.002338             1660.545231          109496.936309   
4  0.142370  0.001734             1634.465076           77425.419156   

   spectral_bandwidth_mean  ...  mfcc16_var  mfcc17_mean  mfcc17_var  \
0              1972.334258  ...   39.547077    -3.230046   36.

Now that we have 3 second windows for each sample, we will split the dataset into training and testing sets. This will allow us to evaluate the performance of our model on unseen data and ensure that it is not overfitting to the training data.

We will reuse some code from dataset.py

In [10]:
import os
import shutil
import numpy as np

In [17]:
IMAGES_PATH = r"C:\Users\JTWit\Desktop\GTZAN 3 Seconds"

SPLIT_BASE_PATH = r'C:\Users\JTWit\Documents\ECE 579\Datasets\Split GTZAN Dataset 3s'
SPLIT_TRAIN_PATH = os.path.join(SPLIT_BASE_PATH, 'train')
SPLIT_TEST_PATH = os.path.join(SPLIT_BASE_PATH, 'test')

#Make the target base path and the train and text split directories
os.makedirs(SPLIT_BASE_PATH,exist_ok = True)
os.makedirs(SPLIT_TRAIN_PATH,exist_ok = True)
os.makedirs(SPLIT_TEST_PATH,exist_ok = True)

#Let's also include all the subfolders for train and test
for label in os.listdir(IMAGES_PATH):

    train_path = os.path.join(SPLIT_TRAIN_PATH,label)
    test_path = os.path.join(SPLIT_TEST_PATH,label)

    os.makedirs(train_path,exist_ok = True)
    os.makedirs(test_path,exist_ok = True)

In [21]:
images = {}
for root, dirs, files in os.walk(IMAGES_PATH):

    image_paths = []    
    for file in files:
        file_path = os.path.join(root,file)
        image_paths.append(file_path)

        key = file.split('0')[0]
    images[key] = image_paths


In [22]:
for key in images.keys():
    np.random.shuffle(images[key]) 

    for i,image in enumerate(images[key]):

        if i < int(0.8*len(images[key])):
            genre = key
            image_name = os.path.basename(image)
            destination_path = os.path.join(SPLIT_TRAIN_PATH,genre,image_name)
            shutil.copyfile(image,destination_path)

        else:
            image_name = os.path.basename(image)
            genre = key
            destination_path = os.path.join(SPLIT_TEST_PATH,genre,image_name)
            shutil.copyfile(image,destination_path)
    

    