## Feature Extraction

### Convolutional Neural Network
CNN algorithms recognize patterns in spatial data, which works best with images. So we will be converting the original audio data into spectrograms which are graphs that visually represent the change in frequency over time.<br>
We are starting with 1000 wav files for our data. I will convert these into mel-spectrogrmas. we chose mel-spectrograms specifically because they measure the mel scale instead of frequency along the y-axis. Also changing the color of the points based off the decibal scale not the amplitude of the wave. These spectrograms focus more on what humans will actually here making it more ideal for genre classification.

In [None]:
import os
# generate genres_img folder for spectrograms

main_dir = "data"
genres_dir = os.path.join(main_dir, "genres_img")
genres = ["blues", "classical", "country", "disco", "hiphop", "jazz", "metal", "pop", "reggae", "rock"]

if not os.path.exists(main_dir):
    os.makedirs(main_dir)
    print(f"Created directory: {main_dir}")
else:
    print(f"Directory already exists: {main_dir}")

if not os.path.exists(genres_dir):
    os.makedirs(genres_dir)
    print(f"Created directory: {genres_dir}")
else:
    print(f"Directory already exists: {genres_dir}")

for genre in genres:
    genre_dir = os.path.join(genres_dir, genre)
    if not os.path.exists(genre_dir):
        os.makedirs(genre_dir)
        print(f"Created directory: {genre_dir}")
    else:
        print(f"Directory already exists: {genre_dir}")

In [None]:
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

# Function to convert the wav file to mel-spectrogram
def save_mel_spectrogram(wav_path, output_image_path, sr=22050, n_mels=128):
    # Load audio file
    y, sr = librosa.load(wav_path, sr=sr)

    # Generate Mel Spectrogram
    mel_spec = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels)
    
    # Convert to decibels for better visualization
    mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)

    # Create the plot
    plt.figure(figsize=(10, 4))
    librosa.display.specshow(mel_spec_db, sr=sr, x_axis='time', y_axis='mel')

    # Remove axes for a clean image
    plt.axis('off')

    # Save as an image
    plt.savefig(output_image_path, bbox_inches='tight', pad_inches=0)
    plt.close()

PATH_MP3 = "./data/genres_original/"
PATH_IMG = "./data/genres_img/"

# script to convert all wav to mel-spectrogram
for genre in os.listdir(PATH_MP3):
    for music in os.listdir(PATH_MP3+genre):
        save_mel_spectrogram(f"{PATH_MP3}{genre}/{music}", f"{PATH_IMG}{genre}/{music[:-3]}png")


## Creating the Model

### Convolutional Neural Network

In [1]:
import tensorflow as tf
print(tf.config.list_physical_devices())

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]


first model made had 47% accuracy after the test data

increased epoch to 50, and increased neuron connection to 128 -> 256
11% accuracy

decreased epoch to 50 -> 20, changed droprate 0.5 -> 0.2, neuron connection 256 -> 128

changes doubled filter size at each layer from 32 -> 64 and so on
added BatchNormalization() after each layer
increaed neuron layers at end from 128 -> 512

In [6]:
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, BatchNormalization

# Set parameters
img_height = 308  # Resize all images to 128x128
img_width = 775
batch_size = 32  # Process 32 images at a time
data_dir = "./data/genres_img/"  # Path to dataset folder

# Create data generators
train_datagen =  ImageDataGenerator(rescale=1./255, validation_split=0.2)

train_generator = train_datagen.flow_from_directory(
    data_dir,
    target_size=(img_height, img_width),  
    batch_size=batch_size,
    class_mode='categorical',  
    subset='training'  
)

val_generator = train_datagen.flow_from_directory(
    data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='categorical',
    subset='validation'  # Uses 20% for validation
)

model = Sequential([
    # creates 32 filters small 3x3 grids that slide over the image looking for patters
    # relu is Rectified Linear Unit
    Conv2D(32, (3, 3), activation='relu', input_shape=(img_height, img_width, 3)),
    BatchNormalization(),
    # reduces the size of the images taking the max values it found in each region
    MaxPooling2D(pool_size=(2, 2)),

    Conv2D(64, (3, 3), activation='relu'),
    BatchNormalization(),
    MaxPooling2D(pool_size=(2, 2)),

    Conv2D(128, (3, 3), activation='relu'),
    BatchNormalization(),
    MaxPooling2D(pool_size=(2, 2)),

    Flatten(),
    # adds neuron layers together to combine extracted features
    Dense(128, activation='relu'),
    # randomly removes 50% of the neurons to prevent overfitting
    Dropout(0.2),
    # assigns probabilities to each of the 10 genres
    Dense(10, activation='softmax')  # 10 output classes (one for each genre)
])

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

epochs = 20  # Number of times model sees the dataset

history = model.fit(
    train_generator,  # Training data
    validation_data=val_generator,  # Validation data
    epochs=epochs
)

val_loss, val_acc = model.evaluate(val_generator)
print(f"Validation Accuracy: {val_acc:.4f}")

model.save("music_genre_cnn.h5")

Found 800 images belonging to 10 classes.
Found 200 images belonging to 10 classes.
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Validation Accuracy: 0.0950


In [7]:
model.save('music_genre_cnn.keras')

## Testing Model