Home

Music Genre Classification

Project Created By Jack Kelly, Zad Khan, Luke Gries, and Dylan Kiratli

Introduction

Music is constantly evolving. With the increased ability to produce music and garner an audience enabled by the the digital age, the process of musical innovation has accelerated, and the variety of styles in music has exploded. The music landscape and its genres have become more intertwined — and convoluted — than ever. It may be difficult to classify what genre of music a particular song falls into when the edges between categories have become so blurred. It may even be difficult to recognize two distinct songs as belonging to the same genre! We turned to using neural networks to help us solve this problem. Our project investigates whether a neural network could be useful in detecting which genre a given song falls under. Specifically, we examined if a convolutional neural network could distinguish between ten of the most popular genres — blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, or rock.

Our Approach

Given the fact that our work was going to be centered around categorizing music genres based on audio files, we elected to use a convolutional neural network (CNN). We collected data from an online Kaggle dataset which provided us with 1000 unique audio files spanning the ten genres we looked to investigate — 100 files for each genre, each 30 seconds in length. To increase the size of our training set, we split the 30-second audio clips into 26 overlapping 5-second clips, which began at each 1 second mark (The first clip from 0–5 seconds, the second from 1–6 seconds, etc). That is, we increased the size of our dataset by a factor of 26x, resulting in a total of 26,000 audio clips. Before training the model, we preprocessed the data by converting these audio signals into mel-spectrograms and standardizing their values to a range of 0–1.

Spectrograms

As outlined , we discovered spectrograms — a visual representation of an audio file. Spectrograms display the changes in frequencies over a given time, with the x-axis representing time and the y-axis representing intensity, denoted by colors. The below spectrogram highlights the difference between a pop and reggae track. The spectrogram of the reggae track showed a low-intensity color scheme, indicating the song’s focus on rhythm and laid-back tempo. On the other hand, the pop track displayed intense colors during the chorus and a break, indicating a more dynamic and repetitive composition that emphasizes high frequencies. Spectrograms can help us understand how different genres of music are composed and can be a useful tool in genre classification.

Pop.20	Reggae.04

Since the Kaggle dataset only provided spectrograms for 30-second samples, we wrote a helper function to create the images on the five second splices. Below is the code for the helper function which both splices the larger wav file into smaller samples and generates the image:

    def get_melspectrogram(wav_file_path, length=30, duration_of_segments=5, overlap=False, duration_of_step=1):
    
        """
        Get mel spectrogram for a given wav file and divide it into parts.
    
        :param wav_file_path: Path to the source wav file
        :param length: length in seconds of the source audio file. Defaults to 30.
        :param duration_of_segments: duration of segments in seconds. number of segments = length/duration_of_segments. Defaults to 5.
        :param overlap: boolean determining whether slices of audio file will be overlapped or distinct. Defaults to False
        :param duration_of_step: step size from the beginning of one segment to the beginning of the next. Defaults to 1 second. Unused if overlap is False.
        :return: Mel spectrogram of the source wav file. Segments will be saved to a file and it's path will be printed.
        """
        y, sr = librosa.load(wav_file_path, sr=None, duration=length)
        melspectrogram = librosa.feature.melspectrogram(y=y, sr=sr)
    
        # Determine the number of samples in the duration of segments
        samples_per_segment = sr * duration_of_segments
    
        if overlap:
            samples_per_step = sr * duration_of_step
            num_segments= length // duration_of_step - duration_of_segments + 1
        else:
            num_segments = length // duration_of_segments
    
        # Loop through the audio signal and extract the segments
        for i in range(num_segments):
            # Get the start and end indices of the segment
            if overlap:
                start = i * samples_per_step
            else:
                start = i * samples_per_segment
            end = start + samples_per_segment
    
            # Extract the segment from the audio signal
            segment = y[start:end]
    
            # Compute the mel spectrogram of the segment
            mel_spec_segment = librosa.feature.melspectrogram(y=segment, sr=sr)
    
            sample_name = wav_file_path.replace("../data/genres_original/","").replace(".wav","")
            sample_name = f'{sample_name.split("/")[0]}/npy/{sample_name.split("/")[1]}'
    
            if overlap:
                directory = "overlap"
            else:
                directory = "distinct"
    
            save_path = f'../data/mel_spec_samples/{directory}/{sample_name}_{i}.npy'
            np.save(save_path, mel_spec_segment)
            """ print(f'Saved segment /mel_spec_samples/{directory}/{sample_name}_{i}.npy') """
    
        return melspectrogram

Building the Convolutional Neural Network

Our first attempt at building our CNN for genre classification used multiple layers and can be seen below. We planned on using this initial architecture based on our previous work classifying hand-drawn numbers with the MNIST dataset.

    #First attempt at our CNN architecture
    
    net = Sequential([
        keras.Input(shape=(128, 216, 1)),
        Conv2D(28, kernel_size=(3, 3), activation="relu", kernel_regularizer =tf.keras.regularizers.l2(0.01)),
        MaxPooling2D(pool_size=(2, 2)),
        Dropout(0.2),
        Conv2D(14, kernel_size=(3, 3), activation="relu", kernel_regularizer =tf.keras.regularizers.l2(0.02)),
        MaxPooling2D(pool_size=(2, 2)),
        Flatten(),
        Dropout(0.2),
        Dense(10, activation="softmax")])
    
    net.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
    net.fit(train_dataset, batch_size=BATCH_SIZE, epochs=10)

We quickly found our model to be overfitting the data and needed to pivot. Our first solution was to add regularization. We thought by regularizing the data, we could at least overfit by a lesser percentage. This was not the case, however, we still were overfitting the data and our model only slowed in training time.
Realizing regularization alone was not going to give us the desired results, we opted to remove a layer and also tweak the dropout percentages. We were initially happy with this iteration of our model, reaching a test accuracy of 60%.
Because our test accuracy was so low, we then sought to standardize the data. We did this by first converting our spectrogram values in the data from Hz to db, then by dividing all values in the data by the largest entry. This iteration yielded a test accuracy of 70% which we were initially satisfied with.
Our research on ResearchGate, indicated that other machine learning music genre classifiers typically achieved a test accuracy of 74–78%. To see if we could compete with these models, we implemented early stopping with a patience of 5 and tracked validation loss. We allowed the model to run until termination, resulting in a test accuracy of 80%, although the training accuracy was over 95%. While the model still overfit our data, it was still the best results we have been able to achieve.

Here is our final network architecture that achieved an 80% test accuracy.

    net = Sequential([
        keras.Input(shape=(128, 216, 1)),
        Conv2D(32, kernel_size=(3, 3), activation="relu",
                   kernel_regularizer=L1L2(l1=0.01, l2=0.01),
                   bias_regularizer=L2(0.01),
                   activity_regularizer=L2(0.01)),
        MaxPooling2D(pool_size=(2, 2)),
        Dropout(0.5),
        Flatten(),
        Dense(10, activation="softmax")])
    
    net.compile(loss="sparse_categorical_crossentropy", optimizer="adam",
                    metrics=["accuracy"])
    
    early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, mode='min', restore_best_weights=True)
    
    history = net.fit(b_train_dataset, batch_size=BATCH_SIZE, verbose=1, epochs=100, callbacks=[early_stopping], validation_data = b_test_dataset, shuffle=True)

Analyzing Our Results

Are we happy with 80% test accuracy?

We’re relatively happy given the limitations of our dataset. Without splicing the audio files into five second clips our model would have only been trained on 1,000 data points — a very small amount by machine learning standards. Unlike the 30-second clips, which always include a complete picture of a song’s tempo, rhythm, and lyrics, some five-second increments may lack sufficient characteristics to help the model distinguish between genres. For example, samples from 0–5 seconds may not capture the same information as samples from 12–17 seconds. Ultimately, we were able to achieve a test accuracy of 80%, which is comparable to or higher than the accuracy achieved by other music genre classifiers in our findings on ResearchGate.

Do we think we could have continued to improve the model substantially?

While we were able to improve the model’s test accuracy to 80%, further substantial improvements may have been difficult to achieve due to limitations in the data and model architecture. As we previously mentioned, our dataset had certain information gaps due to the splicing of audio files into five-second clips, which may have affected the model’s ability to distinguish between certain genres. Additionally, our CNN architecture may have been limited in its ability to capture certain features of the audio files.

If we had more time, are there other avenues we would have explored?

For sure. The amount of tweaking and different models we could have tried could’ve been an entire project in itself. For example, incorporating more advanced data augmentation techniques, such as pitch shifting or time stretching, may help generate more diverse data for the model to learn from. Additionally, experimenting with different model architectures or even exploring alternative approaches such as recurrent neural networks (RNNs) or transformers could yield better results. While further substantial improvements may have been challenging on our current model, there are still many opportunities for optimizing the model’s performance.

Takeaways

Ultimately, our project aimed to investigate whether a convolutional neural network could accurately categorize songs into one of ten popular genres. After collecting a dataset of 1000 audio files and splitting them into 5-second clips, we used spectrograms to feed into our CNN. Although our initial network was overfitting the data, we improved our results by removing a layer and tweaking dropout percentages. This change achieved a test accuracy of 80%, which is relatively good, but still limited by the data size and splicing method used. Overall, our project shows that a convolutional neural network can be useful in genre classification, but further improvements can be made with more diverse and representative data.

Unexpected Results

This model, found during testing, was equally effective as our final model. It contains a 2 x 2 kernel and was run on unstandardized data. We do not know why this worked, but choose to include it because it did.

    net = Sequential([
        keras.Input(shape=(128, 216, 1)),
        Conv2D(16, kernel_size=(2, 2), activation="relu", kernel_regularizer=L2(0.1)),
        Flatten(),
        Dropout(0.6),
        Dense(10, activation="softmax")])
    
    net.compile(loss="sparse_categorical_crossentropy", optimizer="adam",
                metrics=["accuracy"])
    
    earlyStopping = tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=3, mode='min', restore_best_weights=True)
    history = net.fit(train_dataset, batch_size=BATCH_SIZE, verbose=1, epochs=10, callbacks=[earlyStopping], validation_data = test_dataset)

Project Repository: https://github.com/JKelly423/music-genre-classification

Repository Documentation: https://jkelly423.github.io/music-genre-classification/

Kaggle Dataset: https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification

Provide feedback

Saved searches

Use saved searches to filter your results more quickly