# CSC475/575 Spring 2024 - Assignment 4

This assignment covers topics related to genre classification and tag annotation  

* A4.1: Dataset preparation  
* A4.2: Feature extraction + SVM 
* A4.3: Classifier comparison with classification report and confusion matrix 
* A4.4: Misclassification audio 
* A4.5: musicnn tags -> naive bayes classifier 

Each question is worth 2 points for a total of 10 points for the assignment. 

#### **Question A4.1 (basic): Dataset preparation** 
 
You can work on this particular question in coordination with other students or members of your group as it is mostly logistics but useful to know. The next two assignements are based on the dataset you will prepare. We will be using FMA: A Dataset For Music Analysis https://github.com/mdeff/fma. The repository contains the datasets as well as a various code examples for different tasks. You are welcome to use/consult/modify any code from the repository. 

You will need to download the fma small dataset (fma_small.zip) from the repository which contains a balanced dataset with 8 genres and 1000 tracks per genre. The size of the dataset is 7.2GB. Create a new fma-smaller dataset that consists of 4 genres, 1000 tracks per genre in which each track is 6 seconds instead of 30 seconds. You will need to load the 30 second tracks and write the 6 second short tracks to disk. For the remainder of the assignment and the next assignment you will be using this smaller dataset so that it takes less hard disk space and is easier to deal with. 
The four genres you will use are Instrumental, HipHop, Rock, and Folk. Show that your code works by plotting the time domain waveforms of a HipHop track and a Folk track from the new fma_smaller dataset. Once you have things working you can erase the original fma_small stuff to free up space if needed.


Note: librosa uses audioread in the backend which can use many native libraries, e.g. ffmpeg resampling is very slow --> use kaiser_fast
Use .wav files for the smaller dataset 

 (**Basic: 2 points**)


In [1]:
# your code goes here 

#Dataset generated by group memeber Alyssa Blair, as George said that we could do that in class
#missing four files because they're less than 30 seconds and kinda bugged

#Alyssa Blair's code:

# import librosa
# import fma.utils as utils
# import soundfile as sf

# def resample_file(input_file):
#     try:
#         arr, sr = librosa.load(input_file)
#         new_sample = librosa.resample(arr[:int(len(arr)/5)], orig_sr=sr, target_sr=sr, res_type="kaiser_fast")  

#     except Exception as e:
#         print("Error resampling file: ", input_file, e)
#         new_sample = None
#         sr = 0

#     return new_sample, sr


# genres = utils.load('fma_metadata/genres.csv')
# tracks = utils.load('fma_metadata/tracks.csv')
# features = utils.load('fma_metadata/features.csv')
# small = tracks[tracks['set', 'subset'] <= 'small']


# genre1 = small[small['track', 'genre_top'] == 'Instrumental'].index.values
# genre2 = small[small['track', 'genre_top'] == 'Hip-Hop'].index.values
# genre3 = small[small['track', 'genre_top'] == 'Rock'].index.values
# genre4 = small[small['track', 'genre_top'] == 'Folk'].index.values

# genres = {'instrumental': genre1, 'hiphop': genre2, 'rock': genre3, 'folk': genre4}


# def filter_genre(track_ids, genre_name):
#     counter = 0
    
#     for track in track_ids:
#         track_id = str(track)

#         track_name = (6 - len(track_id)) * '0' + track_id
    
#         input_file = f'fma_small/{track_name[:3]}/{track_name}.mp3'
#         audio, sr = resample_file(input_file)

#         if audio is None:
#             continue
        
#         output_file = f'fma_smaller/{genre_name}/{track_name}.wav'
#         sf.write(output_file, audio, sr)
    
#         # counter to keep track of progress
#         counter += 1

#         if counter % 200 == 0:
#             print(counter)


# def filter_genre_list(genres):
#     for genre in genres:
#         try:
#             filter_genre(genres[genre], genre)
#         except Exception as e:
#             print("Error filtering genre: ", e)

# filter_genre_list(genres)

# missing files after classification
# fma_small/098/098565.mp3
# fma_small/098/098567.mp3
# fma_small/098/098569.mp3
# fma_small/108/108925.mp3

#### **Question A4.2 (basic): Genre classification** 

For each track in the new fma_smaller dataset compute the Mel-Frequency Cepstral Coefficients using the librosa Python library for Music Information Retrieval (https://librosa.org/doc/main/index.html). Represent each track as a single vector consisting of the mean MFCCs vectors across the track. 
Perform k-fold cross-validation and calculate the classification report and display the confusion matrix for genre classification using this dataset. Use the Support Vector Classifier (SVC) from scikit-learn (https://scikit-learn.org/stable/) with gamma='auto'. 

 (**Basic: 2 points**)


In [2]:
# your code goes here 
import glob
import librosa
import soundfile
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn import svm, metrics, neighbors, ensemble, linear_model
from sklearn.model_selection import cross_val_predict

In [3]:
fnames = glob.glob('../fma_smaller/*/*.wav')

print(fnames[0])

../fma_smaller\folk\000140.wav


In [4]:
genres = ['folk', 'hiphop', 'instrumental', 'rock']

# allocate matrix for audio features and target 
audio_features = np.zeros((len(fnames), 20))
target = np.zeros(len(fnames))

# compute the features 
for (i,fname) in enumerate(fnames): 
    for (label,genre) in enumerate(genres): 
        if genre in fname: 
            audio, srate = librosa.load(fname)
            mfcc_matrix = librosa.feature.mfcc(y=audio, sr=srate)
            mean_mfcc = np.mean(mfcc_matrix,axis=1)
            audio_features[i] = mean_mfcc
            target[i] = label
print(audio_features.shape)


(3996, 20)


In [5]:
scaler = MinMaxScaler()
features = scaler.fit_transform(audio_features)
clf_mfcc = svm.SVC(gamma='auto', kernel='linear')
clf_mfcc.fit(features, target)
predicted = cross_val_predict(clf_mfcc, features, target, cv=10)

#print classification report
print(metrics.classification_report(target, predicted, target_names=genres))

#print confusion matrix and accuracy
print("Confusion matrix:\n%s" % metrics.confusion_matrix(target, predicted))
print("Accuray :\n%s\n" % (metrics.accuracy_score(target, predicted)))

              precision    recall  f1-score   support

        folk       0.63      0.58      0.60      1000
      hiphop       0.71      0.74      0.72       997
instrumental       0.58      0.56      0.57      1000
        rock       0.65      0.70      0.67       999

    accuracy                           0.64      3996
   macro avg       0.64      0.64      0.64      3996
weighted avg       0.64      0.64      0.64      3996

Confusion matrix:
[[576  74 239 111]
 [ 52 742  62 141]
 [208 110 556 126]
 [ 80 124  94 701]]
Accuray :
0.6443943943943944



#### **Question A4.3 (expected): Feature and Classifier Comparison** 

In this question you will explore how different features and classifiers affect the accuracy of the genre classification. In addition to the mean MFCCs across the track, consider the concatenation of the mean MFCCs and standard deviation MFCCs (this vector will be double the size of the mean MFCC vector. Do the same process with the chroma_cqt features from librosa (librosa.feature.chroma_cqt) i.e mean and the concatenetion mean_std.
  
Therefore you will have 4 features: meanMFCC, mean_stdMFCC, meanChromaCQT, mean_stdChroma_CQT. 
  
In addition consider the following classifiers: Support Vector Classifier (SVC), 3-Nearest Neighbor, RandomForestClassifier, and Logistic Regression. You can use the default settings of each of these classifiers and you don't need to do any hyper-parameter tuning. 

  
For each combination of configurations calculate the classification accuracy using 5-fold cross-validation. The result should be a 4 by 4 table with 16 accuracies one for each combintation of feature front-end and classifier. 

 (**Expected: 2 points**)


In [6]:
# your code goes here 

meanMFCC = np.zeros((len(fnames), 20))
mean_stdMFCC = np.zeros((len(fnames), 40))
meanChromaCQT = np.zeros((len(fnames), 12))
mean_stdChromaCQT = np.zeros((len(fnames), 24))

target = np.zeros(len(fnames))

# compute the features 
for (i,fname) in enumerate(fnames): 
    for (label,genre) in enumerate(genres): 
        if genre in fname: 
            audio, srate = librosa.load(fname)


            mfcc_matrix = librosa.feature.mfcc(y=audio, sr=srate)
            mean_mfcc = np.mean(mfcc_matrix,axis=1)
            std_mfcc = np.std(mfcc_matrix, axis=1)

            meanMFCC[i] = mean_mfcc
            mfcc_fvec = np.concatenate([mean_mfcc, std_mfcc])
            mean_stdMFCC[i] = mfcc_fvec

            chroma_matrix = librosa.feature.chroma_cqt(y=audio, sr=srate)
            mean_chroma = np.mean(chroma_matrix,axis=1)
            std_chroma = np.std(chroma_matrix, axis=1)

            meanChromaCQT[i] = mean_chroma
            chroma_fvec = np.concatenate([mean_chroma, std_chroma])
            mean_stdChromaCQT[i] = chroma_fvec


            target[i] = label


  return pitch_tuning(


In [7]:
scaler = MinMaxScaler()

meanMFCC = scaler.fit_transform(meanMFCC)
mean_stdMFCC = scaler.fit_transform(mean_stdMFCC)
meanChromaCQT = scaler.fit_transform(meanChromaCQT)
mean_stdChromaCQT = scaler.fit_transform(mean_stdChromaCQT)

In [8]:
svm_classifier = svm.SVC(gamma='auto')


svm_classifier.fit(meanMFCC, target)
predicted = cross_val_predict(svm_classifier, meanMFCC, target, cv=5)
meanMFCC_SVM_acc = metrics.accuracy_score(target, predicted)
print("Accuray :\n%s\n" % (metrics.accuracy_score(target, predicted)))


svm_classifier.fit(mean_stdMFCC, target)
predicted = cross_val_predict(svm_classifier, mean_stdMFCC, target, cv=5)
mean_stdMFCC_SVM_acc = metrics.accuracy_score(target, predicted)
print("Accuray :\n%s\n" % (metrics.accuracy_score(target, predicted)))


svm_classifier.fit(meanChromaCQT, target)
predicted = cross_val_predict(svm_classifier, meanChromaCQT, target, cv=5)
meanChromaCQT_SVM_acc = metrics.accuracy_score(target, predicted)
print("Accuray :\n%s\n" % (metrics.accuracy_score(target, predicted)))


svm_classifier.fit(mean_stdChromaCQT, target)
predicted = cross_val_predict(svm_classifier, mean_stdChromaCQT, target, cv=5)
mean_stdChromaCQT_SVM_acc = metrics.accuracy_score(target, predicted)
print("Accuray :\n%s\n" % (metrics.accuracy_score(target, predicted)))



Accuray :
0.6211211211211212

Accuray :
0.6621621621621622

Accuray :
0.48073073073073075

Accuray :
0.501001001001001



In [9]:
neighbor_classifier = neighbors.KNeighborsClassifier(n_neighbors=3)

neighbor_classifier.fit(meanMFCC, target)
predicted = cross_val_predict(neighbor_classifier, meanMFCC, target, cv=5)
meanMFCC_KNN_acc = metrics.accuracy_score(target, predicted)
print("Accuray :\n%s\n" % (metrics.accuracy_score(target, predicted)))


neighbor_classifier.fit(mean_stdMFCC, target)
predicted = cross_val_predict(neighbor_classifier, mean_stdMFCC, target, cv=5)
mean_stdMFCC_KNN_acc = metrics.accuracy_score(target, predicted)
print("Accuray :\n%s\n" % (metrics.accuracy_score(target, predicted)))


neighbor_classifier.fit(meanChromaCQT, target)
predicted = cross_val_predict(neighbor_classifier, meanChromaCQT, target, cv=5)
meanChromaCQT_KNN_acc = metrics.accuracy_score(target, predicted)
print("Accuray :\n%s\n" % (metrics.accuracy_score(target, predicted)))


neighbor_classifier.fit(mean_stdChromaCQT, target)
predicted = cross_val_predict(neighbor_classifier, mean_stdChromaCQT, target, cv=5)
mean_stdChromaCQT_KNN_acc = metrics.accuracy_score(target, predicted)
print("Accuray :\n%s\n" % (metrics.accuracy_score(target, predicted)))


Accuray :
0.5793293293293293

Accuray :
0.6413913913913913

Accuray :
0.4451951951951952

Accuray :
0.47647647647647645



In [10]:
random_forest_classifier = ensemble.RandomForestClassifier()

random_forest_classifier.fit(meanMFCC, target)
predicted = cross_val_predict(random_forest_classifier, meanMFCC, target, cv=5)
meanMFCC_RF_acc = metrics.accuracy_score(target, predicted)
print("Accuray :\n%s\n" % (metrics.accuracy_score(target, predicted)))


random_forest_classifier.fit(mean_stdMFCC, target)
predicted = cross_val_predict(random_forest_classifier, mean_stdMFCC, target, cv=5)
mean_stdMFCC_RF_acc = metrics.accuracy_score(target, predicted)
print("Accuray :\n%s\n" % (metrics.accuracy_score(target, predicted)))


random_forest_classifier.fit(meanChromaCQT, target)
predicted = cross_val_predict(random_forest_classifier, meanChromaCQT, target, cv=5)
meanChromaCQT_RF_acc = metrics.accuracy_score(target, predicted)
print("Accuray :\n%s\n" % (metrics.accuracy_score(target, predicted)))


random_forest_classifier.fit(mean_stdChromaCQT, target)
predicted = cross_val_predict(random_forest_classifier, mean_stdChromaCQT, target, cv=5)
mean_stdChromaCQT_RF_acc = metrics.accuracy_score(target, predicted)
print("Accuray :\n%s\n" % (metrics.accuracy_score(target, predicted)))

Accuray :
0.6466466466466466

Accuray :
0.698948948948949

Accuray :
0.5007507507507507

Accuray :
0.541041041041041



In [11]:
logistic_classifier = linear_model.LogisticRegression()

logistic_classifier.fit(meanMFCC, target)
predicted = cross_val_predict(logistic_classifier, meanMFCC, target, cv=5)
meanMFCC_LR_acc = metrics.accuracy_score(target, predicted)
print("Accuray :\n%s\n" % (metrics.accuracy_score(target, predicted)))


logistic_classifier.fit(mean_stdMFCC, target)
predicted = cross_val_predict(logistic_classifier, mean_stdMFCC, target, cv=5)
mean_stdMFCC_LR_acc = metrics.accuracy_score(target, predicted)
print("Accuray :\n%s\n" % (metrics.accuracy_score(target, predicted)))


logistic_classifier.fit(meanChromaCQT, target)
predicted = cross_val_predict(logistic_classifier, meanChromaCQT, target, cv=5)
meanChromaCQT_LR_acc = metrics.accuracy_score(target, predicted)
print("Accuray :\n%s\n" % (metrics.accuracy_score(target, predicted)))


logistic_classifier.fit(mean_stdChromaCQT, target)
predicted = cross_val_predict(logistic_classifier, mean_stdChromaCQT, target, cv=5)
mean_stdChromaCQT_LR_acc = metrics.accuracy_score(target, predicted)
print("Accuray :\n%s\n" % (metrics.accuracy_score(target, predicted)))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Accuray :
0.6333833833833834



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Accuray :
0.6779279279279279

Accuray :
0.4647147147147147

Accuray :
0.4954954954954955



In [12]:
num_columns = ['Feature', 'SVM', 'KNN', 'Random Forest', 'Logistic Regression']

classifier_acc = np.zeros([4,5])

classifier_acc[0][1] = meanMFCC_SVM_acc
classifier_acc[0][2] = meanMFCC_KNN_acc
classifier_acc[0][3] = meanMFCC_RF_acc
classifier_acc[0][4] = meanMFCC_LR_acc

classifier_acc[1][1] = mean_stdMFCC_SVM_acc
classifier_acc[1][2] = mean_stdMFCC_KNN_acc
classifier_acc[1][3] = mean_stdMFCC_RF_acc
classifier_acc[1][4] = mean_stdMFCC_LR_acc

classifier_acc[2][1] = meanChromaCQT_SVM_acc
classifier_acc[2][2] = meanChromaCQT_KNN_acc
classifier_acc[2][3] = meanChromaCQT_RF_acc
classifier_acc[2][4] = meanChromaCQT_LR_acc

classifier_acc[3][1] = mean_stdChromaCQT_SVM_acc
classifier_acc[3][2] = mean_stdChromaCQT_KNN_acc
classifier_acc[3][3] = mean_stdChromaCQT_RF_acc
classifier_acc[3][4] = mean_stdChromaCQT_LR_acc


classification_accuracy = pd.DataFrame(data=classifier_acc, columns = num_columns)
classification_accuracy['Feature'] = ['MFCC', 'MFCC + STD', 'Chroma CQT', 'Chroma CQT + STD']

classification_accuracy

Unnamed: 0,Feature,SVM,KNN,Random Forest,Logistic Regression
0,MFCC,0.621121,0.579329,0.646647,0.633383
1,MFCC + STD,0.662162,0.641391,0.698949,0.677928
2,Chroma CQT,0.480731,0.445195,0.500751,0.464715
3,Chroma CQT + STD,0.501001,0.476476,0.541041,0.495495


#### **Question A4.4 (expected): Misclassification Audio** 

In this question the goal is to listen to some of the misclassification. Write a function 
**misclassification_audio(ground_label, predicted_label)** that takes as input a ground truth label and a predicted label and returns an audio file with a maximum of 10 misclassified audio tracks that had the ground truth label but were predicted as the predicted_label. For example let's say that 
there are 23 tracks that were originally labeled as Instrumental but were predicted as HipHop. The result of the function will be an audio file consisting of 10 6-second audio tracks that were misclassifed that way. By listening to this one minute audio file you will get a sense of what type of Instrumental tracks were misclassifed as HipHop. 

 (**Expected: 2 points**)


In [13]:
def misclassification_audio(ground_label, predicted_label):
    count = 0
    clips = []
    for i in range(len(fnames)):
        if target[i] == ground_label and predicted[i] == predicted_label:
            audio, srate = librosa.load(fnames[i])
            clips.append(audio)
            count += 1
        if count == 10:
            break
    
    sum_audio = np.concatenate(clips)
    soundfile.write('misclassified_audio.wav', sum_audio, srate)

#2=instrumental, 1=hiphop
misclassification_audio(2,1)

#### **Question A4.5 (advanced): Misclassification Audio** 

For this question you will need to be able to run the musicnn auto-tagger developed by Jordi Pons: https://github.com/jordipons/musicnn

For each track in the 4 genres we have been exploring in this question calculate the topN (N=10) tags using the MSD_musicnn model. Find the 15 most 'popular' tags that is the tags that appear most times in the tracks we are examining. Convert each track to a binary bag of words representation using these 15 most popular tags. Now each track should be represented by 15 binary numbers.

Adapt the code from performing classification based on lyrics shown at the bottom of this notebook to perform Naive Bayes classification using the tag bag-of-words representation:

https://github.com/gtzan/csc421_tzanetakis/blob/main/csc421_tzanetakis_quantifying_uncertainty.ipynb

 (**Advanced: 2 points**)
