# Computing distances with RBF Kernel

### This are the steps to calculate the disruption:
1. Extract the feature representation of the audio
2. Calculate the "distance" for each song against any other song and store this result (in a matrix)
3. Use this "distance" or "similarity" matrix to build the network

As the dataset is too big to use in its entirety, we use 1/3 of it (as it is the limit I can use to compute)

Ps. When visualizing in EDA it seems that the cut dataset still is a good representation of the complete dataset

### Loading dataset

The first step is to load the dataset

In [2]:
import numpy as np
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt
from pathlib import Path

This treated dataset contains the id, artists, song, album name, genres list, popularity, release and duration of each song

In [3]:
DATASETS_FOLDER = Path("./dataset")
DATAFRAME_PATH = DATASETS_FOLDER / "input" / "csvs"
DF_FILENAME = "song_info_release_dataset_29986_entries_filtered.csv"

dataframe = pd.read_csv(DATAFRAME_PATH / DF_FILENAME)
print(f"columns: {dataframe.columns}\nsize: {len(dataframe)}")

columns: Index(['id', 'artist', 'song', 'album_name', 'genres', 'popularity', 'release',
       'duration_ms'],
      dtype='object')
size: 29986


To work with our dataframe a good practice is to make a copy of it so we do not modify the original one

In [4]:
# Copy the dataframe
working_dataframe = dataframe.copy(deep=True)

We'll have to fix some wrong release dates before we work with this dataset

In [5]:
# Los Espiritus album was released in 2013
working_dataframe['release'].replace(1013, 2013, inplace=True)
# Bukka White - Parchman Farm Blues was released in 1940
working_dataframe['release'].replace(1899, 1940, inplace=True)
# Sidney Polak	Otwieram Wino - feat. Pezet according to amazon was released in 2018
working_dataframe['release'].replace(1900, 2018, inplace=True)

In [6]:
# Sort the dataframe by release date (as this is going to be important when generating the similarity matrix)
dataframe_sorted = working_dataframe.sort_values(by=['release'])
dataframe_sorted.head()

Unnamed: 0,id,artist,song,album_name,genres,popularity,release,duration_ms
5987,3MEb9LZbB80nQ1a8,Louis Armstrong,St. James Infirmary,The Complete Hot Five And Hot Seven Recordings...,"jazz,blues",29.0,1928,191867
24349,DqO2fLBqdVsERa1Z,Louis Armstrong,Mack the Knife,The Great American Songbook,"jazz,swing,jazz,blues,swing",43.0,1929,201467
2841,1Z7Pb158yANCZ7zN,Billie Holiday,Georgia On My Mind,Lady Day: The Complete Billie Holiday On Colum...,"jazz,vocal jazz,blues",24.0,1933,198560
822,0SI6oF0XlACvZdQT,Billie Holiday,All Of Me,Lady Day: The Complete Billie Holiday On Colum...,"jazz,vocal jazz,blues,jazz,blues",54.0,1933,181440
15583,8rCzU7kVpoJ0Z37D,Billie Holiday,A Fine Romance,Lady Day: The Complete Billie Holiday On Colum...,"jazz,jazz,blues",24.0,1933,171467


## Mapping our dataframe to the transfer learning features

The transfer learning features have the same ordering of the files of the folder they were extracted, which is not the same as the dataframe now (and the dataframe is now ordered)

That means to use them we have to map each song to its corresponding index in the feature dataframe

In [7]:
# Loading Transfer Learning Features
print("Loading Transfer Learning Features...")
transfer_learning_features = np.load(DATASETS_FOLDER / "input" / "extracted_features" / "transfer_learning" / "features.npy")
print("Shape of the transfer learning features: ", np.shape(transfer_learning_features))

# Open list of files.txt
print("Open list of files to make the mapping...")
list_of_files = []
with open(DATASETS_FOLDER / "input" / "extracted_features" / "transfer_learning" / "list_of_files.txt", "r") as files_list:
    # split by line ending, each path is a line in this file
    list_of_files = files_list.read().split(sep="\n")

# The common information we have is the ID, so we can use it to map to our dataset.
print("Getting only the IDs from file paths...")
only_ids = []
# For all paths in the files list, get only the file name (which is the ID!)
for file_name in tqdm(list_of_files):
    temp_path = Path(file_name)
    only_ids.append(temp_path.stem) # returns only the filename without the extension

# Get every ID of our sorted dataframe
print("Creating the mappings of song to the feature vector indexes ")
ids_sorted = dataframe_sorted["id"].to_numpy()

# Now we only need to create a new column containing the indexes corresponding to the feature vector
mapping_of_indexes = []
# Make the mapping of the indexes
for song_id in tqdm(ids_sorted):
    mapping_of_indexes.append(only_ids.index(song_id))

# Adding as a column
print("Adding the new column with the mapping")
dataframe_sorted["mapping_to_fv_index"] = mapping_of_indexes

# Reseting the index so that iloc works
print("Reseting the index so that iloc works in the sorted dataframe...")
df_sorted_reset_index = dataframe_sorted.reset_index()
print("All done!")
df_sorted_reset_index.head()

Loading Transfer Learning Features...
Shape of the transfer learning features:  (109269, 160)
Open list of files to make the mapping...
Getting only the IDs from file paths...


100%|██████████| 109269/109269 [00:02<00:00, 53206.48it/s]


Creating the mappings of song to the feature vector indexes 


100%|██████████| 29986/29986 [00:45<00:00, 661.21it/s]

Adding the new column with the mapping
Reseting the index so that iloc works in the sorted dataframe...
All done!





Unnamed: 0,index,id,artist,song,album_name,genres,popularity,release,duration_ms,mapping_to_fv_index
0,5987,3MEb9LZbB80nQ1a8,Louis Armstrong,St. James Infirmary,The Complete Hot Five And Hot Seven Recordings...,"jazz,blues",29.0,1928,191867,71045
1,24349,DqO2fLBqdVsERa1Z,Louis Armstrong,Mack the Knife,The Great American Songbook,"jazz,swing,jazz,blues,swing",43.0,1929,201467,20950
2,2841,1Z7Pb158yANCZ7zN,Billie Holiday,Georgia On My Mind,Lady Day: The Complete Billie Holiday On Colum...,"jazz,vocal jazz,blues",24.0,1933,198560,91346
3,822,0SI6oF0XlACvZdQT,Billie Holiday,All Of Me,Lady Day: The Complete Billie Holiday On Colum...,"jazz,vocal jazz,blues,jazz,blues",54.0,1933,181440,65996
4,15583,8rCzU7kVpoJ0Z37D,Billie Holiday,A Fine Romance,Lady Day: The Complete Billie Holiday On Colum...,"jazz,jazz,blues",24.0,1933,171467,108063


## Loading all features in a feature matrix

This next step consists in appending the feature vectors in a list, correspoding to the new orderning of the dataset.
That way we'll have a 1 to 1 mapping of song to its corresponding feature vector

This way we'll have a 2 arrays of feature vectors:
- One of MFCC features 
- One of Transfer Learning features 

And then they will be ready to use to make a `similarity matrix`

In [8]:
MFCC_FEATURES_PATH = Path("../../dataset/dataset_mfcc")
FILE_ENDING = "_mfcc.npy"

def get_mfcc_feature_vector(df):
    """ Load and append the feature vector extracted to a variable 
    This function is slower because every npy is in its separate file, that means heavy IO usage.
    """
    feature_vector = []
    ids_list = df['id'].to_list()
    print("Loading MFCC Features...")
    for song_id in tqdm(ids_list):
        file_name = song_id + FILE_ENDING
        file_path = MFCC_FEATURES_PATH / file_name
        feature_vector.append(np.load(file_path).tolist())
    print("All done!")
    return feature_vector

def get_transfer_learning_feature_vector(df, transfer_learning_feature_vector):
    feature_vector = []
    print("Loading Transfer Learning Features...")
    indexes_list = df['mapping_to_fv_index'].to_list()
    for index in tqdm(indexes_list):
        feature_vector.append(transfer_learning_feature_vector[index])
    print("All done!")
    return feature_vector

def get_feature_vector(df, feature_type, transfer_learning_feature_vector=None):
    if feature_type.lower() == "mfcc":
        return get_mfcc_feature_vector(df)
    elif feature_type.lower() == "transfer_learning":
        if type(transfer_learning_feature_vector) == None:
            raise ValueError("transfer_learning_feature_vector cannot be empty!")
        return get_transfer_learning_feature_vector(df, transfer_learning_feature_vector)
    else:
        raise TypeError("Not a valid feature vector type!")

In [59]:
mfcc_feature_vector = get_feature_vector(df_sorted_reset_index, "mfcc")
print(np.shape(mfcc_feature_vector))

Loading MFCC Features...


100%|██████████| 29986/29986 [00:17<00:00, 1759.50it/s]

All done!
(29986, 120)





In [9]:
transfer_learning_feature_vector = get_feature_vector(df_sorted_reset_index, "transfer_learning", transfer_learning_features)
print(np.shape(transfer_learning_feature_vector))

Loading Transfer Learning Features...


100%|██████████| 29986/29986 [00:00<00:00, 1015940.61it/s]

All done!
(29986, 160)





## Export to avoid computations again

In [13]:
def export_feature_vector(feature_vector, feat_type):
    """ Saves a feature vector to avoid making this list again """
    size = np.shape(feature_vector)[0]
    np.save( DATASETS_FOLDER / "input" / "feature_vectors" / feat_type / f"{feat_type}_feature_vector_{size}_samples.npy", feature_vector)
    print(f"Saving {feat_type} feature vector of {size} samples complete!")

export_feature_vector(transfer_learning_feature_vector, feat_type="transfer_learning")
# export_feature_vector(mfcc_feature_vector, feat_type="mfcc")

Saving transfer_learning feature vector of 29986 samples complete!


## Computing the similarity matrix using the RBF Kernel

In [14]:
from sklearn.metrics.pairwise import rbf_kernel

gamma = 0.1
similarity_matrix = rbf_kernel(transfer_learning_feature_vector, gamma=gamma)
np.fill_diagonal(similarity_matrix, 0) # Just a measure to avoid comparing to itself when generating the network

In [15]:
def save_similarity_matrix(gamma, matrix, feat_type):
    size = np.shape(matrix)[0]
    np.save(DATASETS_FOLDER / "input" / "similarity_matrices" / f"{feat_type}_{size}_samples_{gamma}_gamma.npy", matrix)
    print(f"Saving similarity matrix of size {size} complete!")

save_similarity_matrix(gamma, similarity_matrix, "transfer_learning")

Saving similarity matrix of size 29986 complete!


# Helper functions to help analyze/filter the dataset 

In [16]:
from math import ceil

def show_similarity(df, similarity_matrix, song_index, song_two_index):
    """ Shows the similarity of two entries of a similarity matrix """
    print()
    print(df.iloc[song_index])
    print(df.iloc[song_two_index])
    print()
    similarity = similarity_matrix[song_index, song_two_index]
    print(f"Similarity: {similarity}")
    
def get_songs_list(df, similarity_matrix, similarity_threshold, limit=10):
    """ Get the list of song similarities for a similarity threshold """
    rows, columns = np.where(similarity_matrix > similarity_threshold)
    print(len(similarity_matrix[rows, columns]))
    songs_list = np.array((rows, columns)).T
    songs_list = songs_list[:ceil(len(songs_list)/2)]
    for song in songs_list[:limit]:
        show_similarity(df, similarity_matrix, song[0], song[1])


# Exporting the dataset filtered

In [19]:
import sys
import gc

def sizeof_fmt(num, suffix='B'):
    ''' by Fred Cirera,  https://stackoverflow.com/a/1094933/1870254, modified'''
    for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
        if abs(num) < 1024.0:
            return "%3.1f %s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f %s%s" % (num, 'Yi', suffix)

for name, size in sorted(((name, sys.getsizeof(value)) for name, value in locals().items()),
                            key= lambda x: -x[1])[:10]:
    print("{:>30}: {:>8}".format(name, sizeof_fmt(size)))

#del similarity_matrix
gc.collect()


             similarity_matrix:  3.3 GiB
    transfer_learning_features: 66.7 MiB
                            __: 11.7 MiB
         df_sorted_reset_index: 11.7 MiB
                           _10: 11.7 MiB
                           ___: 11.7 MiB
              dataframe_sorted: 11.7 MiB
                            _9: 11.7 MiB
                     dataframe: 11.2 MiB
                            _7: 11.2 MiB


0

# Showing some songs with high similarity

### When using gamma = 0.01, only masters and remasters of the same songs show as similar

In [14]:
get_songs_list(dataframe, similarity_matrix, 0.8)

140

id                                             06SC55N0G2GjCNuS
artist                                              David Bowie
song                                         Station to Station
album_name         Station To Station (2016 Remastered Version)
genres        rock,progressive rock,classic rock,art rock,ro...
Name: 159, dtype: object
id                                        7JmWTjyaKLzHmKBo
artist                                         David Bowie
song          Station To Station (2016 Remastered Version)
album_name    Station To Station (2016 Remastered Version)
genres                         glam rock,rock,classic rock
Name: 12886, dtype: object

Similarity: 0.9999999999995635

id                    0EvGCDSvdzIZdcQ3
artist                      Kid Abelha
song                      Peito Aberto
album_name                   Pega Vida
genres        mpb,pop nacional,mpb,pop
Name: 403, dtype: object
id            6xboh6SVR8Z0qHXj
artist              Kid Abelha
song         

# Showing the same information but for the beatles

In [15]:
beatles_df = dataframe.query("artist == 'The Beatles'")
print(len(beatles_df))
beatles_feature_vector = get_feature_vector(beatles_df)
beatles_similarity_matrix = rbf_kernel(beatles_feature_vector, gamma=0.001)
np.fill_diagonal(beatles_similarity_matrix, 0)
get_songs_list(beatles_df, beatles_similarity_matrix, 0.7)

37
2

id                                      6NRxkIqKybHZZGvZ
artist                                       The Beatles
song          Why Don't We Do It In The Road? - 2018 Mix
album_name                                   The Beatles
genres                             classic rock,rock,pop
Name: 11252, dtype: object
id                                             BD1AtNm4DiD3tjQ0
artist                                              The Beatles
song          Why Don't We Do It In The Road? - Remastered 2009
album_name                             The Beatles (Remastered)
genres                   rock,classic rock,pop,psychedelic rock
Name: 19734, dtype: object

Similarity: 1.0
