<h2 style="color: #0074D9;">Table of Contents</h2>
<ol>
    <li><a href="#ModPack" style="font-size: larger; color: #0074D9; font-weight: bold;">Modules and Packages</a></li>
    <li><a href="#dataloading-and-analysis" style="font-size: larger; color: #0074D9; font-weight: bold;">Data Loading</a></li>
    <li><a href="#tav" style="font-size: larger; color: #0074D9; font-weight: bold;">TAV Approach</a></li>
    <li><a href="#early_fusion" style="font-size: larger; color: #0074D9; font-weight: bold;">Early Fusion</a></li>
    <li><a href="#late_fusion" style="font-size: larger; color: #0074D9; font-weight: bold;">Late Fusion</a></li>
    <li><a href="#evaluation" style="font-size: larger; color: #0074D9; font-weight: bold;">Evaluation and Results</a></li>
</ol>

<h1 style="color:rgb(0,120,170)">Modules and Packages</h1>

------------------------------------------------------

<a id="ModPack"></a>



In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import ast

from tqdm import tqdm
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
from math import log2
from IPython.display import display_html
import utils2 as helper

#disable annoying warnings
import warnings
warnings.filterwarnings('ignore')

<h1 style="color:rgb(0,120,170)">Data Loading</h1>

------------------------------------------------------

<a id="dataloading"></a>


In [10]:
start_time = time.time() 
# Task 1&2
songs_df = pd.read_csv(f'ws23_exercise1/id_information_mmsr.tsv', sep='\t')
genre_df = pd.read_csv(f'ws23_exercise2/id_genres_mmsr.tsv', sep='\t')
full_dataset = pd.merge(songs_df, genre_df, on='id')
# Task 3
vgg19_df = pd.read_csv(f'task3/ws23_exercise3/id_vgg19_mmsr.tsv', sep='\t')

print(f"Read all files succesfully after {time.time()-start_time}s.")

Read all files succesfully after 15.661694049835205s.


<h1 style="color:rgb(0,120,170)">TAV Approach</h1>

------------------------------------------------------

<a id="tav"></a>

TAV stands for Text-Audio-Video features. The main retrieval function `find_top_n_similar_songs` that take as input:<br><br>
`feature_df`: the embedding as dataframe<br>
`N`: number of item to retrieve for each query<br>
`data_name`: a string to describe the feature dataframe 

In [6]:
for feature_df, data_name in zip(features, data_names):
    result_df = helper.find_top_n_similar_songs(feature_df, N, data_name)
    if SAVE_TO_CSV:
        result_df.to_csv(f'{data_name}_top_{N}_similar_songs.csv', index=False)


Find TOP 10 recommendations for tfidf...
DONE after 2.75s
Find TOP 10 recommendations for bert...
DONE after 19.56s
Find TOP 10 recommendations for w_to_vec...
DONE after 8.45s
Find TOP 10 recommendations for mfcc_stats...
DONE after 3.59s
Find TOP 10 recommendations for blf...
DONE after 84.55s
Find TOP 10 recommendations for ivec...
DONE after 3.69s
Find TOP 10 recommendations for dnn...
DONE after 2.26s
Find TOP 10 recommendations for vgg19...
DONE after 133.15s


In [5]:
# General parameters for data
features = [tfidf_df, bert_df, w_to_vec, mfcc_stats_df, blf_df, ivec_df, dnn_df, vgg19_df]
data_names = ['tfidf', 'bert', 'w_to_vec', 'mfcc_stats', 'blf', 'ivec', 'dnn', 'vgg19']
N = 10

# Additional parameters for main function
SAVE_TO_CSV = True

In [16]:
#show first top 10 songs for the first 2 query songs
vgg_data.head(20)

Unnamed: 0,query_song,query_artist,query_genre,retrieved_song,retrieved_artist,retrieved_genre,similarity
0,Somebody's Gotta Die,The Notorious B.I.G.,"['hip hop', 'rap', 'grindcore', 'death metal']",Last Night on Earth,Green Day,"['rock', 'alternative rock', 'pop punk', 'punk...",0.965113
1,Somebody's Gotta Die,The Notorious B.I.G.,"['hip hop', 'rap', 'grindcore', 'death metal']",This River Is Wild,The Killers,"['rock', 'indie rock', 'alternative rock', 'po...",0.964264
2,Somebody's Gotta Die,The Notorious B.I.G.,"['hip hop', 'rap', 'grindcore', 'death metal']",Dímelo,Enrique Iglesias,"['pop', 'latin', 'latin pop', 'emo']",0.959671
3,Somebody's Gotta Die,The Notorious B.I.G.,"['hip hop', 'rap', 'grindcore', 'death metal']",Haunt (demo),Bastille,"['indie rock', 'rock']",0.956429
4,Somebody's Gotta Die,The Notorious B.I.G.,"['hip hop', 'rap', 'grindcore', 'death metal']",I'm Done,The Pussycat Dolls,"['pop', 'singer songwriter', 'rock']",0.955086
5,Somebody's Gotta Die,The Notorious B.I.G.,"['hip hop', 'rap', 'grindcore', 'death metal']",Talk About You,Mika,"['pop', 'pop rock', 'uk pop', 'pop dance']",0.95252
6,Somebody's Gotta Die,The Notorious B.I.G.,"['hip hop', 'rap', 'grindcore', 'death metal']",Crawl,Chris Brown,"['soul', 'r b', 'hip hop', 'pop', 'rap', 'pop ...",0.950749
7,Somebody's Gotta Die,The Notorious B.I.G.,"['hip hop', 'rap', 'grindcore', 'death metal']",Se bastasse una canzone,Eros Ramazzotti,"['pop', 'easy listening', 'rock', 'singer song...",0.949707
8,Somebody's Gotta Die,The Notorious B.I.G.,"['hip hop', 'rap', 'grindcore', 'death metal']",Factory Girl,The Pretty Reckless,"['rock', 'alternative rock', 'post grunge', 'p...",0.947297
9,Somebody's Gotta Die,The Notorious B.I.G.,"['hip hop', 'rap', 'grindcore', 'death metal']",What Am I to Say,Sum 41,"['rock', 'punk', 'alternative rock', 'hard rock']",0.947024


<h1 style="color:rgb(0,120,170)">Early Fusion</h1>

------------------------------------------------------

<a id="early_fusion"></a><br>
**Word2Vec + Block Level Features**<br>
First I normalize the feature vectors to prepare them for concatenation.<br>


In [7]:
_w_to_vec = w_to_vec[w_to_vec['id'] != "03Oc9WeMEmyLLQbj"] #song not present in all df
feature_1 = _w_to_vec.drop('id',axis='columns') #text
feature_2 = blf_df.drop('id',axis='columns') #audio 


In [8]:
feature_1_normalized = normalize(feature_1, norm='l2')
feature_2_normalized = normalize(feature_2, norm='l2')
concatenated_feat = np.concatenate((feature_1_normalized, feature_2_normalized), axis=1)
concatenated_feat = pd.DataFrame(concatenated_feat)
concatenated_feat['id'] = w_to_vec['id']

In [19]:
for N in [10, 100]:
    result_df = helper.find_top_n_similar_songs(concatenated_feat, N, 'early_fusion')
    if SAVE_TO_CSV:
        result_df.to_csv(f'early_fusion_top_{N}_similar_songs.csv', index=False)


Find TOP 10 recommendations for early_fusion...
DONE after 8.66s
Find TOP 100 recommendations for early_fusion...
DONE after 468.66s


In [17]:
#show first top 10 songs for the first 2 query songs
early_fusion.head(20)

Unnamed: 0,query_song,query_artist,query_genre,retrieved_song,retrieved_artist,retrieved_genre,similarity
0,Sarcasm,Get Scared,"['post hardcore', 'emocore', 'rock', 'alternat...",The Cuddler,Dance Gavin Dance,"['post hardcore', 'metalcore', 'hardcore', 'em...",0.93326
1,Sarcasm,Get Scared,"['post hardcore', 'emocore', 'rock', 'alternat...",Venom,Little Simz,"['hip hop', 'rap']",0.9316
2,Sarcasm,Get Scared,"['post hardcore', 'emocore', 'rock', 'alternat...",Sword,ASHES dIVIDE,"['alternative rock', 'rock', 'progressive rock...",0.929498
3,Sarcasm,Get Scared,"['post hardcore', 'emocore', 'rock', 'alternat...",The Throne of Agony,Foetus,"['industrial', 'experimental', 'no wave', 'roc...",0.926988
4,Sarcasm,Get Scared,"['post hardcore', 'emocore', 'rock', 'alternat...",CALM ENVY,the GazettE,"['j rock', 'visual kei', 'rock']",0.926488
5,Sarcasm,Get Scared,"['post hardcore', 'emocore', 'rock', 'alternat...",Miod,Natalia Przybysz,"['soul', 'blues']",0.926401
6,Sarcasm,Get Scared,"['post hardcore', 'emocore', 'rock', 'alternat...","Kochaj mnie, a będę twoją",Kult,"['rock', 'hard rock', 'polish rock']",0.926308
7,Sarcasm,Get Scared,"['post hardcore', 'emocore', 'rock', 'alternat...",Relations,Jackson C. Frank,"['folk', 'singer songwriter', 'folk rock']",0.926128
8,Sarcasm,Get Scared,"['post hardcore', 'emocore', 'rock', 'alternat...",Cabinet Man,Lemon Demon,"['rap', 'indie rock', 'screamo', 'comedy', 'ch...",0.925967
9,Sarcasm,Get Scared,"['post hardcore', 'emocore', 'rock', 'alternat...",June on the West Coast,Bright Eyes,"['folk', 'singer songwriter', 'indie rock', 'e...",0.925048


<h1 style="color:rgb(0,120,170)">Late Fusion</h1>

------------------------------------------------------

<a id="late_fusion"></a><br>
**Word2Vec (40%) + Block Level Features (60%)**<br>
I introduce a parameter `alpha` and `beta` to increase the similarity value of the retrieved items based on some conditions, before the results get reordered and concatenated:<br>
1. If the title of the query song and the retrieved song have same words, we assume a semantic similarity, and their similarity values get increased by `alpha` (set to 0.003 for top 10, and 0.06 for top 100).
2. If a retrieved song was retrieved by both Word2Vec and BLF based model for the same query, their similarity is increased by `beta` (set to 1, as it is an extremely rare event).

**N.B.**: parameter optimization has not been implemented and should be further researched. The fact that the similarity values are not anymore bounded between -1 and 1 is not a problem because the evaluation metrics used hereunder do not take similarity into account. Different evaluation metrics might require further normalization.

In [3]:
h_w2v_data = pd.read_csv("w_to_vec_top_100_similar_songs.csv", sep=",") #top 100
h_blf_data = pd.read_csv("blf_top_100_similar_songs.csv", sep=",")
#-------
t_w2v_data = pd.read_csv("w_to_vec_top_10_similar_songs.csv", sep=",") #top 10
t_blf_data = pd.read_csv("blf_top_10_similar_songs.csv", sep=",")

In [8]:
#top100
hundreth_res= h_blf_data.iloc[99::100]
first_res = h_blf_data.iloc[::100]
alpha = (first_res['similarity'].mean() - hundreth_res['similarity'].mean())/2
h_blf_data, h_w2v_data = helper.increase_similarity_based_on_artist_and_song(h_blf_data, h_w2v_data)
late_fusion = helper.concatenate_songs(h_blf_data, h_w2v_data, N=60, M=40)
late_fusion.to_csv("late_fusion_top_100_similar_songs.csv", index=False)

#--------

#top10
tenths_res= t_blf_data.iloc[9::10]
first_res = t_blf_data.iloc[::10]
alpha = (first_res['similarity'].mean() - tenths_res['similarity'].mean())/2
t_blf_data, t_w2v_data = helper.increase_similarity_based_on_artist_and_song(t_blf_data, t_w2v_data)
late_fusion2 = helper.concatenate_songs(t_blf_data, t_w2v_data, N=6, M=4)
late_fusion2.to_csv("late_fusion_top_10_similar_songs.csv", index=False)

In [14]:
#show first top 10 songs for the first 2 query songs
late_fusion.head(20)

Unnamed: 0,query_song,query_artist,query_genre,retrieved_song,retrieved_artist,retrieved_genre,similarity,relevant
23242,"$1,000,000,000",Everyone Everywhere,"['midwest emo', 'math rock', 'emo', 'indie rock']",People Help The People,Cherry Ghost,"['indie rock', 'easy listening', 'britpop', 'i...",1.975388,False
23243,"$1,000,000,000",Everyone Everywhere,"['midwest emo', 'math rock', 'emo', 'indie rock']",Looking Out,Brandi Carlile,"['rock', 'singer songwriter', 'country', 'folk...",1.974273,False
23246,"$1,000,000,000",Everyone Everywhere,"['midwest emo', 'math rock', 'emo', 'indie rock']",Baby,Devendra Banhart,"['folk', 'singer songwriter', 'freak folk', 'i...",1.906103,False
23247,"$1,000,000,000",Everyone Everywhere,"['midwest emo', 'math rock', 'emo', 'indie rock']",Death On Two Legs,Queen,"['rock', 'classic rock', 'hard rock', 'progres...",1.905264,False
23248,"$1,000,000,000",Everyone Everywhere,"['midwest emo', 'math rock', 'emo', 'indie rock']",Moody's Mood for Love,Amy Winehouse,"['jazz', 'soul', 'singer songwriter', 'neo sou...",1.903547,False
23249,"$1,000,000,000",Everyone Everywhere,"['midwest emo', 'math rock', 'emo', 'indie rock']",Relations,Jackson C. Frank,"['folk', 'singer songwriter', 'folk rock']",1.902776,False
23240,"$1,000,000,000",Everyone Everywhere,"['midwest emo', 'math rock', 'emo', 'indie rock']",Under the Milky Way,The Church,"['new wave', 'rock', 'post punk', 'soundtrack'...",0.979856,False
23241,"$1,000,000,000",Everyone Everywhere,"['midwest emo', 'math rock', 'emo', 'indie rock']",Take Your Shot,The Pineapple Thief,"['alternative rock', 'progressive rock', 'art ...",0.977607,False
23244,"$1,000,000,000",Everyone Everywhere,"['midwest emo', 'math rock', 'emo', 'indie rock']",Fight Song,The Appleseed Cast,"['post rock', 'indie rock', 'emo', 'rock', 'ex...",0.973874,True
23245,"$1,000,000,000",Everyone Everywhere,"['midwest emo', 'math rock', 'emo', 'indie rock']",Deathly,Aimee Mann,"['singer songwriter', 'soundtrack', 'rock', 'p...",0.973457,True


<h1 style="color:rgb(0,120,170)">Evaluation and Results</h1>

------------------------------------------------------

<a id="evaluation"></a>

In [4]:
# load top 100
early_data = pd.read_csv("early_fusion_top_100_similar_songs.csv", sep=",")
late_data = pd.read_csv("late_fusion_top_100_similar_songs.csv", sep=",")
bert_data = pd.read_csv("bert_top_100_similar_songs.csv", sep=",")
blf_data = pd.read_csv("blf_top_100_similar_songs.csv", sep=",")
dnn_data = pd.read_csv("dnn_top_100_similar_songs.csv", sep=",")
ivec_data = pd.read_csv("ivec_top_100_similar_songs.csv", sep=",")
mfcc_data = pd.read_csv("mfcc_stats_top_100_similar_songs.csv", sep=",")
tfidf_data = pd.read_csv("tfidf_top_100_similar_songs.csv", sep=",")
vgg_data = pd.read_csv("vgg19_top_100_similar_songs.csv", sep=",")
w2v_data = pd.read_csv("w_to_vec_top_100_similar_songs.csv", sep=",")
random_data = pd.read_csv("random_baseline_top_100_similar_songs.csv", sep=",")

In [5]:
systems_data = [random_data, tfidf_data, w2v_data, bert_data, mfcc_data, ivec_data, blf_data, dnn_data, vgg_data, early_data, late_data]
systems_legend = ['random', 'tfidf', 'w2v', 'bert', 'mfcc', 'ivec', 'blf', 'dnn', 'vgg', 'early', 'late']

In [18]:
zipped_systems = zip(systems_legend, systems_data)
for name, data in zipped_systems:
    helper.check_relevance(data)

#### Plot of the **Precision-Recall Curve** for each of the 11 systems at k; with k in range [1,100]

In [6]:
def get_precision_recall_curve(system_data):
    pr_at_k = helper.calculate_precision_recall_at_k(system_data)
    recalls = [pr_at_k[k][1] for k in pr_at_k]  
    precisions = [pr_at_k[k][0] for k in pr_at_k]  
    return precisions, recalls

In [9]:
plt.figure(figsize=(12, 8))

color_palette = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf', '#1a55FF']
systems = [random_data, tfidf_data, w2v_data, bert_data, mfcc_data, ivec_data, blf_data, dnn_data, vgg_data, early_data, late_data]
sys_names = ['random', 'tfidf', 'w2v', 'bert', 'mfcc', 'ivec', 'blf', 'dnn', 'vgg', 'early', 'late']

for idx, system in enumerate(systems_data):
    precisions, recalls = get_precision_recall_curve(system)
    color = color_palette[idx % 11]  
    plt.plot(recalls, precisions, label=systems_legend[idx], color=color)
    
plt.xlabel('Recall@k')
plt.ylabel('Precision@k')
plt.title('Precision-Recall Curve of Retrieval Systems at k for k in range [1,100]')
plt.legend()
plt.grid(True)
plt.show() 

In [15]:
#load top 10
early_data = pd.read_csv("early_fusion_top_10_similar_songs.csv", sep=",")
late_data = pd.read_csv("late_fusion_top_10_similar_songs.csv", sep=",")
vgg_data = pd.read_csv("vgg19_top_10_similar_songs.csv", sep=",")
w2v_data = pd.read_csv("w_to_vec_top_10_similar_songs.csv", sep=",")
mfcc_data = pd.read_csv("mfcc_stats_top_10_similar_songs.csv", sep=",")
blf_data = pd.read_csv("blf_top_10_similar_songs.csv", sep=",")
bert_data = pd.read_csv("bert_top_10_similar_songs.csv", sep=",")
tfidf_data = pd.read_csv("tfidf_top_10_similar_songs.csv", sep=",")
random_data = pd.read_csv("random_baseline_top_10_similar_songs.csv", sep=",")
ivec_data = pd.read_csv("ivec_top_10_similar_songs.csv", sep=",")
dnn_data = pd.read_csv("dnn_top_10_similar_songs.csv", sep=",")


In [44]:
systems_data = [random_data, tfidf_data, w2v_data, bert_data, mfcc_data, ivec_data, blf_data, dnn_data, vgg_data, early_data, late_data]
systems_legend = ['random', 'tfidf', 'w2v', 'bert', 'mfcc', 'ivec', 'blf', 'dnn', 'vgg', 'early', 'late']

In [19]:
zipped_systems = zip(systems_legend, systems_data)
for name, data in zipped_systems:
    helper.check_relevance(data)

#### Table of the Evaluation of **Precision, Recall, nDCG, Genre coverage and Genre diversity @ k** on all 11 retrieval systems, with k=10.

In [36]:
k = 10
zipped_systems = zip(systems_legend, systems_data)
results = []

for name, data in zipped_systems:
    query_genres = data['query_genre']
    precision_at_k = helper.calculate_average_precision_at_k(data, k)
    recall_at_k = helper.calculate_recall_at_k(data, k)
    ndcg = helper.calculate_ndcg_at_10(data)
    gen_cov = helper.genre_coverage_at_k(data, k)
    gen_div = helper.genre_diversity_at_k(data, k)
    results.append((name, precision_at_k, recall_at_k, ndcg, gen_cov, gen_div))
    print(f"{name}: ok")


random: ok
tfidf: ok
w2v: ok
bert: ok
mfcc: ok
ivec: ok
blf: ok
dnn: ok
vgg: ok
early: ok
late: ok


In [48]:
results_df = pd.DataFrame(results, columns=['System Name', 'Precision@10', 'Recall@10', "NDCG@10", "Genre Coverage@10", "Genre Diversity@10"])
display(results_df)

Unnamed: 0,System Name,Precision@10,Recall@10,NDCG@10,Genre Coverage@10,Genre Diversity@10
0,random,0.079675,0.000103,0.40867,1.0,4.249072
1,tfidf,0.467055,0.00011,0.408358,0.977922,4.226594
2,w2v,0.33747,0.000147,0.40852,0.91342,4.197717
3,bert,0.163822,0.000118,0.41091,0.938528,4.194742
4,mfcc,0.506016,0.000106,0.408932,0.956691,4.19336
5,ivec,0.493113,0.000129,0.409474,0.999134,4.220663
6,blf,0.166173,0.000119,0.409535,0.94673,4.211745
7,dnn,0.140966,0.000173,0.410037,0.996535,4.194467
8,vgg,0.502929,0.000154,0.413847,0.961022,4.227351
9,early,0.445296,0.000131,0.407947,0.899567,4.208084
