# MMSR Evaluation Notebook

**Evaluation is run on**
- Random baseline
- Unimodal (audio / lyrics / video)
- Early fusion 
- Late fusion

**Metrics**
It calculated Precision, Recall, MRR, nDCG both with the Jaccard-threshold relevance and with the relevance defintion from the slides


## 1) Configuration


In [3]:
import numpy as np

from common import Evaluator, evaluate_system, MODALITIES
from baseline import RandomBaselineRetrievalSystem
from unimodal import UnimodalRetrievalSystem
from early_fusion import EarlyFusionRetrievalSystem
from late_fusion import LateFusionRetrievalSystem

data_dir = "./data"
k = 10

# Jaccard threshold 
evaluator = Evaluator(data_dir, jaccard_relevant_threshold=0.25)

def banner(title: str):
    print("\n" + "=" * 70)
    print(title)
    print("=" * 70)


## 2) Evaluate Random baseline


In [4]:
banner("Random Baseline")
rs = RandomBaselineRetrievalSystem(evaluator, seed=0)
evaluate_system(evaluator, rs, k=k)


Random Baseline


Evaluating RandomBaselineRetrievalSystem: 100%|██████████| 4148/4148 [00:14<00:00, 278.41track/s]


JACCARD-THRESHOLD RELEVANCE
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.0441 | 0.0741
Recall@10:    0.0022 | 0.0054
MRR@10:       0.1064 | 0.2217
nDCG@10:      0.1009 | 0.0794

OVERLAP RELEVANCE (slide rule)
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.4201 | 0.2750
Recall@10:    0.0024 | 0.0023
MRR@10:       0.5784 | 0.3905
nDCG@10:      0.4199 | 0.2829

POPULARITY
METRIC            MEAN   |  STD
--------------------------------
Pop@10 (log1p): 7.9105 | 0.4194





## 3) Evaluate Unimodal (each modality)


In [5]:
unimodal_rs = UnimodalRetrievalSystem(data_dir, evaluator)

for modality in MODALITIES:
    banner(f"Unimodal | modality={modality}")
    unimodal_rs.set_modality(modality)
    evaluate_system(evaluator, unimodal_rs, k=k)



Unimodal | modality=audio


Evaluating UnimodalRetrievalSystem: 100%|██████████| 4148/4148 [00:16<00:00, 254.94track/s]



JACCARD-THRESHOLD RELEVANCE
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.1184 | 0.1445
Recall@10:    0.0089 | 0.0286
MRR@10:       0.2607 | 0.3436
nDCG@10:      0.1788 | 0.1213

OVERLAP RELEVANCE (slide rule)
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.5734 | 0.3051
Recall@10:    0.0045 | 0.0180
MRR@10:       0.7153 | 0.3670
nDCG@10:      0.5774 | 0.3110

POPULARITY
METRIC            MEAN   |  STD
--------------------------------
Pop@10 (log1p): 7.8570 | 0.5388

Unimodal | modality=lyrics


Evaluating UnimodalRetrievalSystem: 100%|██████████| 4148/4148 [00:18<00:00, 224.58track/s]



JACCARD-THRESHOLD RELEVANCE
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.1024 | 0.1425
Recall@10:    0.0086 | 0.0228
MRR@10:       0.2182 | 0.3181
nDCG@10:      0.1627 | 0.1175

OVERLAP RELEVANCE (slide rule)
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.5511 | 0.2934
Recall@10:    0.0047 | 0.0117
MRR@10:       0.6984 | 0.3693
nDCG@10:      0.5544 | 0.3003

POPULARITY
METRIC            MEAN   |  STD
--------------------------------
Pop@10 (log1p): 8.0593 | 0.5095

Unimodal | modality=video


Evaluating UnimodalRetrievalSystem: 100%|██████████| 4148/4148 [01:05<00:00, 63.26track/s]


JACCARD-THRESHOLD RELEVANCE
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.0859 | 0.1139
Recall@10:    0.0063 | 0.0211
MRR@10:       0.2419 | 0.3540
nDCG@10:      0.1548 | 0.1181

OVERLAP RELEVANCE (slide rule)
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.5014 | 0.3003
Recall@10:    0.0036 | 0.0164
MRR@10:       0.6788 | 0.3794
nDCG@10:      0.5094 | 0.3044

POPULARITY
METRIC            MEAN   |  STD
--------------------------------
Pop@10 (log1p): 7.9458 | 0.5481





## 4) Evaluate Early fusion (all 3 modalities + random 2)


In [6]:
# TODO: change unimodal script so that it also accepts a subset of the three modalities (i.e., random two)

banner("Early Fusion | ALL modalities")
early_rs = EarlyFusionRetrievalSystem(data_dir, evaluator)
evaluate_system(evaluator, early_rs, k=k)



Early Fusion | ALL modalities


Evaluating EarlyFusionRetrievalSystem: 100%|██████████| 4148/4148 [00:18<00:00, 223.01track/s]


JACCARD-THRESHOLD RELEVANCE
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.0444 | 0.0754
Recall@10:    0.0022 | 0.0059
MRR@10:       0.1071 | 0.2224
nDCG@10:      0.1008 | 0.0795

OVERLAP RELEVANCE (slide rule)
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.4182 | 0.2737
Recall@10:    0.0023 | 0.0020
MRR@10:       0.5726 | 0.3873
nDCG@10:      0.4176 | 0.2798

POPULARITY
METRIC            MEAN   |  STD
--------------------------------
Pop@10 (log1p): 7.8855 | 0.4013





## 5) Evaluate Late fusion (all strategies all 3 modalities)


In [8]:
# Note: computation takes some time since all combinations are tested 

modality_sets = [
    ("ALL modalities", ["audio", "lyrics", "video"]),
]

late_configs = [
    ("RRF", dict(fusion="rrf", rrf_k=60)),
    ("norm_sum (zscore + equal)", dict(fusion="norm_sum", norm="zscore", weighting="equal")),
    ("norm_sum (zscore + auto_agreement)", dict(fusion="norm_sum", norm="zscore", weighting="auto")),
    ("norm_sum (minmax + equal)", dict(fusion="norm_sum", norm="minmax", weighting="equal")),
    ("norm_sum (minmax + auto_agreement)", dict(fusion="norm_sum", norm="minmax", weighting="auto")),
]

for mod_label, mods in modality_sets:
    for cfg_label, cfg in late_configs:
        banner(f"Late Fusion | {cfg_label} | {mod_label} | modalities={mods}")
        rs = LateFusionRetrievalSystem(data_dir, evaluator, modalities=mods, **cfg)
        evaluate_system(evaluator, rs, k=k)



Late Fusion | RRF | ALL modalities | modalities=['audio', 'lyrics', 'video']


Evaluating LateFusionRetrievalSystem: 100%|██████████| 4148/4148 [01:11<00:00, 57.72track/s]



JACCARD-THRESHOLD RELEVANCE
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.1377 | 0.1558
Recall@10:    0.0111 | 0.0236
MRR@10:       0.3019 | 0.3639
nDCG@10:      0.2011 | 0.1295

OVERLAP RELEVANCE (slide rule)
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.6061 | 0.2948
Recall@10:    0.0051 | 0.0123
MRR@10:       0.7454 | 0.3501
nDCG@10:      0.6108 | 0.3004

POPULARITY
METRIC            MEAN   |  STD
--------------------------------
Pop@10 (log1p): 8.0441 | 0.5487

Late Fusion | norm_sum (zscore + equal) | ALL modalities | modalities=['audio', 'lyrics', 'video']


Evaluating LateFusionRetrievalSystem: 100%|██████████| 4148/4148 [01:12<00:00, 57.36track/s]



JACCARD-THRESHOLD RELEVANCE
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.1383 | 0.1610
Recall@10:    0.0111 | 0.0257
MRR@10:       0.3134 | 0.3774
nDCG@10:      0.2059 | 0.1359

OVERLAP RELEVANCE (slide rule)
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.6156 | 0.2963
Recall@10:    0.0053 | 0.0127
MRR@10:       0.7562 | 0.3479
nDCG@10:      0.6216 | 0.3011

POPULARITY
METRIC            MEAN   |  STD
--------------------------------
Pop@10 (log1p): 8.0919 | 0.5646

Late Fusion | norm_sum (zscore + auto_agreement) | ALL modalities | modalities=['audio', 'lyrics', 'video']


Evaluating LateFusionRetrievalSystem: 100%|██████████| 4148/4148 [01:16<00:00, 54.46track/s]



JACCARD-THRESHOLD RELEVANCE
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.1411 | 0.1658
Recall@10:    0.0116 | 0.0270
MRR@10:       0.3132 | 0.3753
nDCG@10:      0.2075 | 0.1379

OVERLAP RELEVANCE (slide rule)
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.6176 | 0.2958
Recall@10:    0.0054 | 0.0125
MRR@10:       0.7577 | 0.3456
nDCG@10:      0.6234 | 0.3003

POPULARITY
METRIC            MEAN   |  STD
--------------------------------
Pop@10 (log1p): 8.0872 | 0.5711

Late Fusion | norm_sum (minmax + equal) | ALL modalities | modalities=['audio', 'lyrics', 'video']


Evaluating LateFusionRetrievalSystem: 100%|██████████| 4148/4148 [01:18<00:00, 53.04track/s]



JACCARD-THRESHOLD RELEVANCE
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.1384 | 0.1595
Recall@10:    0.0110 | 0.0246
MRR@10:       0.3172 | 0.3798
nDCG@10:      0.2070 | 0.1362

OVERLAP RELEVANCE (slide rule)
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.6165 | 0.2963
Recall@10:    0.0051 | 0.0102
MRR@10:       0.7632 | 0.3462
nDCG@10:      0.6241 | 0.3011

POPULARITY
METRIC            MEAN   |  STD
--------------------------------
Pop@10 (log1p): 8.1173 | 0.5588

Late Fusion | norm_sum (minmax + auto_agreement) | ALL modalities | modalities=['audio', 'lyrics', 'video']


Evaluating LateFusionRetrievalSystem: 100%|██████████| 4148/4148 [01:16<00:00, 54.24track/s]


JACCARD-THRESHOLD RELEVANCE
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.1418 | 0.1665
Recall@10:    0.0114 | 0.0257
MRR@10:       0.3130 | 0.3751
nDCG@10:      0.2079 | 0.1382

OVERLAP RELEVANCE (slide rule)
METRIC       MEAN   |  STD
-----------------------------
Precision@10: 0.6174 | 0.2981
Recall@10:    0.0052 | 0.0108
MRR@10:       0.7620 | 0.3454
nDCG@10:      0.6246 | 0.3021

POPULARITY
METRIC            MEAN   |  STD
--------------------------------
Pop@10 (log1p): 8.1100 | 0.5682



