Most similarity models collapse melody, rhythm, and timbre into a single undifferentiated score. MERIT exposes all three as independent, interpretable signals from the same audio query.
Given two audio clips, MERIT returns three independent cosine similarities — one per musical factor:
| Score | Captures | Example query |
|---|---|---|
S_mel |
Melodic contour & pitch identity | "Find songs with the same melody" |
S_rhy |
Rhythmic groove & beat pattern | "Find songs with the same drum feel" |
S_tim |
Instrument timbre & sonic character | "Find songs played on the same instrument" |
A solo piano cover of a rock anthem scores high on S_mel but low on S_rhy and S_tim. MERIT makes this distinction explicit and computable.
| Resource | Link | Description |
|---|---|---|
| Pre-trained heads | amaai-lab/merit | 3 × ~11 MB projection heads |
| Training dataset | datasets/amaai-lab/merit | 296K factor-controlled triplets |
# Download pre-trained heads only (~33 MB total)
huggingface-cli download amaai-lab/merit \
head_mel/best_head.pt head_rhy/best_head.pt head_tim/best_head.pt \
--local-dir ./modelsNo training or dataset required. Download the three pre-trained heads (~11 MB each) and encode any audio in a few lines of Python.
pip install torch torchaudio transformers huggingface_hub
huggingface-cli download amaai-lab/merit \
head_mel/best_head.pt head_rhy/best_head.pt head_tim/best_head.pt \
--local-dir ./modelsimport torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from transformers import AutoModel, Wav2Vec2FeatureExtractor
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
EXTRACT_LAYERS = (3, 4, 5, 6, 23)
MODEL_ID = "m-a-p/MERT-v1-330M"
# Load MERT backbone (shared for all three factors)
processor = Wav2Vec2FeatureExtractor.from_pretrained(MODEL_ID, trust_remote_code=True)
mert = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True).to(DEVICE).eval()
class ProjectionHead(nn.Module):
def __init__(self, in_dim=5120, hidden_dim=512, out_dim=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_dim, hidden_dim),
nn.ReLU(inplace=True),
nn.Linear(hidden_dim, out_dim, bias=False),
)
def forward(self, x):
return F.normalize(self.net(x), dim=-1)
def load_head(path):
ckpt = torch.load(path, map_location=DEVICE, weights_only=True)
head = ProjectionHead(ckpt["in_dim"], ckpt["hidden_dim"], ckpt["out_dim"])
head.load_state_dict(ckpt["state_dict"])
return head.to(DEVICE).eval()
head_mel = load_head("models/head_mel/best_head.pt")
head_rhy = load_head("models/head_rhy/best_head.pt")
head_tim = load_head("models/head_tim/best_head.pt")
def load_audio(path, sr=24_000, max_sec=30):
wav, orig_sr = torchaudio.load(path)
if orig_sr != sr:
wav = torchaudio.functional.resample(wav, orig_sr, sr)
wav = wav.mean(0) # stereo → mono
wav = wav[: sr * max_sec] # truncate
wav = F.pad(wav, (0, sr * max_sec - wav.shape[0])) # zero-pad
return wav
@torch.no_grad()
def get_merit_embeddings(audio_path):
"""Return (melody, rhythm, timbre) embeddings — each a (1, 128) unit vector."""
wav = load_audio(audio_path)
inputs = processor(wav.numpy(), sampling_rate=24_000, return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
out = mert(**inputs, output_hidden_states=True)
parts = [out.hidden_states[l].mean(dim=1) for l in EXTRACT_LAYERS]
backbone = torch.cat(parts, dim=-1) # (1, 5120)
return head_mel(backbone), head_rhy(backbone), head_tim(backbone)
# Get embeddings for any two audio files
emb_a = get_merit_embeddings("song_a.wav")
emb_b = get_merit_embeddings("song_b.wav")
melody_sim = (emb_a[0] * emb_b[0]).sum().item() # cosine sim in [-1, 1]
rhythm_sim = (emb_a[1] * emb_b[1]).sum().item()
timbre_sim = (emb_a[2] * emb_b[2]).sum().item()Tip: For large collections, use
evaluation/encode_folder.pyto batch-encode an entire directory to a single pkl file — much faster than encoding file-by-file.
Audio (24 kHz mono)
└─► MERT-v1-330M [FROZEN]
Layers 3, 4, 5, 6, 23
└─► mean-pool over time → concat → 5120-dim
5120-dim backbone vector
├─► H_mel Linear(5120→512) → ReLU → Linear(512→128) → L2-norm → S_mel
├─► H_rhy Linear(5120→512) → ReLU → Linear(512→128) → L2-norm → S_rhy
└─► H_tim Linear(5120→512) → ReLU → Linear(512→128) → L2-norm → S_tim
Each head is trained independently with Circle Loss on triplets where only one musical factor varies at a time.
| Component | Detail |
|---|---|
| Backbone | MERT-v1-330M, 330M params, frozen |
| Layers extracted | 3, 4, 5, 6, 23 (5 × 1024-dim → 5120-dim) |
| Head architecture | Linear → ReLU → Linear → L2-norm |
| Embedding dim | 128 per factor |
| Loss | Circle Loss (γ=10, m=0.2) |
| Optimizer | AdamW, lr=1e-3 |
| Schedule | Cosine annealing, 200 epochs |
# Clone this repository
git clone https://github.com/AMAAI-Lab/MERIT.git
cd MERIT
# Create conda environment
conda create -n merit python=3.10 -y
conda activate merit
# Install dependencies
pip install -r requirements.txtTriplet generation uses the JASCO music generation model (Meta AI). Follow their installation instructions and then set:
export JASCO_ROOT=/path/to/jasco-audiocraftNote: JASCO is only needed to re-generate triplets.
The factor-controlled training triplets are published on HuggingFace:
| Factor | Folders | Triplets |
|---|---|---|
| Melody | 5,000 | 125,000 |
| Rhythm | 5,000 | 125,000 |
| Timbre | 1,855 | 46,241 |
# Download all three factor archives (~50 GB melody, ~50 GB rhythm, ~10 GB timbre)
huggingface-cli download --repo-type dataset amaai-lab/merit \
melody_triplets.tar.gz rhythm_triplets.tar.gz timbre_triplets.tar.gz \
--local-dir ./data/tripletsEach archive extracts to triplets_*/triplet/{anchor.wav, positive_01-05.wav, negative.wav, triplet_meta.json}. Within each folder, only the target factor is shared between anchor and positives — key, genre, and instrumentation vary freely.
Dataset licensed under CC BY-NC-SA 4.0 — derived from MoisesDB. Non-commercial use only.
- Request access and download MoisesDB from Moises Inc.
- Unpack so that the structure is:
/your/path/moisesdb/moisesdb_v0.1/<song_id>/... - Export the environment variable:
export MOISESDB_ROOT=/your/path/moisesdb
Download the three probe datasets and place them under a common root:
| Dataset | Used for | Source |
|---|---|---|
| MUSDB18-HQ | Timbral probe | Zenodo |
| Ballroom | Rhythmic probe | MTG-UPF |
| Covers80 | Melodic/cover probe | LabROSA |
export PROBES_ROOT=/your/path/probes
# Expected layout:
# $PROBES_ROOT/musdb18hq/ (stems as .wav)
# $PROBES_ROOT/ballroom/ (subdirs named by class)
# $PROBES_ROOT/covers80/ (subdirs with 2 files = one cover pair)The three trained projection heads (melody, rhythm, timbre) are available on HuggingFace (~11 MB each):
huggingface-cli download amaai-lab/merit head_mel/best_head.pt head_rhy/best_head.pt head_tim/best_head.pt --local-dir ./modelsWant to run MERIT on your own audio? This is all you need — no training required. Download the heads, encode your audio with
evaluation/encode_folder.py, and project with the heads. No MoisesDB, no JASCO, no GPU-days of training.
To reproduce the paper evaluations:
# 3×3 disentanglement table (Table 1)
export EMBEDDINGS_DIR=./data/embeddings
bash scripts/3_extract_embeddings.sh
bash scripts/5_evaluate.sh
# Probe evaluations (Table 2 / Table 3)
export PROBES_ROOT=/your/path/probes
bash scripts/6_run_probes.sh# Step 1: Build MoisesDB input indexes
export MOISESDB_ROOT=/your/path/moisesdb
bash scripts/1_build_indexes.sh
# Step 2: Generate triplets (requires JASCO)
export JASCO_ROOT=/path/to/jasco-audiocraft
bash scripts/2_generate_triplets.sh
# Step 3: Extract MERT embeddings
bash scripts/3_extract_embeddings.sh
# Step 4: Train heads
bash scripts/4_train_heads.sh
# Step 5: Evaluate (3×3 disentanglement table)
bash scripts/5_evaluate.sh
# Step 6: Probe evaluations
export PROBES_ROOT=/your/path/probes
bash scripts/6_run_probes.shextract_embeddings.py supports sharding across multiple GPUs to speed up extraction. Skip this if running on a single GPU — scripts/3_extract_embeddings.sh handles that directly.
# Run on 4 GPUs (adjust CUDA_VISIBLE_DEVICES accordingly)
for I in 1 2 3 4; do
CUDA_VISIBLE_DEVICES=$((I-1)) python training/extract_embeddings.py \
--encoder mert --triplets-dir ./data/melody_triplets \
--split-file splits/melody_split.json \
--out ./data/embeddings/mel_shard_${I}.pkl \
--shard ${I}/4 &
done
wait
# Merge shards
python training/merge_pkl.py \
--shards ./data/embeddings/mel_shard_*.pkl \
--triplets-dir ./data/melody_triplets \
--out ./data/embeddings/mel_mert.pklIf you use this code, please cite:
@article{merit2026,
title = {Learning Disentangled Music Representations for Audio Similarity},
author = {},
journal = {arXiv preprint arXiv:coming soon},
year = {2026},
}This code is released under the MIT License.
The datasets used (MoisesDB, MUSDB18-HQ, Ballroom, Covers80) are subject to their own respective licenses. See each dataset's homepage for terms of use.
