MERIT

Multi-Factor Disentangled Music Similarity

Most similarity models collapse melody, rhythm, and timbre into a single undifferentiated score. MERIT exposes all three as independent, interpretable signals from the same audio query.

What is MERIT?

Given two audio clips, MERIT returns three independent cosine similarities — one per musical factor:

Score	Captures	Example query
`S_mel`	Melodic contour & pitch identity	"Find songs with the same melody"
`S_rhy`	Rhythmic groove & beat pattern	"Find songs with the same drum feel"
`S_tim`	Instrument timbre & sonic character	"Find songs played on the same instrument"

A solo piano cover of a rock anthem scores high on S_mel but low on S_rhy and S_tim. MERIT makes this distinction explicit and computable.

HuggingFace Resources

Resource	Link	Description
Pre-trained heads	amaai-lab/merit	3 × ~11 MB projection heads
Training dataset	datasets/amaai-lab/merit	296K factor-controlled triplets

# Download pre-trained heads only (~33 MB total)
huggingface-cli download amaai-lab/merit \
    head_mel/best_head.pt head_rhy/best_head.pt head_tim/best_head.pt \
    --local-dir ./models

Quick Inference — Get MERIT Embeddings for Your Audio

No training or dataset required. Download the three pre-trained heads (~11 MB each) and encode any audio in a few lines of Python.

Step 1 — Download pre-trained heads

pip install torch torchaudio transformers huggingface_hub

huggingface-cli download amaai-lab/merit \
    head_mel/best_head.pt head_rhy/best_head.pt head_tim/best_head.pt \
    --local-dir ./models

Step 2 — Encode audio

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from transformers import AutoModel, Wav2Vec2FeatureExtractor

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
EXTRACT_LAYERS = (3, 4, 5, 6, 23)
MODEL_ID = "m-a-p/MERT-v1-330M"

# Load MERT backbone (shared for all three factors)
processor = Wav2Vec2FeatureExtractor.from_pretrained(MODEL_ID, trust_remote_code=True)
mert = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True).to(DEVICE).eval()


class ProjectionHead(nn.Module):
    def __init__(self, in_dim=5120, hidden_dim=512, out_dim=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, out_dim, bias=False),
        )

    def forward(self, x):
        return F.normalize(self.net(x), dim=-1)


def load_head(path):
    ckpt = torch.load(path, map_location=DEVICE, weights_only=True)
    head = ProjectionHead(ckpt["in_dim"], ckpt["hidden_dim"], ckpt["out_dim"])
    head.load_state_dict(ckpt["state_dict"])
    return head.to(DEVICE).eval()


head_mel = load_head("models/head_mel/best_head.pt")
head_rhy = load_head("models/head_rhy/best_head.pt")
head_tim = load_head("models/head_tim/best_head.pt")


def load_audio(path, sr=24_000, max_sec=30):
    wav, orig_sr = torchaudio.load(path)
    if orig_sr != sr:
        wav = torchaudio.functional.resample(wav, orig_sr, sr)
    wav = wav.mean(0)                                    # stereo → mono
    wav = wav[: sr * max_sec]                            # truncate
    wav = F.pad(wav, (0, sr * max_sec - wav.shape[0]))   # zero-pad
    return wav


@torch.no_grad()
def get_merit_embeddings(audio_path):
    """Return (melody, rhythm, timbre) embeddings — each a (1, 128) unit vector."""
    wav = load_audio(audio_path)
    inputs = processor(wav.numpy(), sampling_rate=24_000, return_tensors="pt")
    inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
    out = mert(**inputs, output_hidden_states=True)
    parts = [out.hidden_states[l].mean(dim=1) for l in EXTRACT_LAYERS]
    backbone = torch.cat(parts, dim=-1)  # (1, 5120)
    return head_mel(backbone), head_rhy(backbone), head_tim(backbone)


# Get embeddings for any two audio files
emb_a = get_merit_embeddings("song_a.wav")
emb_b = get_merit_embeddings("song_b.wav")

melody_sim = (emb_a[0] * emb_b[0]).sum().item()  # cosine sim in [-1, 1]
rhythm_sim  = (emb_a[1] * emb_b[1]).sum().item()
timbre_sim  = (emb_a[2] * emb_b[2]).sum().item()

Tip: For large collections, use evaluation/encode_folder.py to batch-encode an entire directory to a single pkl file — much faster than encoding file-by-file.

Architecture

Audio (24 kHz mono)
  └─► MERT-v1-330M [FROZEN]
        Layers 3, 4, 5, 6, 23
        └─► mean-pool over time → concat → 5120-dim

5120-dim backbone vector
  ├─► H_mel  Linear(5120→512) → ReLU → Linear(512→128) → L2-norm  →  S_mel
  ├─► H_rhy  Linear(5120→512) → ReLU → Linear(512→128) → L2-norm  →  S_rhy
  └─► H_tim  Linear(5120→512) → ReLU → Linear(512→128) → L2-norm  →  S_tim

Each head is trained independently with Circle Loss on triplets where only one musical factor varies at a time.

Component	Detail
Backbone	MERT-v1-330M, 330M params, frozen
Layers extracted	3, 4, 5, 6, 23 (5 × 1024-dim → 5120-dim)
Head architecture	Linear → ReLU → Linear → L2-norm
Embedding dim	128 per factor
Loss	Circle Loss (γ=10, m=0.2)
Optimizer	AdamW, lr=1e-3
Schedule	Cosine annealing, 200 epochs

Installation

# Clone this repository
git clone https://github.com/AMAAI-Lab/MERIT.git
cd MERIT

# Create conda environment
conda create -n merit python=3.10 -y
conda activate merit

# Install dependencies
pip install -r requirements.txt

JASCO (required for triplet generation only)

Triplet generation uses the JASCO music generation model (Meta AI). Follow their installation instructions and then set:

export JASCO_ROOT=/path/to/jasco-audiocraft

Note: JASCO is only needed to re-generate triplets.

Training Data

The factor-controlled training triplets are published on HuggingFace:

Factor	Folders	Triplets
Melody	5,000	125,000
Rhythm	5,000	125,000
Timbre	1,855	46,241

# Download all three factor archives (~50 GB melody, ~50 GB rhythm, ~10 GB timbre)
huggingface-cli download --repo-type dataset amaai-lab/merit \
    melody_triplets.tar.gz rhythm_triplets.tar.gz timbre_triplets.tar.gz \
    --local-dir ./data/triplets

Each archive extracts to triplets_*/triplet/{anchor.wav, positive_01-05.wav, negative.wav, triplet_meta.json}. Within each folder, only the target factor is shared between anchor and positives — key, genre, and instrumentation vary freely.

Dataset licensed under CC BY-NC-SA 4.0 — derived from MoisesDB. Non-commercial use only.

Data Setup

MoisesDB

Request access and download MoisesDB from Moises Inc.

Unpack so that the structure is:

/your/path/moisesdb/moisesdb_v0.1/<song_id>/...

Export the environment variable:

export MOISESDB_ROOT=/your/path/moisesdb

Probe Datasets (for Step 6)

Download the three probe datasets and place them under a common root:

Dataset	Used for	Source
MUSDB18-HQ	Timbral probe	Zenodo
Ballroom	Rhythmic probe	MTG-UPF
Covers80	Melodic/cover probe	LabROSA

export PROBES_ROOT=/your/path/probes
# Expected layout:
#   $PROBES_ROOT/musdb18hq/   (stems as .wav)
#   $PROBES_ROOT/ballroom/    (subdirs named by class)
#   $PROBES_ROOT/covers80/    (subdirs with 2 files = one cover pair)

Reproduction

Using Pre-trained Heads (recommended)

The three trained projection heads (melody, rhythm, timbre) are available on HuggingFace (~11 MB each):

huggingface-cli download amaai-lab/merit head_mel/best_head.pt head_rhy/best_head.pt head_tim/best_head.pt --local-dir ./models

Want to run MERIT on your own audio? This is all you need — no training required. Download the heads, encode your audio with evaluation/encode_folder.py, and project with the heads. No MoisesDB, no JASCO, no GPU-days of training.

To reproduce the paper evaluations:

# 3×3 disentanglement table (Table 1)
export EMBEDDINGS_DIR=./data/embeddings
bash scripts/3_extract_embeddings.sh
bash scripts/5_evaluate.sh

# Probe evaluations (Table 2 / Table 3)
export PROBES_ROOT=/your/path/probes
bash scripts/6_run_probes.sh

Full Reproduction (re-generate everything from scratch)

# Step 1: Build MoisesDB input indexes
export MOISESDB_ROOT=/your/path/moisesdb
bash scripts/1_build_indexes.sh

# Step 2: Generate triplets (requires JASCO)
export JASCO_ROOT=/path/to/jasco-audiocraft
bash scripts/2_generate_triplets.sh

# Step 3: Extract MERT embeddings
bash scripts/3_extract_embeddings.sh

# Step 4: Train heads
bash scripts/4_train_heads.sh

# Step 5: Evaluate (3×3 disentanglement table)
bash scripts/5_evaluate.sh

# Step 6: Probe evaluations
export PROBES_ROOT=/your/path/probes
bash scripts/6_run_probes.sh

Multi-GPU Embedding Extraction (Optional — Advanced)

extract_embeddings.py supports sharding across multiple GPUs to speed up extraction. Skip this if running on a single GPU — scripts/3_extract_embeddings.sh handles that directly.

# Run on 4 GPUs (adjust CUDA_VISIBLE_DEVICES accordingly)
for I in 1 2 3 4; do
  CUDA_VISIBLE_DEVICES=$((I-1)) python training/extract_embeddings.py \
    --encoder mert --triplets-dir ./data/melody_triplets \
    --split-file splits/melody_split.json \
    --out ./data/embeddings/mel_shard_${I}.pkl \
    --shard ${I}/4 &
done
wait

# Merge shards
python training/merge_pkl.py \
  --shards ./data/embeddings/mel_shard_*.pkl \
  --triplets-dir ./data/melody_triplets \
  --out ./data/embeddings/mel_mert.pkl

Citation

If you use this code, please cite:

@article{merit2026,
  title   = {Learning Disentangled Music Representations for Audio Similarity},
  author  = {},
  journal = {arXiv preprint arXiv:coming soon},
  year    = {2026},
}

License

This code is released under the MIT License.

The datasets used (MoisesDB, MUSDB18-HQ, Ballroom, Covers80) are subject to their own respective licenses. See each dataset's homepage for terms of use.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data_pipeline		data_pipeline
evaluation		evaluation
scripts		scripts
splits		splits
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
fig_overview.png		fig_overview.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MERIT

Multi-Factor Disentangled Music Similarity

What is MERIT?

HuggingFace Resources

Quick Inference — Get MERIT Embeddings for Your Audio

Step 1 — Download pre-trained heads

Step 2 — Encode audio

Architecture

Installation

JASCO (required for triplet generation only)

Training Data

Data Setup

MoisesDB

Probe Datasets (for Step 6)

Reproduction

Using Pre-trained Heads (recommended)

Full Reproduction (re-generate everything from scratch)

Multi-GPU Embedding Extraction (Optional — Advanced)

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MERIT

Multi-Factor Disentangled Music Similarity

What is MERIT?

HuggingFace Resources

Quick Inference — Get MERIT Embeddings for Your Audio

Step 1 — Download pre-trained heads

Step 2 — Encode audio

Architecture

Installation

JASCO (required for triplet generation only)

Training Data

Data Setup

MoisesDB

Probe Datasets (for Step 6)

Reproduction

Using Pre-trained Heads (recommended)

Full Reproduction (re-generate everything from scratch)

Multi-GPU Embedding Extraction (Optional — Advanced)

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages