# SmashChords Scoring Model

Heuristic scoring functions for matching snippets.

- **`key_score`** — similarity based on circle-of-fifths distance between two keys
- **`progression_score`** — similarity based on chord-sequence comparison (pluggable distance function)
- **`score_all`** — given a query snippet, score every snippet in the dataset

## Load Data

In [1]:
import ast
import pandas as pd

df = pd.read_csv("toy_smashchords_tswift.csv")
df["roman_progression"] = df["roman_progression"].apply(ast.literal_eval)
df["chord_progression_raw"] = df["chord_progression_raw"].apply(ast.literal_eval)

print(f"Loaded {len(df)} snippets across {df['song_title'].nunique()} songs")
df.head()

Loaded 1163 snippets across 234 songs


Unnamed: 0,snippet_id,song_title,section,artist,key,key_mode,roman_progression,chord_progression_raw
0,TS_001,seven,Verse,Taylor Swift,G,major,"[I, v, IV]","[G, Dm, C]"
1,TS_002,seven,Chorus,Taylor Swift,D,minor,"[i, VII, III, IV]","[Dm, C, F, G]"
2,TS_003,seven,Bridge,Taylor Swift,A,minor,"[i, III, VII]","[Am, C, G]"
3,TS_004,seven,Outro,Taylor Swift,A,minor,"[i, III, VII, III]","[Am, C, G, C]"
4,TS_005,it's nice to have a friend,Intro,Taylor Swift,A,minor,"[i, VII]","[Am, G]"


---
## Key Score

Each key is a node on the circle of fifths (12 nodes). Distance is the shortest arc between two nodes (0–6 steps).

Linear mapping: `key_score = 1 - distance / 6`
- Same key → distance 0 → score **1.0**
- Opposite key → distance 6 → score **0.0**

In [2]:
# Circle of fifths (12 nodes, ordered)
CIRCLE_OF_FIFTHS = ["C", "G", "D", "A", "E", "B", "F#", "C#", "Ab", "Eb", "Bb", "F"]

# Enharmonic equivalents → canonical form
ENHARMONIC = {
    "Db": "C#", "Gb": "F#", "Cb": "B",
    "Fb": "E",  "E#": "F", "B#": "C",
    "A#": "Bb", "D#": "Eb", "G#": "Ab",
}

def _normalize_key(key: str) -> str | None:
    """Resolve enharmonic spellings to the canonical circle-of-fifths label."""
    if key is None or (isinstance(key, float)):
        return None
    key = str(key).strip()
    return ENHARMONIC.get(key, key)

def circle_distance(key_a: str, key_b: str) -> int | None:
    """
    Shortest arc distance between two keys on the circle of fifths.
    Returns an int in [0, 6], or None if either key is unrecognized.
    """
    a = _normalize_key(key_a)
    b = _normalize_key(key_b)
    if a not in CIRCLE_OF_FIFTHS or b not in CIRCLE_OF_FIFTHS:
        return None
    i, j = CIRCLE_OF_FIFTHS.index(a), CIRCLE_OF_FIFTHS.index(b)
    d = abs(i - j)
    return min(d, len(CIRCLE_OF_FIFTHS) - d)  # shortest arc

def key_score(key_a: str, key_b: str) -> float:
    """
    Similarity score in [0, 1] based on circle-of-fifths distance.
    Same key → 1.0.  Opposite key (tritone, 6 steps) → 0.0.
    Returns 0.5 if either key is unrecognized.
    """
    d = circle_distance(key_a, key_b)
    if d is None:
        return 0.5  # unknown key: neutral score
    return 1.0 - d / 6.0

# --- Quick sanity check ---
tests = [("C", "C"), ("C", "G"), ("C", "F#"), ("C", "Db")]
for a, b in tests:
    print(f"key_score({a!r}, {b!r}) = {key_score(a, b):.4f}  (distance={circle_distance(a, b)})")

key_score('C', 'C') = 1.0000  (distance=0)
key_score('C', 'G') = 0.8333  (distance=1)
key_score('C', 'F#') = 0.0000  (distance=6)
key_score('C', 'Db') = 0.1667  (distance=5)


In [3]:
def key_scores_for_song(song_key: str, df: pd.DataFrame) -> pd.DataFrame:
    """
    Given a song's key, return a copy of df with a `key_score` column
    showing how closely each snippet's key matches.
    """
    result = df.copy()
    result["key_score"] = result["key"].apply(lambda k: key_score(song_key, k))
    return result.sort_values("key_score", ascending=False)

# Demo: score all snippets against the key of "C"
scored = key_scores_for_song("C", df)
print("All snippets scored by key proximity to C major:")
print(scored[["snippet_id", "song_title", "section", "key", "key_score"]].to_string(index=False))

All snippets scored by key proximity to C major:
snippet_id                                                           song_title     section key  key_score
    TS_605                                                                karma      Bridge   C   1.000000
    TS_949                                                            labyrinth       Verse   C   1.000000
    TS_227                                                      thank you aimee       Verse   C   1.000000
    TS_226                                                      thank you aimee       Intro   C   1.000000
    TS_944                                                           the archer  Pre-Chorus   C   1.000000
    TS_945                                                           the archer      Chorus   C   1.000000
    TS_946                                                           the archer      Bridge   C   1.000000
    TS_947                                                           the archer       Outro   C

---
## Progression Score

Similarity between two chord progressions (sequences of Roman numerals).

### Pluggable design
`progression_score(prog_a, prog_b, sim_fn=lcs_similarity)` accepts any function with the signature:
```python
sim_fn(seq_a: list[str], seq_b: list[str]) -> float  # in [0, 1]
```
Three built-in options are provided below.

In [4]:
# ── Similarity function 1: Longest Common Subsequence (DEFAULT) ──────────────
# Allows gaps; good at catching shared harmonic motifs regardless of insertions.

def lcs_length(a: list, b: list) -> int:
    """Standard LCS length via dynamic programming."""
    m, n = len(a), len(b)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if a[i - 1] == b[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    return dp[m][n]

def lcs_similarity(a: list, b: list) -> float:
    """LCS length normalized by the longer sequence. Range: [0, 1]."""
    if not a and not b:
        return 1.0
    if not a or not b:
        return 0.0
    return lcs_length(a, b) / max(len(a), len(b))


# ── Similarity function 2: Longest Common Substring ──────────────────────────
# Must be contiguous; rewards exact shared riffs.

def lcsubstring_length(a: list, b: list) -> int:
    """Length of the longest contiguous common block."""
    m, n = len(a), len(b)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    best = 0
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if a[i - 1] == b[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
                best = max(best, dp[i][j])
    return best

def lcsubstring_similarity(a: list, b: list) -> float:
    """Longest common substring normalized by the longer sequence. Range: [0, 1]."""
    if not a and not b:
        return 1.0
    if not a or not b:
        return 0.0
    return lcsubstring_length(a, b) / max(len(a), len(b))


# ── Similarity function 3: Levenshtein (edit distance) ───────────────────────
# Counts substitutions, insertions, deletions; penalizes reordering.

def levenshtein_distance(a: list, b: list) -> int:
    m, n = len(a), len(b)
    dp = list(range(n + 1))
    for i in range(1, m + 1):
        prev = dp[:]
        dp[0] = i
        for j in range(1, n + 1):
            if a[i - 1] == b[j - 1]:
                dp[j] = prev[j - 1]
            else:
                dp[j] = 1 + min(prev[j], dp[j - 1], prev[j - 1])
    return dp[n]

def levenshtein_similarity(a: list, b: list) -> float:
    """1 - normalized edit distance. Range: [0, 1]."""
    if not a and not b:
        return 1.0
    if not a or not b:
        return 0.0
    return 1.0 - levenshtein_distance(a, b) / max(len(a), len(b))


# ── Quick comparison of all three ────────────────────────────────────────────
examples = [
    (["I", "V", "vi", "IV"], ["I", "V", "vi", "IV"]),   # identical
    (["I", "V", "vi", "IV"], ["I", "vi", "IV"]),          # one chord dropped
    (["I", "V", "vi", "IV"], ["V", "vi", "IV", "I"]),     # rotation
    (["I", "V", "vi", "IV"], ["ii", "V", "I"]),           # partial overlap
]

fns = [("LCS", lcs_similarity), ("Substring", lcsubstring_similarity), ("Levenshtein", levenshtein_similarity)]

print(f"{'Pair':<45} {'LCS':>6} {'Substr':>8} {'Leven':>8}")
print("-" * 70)
for a, b in examples:
    label = f"{a} vs {b}"
    scores = [fn(a, b) for _, fn in fns]
    print(f"{label:<45} {scores[0]:>6.3f} {scores[1]:>8.3f} {scores[2]:>8.3f}")

Pair                                             LCS   Substr    Leven
----------------------------------------------------------------------
['I', 'V', 'vi', 'IV'] vs ['I', 'V', 'vi', 'IV']  1.000    1.000    1.000
['I', 'V', 'vi', 'IV'] vs ['I', 'vi', 'IV']    0.750    0.500    0.750
['I', 'V', 'vi', 'IV'] vs ['V', 'vi', 'IV', 'I']  0.750    0.750    0.500
['I', 'V', 'vi', 'IV'] vs ['ii', 'V', 'I']     0.250    0.250    0.250


In [5]:
def progression_score(
    prog_a: list[str],
    prog_b: list[str],
    sim_fn=lcs_similarity,   # swap to lcsubstring_similarity or levenshtein_similarity
) -> float:
    """
    Similarity score in [0, 1] between two Roman-numeral progressions.

    Parameters
    ----------
    prog_a, prog_b : list[str]
        Chord progressions as Roman-numeral token lists.
    sim_fn : callable
        Any function (list, list) -> float in [0, 1].
        Built-ins: lcs_similarity, lcsubstring_similarity, levenshtein_similarity.
    """
    return sim_fn(prog_a, prog_b)

---
## Score All Snippets Against a Query

Given a query `(key, progression)`, compute `key_score` and `progression_score` for every snippet in the dataset.

In [8]:
def score_all(
    query_key: str,
    query_progression: list[str],
    df: pd.DataFrame,
    sim_fn=lcs_similarity,
    key_weight: float = 0.5,
    prog_weight: float = 0.5,
) -> pd.DataFrame:
    """
    Score every snippet in df against a query (key, progression).

    Returns a copy of df with three added columns:
      - key_score         : circle-of-fifths similarity
      - progression_score : sequence similarity via sim_fn
      - total_score       : weighted sum of the two
    Sorted by total_score descending.
    """
    result = df.copy()
    result["key_score"] = result["key"].apply(lambda k: key_score(query_key, k))
    result["progression_score"] = result["roman_progression"].apply(
        lambda p: progression_score(query_progression, p, sim_fn=sim_fn)
    )
    result["total_score"] = (
        key_weight * result["key_score"] + prog_weight * result["progression_score"]
    )
    return result.sort_values("total_score", ascending=False).reset_index(drop=True)


# ── Demo: All Too Well (Taylor's Version) — Chorus ───────────────────────────
query = df[
    (df["song_title"] == "all too well (taylor's version)") &
    (df["section"] == "Chorus")
].iloc[0]

print(f"Query: [{query['snippet_id']}] {query['song_title']} — {query['section']}")
print(f"  Key: {query['key']}  |  Progression: {query['roman_progression']}")
print()

results = score_all(query["key"], query["roman_progression"], df)
print(results[["snippet_id", "song_title", "section", "key", "key_score", "progression_score", "total_score"]].head(25).to_string(index=False))


Query: [TS_887] all too well (taylor's version) — Chorus
  Key: C  |  Progression: ['I', 'V', 'vi', 'IV']

snippet_id                                                           song_title    section key  key_score  progression_score  total_score
    TS_913                                                      cornelia street      Outro   C   1.000000                1.0     1.000000
    TS_902                                                   champagne problems     Chorus   C   1.000000                1.0     1.000000
    TS_884                                      all too well (taylor's version)      Intro   C   1.000000                1.0     1.000000
    TS_885                                      all too well (taylor's version)      Verse   C   1.000000                1.0     1.000000
    TS_887                                      all too well (taylor's version)     Chorus   C   1.000000                1.0     1.000000
    TS_888                                      all too well (tay

In [None]:
# ── Compare results across different similarity functions ─────────────────────
SIM_FNS = {
    "LCS": lcs_similarity,
    "Substring": lcsubstring_similarity,
    "Levenshtein": levenshtein_similarity,
}

query = df[
    (df["song_title"] == "all too well (taylor's version)") &
    (df["section"] == "Chorus")
].iloc[0]

print(f"Query: [{query['snippet_id']}] {query['song_title']} — {query['section']}")
print(f"  Key: {query['key']}  |  Progression: {query['roman_progression']}")
print()

for fn_name, fn in SIM_FNS.items():
    top5 = score_all(query["key"], query["roman_progression"], df, sim_fn=fn).head(15)
    print(f"── Top 5 with {fn_name} ──")
    print(top5[["snippet_id", "song_title", "section", "key_score", "progression_score", "total_score"]].to_string(index=False))
    print()


Query: [TS_887] all too well (taylor's version) — Chorus
  Key: C  |  Progression: ['I', 'V', 'vi', 'IV']

── Top 5 with LCS ──
snippet_id                      song_title section  key_score  progression_score  total_score
    TS_913                 cornelia street   Outro        1.0                1.0          1.0
    TS_902              champagne problems  Chorus        1.0                1.0          1.0
    TS_884 all too well (taylor's version)   Intro        1.0                1.0          1.0
    TS_885 all too well (taylor's version)   Verse        1.0                1.0          1.0
    TS_887 all too well (taylor's version)  Chorus        1.0                1.0          1.0

── Top 5 with Substring ──
snippet_id                      song_title    section  key_score  progression_score  total_score
    TS_887 all too well (taylor's version)     Chorus        1.0                1.0          1.0
    TS_902              champagne problems     Chorus        1.0                1.0   