## BERT prompt-to-playlist generator

In this section, we use the transformer BERT (all-mpnet-base-v2) to encode tracks and playlists as embeddings, as well as the given user prompt.  
Using FAISS similarity scores, we generate a new playlist with k tracks which are semantically close to the user prompt.

In [25]:
import pandas as pd 
import numpy as np 
import faiss 
from sentence_transformers import SentenceTransformer 
from tqdm import tqdm 

#### Prompt input (ex. "UK alternative rock 80s") and size of created playlist (ex. 30)

Here we define a list of prompts to generate playlists. Usually these would contain at least the genre or an indicative period of time.  

In [77]:
#prompt = "UK alternative rock 80s"
prompts = [
    "Iconic rock 80s", 
    "Summer vibe 2010s", 
    "Girl power",
    "study vibe", 
    "songs to scream to", #a friend's recommandation
    "Kendrick Lamar"
]

playlist_size = 30

features = [
    "popularity","genre","danceability","energy","key","loudness","mode",
    "speechiness","acousticness","instrumentalness","liveness","valence","tempo",
    "duration_ms","time_signature"
]

Importing tracks data

In [None]:
path = "./data.csv"
df = pd.read_csv(path) 

#deletes invalid tracks, we need a minimum of info
df = df.dropna(subset=["track_name", "artist_name"])  

#fill NaNs with "unknown" for genres and -1 for years
df["genre"] = df["genre"].fillna("unknown")
df["year"] = df["year"].fillna(-1).astype(int)

print(f'Loaded {len(df)} tracks')

Loaded 171038 tracks


  df = pd.read_csv(path)


### Tracks as NL descriptions

To let BERT embed tracks in a meaningful way, each track in our dataset is converted to a NL description containing basic metadata, audio features and playlist context (name + position).

*ex: 'Keeper of the Flame' by Miranda Lambert. From album: The Weight of These Wings. Released in 2016. [...] From playlist: Miranda Lambert.* 

This allows BERT to capture semantic similarity between tracks. 

In [28]:
#fill other NaNs with unknown to make sure this is understandable for BERT
def safe(x, default="unknown"): 
    return x if pd.notna(x) else default 

def track_to_text(row): 
    """
    Converts each row (track) of the dataset into a rich natural language 
    descritption which BERT can use efficiently and embed meaningfully.
    """
    #
    features_text = ", ".join(
        f"{col} {safe(row[col])}" for col in features if col in row
    )
    return (
        f"{safe(row['track_name'])} by {safe(row['artist_name'])}. "
        f"From album: {safe(row['album_name'])}. "
        f"Released in {safe(row['year'])}. "
        f"Position in playlist: {safe(row['position'])}. "
        f"{features_text}. "
        f"From playlist: {safe(row['playlist_name'])}."
    )

df["text_rpz"] = df.apply(track_to_text, axis=1) 
texts = df["text_rpz"].tolist() 

Creating tracks embeddings with BERT

In [None]:
#loading BERT model (17 mins)
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

#encoding tracks
#track_embeddings = model.encode(texts, normalize_embeddings=True, show_progress_bar=True)
#track_embeddings = track_embeddings.astype(np.float32)

#np.save("track_embeddings.npy", track_embeddings)

track_embeddings = np.load("./embeddings/track_embeddings.npy")

dim = track_embeddings.shape[1]
print(f"Track embeddings shape: {track_embeddings.shape}")

Track embeddings shape: (171038, 768)


### Building FAISS index

FAISS (Facebook AI Similarity Search) is an optimized library used for similarity search between vectors. In this section of our project, each track is represented as a BERT generated embedding, and FAISS finds the tracks nearest to a prompt very fast. 

- faiss.IndexFlatIP(dim) is an exact index of size dim based on the Inner Product. Since our embeddings are normalised (*normalise_embeddings=True*), this inner product is the cosine similarity. 

- index.add keeps those vectors in the optimised index structure.

- index.search(query, k) (used later) returns two objects: **I**, the k nearest vectors (tracks), and **D**, the similarity score.  

Documentation: https://faiss.ai/index.html

In [30]:
index = faiss.IndexFlatIP(dim) #cosine similarity
index.add(track_embeddings)

### Playlist embeddings

Since our dataset is playlist-based, and not just random tracks that BERT and FAISS would correlate, we want to encode the information the playlists give us. This could help decide which song would fit better in a new generated playlist, based on whether they are already associated in existing playlists.

To do so, we encode the playlists as the mean of the tracks embeddings. 
This provides a signal of track co-occurence, improving the relevance of our generated playlists. 

In [None]:
#playlist_groups = df.groupby("playlist_id")
#playlist_embeddings = {}
#for pname, group in tqdm(playlist_groups, desc="Encoding playlist embeddings"):
#    emb = model.encode(group["text_rpz"].tolist(), normalize_embeddings=True)
#    playlist_embeddings[pname] = np.mean(emb, axis=0)  # mean of tracks

#np.save("playlist_embeddings.npy", playlist_embeddings)

playlist_embeddings = np.load("./embeddings/playlist_embeddings.npy", allow_pickle=True)

### Generating new playlist

To generate a new playlist, we first encode the prompt as a BERT embedding.

Then we combine the two types of information: the similarity between the user prompt and the tracks, and the correlation of the tracks based on the playlists in which they appear.
We then mix these scores with a weight alpha.

When using FAISS's index.search(), notice we use k\*3. This is because we are filtering by playlists **after**, and we would not want to miss relevant tracks. We thus have $3 \times k$ candidate tracks and we will choose the k most relevant. 



In [None]:
def generate_playlist(prompt: str, k: int, alpha=0.7): 
    """
    Given a natural language prompt, return the k most similar tracks.

    alpha: weight btw prompt similarity & playlist correlation 
    """
    prompt_embedding = model.encode(prompt, normalize_embeddings=True).astype(np.float32) 
    
    #faiss search
    D, I = index.search(prompt_embedding.reshape(1, -1), k*3)
    candidate_tracks = df.iloc[I[0]].copy()

    #normalized faiss scores D
    D = (D - D.min()) / (D.max() - D.min() + 1e-9) #to avoid errors with 0
    candidate_tracks["faiss_score"] = D[0]

    #playlist correlation score (for each track)
    def correlation_score_playlist(track_row):
        score = 0.0
        #we list the playlists in which the track appears
        for pid in df[df["track_id"]==track_row["track_id"]]["playlist_id"].unique():
            if pid in playlist_embeddings:
                #we add a score for each time the track appears in a playlist, 
                #based on a similarity score between the user prompt and the playlist embedding
                score += np.dot(prompt_embedding, playlist_embeddings[pid])
        return score

    #playlist correlation score
    candidate_tracks["playlist_corr"] = candidate_tracks.apply(correlation_score_playlist, axis=1)
    #normalize playlist correlation scores to avoid bias 
    pc = candidate_tracks["playlist_corr"].values
    pc_norm = (pc - pc.min()) / (pc.max() - pc.min() + 1e-9)
    candidate_tracks["playlist_corr"] = pc_norm

    #combined with prompt similarity score, with weight alpha for prompt (ie. 1-alpha for playlist)
    candidate_tracks["combined_score"] = alpha*candidate_tracks["faiss_score"] + (1-alpha)*candidate_tracks["playlist_corr"]

    #sort & remove duplicates
    result = candidate_tracks.sort_values("combined_score", ascending=False)
    result = result.drop_duplicates(subset=["track_name","artist_name"]).head(k) #keep k best

    df_tracks = result[[
        "playlist_name", "track_name","artist_name", "year", "genre", "album_name"
    ]]

    track_ids = result["track_id"].tolist()

    return df_tracks, track_ids

Use case example: 

In [78]:
playlists = [] 

for prompt in prompts: 
    playlist, _ = generate_playlist(prompt, playlist_size)
    playlists.append( playlist )

print("\ngenerated playlists:")
for i, pl in enumerate(playlists):
    print(f"\nPrompt: {prompts[i]}")
    print(pl.to_string(index=False)) 


generated playlists:

Prompt: Iconic rock 80s
                          playlist_name                track_name artist_name  year genre album_name
             80s Hits - Best of the 80s            Crocodile Rock                -1   pop           
     Classic Rock Songs 60s 70s 80s 90s                  Rockstar                -1  rock           
             80s Hits - Best of the 80s            I Remember You                -1  rock           
             80s Hits - Best of the 80s                   Running   Retrofile    -1  rock           
     Classic Rock Songs 60s 70s 80s 90s                Photograph                -1  rock           
             80s Hits - Best of the 80s      Jump - 2015 Remaster                -1  rock           
     Classic Rock Songs 60s 70s 80s 90s    1979 - Remastered 2012                -1  rock           
     Classic Rock Songs 60s 70s 80s 90s  Rockstar - 2020 Remaster                -1  rock           
         Rock Classics (80s, 90s 2000s)     

## Evaluation methods 

It is very hard to evaluate a recommendation system accurately, especially ours, since there is no "true" perfectly generated playlist. We have no ground-truth for subjective similarity to evaluate our performance.

What we can do is evaluate playlist **reconstruction**, using several metrics.   
Source: https://www.evidentlyai.com/ranking-metrics/evaluating-recommender-systems

### 1. Recall@k 
 
$$ Recall@k = \frac{\text{nb of 'real' tracks found in the k returned tracks}}{\text{total nb of real tracks}}$$
Since we generate a list with no specific order, this gives us a simple and robust objective measure of the playlist reconstruction.

In [34]:
# garder uniquement les playlist_id qui sont des entiers
df = df[df["playlist_id"].apply(lambda x: str(x).isdigit())]
df["playlist_id"] = df["playlist_id"].astype(int)


#filtering playlists with between 20-40 tracks
#this allows to evaluate with a minimum of material and consistency
playlist_sizes = df.groupby("playlist_id")["track_id"].count()
valid_pids = playlist_sizes[(playlist_sizes >= 20) & (playlist_sizes <= 40)].index.tolist()

print(f"{len(valid_pids)} retained playlists for evaluation (20-40 tracks)")

np.random.seed(42) 
selected_playlists = np.random.choice(valid_pids, size=min(10, len(valid_pids)), replace=False)

#returns a list of the predicted track_ids (generated playlist), 
#using the playlist name as a prompt for evaluating purposes
def predict_playlist_from_id(pid, alpha=0.7):
    playlist_df = df[df["playlist_id"] == pid]

    if playlist_df.empty:
        return [], None

    prompt = playlist_df["playlist_name"].iloc[0]
    k = len(playlist_df)

    df_playlist, ids = generate_playlist(prompt, k=k, alpha=alpha)

    return df_playlist, ids

1403 retained playlists for evaluation (20-40 tracks)


In [35]:
def recall_at_k(true_ids, pred_ids, k):
    true_set = set(true_ids)
    pred_set = set(pred_ids[:k])
    return len(true_set & pred_set) / len(true_set)

### Additional Metrics

To obtain a more complete evaluation, we compute:

**Precision@k**: How many of the returned tracks were correct? 

**MAP@k**: Rewars correct predictions at earlier ranks

**nDCG@k**: Measures ranking quality with positional discounting. 

These metrics give a more nuanced analysis of playlist reconstruction quality.

In [36]:
def precision_at_k(recommended, ground_truth, k):
    recommended_k = recommended[:k]
    hits = sum([1 for r in recommended_k if r in ground_truth])
    return hits / k


def mean_average_precision_at_k(recommended, ground_truth, k):
    recommended_k = recommended[:k]
    score = 0.0
    hits = 0
    
    for i, r in enumerate(recommended_k, start=1):
        if r in ground_truth:
            hits += 1
            score += hits / i
    
    if len(ground_truth) == 0:
        return 0.0

    return score / min(len(ground_truth), k)


def ndcg_at_k(recommended, ground_truth, k):
    recommended_k = recommended[:k]
    dcg = 0.0
    
    for i, r in enumerate(recommended_k, start=1):
        if r in ground_truth:
            dcg += 1 / np.log2(i + 1)

    #ideal dcg
    ideal_hits = min(len(ground_truth), k)
    idcg = sum([1 / np.log2(i + 1) for i in range(1, ideal_hits + 1)])

    return dcg / idcg if idcg > 0 else 0.0


#### Full evaluation loop

In [37]:

results = []

for pid in tqdm(valid_pids, desc="Evaluating model metrics"):

    playlist_df = df[df["playlist_id"] == pid]

    if playlist_df.empty:
        continue

    true_ids = playlist_df["track_id"].tolist()
    k = len(true_ids)

    _, pred_ids = predict_playlist_from_id(pid, alpha=0.7)

    res = {
        "playlist_id": pid,
        "playlist_name": playlist_df["playlist_name"].iloc[0],
        "num_tracks": k,
        "recall@k": recall_at_k(true_ids, pred_ids, k),
        "precision@k": precision_at_k(pred_ids, true_ids, k),
        "map@k": mean_average_precision_at_k(pred_ids, true_ids, k),
        "ndcg@k": ndcg_at_k(pred_ids, true_ids, k)
    }

    results.append(res)


#displaying some results
for r in results[0:10]:
    print("\n" + "="*70)
    print(f"Playlist: {r['playlist_name']} (id={r['playlist_id']}, {r['num_tracks']} tracks)")
    print(f"Recall@k:    {r['recall@k']:.3f}")
    print(f"Precision@k: {r['precision@k']:.3f}")
    print(f"MAP@k:       {r['map@k']:.3f}")
    print(f"NDCG@k:      {r['ndcg@k']:.3f}")


Evaluating model metrics: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1403/1403 [30:16<00:00,  1.29s/it] 


Playlist: summer 2k17 (id=1043, 34 tracks)
Recall@k:    0.000
Precision@k: 0.000
MAP@k:       0.000
NDCG@k:      0.000

Playlist: The Glitch Mob (id=1192, 25 tracks)
Recall@k:    0.240
Precision@k: 0.240
MAP@k:       0.086
NDCG@k:      0.237

Playlist: country (id=1240, 32 tracks)
Recall@k:    0.000
Precision@k: 0.000
MAP@k:       0.000
NDCG@k:      0.000

Playlist: country (id=1915, 23 tracks)
Recall@k:    0.000
Precision@k: 0.000
MAP@k:       0.000
NDCG@k:      0.000

Playlist: starbucks  (id=2976, 26 tracks)
Recall@k:    0.423
Precision@k: 0.423
MAP@k:       0.171
NDCG@k:      0.356

Playlist: Country (id=3029, 23 tracks)
Recall@k:    0.000
Precision@k: 0.000
MAP@k:       0.000
NDCG@k:      0.000

Playlist: best of country (id=4181, 31 tracks)
Recall@k:    0.032
Precision@k: 0.032
MAP@k:       0.002
NDCG@k:      0.026

Playlist: Country (id=5005, 40 tracks)
Recall@k:    0.025
Precision@k: 0.025
MAP@k:       0.004
NDCG@k:      0.032

Playlist: ignite (id=5193, 32 tracks)
Recall@k:  




#### Global average scores

In [38]:
print("Average performance over tested playlists:\n")

print("Recall@k:    ", np.mean([r["recall@k"] for r in results]))
print("Precision@k: ", np.mean([r["precision@k"] for r in results]))
print("MAP@k:       ", np.mean([r["map@k"] for r in results]))
print("NDCG@k:      ", np.mean([r["ndcg@k"] for r in results]))

Average performance over tested playlists:

Recall@k:     0.060265029668296756
Precision@k:  0.059304849562614975
MAP@k:        0.02726536850022075
NDCG@k:       0.06647384816894805


### 2. LLM-as-a-judge

Beyond quantitative metrics, we can ask an LLM to evaluate thematic coherence, genre consistency, mood alignement and subjective musical similarity.   
It provides a complementary qualitative angle. 

We follow this example: https://towardsdatascience.com/llm-as-a-judge-what-it-is-why-it-works-and-how-to-use-it-to-evaluate-ai-models/  
With the framework found on github: https://github.com/PieroPaialungaAI/LLMAsAJudge/tree/main

In [None]:
from openai import OpenAI
import os
from llm_judge import LLMJudge 

os.environ["OPENAI_API_KEY"] = "secret_key" #im hiding this

client = OpenAI()

judge_role = """You are an expert evaluator of music playlist recommendation systems. 
You have 10 years of experience in playlist generation systems and understand the wide 
choice of possible relevant songs. You understand that the dataset may contain missing 
or unknown metadata (notably values marked as -1 or 'unknown'), and that tracks span 
from 2000 to 2023. Ignore missing metadata like 'unknown' genre or -1 year. Focus on 
whether the tracks match the style, mood, and era indicated by the user prompt or 
original playlist name. Your job is to assess whether the generated playlists 
are accurate and appropriate given the user prompt."""

# Define the evaluation task
evaluation_task = """For each generated playlist, you must:
1. Read the user prompt and understand its expectations
2. Examine the tracks in the generated playlist
3. Determine if the playlist is coherent with regard to the user prompt
4. Provide a quality score (0-100), verdict, and detailed reasoning"""

# Define evaluation criteria
evaluation_criteria = """
A generated playlist is excellent if:
- The genres of the tracks are coherent with the prompt
- The mood and vibe are coherent 
- There is a temporal and style coherence (if the prompt mentions era or style) 
- There is some diversity of artists, albums, energy...
- It is appropriate overall"""

# Valid verdicts
verdicts = ["Excellent", "Good", "Average", "Poor"]

In [40]:
playlist_judge = LLMJudge(
    llm_client= client, 
    role= judge_role, 
    task_description= evaluation_task, 
    evaluation_criteria= evaluation_criteria, 
    valid_verdicts= verdicts, 
    model_name= "gpt-4o-mini", 
    temperature= 0.2
)

In [94]:
#we need a function to create the playlist object we will give to the llm 
def format_playlist_llm(prompt, playlist):  
    
    def safe_year(year): 
        return "unknown" if year == -1 else year 

    text = f"Given prompt: {prompt}\n"
    text += f"Number of tracks: {len(playlist)}\n\n"

    text += "Tracks:\n"

    for i, row in playlist.iterrows():
        track_name = (row.get("track_name"))
        artists = (row.get("artist_name"))
        genre = (row.get("genre"))
        year = (row.get("year"))
        original_playlist = (row.get("playlist_name"))

        track_line = f"- '{track_name}' by {artists}"
        track_line += f" â€” Genre: {genre}"
        track_line += f" year: {safe_year(year)}"
        track_line += f" fetched from playlist: {original_playlist}"

        text += track_line + "\n"

    text += (
        "\nNotes: Some metadata may be missing or labelled as 'unknown'. "
        "This is normal for this dataset and should NOT influence quality judgment.\n"
    )

    return {"input": prompt, "model_output": text}

In [95]:
generated_playlists = [] 
for prompt in prompts: 
    df_generated, _ = generate_playlist(prompt, k=30)
    formatted = format_playlist_llm(prompt, df_generated)

    generated_playlists.append( formatted ) 


In [96]:
print("=" *50)
print("Judging generated playlists")
print("=" *50)

for i, pred in enumerate(generated_playlists, 1): 
    print(f"\n### Case {i} ###")
    print(f"Prompt: {pred['input']}")
    print(f"Generated playlist: {pred['model_output']}")


    judgment = playlist_judge.judge_single(
        input_text=pred['input'],
        model_output=pred['model_output']
    )

    print(f"\n{'Judge Verdict:':<20} {judgment.verdict}")
    print(f"{'Quality Score:':<20} {judgment.score}/100")
    print(f"{'Confidence:':<20} {judgment.confidence}%")
    print(f"\nReasoning: {judgment.reasoning}")
    
    if judgment.notes:
        print(f"Notes: {judgment.notes}")
    
    print("\n" + "=" * 100)

Judging generated playlists

### Case 1 ###
Prompt: Iconic rock 80s
Generated playlist: Given prompt: Iconic rock 80s
Number of tracks: 30

Tracks:
- 'Crocodile Rock' by   â€” Genre: pop year: unknown fetched from playlist: 80s Hits - Best of the 80s
- 'Rockstar' by   â€” Genre: rock year: unknown fetched from playlist: Classic Rock Songs 60s 70s 80s 90s
- 'I Remember You' by   â€” Genre: rock year: unknown fetched from playlist: 80s Hits - Best of the 80s
- 'Running' by Retrofile â€” Genre: rock year: unknown fetched from playlist: 80s Hits - Best of the 80s
- 'Photograph' by   â€” Genre: rock year: unknown fetched from playlist: Classic Rock Songs 60s 70s 80s 90s
- 'Jump - 2015 Remaster' by   â€” Genre: rock year: unknown fetched from playlist: 80s Hits - Best of the 80s
- '1979 - Remastered 2012' by   â€” Genre: rock year: unknown fetched from playlist: Classic Rock Songs 60s 70s 80s 90s
- 'Rockstar - 2020 Remaster' by   â€” Genre: rock year: unknown fetched from playlist: Classic R