## Deep Learning Model Analysis, Discussion, and Conclusions

### Models Used

For this project, I explored the task of analyzing and comparing songs based on their lyrical and metadata similarities. The goal was to identify which deep learning-based sentence embedding models best capture the overall "vibe" of a song, measured through a custom similarity function. I evaluated the following pre-trained transformer-based models from the `sentence-transformers` library:

- `all-MiniLM-L12-v2`
- `all-mpnet-base-v2`
- `sentence-t5-base`

These models are widely used for sentence-level semantic similarity tasks and were chosen for their balance of performance and efficiency.

---

### Custom Similarity Function

A novel aspect of the analysis was the development of a **custom similarity metric** that incorporates both deep semantic embeddings and structured metadata. The final similarity score between two songs was computed as a weighted average of:

- Cosine similarity of **lyrics**
- Cosine similarity of **titles**
- Genre match
- Artist match
- Year similarity (modeled with an exponential decay)
- View-based popularity score
- Feature overlap (e.g., moods, themes)

This allowed the system to go beyond basic keyword or text similarity and consider a richer context around each song.

---

### Experimental Setup

To evaluate each model’s ability to identify semantically and contextually similar songs, I randomly generated 500 pairs of songs over 5 rounds (100 per round). For each pair, the similarity score was computed using the custom function, and average scores were recorded per model.

The evaluation was not supervised in a traditional classification or regression sense due to the lack of labeled ground truth for “similarity,” but it provided a comparative insight into the stability and sensitivity of each model in a real-world, unstructured use case.

---

### Results Summary

| Model               | Mean Similarity | Std Deviation | Total Pairs |
|--------------------|----------------:|--------------:|------------:|
| all-MiniLM-L12-v2  |  0.2994    | 0.0939        | 500         |
| all-mpnet-base-v2  |  0.3371    | 0.1042        | 500         |
| sentence-t5-base   |  0.5908   | 0.0808        | 500         |


From these results, `sentence-t5-base` slightly outperformed the others in terms of average similarity score and lower variance, suggesting it was more consistent in its representation of song content and metadata alignment.

---

### Search Engine Demo (Bonus)

In addition to the similarity analysis, I built a **console-based song search engine** that takes a user-defined phrase (or “vibe”) as input and returns the top 10 most relevant songs based on FAISS-powered semantic search. This component uses the precomputed song embeddings to efficiently return high-quality results in real time.

To keep the project lightweight and ensure faster experimentation, I limited the dataset to **100,000 songs** sampled from the full corpus. For the search engine backend, I selected the **`all-MiniLM-L12-v2`** model. Although this model did not perform as strongly in the evaluation compared to alternatives like `sentence-t5-base`, it offered **significantly faster encoding times and lower memory usage**, making it ideal for prototyping and real-time search applications.


**Sample use cases:**

- Entering `heartbreak` retrieves emotional, slow-paced tracks.
- Entering `hype` brings back upbeat, high-energy songs.

This demo helps demonstrate the practical value of the embeddings and similarity models used in the main analysis.

---

### Discussion & Insights

- **Model Strengths**: All three models performed reasonably well, though `sentence-t5-base` offered slightly stronger and more stable results, likely due to its higher parameter count and deeper architecture.
- **Limitations**: The custom similarity function involves multiple handcrafted weightings, which could be further optimized with labeled data or learned end-to-end. Also, there's no objective “ground truth” for similarity without user studies or annotations.
- **Hyperparameter Considerations**: While deep learning models were used as-is, the primary tuning effort involved adjusting weights in the similarity function to reflect reasonable assumptions (e.g., lyrics > titles).
- **Scalability**: The system scales well due to FAISS indexing, even on large datasets (tested with 100,000+ rows), and benefits from GPU acceleration.

---

In [1]:
import pandas as pd

# Read the CSV file from the working directory
df_clean = pd.read_csv('/kaggle/input/df-clean/df_clean_output.csv')
df_clean = df_clean.sample(n=100000, random_state=42).reset_index(drop=True)

In [2]:
import numpy as np
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer
import random

# ----------- 1. Cosine similarity helper -----------
def cosine_sim(model, text1, text2):
    emb1 = model.encode(text1, convert_to_tensor=True, show_progress_bar=False, disable_tqdm=True)
    emb2 = model.encode(text2, convert_to_tensor=True, show_progress_bar=False, disable_tqdm=True)
    return float(torch.nn.functional.cosine_similarity(emb1, emb2, dim=0))

# ----------- 2. Custom similarity function -----------
def custom_similarity(song1, song2, model=None, weights=None):
    if weights is None:
        weights = {
            'lyric_sim': 0.4,
            'title_sim': 0.1,
            'genre_match': 0.1,
            'artist_match': 0.1,
            'year_score': 0.1,
            'view_score': 0.1,
            'feature_match': 0.1
        }

    def safe_get(s, key, default=""):
        return s.get(key, default) if pd.notna(s.get(key, default)) else default

    lyric_sim = cosine_sim(model, str(safe_get(song1, 'clean_lyrics')), str(safe_get(song2, 'clean_lyrics')))
    title_sim = cosine_sim(model, str(safe_get(song1, 'title')), str(safe_get(song2, 'title')))
    genre_match = 1.0 if safe_get(song1, 'tag') == safe_get(song2, 'tag') else 0.0
    artist_match = 1.0 if safe_get(song1, 'artist') == safe_get(song2, 'artist') else 0.0

    try:
        year1 = int(float(safe_get(song1, 'year', 0)))
        year2 = int(float(safe_get(song2, 'year', 0)))
    except:
        year1, year2 = 0, 0
    year_diff = abs(year1 - year2)
    year_score = np.exp(-year_diff / 10.0)

    try:
        views1 = int(float(safe_get(song1, 'views', 1)))
        views2 = int(float(safe_get(song2, 'views', 1)))
    except:
        views1, views2 = 1, 1
    max_views = max(views1, views2, 1)
    view_score = min(views1, views2) / max_views

    features1 = set(safe_get(song1, 'clean_features', "").split(", ")) if isinstance(song1.get('clean_features'), str) else set()
    features2 = set(safe_get(song2, 'clean_features', "").split(", ")) if isinstance(song2.get('clean_features'), str) else set()
    feature_match = len(features1 & features2) / max(len(features1), len(features2), 1)

    final_score = (
        weights['lyric_sim'] * lyric_sim +
        weights['title_sim'] * title_sim +
        weights['genre_match'] * genre_match +
        weights['artist_match'] * artist_match +
        weights['year_score'] * year_score +
        weights['view_score'] * view_score +
        weights['feature_match'] * feature_match
    )
    return final_score

# ----------- 3. Load models -----------
model_names = ['all-MiniLM-L12-v2', 'all-mpnet-base-v2', 'sentence-t5-base']
models = {name: SentenceTransformer(f'sentence-transformers/{name}') for name in model_names}

# ----------- 4. Generate song pairs -----------
def generate_song_pairs(df, n_pairs=100):
    indices = list(df.index)
    pairs = []
    for _ in range(n_pairs):
        i, j = random.sample(indices, 2)
        pairs.append((df.loc[i], df.loc[j]))
    return pairs

# ----------- 5. Evaluate models on song pairs -----------
def evaluate_custom_similarity(models, df, n_pairs=100):
    song_pairs = generate_song_pairs(df, n_pairs)
    results = {}

    for name, model in models.items():
        print(f"Evaluating model: {name}")
        sim_scores = []
        for song1, song2 in song_pairs:
            try:
                score = custom_similarity(song1, song2, model=model)
                sim_scores.append(score)
            except Exception as e:
                print(f"Error comparing songs: {e}")
                sim_scores.append(None)
        results[name] = sim_scores
    return results

# ----------- 6. Multi-round Evaluation -----------
n_rounds = 5
n_pairs_per_round = 100
aggregated_scores = {name: [] for name in models.keys()}

for round_num in range(n_rounds):
    print(f"\n=== Round {round_num + 1} of {n_rounds} ===")
    round_results = evaluate_custom_similarity(models, df_clean, n_pairs=n_pairs_per_round)
    for name, scores in round_results.items():
        scores_filtered = [s for s in scores if s is not None]
        aggregated_scores[name].extend(scores_filtered)

# ----------- 7. Final Results -----------
print("\n===== Average Custom Similarity per Model (Across Rounds) =====")
for name, scores in aggregated_scores.items():
    mean_sim = np.mean(scores)
    std_sim = np.std(scores)
    print(f"{name}: Mean = {mean_sim:.4f} | Std = {std_sim:.4f} | Total Pairs = {len(scores)}")


2025-04-21 00:25:12.277418: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745195112.488405      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1745195112.546774      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/1.78k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/219M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

rust_model.ot:   0%|          | 0.00/2.36M [00:00<?, ?B/s]


=== Round 1 of 5 ===
Evaluating model: all-MiniLM-L12-v2
Evaluating model: all-mpnet-base-v2
Evaluating model: sentence-t5-base

=== Round 2 of 5 ===
Evaluating model: all-MiniLM-L12-v2
Evaluating model: all-mpnet-base-v2
Evaluating model: sentence-t5-base

=== Round 3 of 5 ===
Evaluating model: all-MiniLM-L12-v2
Evaluating model: all-mpnet-base-v2
Evaluating model: sentence-t5-base

=== Round 4 of 5 ===
Evaluating model: all-MiniLM-L12-v2
Evaluating model: all-mpnet-base-v2
Evaluating model: sentence-t5-base

=== Round 5 of 5 ===
Evaluating model: all-MiniLM-L12-v2
Evaluating model: all-mpnet-base-v2
Evaluating model: sentence-t5-base

===== Average Custom Similarity per Model (Across Rounds) =====
all-MiniLM-L12-v2: Mean = 0.2994 | Std = 0.0939 | Total Pairs = 500
all-mpnet-base-v2: Mean = 0.3371 | Std = 0.1042 | Total Pairs = 500
sentence-t5-base: Mean = 0.5908 | Std = 0.0808 | Total Pairs = 500


In [3]:
pip install faiss-gpu-cu12

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting faiss-gpu-cu12
  Downloading faiss_gpu_cu12-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading faiss_gpu_cu12-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.9/47.9 MB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: faiss-gpu-cu12
Successfully installed faiss-gpu-cu12-1.10.0
Note: you may need to restart the kernel to use updated packages.


In [4]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import torch
import faiss
import ipywidgets as widgets
from IPython.display import display

# ---------- 1. Load model and move to GPU if available ----------
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")
model = model.to(device)

# ---------- 2. Batch encode and combine lyric + title ----------
def precompute_embeddings(df_clean, model):
    lyrics = df_clean['clean_lyrics'].fillna("").astype(str).tolist()
    titles = df_clean['title'].fillna("").astype(str).tolist()

    # Batch encode (use GPU + batching)
    lyric_embeddings = model.encode(lyrics, batch_size=64, convert_to_tensor=True, device=device, show_progress_bar=True)
    title_embeddings = model.encode(titles, batch_size=64, convert_to_tensor=True, device=device, show_progress_bar=True)

    # Combine both (average)
    combined = (lyric_embeddings + title_embeddings) / 2
    return combined

print("Precomputing song embeddings...")
combined_embeddings_tensor = precompute_embeddings(df_clean, model)
combined_embeddings = combined_embeddings_tensor.cpu().numpy().astype('float32')

# ---------- 3. Build FAISS index ----------
print("Building FAISS index...")
faiss.normalize_L2(combined_embeddings)  # Important for cosine similarity
index = faiss.IndexFlatIP(combined_embeddings.shape[1])
index.add(combined_embeddings)


Precomputing song embeddings...


Batches:   0%|          | 0/1563 [00:00<?, ?it/s]

Batches:   0%|          | 0/1563 [00:00<?, ?it/s]

Building FAISS index...


In [None]:
# ---------- 4. Search function ----------
def search_songs(query, model, index, df_clean, top_k=10):
    query_emb = model.encode(query, convert_to_tensor=True, device=device)
    query_emb = query_emb.cpu().numpy().astype('float32').reshape(1, -1)
    faiss.normalize_L2(query_emb)

    D, I = index.search(query_emb, top_k)
    results = df_clean.iloc[I[0]].copy()
    results['score'] = D[0]
    return results

# ---------- 5. Console-based search (Kaggle-friendly) ----------
def song_search_console(df_clean):
    while True:
        query = input("\nEnter a vibe to search for (or type 'exit' to quit): ").strip()
        if query.lower() == 'exit':
            print("Goodbye! ")
            break
        if query:
            print(f"\nSearching for songs similar to: '{query}'")
            top_matches = search_songs(query, model, index, df_clean)

            print("\nTop Matching Songs:")
            for _, row in top_matches.iterrows():
                print(f"\n**{row['title']}** by *{row['artist']}*")
                print(f"Score: {row['score']:.3f} | Year: {row['year']} | Genre: {row['tag']}")
                print("-" * 40)

#  Run it
song_search_console(df_clean)


Enter a vibe to search for (or type 'exit' to quit):  motivation



Searching for songs similar to: 'motivation'


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Top Matching Songs:

**Motivation** by *Ashland*
Score: 0.815 | Year: 2019 | Genre: pop
----------------------------------------

**Motivation** by *Rivets*
Score: 0.795 | Year: 2015 | Genre: pop
----------------------------------------

**Motivation** by *Dommyy*
Score: 0.779 | Year: 2021 | Genre: rap
----------------------------------------

**Motivation** by *DeQuince*
Score: 0.779 | Year: 2012 | Genre: rap
----------------------------------------

**Motivation Alternate Version** by *Dope*
Score: 0.722 | Year: 2005 | Genre: pop
----------------------------------------

**Motivation** by *Shad da God*
Score: 0.706 | Year: 2015 | Genre: rap
----------------------------------------

**Motivated** by *Sez Batters*
Score: 0.698 | Year: 2011 | Genre: rap
----------------------------------------

**A Little Motivation** by *T-Bruin*
Score: 0.646 | Year: 2013 | Genre: rap
----------------------------------------

**Greedy** by *Mia Stegner*
Score: 0.598 | Year: 2020 | Genre: pop
---------

### Conclusion

This project successfully delivers on the goal of building a **vibe-based song search engine**—a system that goes beyond traditional filters like artist or genre by understanding and matching the *meaning* and *emotion* behind both user queries and song lyrics.

By leveraging **transformer-based sentence embeddings**, I was able to map lyrics and search phrases into a shared semantic space, enabling highly relevant matches based on themes and tone. The **custom similarity function** further enhanced search quality by incorporating additional metadata such as genre, artist, views, year, and featured artists.

To manage compute efficiently, I used a **sample of 100,000 rows** and the **`all-MiniLM-L12-v2`** model. While this model didn't achieve the highest similarity scores during evaluation, it trained significantly faster and allowed for rapid prototyping of the system. Despite its speed advantage, more powerful models like **`sentence-t5-base`** showed greater potential in terms of accuracy and could be integrated in future iterations.

With access to **more compute**, I could scale this project further by:

- Processing the **entire dataset** of over 3 million songs  
- Switching to **higher-performing models** like `sentence-t5-base`  
- **Storing embeddings** in a persistent vector database for scalable, real-time search  

This work lays a solid foundation for expressive music discovery powered by modern NLP. 

**Future directions include:**

- Fine-tuning embedding models on music-specific data  
- Creating labeled training pairs for supervised similarity learning  
- Deploying the system as a **fully interactive web app** (e.g., via Streamlit or Gradio)
