<a href="https://colab.research.google.com/github/Soumya-Xd/AHPS_frontend/blob/main/spotify.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Remove conflicting versions (IMPORTANT)
!pip uninstall -y torch transformers huggingface_hub sentence-transformers

# Install compatible versions (STRICT)
!pip install torch==2.1.2 --no-cache-dir
!pip install huggingface_hub==0.19.4 --no-cache-dir
!pip install transformers==4.35.2 --no-cache-dir
!pip install sentence-transformers==2.2.2 --no-cache-dir



[0mFound existing installation: transformers 4.35.2
Uninstalling transformers-4.35.2:
  Successfully uninstalled transformers-4.35.2
Found existing installation: huggingface-hub 0.17.3
Uninstalling huggingface-hub-0.17.3:
  Successfully uninstalled huggingface-hub-0.17.3
[0m[31mERROR: Could not find a version that satisfies the requirement torch==2.1.2 (from versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0, 2.4.1, 2.5.0, 2.5.1, 2.6.0, 2.7.0, 2.7.1, 2.8.0, 2.9.0, 2.9.1)[0m[31m
[0m[31mERROR: No matching distribution found for torch==2.1.2[0m[31m
[0mCollecting huggingface_hub==0.19.4
  Downloading huggingface_hub-0.19.4-py3-none-any.whl.metadata (14 kB)
Downloading huggingface_hub-0.19.4-py3-none-any.whl (311 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m311.7/311.7 kB[0m [31m67.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: huggingface_hub
[31mERROR: pi

In [1]:
# Remove conflicting versions (IMPORTANT for local installs to avoid issues, might not be strictly needed in Colab if re-running)
# Note: This block is added here to ensure dependencies are correctly installed before the rest of the cell runs.
!pip uninstall -y torch transformers huggingface_hub sentence-transformers accelerate

# Install compatible versions
# Torch 2.1.2 was not found; 2.9.0 was installed by sentence-transformers later, so we explicitly pin to 2.9.0.
!pip install torch==2.9.0 --no-cache-dir

# huggingface_hub needs to be >= 0.21.0 for accelerate (a dependency of transformers).
# transformers 4.35.2 is compatible with huggingface_hub >= 0.16.4 and < 1.0, so 0.21.0 fits.
!pip install huggingface_hub==0.21.0 --no-cache-dir

# Install transformers 4.35.2
!pip install transformers==4.35.2 --no-cache-dir

# Install sentence-transformers 2.2.2
!pip install sentence-transformers==2.2.2 --no-cache-dir

import numpy as np
import pandas as pd
import re
import pickle
import warnings
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

warnings.filterwarnings("ignore")


class SpotifyLyricSearch:
    """
    Spotify Lyric Search using Sentence-BERT (Semantic Similarity)
    """

    def __init__(self):
        print("Loading Sentence-BERT model...")
        self.model = SentenceTransformer("all-MiniLM-L6-v2")
        self.df = None
        self.embeddings = None

    # -----------------------------
    # Text Preprocessing
    # -----------------------------
    def preprocess_text(self, text):
        if pd.isna(text):
            return ""

        text = str(text).lower()
        text = re.sub(r"http\S+|www\S+|https\S+", "", text)
        text = re.sub(r"[^a-zA-Z\s]", "", text)
        text = re.sub(r"\s+", " ", text).strip()
        return text

    # -----------------------------
    # Load & Prepare Dataset
    # -----------------------------
    def load_data(self, csv_path, sample_size=None):
        print("\nLoading dataset...")
        df = pd.read_csv(csv_path)

        print(f"Original shape: {df.shape}")
        print(f"Columns: {df.columns.tolist()}")

        df = df.dropna(subset=["artist", "song", "text"])

        if sample_size and sample_size < len(df):
            df = df.sample(sample_size, random_state=42)
            print(f"Sampled {sample_size} rows")

        df["processed_text"] = df["text"].apply(self.preprocess_text)
        df = df[df["processed_text"].str.len() > 20]

        df = df.reset_index(drop=True)

        self.df = df

        print(f"Final dataset shape: {df.shape}")
        print(f"Unique songs: {df[['song','artist']].drop_duplicates().shape[0]}")

    # -----------------------------
    # Encode Lyrics
    # -----------------------------
    def encode_lyrics(self):
        print("\nEncoding lyrics using Sentence-BERT...")
        self.embeddings = self.model.encode(
            self.df["processed_text"].tolist(),
            show_progress_bar=True
        )
        print("Embeddings shape:", self.embeddings.shape)

    # -----------------------------
    # Search Function
    # -----------------------------
    def search(self, lyric_snippet, top_k=5):
        if self.embeddings is None:
            raise Exception("Embeddings not generated. Call encode_lyrics() first.")

        query = self.preprocess_text(lyric_snippet)
        query_embedding = self.model.encode([query])

        scores = cosine_similarity(query_embedding, self.embeddings)[0]
        top_indices = scores.argsort()[-top_k:][::-1]

        results = []
        for idx in top_indices:
            results.append({
                "song": self.df.iloc[idx]["song"],
                "artist": self.df.iloc[idx]["artist"],
                "confidence": round(scores[idx] * 100, 2)
            })

        return results

    # -----------------------------
    # Save & Load
    # -----------------------------
    def save(self, path="spotify_lyric_embeddings.pkl"):
        with open(path, "wb") as f:
            pickle.dump({
                "df": self.df,
                "embeddings": self.embeddings
            }, f)
        print(f"Saved embeddings to {path}")

    def load(self, path="spotify_lyric_embeddings.pkl"):
        with open(path, "rb") as f:
            data = pickle.load(f)
            self.df = data["df"]
            self.embeddings = data["embeddings"]
        print("Embeddings loaded successfully!")


# ======================================
# MAIN EXECUTION
# ======================================
def main():
    print("=" * 60)
    print("SPOTIFY LYRIC SEARCH - SEMANTIC ML MODEL")
    print("=" * 60)

    csv_path = "spotify_data.csv"  # update if needed

    searcher = SpotifyLyricSearch()

    # Load & encode
    searcher.load_data(csv_path, sample_size=5000)
    searcher.encode_lyrics()

    # Save embeddings
    searcher.save()

    print("\n" + "=" * 60)
    print("TESTING PREDICTIONS")
    print("=" * 60)

    test_lyrics = [
        "I want to hold your hand",
        "We will we will rock you",
        "Is this the real life is this just fantasy",
        "Hello from the other side",
        "I got the eye of the tiger"
    ]

    for lyric in test_lyrics:
        print(f"\nüìù Input Lyrics: '{lyric}'")
        print("-" * 60)
        results = searcher.search(lyric, top_k=3)
        for i, r in enumerate(results, 1):
            print(f"{i}. {r['song']} - {r['artist']} ({r['confidence']}%) affinity")

    print("\n" + "=" * 60)
    print("INTERACTIVE MODE")
    print("=" * 60)
    print("Type lyrics (or 'quit' to exit)")

    while True:
        user_input = input("\nüéµ Enter lyrics: ").strip()
        if user_input.lower() in ["quit", "exit", "q"]:
            break
        if not user_input:
            continue

        results = searcher.search(user_input, top_k=5)
        print("\nTop Results:")
        print("-" * 60)
        for i, r in enumerate(results, 1):
            print(f"{i}. {r['song']} - {r['artist']} ({r['confidence']}%) affinity")


if __name__ == "__main__":
    main()

Found existing installation: torch 2.9.0
Uninstalling torch-2.9.0:
  Successfully uninstalled torch-2.9.0
[0mFound existing installation: huggingface-hub 0.21.0
Uninstalling huggingface-hub-0.21.0:
  Successfully uninstalled huggingface-hub-0.21.0
[0mCollecting torch==2.9.0
  Downloading torch-2.9.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (30 kB)
Downloading torch-2.9.0-cp312-cp312-manylinux_2_28_x86_64.whl (899.7 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m899.7/899.7 MB[0m [31m241.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torch
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
timm 1.0.22 requires huggingface_hub, which is not installed.
peft 0.18.0 requires accelerate>=0.21.0, which is not installed.
peft 0.18.0 requires

  _torch_pytree._register_pytree_node(


SPOTIFY LYRIC SEARCH - SEMANTIC ML MODEL
Loading Sentence-BERT model...

Loading dataset...
Original shape: (57650, 4)
Columns: ['artist', 'song', 'link', 'text']
Sampled 5000 rows
Final dataset shape: (5000, 5)
Unique songs: 5000

Encoding lyrics using Sentence-BERT...


Batches:   0%|          | 0/157 [00:00<?, ?it/s]

Embeddings shape: (5000, 384)
Saved embeddings to spotify_lyric_embeddings.pkl

TESTING PREDICTIONS

üìù Input Lyrics: 'I want to hold your hand'
------------------------------------------------------------
1. Hold Me - Stevie Wonder (48.5%) affinity
2. I'm Free - Rolling Stones (46.15999984741211%) affinity
3. Hand Of God - Soundgarden (46.09000015258789%) affinity

üìù Input Lyrics: 'We will we will rock you'
------------------------------------------------------------
1. Rock Love - Steve Miller Band (47.43000030517578%) affinity
2. Get On Your Boots - U2 (46.18000030517578%) affinity
3. Don't Bang The Drum - Waterboys (45.43000030517578%) affinity

üìù Input Lyrics: 'Is this the real life is this just fantasy'
------------------------------------------------------------
1. Make It Real - Scorpions (50.79999923706055%) affinity
2. In The Real World - Roy Orbison (44.439998626708984%) affinity
3. Caught In The Middle - Yngwie Malmsteen (44.34000015258789%) affinity

üìù Input Lyr