### General Premise

- **Goal**: Provide song recommendations to users based on criteria they select. Moreover, users can choose to weigh different criteria differently.
    - Inspiration: conversations with friends about what they value in songs they listen to. Personal interest in finding music that aligns with the melodies and themes I enjoy.
- Current Song Criteria Outline
    - Vocals/Pitch - quanify how similar two singers are to one another.
    - Melody - are the same notes/pitches reoccuring in somewhat similar order?
    - Theme/Lyrics - what topics/keywords do the songs share? Is there a similar story? How can we use data to pick up on that?
    - Genre - relatively predetermined, but helpful nonetheless. How does society bucket the song? Could play around with creating my own Bayes classifier from scratch.
    - Instrumental/Beat/Effects - are similar instruments being used? Is there a similar beat and rhyhtm? Can we identify any other production similarities?
- For now, we will assume that people value each of these equally. **However**, we can actually train/update the model to figure out a better baseline spread.

### Theme/Lyrics/Storyline
- How can we quanitfy the topics, themes, and stories of songs? Some kind of NLP. 

### Recreating Basic Recommender System Techniques from DS340
- With extra help from Aggrawal's *Recommender Systems* textbook

#### Content Based Filtering 
- Main Idea:
    - If users like song x, then they will probably like song y since it hsa the same genre, artist etc.
- Steps
    - Represent items as a matrix. Row = song. Column = features. Features often binary (present or not present). 
    - Choose a similarity metric between rows. (Dot product, cosine similarity, euclidean distance etc.)
        - Things to consider....
        - 1. Items that appear more frequently in training set tend to have embeddings with larger norms --> dot products makes sense if you want to capture popularity info. Must be wary that thee do not entirely dominate --> use other measures that put less emphasis on the norm.
        - 2. Rare items may not get updated a lot during training. Could initialize these with larger norms, but run the risk of rare items being recommended too often. Careful initialization and regularization needed.
    - Calculate similarity metrics between rows/items to determine which songs are most similar.
    - Also represent user in the same feature space/matrix as another row. Represent users with the featurs they're interested in.
        - User features/interests can be explicitly provided by user, or implicity learned based on past history.
    -  Calculate similarity metrics between user row and item rows to decide --> calculate similaity between embeddings/vectors. 
- Tradeoffs
    - Must find a way to fill in features of products and features/interests of users.
        -  Go through products and tag them.
        -  Ask users to select interests.
    - Must decide relevant features ahead of time.
        - Could potentially forget one.
    - No pleasant suprises, just continues trend --> lack of diversity.
        - Can we make this an adjustable setting?
- Considerations
    - Need an external dataset to tag songs! 

### Data Loading/Extraction 
- Goal: Combine data from spotify million dataset with external data about song features --> allow for better content based filtering.

In [9]:
# Load in playlist data - all data is in a data folder, which is split into many JSON files, each with 1,000 playlists. 1 million playlists total
import pandas as pd
import json 
import os 
data_folder_path = 'data/spotify_million_playlist_dataset/data'
def load_playlists(data_folder, num_files=None): #maximum num_files is 1000. 1000 files * 1000 playlists = 1 million playlists total
    playlists = []
    files = sorted(os.listdir(data_folder))
    
    if num_files:
        files = files[:num_files]  # limit how many files you load
    
    for filename in files:
        if filename.endswith('.json'):
            with open(os.path.join(data_folder, filename), 'r') as f:
                data = json.load(f)
                playlists.extend(data['playlists'])
    
    return playlists
df_playlists = load_playlists(data_folder_path, num_files = 1) 

In [14]:
track_ids = pd.read_csv('unique_track_uris.csv')

In [15]:
track_ids.head()

Unnamed: 0,track_id
0,4zAPTVyn2AA8S1yHvxo6bh
1,1wzaWJPvxcvshiUbOCKg1b
2,6aqheaqglL3yJypQeBbjXe
3,4DCJ43BgL8gfIN6avBqKGe
4,7w2h4L8AysiYqdfQK9uUss


In [21]:
# Imports
import time 
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import time
import pandas as pd
import csv 
import json
import os 
print("Imports completed")

Imports completed


In [None]:
# Load in song data using spotipy 
# First loop through all Million Playlist Dataset files, extract all the track_uris (unique song identifier), save to csv
def extract_unique_track_uris(mpd_dir, output_file="unique_track_uris.csv"):
    seen = set()
    count = 0
    for fname in sorted(os.listdir(mpd_dir)):
        if not fname.endswith('.json'):
            continue
        with open(os.path.join(mpd_dir, fname)) as f:
            data = json.load(f)
            for playlist in data['playlists']:
                for track in playlist['tracks']:
                    uri = track['track_uri'].split(':')[-1]
                    if uri not in seen:
                        seen.add(uri)
                        count += 1
        print(f"Processed {fname}, total unique tracks so far: {count}")

    # Save to CSV
    with open(output_file, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['track_id'])
        for uri in seen:
            writer.writerow([uri])

    print(f"\nSaved {len(seen)} unique track URIs to {output_file}")

# Use it on the data folder
extract_unique_track_uris(data_folder_path, 'unique_track_uris.csv')

In [33]:
# Set up spotipy to collect contnet/song feautres
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(
    client_id='78f5bc98febb46cc9ed61ee86fb3c2da',
    client_secret='8254d882cc124641b16d68a544c62d26'
))

# Check if validabs# First some checks 
sp.audio_features(['4zAPTVyn2AA8S1yHvxo6bh'])

HTTP Error for GET to https://api.spotify.com/v1/audio-features/?ids=4zAPTVyn2AA8S1yHvxo6bh with Params: {} returned 403 due to None


SpotifyException: http status: 403, code: -1 - https://api.spotify.com/v1/audio-features/?ids=4zAPTVyn2AA8S1yHvxo6bh:
 None, reason: None

In [29]:
# Next, use the unique ids to query spotify's API and collect content/song features

def enrich_tracks(track_ids, batch_size=100, sleep_time=2, output_path='enriched_tracks.csv'):
    data = []
    total = len(track_ids)
    for i in range(0, total, batch_size):
        batch = track_ids[i:i+batch_size]
        try:
            features = sp.audio_features(batch)
            meta = sp.tracks(batch)['tracks']
        except Exception as e:
            print(f"Error at index {i}: {e}")
            time.sleep(10)
            continue

        for f, m in zip(features, meta):
            if f is None or m is None:
                continue
            data.append({
                'track_id': m['id'],
                'track_name': m['name'],
                'artist_name': m['artists'][0]['name'] if m['artists'] else None,
                'album_name': m['album']['name'],
                'popularity': m['popularity'],
                'duration_ms': f['duration_ms'],
                'explicit': m['explicit'],
                'danceability': f['danceability'],
                'energy': f['energy'],
                'valence': f['valence'],
                'tempo': f['tempo'],
                'acousticness': f['acousticness'],
                'instrumentalness': f['instrumentalness'],
                'liveness': f['liveness'],
                'speechiness': f['speechiness'],
                'mode': f['mode'],
                'key': f['key'],
                'time_signature': f['time_signature'],
                'genre': None  # can enrich later
            })

        print(f"Processed {i + batch_size} / {total}")
        time.sleep(sleep_time)

    # Save once at the end
    df = pd.DataFrame(data)
    df.to_csv(output_path, index=False)
    print(f"\nSaved {len(df)} tracks to {output_path}")



In [30]:
# First some checks 
sp.audio_features(['4zAPTVyn2AA8S1yHvxo6bh'])

HTTP Error for GET to https://api.spotify.com/v1/audio-features/?ids=4zAPTVyn2AA8S1yHvxo6bh with Params: {} returned 403 due to None


SpotifyException: http status: 403, code: -1 - https://api.spotify.com/v1/audio-features/?ids=4zAPTVyn2AA8S1yHvxo6bh:
 None, reason: None

In [None]:
# Get the song features
track_id_list = track_ids['track_id'].tolist()
enrich_tracks(track_id_list)

In [8]:
# Load in song feature data
file_path = 'data/SpotifySongDataset/dataset.csv'
df_songs = pd.read_csv(file_path)
df_songs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114000 entries, 0 to 113999
Data columns (total 21 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        114000 non-null  int64  
 1   track_id          114000 non-null  object 
 2   artists           113999 non-null  object 
 3   album_name        113999 non-null  object 
 4   track_name        113999 non-null  object 
 5   popularity        114000 non-null  int64  
 6   duration_ms       114000 non-null  int64  
 7   explicit          114000 non-null  bool   
 8   danceability      114000 non-null  float64
 9   energy            114000 non-null  float64
 10  key               114000 non-null  int64  
 11  loudness          114000 non-null  float64
 12  mode              114000 non-null  int64  
 13  speechiness       114000 non-null  float64
 14  acousticness      114000 non-null  float64
 15  instrumentalness  114000 non-null  float64
 16  liveness          11

In [9]:
# Test data
import pandas as pd
file_path = 'data/spotify_million_playlist_dataset_challenge/challenge_set.json'
df = pd.read_json(file_path)
print(df.head(3))
df.info()

                        date  ...                                      description
0 2018-01-16 08:47:28.198015  ...  the challenge set for the RecSys Challenge 2018
1 2018-01-16 08:47:28.198015  ...  the challenge set for the RecSys Challenge 2018
2 2018-01-16 08:47:28.198015  ...  the challenge set for the RecSys Challenge 2018

[3 rows x 5 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         10000 non-null  datetime64[ns]
 1   version      10000 non-null  object        
 2   playlists    10000 non-null  object        
 3   name         10000 non-null  object        
 4   description  10000 non-null  object        
dtypes: datetime64[ns](1), object(4)
memory usage: 390.8+ KB


### Direct Data Loading Here - Skip Extraction

In [16]:
# Load in the unique track ids
track_ids = pd.read_csv('unique_track_uris.csv')
track_ids.head(5)

Unnamed: 0,track_id
0,4zAPTVyn2AA8S1yHvxo6bh
1,1wzaWJPvxcvshiUbOCKg1b
2,6aqheaqglL3yJypQeBbjXe
3,4DCJ43BgL8gfIN6avBqKGe
4,7w2h4L8AysiYqdfQK9uUss


### Collaborative Filtering
- Main Idea:
    - Blah
- Steps:
- Brainstorm
    - Create a DNN to learn the embeddings for each user

### A Hybrid Approach, With Transformers.
- Hybrid systems outperform single-method models and are dominant in practical deployments.

- Adding transformers...