# Content filtering using cosine similarity of tracks

The following notebook illustrates our content filtering approach that uses track similarity (measured by cosine distance) to recommend tracks to playlists.

Cosine similarity measures the orientation of two *n*-dimensional sample vectors irrespective to their magnitude. It is calculated by the dot product of two numeric vectors, and it is normalized by the product of the vector lengths. 
The output value ranges from 0 to 1, with 1 as the highest similarity.

We compute a similarity matrix for tracks by using sklearn pairwise distance method, with cosine similairty:

<h><center>$cos(\pmb track_1, \pmb track_2) = \frac {\pmb track_1 \cdot \pmb track_2}{||\pmb track_1|| \cdot ||\pmb track_2||}$ </h>

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from sklearn.utils import shuffle

subset100 = pd.read_csv("../raw_data/track_meta_100subset_new.csv")
subset100 = shuffle(subset100)

## 1. Data Processing

### 1.1 Train-test split
We split the data into training and test set by 80-20.

In [2]:
train, test = train_test_split(subset100, test_size=0.2, random_state=42, stratify = subset100['Playlistid'])
# train, val = train_test_split(train, test_size=0.2, random_state=42, stratify = train['Playlistid'])

### 1.2 Data cleaning
We drop some non-numeric features in order to calculate the cosine similarity matrix.

In [3]:
# Drop features here
features_drop = ["Playlistid","Playlist","Album", "Track", "Artist", "Trackid", "Artist_Name", "Track_Name", "Album_Name", "Artist_uri", "Track_uri", "Album_uri", "artist_genres", "explicit"]
train_cleaned, test_cleaned = train.drop(features_drop, axis =1), test.drop(features_drop, axis=1)
train = train.reset_index(drop=True)
train_cleaned = train_cleaned.reset_index(drop=True)

In [4]:
train_cleaned.head()

Unnamed: 0,Track_Duration,acousticness,artist_popularity,danceability,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,219683,0.716,84,0.505,0.397,0.0,5,0.0853,-9.349,1,0.324,95.063,1,0.558
1,249533,0.0161,76,0.68,0.687,0.0,9,0.261,-6.162,0,0.0709,150.053,4,0.467
2,220573,0.00502,67,0.551,0.836,2.1e-05,10,0.0425,-3.838,0,0.0524,185.063,4,0.758
3,275840,0.19,100,0.735,0.41,0.0,11,0.341,-8.735,0,0.2,114.812,4,0.164
4,173493,0.238,53,0.67,0.558,0.0,2,0.106,-9.159,1,0.0251,80.511,4,0.63


### 1.3 Create a cosine-similarity matrix

In [5]:
from sklearn.metrics.pairwise import cosine_similarity 
from sklearn.preprocessing import MinMaxScaler

# Standardize the data
scaler = MinMaxScaler()
scaler.fit(train_cleaned)
train_scaled = scaler.transform(train_cleaned)
test_scaled = scaler.transform(test_cleaned)

train_scaled_cos_matrix = cosine_similarity(train_scaled)

The shape of the cosine matrix shows 1970 unique tracks (in 100 playlists).

In [6]:
train_scaled_cos_matrix.shape

(2463, 2463)

We wrote a function to compute prediction set per playlist.

The function takes in a pre-calculated track cosine similarity matrix, training set, the target playlist id and the prediction set size (which we pre-determine it to be test set size * 15). It returns a list of tracks (prediction list) to recommend per playlist. The prediction list contains top k similar songs (based on cosine similarity) per track in the playlist.

In [7]:
def cos_similar_songs_playlist(cos_matrix, orig_df, target_playlist_id, cand_list_size):
    """
    Input:
    cos_matrix: cosine matrix of the tracks
    orig_df: original df with tracks as rows, but with playlistid and other features (e.g., train)
    target_playlist_id: id of the target playlist
    target_playlist_inx: index of playlist in the training set
    cand_list_size: candidate list of songs to recommend size (= test-set size * 15)
    
    Output:
    k_song_to_recommend: the most similar tracks per track
    """
    target_track_inx = np.where(train["Playlistid"] == target_playlist_id)[0] # index of tracks in training playlist of target playlist
    candidate_cos_matrix = cos_matrix

    ## For each song in the playlist, find k similar songs
    cand_list = []
    # cand_list_size = k*15
    k = np.floor(cand_list_size/len(target_track_inx)) # round(cand_list_size/len(target_track_inx))
    k_rest = cand_list_size - k*len(target_track_inx)
    # e.g., for a candidate list size of 30, get 3 songs for each track first
    for inx, i in enumerate(target_track_inx):
        candidate_song_rec = candidate_cos_matrix[i, ] #ith row of matrix
        candidate_song_rec_inx = np.argsort(candidate_song_rec)
        unique_candidate_song_sorted = train['Track_uri'][candidate_song_rec_inx][::-1].drop_duplicates()
        tracks_in_target_playlist = train.loc[train["Playlistid"] == target_playlist_id, "Track_uri"]
        song_to_recommend = np.array(unique_candidate_song_sorted.loc[~unique_candidate_song_sorted.isin(tracks_in_target_playlist)])

        if (k_rest != 0 & inx <= k_rest): # 30-24 = 6; for the first 6 tracks recommend k + 1 songs
            k_song_to_recommend = song_to_recommend[:int(k+1)]
        else:
            k_song_to_recommend = song_to_recommend[:int(k)]
            
        if inx == 0:
            cand_list = k_song_to_recommend
        else:
            cand_list = np.append(cand_list, k_song_to_recommend)
    return list(cand_list) # turn np array into list

## 2. Model Performance 
### 2.1 Metrics

In [8]:
def nholdout(playlist_id, df):
    '''Pass in a playlist id to get number of songs held out in val/test set'''
    
    return len(df[df.Playlistid == playlist_id].Track_uri)

In [9]:
def r_precision(prediction, val_set):
# prediction should be a list of predictions
# val_set should be pandas Series of ground truths
    score = np.sum(val_set.isin(prediction))/val_set.shape[0]
    return score

In [10]:
### NDCG Code Source: https://gist.github.com/bwhite/3726239
def dcg_at_k(r, k, method=0):
    r = np.asfarray(r)[:k]
    if r.size:
        if method == 0:
            return r[0] + np.sum(r[1:] / np.log2(np.arange(2, r.size + 1)))
        elif method == 1:
            return np.sum(r / np.log2(np.arange(2, r.size + 2)))
        else:
            raise ValueError('method must be 0 or 1.')
    return 0.


def ndcg_at_k(r, k, method=0):
    dcg_max = dcg_at_k(sorted(r, reverse=True), k, method)
    if not dcg_max:
        return 0.
    return dcg_at_k(r, k, method) / dcg_max

### 2.2 Model Test-Set Performance on 100 playlists

In [11]:
unique_playlistid = train['Playlistid'].drop_duplicates()

In [14]:
rps = []
ndcgs = []
for pid in unique_playlistid: # loop through each playlist
#     print(pid)
    ps = cos_similar_songs_playlist(train_scaled_cos_matrix, train, pid, nholdout(pid, test)*15)# predictions
    vs = test[test.Playlistid == pid].Track_uri # ground truth
    
#     print(r_precision(ps, vs))
    rps.append(r_precision(ps, vs)) # append individual r-precision score
    
    # NDCG
    r = np.zeros(len(ps))
    for i, p in enumerate(ps):
        if np.any(vs.isin([p])):
            r[i] = 1
    ndcgs.append(ndcg_at_k(r, len(r)))

In [15]:
avg_rp = np.mean(rps)
avg_ndcg = np.mean(ndcgs)
print('Avg. R-Precision: ', avg_rp)
print('Avg. NDCG: ', avg_ndcg)
print('Total Sum: ', np.mean([avg_rp, avg_ndcg]))

Avg. R-Precision:  0.05171355144110563
Avg. NDCG:  0.054487469476922304
Total Sum:  0.05310051045901397


### 2.3 Model Performance on 10k playlists

In [17]:
subset10k_seed = pd.read_csv("../raw_data/track_meta_milestone3.csv", index_col="Unnamed: 0")
np.random.seed(123)

In [21]:
subset10k_id = np.random.choice(subset10k_seed['Playlistid'].unique(), size = 10000, replace = False)
subset10k = subset10k_seed[subset10k_seed['Playlistid'].isin(subset10k_id)]

#### 2.3.1 Data Processing on 10k playlists

In [None]:
train, test = train_test_split(subset10k, test_size=0.2, random_state=42, stratify = subset10k['Playlistid'])
# Drop features here
features_drop = ["Playlistid","Playlist","Album", "Track", "Artist", "Trackid", "Artist_Name", "Track_Name", "Album_Name", "Artist_uri", "Track_uri", "Album_uri", "artist_genres", "explicit"]
train_cleaned, test_cleaned = train.drop(features_drop, axis =1), test.drop(features_drop, axis=1)
train = train.reset_index(drop=True)
train_cleaned = train_cleaned.reset_index(drop=True)

# Standardize the data
scaler = MinMaxScaler()
scaler.fit(train_cleaned)
train_scaled = scaler.transform(train_cleaned)
test_scaled = scaler.transform(test_cleaned)

train_scaled_cos_matrix = cosine_similarity(train_scaled)

In [1]:
subset10k.shape

NameError: name 'subset10k' is not defined

In [None]:
unique_10kplaylistid = train['Playlistid'].drop_duplicates()

In [None]:
rps = []
ndcgs = []
for pid in unique_10kplaylistid: # loop through each playlist
#     print(pid)
    ps = cos_similar_songs_playlist(train_scaled_cos_matrix, train, pid, nholdout(pid, test)*15)# predictions
    vs = test[test.Playlistid == pid].Track_uri # ground truth
    
#     print(r_precision(ps, vs))
    rps.append(r_precision(ps, vs)) # append individual r-precision score
    
    # NDCG
    r = np.zeros(len(ps))
    for i, p in enumerate(ps):
        if np.any(vs.isin([p])):
            r[i] = 1
    ndcgs.append(ndcg_at_k(r, len(r)))

In [22]:
avg_rp = np.mean(rps)
avg_ndcg = np.mean(ndcgs)
print('Avg. R-Precision: ', avg_rp)
print('Avg. NDCG: ', avg_ndcg)
print('Total Sum: ', np.mean([avg_rp, avg_ndcg]))

10000