# MLBD : Project on the music dataset

The base subject is : Predicting a playlist that satisfies group members (e.g., to decide the music to play in a party). By playlist we mean a set of songs that can be from the same artist or multiple ones.

Research questions : 
- How do we compute the metrics for similarity between user ?
- What is the impact of the number of plays ?
- What is the impact of the gender, age and country on the kind of music people are listening to ?
- Not a question : Use the Spotify API to access to the genre, maybe. (using Spotipy)
- See what we can get from the Spotify API !
- Can we generate a music playlist for multiple users based on what they listened?

### 1. Imports

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### 2. Load the data

In [5]:
DATA_FOLDER = 'data/'
TOP_FOLDER = DATA_FOLDER + 'lastfm-dataset-360k/'
TIMELINE_FOLDER = DATA_FOLDER + 'lastfm-dataset-1k/'

top_user = pd.read_csv(TOP_FOLDER + 'usersha1-profile.tsv', sep = '\t', error_bad_lines = False, header = None)
top_data = pd.read_csv(TOP_FOLDER + 'usersha1-artmbid-artname-plays.tsv', sep = '\t', error_bad_lines = False, header = None)

timeline_user = pd.read_csv(TIMELINE_FOLDER + 'userid-profile.tsv', sep = '\t', error_bad_lines = False, header = 0)
timeline_data = pd.read_csv(TIMELINE_FOLDER + 'userid-timestamp-artid-artname-traid-traname.tsv', sep = '\t', error_bad_lines = False, header = None)

b'Skipping line 2120260: expected 6 fields, saw 8\n'
b'Skipping line 2446318: expected 6 fields, saw 8\n'
b'Skipping line 11141081: expected 6 fields, saw 8\n'
b'Skipping line 11152099: expected 6 fields, saw 12\nSkipping line 11152402: expected 6 fields, saw 8\n'
b'Skipping line 11882087: expected 6 fields, saw 8\n'
b'Skipping line 12902539: expected 6 fields, saw 8\nSkipping line 12935044: expected 6 fields, saw 8\n'
b'Skipping line 17589539: expected 6 fields, saw 8\n'


# EDA

In [6]:
top_user.rename(columns = {0 : 'ID', 1 : 'Gender', 2 : 'Age', 3 : 'Country', 4 : 'Registered'}, inplace = True)
top_data.rename(columns = {0 : 'ID', 1 : 'Artist_ID', 2 : 'Artist', 3 : 'Plays'}, inplace = True)
timeline_user.rename(columns = {'#id' : 'ID', 'gender' : 'Gender', 'age' : 'Age', 'country' : 'Country', 'registered' : 'Registered'}, inplace = True)
timeline_data.rename(columns = {0 : 'ID', 1 : 'Timestamp', 2 : 'Artist_ID', 3 : 'Artist', 4 : 'Track_ID', 5 : 'Track'}, inplace = True)

In [7]:
# we'll check the number of NaNs for each dataset
print(top_user.isna().sum(), '\n')
print(top_data.isna().sum(), '\n')
print(timeline_user.isna().sum(), '\n')
print(timeline_data.isna().sum())

ID                0
Gender        32775
Age           74900
Country           0
Registered        0
dtype: int64 

ID                0
Artist_ID    226137
Artist          204
Plays             0
dtype: int64 

ID              0
Gender        108
Age           706
Country        85
Registered      8
dtype: int64 

ID                 0
Timestamp          0
Artist_ID     600848
Artist             0
Track_ID     2162719
Track             12
dtype: int64


In [8]:
top_merged = top_data.merge(top_user, left_on='ID', right_on='ID')
top_merged = top_merged.drop(columns=['Artist_ID', 'Registered'])
top_merged = top_merged.drop(top_merged[top_merged.isna().any(axis = 1)].index)

In [9]:
# !!! it seems that we lose aproximetely 100k users by merging two datasets above. 
# But I guess we can't do anything with that, but probably it's good to mention it in eda ~rap

# We decide to eliminate users, which have less than 6 favourite artists (as per Aleandro's advice)
top_merged_IDs = top_merged.groupby(['ID']).size().reset_index()
users_id = top_merged_IDs[top_merged_IDs[0] > 5]['ID']
top_merged = top_merged[top_merged['ID'].isin(users_id)]

In [10]:
# we have almost 300k artists
top_plays = top_merged.groupby(['Artist']).size().sort_values(ascending = True).reset_index(name = 'Sum of plays')
# we check how many times an artist occurs in dataset
top_plays

Unnamed: 0,Artist,Sum of plays
0,04)],1
1,john mcdonough,1
2,john mcdaniel iii,1
3,john mcbain,1
4,john mcarthur,1
...,...,...
249789,muse,37313
249790,red hot chili peppers,38358
249791,coldplay,50624
249792,the beatles,57372


In [16]:
# to reduce the dataset, we'll drop artists who were listened only by less than x users
top_artists = top_plays[top_plays['Sum of plays'] > 1000]['Artist']
top_merged = top_merged.loc[top_merged['Artist'].isin(top_artists)].reset_index().drop(columns = ['index'])

In [17]:
top_artists

247646              los piratas
247647     charlotte gainsbourg
247648             edvard grieg
247649          the wallflowers
247650                    milow
                  ...          
249789                     muse
249790    red hot chili peppers
249791                 coldplay
249792              the beatles
249793                radiohead
Name: Artist, Length: 2148, dtype: object

## Computing Prediction

### Predicting only based on the user average listens

Trivially predict the mean of the user: $\large pred(u,i) = \mu_{u} = \sum_{k \in I(u)} \frac{Plays(u, k)}{|I(u)|}$

where, $u$ is the usere we're making the prediction for, $i$ is the artist we want to predict the number of plays, $I(u)$ is the set of Artist the user has listened to.

Have still to measure the performance ? How?

In [31]:
top_merged = top_data.merge(top_user, left_on='ID', right_on='ID')
top_merged = top_merged.drop(columns=['Artist_ID', 'Registered'])
top_merged = top_merged.drop(top_merged[top_merged.isna().any(axis = 1)].index)
top_merged.head()

Unnamed: 0,ID,Artist,Plays,Gender,Age,Country
0,00000c289a1829a808ac09c00daf10bc3c4e223b,betty blowtorch,2137,f,22.0,Germany
1,00000c289a1829a808ac09c00daf10bc3c4e223b,die Ärzte,1099,f,22.0,Germany
2,00000c289a1829a808ac09c00daf10bc3c4e223b,melissa etheridge,897,f,22.0,Germany
3,00000c289a1829a808ac09c00daf10bc3c4e223b,elvenking,717,f,22.0,Germany
4,00000c289a1829a808ac09c00daf10bc3c4e223b,juliette & the licks,706,f,22.0,Germany


In [32]:
users_plays = top_merged.groupby('ID')['Plays'].sum()
users_with_n_listens = users_plays[users_plays > 2000] #Select only users that have played more than 2000 across all artists

In [33]:
users_with_n_listens

ID
00000c289a1829a808ac09c00daf10bc3c4e223b    16716
00007a47085b9aab8af55f52ec8846ac479ac4fe     6115
0000c176103e538d5c9828e695fed4f7ae42dd01    25424
0000ee7dd906373efa37f4e1185bfe1e3f8695ae     7252
0000ef373bbd0d89ce796abae961f2705e8c1faf     3240
                                            ...  
fffe7823f67b433b45f22056467db921c1d3d7d0     3224
fffe8637bd8234309e871409c7ebef99a720afc1    11207
fffe8c7f952d9b960a56ed4dcb40a415d924b224    11434
ffff9af9ae04d263dae91cb838b1f3a6725f5ffb     5408
ffff9ef87a7d9494ada2f9ade4b9ff637c0759ac     7440
Name: Plays, Length: 216163, dtype: int64

In [34]:
top_users = top_merged.loc[top_merged['ID'].isin(users_with_n_listens.index)].reset_index()

In [35]:
top_users

Unnamed: 0,index,ID,Artist,Plays,Gender,Age,Country
0,0,00000c289a1829a808ac09c00daf10bc3c4e223b,betty blowtorch,2137,f,22.0,Germany
1,1,00000c289a1829a808ac09c00daf10bc3c4e223b,die Ärzte,1099,f,22.0,Germany
2,2,00000c289a1829a808ac09c00daf10bc3c4e223b,melissa etheridge,897,f,22.0,Germany
3,3,00000c289a1829a808ac09c00daf10bc3c4e223b,elvenking,717,f,22.0,Germany
4,4,00000c289a1829a808ac09c00daf10bc3c4e223b,juliette & the licks,706,f,22.0,Germany
...,...,...,...,...,...,...,...
10796591,17535564,ffff9ef87a7d9494ada2f9ade4b9ff637c0759ac,tristania,61,m,21.0,Belgium
10796592,17535565,ffff9ef87a7d9494ada2f9ade4b9ff637c0759ac,xandria,61,m,21.0,Belgium
10796593,17535566,ffff9ef87a7d9494ada2f9ade4b9ff637c0759ac,alice cooper,59,m,21.0,Belgium
10796594,17535567,ffff9ef87a7d9494ada2f9ade4b9ff637c0759ac,lamb of god,58,m,21.0,Belgium


In [41]:
top_artists = top_plays[top_plays['Sum of plays'] > 500]['Artist']

In [42]:
top = top_users.loc[top_users['Artist'].isin(top_artists)].reset_index().drop(columns = ['index'])

In [45]:
pred_data = top.drop(columns = ['Gender', 'Age', 'Country']).copy()

In [46]:
pred_data

Unnamed: 0,level_0,ID,Artist,Plays
0,1,00000c289a1829a808ac09c00daf10bc3c4e223b,die Ärzte,1099
1,2,00000c289a1829a808ac09c00daf10bc3c4e223b,melissa etheridge,897
2,3,00000c289a1829a808ac09c00daf10bc3c4e223b,elvenking,717
3,5,00000c289a1829a808ac09c00daf10bc3c4e223b,red hot chili peppers,691
4,7,00000c289a1829a808ac09c00daf10bc3c4e223b,the black dahlia murder,507
...,...,...,...,...
7805496,10796591,ffff9ef87a7d9494ada2f9ade4b9ff637c0759ac,tristania,61
7805497,10796592,ffff9ef87a7d9494ada2f9ade4b9ff637c0759ac,xandria,61
7805498,10796593,ffff9ef87a7d9494ada2f9ade4b9ff637c0759ac,alice cooper,59
7805499,10796594,ffff9ef87a7d9494ada2f9ade4b9ff637c0759ac,lamb of god,58


In [None]:
#Need to select how we reduce the dataset
data_user_avg = pred_data.iloc[0:10000]

In [None]:
avg_listens = data_user_avg.groupby("ID").mean('Plays').reset_index()

In [None]:
def compute_pred_avg_user(user, artist_i, avg_listens):
    return int(avg_listens[avg_listens['ID'] == user]['Plays'])

In [None]:
user = data_user_avg.iloc[72]['ID']
artists = "the strokes"
compute_pred_avg_user(user, "the strokes", avg_listens)

### Predicting only based on the artist average listens


Trivially predict the mean of the artist for each user: $\large pred(u,i) = \mu_{i} =\sum_{v \in U(i)} \frac{Plays(v,i)}{|U(i)|}$

where $U(i)$ is the set of users that has listened artist i



In [None]:
#Need to select how we reduce the dataset
data_artist_avg = pred_data.iloc[0:10000]

In [None]:
avg_artist_listens = data_artist_avg.groupby("Artist").mean('Plays').reset_index()

In [None]:
def compute_pred_avg_artist(user, artist_i, avg_listens):
    return int(avg_listens[avg_listens['Artist'] == artist_i]['Plays'])

In [None]:
user = data_artist_avg.iloc[72]['ID']
artist = "the strokes"
compute_pred_avg_artist(user, "the strokes", avg_artist_listens)

###  Using User similarity

**First idea**: Compute similarity based on Jaccard distance: Each user has a set of artist he has listened to.

$\large sim(u,v)$ = Jacc$(I(u), I(v))$ = $\Large \frac{|I(u) \cap I(v)|}{|I(u) \cup I(v)|}$

Once we have this similarity, check if a new artist will be listened a lot by the user by comparing it to all the other users that have listened to him, this is the user-specific sum : 

$\large p(u, i) = \frac{\sum_{v \in U(i)} sim(u,v) \cdot Plays(v,i)}{ \sum_{v \in U(i)} sim(u,v)}$

where $U(i)$ is the set of users that have listened to the group $i$

In [None]:
#Replace data_similarity by the data we decide to use finally we ensure that each user has listened to at least one user
data_similarity = pred_data.iloc[0:10000]

In [None]:
group = data_similarity[['ID','Artist']].groupby('ID').agg(set)
matrice = np.zeros((len(group), len(group)))

def compute_sim_users_artist(user_set):
    matrice = np.zeros((len(user_set), len(user_set)))
    mapping = {}
    
    for i in range(len(matrice)):
        print(i)
        mapping[user_set.iloc[i].name] = i
        for j in range(i+1, len(matrice[i])):
            matrice[i,j]=len((user_set.iloc[i].Artist.intersection(user_set.iloc[j].Artist)))/len(user_set.iloc[i].Artist.union(user_set.iloc[j].Artist))
    return matrice, mapping

In [None]:
#Technique is too slow
user_sim, mapping = compute_sim_users_artist(group)

In [None]:
len(user_sim)
values = user_sim[np.where(user_sim != 0)]

sns.histplot(values.flatten())
plt.yscale("log")
plt.title("User non-zero similarity distribution")

In [None]:
def compute_pred_sim_users(user, artist, user_has_artist, sim_matrix):
    user_id = mapping[user]
    num = 0
    denom = 0
    for i in user_has_artist:
        
        if mapping[i] < user_id: ## Our matrice of similarities is a upper triangular one
            num += int(data_similarity[(data_similarity['Artist'] == artist)
                               & (data_similarity['ID'] == i)]['Plays']) * sim_matrix[mapping[i], user_id]
            denom += sim_matrix[mapping[i], user_id]
        
        else:
            num += int(data_similarity[(data_similarity['Artist'] == artist)
                               & (data_similarity['ID'] == i)]['Plays']) * sim_matrix[user_id, mapping[i]]
            denom += sim_matrix[user_id, mapping[i]]
        
    if denom == 0:
        return 0
    else:
        return int(num/denom)

In [None]:
user = data_similarity.iloc[72]['ID']
artist = "the strokes"
user_has_artist = set(data_similarity[(data_similarity['Artist'] == artist) & (data_similarity['ID'] != user)]['ID'])
compute_pred_sim_users(user, artist, user_has_artist, user_sim)

**Second idea**: Add user specific mean in the prediction

Previously we do not consider at all the user, we just find the users that rate this item. We try to integrate this in this method:

We will use the user-specific sum:
$\large \tilde{u} =  \frac{\sum_{v \in U(i)} sim(u,v) \cdot (Plays(v,i) - \mu_{v})}{ \sum_{v \in U(i)} sim(u,v)} $

Then we compute: 

$\large p(u, i) = \mu_{u} + \tilde{u} $


In [None]:
def compute_pred_user_item_with_matrix(user, artist, user_has_artist, sim_matrix):
    user_id = mapping[user]
    num = 0
    denom = 0
    for i in user_has_artist:
        
        if mapping[i] < user_id: ## Our matrice of similarities is a upper triangular one
            num += (sim_matrix[mapping[i], user_id]) * int(data_similarity[(data_similarity['Artist'] == artist)
                               & (data_similarity['ID'] == i)]['Plays'] - data_similarity[data_similarity['ID'] == i]['Plays'].mean())
            
            denom += sim_matrix[mapping[i], user_id]
        
        else:
            num += sim_matrix[user_id, mapping[i]] * int(data_similarity[(data_similarity['Artist'] == artist)
                               & (data_similarity['ID'] == i)]['Plays'] -  data_similarity[data_similarity['ID'] == i]['Plays'].mean()) 
            denom += sim_matrix[user_id, mapping[i]]
        
    if denom == 0:
        return 0
    else:
        return  int(data_similarity[data_similarity['ID'] == user]['Plays'].mean() + num/denom)

In [None]:
user = data_similarity.iloc[9996]['ID']
artist = "david bowie"
user_has_artist = set(data_similarity[(data_similarity['Artist'] == artist) & (data_similarity['ID'] != user)]['ID'])

compute_pred_user_item_with_matrix(user, artist, user_has_artist, user_sim)

Second method that doesn't use similarity matrix (computes this value on the fly)

In [None]:
def compute_pred_user_item(user, artist, df):
    user_artists = set(df[df['ID'] == user]['Artist'])
    artist_has_user = set(df[(df['Artist'] == artist) & (df['ID'] != user)]['ID'])
    
    print(len(artist_has_user))
    
    num = 0
    denom = 0
    
    for i, x in enumerate(artist_has_user):
        user_i_artists = set(df[df['ID'] == x]['Artist'])
        sim_with_i = len(user_artists.intersection(user_i_artists))/len(user_artists.union(user_i_artists))
        
        num += sim_with_i * (int(df[(df['ID'] == x) & (df['Artist'] == artist)]['Plays']) - int(df[df['ID'] == x]['Plays'].mean()))
        denom += sim_with_i
    
    if denom == 0:
        return df[df['ID'] == user]['Plays'].mean()
    else:
        return df[df['ID'] == user]['Plays'].mean() + num/denom
    
    
    

    