# MLBD : Project on the music dataset

The base subject is : Predicting a playlist that satisfies group members (e.g., to decide the music to play in a party). By playlist we mean a set of artists.

Research questions : 
- Can we generate a playlist of artists for multiple users based on what they listened?

## 1. Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time

import ast
import os

## 2. Load the data

In [None]:
DATA_FOLDER = 'data/'
TOP_FOLDER = DATA_FOLDER + 'lastfm-dataset-360k/'
GROUP_FOLDER = 'data/groups/'
#Using just the artists
top_user = pd.read_csv(TOP_FOLDER + 'usersha1-profile.tsv', sep = '\t', error_bad_lines = False, header = None)
top_data = pd.read_csv(TOP_FOLDER + 'usersha1-artmbid-artname-plays.tsv', sep = '\t', error_bad_lines = False, header = None)

# This file was created using the data expansion done in part 3
spotify_data = pd.read_csv(DATA_FOLDER + 'full_spotify_info.csv', error_bad_lines = False, header = 0)

In [None]:
top_user.head()

In [None]:
top_data.head()

As we see, the columns have not proper names, we reformat them in this cell:

In [None]:
#Need to rename the columns
top_user.rename(columns = {0 : 'ID', 1 : 'Gender', 2 : 'Age', 3 : 'Country', 4 : 'Registered'}, inplace = True)
top_data.rename(columns = {0 : 'ID', 1 : 'Artist_ID', 2 : 'Artist', 3 : 'Plays'}, inplace = True)

In [None]:
top_user.shape

In [None]:
top_data.shape

We have 359'347 users in the users dataset and 17'535'655 entries of type (user, Artist, Plays)

In [None]:
# we'll check the number of NaNs for each dataset
print(top_user.isna().sum(), '\n')
print(top_data.isna().sum(), '\n')

There is no missing entries for the user dataset, however we miss a lot of artist_ID and some artists name. We therefore remove the 204 artists with no name as there we can't recommand them.

In [None]:
to_drop = top_data[top_data['Artist'].isna()].index
top_data = top_data.drop(to_drop)

Let's see how many artists we have and users that have at least listen to one music:

In [None]:
print(f"The dataset has {len(top_data.groupby('Artist').count())} artists")

In [None]:
nb_users_in_top_data = len(top_data.groupby('ID').count())
print(f"The music dataset has {nb_users_in_top_data} users, meaning that {359347-nb_users_in_top_data} did not listen to anything and have therefore no matching entries")

## 3. Spotify API Data analysis

We first need to change the Info column from a string to a dictionary

In [None]:
add_data = spotify_data.copy()
add_data['Info'] = add_data['Info'].map(lambda x: x if isinstance(x, str) else "{}").map(lambda x: ast.literal_eval(x))
add_data.head()

In [None]:
add_data['Genres'] = add_data['Info'].map(lambda x: x['genres'] if len(x) > 0 else [])
add_data.head()

In [None]:
genres = add_data['Genres'].tolist()
all_genres = [item for sublist in genres for item in sublist]
unique_genres = set(all_genres)
len(unique_genres)

In [None]:
from collections import Counter
c = Counter(all_genres)
c.most_common(25)

In [None]:
def most_common_genre(l, c) :
    best = ""
    best_num = 0
    for elem in l:
        if(c[elem] > best_num) :
            best = elem
            best_num = c[elem]
    return best

In [None]:
add_data['Most_common_genre'] = add_data['Genres'].map(lambda x: most_common_genre(x, c))
add_data.head()

In [None]:

add_data.groupby('Most_common_genre').size().hist(bins = 25, figsize = (13, 5))
plt.yscale('log')
plt.title('Number of artists which have the same best genre')
plt.xlabel('Number of artists')
_ = plt.ylabel('Number of genres')

In [None]:
add_data.groupby('Most_common_genre').size()

In this case, we see that there are 11522 artists which do not have any genres attached to them, this probably comes from a lack of information about these artists in general from Spotify.

In [None]:
add_data = add_data[add_data['Genres'].map(len) > 0]

In [None]:
add_data.groupby('Most_common_genre').size().hist(bins = 25, figsize = (13, 5))
plt.yscale('log')
plt.title('Number of artists which have the same best genre (without no genre)')
plt.xlabel('Number of artists')
_ = plt.ylabel('Number of genres')

We can see that there are some issues with the Spotify API part where Artist such as She will be take as Ed Sheeran. This will be fixed by setting a bigger threshold on the number of users which are listening to the same artist.

In [None]:
top_data_percent = top_data.copy()
top_data_percent['Percent'] = top_data['Plays'] / top_data[['ID', 'Plays']].groupby('ID').Plays.transform('sum')
top_data_percent.head()

In [None]:
all_data = top_data_percent.merge(add_data, on = 'Artist')
all_data.head()

In [None]:
all_data[['Most_common_genre', 'Percent']].groupby('Most_common_genre').sum().hist(bins = 100, figsize = (13, 5))
plt.yscale('log')
plt.title('Sum of percent of listening time per genre')
plt.xlabel('Sum of percent')
_ = plt.ylabel('Number of genres')

In [None]:
all_data[['Most_common_genre', 'Percent']].groupby('Most_common_genre').sum().sort_values(by = 'Percent')

In [None]:
all_data

Instead of the 17'535'655 entries, we now have 15'553'756 that have possess a lot more information

## 4. Exploratory data analysis

Now we look at the number of users that have listened to one artist

In [None]:
artists_nb_users_listen= all_data.groupby(['Artist']).size().sort_values(ascending = True).reset_index(name = 'nb_users')
# we check how many times an artist occurs in dataset
artists_nb_users_listen

In [None]:
artists_nb_users_listen.describe()

As expected, the distribution is exponential. More than 25% of the artists that have a matching entry in the spotify API have been listened only by 37 users, 50% by 80, 75% by 242.


In [None]:
sns.histplot(artists_nb_users_listen['nb_users'], log_scale = True)
plt.title("Nb users listening to one artist")
plt.ylabel("Nb of artists")
plt.xlabel("Nb of users")


#### Now look at the users

In [None]:
users_nb_artists_listen = top_data.groupby('ID').size().sort_values().reset_index(name = 'nb_artists_listen')

In [None]:
users_nb_artists_listen.describe()

Interestingly, we see that, although we have a lot of artists, users tend to listen to only a few of them. In average, a users has listened to 49 different artists with the quantiles being near from each other which is good.

Let's now see from where our users come from. It's possible that the origin of the user has an impact on what he listens

In [None]:
nb_users_per_country = top_user.groupby('Country').size().reset_index(name = 'nb_users')

In [None]:
nb_users_per_country = nb_users_per_country.sort_values('nb_users')

In [None]:
#Only plotting the countries with more than 2000 users in the database
fig, ax = plt.subplots(figsize=(10,10))
sns.barplot(ax= ax, data = nb_users_per_country[nb_users_per_country['nb_users']>2000], x = 'nb_users', y = 'Country', )

We can observe that most of our users come from the United States, Germany and the United Kingdom. More generally, we have a lot of european music culture with the exception of Brazil, Japan, Turkey, Mexico, Chile and Argentinia. We will therefore have that the most listened artists come from this culture. 

Let's now try to merge the two dataset together to continue the exploration

In [None]:
top_merged = all_data.merge(top_user, left_on='ID', right_on='ID')
top_merged = top_merged.drop(columns=['Artist_ID']) ##Drop when the users was registered and the artist_ID

In [None]:
# !!! it seems that we lose approximetely 100k users by merging two datasets above. 
# But I guess we can't do anything with that, but probably it's good to mention it in eda ~rap

# We decide to eliminate users, which have listened to less than 10 favourite artists (We'll use trivial recommendation for them)
top_merged_IDs = top_merged.groupby(['ID']).size().reset_index()
users_id = top_merged_IDs[top_merged_IDs[0] > 10]['ID']
top_merged = top_merged[top_merged['ID'].isin(users_id)]

Let's see if artists are more listened to one country or another

In [None]:
artists_per_country = top_merged[['Artist', 'Country', 'ID', 'Plays']].groupby(['Artist','Country']).agg({'ID': len,
                             'Plays': 'sum'}).rename(columns = {'ID':'nb_users'})

In [None]:
artists_per_country

With this, we can also look at the most listened artist in one country:

In [None]:
top1_artist_per_country = artists_per_country.unstack(1, fill_value = 0)

In [None]:
top1_artist_per_country = top1_artist_per_country.reset_index()
top1_artist_per_country

In [None]:
top1_artist_per_country = top1_artist_per_country.set_index(top1_artist_per_country['Artist']).drop(columns = 'Artist')

In [None]:
for country in top1_artist_per_country['nb_users']:
    print(f"Most listened artist in {country}: {top1_artist_per_country['nb_users'][country].idxmax()}, with {top1_artist_per_country['nb_users'][country].max()} users")

Look at it in terms of plays:

In [None]:
for country in top1_artist_per_country['Plays']:
    print(f"Most listened artist in {country}: {top1_artist_per_country['Plays'][country].idxmax()}, with {top1_artist_per_country['Plays'][country].max()} plays")

We observe that the number of plays and users that listen to an artist doesn't always lead to the same top result. However we can argue that there seem to be cultural differences between the countries leading to different top groups/genres. For example, most nordic countries, have a rock/metal group as top position, while more West-European tend to listen to pop.


We therefore decide to inspect for each users, the max number of plays they have. 

In [None]:
max_top = top_data.groupby(['ID'])['Plays'].max()

In [None]:
max_top.reset_index().hist(bins = 100, figsize = (13, 5))
plt.yscale('log')
plt.title('Number of plays for most listened artist per user')
plt.xlabel('Number of plays')
plt.ylabel('Number of users')

In [None]:
top_data[top_data['Plays'] == top_data['Plays'].max()]

Most user can be found at the start of scale (notice the logY scale), but some users are truly amazing, with the max plays sitting at 419157. After some research, nofx, the artist this user has been listening to, mainly makes music of about 2 minutes, still this user has more or less listened to 1.6 years of nofx in about 4 years. We really suspect that this is due to a bot. To not have this kind of biases, we argue that taking the number of users listening to an artist is more representative of its fame in the corresponding country

In [None]:
top_user[top_user['ID'] == '8d0384537845e7f2b1b8b3e8a9f67eb8d9439794']

## 5. Measurement of the quality of the individual recommendation

In [None]:
#Splits the dataframe into a train and a test set randomly
def split_train_test(df, train_size = 0.9 ,seed = 42, apply_seed = False):
    uniques_ids = df.ID.unique()
    if apply_seed:
        np.random.seed(seed)
    train = pd.DataFrame()
    test = pd.DataFrame()
    
    for user_id in uniques_ids:
        user_sub = df[df['ID'] == user_id]
        randomization = np.random.permutation(user_sub.index)
        user_sub_train = user_sub.loc[randomization[0:int(len(randomization)*train_size)]]
        user_sub_test = user_sub.loc[randomization[int(len(user_sub)*train_size):len(user_sub)]]
        
        train = train.append(user_sub_train)
        test = test.append(user_sub_test)
    
    return train, test


Computes just mean absolute error

In [None]:
def compute_mae(pred_method, helper_df):
    mae = 0
    for i, row in test.iterrows():
        user = row.ID
        artist = row.Artist
        prediction = pred_method(user, artist, helper_df)
        mae += abs(row.Plays - prediction)
    return mae/len(test)

Computes accuracy, precision and recall

In [None]:
def compute_appreciation(pred_method, helper_df, user_specific_threshold = None, threshold = 100):
    predictions = np.zeros(len(test))
    reals = np.zeros(len(test))
    indice = 0
    for i, row in test.iterrows():
        user = row.ID
        artist = row.Artist
        prediction = pred_method(user, artist, helper_df)
        predictions[indice] = prediction > threshold
        reals[indice] = row.Plays > threshold
        indice += 1
    
    tp = np.sum(np.bitwise_and(predictions==1, reals == 1))
    fp = np.sum(np.bitwise_and(predictions==1, reals == 0))
    
    fn = np.sum(np.bitwise_and(predictions== 0, reals == 1))
    
    acc = np.sum(predictions == reals)
    
    return mae/len(test_df), acc/len(test_df), tp/(tp+fp), tp/(tp + fn)

Computes mae, accuracy, precision and recall in one go

In [None]:
def compute_mae_and_app(test_df, pred_method, helper_df, threshold = 250):
    predictions = np.zeros(len(test_df))
    reals = np.zeros(len(test_df))
    indice = 0
    mae = 0
    for i, row in test_df.iterrows():
        user = row.ID
        artist = row.Artist
        prediction = pred_method(user, artist, helper_df)
        mae += abs(row.Plays - prediction)
        predictions[indice] = prediction > threshold
        reals[indice] = row.Plays > threshold
        indice += 1
    
    tp = np.sum(np.bitwise_and(predictions==1, reals == 1))
    fp = np.sum(np.bitwise_and(predictions==1, reals == 0))
    
    fn = np.sum(np.bitwise_and(predictions== 0, reals == 1))
    
    acc = np.sum(predictions == reals)
    
    return mae/len(test_df), acc/len(test_df), tp/(tp+fp), tp/(tp + fn)

In [None]:
def compute_mae_and_app_knn(test_df, pred_method, threshold = 100):
    predictions = np.zeros(len(test_df))
    reals = np.zeros(len(test_df))
    indice = 0
    mae = 0
    for i, row in test_df.iterrows():
        user = row.ID
        artist = row.Artist
        prediction = pred_method(user, artist, test_df)
        if type(prediction) == np.float64:
            mae += abs(row.Plays - prediction)
            predictions[indice] = prediction 
            prediction > threshold
            reals[indice] = row.Plays > threshold
            indice += 1
    
    tp = np.sum(np.bitwise_and(predictions==1, reals == 1))
    fp = np.sum(np.bitwise_and(predictions==1, reals == 0))

    fn = np.sum(np.bitwise_and(predictions== 0, reals == 1))

    acc = np.sum(predictions == reals)
    
    return mae/len(test_df), acc/len(test_df), tp, fp

In [None]:
def compute_n_rounds(pred_method, nb_users_selected = 500, n = 10, seed = 42):
    """
    Computes the mae, accuracy, precision and recall on n round of the pred_method
    
    pred_method = method that allows to compute the prediction
    nb_users_selected = number of users in the sub-sample
    n = number of rounds
    seed = seed for random generation of sub-samples
    """
    np.random.seed(seed)
    maes, accs, precs, recs = [],[],[],[]
    nb_users = len(top_merged[['ID']].groupby('ID'))
    df = top_merged.groupby('ID').size()
    rng = np.random.default_rng(seed = seed)
    
    for i in range(n):
        print(f"===== Epoch {i} =====")
        selected_users = rng.choice(nb_users, nb_users_selected, replace = False) #Generate a random list of nb_users_select users
        subset_500_users = df[selected_users]
        subset = top_merged[top_merged.ID.isin(subset_500_users.index)]
        train, test = split_train_test(subset, train_size=0.95)

        if pred_method == compute_pred_avg_user:
            avg_user_listens = train[['ID', 'Plays']].groupby("ID").mean('Plays').reset_index()
            mae, acc, prec, rec = compute_mae_and_app(test, pred_method, avg_user_listens)

        elif pred_method == compute_pred_avg_artist:
            avg_artist_listens = train[['Artist', 'Plays']].groupby("Artist").mean('Plays').reset_index()
            mae, acc, prec, rec = compute_mae_and_app(test, pred_method, avg_artist_listens)
        elif pred_method == compute_pred_knn:
            mae, acc, prec, rec = compute_mae_and_app_knn(subset, pred_method)
        else:
            mae, acc, prec, rec = compute_mae_and_app(test, pred_method, train) #sim measures doesn't a helper set
        
        maes.append(mae)
        accs.append(acc)
        precs.append(prec)
        recs.append(rec)
        
    return maes, accs, precs, recs


## 6. Measurment of the quality of the group recommender systems:

For the group recommendation system. We use a least measury principle, meaning we want to satisfy the most users. So for each user we need to get the number of plays, if it is over a certain threshold, we consider he likes it, otherwise not. We then aggregate all the individual predictions and try to select the artist with the most likes.

To measure the performance of this we use the Discounted Cumulative Gain (DCG).

In [None]:
def get_group_recommendation_according_to_our_algorithm(users, same_artists, df, threshold = 250):
    recommendations = pd.DataFrame(columns = ["Artist", "nb_likes"])
    for artist in same_artists:
        nb_likes = 0
        for user in users:
            if df[(df['ID'] == user) & (df['Artist'] == artist)].iloc[0]['Plays'] > threshold:
                nb_likes += 1
        recommendations = recommendations.append({"Artist": artist, "nb_likes" : nb_likes}, ignore_index = True)
    return recommendations.sort_values("nb_likes", ascending=False).reset_index().drop(columns = ['index'])

In [None]:
def get_nb_plays(user, artist, df):
    res = df[(df['ID'] == user) & (df['Artist'] == artist)]
    if len(res) == 0:
        return -1
    else:
        return res.iloc[0]['Plays']

def test_recommender(pred_method, df, helper_df, group_of_users, same_artists, threshold = 250):
    list_of_artists = dict()
    
    for user in group_of_users:
        nb_likes = 0
        for artist in df[(df['ID'] == user) & (df['Artist'].isin(same_artists))]['Artist']:
            list_of_artists[artist] = 0
    
    for artist in list_of_artists.keys():
        for user in group_of_users:
            if(user == group_of_users[0]): #Always takes first user because selection is randomized
                nb_plays = get_nb_plays(user, artist, df)
            
            else: nb_plays = -1
                
            if nb_plays != -1:
                if nb_plays > threshold:
                    list_of_artists[artist] += 1
            
            else:
                prediction = pred_method(user, artist, helper_df)
                if prediction > threshold:
                    list_of_artists[artist] += 1
    
    recommendation = pd.DataFrame(columns = ["Artist", "nb_likes"])
    
    for entry in list_of_artists:
        recommendation = recommendation.append({"Artist":entry, "nb_likes":list_of_artists[entry]}, ignore_index = True)
    
    return recommendation.sort_values("nb_likes", ascending = False).reset_index().drop(columns = "index")

In [None]:
def dcg_idcg(reals, preds):
    reals['Rank'] = [i for i in range(1, len(reals)+ 1)]
    preds['Rank'] = [i for i in range(1, len(reals)+ 1)] #Not same rank for items rated equally
    final = reals.merge(preds, on = 'Artist')
    log_ranks_pred = np.log2(final['Rank_y'])
    log_ranks_pred = log_ranks_pred.where(log_ranks_pred > 0, 1)
    log_ranks_real = np.log2(final['Rank_x'])
    log_ranks_real = log_ranks_real.where(log_ranks_real > 0, 1)
    DCG = np.sum(final['nb_likes_x']/log_ranks_pred)
    IDCG = np.sum(final['nb_likes_x']/log_ranks_real)
    if IDCG == 0: #Case nobody has liked anything
        return 1.0
    return DCG/IDCG

In [None]:
def measure_group_recommendation(pred_method, group_df, helper_df, rng):
    users = group_df['ID'].unique()
    measures = []
    rng = rng
    for nb_pred in range(10):
        random_6_users = rng.choice(len(users), 6, replace = False)
        my_6_users_group = users[random_6_users]
        
        aggregated_set = group_df[group_df['ID'].isin(users)][['ID','Artist']].groupby('ID').agg(set)
        same_artists = aggregated_set.iloc[0]['Artist']
        for i,row in aggregated_set.iterrows():
            same_artists = same_artists.intersection(row['Artist'])
        
        random_5_artists = np.array(list(same_artists))
        random_5_artists = random_5_artists[rng.choice(len(same_artists), 5, replace = False)]
        
        reals = get_group_recommendation_according_to_our_algorithm(my_6_users_group, random_5_artists, group)
        preds = test_recommender(pred_method, group, helper_df, users, random_5_artists)
        measures.append(dcg_idcg(reals, preds))
    return measures

## 7. Testing different prediction method

### Predicting only based on the user average listens

Trivially predict the mean of the user: $\large pred(u,i) = \mu_{u} = \sum_{k \in I(u)} \frac{Plays(u, k)}{|I(u)|}$

where, $u$ is the usere we're making the prediction for, $i$ is the artist we want to predict the number of plays, $I(u)$ is the set of Artist the user has listened to.

Have still to measure the performance ? How?

In [None]:
def compute_pred_avg_user(user, artist_i, avg_listens):
    return int(avg_listens[avg_listens['ID'] == user]['Plays'])

#### Results on individual prediction

In [None]:
start = time.time()
maes_users, accs_users, precs_users, recs_users = compute_n_rounds(compute_pred_avg_user)
end = time.time()
print(f"Time required to do the prediction on 10 rounds {end - start}")

In [None]:
sns.barplot(data = pd.DataFrame(maes_users, columns = ["Mae_User_avg"]))
plt.ylabel("Mean absolute error")
plt.title("Mean absolute error on 10 runs")

In [None]:
sns.barplot(data = pd.DataFrame({"Accuracy_User_avg":accs_users, "Precision_User_avg":precs_users, "Recall_User_avg":recs_users}))

#### Results on group prediction

In [None]:
start = time.time()
group_pred_measures = []
rng = np.random.default_rng(seed = 42)
for i in range(10):
    measures = []
    for file in os.listdir(GROUP_FOLDER):
        group = pd.read_csv(f'{GROUP_FOLDER}{file}').drop(columns = ['Unnamed: 0'])
        measures.append(measure_group_recommendation(compute_pred_avg_user, group, group.groupby('ID').mean('Plays').reset_index(), rng))
    group_pred_measures.append(np.hstack(np.array(measures)).mean())
    print(f"Finished round {i}")
groups_user_avg = group_pred_measures
end = time.time()
print(f"Time required to do the predictions on 10 rounds {end - start}")

In [None]:
sns.barplot(data = pd.DataFrame(groups_user_avg, columns = ["DCG"]))
plt.title("DCG on 120 groups for 10 epochs (User Avg)")
plt.ylim(0.8, 1.0)

### Predicting only based on the artist average listens


Trivially predict the mean of the artist for each user: $\large pred(u,i) = \mu_{i} =\sum_{v \in U(i)} \frac{Plays(v,i)}{|U(i)|}$

where $U(i)$ is the set of users that has listened artist i



In [None]:
def compute_pred_avg_artist(user, artist_i, avg_listens):
    prediction = avg_listens[avg_listens['Artist'] == artist_i]
    if len(prediction) == 0: 
        return int(avg_listens['Plays'].mean()) # What to return when we haven't seen the artist?
    else:
        return int(prediction['Plays'])

#### Results on individual prediction

In [None]:
start_artist = time.time()
maes_artists, accs_artists, precs_artists, recs_artists = compute_n_rounds(compute_pred_avg_artist)
end_artist = time.time()
print(f"Time required to do the prediction on 10 rounds {end_artist - start_artist}")

In [None]:
sns.barplot(data = pd.DataFrame(maes_artists, columns = ["Mae_Artist_avg"]))
plt.ylabel("Mean absolute error")
plt.title("Mean absolute error on 10 runs")

In [None]:
sns.barplot(data = pd.DataFrame({"Accuracy_Artist_avg":accs_artists, "Precision_Artist_avg":precs_artists, "Recall_Artist_avg":recs_artists}))

#### Results on Group prediction

In [None]:
start = time.time()
group_pred_measures = []
rng = np.random.default_rng(seed = 42)
avg_artist_plays = top_merged.groupby('Artist').mean('Plays').reset_index()
for i in range(10):
    measures = []
    for file in os.listdir(GROUP_FOLDER):
        group = pd.read_csv(f'{GROUP_FOLDER}{file}').drop(columns = ['Unnamed: 0'])
        measures.append(measure_group_recommendation(compute_pred_avg_artist, group,
                                                     avg_artist_plays[avg_artist_plays['Artist'].isin(group['Artist'])] , rng))
    group_pred_measures.append(np.hstack(np.array(measures)).mean())
    print(f"Finished round {i}")
groups_artist_avg = group_pred_measures
end = time.time()
print(f"Time required to do the predictions on 10 rounds {end - start}")

In [None]:
sns.barplot(data = pd.DataFrame(groups_artist_avg, columns = ["DCG"]))
plt.title("DCG on 120 groups for 10 epochs")
plt.ylim(0.8, 1.0)

###  Using User similarity

**User specific prediction**: Compute similarity based on Jaccard distance: Each user has a set of artist he has listened to.

$\large sim(u,v)$ = Jacc$(I(u), I(v))$ = $\Large \frac{|I(u) \cap I(v)|}{|I(u) \cup I(v)|}$

Once we have this similarity, check if a new artist will be listened a lot by the user by comparing it to all the other users that have listened to him, this is the user-specific sum : 

$\large \tilde{u} =  \frac{\sum_{v \in U(i)} sim(u,v) \cdot (Plays(v,i) - \mu_{v})}{ \sum_{v \in U(i)} sim(u,v)} $

where $U(i)$ is the set of users that have listened to the group $i$

and add the mean of the user to it so we have:

$\large pred(u, i) = \mu_{u} + \tilde{u} $

In [None]:
def compute_pred_user_item(user, artist, df):
    user_artists = set(df[df['ID'] == user]['Artist']) #Get the set of the current user we want to get the prediction from
    artist_has_user = set(df[(df['Artist'] == artist) & (df['ID'] != user)]['ID']) #Set of artist that have rated the user
    
    num = 0
    denom = 0
    
    for i, x in enumerate(artist_has_user):
        user_i_artists = set(df[df['ID'] == x]['Artist'])
        sim_with_i = len(user_artists.intersection(user_i_artists))/len(user_artists.union(user_i_artists))
        
        num += sim_with_i * (int(df[(df['ID'] == x) & (df['Artist'] == artist)]['Plays']) - int(df[df['ID'] == x]['Plays'].mean()))
        denom += sim_with_i
    
    if denom == 0:
        return df[df['ID'] == user]['Plays'].mean()
    else:
        return df[df['ID'] == user]['Plays'].mean() + num/denom
    
    
    

#### Results on individual prediction

In [None]:
start_artist = time.time()
maes_sim, accs_sim, precs_sim, recs_sim = compute_n_rounds(compute_pred_user_item)
end_artist = time.time()
print(f"Time required to do the prediction on 10 rounds {end_artist - start_artist}")

In [None]:
sns.barplot(data = pd.DataFrame(maes_sim, columns = ["Mae_User_sim"]))
plt.ylabel("Mean absolute error")
plt.title("Mean absolute error on 10 runs")

In [None]:
sns.barplot(data = pd.DataFrame({"Accuracy_User_sim":accs_sim, "Precision_User_sim":precs_sim, "Recall_User_sim":recs_sim}))

#### Results on group prediction

**This takes too much time**

Furthermore we observed that it was less good for individual predictions, so this measure is a bit useless

In [None]:
"""
start = time.time()
group_pred_measures = []
df = top_merged.groupby('ID').size()
nb_users = len(df)
rng = np.random.default_rng(seed = 42)
for i in range(10):
    measures = []
    
    selected_users = rng.choice(nb_users, 500, replace = False) #Generate a random list of nb_users_select users
    subset_500_users = df[selected_users]
    subset = top_merged[top_merged.ID.isin(subset_500_users.index)] #Subset to compare our group members too
    
    for file in os.listdir(GROUP_FOLDER):
        group = pd.read_csv(f'{GROUP_FOLDER}{file}').drop(columns = ['Unnamed: 0'])
        measures.append(measure_group_recommendation(compute_pred_user_item, group,
                                                      subset, rng))
    group_pred_measures.append(np.array(measures).flatten().mean())
    print(f"Finished round {i}")
group_user_avg = group_pred_measures
end = time.time()
print(f"Time required to do the predictions on 10 rounds {end - start}")
"""

###  Using KNN

**User specific prediction**: Still computes similarity between the users but uses the only the k neirest neighbors to make the predictions.

$\large \tilde{u} =  \frac{\sum_{v \in KNN(i)} sim(u,v) \cdot (Plays(v,i) - \mu_{v})}{ \sum_{v \in KNN(i)} sim(u,v)} $

where $KNN(i)$ is the set of neighbors of $u$ that have listened to the group $i$.

and then do as before:

$\large p(u, i) = \mu_{u} + \tilde{u} $

In [None]:
from sklearn.neighbors import NearestNeighbors
    
def compute_pred_knn(user_knn, artist, df, neighbors = 20):
    artist_has_user = np.unique(df[(df['Artist'] == artist)]['ID']) #Set of users who have listened to the chosen artist
    df = df[df['ID'].isin(artist_has_user)] 
    users = np.unique(df['ID'])
    
    if len(users) >= neighbors:

        knn_pivot = df.pivot(index = 'ID', columns = 'Artist', values = 'Plays').fillna(0)
        model_knn = NearestNeighbors(metric='jaccard', algorithm='brute', n_neighbors=neighbors, n_jobs=-1)
        model_knn.fit(knn_pivot)

        distances, indices =model_knn.kneighbors(knn_pivot, n_neighbors= neighbors)
        distances = distances[:,1:] #eliminate comparing user_i with himself
        indices = indices[:,1:] #eliminate comparing user_i with himself

        indices_lists = zip(users, indices)
        distances_lists = zip(users, distances)
        indices_dict = dict(indices_lists)
        distances_dict = dict(distances_lists)

        if artist in list(df[df['ID'] == user_knn]['Artist']):
            closest_users = users[indices_dict[user_knn]]
            sum_of_similarities = distances_dict[user_knn].sum()
            prediction = []

            for index, user_knn in enumerate(closest_users):
                similarity = distances_dict[user_knn][index]
                user_mean = df[df['ID'] == user_knn]['Plays'].mean()
                user_artists = df[df['ID'] == user_knn]
                user_plays = int(user_artists[user_artists['Artist'] == artist]['Plays'])
                pre_similarity = similarity*(user_plays - user_mean)
                prediction.append(pre_similarity)

            if sum_of_similarities == 0:
                return 0
            else:
                return (np.sum(prediction))/sum_of_similarities  
        else:
            return 0
    else:
        return 0

In [None]:
# takes around 15 mins to compute
start = time.time()
maes_knn, accs_knn, precs_knn, recs_knn = compute_n_rounds(compute_pred_knn)
end = time.time()
print(f"Time required to do the prediction on 1 round {end - start}")

In [None]:
sns.barplot(data = pd.DataFrame(maes_knn, columns = ["KNN_sim"]))
plt.ylabel("Mean absolute error")
plt.title("Mean absolute error on for knn approach")

In [None]:
sns.barplot(data = pd.DataFrame(accs_knn, columns = ["KNN_sim"]))
plt.ylabel("Accuracy")
plt.title("Accuracy for knn approach")

### Predicting based on the average number of plays per genre and user

The mean number of plays per genre and user is computed as : $\large pred(u,i) = \mu_{u,g} = \sum_{k \in I(u, g)} \frac{Plays(u, k)}{|I(u, g)|}$

where $u$ the user, $i$ the artist, $g$ the most common genre associated with the artist, and $I(u, g)$ the set of artists listened to by the user which have the same most common genre as the artist for which we are computing the prediction.


In [None]:
mean_genres_users = top_merged.groupby(['ID', 'Most_common_genre'])['Plays'].mean()
mean_genres_users = mean_genres_users.reset_index()
mean_genres_users

In [None]:
def compute_pred_genre_avg_user(user, artist, mean_genres_users, artist_most_common) :
    most_common = artist_most_common[artist_most_common['Artist'] == artist].Most_common_genre.iloc[0]
    return mean_genres_users[(mean_genres_users['ID'] == user) & (mean_genres_users['Most_common_genre'] == most_common)].Plays.iloc[0]

**Testing individual predictions**

**Previous function doesn't match the need of this method**

In [None]:
def compute_mae_and_app_genre(test_df, mean_genres_users, artist_most_common, threshold = 100):
    predictions = np.zeros(len(test_df))
    reals = np.zeros(len(test_df))
    indice = 0
    mae = 0
    for i, row in test_df.iterrows():
        user = row.ID
        artist = row.Artist
        prediction = compute_pred_genre_avg_user(user, artist, mean_genres_users, artist_most_common)
        mae += abs(row.Plays - prediction)
        predictions[indice] = prediction > threshold
        reals[indice] = row.Plays > threshold
        indice += 1
    
    tp = np.sum(np.bitwise_and(predictions==1, reals == 1))
    fp = np.sum(np.bitwise_and(predictions==1, reals == 0))
    
    fn = np.sum(np.bitwise_and(predictions== 0, reals == 1))
    
    acc = np.sum(predictions == reals)
    
    return mae/len(test_df), acc/len(test_df), tp/(tp+fp), tp/(tp + fn)

In [None]:
def compute_n_rounds_genre(nb_users_selected = 500, n = 10, seed = 42):
    """
    Computes the mae, accuracy, precision and recall on n round of the pred_method
    
    pred_method = method that allows to compute the prediction
    nb_users_selected = number of users in the sub-sample
    n = number of rounds
    seed = seed for random generation of sub-samples
    """
    np.random.seed(seed)
    maes, accs, precs, recs = [],[],[],[]
    df = top_merged[['ID', 'Artist', 'Plays']]
    df2 = df.groupby('ID').size()
    nb_users = len(top_merged.ID.unique())
    
    for i in range(n):
        print(f"===== Epoch {i} =====")
        selected_users = np.random.randint(0, nb_users, nb_users_selected) #Generate a random list of nb_users_select users
        subset_500_users = df2[selected_users]
        subset = df[df.ID.isin(subset_500_users.index)]
        train, test = split_train_test(subset, train_size=0.95)
        test_genre_users = mean_genres_users[mean_genres_users.ID.isin(test.ID)]
        
        mae, acc, prec, rec = compute_mae_and_app_genre(test, test_genre_users, add_data[['Artist', 'Most_common_genre']])
        
        maes.append(mae)
        accs.append(acc)
        precs.append(prec)
        recs.append(rec)
        
    return maes, accs, precs, recs

In [None]:
def test_recommender_genre(df, mean_genres_users, artist_most_common, group_of_users, same_artists, threshold = 250):
    list_of_artists = dict()
    
    for user in group_of_users:
        nb_likes = 0
        for artist in df[(df['ID'] == user) & (df['Artist'].isin(same_artists))]['Artist']:
            list_of_artists[artist] = 0
    
    for artist in list_of_artists.keys():
        for user in group_of_users:
            if(user == group_of_users[0]): #Always takes first user because selection is randomized
                nb_plays = get_nb_plays(user, artist, df)
            
            else: nb_plays = -1
                
            if nb_plays != -1:
                if nb_plays > threshold:
                    list_of_artists[artist] += 1
            
            else:
                prediction = compute_pred_genre_avg_user(user, artist, mean_genre_users)
                if prediction > threshold:
                    list_of_artists[artist] += 1
    
    recommendation = pd.DataFrame(columns = ["Artist", "nb_likes"])
    
    for entry in list_of_artists:
        recommendation = recommendation.append({"Artist":entry, "nb_likes":list_of_artists[entry]}, ignore_index = True)
    
    return recommendation.sort_values("nb_likes", ascending = False).reset_index().drop(columns = "index")

In [None]:
def measure_group_recommendation_genre(mean_genres_users, artist_most_common, group_df, rng):
    users = group_df['ID'].unique()
    measures = []
    rng = rng
    for nb_pred in range(10):
        random_6_users = rng.choice(len(users), 6, replace = False)
        my_6_users_group = users[random_6_users]
        
        aggregated_set = group_df[group_df['ID'].isin(users)][['ID','Artist']].groupby('ID').agg(set)
        same_artists = aggregated_set.iloc[0]['Artist']
        for i,row in aggregated_set.iterrows():
            same_artists = same_artists.intersection(row['Artist'])
        
        random_5_artists = np.array(list(same_artists))
        random_5_artists = random_5_artists[rng.choice(len(same_artists), 5, replace = False)]
        
        reals = get_group_recommendation_according_to_our_algorithm(my_6_users_group, random_5_artists, group)
        preds = test_recommender_genre(group_df, mean_genres_users, artist_most_common, group, random_5_artists)
        measures.append(dcg_idcg(reals, preds))
    return measures

#### Individual prediction

In [None]:
import time
start = time.time()
maes_genres, accs_genres, precs_genres, recs_genres = compute_n_rounds_genre()
end = time.time()
print(f"Time required to do the prediction on 10 rounds {end - start}")

In [None]:
sns.barplot(data = pd.DataFrame(maes_genres, columns = ["Mae_Artist_avg"]))
plt.ylabel("Mean absolute error")
plt.title("Mean absolute error on 10 runs")

In [None]:
sns.barplot(data = pd.DataFrame({"Accuracy_Artist_avg":accs_genres, "Precision_Artist_avg":precs_genres, "Recall_Artist_avg":recs_genres}))

In [None]:
start = time.time()
group_pred_measures = []
rng = np.random.default_rng(seed = 42)

for i in range(10):
    measures = []
    for file in os.listdir(GROUP_FOLDER):
        group = pd.read_csv(f'{GROUP_FOLDER}{file}').drop(columns = ['Unnamed: 0'])
        measures.append(measure_group_recommendation_genre(maes_genres, add_data[['Artist', 'Most_common_genre']], group, rng))
    group_pred_measures.append(np.hstack(np.array(measures)).mean())
    print(f"Finished round {i}")
groups_genres = group_pred_measures
end = time.time()
print(f"Time required to do the predictions on 10 rounds {end - start}")

In [None]:
sns.barplot(data = pd.DataFrame(groups_artist_avg, columns = ["DCG"]))
plt.title("DCG on 120 groups for 10 epochs")
plt.ylim(0.8, 1.0)

## 8. Comparison between techniques

In [None]:
fig = plt.figure(figsize= (15,6))
measurments_df = pd.DataFrame(columns = ['method', 'mae_acc_prec_rec', 'measurement_method']) 


for i in range(len(maes_genres)):
    measurments_df= measurments_df.append({'method':'Genre Similarity', 'mae_acc_prec_rec':maes_genres[i], 'measurement_method':'MAE'}, ignore_index = True)
    measurments_df= measurments_df.append({'method':'Genre Similarity', 'mae_acc_prec_rec':accs_genres[i], 'measurement_method':'Accuracy'}, ignore_index = True)
    measurments_df= measurments_df.append({'method':'Genre Similarity', 'mae_acc_prec_rec':precs_genres[i], 'measurement_method':'Precision'}, ignore_index = True)
    measurments_df= measurments_df.append({'method':'Genre Similarity', 'mae_acc_prec_rec':recs_genres[i], 'measurement_method':'Recall'}, ignore_index = True)

for i in range(len(maes_users)):
    measurments_df= measurments_df.append({'method':'User average', 'mae_acc_prec_rec':maes_users[i], 'measurement_method':'MAE'}, ignore_index = True)
    measurments_df= measurments_df.append({'method':'User average', 'mae_acc_prec_rec':accs_users[i], 'measurement_method':'Accuracy'}, ignore_index = True)
    measurments_df= measurments_df.append({'method':'User average', 'mae_acc_prec_rec':precs_users[i], 'measurement_method':'Precision'}, ignore_index = True)
    measurments_df= measurments_df.append({'method':'User average', 'mae_acc_prec_rec':recs_users[i], 'measurement_method':'Recall'}, ignore_index = True)

for i in range(len(maes_sim)):
    measurments_df= measurments_df.append({'method':'Full-Similarity', 'mae_acc_prec_rec':maes_sim[i], 'measurement_method':'MAE'}, ignore_index = True)
    measurments_df= measurments_df.append({'method':'Full-Similarity', 'mae_acc_prec_rec':accs_sim[i], 'measurement_method':'Accuracy'}, ignore_index = True)
    measurments_df= measurments_df.append({'method':'Full-Similarity', 'mae_acc_prec_rec':precs_sim[i], 'measurement_method':'Precision'}, ignore_index = True)
    measurments_df= measurments_df.append({'method':'Full-Similarity', 'mae_acc_prec_rec':recs_sim[i], 'measurement_method':'Recall'}, ignore_index = True)

for i in range(len(maes_artists)):
    measurments_df= measurments_df.append({'method':'Artist average', 'mae_acc_prec_rec':maes_artists[i], 'measurement_method':'MAE'}, ignore_index = True)
    measurments_df= measurments_df.append({'method':'Artist average', 'mae_acc_prec_rec':accs_artists[i], 'measurement_method':'Accuracy'}, ignore_index = True)
    measurments_df= measurments_df.append({'method':'Artist average', 'mae_acc_prec_rec':precs_artists[i], 'measurement_method':'Precision'}, ignore_index = True)
    measurments_df= measurments_df.append({'method':'Artist average', 'mae_acc_prec_rec':recs_artists[i], 'measurement_method':'Recall'}, ignore_index = True)


sns.barplot(data =  measurments_df[measurments_df['measurement_method']!= 'MAE'] , x= 'method', y='mae_acc_prec_rec', hue='measurement_method')
plt.ylabel('Percent')

plt.savefig('/data/all_measurements.png')

### MAE comparison

In [None]:
sns.barplot(data =  measurments_df[measurments_df['measurement_method']== 'MAE'] , x= 'method', y='mae_acc_prec_rec')
plt.ylabel('MAE')
plt.savefig('data/mae.png')

### Group measurements (Only on bests method)

In [None]:
NDCG_measures_df =  pd.DataFrame()

for i in range(len(groups_genres)):
    NDCG_measures_df = NDCG_measures_df.append({'method':'Genre Similarity', 'ndcg':groups_genres[i], 'measurement_method':'NDCG'}, ignore_index = True)
    NDCG_measures_df = NDCG_measures_df.append({'method':'User Similarity', 'ndcg':groups_user_avg[i], 'measurement_method':'NDCG'}, ignore_index = True)

In [None]:
sns.barplot(data = NDCG_measures_df, x="method", y="ndcg")
plt.ylim(0.8, 1)

In [None]:
#Creating the dataframe the watch the different techniques next to eachother

sns.barplot(data = float_measurements_df, x="day", y="total_bill", hue="sex", data=tips)