# Music Artist PersonalRank Recommender based on social network and tag system



**Team 5 : Bo Tong, Honglin Li, Junyi Chen, Li-chen Lin**

## TOC
* [Section 1: Download and Split data set](#section1)
* [Section 2: Tag Clustering](#section2)
    * [Section 2.1: Tag preprocessing](#section21)
    * [Section 2.2: Tag clustering: BERT](#section22)
    * [Section 2.3: Tag clustering: Levenshtein distance](#section23)
    * [Section 2.4: Tag clustering: user artist correlation](#section24)
    * [Section 2.5: Tag clustering Result](#section25)
* [Section 3: PersonalRank-based Recomender](#section3)
    * [Section 3.1: User similarity computation](#section31)
    * [Section 3.2: Graph Construction](#section32)
    * [Section 3.3: PersonalRank implementation: Based on Iteration & Matrix](#section33)
    * [Section 3.4: Example](#section34)
* [Section 4: User-based CF & Item-based CF Recommenders](#section4)
    * [Section 4.1: Calculate similarity matrix for user & item](#section41)
    * [Section 4.2: Predict rating based on item-based or user-based methods](#section42)
    * [Section 4.3: Recommend artists based on user preference](#section43)
* [Section 5: Content-based Recommender](#section5)
* [Section 6: Evaluation](#section5)
    * [Section 6.1: User-based metrics](#section61)
    * [Section 6.2: Item-based metrics](#section62)
    * [Section 6.3: Content-based](#section63)
    * [Section 6.4: PersonalRank without tags](#section64)
    * [Section 6.5: PersonalRank with tags](#section65)

<a class="anchor" id="section1"></a>

## Section 1. Download and Splitting data set
- For every user who listen to more than 5 artists, we randomly move 40% of user and artist listening pairs into validation data sets.
Remaining data sets are used as train data sets.
- In the mean time, we also move the corresponding tag records from user taggedartists table into tag train data sets.
- For tags, there are 186479 tagging records, only 73358 tags were tagged from the users who have listened the artists.
So we assume: only use the tags tagged by the users who have listened the artists.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My\ Drive/WMP

Mounted at /content/drive
/content/drive/My Drive/WMP


In [None]:
import pandas as pd
import random
import os
import zipfile

### Data set download and make directories

File Structure:
- notebook
- data
 - dataset
 - split
 - tags
 - interim # put intermiedite output like, translated_tags, similarity_matrix
 - external
 - result

In [None]:
def make_dir(directory):
  if not os.path.exists(directory):
    os.makedirs(directory)

make_dir('data/split')
make_dir('data/tags')
make_dir('data/interim')
make_dir('data/dataset')
make_dir('data/external')
make_dir('data/result')

In [None]:
# download original data set
zip_path = 'data/dataset/hetrec2011-lastfm-2k.zip'
!wget --no-check-certificate -P 'data/dataset' 'https://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip'

zip_file = zipfile.ZipFile(zip_path)
for names in zip_file.namelist():
    zip_file.extract(names, 'data/dataset')
zip_file.close()
os.remove(zip_path)

--2021-06-14 09:53:02--  https://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 2589075 (2.5M) [application/zip]
Saving to: ‘data/dataset/hetrec2011-lastfm-2k.zip’


2021-06-14 09:53:02 (13.3 MB/s) - ‘data/dataset/hetrec2011-lastfm-2k.zip’ saved [2589075/2589075]



### Split Data Set

In [None]:
# HONGLIN
artists_df = pd.read_table('data/dataset/artists.dat')
tags_df = pd.read_table('data/dataset/tags.dat', encoding = "ISO-8859-1")
user_artists_df = pd.read_table('data/dataset/user_artists.dat')  # to be split
user_friends_df = pd.read_table('data/dataset/user_friends.dat')
user_tag_artists_df = pd.read_table('data/dataset/user_taggedartists.dat')  # to be split

artists_df.drop(['pictureURL', 'url'], inplace=True, axis=1)
user_tag_artists_df.drop(['day', 'month', 'year'], inplace=True, axis=1)

In [None]:
def split_dataset(user_artists_df, user_tag_artists_df, train_prefix='train', test_prefix='test'):
  df = user_artists_df
  if 'name' not in user_artists_df.columns:
    df = pd.merge(artists_df, user_artists_df, how="inner", left_on='id', right_on='artistID').drop(['id'], axis = 1)  # add name
  print('User_Artist Shape: ', df.shape)

  # Check how many artists are listened by each user
  user_artists_count = user_artists_df.groupby(['userID']).agg({'artistID' : 'count'}).rename(columns={'artistID': 'artistCount'})

  # 1.Splitting train and validation/test data on user_artists
  # 1.1 test/validation set
  user_candidate_val = user_artists_count[user_artists_count.artistCount>5]
  test_user_artists = df.merge(user_candidate_val, how='inner', on='userID').drop(['artistCount'], axis = 1)
  test_user_artists = test_user_artists.sample(frac=0.4, random_state=41)

  # 1.2 train set
  train_user_artists = pd.concat([df,test_user_artists]).drop_duplicates(keep=False)
  print('Shape of user_artists train set: ', train_user_artists.shape)
  print('Coverage rate of User in train set:', len(train_user_artists['userID'].unique())/len(df['userID'].unique()))
  print('Coverage rate of artist in train set: ', len(train_user_artists['artistID'].unique())/len(df['artistID'].unique()))

  # 2.splitting on user_taggedartists    for tags, we just need train set, do not need test set
  train_user_taggedartists = user_tag_artists_df.merge(train_user_artists.drop(columns = ['name','weight']), how='inner', on=['userID', 'artistID'])
  #train_user_taggedartists = train_user_taggedartists.drop(columns = ['name','weight'])
  print('Shape of train set of user_artist_tag: ', train_user_taggedartists.shape)

  # save to local
  if not os.path.exists('data/split'):
    os.makedirs('data/split')

  test_user_artists.to_csv(f'data/split/{test_prefix}_user_artists.csv', index=False)
  train_user_artists.to_csv(f'data/split/{train_prefix}_user_artists.csv', index=False)
  train_user_taggedartists.to_csv(f'data/split/{train_prefix}_user_taggedartists.csv', index=False)


In [None]:
# split dataset to train set and test set
split_dataset(user_artists_df, user_tag_artists_df)

# split train set to train set and validation set
train_user_artists = pd.read_csv('data/split/train_user_artists.csv')
train_user_taggedartists = pd.read_csv('data/split/train_user_taggedartists.csv')
split_dataset(train_user_artists, train_user_taggedartists, train_prefix='train_tune', test_prefix='val')

User_Artist Shape:  (92834, 4)
Shape of user_artists train set:  (55716, 4)
Coverage rate of User in train set: 1.0
Coverage rate of artist in train set:  0.7296960072595281
Shape of train set of user_artist_tag:  (44009, 3)
User_Artist Shape:  (55716, 4)
Shape of user_artists train set:  (33454, 4)
Coverage rate of User in train set: 1.0
Coverage rate of artist in train set:  0.7264106948546557
Shape of train set of user_artist_tag:  (26188, 3)


In [None]:
# coverage after spliting
artist_count = len(user_artists_df['artistID'].unique())
user_count = len(user_artists_df['userID'].unique())
tag_count = len(user_tag_artists_df['tagID'].unique())

train_user = pd.read_csv('data/split/train_user_artists.csv')
train_tag = pd.read_csv('data/split/train_user_taggedartists.csv')

train_tune_user = pd.read_csv('data/split/train_tune_user_artists.csv')
train_tune_tag = pd.read_csv('data/split/train_tune_user_taggedartists.csv')

val_user = pd.read_csv('data/split/val_user_artists.csv')

test_user = pd.read_csv('data/split/test_user_artists.csv')

In [None]:
train_user_count = len(train_user.userID.unique())
train_artist_count = len(train_user.artistID.unique())
train_tag_count = len(train_tag.tagID.unique())

train_tune_user_count = len(train_tune_user.userID.unique())
train_tune_artist_count = len(train_tune_user.artistID.unique())
train_tune_tag_count = len(train_tune_tag.tagID.unique())

val_user_count = len(val_user.userID.unique())
val_artist_count = len(val_user.artistID.unique())

test_user_count = len(test_user.userID.unique())
test_artist_count = len(test_user.artistID.unique())

print(f'user ratio in train set: {train_user_count / user_count}')
print(f'user ratio in train tune set: {train_tune_user_count / user_count}')
print(f'user ratio in val set: {val_user_count / user_count}')
print(f'user ratio in test set: {test_user_count / user_count}')

print(f'artist ratio in train set: {train_artist_count / artist_count}')
print(f'artist ratio in train tune set: {train_tune_artist_count / artist_count}')
print(f'artist ratio in val set: {val_artist_count / artist_count}')
print(f'artist ratio in test set: {test_artist_count / artist_count}')

print(f'tag ratio in train set: {train_tag_count / tag_count}')
print(f'tag ratio in train tune set: {train_tune_tag_count / tag_count}')

user ratio in train set: 1.0
user ratio in train tune set: 1.0
user ratio in val set: 0.9889006342494715
user ratio in test set: 0.9915433403805497
artist ratio in train set: 0.7296960072595281
artist ratio in train tune set: 0.5300589836660617
artist ratio in val set: 0.40625
artist ratio in test set: 0.5688520871143375
tag ratio in train set: 0.6080623653708073
tag ratio in train tune set: 0.4573802441276028
