If you have any problem during the procedure, please let me know at lorenzo.porcaro at gmail.com
After cloning the repo, you can choose to analyze your own dataset. For doing that, start with creating the dataset folder:
cd playlist/data
mkdir THE_NAME_OF_YOUR_DATASET
Open the file playlist/src/utils.py
and modify the line 9, adding the name of your dataset in the list ALLOWED_DATASETS
import logging
import csv
import numpy as np
from annoy import AnnoyIndex
ALLOWED_DATASETS = ["AOTM", "CORN", "DEEZ", "SPOT", "THE_NAME_OF_YOUR_DATASET"]
You need to add two files in the folder playlist/data/THE_NAME_OF_YOUR_DATASET
:
tracklist.tsv
: a TSV file with the complete list of tracks contained in the playlists. It needs to have three column: 1) Track ID ; 2) Artist Name; 3) Track Name (NO HEADER needed). Be sure that the track IDs are the same used in the playlist file.playlists.tsv
: a TSV file with the complete list of playlists in your dataset. Each line represents a playlists, which will be formed by tab-separated track IDs (i.e.1\t35\t56\t...etc
)
Apart from the files needed for the Popularity Analysis, you need the Tag-embeddings, and the list of tracks with tags associated.
First, create a folder for the embeddings
cd playlist/data
mkdir tag_embeds
Then, add in the folder the file with the embeddings previously computed. No needs for using GloVe instead of other embeddings. You can try other architectures (and let us know how it works!). The file with the embedding should be a .txt
file in the following format:
tag_name value1 value2 .... valueN
Notice that is not tab-separated, just white-space separated. The dimension of the embeddings by default is 100. If you want to use embeddings with different dimensions, open the file playlist/src/utils.py
and modify the line 156, changing the value of the sanity check, by default set at 100:
if len(embs) != THE_DIMENSION_OF_YOUR_EMBEDDINGS:
logging.error('Problem importing the embeddings')
break
The file with the embedding should be located in the folder previously created, and named as the dataset:
playlist/data/tag_embeds/THE_NAME_OF_YOUR_DATSET.txt
First, create a folder for the tag-tracks
cd playlist/data
mkdir lastfm_tags
Then, add in the folder the file with the tags previously retrieved. No needs for using LastFm instead of other tags. You can try other tags (and let us know how it works!). The files with the tags should be a TSV file in the following format:
ArtistName|TrackName\t[('tag1', w1), ('tag2',w2), ('tag3',w3),('tag4', w5), ('tag5', w5)]
Example
randy+roos|ray's+passage\t[('guitar virtuoso', 100), ('Mindracers', 100), ('Still Behind the Wheel', 100), ('guitar bud', 100), ('mindracer', 50)]
where the w
are the weights associated to each tag. If you don't have it or don't want to take into account the weights, just set 100 to every w
(if you are asking why, just read the paper section 3.3.2 ;) ). ArtistName
and Trackname
should be formatted using the norm_str function defined in playlist/src/utils.py
(line 81). There is room for improving the normalization function, but for now is the only way to be able to map strings used in different places.
The file with the tags should be located in the folder previously created, and named as following:
playlist/data/lastfm_tags/lastfm_tags.tsv