# HarperDB Song Recommender Data Cleaner

This notebook cleans the original [MillionSong dataset](http://millionsongdataset.com/) to make it more useful for this project.

The following data is used:

**download these files into the /data/original directory**
1. A list of all song id's, titles, and artists - [unique_tracks.txt](http://millionsongdataset.com/sites/default/files/AdditionalFiles/unique_tracks.txt)
2. The [Echo Nest Taste Profile Subset](http://millionsongdataset.com/tasteprofile/) with user id's, song id's and play counts - [train_triplets.txt.zip](http://millionsongdataset.com/sites/default/files/challenge/train_triplets.txt.zip)

The output of the following code:
1. song_idxs.csv - a file to lookup the index for a song id (not used in training)
2. user_idxs.csv - a file to lookup the index for a user id (not used in training)
3. users_songs.csv - a file of all users user idx, song idx, play count pairs (main data for training)
4. songs.csv - a file to lookup the song information from its index (used for mapping the output of the model)

In [1]:
import csv

In [3]:
# create dictionary of users and songs
min_play_count = 10
min_songs = 10

user_songs = {}
with open('../data/original/train_triplets.txt') as file:
    for line in file:
        data = line.strip().split('\t')
        play_count = int(data[2])
        if play_count < min_play_count:
            continue
        try:
            user_songs[data[0]].append((data[1], play_count))
        except:
            user_songs[data[0]] = [(data[1], play_count)]

user_songs = dict(filter(lambda x: len(x[1]) >= 10, user_songs.items()))

In [4]:
# create list of all songs
all_songs = []
for songs_counts in user_songs.values():
    for song, _ in songs_counts:
        all_songs.append(song)
all_songs = list(set(all_songs))
len(all_songs)

130791

In [6]:
# creates song_idxs.csv
song_idxs = {}
with open('../data/clean/song_idxs.csv', 'w') as file:
    file.write('song,index\n')
    for idx, song in enumerate(all_songs):
        song_idxs[song] = idx
        file.write('{},{}\n'.format(song, idx))
len(song_idxs)

130791

In [13]:
# reads song_idxs.csv into a dictionary (used if rerunning the code after already writing the file)
# song_idxs = {}
# with open('../data/clean/song_idxs.csv') as file:
#     next(file)
#     for line in file:
#         info = line.strip().split(',')
#         song_idxs[info[0]] = int(info[1])        

In [7]:
# creates user_idxs.csv
user_idxs = {}
with open('../data/clean/user_idxs.csv', 'w') as file:
    file.write('user,index\n')
    for idx, user in enumerate(user_songs.keys()):
        user_idxs[user] = idx
        file.write('{},{}\n'.format(user, idx))
len(user_idxs)

64774

In [8]:
lines = 0
with open('../data/clean/users_songs.csv', 'w') as file:
    file.write('user_idx,song_idx,play_count\n')
    for user, songs_playcounts in user_songs.items():
        user_idx = user_idxs[user]
        play_counts = [x[1] for x in songs_playcounts]
        avg_play_count = sum(play_counts) / len(play_counts)
        for song, play_count in songs_playcounts:
            if play_count >= avg_play_count:
                pcrta = play_count / avg_play_count
                pcrta = min(pcrta, 1)
                song_idx = song_idxs[song]
                file.write('{},{},{}\n'.format(user_idx, song_idx, pcrta))
                lines += 1
                
lines

397841

In [14]:
good=0
bad=0
with open('../data/clean/songs.csv', 'w') as songs_file:
    writer = csv.writer(songs_file)
    writer.writerow(['index', 'song', 'artist', 'search'])
    with open('../data/original/unique_tracks.txt') as ut_file:
        for line in ut_file:
            try:
                info = line.strip().split('<SEP>')
                song_id = info[1]
                song_idx = song_idxs[song_id]
                artist = info[2]
                song = info[3]
                search = song.lower() + ' by ' + artist.lower()
                writer.writerow([song_idx,song,artist,search])
                good += 1
            except Exception as exception:
                bad += 1
print('good', good)
print('bad', bad)

good 131397
bad 868603
