# Mosaico Musical

### Musical recommender by Alberto Antón as a final project for the Master in Data Science of KSchool

The object of this project is to build a musical recommender based in a collaborative filter (recommend based on the likings of similar users) using the Million Songs Dataset.

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import io
import os
import sys
import random

In [3]:
# display results to 3 decimal points, not in scientific notation, and thousands separator
pd.set_option("display.float_format", lambda x: "{:,.2f}".format(x))

In [4]:
# Set random seed
random.seed(666)

### Loading data

In [5]:
data_root = "data"

In [6]:
# Load training dataset
columns = ["user_id", "song_id", "num_plays"]
datafile = os.path.join(data_root, "train_triplets.txt")

data = pd.read_csv(datafile, 
                   sep="\t", 
                   header = None,
                   names = columns)

In [7]:
# Let's get a glimpse of the data
data.head()

Unnamed: 0,user_id,song_id,num_plays
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAPDEY12A81C210A9,1
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBFNSP12AF72A0E22,1
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBFOVM12A58A7D494,1


In [8]:
data.describe()

Unnamed: 0,num_plays
count,48373586.0
mean,2.87
std,6.44
min,1.0
25%,1.0
50%,1.0
75%,3.0
max,9667.0


In [9]:
# Let's analyze num_plays column a little deeper
data.num_plays.describe()

count   48,373,586.00
mean             2.87
std              6.44
min              1.00
25%              1.00
50%              1.00
75%              3.00
max          9,667.00
Name: num_plays, dtype: float64

Most of the songs have been played only once, and there are very large outliers, so we will not be using num_plays field.

We already have the listenings dataset, now let's load the song information dataset.

In [10]:
columns = ["foo", "song_id", "artist", "title"]
datafile = os.path.join(data_root, "unique_tracks.txt")

all_songs = pd.read_csv(datafile, 
                        header = None,
                        sep = "<SEP>",
                        names = columns,
                        usecols = ["song_id", "artist", "title"],
                        encoding =  "utf-8",
                        engine = "python")

In [11]:
all_songs.sample(10)

Unnamed: 0,song_id,artist,title
916994,SOZJXGU12AAA15EAC4,Delbert McClinton,Under Suspicion
508990,SOJYVER12AB018D1C3,Jeff Mills,Solid Sleep
82173,SORPUBJ12A8C1454F2,Gianni Drudi,Jambè
682853,SOTFPPY12AB0183D2F,Freemasons,When You Touch Me
900073,SOINDMY12A8C1388D3,XU WEI,Li Wu
805707,SOWKETB12A8C13A9FF,Chopin,Ballade n° 2
870541,SOUWMCT12A8C143272,Aeoliah,Nirvana
675396,SOVJETW12A6D4FA6C9,Slam,Lifetimes (Slam's Schizoid Dub)
275354,SOCTZEJ12A67020722,Kraftwerk,Pocket Calculator (Live)
334120,SOFIVON12A8C13FF93,Dexter Danger,Guilty as Charged


In [12]:
all_songs.describe()

Unnamed: 0,song_id,artist,title
count,1000000,1000000,999985
unique,999056,72665,702000
top,SOBPAEP12A58A77F49,Michael Jackson,Intro
freq,3,194,1511


In [13]:
# There seems to be some song_ids repeated. Let's find them
all_songs[all_songs.duplicated(subset=["song_id"], keep=False)].sort_values("song_id").head(10)

Unnamed: 0,song_id,artist,title
963681,SOAAEFC12AB01852F1,Tineke Schouten/Linda De Mol/Franklin Brown,De Tongbreker
304966,SOAAEFC12AB01852F1,Tineke Schouten,De Tongbreker (Tineke Schouten & Linda de Mol)
572832,SOACGAQ12A58A79805,Arctic Monkeys,Fire And The Thud
141269,SOACGAQ12A58A79805,Arctic Monkeys,Fire And The Thud
15028,SOADFGH12A6D4F74F7,Red Hot Chili Peppers,Falling Into Grace (Album Version)
522461,SOADFGH12A6D4F74F7,Red Hot Chili Peppers,Falling Into Grace (Album Version)
68224,SOADYVX12A8A9D9462,Rihanna,We Ride
655219,SOADYVX12A8A9D9462,Rihanna,We Ride
182693,SOAEIFW12A8C1391E4,Franz Ferdinand,Michael
525920,SOAEIFW12A8C1391E4,Franz Ferdinand,Michael


In [14]:
# Let's remove those duplicates
all_songs.drop_duplicates(subset = "song_id", inplace = True)
all_songs.describe()

Unnamed: 0,song_id,artist,title
count,999056,999056,999041
unique,999056,72652,701922
top,SOLLVGO12AC46872FA,Johnny Cash,Intro
freq,1,191,1511


Now count and unique show the same ammount

In [15]:
# There is information of about one million songs. Let's see hoy many of these songs
# are in the training dataset
data.song_id.unique().shape[0]

384546

We don't need information for so many songs, so let's create a dataframe with information only on the songs that are in the training dataframe.

In [16]:
# Let's keep only the songs that are in data
unique_songs_df = pd.DataFrame(data.song_id.unique(), columns = ["song_id"])

In [17]:
songs = unique_songs_df.merge(all_songs, on = "song_id", how = "left")[["song_id", "artist", "title"]]

In [18]:
songs.describe()

Unnamed: 0,song_id,artist,title
count,384546,384546,384542
unique,384546,42055,306720
top,SOYRTBJ12A67AD930B,Beastie Boys,Intro
freq,1,136,526


### 1st recommender

If we want to recommend something to a new user, either we know something about him or we can't but recommend the most popular songs.

Let's go for the first option, so we'll choose some songs and make the recommender to select some new ones for us.

In [19]:
# Function that shows the songs of an artist
def songs_by(artist_name):
    return songs[songs["artist"] == artist_name][["song_id", "title"]]

In [20]:
# Let's try to find some rock songs and see the recommendation...
songs.artist.sample(25)

220624                                        Fanfarlo
95369                                    The Hot Pants
302218                                      Ray Davies
209432                                      Dave Weckl
114455                                          Winger
234974                                           MU330
213282                                Hawksley Workman
154627                   Southern Culture On The Skids
101264                                     Phil Lynott
381245                              Aceituna sin hueso
332473                                   Presto Ballet
214696                                       Value Pac
184071                                  Jennifer Paige
139604    ...And You Will Know Us By The Trail Of Dead
15909               Brigitte Bardot / Serge Gainsbourg
294777            Soul II Soul Featuring Caron Wheeler
68191                                    Julian Lennon
353958                                       Leon Ware
323021    

In [21]:
songs_by("Metallica")

Unnamed: 0,song_id,title
1146,SOOEEPE12A8AE459A4,The Unforgiven III
1407,SOZATKE12A6D4F5915,2 X 4
1594,SOGAUIQ12A6D4F8262,Hit The Lights
1641,SOUGBIM12A6D4F8247,The Four Horsemen
1713,SOCHYVZ12A6D4F5908,Enter Sandman
1787,SOZDGEW12A8C13E748,One
1794,SOGMBXD12A6D4F5920,Ronnie
4702,SOJSRYJ12A6D4F824C,Phantom Lord
4712,SOMTBXX12AF729F5A6,Am I Evil?
4726,SORIEXB12A6D4F824D,No Remorse


We list some groups. When we find one we like, list its songs and get the song id. We do this until we have a small dataset of songs we like.

This is our metalhead selection: 

| artist | title | song_id   |
|------|------|------|
| Thrice | Blood Clots And Black Holes | SORJRTI12A6D4F7D67 |
| Clawfinger | Do What I Say | SOSOPGB12A8C13C185 |
| Rammstein | Du Hast | SOSYHME12A8C135DD8 | 
| Rancid | Time Bomb (Album Version) | SOSBJSU12A8C138469 | 
| Against Me! | Thrash Unreal (Album Version) | SONJQZM12A6D4FBE30 |
| Millencolin | Every Breath You Take (Album Version) | SOZVBUH12A8AE4745C |
| Van Halen | Jump (Album Version) | SOVMGEX12AC9070FF2 |
| Led Zeppelin | Stairway To Heaven (2007 Remastered LP Version) | SOEHJKJ12A8C13CA4D |
| Staind | Mudshovel (Explicit Album Version) | SOUNBBX12A6D4F338E |
| Monster Magnet | Space Lord | SOJKARY12A6701ED3F |
| Whitesnake | Here I Go Again (2007 Digital Remaster) | SOPCNEA12A67ADF48B |
| Puddle Of Mudd | She Hates Me | SOTQVSE12A6D4F8200 |
| Green Day | Basket Case (Album Version) | SOTNYYH12A6701F94B |
| Metallica | Master Of Puppets | SOSJRJP12A6D4F826F |

In [22]:
selected_songs = ["SORJRTI12A6D4F7D67", "SOSOPGB12A8C13C185", "SOSYHME12A8C135DD8", "SOSBJSU12A8C138469", \
                 "SONJQZM12A6D4FBE30", "SOZVBUH12A8AE4745C", "SOVMGEX12AC9070FF2", "SOEHJKJ12A8C13CA4D", \
                 "SOUNBBX12A6D4F338E", "SOJKARY12A6701ED3F", "SOPCNEA12A67ADF48B", "SOTQVSE12A6D4F8200", \
                 "SOTNYYH12A6701F94B", "SOSJRJP12A6D4F826F"]

In [23]:
# Now let's find other users that listened to these songs.
similar_users = data[data["song_id"].isin(selected_songs)]["user_id"].to_frame()

In [24]:
similar_users.head()

Unnamed: 0,user_id
6041,bd64f193f0f53f09d44ff48fd52830ff2fded392
7067,a520488fcf049bbb5cd847cfa4f884c740692780
7970,0ef42a19efb74d0a05c308d00636c8d8d41bec0c
8466,7661038e3e655fd31961ad18aea13dded963eedf
9524,12497e138741a0b94bb36a14bef32c9d0ee20fec


In [25]:
similar_users.user_id.describe()

count                                        36391
unique                                       35253
top       be59c5b281f8b714c4d4d4bfb877715a93b3c64d
freq                                             4
Name: user_id, dtype: object

In [26]:
# Let's see how many of our songs those users have played

data[data["song_id"].isin(selected_songs)].groupby(["user_id"]).size().sort_values(ascending = False)

user_id
8fc187765f25645e802bd5137f641c8de7df17b8    4
52542a715ba72e52eec99b277a42532c88615469    4
be59c5b281f8b714c4d4d4bfb877715a93b3c64d    4
58c846a9d19a9345bffe62b212436cb49363278a    3
a883218d1e6171d4913b1dec6c083eb3fea5f914    3
b73bc9b4732c8edf790e257df7395973f8d085ef    3
3e6dd161e97e7bd0e20986e7f5e391e5d24e0a62    3
b67f2d3bea6a313bc55695517cc9b38ff5f920fa    3
07e8066fc9c82f5e700023f3c963117e874e0188    3
3f52fdf255f7043eb170a49606bebe14f6c7a08a    3
ed824982bb5d17465708f5bfbd8589af81ad4de0    3
217b76adb93cdb5d221408ad9f9c5c244a65b038    3
f106f63e74ba0648ed27e2fd59094a53c8c9c534    3
78fb080641b1b1f9b85ceffd9c1686eb8db7c765    3
748096044d04f6736c6921203f711f57fe6e31ee    3
00fc9d7d12f74bcd93fa787cc26a9c61a0904ac7    3
b161e27efcd0135dabd0cc2cfea477498667b191    3
53175b45ba820a33ac8f833a85a986a7c0f7d3d4    3
b60c2902ab24963f33d8a431bee8676a14ceb003    3
8c24607fcd3b2ca28a8eb5924c7c26d8d40e82c4    3
127c8ac775ebdce42de94ff5783ab8d8e333711f    3
58dc40ef3b13f15b889f72d6e3

We can see that most of the users have only played one of the songs of our selection, and that the ones who have listened to more of our songs have played only 4.

Now that we know the users that have played our music, we need to know what other songs they have listened to.

In [27]:
prediction_df = similar_users.merge(data[~data.song_id.isin(selected_songs)], on = "user_id")

In [28]:
prediction_df.head()

Unnamed: 0,user_id,song_id,num_plays
0,bd64f193f0f53f09d44ff48fd52830ff2fded392,SOACIPG12A8AE47E1C,1
1,bd64f193f0f53f09d44ff48fd52830ff2fded392,SOAHEEC12A6BD4DAA4,1
2,bd64f193f0f53f09d44ff48fd52830ff2fded392,SOAKQBB12A8C1413A0,1
3,bd64f193f0f53f09d44ff48fd52830ff2fded392,SOAVFMF12A6D4F92E6,1
4,bd64f193f0f53f09d44ff48fd52830ff2fded392,SOBOLEI12A58A7E386,1


In [29]:
prediction_df.describe()

Unnamed: 0,num_plays
count,3233753.0
mean,2.99
std,6.43
min,1.0
25%,1.0
50%,1.0
75%,3.0
max,2213.0


In [30]:
prediction_df.user_id.describe()

count                                      3233753
unique                                       35253
top       4e73d9e058d2b1f2dba9c1fe4a8f416f9f58364f
freq                                          4623
Name: user_id, dtype: object

Now we have all the songs our similar users have played. We only have to select the most popular ones. We can do this in two ways:

1. Order the songs by play count (repetition).
2. Order the songs by the number of users that have played them (popularity).

The repetition is very sensitive to outliers. As we saw at the beginning, the maximum value of play_count is 9,667. Everytime this user is selected, that song would be the first in our recommendation, so we are going to go with the popularity option.

In [31]:
# Let's now find the most played songs in the prediction dataframe. This would be our recommendation
predicted_songs = prediction_df.\
                    groupby(["song_id"]).\
                    size().\
                    sort_values(ascending = False).\
                    head(20).\
                    to_frame("popularity").\
                    reset_index()

In [32]:
predicted_songs.head()

Unnamed: 0,song_id,popularity
0,SOEGIYH12A6D4FC0E3,8851
1,SOAUWYT12A81C206F1,8014
2,SOSXLTC12AF72A7F54,7285
3,SOBONKR12A58A7A7E0,6980
4,SOFRQTD12A81C233C0,6086


In [33]:
# Let's see the songs in a human readable way
predicted_songs.merge(songs, on = "song_id")[["artist", "title"]]

Unnamed: 0,artist,title
0,Barry Tuckwell/Academy of St Martin-in-the-Fie...,Horn Concerto No. 4 in E flat K495: II. Romanc...
1,Björk,Undo
2,Kings Of Leon,Revelry
3,Dwight Yoakam,You're The One
4,Harmonia,Sehr kosmisch
5,Alliance Ethnik,Représente
6,Cartola,Tive Sim
7,OneRepublic,Secrets
8,Lil Wayne / Eminem,Drop The World
9,Florence + The Machine,Dog Days Are Over (Radio Edit)


Mmmmm... There doesn't seem to be much metal in that list, does it? The result is totally different from the set of songs we chose. 

Let's investigate. 

First we are going to take a look at the most popular songs in the entire dataset to see if it's similar to this.

In [34]:
# Let's find out the most played songs in the whole dataset
most_popular_songs = data.\
                    groupby(["song_id"]).\
                    size().\
                    sort_values(ascending = False).\
                    head(20).\
                    to_frame("popularity").\
                    reset_index()

In [35]:
most_popular_songs.merge(songs, on = "song_id")[["song_id", "artist", "title", "popularity"]]

Unnamed: 0,song_id,artist,title,popularity
0,SOFRQTD12A81C233C0,Harmonia,Sehr kosmisch,110479
1,SOAUWYT12A81C206F1,Björk,Undo,90476
2,SOAXGDH12A8C13F8A1,Florence + The Machine,Dog Days Are Over (Radio Edit),90444
3,SOBONKR12A58A7A7E0,Dwight Yoakam,You're The One,84000
4,SOSXLTC12AF72A7F54,Kings Of Leon,Revelry,80656
5,SONYKOW12AB01849C9,OneRepublic,Secrets,78353
6,SOEGIYH12A6D4FC0E3,Barry Tuckwell/Academy of St Martin-in-the-Fie...,Horn Concerto No. 4 in E flat K495: II. Romanc...,69487
7,SOLFXKT12AB017E3E0,Charttraxx Karaoke,Fireflies,64229
8,SODJWHY12A8C142CCE,Train,Hey_ Soul Sister,63809
9,SOFLJQZ12A6D4FADA6,Cartola,Tive Sim,58610


Ok. Now we see a couple of things about our recommender:

1. It is heavily influenced by popularity. There are 35,253 users that have listened to at least one of our songs, and the vast majority has listened to just one of them, meaning that they are not very similar to us, so their recomendation is almost random. This forces us to give more importance to the songs of the users more similar to us.

2. Harmonia the most popular artist? Bjök the second? Dwight Yoakam the fourth? This is impossible in real life. This dataset must be rotten, but anyway, to our recommender they are just names.



### 2nd recommender

Now we are going to implement the scoring funtion to put more weight on the songs of the most similar to us users than on the least similar. We will create a score for each song based on how similar to us the user is. Later we will sum all the scores and return an ordered by score list that will be our new recommendation.

First we have to know every user similarity to us:

In [46]:
# Create a Dataframe with the number of similar songs by user
user_similarity = data[data["song_id"].\
                       isin(selected_songs)].\
                       groupby(["user_id"]).\
                       size().\
                       to_frame("similarity").\
                       reset_index()

In [48]:
# Let's see the ten most similar to us users
user_similarity.sort_values("similarity", ascending = False).head(10)

Unnamed: 0,user_id,similarity
19668,8fc187765f25645e802bd5137f641c8de7df17b8,4
11318,52542a715ba72e52eec99b277a42532c88615469,4
26131,be59c5b281f8b714c4d4d4bfb877715a93b3c64d,4
4544,217b76adb93cdb5d221408ad9f9c5c244a65b038,3
7031,33825a5d5b1b2ea935a9fc2f4a3cbf8e97e6280a,3
1105,07e8066fc9c82f5e700023f3c963117e874e0188,3
24623,b31888d485ddff26572ffdab1c947bcc067ff3a1,3
32657,ed824982bb5d17465708f5bfbd8589af81ad4de0,3
11440,53175b45ba820a33ac8f833a85a986a7c0f7d3d4,3
8567,3e6dd161e97e7bd0e20986e7f5e391e5d24e0a62,3


In [49]:
# Let's see how many users are there in each similarity group
similarity_groups = user_similarity.groupby("similarity").size()

In [50]:
similarity_groups

similarity
1    34153
2     1065
3       32
4        3
dtype: int64

In [51]:
# Now we are going to create a new dataframe with the user_id, song_id and similarity excluding the songs
# we have selected for our recommendation.
prediction_df = similar_users.\
    merge(data[~data.song_id.isin(selected_songs)], on = "user_id").\
    merge(user_similarity, on = "user_id")[["user_id", "song_id", "similarity"]]

In [52]:
prediction_df.sample(10)

Unnamed: 0,user_id,song_id,similarity
1573249,9d022acbda0278d1d6536d99cfb54d7b51cc8ece,SOUSVDG12AB018B099,2
1393475,d4e2a1187bd88b3d1799371325e0b3565cffa423,SOUXSKA12A6D4FDE52,1
772944,bbd3b37ac4cdcfa3f452b71927b6f90f8d8c411a,SOOHIBL12A8C14588A,1
101693,4d74d697b1250a9c3de064c6a4c5b09f56d31e7f,SOAFCAS12A58A79547,1
690294,a438c56c21325878c02bc85dfa336833b58bb051,SOQJWXU12AB0186E62,2
479534,9a98d3e33a0ed6d98ed8c580f0aaf750c6dcc2d8,SOKKYFO12A6D4F7B7C,1
2423382,9f2a3eca4bfe14b4018479b4b9b5c1a49f8372dc,SOFCPOU12A8C13BF40,1
1529577,ffc518f2af0946e6beebf753fbe16ddcc618aec8,SOWCOQE12AAA15F058,1
2258551,d3df25570d5c9854427f8be47c7920353ead50cd,SOUJVIT12A8C1451C1,1
1694759,fc72aa189c5b58bbda2ca6f9cd4e0ac40c1d8d64,SOFSLOC12A8C13F8BD,2


In [53]:
# We are going to use the inverse to set the weight for each type of similarity. With it we are going to
# add more weight to the songs of our most similar users.
def song_scoring(similarity):
    return 1 / similarity_groups[similarity]

In [54]:
# Let's add a new column to the prediction dataframe with the scoring function value
prediction_df["score"] = prediction_df["similarity"].map(lambda x: song_scoring(x))

In [55]:
prediction_df.head(10)

Unnamed: 0,user_id,song_id,similarity,score
0,bd64f193f0f53f09d44ff48fd52830ff2fded392,SOACIPG12A8AE47E1C,1,0.0
1,bd64f193f0f53f09d44ff48fd52830ff2fded392,SOAHEEC12A6BD4DAA4,1,0.0
2,bd64f193f0f53f09d44ff48fd52830ff2fded392,SOAKQBB12A8C1413A0,1,0.0
3,bd64f193f0f53f09d44ff48fd52830ff2fded392,SOAVFMF12A6D4F92E6,1,0.0
4,bd64f193f0f53f09d44ff48fd52830ff2fded392,SOBOLEI12A58A7E386,1,0.0
5,bd64f193f0f53f09d44ff48fd52830ff2fded392,SOBYNII12A58291CDC,1,0.0
6,bd64f193f0f53f09d44ff48fd52830ff2fded392,SOCALRI12A58A7BBC5,1,0.0
7,bd64f193f0f53f09d44ff48fd52830ff2fded392,SOCHYVZ12A6D4F5908,1,0.0
8,bd64f193f0f53f09d44ff48fd52830ff2fded392,SODQGBE12A6D4F6BAB,1,0.0
9,bd64f193f0f53f09d44ff48fd52830ff2fded392,SOEAQHH12A58A78F59,1,0.0


In [59]:
# Let's sum the score of every song and order it
predicted_songs = prediction_df.\
                    groupby(["song_id"])["score"].\
                    sum().\
                    sort_values(ascending = False).\
                    head(20).\
                    to_frame("popularity").\
                    reset_index()

In [60]:
predicted_songs.head(20)

Unnamed: 0,song_id,popularity
0,SOFGIVB12A6D4F5923,4.01
1,SOZDGEW12A8C13E748,3.97
2,SOITRTA12A6D4F8261,3.75
3,SOGHFDV12A6D4F7E0D,3.61
4,SOSWDMO12A8AE45996,3.52
5,SOOEEPE12A8AE459A4,3.35
6,SOMBAVX12AF72AC99C,3.2
7,SOTNHIP12AB0183131,2.88
8,SOIEHEL12AAA8C6628,2.81
9,SOPQLBY12A6310E992,2.76


In [61]:
# Let's see the songs in a human readable way
predicted_songs.merge(songs, on = "song_id")[["song_id", "artist", "title"]].head(20)

Unnamed: 0,song_id,artist,title
0,SOFGIVB12A6D4F5923,Metallica / Marianne Faithfull,The Memory Remains
1,SOZDGEW12A8C13E748,Metallica,One
2,SOITRTA12A6D4F8261,Metallica,Ride The Lightning
3,SOGHFDV12A6D4F7E0D,Rammstein,AMERIKA
4,SOSWDMO12A8AE45996,Metallica,The Day That Never Comes
5,SOOEEPE12A8AE459A4,Metallica,The Unforgiven III
6,SOMBAVX12AF72AC99C,Alice In Chains,Man In The Box
7,SOTNHIP12AB0183131,Kid Cudi / Kanye West / Common,Make Her Say
8,SOIEHEL12AAA8C6628,Rammstein,Der Meister
9,SOPQLBY12A6310E992,Radiohead,Creep (Explicit)


Now we're talking! This is a good recommendation even if Kanye West is in it.

Let's compare it with a statistical model!

### 3rd recommender

Since this dataset is too big to make a co-ocurrence matrix out of it, we'll use a subset that we can find [here](https://static.turi.com/datasets/millionsong/10000.txt)

In [62]:
# Load training dataset
columns = ["user_id", "song_id", "num_plays"]
datafile = os.path.join(data_root, "10000.txt")

data_small = pd.read_csv(datafile, 
                   sep="\t", 
                   header = None,
                   names = columns)

In [63]:
data_small.describe()

Unnamed: 0,num_plays
count,2000000.0
mean,3.05
std,6.58
min,1.0
25%,1.0
50%,1.0
75%,3.0
max,2213.0


In [64]:
data_small.head()

Unnamed: 0,user_id,song_id,num_plays
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBXHDL12A81C204C0,1
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBYHAJ12A6701BF1D,1
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODACBL12A8C13C273,1


Let's see how many users and songs this dataset has

In [65]:
data_small.user_id.describe()

count                                      2000000
unique                                       76353
top       6d625c6557df84b60d90426c0116138b617b9449
freq                                           711
Name: user_id, dtype: object

In [66]:
data_small.song_id.describe()

count                2000000
unique                 10000
top       SOFRQTD12A81C233C0
freq                    8277
Name: song_id, dtype: object

To use the collaborative filtering model we need to create a co-occurrence with the songs (item - item), and the indexes of this matrix have to be integer and consecutive. The IDs of the songs dataframe are strings, so we have to make a new dataset with the songs in data_small with a numeric consecutive index.

In [67]:
# Let's keep only the songs that are in data_small
unique_songs_df = pd.DataFrame(data_small.song_id.unique(), columns = ["song_id"])

In [68]:
songs_small = unique_songs_df.merge(all_songs, on = "song_id", how = "left")[["song_id", "artist", "title"]]

In [69]:
songs_small.describe()

Unnamed: 0,song_id,artist,title
count,10000,10000,10000
unique,10000,3375,9567
top,SOVEFXA12A58A7942A,The Black Keys,Breathe
freq,1,71,8


In [70]:
songs_small.head()

Unnamed: 0,song_id,artist,title
0,SOAKIMP12A8C130995,Jack Johnson,The Cove
1,SOBBMDR12A8C13253B,Paco De Lucia,Entre Dos Aguas
2,SOBXHDL12A81C204C0,Kanye West,Stronger
3,SOBYHAJ12A6701BF1D,Jack Johnson,Constellations
4,SODACBL12A8C13C273,Foo Fighters,Learn To Fly


In [71]:
# Let's add a column with the new index
songs_small["custom_index"] = range(0, len(songs_small))

In [72]:
songs_small.head()

Unnamed: 0,song_id,artist,title,custom_index
0,SOAKIMP12A8C130995,Jack Johnson,The Cove,0
1,SOBBMDR12A8C13253B,Paco De Lucia,Entre Dos Aguas,1
2,SOBXHDL12A81C204C0,Kanye West,Stronger,2
3,SOBYHAJ12A6701BF1D,Jack Johnson,Constellations,3
4,SODACBL12A8C13C273,Foo Fighters,Learn To Fly,4


In [73]:
# Now we'll add the new index also to the data_small dataframe to be able to join it with songs_small
data_small = data_small.merge(songs_small, on = "song_id")[["user_id", "song_id", "num_plays", "custom_index"]]

In [74]:
data_small.head()

Unnamed: 0,user_id,song_id,num_plays,custom_index
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1,0
1,7c86176941718984fed11b7c0674ff04c029b480,SOAKIMP12A8C130995,1,0
2,76235885b32c4e8c82760c340dc54f9b608d7d7e,SOAKIMP12A8C130995,3,0
3,250c0fa2a77bc6695046e7c47882ecd85c42d748,SOAKIMP12A8C130995,1,0
4,3f73f44560e822344b0fb7c6b463869743eb9860,SOAKIMP12A8C130995,6,0


In [75]:
data_small.tail()

Unnamed: 0,user_id,song_id,num_plays,custom_index
1999995,8d5be34165a0d2d20878abd6a48bb87af29b9f7a,SOPBPHJ12AAF3B59B6,2,9999
1999996,23f8ab814cd41e4a3394e762cc7360eb6c04cbd7,SOPBPHJ12AAF3B59B6,2,9999
1999997,4cc239fd4ab90eb599b2263e21dceebb252cf340,SOPBPHJ12AAF3B59B6,1,9999
1999998,e0039fa2e1d0c51c729d2521a48eccc52a375cc2,SOPBPHJ12AAF3B59B6,2,9999
1999999,6a5f74c28e6d091b31027965d402c93a6c7667e2,SOPBPHJ12AAF3B59B6,1,9999


In [76]:
# Now we are going to create a dictionary of songs per user
songs_per_user = (data_small.groupby("user_id")["custom_index"].\
                 apply(np.array).\
                 to_dict())

In [77]:
dict(list(songs_per_user.items())[0:5])

{'00003a4459f33b92906be11abe0e93efc423c0ff': array([1233, 2298, 3426, 3947, 4701, 8708, 8867]),
 '00005c6177188f12fb5e2e82cdbd93e8a3f35e64': array([1293, 2827, 3228, 6222, 9223]),
 '00030033e3a2f904a48ec1dd53019c9969b6ef1f': array([  91,  102, 1122, 1515, 1975, 3037, 3439, 6333, 7665]),
 '0007235c769e610e3d339a17818a5708e41008d9': array([2120, 2767, 3845, 3953, 4992, 5571, 6212, 8674, 9167, 9725]),
 '0007c0e74728ca9ef0fe4eb7f75732e8026a278b': array([  59,  322, 1047, 3012, 3395, 3400, 5399, 5585, 7040])}

In [78]:
# Let's get the number of songs
n_songs = len(data_small.song_id.unique())
n_songs

10000

In [79]:
# Let's create the co-ocurrence matrix with a shape of n_songs * n_songs
co_matrix = np.zeros((n_songs, n_songs))

In [80]:
# Let's fill the matrix with the data from songs_per_user
for user, songs in songs_per_user.items():
    for song in songs:
        co_matrix[song, songs] = co_matrix[song, songs] + 1

In [81]:
co_matrix

array([[  1.94000000e+02,   3.00000000e+00,   3.00000000e+00, ...,
          0.00000000e+00,   1.00000000e+00,   2.00000000e+00],
       [  3.00000000e+00,   1.57000000e+02,   5.00000000e+00, ...,
          0.00000000e+00,   2.00000000e+00,   0.00000000e+00],
       [  3.00000000e+00,   5.00000000e+00,   1.08200000e+03, ...,
          1.00000000e+00,   0.00000000e+00,   2.00000000e+00],
       ..., 
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00, ...,
          9.00000000e+01,   0.00000000e+00,   0.00000000e+00],
       [  1.00000000e+00,   2.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   7.10000000e+01,   0.00000000e+00],
       [  2.00000000e+00,   0.00000000e+00,   2.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   6.30000000e+01]])

Not the best representation we can get, but we can still see some values and the diagonal.

In [82]:
# Let's create a function that returns the N most similar songs to a given one based on a co_ocurrence matrix
def song_similarity(song_id, matrix, ntop=10):
    # get the line of the song
    similar_songs = matrix[song_id,:]
    # return indexes of most similar songs in descending order
    most_similar = np.argsort(similar_songs)[::-1]  
    # start from the 2nd element, as the first element is the item itslef
    most_similar = most_similar[1:ntop+1]
    
    # return a numpy array with the index and the value of the most similar items
    return np.dstack((most_similar,similar_songs[most_similar]))[0]

In [83]:
# Let's try this for one song (randomly selected)
songs_small[songs_small.custom_index == 1234]

Unnamed: 0,song_id,artist,title,custom_index
1234,SOWMELX12A8C13277A,Cake,Pretty Pink Ribbon,1234


In [84]:
sim_songs = song_similarity(1234, co_matrix)

In [85]:
# Let's see the similar songs
songs_small[songs_small.custom_index.isin(sim_songs[:,0])]

Unnamed: 0,song_id,artist,title,custom_index
137,SODAQMD12A8C131D57,Cake,Meanwhile_ Rick James...,137
152,SOLJSEJ12A8C132F61,Cake,Opera Singer,152
183,SOWUTFF12A8C138AB2,Cake,Frank Sinatra,183
933,SOCXBTX12A8C132F5A,Cake,Shadow Stabbing,933
934,SOEEYNQ12A8C132864,Cake,Comfort Eagle,934
935,SOEGOAB12A8C13BAE4,Cake,Never There,935
936,SOETHQM12A8C1366A9,Cake,Short Skirt/Long Jacket,936
939,SONZMEO12AF72AA717,Cake,Commissioning A Symphony In C,939
943,SOXOJHJ12AF72A4ABD,Cake,Long Line Of Cars,943
8764,SOCXBQG12AF72A7CBD,Cake,Arco Arena,8764


:) all songs of the same artist....

We don't want to recommend based on just one song, but on many, so let's make a function that reads several songs and groups the most similar of each and returns the most similar of all

In [86]:
def song_recommendation(songs_id, matrix, ntop=10):
    list_sim_songs = np.vstack([song_similarity(id, matrix, ntop) for id in songs_id])
    sorted_list = np.sort(list_sim_songs, axis=0)[::-1]
    # Removing duplicates
    unique_items = np.unique(sorted_list[:,0])[:ntop]
    return unique_items 

In [87]:
# Let's try with a group of songs similar to our first recommender
songs_small[songs_small.song_id.isin(selected_songs)]

Unnamed: 0,song_id,artist,title,custom_index
2536,SOTQVSE12A6D4F8200,Puddle Of Mudd,She Hates Me,2536
2826,SOSBJSU12A8C138469,Rancid,Time Bomb (Album Version),2826
3035,SOSJRJP12A6D4F826F,Metallica,Master Of Puppets,3035
6055,SOSYHME12A8C135DD8,Rammstein,Du Hast,6055
8819,SOJKARY12A6701ED3F,Monster Magnet,Space Lord,8819


Let's get these five and add some others

| artist | title | song_id | custom_inex
|------|------|------|------|
| Thrice | The Artist In The Ambulance | SOEBKPB12AB0182150 | 9118
| Guns N' Roses | Paradise City | SOQGVCS12AF72A078D | 1105
| Rammstein | Du Hast | SOSYHME12A8C135DD8 | 6055
| Rancid | Time Bomb (Album Version) | SOSBJSU12A8C138469 | 2826
| Against Me! | I Was A Teenage Anarchist (Album Version) | SORXMRX12AC468D5BB | 6299
| Led Zeppelin | Tangerine (Album Version) | SOSUZFA12A8C13C04A | 1286
| Staind | It's Been Awhile (Clean Edit) | SOAGNRU12A58A7AC5C | 3375
| Monster Magnet | Space Lord | SOJKARY12A6701ED3F | 8819
| Whitesnake | Fool For Your Loving | SOHWVJJ12AB0185F6D | 2449
| Puddle Of Mudd | She Hates Me | SOTQVSE12A6D4F8200 | 2536
| Green Day | American Idiot [feat. Green Day & The Cast Of ... | SODEAWL12AB0187032 | 301
| Metallica | Master Of Puppets | SOSJRJP12A6D4F826F | 3035




In [88]:
selected_songs = [9118, 1105, 6055, 2826, 6299, 1286, 3375, 8819, 2449, 2536, 301, 3035]

In [89]:
recommended_songs = song_recommendation(selected_songs, co_matrix, 20)

In [90]:
recommended_songs

array([   8.,   28.,   65.,   84.,   85.,   88.,   89.,   91.,   93.,
         94.,   95.,  100.,  101.,  102.,  104.,  105.,  106.,  107.,
        108.,  110.])

In [91]:
# Let's see the recommended songs
songs_small[songs_small.custom_index.isin(recommended_songs)]

Unnamed: 0,song_id,artist,title,custom_index
8,SOFRQTD12A81C233C0,Harmonia,Sehr kosmisch,8
28,SORWLTW12A670208FA,Jimmy Eat World,The Middle,28
65,SONQEYS12AF72AABC9,Counting Crows,Mr. Jones,65
84,SOWEHOM12A6BD4E09E,The Crests,16 Candles,84
85,SOWGXOP12A6701E93A,Eminem,Without Me,85
88,SOAUWYT12A81C206F1,Björk,Undo,88
89,SOAXGDH12A8C13F8A1,Florence + The Machine,Dog Days Are Over (Radio Edit),89
91,SOBONKR12A58A7A7E0,Dwight Yoakam,You're The One,91
93,SODEOCO12A6701E922,Nirvana,Come As You Are,93
94,SODJWHY12A8C142CCE,Train,Hey_ Soul Sister,94


This recommendation has nothing to do with the songs we have selected. It is totally popularity biased, so we can determine one thing:

* An item to item co-occurrence matrix doesn't include information about users, so the weight of every song against every other song is the same. To make a good recommender we need to take into consideration similarity among users, not just songs being played by any user.

### 4th recommender

Let's try with a user to user matrix

In [92]:
# For that, first we need to create an intger consecutive index for the users like we did with the songs
# Let's create a dataframe with the users and the new id
unique_users = pd.DataFrame(data_small.user_id.unique(), columns=["user_id"])

In [93]:
# Let's add a column with the new index
unique_users["custom_user_id"] = range(0, len(unique_users))

In [94]:
unique_users.head()

Unnamed: 0,user_id,custom_user_id
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,0
1,7c86176941718984fed11b7c0674ff04c029b480,1
2,76235885b32c4e8c82760c340dc54f9b608d7d7e,2
3,250c0fa2a77bc6695046e7c47882ecd85c42d748,3
4,3f73f44560e822344b0fb7c6b463869743eb9860,4


In [95]:
# Now let's insert this new index in data_small
data_small = data_small.merge(unique_users, on = "user_id")\
    [["user_id", "song_id", "num_plays", "custom_index", "custom_user_id"]]

In [96]:
data_small.head()

Unnamed: 0,user_id,song_id,num_plays,custom_index,custom_user_id
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1,0,0
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2,1,0
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBXHDL12A81C204C0,1,2,0
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBYHAJ12A6701BF1D,1,3,0
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODACBL12A8C13C273,1,4,0


In [97]:
# First we create a user-item matrix
n_users = len(data_small.user_id.unique())

ui_matrix = np.zeros((n_users, n_songs))

In [98]:
# Now we fill it with the song every user has played
for row in data_small.values[:,3:]:
    ui_matrix[row[1], row[0]] = 1

To define a similarity measure we are going to use cosine similarity:

#### $$\mathrm{sim}({\bf a},{\bf b})=\frac{{\bf a}\cdot{\bf b}}{\sqrt{{\bf a}\cdot{\bf a}}\sqrt{{\bf  b}\cdot{\bf b}}}$$

In [99]:
# This function calculates the cosine distance along the row of a user x items matrix
# It returns a square matrix with the similarities
# It also accepts 2 modes: item and user
def cosineSimilarity(matrix, kind='user', epsilon=1e-9):
    # epsilon is small number for handling divided-by-zero errors
    if kind == 'user':
        sim = matrix.dot(matrix.T) + epsilon
    elif kind == 'item':
        sim = matrix.T.dot(ratings) + epsilon
    norms = np.array([np.sqrt(np.diagonal(sim))])
    return (sim / norms / norms.T)

In [None]:
# we use cosine similarity
userSimilarity = cosineSimilarity(ui_matrix, kind='user')
userSimilarity.shape

In [None]:
# Unfortunately this process takes too long and my time is running out...