# Songs Recommender System    

Recommender systems are a popular class of information filtering system. The goal of such systems is to predict the preference an user would give to an item/ service and thus "recommend" them with those relevant items. Recommender systems are known to improve user experience on many webportals, especially the ones which involve lots of social interaction or shopping. Here, we explore **content based recommenders** that leverage:   
1. Popularity of items   
2. Matrix Factorization (Singular Value Decomposition)

<strong>Dataset:</strong>    
We use the "Taste profile subset" dataset that is auxillary to the popular million songs dataset available at: [The Million Song Dataset](https://labrosa.ee.columbia.edu/millionsong/)
It is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. The purpose of choosing such a large dataset is to build and test <font color="blue">recommender systems at scale</font>. Although the entire dataset is of 280 GB, we deal only with **3 GB** of available sample data which still exhibits characteristics of a large dataset while being moderately computationally intensive to model. 

In [1]:
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt;
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### 1. Import Data from log files and basic exploration 
Since, the overall dataset has more than a million unique users and records about 384,000 songs, we would work with a subset (10000 records) of data

In [2]:
dfsongs = pd.read_csv("train_triplets.txt", sep="\t", nrows=10000, names=['user', 'song', 'play_count'], header=None)

Let us inspect some of the imported records and other characteristics of data before we proceed further

In [3]:
dfsongs.head(5)

Unnamed: 0,user,song,play_count
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAPDEY12A81C210A9,1
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBFNSP12AF72A0E22,1
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBFOVM12A58A7D494,1


### Power-Law in real world systems
   Next, we are interested to know the number of unique users that we need to consider to build an useful recommender system. However, in most real world web applications, we see that only a few number of users consume a majority of the services while most other users use the services very rarely. In such a scenario, we need to take into account only those frequent song listeners to build a significant recommender.   

<strong>Efficiently summarizing 3.5 million records:</strong>    
    So, to decide the percentage of such users, we need to group the records by user id. However, it would be inefficient to use pandas grouping functions over 3.5 million records. Hence, we would iterate over the rows of records and summarize the user count.   

### 2. Summarizing records by users and songs:

We first summarize the records by users and then by songs.    
<strong>2.1 Users - play count summary </strong>

In [4]:
from collections import defaultdict
usersDict = defaultdict(int)
with open("train_triplets.txt", "r") as songFile:
    for record in songFile:
        #Fetch user id as the first item from a tab delimited line of record
        user = record.split("\t")[0]
        #Fetch song play count as the last item from a tab delimited line of record
        play_count = int(record.split("\t")[2])
        #Update song play count for a user
        usersDict[user] += play_count
        
userPlayCountList = [{'user': u, 'play_count': p} for u,p in usersDict.items()]
dfSongsPlayCount = pd.DataFrame(userPlayCountList)
#Arrange the user songs played records in descending order to facilitate further analysis
dfSongsPlayCount = dfSongsPlayCount.sort_values(by = 'play_count', ascending = False)
dfSongsPlayCount.to_csv("songs_play_count.csv", index=None)

As seen earlier, now we arbitrarily consider top 40% of the songs play count and need to determine the number of users accounting for this percentage. So, we consider only top 100,000 users ranked according to number of songs listened by them.

In [5]:
totalPlayCount = sum(dfSongsPlayCount.play_count)
(float(dfSongsPlayCount.head(n=100000).play_count.sum())/totalPlayCount)*100
dfUserSubset = dfSongsPlayCount.head(n=100000)

<strong>2.2 Songs - play count summary:</strong>

In [6]:
songsDict = defaultdict(int)
with open("train_triplets.txt", "r") as songFile:
    for record in songFile:
        #Fetch song id as the first item from a tab delimited line of record
        song = record.split("\t")[1]
        #Fetch song play count as the last item from a tab delimited line of record
        play_count = int(record.split("\t")[2])
        #Update song play count for a song
        songsDict[song] += play_count
        
songPlayCountList = [{'song': s, 'play_count': p} for s,p in songsDict.items()]
dfSongsPlaySummary = pd.DataFrame(songPlayCountList)
#Arrange the user songs played records in descending order to facilitate further analysis
dfSongsPlaySummary = dfSongsPlaySummary.sort_values(by = 'play_count', ascending = False)
dfSongsPlaySummary.to_csv("songs_summary.csv", index=None)

In [7]:
(float(dfSongsPlaySummary.head(n=30000).play_count.sum())/totalPlayCount)*100

78.39315366645269

<font color="blue">Observation:</font>     
We observe that only top 30% of the songs are listened on almost 80% of the occassions and less than 40% of the users listen to almost all the songs.   
So, we subset the songs and users summary datasets accordingly.   

In [8]:
dfSongPlayCountSubset = dfSongsPlaySummary.head(n=30000)

Now, we have obtained top 30% of the popular songs that users listen. Next, we need to identify these songs from the original dataset that contained play counts corresponding to each user for these songs.

In [9]:
dfAllSongs = pd.read_csv("train_triplets.txt", sep="\t", names=['user', 'song', 'play_count'], header=None)
#Filter records for top 40% users using the users subset found earlier
dfSongsTopUsers = dfAllSongs[dfAllSongs.user.isin(list(dfUserSubset.user))]
dfSongsTopSongs = dfSongsTopUsers[dfSongsTopUsers.song.isin(list(dfSongPlayCountSubset.song))]

In [10]:
#Save the top 30 songs dataset to a file
dfSongsTopSongs.to_csv("top_Songs_Subset.csv", index=False)
#Find the number of records in this subset
dfSongsTopSongs.shape

(10774558, 3)

So, we have obtained about 1 Million records related to top 30% of the popular songs. We also delete the unwanted subset dataframes to free up the memory

In [11]:
del dfAllSongs
del dfSongsTopUsers

Looking at the few records from the newly obtained subset of popular songs

In [12]:
dfSongsTopSongs.head(5)

Unnamed: 0,user,song,play_count
498,d6589314c0a9bcbca4fee0c93b14bc402363afea,SOADQPP12A67020C82,12
499,d6589314c0a9bcbca4fee0c93b14bc402363afea,SOAFTRR12AF72A8D4D,1
500,d6589314c0a9bcbca4fee0c93b14bc402363afea,SOANQFY12AB0183239,1
501,d6589314c0a9bcbca4fee0c93b14bc402363afea,SOAYATB12A6701FD50,1
502,d6589314c0a9bcbca4fee0c93b14bc402363afea,SOBOAFP12A8C131F36,7


### 3. Enhance Dataset with Song Track Details
Though we have obtained a subset of popular songs, it lacks song titles, artist names and other such details. These details can be retrived from a available supporting dataset of metadata in SQLite format at [Million Songs Metdata](https://labrosa.ee.columbia.edu/millionsong/pages/getting-dataset#subset)

In [13]:
conn = sqlite3.connect('track_metadata.db')
cur = conn.cursor()
#Find the name of tables in the database
cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
cur.fetchall()

[('songs',)]

In [14]:
#Retrieve details of song tracks from the metadata file
dfTrackMetadata = pd.read_sql(con=conn, sql='select * from songs')
dfTrackMetadata_sub = dfTrackMetadata[dfTrackMetadata.song_id.isin(list(dfSongPlayCountSubset.song))]

Observe the song details dataset

In [15]:
dfTrackMetadata_sub.head(3)

Unnamed: 0,track_id,title,song_id,release,artist_id,artist_mbid,artist_name,duration,artist_familiarity,artist_hotttnesss,year,track_7digitalid,shs_perf,shs_work
115,TRMMGCB128E079651D,Get Along (Feat: Pace Won) (Instrumental),SOHNWIM12A67ADF7D9,Charango,ARU3C671187FB3F71B,067102ea-9519-4622-9077-57ca4164cfbb,Morcheeba,227.47383,0.819087,0.533117,2002,185967,-1,0
123,TRMMGTX128F92FB4D9,Viejo,SOECFIW12A8C144546,Caraluna,ARPAAPH1187FB3601B,f69d655c-ffd6-4bee-8c2a-3086b2be2fc6,Bacilos,307.51302,0.595554,0.400705,0,6825058,-1,0
145,TRMMGDP128F933E59A,I Say A Little Prayer,SOGWEOB12AB018A4D0,The Legendary Hi Records Albums_ Volume 3: Ful...,ARNNRN31187B9AE7B7,fb7272ba-f130-4f0a-934d-6eeea4c18c9a,Al Green,133.58975,0.77949,0.59921,1978,5211723,-1,11898


### 4. Preprocessing the data 
Now that we have all the relevant information, we need to aggregate the song details and user Ids and remove the unwanted columns from the final cleaned table

In [16]:
del(dfTrackMetadata_sub['track_id'])
del(dfTrackMetadata_sub['artist_mbid'])
#Drop duplicate songs ids, as we are concerned with only unique songs
dfTrackMetadata_sub.drop_duplicates(['song_id'], inplace=True)
#Join the earlier obtained top popular songs dataset and this metadata dataframe on song id
dfPopularSongMetaDataMerged = pd.merge(dfSongsTopSongs, dfTrackMetadata_sub, how="left", 
                                      left_on="song", right_on = "song_id")
#Keep only relevant columns
dfPopularSongMetaDataMerged = dfPopularSongMetaDataMerged[['user', 'song', 'title',
                                                           'play_count', 'release', 'artist_name', 'year']]
dfPopularSongMetaDataMerged.head(3)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfTrackMetadata_sub.drop_duplicates(['song_id'], inplace=True)


Unnamed: 0,user,song,title,play_count,release,artist_name,year
0,d6589314c0a9bcbca4fee0c93b14bc402363afea,SOADQPP12A67020C82,You And Me Jesus,12,Tribute To Jake Hess,Jake Hess,2004
1,d6589314c0a9bcbca4fee0c93b14bc402363afea,SOAFTRR12AF72A8D4D,Harder Better Faster Stronger,1,Discovery,Daft Punk,2007
2,d6589314c0a9bcbca4fee0c93b14bc402363afea,SOANQFY12AB0183239,Uprising,1,Uprising,Muse,0


### 5. Recommendation Engine
Till now, we have imported and preprocessed the songs and users data for the most popular songs. We would use this clean dataset for building different variants of recommendation engines:    

### 5.1. Popularity based recommendation:    
This is a basic type of recommendation offered to users. It relies on the simple logic that if some content is highly popular among users, that content is a natural candidate to be introduced to other users.    
Thus, we build a recommender that suggests users to listen to most popular songs that they haven't yet heard. It does not take into account user's taste or listening history.

In [17]:
def create_popularity_recommendation(songData, user, song):
    #Get a count of users for each unique song as recommendation score
    songDataGrouped = songData.groupby([song]).agg({user: 'count'}).reset_index()
    songDataGrouped.rename(columns = {user: 'score'},inplace=True)
    
    #Sort the songs based upon recommendation score
    songDataSorted = songDataGrouped.sort_values(['score', song], ascending = [0,1])
    
    #Generate a recommendation rank based upon score
    songDataSorted['Rank'] = songDataSorted['score'].rank(ascending=0, method='first')
        
    #Get the top 10 recommendations
    popularityBasedRecommendations = songDataSorted.head(10)
    return popularityBasedRecommendations

The top 10 popularity based recommendations for songs are:

In [18]:
recommendations = create_popularity_recommendation(dfPopularSongMetaDataMerged,'user','title')
recommendations

Unnamed: 0,title,score,Rank
19580,Sehr kosmisch,18626,1.0
5780,Dog Days Are Over (Radio Edit),17635,2.0
27314,You're The One,16085,3.0
19542,Secrets,15138,4.0
18636,Revelry,14945,5.0
25070,Undo,14687,6.0
7530,Fireflies,13085,7.0
9640,Hey_ Soul Sister,12993,8.0
25216,Use Somebody,12793,9.0
9921,Horn Concerto No. 4 in E flat K495: II. Romanc...,12346,10.0


### 5.2. Matrix Factorization based Recommendation
The recommendations based on popular content may be simple to implement but recommendations after taking into account users' taste through their historical activity can be more effective. However, it is not always easy to mathematically represent the features of the content liked by an user. For eg. features like beats, tempo, bars and danceability of music tunes often influence listener's choice and can be leveraged to find songs with similar features. However, deriving such features from content almost always requires high level of domain expertise and time.    
The relationship between users and their favorite content can be represented in a matrix form. Matrix factorization is the mathematical process used to derive such an initial matrix from two or more component matrices. Thus, matrix factorization helps in learning the latent (intially unknown) features from two different entities. In this case, latent features between users and their preferred choice of songs.   
    We use <font color="blue"><strong>Singular Value Decomposition (SVD)</strong></font> to determine the factorization of the matrix.    
    
<font color="brown"><strong>Implicit feedback:</strong></font>     
The absense of *explicit ratings* is a well known problem while building a recommendation engine. We do not have any user ratings related to the songs apart from their play counts. We compute the *fractional play counts* and consider it as <font color="brown">implicit feedback</font> in determining the recommendations. These fractional counts will give a sense of users' likings for songs in a range of [0,1].    

In [19]:
#Aggregate by user and calculate sum of all songs listened by an user
dfPopularSongMetaDataMergedSum  = dfPopularSongMetaDataMerged[['user','play_count']].groupby('user').sum().reset_index()
dfPopularSongMetaDataMergedSum.rename(columns={'play_count':'total_play_count'},inplace=True)
dfPopularSongMetaDataMerged = pd.merge(dfPopularSongMetaDataMerged,dfPopularSongMetaDataMergedSum)
#Calculate the fractional play count
dfPopularSongMetaDataMerged['fractional_play_count'] = dfPopularSongMetaDataMerged['play_count']/dfPopularSongMetaDataMerged['total_play_count']

In [20]:
#Examine the structure of the newly formed dataset
dfPopularSongMetaDataMerged.head()

Unnamed: 0,user,song,title,play_count,release,artist_name,year,total_play_count,fractional_play_count
0,d6589314c0a9bcbca4fee0c93b14bc402363afea,SOADQPP12A67020C82,You And Me Jesus,12,Tribute To Jake Hess,Jake Hess,2004,329,0.036474
1,d6589314c0a9bcbca4fee0c93b14bc402363afea,SOAFTRR12AF72A8D4D,Harder Better Faster Stronger,1,Discovery,Daft Punk,2007,329,0.00304
2,d6589314c0a9bcbca4fee0c93b14bc402363afea,SOANQFY12AB0183239,Uprising,1,Uprising,Muse,0,329,0.00304
3,d6589314c0a9bcbca4fee0c93b14bc402363afea,SOAYATB12A6701FD50,Breakfast At Tiffany's,1,Home,Deep Blue Something,1993,329,0.00304
4,d6589314c0a9bcbca4fee0c93b14bc402363afea,SOBOAFP12A8C131F36,Lucky (Album Version),7,We Sing. We Dance. We Steal Things.,Jason Mraz & Colbie Caillat,0,329,0.021277


In [21]:
dfPopularSongMetaDataMerged.to_csv("popular_songs_metadata.csv", index=False)

In [22]:
dfPopularSongMetaDataMerged.shape

(10774558, 9)

In [None]:
/Users/gretalerer/music-recommendation/spotify_reco/popular_songs_metadata.csv