### DATASETS USED FOR THE PROJECT

#### Million Songs Dataset
##### Source: http://labrosa.ee.columbia.edu/millionsong/
##### Paper: http://ismir2011.ismir.net/papers/OS6-1.pdf


#### modules imported are

In [2]:
import pandas #pandas is used  creating data frames
from sklearn.model_selection import train_test_split #scikit-learn module for splitting for training and testing
import numpy as np #numpy is used for storing values of a dataframe in N-dimensional arrays
import Recommenders as Recommenders #Recoomendor is python class used to recommend songs for a playlist using Score and rank
import time
from sklearn.externals import joblib #
import Evaluation as Evaluation #Evaluation is a python file that helps us to caluclate precision and recall score

### GETTING MUSIC DATA

In [3]:
triplets_file = 'file:///D:/winter%20sem/mini%20project/10000.txt'
songs_metadata_file = 'D:\winter sem\mini project\song_data.csv'
song_df_1 = pandas.read_table(triplets_file,header=None)
song_df_1.columns = ['user_id', 'song_id', 'listen_count']
#Read song  metadata
song_df_2 =  pandas.read_csv(songs_metadata_file)

##### two data frames are used from two different data sets

In [4]:
song_df_1.head(5)

Unnamed: 0,user_id,song_id,listen_count
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBXHDL12A81C204C0,1
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBYHAJ12A6701BF1D,1
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODACBL12A8C13C273,1


In [5]:
song_df_2.head(5)

Unnamed: 0,song_id,title,release,artist_name,year
0,SOQMMHC12AB0180CB8,Silent Night,Monster Ballads X-Mas,Faster Pussy cat,2003
1,SOVFVAK12A8C1350D9,Tanssi vaan,Karkuteillä,Karkkiautomaatti,1995
2,SOGTUKN12AB017F4F1,No One Could Ever,Butter,Hudson Mohawke,2006
3,SOBNYVR12A8C13558C,Si Vos Querés,De Culo,Yerba Brava,2003
4,SOHSBXH12A8C13B0DF,Tangle Of Aspens,Rene Ablaze Presents Winter Sessions,Der Mystic,0


# preprocessing

### merging of two data frames with song id as key attribute and dropping duplicates of song_id so that we can find unique songs

In [6]:
song_df = pandas.merge(song_df_1, song_df_2.drop_duplicates(['song_id']), on="song_id", how="left")

# Explore data

Music data shows how many times a user listened to a song, as well as the details of the song.

In [7]:
song_df.head(10)

Unnamed: 0,user_id,song_id,listen_count,title,release,artist_name,year
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1,The Cove,Thicker Than Water,Jack Johnson,0
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2,Entre Dos Aguas,Flamenco Para Niños,Paco De Lucia,1976
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBXHDL12A81C204C0,1,Stronger,Graduation,Kanye West,2007
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBYHAJ12A6701BF1D,1,Constellations,In Between Dreams,Jack Johnson,2005
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODACBL12A8C13C273,1,Learn To Fly,There Is Nothing Left To Lose,Foo Fighters,1999
5,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODDNQT12A6D4F5F7E,5,Apuesta Por El Rock 'N' Roll,Antología Audiovisual,Héroes del Silencio,2007
6,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODXRTY12AB0180F3B,1,Paper Gangsta,The Fame Monster,Lady GaGa,2008
7,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOFGUAY12AB017B0A8,1,Stacked Actors,There Is Nothing Left To Lose,Foo Fighters,1999
8,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOFRQTD12A81C233C0,1,Sehr kosmisch,Musik von Harmonia,Harmonia,0
9,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOHQWYZ12A6D4FA701,1,Heaven's gonna burn your eyes,Hôtel Costes 7 by Stéphane Pompougnac,Thievery Corporation feat. Emiliana Torrini,2002


## Length of the dataset

In [8]:
len(song_df)

2000000

## Creating a subset of the dataset for sample so that after best method we prefer to use for main data


In [9]:
song_df = song_df.head(10000)

#Merge song title and artist_name columns to make a merged column
song_df['song'] = song_df['title'].map(str) + " - " + song_df['artist_name']

In [10]:
song_df['song'].head(10)

0                              The Cove - Jack Johnson
1                      Entre Dos Aguas - Paco De Lucia
2                                Stronger - Kanye West
3                        Constellations - Jack Johnson
4                          Learn To Fly - Foo Fighters
5    Apuesta Por El Rock 'N' Roll - Héroes del Sile...
6                            Paper Gangsta - Lady GaGa
7                        Stacked Actors - Foo Fighters
8                             Sehr kosmisch - Harmonia
9    Heaven's gonna burn your eyes - Thievery Corpo...
Name: song, dtype: object

## Showing the  songs grouping with listen count4ew in the dataset

In [11]:
song_grouped = song_df.groupby(['song']).agg({'listen_count': 'count'}).reset_index()


In [12]:
song_grouped.head(10)

Unnamed: 0,song,listen_count
0,#40 - DAVE MATTHEWS BAND,1
1,& Down - Boys Noize,4
2,'97 Bonnie & Clyde - Eminem,2
3,'Round Midnight - Miles Davis,3
4,'Till I Collapse - Eminem / Nate Dogg,6
5,(Anaesthesia) Pulling Teath - Metallica,1
6,(I Cant Get No) Satisfaction - Cat Power,1
7,(I Can't Get Me No) Satisfaction - Devo,1
8,(I Just) Died In Your Arms - Cutting Crew,2
9,(Nice Dream) - Radiohead,2


In [13]:
grouped_sum = song_grouped['listen_count'].sum()


grouped_sum

### caluclating listencount and percentage and grouping them

In [15]:
song_grouped['percentage']  = song_grouped['listen_count'].div(grouped_sum)*100
song_grouped.sort_values(['listen_count', 'song'],ascending = [0,1]).head(10)

Unnamed: 0,song,listen_count,percentage
3660,Sehr kosmisch - Harmonia,45,0.45
4678,Undo - Björk,32,0.32
5105,You're The One - Dwight Yoakam,32,0.32
1071,Dog Days Are Over (Radio Edit) - Florence + Th...,28,0.28
3655,Secrets - OneRepublic,28,0.28
4378,The Scientist - Coldplay,27,0.27
4712,Use Somebody - Kings Of Leon,27,0.27
3476,Revelry - Kings Of Leon,26,0.26
1387,Fireflies - Charttraxx Karaoke,24,0.24
1862,Horn Concerto No. 4 in E flat K495: II. Romanc...,23,0.23


## Counting number of unique users in the dataset

In [16]:
users = song_df['user_id'].unique()

In [17]:
len(users)

365

##  Counting the number of unique songs in the dataset

In [18]:
songs = song_df['song'].unique()
len(songs)

5151

## Create a song recommender for generating playlist

## Now we need to take two attributes and split it for training and testing and caluclate precison and recall score over different algorithms

In [19]:
train_data, test_data = train_test_split(song_df, test_size = 0.20, random_state=0)
print(train_data.head(5))

                                       user_id             song_id  \
7389  94d5bdc37683950e90c56c9b32721edb5d347600  SOXNZOW12AB017F756   
9275  1012ecfd277b96487ed8357d02fa8326b13696a5  SOXHYVQ12AB0187949   
2995  15415fa2745b344bce958967c346f2a89f792f63  SOOSZAZ12A6D4FADF8   
5316  ffadf9297a99945c0513cd87939d91d8b602936b  SOWDJEJ12A8C1339FE   
356   5a905f000fc1ff3df7ca807d57edb608863db05d  SOAMPRJ12A8AE45F38   

      listen_count                 title  \
7389             2      Half Of My Heart   
9275             1  The Beautiful People   
2995             1     Sanctify Yourself   
5316             4     Heart Cooks Brain   
356             20                 Rorol   

                                                release      artist_name  \
7389                                     Battle Studies       John Mayer   
9275             Antichrist Superstar (Ecopac Explicit)   Marilyn Manson   
2995                             Glittering Prize 81/92     Simple Minds   
5316  Ever

## now applying Simple popularity-based recommender class.

#### in this popularity based recommendor we are using listen count as popularity parameter and give score based on that and recommend a playlist for a user

##### Recommenders.popularity_recommender_py

### Create an instance of popularity based recommender class

In [20]:
pm = Recommenders.popularity_recommender_py()
pm.create(train_data, 'user_id', 'song')

### Use the popularity model to make some predictions

In [21]:
user_id = users[5]
pm.recommend(user_id)

Unnamed: 0,user_id,song,score,Rank
3194,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Sehr kosmisch - Harmonia,37,1.0
4083,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Undo - Björk,27,2.0
931,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Dog Days Are Over (Radio Edit) - Florence + Th...,24,3.0
4443,4bd88bfb25263a75bbdd467e74018f4ae570e5df,You're The One - Dwight Yoakam,24,4.0
3034,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Revelry - Kings Of Leon,21,5.0
3189,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Secrets - OneRepublic,21,6.0
4112,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Use Somebody - Kings Of Leon,21,7.0
1207,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Fireflies - Charttraxx Karaoke,20,8.0
1577,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Hey_ Soul Sister - Train,19,9.0
1626,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Horn Concerto No. 4 in E flat K495: II. Romanc...,19,10.0


In [22]:
user_id = users[6]
pm.recommend(user_id)

Unnamed: 0,user_id,song,score,Rank
3194,e006b1a48f466bf59feefed32bec6494495a4436,Sehr kosmisch - Harmonia,37,1.0
4083,e006b1a48f466bf59feefed32bec6494495a4436,Undo - Björk,27,2.0
931,e006b1a48f466bf59feefed32bec6494495a4436,Dog Days Are Over (Radio Edit) - Florence + Th...,24,3.0
4443,e006b1a48f466bf59feefed32bec6494495a4436,You're The One - Dwight Yoakam,24,4.0
3034,e006b1a48f466bf59feefed32bec6494495a4436,Revelry - Kings Of Leon,21,5.0
3189,e006b1a48f466bf59feefed32bec6494495a4436,Secrets - OneRepublic,21,6.0
4112,e006b1a48f466bf59feefed32bec6494495a4436,Use Somebody - Kings Of Leon,21,7.0
1207,e006b1a48f466bf59feefed32bec6494495a4436,Fireflies - Charttraxx Karaoke,20,8.0
1577,e006b1a48f466bf59feefed32bec6494495a4436,Hey_ Soul Sister - Train,19,9.0
1626,e006b1a48f466bf59feefed32bec6494495a4436,Horn Concerto No. 4 in E flat K495: II. Romanc...,19,10.0


## Build a song recommender with personalization

We now create an item similarity based collaborative filtering model that allows us to make personalized recommendations to each user. 

## Class for an item similarity based personalized recommender system

#### Recommenders.item_similarity_recommender_py

### Create an instance of item similarity based recommender class

In [23]:
is_model = Recommenders.item_similarity_recommender_py()
is_model.create(train_data, 'user_id', 'song')

### Use the personalized model to make some song recommendations

In [24]:
#Print the songs for the user in training data
user_id = users[5]
user_items = is_model.get_user_items(user_id)
#
print("------------------------------------------------------------------------------------")
print("Training data songs for the user userid: %s:" % user_id)
print("------------------------------------------------------------------------------------")

for user_item in user_items:
    print(user_item)

print("----------------------------------------------------------------------")
print("Recommendation process going on:")
print("----------------------------------------------------------------------")

#Recommend songs for the user using personalized model
is_model.recommend(user_id)

------------------------------------------------------------------------------------
Training data songs for the user userid: 4bd88bfb25263a75bbdd467e74018f4ae570e5df:
------------------------------------------------------------------------------------
Just Lose It - Eminem
Without Me - Eminem
16 Candles - The Crests
Speechless - Lady GaGa
Push It - Salt-N-Pepa
Ghosts 'n' Stuff (Original Instrumental Mix) - Deadmau5
Say My Name - Destiny's Child
My Dad's Gone Crazy - Eminem / Hailie Jade
The Real Slim Shady - Eminem
Somebody To Love - Justin Bieber
Forgive Me - Leona Lewis
Missing You - John Waite
Ya Nada Queda - Kudai
----------------------------------------------------------------------
Recommendation process going on:
----------------------------------------------------------------------
No. of unique songs for the user: 13
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :2097


Unnamed: 0,user_id,song,score,rank
0,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Superman - Eminem / Dina Rae,0.088692,1.0
1,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Mockingbird - Eminem,0.067663,2.0
2,4bd88bfb25263a75bbdd467e74018f4ae570e5df,I'm Back - Eminem,0.065385,3.0
3,4bd88bfb25263a75bbdd467e74018f4ae570e5df,U Smile - Justin Bieber,0.064525,4.0
4,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Here Without You - 3 Doors Down,0.062293,5.0
5,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Hellbound - J-Black & Masta Ace,0.055769,6.0
6,4bd88bfb25263a75bbdd467e74018f4ae570e5df,The Seed (2.0) - The Roots / Cody Chestnutt,0.052564,7.0
7,4bd88bfb25263a75bbdd467e74018f4ae570e5df,I'm The One Who Understands (Edit Version) - War,0.052564,8.0
8,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Falling - Iration,0.052564,9.0
9,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Armed And Ready (2009 Digital Remaster) - The ...,0.052564,10.0


###  Use the personalized model to make recommendations for the following user id. (Note the difference in recommendations from the first user id.)

### We can also apply the model to find similar songs to any song in the dataset

In [26]:
is_model.get_similar_items(['U Smile - Justin Bieber'])

no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :271


Unnamed: 0,user_id,song,score,rank
0,,Somebody To Love - Justin Bieber,0.428571,1.0
1,,Bad Company - Five Finger Death Punch,0.375,2.0
2,,Love Me - Justin Bieber,0.333333,3.0
3,,One Time - Justin Bieber,0.333333,4.0
4,,Here Without You - 3 Doors Down,0.333333,5.0
5,,Stuck In The Moment - Justin Bieber,0.333333,6.0
6,,Teach Me How To Dougie - California Swag District,0.333333,7.0
7,,Paper Planes - M.I.A.,0.333333,8.0
8,,Already Gone - Kelly Clarkson,0.333333,9.0
9,,The Funeral (Album Version) - Band Of Horses,0.3,10.0


## Quantitative comparison between the models
We now formally compare the popularity and the personalized models using precision-recall curves.

### Class to calculate precision and recall 

### Use the above precision recall calculator class to calculate the evaluation measures

In [118]:
start = time.time()

#Define what percentage of users to use for precision recall calculation
user_sample = 0.05

#Instantiate the precision_recall_calculator class
pr = Evaluation.precision_recall_calculator(test_data, train_data, pm, is_model)

#Call method to calculate precision and recall values
(pm_avg_precision_list, pm_avg_recall_list, ism_avg_precision_list, ism_avg_recall_list) = pr.calculate_measures(user_sample)

end = time.time()
print(end - start)

Length of user_test_and_training:319
Length of user sample:15
Getting recommendations for user:9b887e10a4711486085c4fae2d2599fc0d2c484d
No. of unique songs for the user: 128
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :25830
Getting recommendations for user:d66f2f66f2bdc9aa3d0362a35fc91ccc844101f7
No. of unique songs for the user: 20
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :913
Getting recommendations for user:4208d4ac45e7caab7167a4ea6d34e759a6b9a1fc
No. of unique songs for the user: 33
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :6769
Getting recommendations for user:497f5a58ffeaa953d619e95ca5b8736e74b99127
No. of unique songs for the user: 11
no. of unique songs in the training set: 4483
Non zero values in cooccurence_matrix :2573
Getting recommendations for user:ae99321be6e9a79bb9c2dd4b2aa8b49fdb9efdf8
No. of unique songs for the user: 11
no. of unique songs in

### Code to plot precision recall curve

In [119]:
import pylab as pl

#Method to generate precision and recall curve
def plot_precision_recall(m1_precision_list, m1_recall_list, m1_label, m2_precision_list, m2_recall_list, m2_label):
    pl.clf()    
    pl.plot(m1_recall_list, m1_precision_list, label=m1_label)
    pl.plot(m2_recall_list, m2_precision_list, label=m2_label)
    pl.xlabel('Recall')
    pl.ylabel('Precision')
    pl.ylim([0.0, 0.20])
    pl.xlim([0.0, 0.20])
    pl.title('Precision-Recall curve')
    #pl.legend(loc="upper right")
    pl.legend(loc=9, bbox_to_anchor=(0.5, -0.2))
    pl.show()

In [None]:
print("Plotting precision recall curves.")

plot_precision_recall(pm_avg_precision_list, pm_avg_recall_list, "popularity_model",ism_avg_precision_list, ism_avg_recall_list, "item_similarity_model")

Plotting precision recall curves.
