## Dataset Information

Million Songs Dataset contains of two files: triplet_file and metadata_file. The triplet_file contains user_id, song_id and listen time. The metadata_file contains song_id, title, release, year and artist_name. Million Songs Dataset is a mixture of song from various website with the rating that users gave after listening to the song.

There are 3 types of recommendation system: content-based, collaborative and popularity.

## Import modules

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
for dirname, _, filenames in os.walk('/kaggle/usr/lib'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

/kaggle/input/innomatics-music-recom/kaggle_songs.txt
/kaggle/input/innomatics-music-recom/unique_tracks.txt
/kaggle/input/innomatics-music-recom/taste_profile_song_to_tracks.txt
/kaggle/input/innomatics-music-recom/kaggle_users.txt
/kaggle/input/innomatics-music-recom/kaggle_visible_evaluation_triplets.txt
/kaggle/usr/lib/recommenders_py/recommenders_py.py


In [2]:
import pandas as pd
import numpy as np
from recommenders_py import recommenders_py as Recommenders

## Loading the dataset

In [3]:
song_df_1 = pd.read_csv('../input/innomatics-music-recom/kaggle_visible_evaluation_triplets.txt', sep='\t',names=['user_id','song_id','listen_count'])
song_df_1.head()

Unnamed: 0,user_id,song_id,listen_count
0,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOBONKR12A58A7A7E0,1
1,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOEGIYH12A6D4FC0E3,1
2,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOFLJQZ12A6D4FADA6,1
3,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOHTKMO12AB01843B0,1
4,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SODQZCY12A6D4F9D11,1


In [4]:
song_df_2 = pd.read_csv('../input/innomatics-music-recom/unique_tracks.txt', sep='<SEP>',names=['track_id','song_id','artist_name','song_release'])
song_df_2.head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,track_id,song_id,artist_name,song_release
0,TRMMMYQ128F932D901,SOQMMHC12AB0180CB8,Faster Pussy cat,Silent Night
1,TRMMMKD128F425225D,SOVFVAK12A8C1350D9,Karkkiautomaatti,Tanssi vaan
2,TRMMMRX128F93187D9,SOGTUKN12AB017F4F1,Hudson Mohawke,No One Could Ever
3,TRMMMCH128F425532C,SOBNYVR12A8C13558C,Yerba Brava,Si Vos Querés
4,TRMMMWA128F426B589,SOHSBXH12A8C13B0DF,Der Mystic,Tangle Of Aspens


In [5]:
# combine both data
song_df = pd.merge(song_df_1, song_df_2.drop_duplicates(['song_id']), on='song_id', how='left')
song_df.head()

Unnamed: 0,user_id,song_id,listen_count,track_id,artist_name,song_release
0,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOBONKR12A58A7A7E0,1,TRAEHHJ12903CF492F,Dwight Yoakam,You're The One
1,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOEGIYH12A6D4FC0E3,1,TRLGMFJ128F4217DBE,Barry Tuckwell/Academy of St Martin-in-the-Fie...,Horn Concerto No. 4 in E flat K495: II. Romanc...
2,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOFLJQZ12A6D4FADA6,1,TRTNDNE128F1486812,Cartola,Tive Sim
3,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOHTKMO12AB01843B0,1,TRASTUE128F930D488,Lonnie Gordon,Catch You Baby (Steve Pitron & Max Sanna Radio...
4,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SODQZCY12A6D4F9D11,1,TRFPLWO128F1486B9E,Miguel Calo,El Cuatrero


In [6]:
print(len(song_df_1), len(song_df_2))

1450933 1000000


In [7]:
len(song_df)

1450933

## Data Preprocessing

In [8]:
# taking top 10k samples for quick results
song_df = song_df.head(10000)

In [9]:
# cummulative sum of listen count of the songs
song_grouped = song_df.groupby(['song_release']).agg({'listen_count':'count'}).reset_index()
song_grouped.head()

Unnamed: 0,song_release,listen_count
0,#40,1
1,$in$,1
2,& Down,1
3,&And The World Will Cease To Be,1
4,'A Cimma,1


In [10]:
grouped_sum = song_grouped['listen_count'].sum()
song_grouped['percentage'] = (song_grouped['listen_count'] / grouped_sum ) * 100
song_grouped.sort_values(['listen_count', 'song_release'], ascending=[0,1])

Unnamed: 0,song_release,listen_count,percentage
7113,You're The One,40,0.40
6524,Undo,38,0.38
5089,Sehr kosmisch,36,0.36
4814,Revelry,32,0.32
1468,Dog Days Are Over (Radio Edit),29,0.29
...,...,...,...
7178,Árboles de la barranca,1,0.01
7179,Ännu En Dag,1,0.01
7180,Ça Marche,1,0.01
7181,Örökké Tart,1,0.01


## Popularity Recommendation Engine

In [11]:
pr = Recommenders.popularity_recommender_py()

In [12]:
pr.create(song_df, 'user_id', 'song_release')

In [13]:
# display the top 10 popular songs
pr.recommend(song_df['user_id'][5])

Unnamed: 0,user_id,song_release,score,Rank
7113,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,You're The One,40,1.0
6524,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Undo,38,2.0
5089,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Sehr kosmisch,36,3.0
4814,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Revelry,32,4.0
1468,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Dog Days Are Over (Radio Edit),29,5.0
650,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Billionaire [feat. Bruno Mars] (Explicit Albu...,24,6.0
1581,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Drop The World,22,7.0
1934,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Fireflies,22,8.0
2581,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Horn Concerto No. 4 in E flat K495: II. Romanc...,22,9.0
4799,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Représente,20,10.0


In [14]:
pr.recommend(song_df['user_id'][100])

Unnamed: 0,user_id,song_release,score,Rank
7113,fdf6afb5daefb42774617cf223475c6013969724,You're The One,40,1.0
6524,fdf6afb5daefb42774617cf223475c6013969724,Undo,38,2.0
5089,fdf6afb5daefb42774617cf223475c6013969724,Sehr kosmisch,36,3.0
4814,fdf6afb5daefb42774617cf223475c6013969724,Revelry,32,4.0
1468,fdf6afb5daefb42774617cf223475c6013969724,Dog Days Are Over (Radio Edit),29,5.0
650,fdf6afb5daefb42774617cf223475c6013969724,Billionaire [feat. Bruno Mars] (Explicit Albu...,24,6.0
1581,fdf6afb5daefb42774617cf223475c6013969724,Drop The World,22,7.0
1934,fdf6afb5daefb42774617cf223475c6013969724,Fireflies,22,8.0
2581,fdf6afb5daefb42774617cf223475c6013969724,Horn Concerto No. 4 in E flat K495: II. Romanc...,22,9.0
4799,fdf6afb5daefb42774617cf223475c6013969724,Représente,20,10.0


## Item Similarity Recommendation

In [15]:
ir = Recommenders.item_similarity_recommender_py()
ir.create(song_df, 'user_id', 'song_release')

In [16]:
user_items = ir.get_user_items(song_df['user_id'][5])

In [17]:
# display user songs history
for user_item in user_items:
    print(user_item)

You're The One
Horn Concerto No. 4 in E flat K495: II. Romance (Andante cantabile)
Tive Sim
Catch You Baby (Steve Pitron & Max Sanna Radio Edit)
El Cuatrero
Unite (2009 Digital Remaster)


In [18]:
# give song recommendation for that user
ir.recommend(song_df['user_id'][5])

No. of unique songs for the user: 6
no. of unique songs in the training set: 7183
Non zero values in cooccurence_matrix :1403


Unnamed: 0,user_id,song,score,rank
0,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Représente,0.06874,1
1,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Revelry,0.067948,2
2,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Sayonara-Nostalgia,0.066639,3
3,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Undo,0.064033,4
4,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Secrets,0.060982,5
5,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Rianna,0.052746,6
6,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Gears,0.04394,7
7,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,The Gift,0.043168,8
8,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,Invalid,0.042351,9
9,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,16 Candles,0.041779,10


In [21]:
# give related songs based on the words
ir.get_similar_items(['You\'re The One', 'Fireflies'])

no. of unique songs in the training set: 7183
Non zero values in cooccurence_matrix :789


Unnamed: 0,user_id,song,score,rank
0,,Undo,0.100932,1
1,,OMG,0.084124,2
2,,Secrets,0.080556,3
3,,Hey_ Soul Sister,0.074132,4
4,,Fix You,0.068387,5
5,,16 Candles,0.067091,6
6,,The Only Exception (Album Version),0.063988,7
7,,Revelry,0.056544,8
8,,Billionaire [feat. Bruno Mars] (Explicit Albu...,0.056061,9
9,,Dog Days Are Over (Radio Edit),0.053977,10
