# Million Song Data - Recommendation Engine

In this project, we will be working with a set of million songs which will be divided into two csv files: triplet and metadata. The triplet file condains userID, songID and the number of times the songs were listened to. The metadata file contains songID, title, release year and artist name. Million Songs Dataset is a mixture of song from various website with the rating that users gave after listening to the song.

We will be implementing three types of recommendation system in this project: content-based, collaborative and popularity based.

**1. Import Modules**

The first step is to import all the packages that we will be using in this project. We will be using pandas, numpy and Recommenders (custom package)

In [1]:
import pandas as pd
import numpy as np
import Recommenders as Recommenders

**2. Loading the dataset**

We are going to create pandas dataframes using our two csv files. We will also create a separate dataframe which merges the two datasets using the common field - song_id. Next we combine the name of song and artist name to create a unique keyword in the dataset. 

In [2]:
song_df_1 = pd.read_csv('triplets_file.csv')
song_df_1.head()

Unnamed: 0,user_id,song_id,listen_count
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBXHDL12A81C204C0,1
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBYHAJ12A6701BF1D,1
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODACBL12A8C13C273,1


In [3]:
song_df_2 = pd.read_csv('song_data.csv')
song_df_2.head()

Unnamed: 0,song_id,title,release,artist_name,year
0,SOQMMHC12AB0180CB8,Silent Night,Monster Ballads X-Mas,Faster Pussy cat,2003
1,SOVFVAK12A8C1350D9,Tanssi vaan,Karkuteillä,Karkkiautomaatti,1995
2,SOGTUKN12AB017F4F1,No One Could Ever,Butter,Hudson Mohawke,2006
3,SOBNYVR12A8C13558C,Si Vos Querés,De Culo,Yerba Brava,2003
4,SOHSBXH12A8C13B0DF,Tangle Of Aspens,Rene Ablaze Presents Winter Sessions,Der Mystic,0


In [4]:
song_df = pd.merge(song_df_1, song_df_2.drop_duplicates(['song_id']), on = 'song_id', how = 'left')
song_df.head()

Unnamed: 0,user_id,song_id,listen_count,title,release,artist_name,year
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1,The Cove,Thicker Than Water,Jack Johnson,0
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2,Entre Dos Aguas,Flamenco Para Niños,Paco De Lucia,1976
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBXHDL12A81C204C0,1,Stronger,Graduation,Kanye West,2007
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBYHAJ12A6701BF1D,1,Constellations,In Between Dreams,Jack Johnson,2005
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODACBL12A8C13C273,1,Learn To Fly,There Is Nothing Left To Lose,Foo Fighters,1999


In [5]:
len(song_df)

2000000

In [6]:
print(len(song_df_1), len(song_df_2))

2000000 1000000


# Data Preprocessing

In [7]:
song_df['song'] = song_df['title'] + "-" + song_df['artist_name']
song_df.head()

Unnamed: 0,user_id,song_id,listen_count,title,release,artist_name,year,song
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1,The Cove,Thicker Than Water,Jack Johnson,0,The Cove-Jack Johnson
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2,Entre Dos Aguas,Flamenco Para Niños,Paco De Lucia,1976,Entre Dos Aguas-Paco De Lucia
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBXHDL12A81C204C0,1,Stronger,Graduation,Kanye West,2007,Stronger-Kanye West
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBYHAJ12A6701BF1D,1,Constellations,In Between Dreams,Jack Johnson,2005,Constellations-Jack Johnson
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODACBL12A8C13C273,1,Learn To Fly,There Is Nothing Left To Lose,Foo Fighters,1999,Learn To Fly-Foo Fighters


In [8]:
#cumulative sum of listen count of the songs
song_grouped = song_df.groupby(['song']).agg({'listen_count':'count'}).reset_index()
song_grouped.head()

Unnamed: 0,song,listen_count
0,#!*@ You Tonight [Featuring R. Kelly] (Explici...,78
1,#40-DAVE MATTHEWS BAND,338
2,& Down-Boys Noize,373
3,' Cello Song-Nick Drake,103
4,'97 Bonnie & Clyde-Eminem,93


In [9]:
grouped_sum = song_grouped['listen_count'].sum()
song_grouped['percentage'] = (song_grouped['listen_count']/grouped_sum) * 100
song_grouped.sort_values(['listen_count', 'song'], ascending=[0,1])

Unnamed: 0,song,listen_count,percentage
7127,Sehr kosmisch-Harmonia,8277,0.41385
9084,Undo-Björk,7032,0.35160
2068,Dog Days Are Over (Radio Edit)-Florence + The ...,6949,0.34745
9877,You're The One-Dwight Yoakam,6412,0.32060
6774,Revelry-Kings Of Leon,6145,0.30725
...,...,...,...
3526,Historia Del Portero-Ricardo Arjona,51,0.00255
7072,Scared-Three Days Grace,51,0.00255
2147,Don´t Leave Me Now-Amparanoia,50,0.00250
2991,Ghosts (Toxic Avenger Mix)-Ladytron,48,0.00240


# Popularity recommendation Engine

If we go for any ML model, behind it, there will be some algorithms. These are defined as a function and will be used as an end user. This algorithm is stored in the Recommenders.py file

In [10]:
pr = Recommenders.popularity_recommender_py()

In [11]:
pr.create(song_df, 'user_id', 'song')

In [12]:
pr.recommend(song_df['user_id'][5])

Unnamed: 0,user_id,song,score,Rank
7127,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Sehr kosmisch-Harmonia,8277,1.0
9084,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Undo-Björk,7032,2.0
2068,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Dog Days Are Over (Radio Edit)-Florence + The ...,6949,3.0
9877,b80344d063b5ccb3212f76538f3d9e43d87dca9e,You're The One-Dwight Yoakam,6412,4.0
6774,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Revelry-Kings Of Leon,6145,5.0
7115,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Secrets-OneRepublic,5841,6.0
3613,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Horn Concerto No. 4 in E flat K495: II. Romanc...,5385,7.0
2717,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Fireflies-Charttraxx Karaoke,4795,8.0
3485,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Hey_ Soul Sister-Train,4758,9.0
8847,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Tive Sim-Cartola,4548,10.0


In [13]:
pr.recommend(song_df['user_id'][10])

Unnamed: 0,user_id,song,score,Rank
7127,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Sehr kosmisch-Harmonia,8277,1.0
9084,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Undo-Björk,7032,2.0
2068,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Dog Days Are Over (Radio Edit)-Florence + The ...,6949,3.0
9877,b80344d063b5ccb3212f76538f3d9e43d87dca9e,You're The One-Dwight Yoakam,6412,4.0
6774,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Revelry-Kings Of Leon,6145,5.0
7115,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Secrets-OneRepublic,5841,6.0
3613,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Horn Concerto No. 4 in E flat K495: II. Romanc...,5385,7.0
2717,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Fireflies-Charttraxx Karaoke,4795,8.0
3485,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Hey_ Soul Sister-Train,4758,9.0
8847,b80344d063b5ccb3212f76538f3d9e43d87dca9e,Tive Sim-Cartola,4548,10.0


# Item Similarity Recommendation System

In [14]:
ir =  Recommenders.item_similarity_recommender_py()
ir.create(song_df, 'user_id', 'song')
song_df = song_df.head(10000)

In [15]:
user_items = ir.get_user_items(song_df['user_id'][5])

In [16]:
for x in user_items:
    print(x)

The Cove-Jack Johnson
Entre Dos Aguas-Paco De Lucia
Stronger-Kanye West
Constellations-Jack Johnson
Learn To Fly-Foo Fighters
Apuesta Por El Rock 'N' Roll-Héroes del Silencio
Paper Gangsta-Lady GaGa
Stacked Actors-Foo Fighters
Sehr kosmisch-Harmonia
Heaven's gonna burn your eyes-Thievery Corporation feat. Emiliana Torrini
Let It Be Sung-Jack Johnson / Matt Costa / Zach Gill / Dan Lebowitz / Steve Adams
I'll Be Missing You (Featuring Faith Evans & 112)(Album Version)-Puff Daddy
Love Shack-The B-52's
Clarity-John Mayer
I?'m A Steady Rollin? Man-Robert Johnson
The Old Saloon-The Lonely Island
Behind The Sea [Live In Chicago]-Panic At The Disco
Champion-Kanye West
Breakout-Foo Fighters
Ragged Wood-Fleet Foxes
Mykonos-Fleet Foxes
Country Road-Jack Johnson / Paula Fuga
Oh No-Andrew Bird
Love Song For No One-John Mayer
Jewels And Gold-Angus & Julia Stone
83-John Mayer
Neon-John Mayer
The Middle-Jimmy Eat World
High and dry-Jorge Drexler
All That We Perceive-Thievery Corporation
The Christmas 

In [17]:
user_items = ir.get_user_items(song_df['user_id'][150])

In [18]:
for x in user_items:
    print(x)

Harder Better Faster Stronger-Daft Punk
Jumping Jack Flash-The Rolling Stones
Aerodynamic-Daft Punk
You Know What You Are?-Nine Inch Nails
Indo Silver Club-Daft Punk
Steam Machine-Daft Punk
High Life-Daft Punk
D.A.N.C.E. [Radio Edit]-Justice
Emotion-Daft Punk
Meanwhile_ Rick James...-Cake
Face To Face-Daft Punk
Digital Love-Daft Punk
Fresh-Daft Punk
Sad Songs And Waltzes-Cake
Especially In Michigan (Album Version)-Red Hot Chili Peppers
Rock'n Roll-Daft Punk
Nightvision-Daft Punk
The Brainwasher-Daft Punk
C'mon Girl (Album Version)-Red Hot Chili Peppers
It's Tricky-RUN-DMC
Around The World (Radio Edit)-Daft Punk
Strip My Mind (Album Version)-Red Hot Chili Peppers
Top Down-Swizz Beatz
The Real Slim Shady-Eminem
I Could Have Lied (Album Version)-Red Hot Chili Peppers
Opera Singer-Cake
The Prime Time Of Your Life-Daft Punk
I'm Back-Eminem
Sinisten tähtien alla-J. Karjalainen & Mustat Lasit
One More Time (Short Radio Edit)-Daft Punk
Da Funk-Daft Punk
Crescendolls-Daft Punk
Angie (1993 Digit