## Music Recommendation System (Machine Learning)

This project is aimed upon building a music recommendation system that gives the user recommendations on music based on his music taste by analysing his previously heard music and playlist. This project is done in two ways, using 'User - to - User Recommendation' and 'Item - to - Item Recommendation'. KMeans algorithm is being used along with 'Surprise' module to compute the similarity between recommendations and user's already existing playlist for evaluation

### Obtaining Data

In [266]:
import pandas as pd
import numpy as np

In [267]:
final = pd.read_csv(r'../../assets/final.csv')
metadata = pd.read_csv(r'../../assets/metadata.csv')

final = final.drop(final.columns[0], axis="columns")  # drop the first column of the index


In [268]:
final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13129 entries, 0 to 13128
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   track_id            13129 non-null  int64  
 1   acousticness        13129 non-null  float64
 2   danceability        13129 non-null  float64
 3   energy              13129 non-null  float64
 4   instrumentalness    13129 non-null  float64
 5   liveness            13129 non-null  float64
 6   speechiness         13129 non-null  float64
 7   tempo               13129 non-null  float64
 8   valence             13129 non-null  float64
 9   artist_discovery    13129 non-null  float64
 10  artist_familiarity  13129 non-null  float64
 11  artist_hotttnesss   13129 non-null  float64
 12  song_currency       13129 non-null  float64
 13  song_hotttnesss     13129 non-null  float64
dtypes: float64(13), int64(1)
memory usage: 1.4 MB


In [269]:
final.head(3)

Unnamed: 0,track_id,acousticness,danceability,energy,instrumentalness,liveness,speechiness,tempo,valence,artist_discovery,artist_familiarity,artist_hotttnesss,song_currency,song_hotttnesss
0,2,0.416675,0.675894,0.634476,0.010628,0.177647,0.15931,165.922,0.576661,0.38899,0.38674,0.40637,0.0,0.0
1,3,0.374408,0.528643,0.817461,0.001851,0.10588,0.461818,126.957,0.26924,0.38899,0.38674,0.40637,0.0,0.0
2,5,0.043567,0.745566,0.70147,0.000697,0.373143,0.124595,100.26,0.621661,0.38899,0.38674,0.40637,0.0,0.0


In [270]:
metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13129 entries, 0 to 13128
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   track_id     13129 non-null  int64 
 1   album_title  13129 non-null  object
 2   artist_name  13129 non-null  object
 3   genre        13129 non-null  object
 4   track_title  13128 non-null  object
dtypes: int64(1), object(4)
memory usage: 513.0+ KB


In [271]:
metadata.head()

Unnamed: 0,track_id,album_title,artist_name,genre,track_title
0,2,AWOL - A Way Of Life,AWOL,HipHop,Food
1,3,AWOL - A Way Of Life,AWOL,HipHop,Electric Ave
2,5,AWOL - A Way Of Life,AWOL,HipHop,This World
3,10,Constant Hitmaker,Kurt Vile,Pop,Freeway
4,134,AWOL - A Way Of Life,AWOL,HipHop,Street Music


### Model Selection - K Means Algorithm

In [272]:
from sklearn.cluster import KMeans
from sklearn.utils import shuffle


In [273]:
final = shuffle(final, random_state=100)

In [274]:
num_enlisted = 2000  # how many songs are enlisted in the user's playlist

# X is the audience's playlist  Recall: iloc is integer position-based
X = final.iloc[[i for i in range(0, num_enlisted)]]

# Y is the music reservoir which are going to be used for the recommendations
Y = final.iloc[[i for i in range(num_enlisted, final.shape[0])]]

In [275]:
X = shuffle(X, random_state=100)
Y = shuffle(Y, random_state=100)

In [276]:
X.head()

Unnamed: 0,track_id,acousticness,danceability,energy,instrumentalness,liveness,speechiness,tempo,valence,artist_discovery,artist_familiarity,artist_hotttnesss,song_currency,song_hotttnesss
5257,20578,0.857307,0.380751,0.143179,0.904447,0.371113,0.036546,112.611,0.172942,0.148544,0.24743,0.155181,0.0,0.0
410,1105,0.001511,0.119617,0.932086,0.930931,0.065884,0.075716,166.922,0.175888,0.160536,0.26377,0.167709,0.0,0.0
4711,18753,0.10344,0.176934,0.751613,0.21459,0.332937,0.048927,136.965,0.524496,0.527548,0.355465,0.551119,3.9e-05,0.070213
11922,81368,0.960197,0.403913,0.204022,0.900791,0.107306,0.04342,108.77,0.129756,0.25473,0.185473,0.266111,0.0,0.0
7336,31881,0.007057,0.55503,0.764131,0.880281,0.064819,0.070115,93.362,0.262522,0.284627,0.25296,0.297344,0.0,0.0


In [277]:
kmeans = KMeans(n_clusters=6)

In [278]:
def fit(df, algo, flag=0):
    df = df.set_index('track_id')
    if flag:
        algo.fit(df)
    else:
        algo.partial_fit(df)
    df['label'] = algo.labels_
    return (df, algo)

In [279]:
def predict(t, Y):
    Y = Y.set_index('track_id')
    y_pred = t[1].predict(Y)
    mode = pd.Series(y_pred).mode()
    return t[0][t[0]['label'] == mode.loc[0]]

In [280]:
Y.head()

Unnamed: 0,track_id,acousticness,danceability,energy,instrumentalness,liveness,speechiness,tempo,valence,artist_discovery,artist_familiarity,artist_hotttnesss,song_currency,song_hotttnesss
6159,25172,0.995769,0.575536,0.072498,0.95181,0.362845,0.156567,134.222,0.749784,0.474547,0.401179,0.496548,0.0,0.024744
4928,19622,0.995796,0.363499,0.064849,0.873007,0.268156,0.056632,74.648,0.569499,0.351244,0.288913,0.366938,0.0,0.0
12692,112277,0.643805,0.450673,0.465502,0.091072,0.120271,0.212198,175.354,0.891171,0.409123,0.3049,0.427403,0.0,0.0
2378,10035,0.954748,0.162086,0.000669,5.5e-05,0.121634,0.041126,125.588,0.089176,0.340716,0.176642,0.355939,0.0,0.0
3139,12519,0.010644,0.863181,0.422751,0.932807,0.112729,0.113547,113.052,0.483683,0.440652,0.338019,0.46034,0.000178,0.100569


In [281]:
t = fit(X, kmeans, 1)
t



(          acousticness  danceability    energy  instrumentalness  liveness  \
 track_id                                                                     
 20578         0.857307      0.380751  0.143179          0.904447  0.371113   
 1105          0.001511      0.119617  0.932086          0.930931  0.065884   
 18753         0.103440      0.176934  0.751613          0.214590  0.332937   
 81368         0.960197      0.403913  0.204022          0.900791  0.107306   
 31881         0.007057      0.555030  0.764131          0.880281  0.064819   
 ...                ...           ...       ...               ...       ...   
 14800         0.000032      0.892513  0.253932          0.879037  0.332298   
 37921         0.548195      0.491913  0.735015          0.955849  0.076965   
 4199          0.902837      0.110838  0.264158          0.970930  0.086866   
 1799          0.040998      0.613058  0.870231          0.819554  0.110854   
 642           0.946599      0.420579  0.239794     

In [282]:
recommendations = predict(t, Y)  # generate the predicted label for the music archive

Unnamed: 0_level_0,acousticness,danceability,energy,instrumentalness,liveness,speechiness,tempo,valence,artist_discovery,artist_familiarity,artist_hotttnesss,song_currency,song_hotttnesss,label
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
31881,0.007057,0.555030,0.764131,0.880281,0.064819,0.070115,93.362,0.262522,0.284627,0.252960,0.297344,0.0,0.0,2
33711,0.395812,0.736020,0.575406,0.246747,0.090738,0.412314,91.945,0.666975,0.262898,0.327014,0.274644,0.0,0.0,2
33044,0.960250,0.420011,0.190370,0.915435,0.336118,0.036872,84.151,0.061257,0.374602,0.262303,0.391339,0.0,0.0,2
105018,0.008160,0.533155,0.709440,0.300927,0.347076,0.027579,100.030,0.245441,0.335636,0.208720,0.350632,0.0,0.0,2
49931,0.495373,0.561037,0.388667,0.068159,0.128947,0.255811,105.824,0.202095,0.376508,0.410917,0.399240,0.0,0.0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34406,0.063832,0.395921,0.926013,0.885229,0.194911,0.042750,96.218,0.126521,0.310875,0.275075,0.324765,0.0,0.0,2
18026,0.995796,0.215221,0.129007,0.972870,0.339714,0.030085,84.796,0.537703,0.248466,0.053097,0.259567,0.0,0.0,2
18027,0.995765,0.539181,0.264597,0.972724,0.246283,0.032837,102.617,0.775923,0.076188,0.034536,0.079592,0.0,0.0,2
36302,0.296315,0.889528,0.232582,0.932233,0.111159,0.317319,100.087,0.745758,0.285311,0.279443,0.298059,0.0,0.0,2


In [283]:
recommendations.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 520 entries, 31881 to 23172
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   acousticness        520 non-null    float64
 1   danceability        520 non-null    float64
 2   energy              520 non-null    float64
 3   instrumentalness    520 non-null    float64
 4   liveness            520 non-null    float64
 5   speechiness         520 non-null    float64
 6   tempo               520 non-null    float64
 7   valence             520 non-null    float64
 8   artist_discovery    520 non-null    float64
 9   artist_familiarity  520 non-null    float64
 10  artist_hotttnesss   520 non-null    float64
 11  song_currency       520 non-null    float64
 12  song_hotttnesss     520 non-null    float64
 13  label               520 non-null    int32  
dtypes: float64(13), int32(1)
memory usage: 58.9 KB


In [284]:
recommendations.head()

Unnamed: 0_level_0,acousticness,danceability,energy,instrumentalness,liveness,speechiness,tempo,valence,artist_discovery,artist_familiarity,artist_hotttnesss,song_currency,song_hotttnesss,label
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
31881,0.007057,0.55503,0.764131,0.880281,0.064819,0.070115,93.362,0.262522,0.284627,0.25296,0.297344,0.0,0.0,2
33711,0.395812,0.73602,0.575406,0.246747,0.090738,0.412314,91.945,0.666975,0.262898,0.327014,0.274644,0.0,0.0,2
33044,0.96025,0.420011,0.19037,0.915435,0.336118,0.036872,84.151,0.061257,0.374602,0.262303,0.391339,0.0,0.0,2
105018,0.00816,0.533155,0.70944,0.300927,0.347076,0.027579,100.03,0.245441,0.335636,0.20872,0.350632,0.0,0.0,2
49931,0.495373,0.561037,0.388667,0.068159,0.128947,0.255811,105.824,0.202095,0.376508,0.410917,0.39924,0.0,0.0,2


In [285]:
metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13129 entries, 0 to 13128
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   track_id     13129 non-null  int64 
 1   album_title  13129 non-null  object
 2   artist_name  13129 non-null  object
 3   genre        13129 non-null  object
 4   track_title  13128 non-null  object
dtypes: int64(1), object(4)
memory usage: 513.0+ KB


In [286]:
metadata = metadata.set_index('track_id')

In [287]:
#metadata = metadata.set_index('track_id')

In [288]:
Y.head()

Unnamed: 0,track_id,acousticness,danceability,energy,instrumentalness,liveness,speechiness,tempo,valence,artist_discovery,artist_familiarity,artist_hotttnesss,song_currency,song_hotttnesss
6159,25172,0.995769,0.575536,0.072498,0.95181,0.362845,0.156567,134.222,0.749784,0.474547,0.401179,0.496548,0.0,0.024744
4928,19622,0.995796,0.363499,0.064849,0.873007,0.268156,0.056632,74.648,0.569499,0.351244,0.288913,0.366938,0.0,0.0
12692,112277,0.643805,0.450673,0.465502,0.091072,0.120271,0.212198,175.354,0.891171,0.409123,0.3049,0.427403,0.0,0.0
2378,10035,0.954748,0.162086,0.000669,5.5e-05,0.121634,0.041126,125.588,0.089176,0.340716,0.176642,0.355939,0.0,0.0
3139,12519,0.010644,0.863181,0.422751,0.932807,0.112729,0.113547,113.052,0.483683,0.440652,0.338019,0.46034,0.000178,0.100569


In [289]:
Y.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11129 entries, 6159 to 12157
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   track_id            11129 non-null  int64  
 1   acousticness        11129 non-null  float64
 2   danceability        11129 non-null  float64
 3   energy              11129 non-null  float64
 4   instrumentalness    11129 non-null  float64
 5   liveness            11129 non-null  float64
 6   speechiness         11129 non-null  float64
 7   tempo               11129 non-null  float64
 8   valence             11129 non-null  float64
 9   artist_discovery    11129 non-null  float64
 10  artist_familiarity  11129 non-null  float64
 11  artist_hotttnesss   11129 non-null  float64
 12  song_currency       11129 non-null  float64
 13  song_hotttnesss     11129 non-null  float64
dtypes: float64(13), int64(1)
memory usage: 1.3 MB


In [290]:
#Y = Y.reset_index(level=0)

In [291]:
#recommendations = recommendations.reset_index(level=0)

In [292]:
def recommend(recommendations, meta, Y):
    recommendations = recommendations.reset_index(level=0)
    Y = Y.reset_index(level=0)
    dat = []
    for i in Y['track_id']:
        dat.append(i)
    genre_mode = meta.loc[dat]['genre'].mode()
    artist_mode = meta.loc[dat]['artist_name'].mode()
    return meta[meta['genre'] == genre_mode.iloc[0]], meta[meta['artist_name'] == artist_mode.iloc[0]], meta.loc[
        recommendations['track_id']]

In [293]:
output = recommend(recommendations, metadata, Y)

In [294]:
genre_recommend, artist_name_recommend, mixed_recommend = output[0], output[1], output[2]

In [295]:
genre_recommend.shape

(3892, 4)

In [296]:
artist_name_recommend.shape

(94, 4)

In [297]:
mixed_recommend.shape

(520, 4)

In [298]:
# Genre wise recommendations
genre_recommend.head()

Unnamed: 0_level_0,album_title,artist_name,genre,track_title
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
153,Arc and Sender,Arc and Sender,Rock,Hundred-Year Flood
154,Arc and Sender,Arc and Sender,Rock,Squares And Circles
155,unreleased demo,Arc and Sender,Rock,Maps of the Stars Homes
169,Boss of Goth,Argumentix,Rock,Boss of Goth
170,Nightmarcher,Argumentix,Rock,Industry Standard Massacre


In [299]:
# Artist wise recommendations
artist_name_recommend.head()

Unnamed: 0_level_0,album_title,artist_name,genre,track_title
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10381,Big Blood & The Bleedin' Hearts,Big Blood,Folk,Baron in the Trees
10382,Big Blood & The Bleedin' Hearts,Big Blood,Folk,New Dish Rag
10383,Big Blood & The Bleedin' Hearts,Big Blood,Folk,Graceless Lady
10384,Big Blood & The Bleedin' Hearts,Big Blood,Folk,Blood Mumble
10385,Big Blood & The Bleedin' Hearts,Big Blood,Folk,Curee


In [300]:
# Mixed Recommendations
mixed_recommend.head()

Unnamed: 0_level_0,album_title,artist_name,genre,track_title
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
31881,Disco Pirata,Pisu,AvantGarde|International|Blues|,Commodore
33711,Húsares de la Muerte,H.D.M.,HipHop,Rapero mula
33044,Later Days,Asthmaboy,Rock,The Traffic Still Moves
105018,Violent,Talk Less Say More,AvantGarde|International|Blues|,"Yeah, That's Right"
49931,"Live at WFMU on Nat Roe's Show June 29, 2011",Pika,Rock,Intro


In [301]:
recommendations

Unnamed: 0_level_0,acousticness,danceability,energy,instrumentalness,liveness,speechiness,tempo,valence,artist_discovery,artist_familiarity,artist_hotttnesss,song_currency,song_hotttnesss,label
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
31881,0.007057,0.555030,0.764131,0.880281,0.064819,0.070115,93.362,0.262522,0.284627,0.252960,0.297344,0.0,0.0,2
33711,0.395812,0.736020,0.575406,0.246747,0.090738,0.412314,91.945,0.666975,0.262898,0.327014,0.274644,0.0,0.0,2
33044,0.960250,0.420011,0.190370,0.915435,0.336118,0.036872,84.151,0.061257,0.374602,0.262303,0.391339,0.0,0.0,2
105018,0.008160,0.533155,0.709440,0.300927,0.347076,0.027579,100.030,0.245441,0.335636,0.208720,0.350632,0.0,0.0,2
49931,0.495373,0.561037,0.388667,0.068159,0.128947,0.255811,105.824,0.202095,0.376508,0.410917,0.399240,0.0,0.0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34406,0.063832,0.395921,0.926013,0.885229,0.194911,0.042750,96.218,0.126521,0.310875,0.275075,0.324765,0.0,0.0,2
18026,0.995796,0.215221,0.129007,0.972870,0.339714,0.030085,84.796,0.537703,0.248466,0.053097,0.259567,0.0,0.0,2
18027,0.995765,0.539181,0.264597,0.972724,0.246283,0.032837,102.617,0.775923,0.076188,0.034536,0.079592,0.0,0.0,2
36302,0.296315,0.889528,0.232582,0.932233,0.111159,0.317319,100.087,0.745758,0.285311,0.279443,0.298059,0.0,0.0,2


In [302]:
artist_name_recommend['artist_name'].value_counts()

Big Blood    94
Name: artist_name, dtype: int64

In [303]:
genre_recommend['genre'].value_counts()

Rock    3892
Name: genre, dtype: int64

In [304]:
genre_recommend['artist_name'].value_counts()

Glove Compartment               65
Blah Blah Blah                  62
Mors Ontologica                 50
Les Baudouins Morts             38
Kraus                           35
                                ..
Alone in 1982                    1
Ostrich Tuning                   1
The Dalai Lama Rama Fa Fa Fa     1
The Rusty Bells                  1
Lost Boy                         1
Name: artist_name, Length: 725, dtype: int64

#### Testing

In [305]:
testing = Y.iloc[100:200]['track_id']

In [306]:
testing

4743     18970
2411     10097
12308    97959
12233    95407
9727     44240
516       1307
Name: track_id, dtype: int64

In [307]:
ids = testing.loc[testing.index]

In [308]:
songs = metadata.loc[testing.loc[list(testing.index)]]

In [309]:
songs

Unnamed: 0_level_0,album_title,artist_name,genre,track_title
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
18970,Air,Sandro Marinoni & Stefano Roncarolo,Jazz,Double Fee
10097,Edison Blue Amberol: 2687,Harvey Hindermyer and Helen Clark,OldTime|Historic,"Hello, Frisco!"
97959,4a @ electric,Gilo,Electronic,Este som - parte2
95407,Wiiiiiiiide Awake EP,junior85,Electronic,raymondscott
44240,"Son Of 1,000 Pardons",Joey Ripps,HipHop,First Of Many Pardons
1307,Realistic Psychosis,Nicky Andrews,Electronic,Bru Ha Ha


In [310]:
re = predict(t, Y.iloc[6:12])

In [311]:
output = recommend(re, metadata, Y.iloc[6:12])

In [312]:
ge_re, ge_ar, ge_mix = output[0], output[1], output[2]

In [313]:
ge_re.head()

Unnamed: 0_level_0,album_title,artist_name,genre,track_title
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
384,Summer Set,Blanketship,Electronic,Baja Jones
386,Summer Set,Blanketship,Electronic,Clapartroach
387,Summer Set,Blanketship,Electronic,I wish I wish
396,On the Back of a Dying Beast: Volume 1,Borful Tang,Electronic,Juggernaut Soliloquy
397,On the Back of a Dying Beast: Volume 1,Borful Tang,Electronic,The Tides Of Land


In [314]:
ge_ar.head(10)

Unnamed: 0_level_0,album_title,artist_name,genre,track_title
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
97959,4a @ electric,Gilo,Electronic,Este som - parte2
97960,4a @ electric,Gilo,Electronic,Eu sou o fado
97961,4a @ electric,Gilo,Electronic,Why does my heart...
97962,4a @ electric,Gilo,Electronic,Breacking down


In [315]:
ge_mix.head(10)

Unnamed: 0_level_0,album_title,artist_name,genre,track_title
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
20578,@ ISSUE 10/22/09,Sim Cain,Blues,Two
81368,Under the Lamp,E. R. Goodman,AvantGarde|International|Blues|Jazz|Classical|,Open Door
111907,Reel Time Canvas,Camden,Rock,How To Make America Proud
28754,Soundliketrains (2002-2004),Los Llamarada,Rock,It's a dream
33903,Piriperos Nails,HOM,AvantGarde|International|,IBM Deep Blue
58602,Live at WFMU on Beastin the Airwaves with Keil...,Regal Degal,Rock,Logs in the River
14199,Live at WFMU's Loud Live Acts on 4/02/09,Mental Abuse,Rock,No God
26734,Folk Den Project,Roger McGuinn,Folk,Ill Fly Away
34698,Songs From The Boidem,51%,AvantGarde|International|Blues|Jazz|,Taboo+ - Siman Sheela
50482,Introducing,The Brians,AvantGarde|International|,Sugar Clouds


In [316]:
ge_re.shape

(2170, 4)

In [317]:
ge_ar.shape

(4, 4)

In [318]:
ge_mix.shape

(464, 4)