<img src="https://letsgrowmore.in/wp-content/uploads/2021/05/growmore-removebg-preview-600x245.png" />

# Music Recommendation Sytem 

### Introduction to Music Recommendation System using Collaborative filtering and K-Nearest Neighbors
Music recommender systems are designed to analyze users' listening patterns, preferences, and behavior to provide personalized song recommendations. These systems leverage various algorithms and techniques to generate relevant song suggestions that align with users' tastes and interests. By analyzing patterns such as user history, genre preferences, artist similarities, collaborative filtering, and content-based filtering, music recommender systems can enhance user experiences and help discover new music.

Collaborative filtering and k-nearest neighbors (KNN) are popular techniques used in music recommendation systems. Collaborative filtering analyzes user behavior and preferences to recommend songs based on similarities between users or items. K-nearest neighbors is a specific algorithm used within collaborative filtering to find the most similar items or users based on their characteristics. 

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("./"))

['.ipynb_checkpoints', '1st_submission.csv', 'members.csv', 'Music Recommender.ipynb', 'sample_submission.csv', 'songs.csv', 'song_extra_info.csv', 'test.csv', 'train.csv', 'Untitled.ipynb']


In [2]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
import gc
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb
import datetime
import math

In [3]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [4]:
train = reduce_mem_usage(pd.read_csv('./train.csv'))
test = reduce_mem_usage(pd.read_csv('./test.csv'))
sei = pd.read_csv('./song_extra_info.csv')
members = pd.read_csv('./members.csv',parse_dates=['registration_init_time','expiration_date'])
songs = pd.read_csv('./songs.csv')

Memory usage of dataframe is 337.71 MB
Memory usage after optimization is: 82.41 MB
Decreased by 75.6%
Memory usage of dataframe is 117.04 MB
Memory usage after optimization is: 42.17 MB
Decreased by 64.0%


In [5]:
print('Shape of train is ->',train.shape)
print('Shape of test is ->',test.shape)
print('Shape of Song Extra Info is ->',sei.shape)
print('Shape of Members is ->',members.shape)
print('Shape of Songs is ->',songs.shape)

Shape of train is -> (7377418, 6)
Shape of test is -> (2556790, 6)
Shape of Song Extra Info is -> (2295971, 3)
Shape of Members is -> (34403, 7)
Shape of Songs is -> (2296320, 7)


In [6]:
def get_codes(isrc):
    if pd.isnull(isrc):
        return np.nan
    else:
        if int(str(isrc)[5:7]) > 17:
            temp =  1900+int(str(isrc)[5:7])
        else:
            temp = 2000+int(isrc[5:7])
        return temp

In [7]:
sei['year'] = sei['isrc'].apply(lambda x: get_codes(x))
sei.sample(10)

Unnamed: 0,song_id,name,isrc,year
1826265,cJ45pPeYHs0omK0bdLNqhMjlmOhOaGSyUuGDoby9VVw=,Ma Rho,FRYAS0755350,2007.0
1466077,HFzMHBevEWA4KlxGL00a3G/PkxOMlWEAgG4GtrXymyY=,Say You Won't Let Go (Originally Performed by ...,GBRBE0791734,2007.0
115986,fjgrSRkdDheD0827ktLXLuuiH/fJxtAZIwZcw6eJ8CY=,Bad Luck Buffet,JPP741605003,2016.0
2220240,cIHTgbmxC2m6MMVV/iwYGHGR1UdFr5SLQMuuemctwzM=,RICE,SGA051500020,2015.0
878148,Ew/z3y50au7ggZ5ogLW5oLNZqAc9av8dS5lmDE8jJjI=,Sonate Re Mineur/Pastorale Kk9,,
738639,JQhfQ5d+sMUciVkzMdOIVE5qgs+zP7GDyCqlaO6OeVs=,Anything Goes,USSM18900825,1989.0
1351866,MWqXdGFkvJqoNBqwDX9U4LYhnIX7KJYELrsquyfhkgg=,That Was Good (feat. Sean Armstrong),QM5NT1500007,2015.0
2266589,374VHRGpnprNSt/ChPsD7lDbzNn6jfTkPSShjNPsCpE=,Out Of Time,USA177510030,1975.0
718161,N05fOQ4dJml82eDGnpyhglVkz4/xwFcwJfn1qrlBDmQ=,Poème| Op. 44/2,USA560880598,2008.0
1266286,8bDpZ0ENzoEKIqoDBydna9gomJY6o+7peh2q1noHVGg=,The Eagle Will Rise Again,USAR19700008,1997.0


In [8]:
members['membership_days'] = members['expiration_date'].subtract(members['registration_init_time']).dt.days.astype(int)
members['registration_year'] = members['registration_init_time'].dt.year
members['expiration_year'] = members['expiration_date'].dt.year
members.drop(columns = ['registration_init_time' , 'expiration_date'] , inplace = True)
members.head()

Unnamed: 0,msno,city,bd,gender,registered_via,membership_days,registration_year,expiration_year
0,XQxgAYj3klVKjR3oxPPXYYFp4soD4TuBghkhMTD4oTw=,1,0,,7,2223,2011,2017
1,UizsfmJb9mV54qE9hCYyU07Va97c0lCRLEQX3ae+ztM=,1,0,,7,725,2015,2017
2,D8nEhsIOBSoE6VthTaqDX8U6lqjJ7dLdr72mOyLya2A=,1,0,,4,457,2016,2017
3,mCuD+tZ1hERA/o5GPqk38e041J8ZsBaLcu7nGoIIvhI=,1,0,,9,1,2015,2015
4,q4HRBfVSssAFS9iRfxWrohxuk9kCYMKjHOEagUMV6rQ=,1,0,,4,138,2017,2017


In [9]:
# Extending columns
# merging the database
train = train.merge(songs , on='song_id' , how='left')
train = train.merge(members , on = 'msno' , how='left')
train = train.merge(sei , on = 'song_id' , how='left')
test  = test.merge(songs , on='song_id' , how='left')
test = test.merge(members , on = 'msno' , how = 'left')
test =  test.merge(sei , on = 'song_id' , how = 'left')
del sei ,members , songs
gc.collect()

44

In [10]:
print(train['song_length'].isnull().value_counts()/train.shape[0])
train['song_length'].fillna(train['song_length'].mean() , inplace = True)
train['song_length'] = train['song_length'].astype(np.uint32)
print(train['language'].isnull().value_counts()/train.shape[0])
train['language'].fillna(train['language'].mode().values[0] , inplace= True)
train['language'] = train['language'].astype(np.int8)
test['song_length'].fillna(test['song_length'].mean() , inplace = True)
test['song_length'] = test['song_length'].astype(np.uint32)
test['language'].fillna(test['language'].mode().values[0] , inplace= True)
test['language'] = test['language'].astype(np.int8)

False    0.999985
True     0.000015
Name: song_length, dtype: float64
False    0.99998
True     0.00002
Name: language, dtype: float64


In [11]:
def genre_count(genre):
    if genre == 'no_genre_id':
        return 0
    else :
        return genre.count('|') + 1
print(train['genre_ids'].isnull().value_counts()/train.shape[0])
train['genre_ids'].fillna('no_genre_id' , inplace= True)
train['genre_ids_count'] = train['genre_ids'].apply(lambda x: genre_count(x)).astype(np.int8)
test['genre_ids'].fillna('no_genre_id' , inplace= True)
test['genre_ids_count'] = test['genre_ids'].apply(lambda x: genre_count(x)).astype(np.int8)

False    0.983944
True     0.016056
Name: genre_ids, dtype: float64


In [12]:
def artist_count(art):
    if art=='no_artist_name':
        return 0
    else:
        return art.count('|')+art.count('/') + art.count('//') + art.count(';') + 1
train['artist_name'].isnull().value_counts()
train['artist_name'].fillna('no_artist_name' , inplace = True)
train['artist_count'] = train['artist_name'].apply(lambda x : artist_count(x)).astype(np.int8)
test['artist_name'].fillna('no_artist_name' , inplace = True)
test['artist_count'] = test['artist_name'].apply(lambda x : artist_count(x)).astype(np.int8)

In [13]:
def  count_composer(comp):
    if comp=='no_composer':
        return 0
    else:
        return comp.count('|')+comp.count('/') + comp.count('//') + comp.count(';') + 1
def  count_lyricist(lyr):
    if lyr=='no_lyricist':
        return 0
    else:
        return lyr.count('|')+lyr.count('/') + lyr.count('//') + lyr.count(';') + 1

In [14]:
train['composer'].fillna('no_composer',inplace=True)
train['composer_count'] = train['composer'].apply(lambda x: count_composer(x)).astype(np.int8)
train['lyricist'].fillna('no_lyricist',inplace=True)
train['lyricist_count'] = train['lyricist'].apply(lambda x: count_lyricist(x)).astype(np.int8)
test['composer'].fillna('no_composer',inplace=True)
test['composer_count'] = test['composer'].apply(lambda x: count_composer(x)).astype(np.int8)
test['lyricist'].fillna('no_lyricist',inplace=True)
test['lyricist_count'] = test['lyricist'].apply(lambda x: count_lyricist(x)).astype(np.int8)

In [15]:
dict_count_song_played_train = {k: v for k, v in train['song_id'].value_counts().iteritems()}
dict_count_song_played_test = {k: v for k, v in test['song_id'].value_counts().iteritems()}
def return_number_played(x):
    try:
        return dict_count_song_played_train[x]
    except KeyError:
        try:
            return dict_count_song_played_test[x]
        except KeyError:
            return 0
train['number_of_time_played'] = train['song_id'].apply(lambda x: return_number_played(x))
test['number_of_time_played'] = test['song_id'].apply(lambda x: return_number_played(x))

In [16]:
dict_user_activity = {k:v for k,v in pd.concat([train['msno'] , test['msno']] , axis = 0).value_counts().iteritems()}
def return_user_activity(x):
    try:
        return dict_user_activity[x]
    except KeyError:
        return 0
train['user_activity_msno'] = train['msno'].apply(lambda x: return_user_activity(x))
test['user_activity_msno'] = test['msno'].apply(lambda x: return_user_activity(x))

In [17]:
train_col = list(train.columns)
test_col = list(test.columns)
for f in test_col :
    if f not in train_col:
        print('ERROR !!!  Column from Test not found in train is ->' , f)
label_encoding = ['source_system_tab', 'source_screen_name',
       'source_type','gender']
drop = ['msno', 'song_id' , 'isrc','artist_name',
       'composer', 'lyricist','name','genre_ids']
min_max_scaling = ['number_of_time_played', 'user_activity_msno','membership_days', 'song_length']

ERROR !!!  Column from Test not found in train is -> id


In [18]:
for f in label_encoding:
    lb = LabelEncoder()
    lb.fit(list(train[f].values) + list(test[f].values))
    train[f] = lb.transform(list(train[f].values))
    test[f] = lb.transform(list(test[f].values))
for f in min_max_scaling:
    ms = MinMaxScaler()
    train[f] = ms.fit_transform(train[[f]])
    test[f] = ms.transform(test[[f]])

In [19]:
for col in train.columns:
    if train[col].dtype == object:
        train[col] = train[col].astype('category')
        test[col] = test[col].astype('category')

In [20]:
train.sample(10)

Unnamed: 0,msno,song_id,source_system_tab,source_screen_name,source_type,target,song_length,genre_ids,artist_name,composer,...,expiration_year,name,isrc,year,genre_ids_count,artist_count,composer_count,lyricist_count,number_of_time_played,user_activity_msno
6441750,Gp/f+FgCr2z+dMgO7wdoA4k7VUlOv03y4Xw8n8ORzFc=,6c/FbNPsK7qGvQGyRFa1ECv0uJLoIPX1QegcGl/maWs=,3,8,3,0,0.021453,444,I.O.I,no_composer,...,2017,DOWNPOUR,KRA381700114,2017.0,1,1,0,0,0.059906,0.038895
491662,9tQemYoJ/2uNTdsmb9WeEcO3AUZ79qACNdeCsJLD6uw=,kEFy4i96pziXPV/UPtnmzzitZcGIdBVT9Gsi2UQRPvs=,3,8,3,0,0.027394,458,九龍,趙照,...,2017,當你老了,,,1,1,1,1,0.005368,0.042316
4472624,8T8BuOf4qZekNSK6opgFjrxwIiDYKdz8bNek16l7VJU=,YmAISVdWO5T+OgcuVLr6qQ7OsqPzAbBO7BrEwfcxMSw=,3,8,3,1,0.019988,1609,Zedd,Jon Bellion| Anton Zaslavski| Tim James| Anton...,...,2017,Beautiful Now,USUM71505090,2015.0,1,1,5,0,0.219296,0.036868
6432818,lRbSFInUq8XekV6BiYNehssrnGJkLslNmMOdxn1yO2w=,Bs6gQpg71QmXFTvl2VdwQeQA9KZ5czw238ZWjE/tnFg=,2,12,2,0,0.022603,465,蕭亞軒 (Elva Hsiao),no_composer,...,2017,倒數,TWA270912003,2009.0,1,1,0,0,0.033853,0.048524
194115,zlHBUBsWDaokMojjGy6ePkhan130FyKnXNIbuqriyFo=,qCjK0SHytXxm2p+hl96KkOhYt19voU0ARfuJCa9m8z8=,3,8,3,1,0.024505,465,田馥甄 (Hebe),陳小霞,...,2017,口袋的溫度,TWD951343110,2013.0,1,1,1,2,0.040009,0.102749
1440321,de6JR6jK7i+t2u9wDgGIi2PsyrvM6GLft5rmx5yvfuQ=,GxMkUHL0RcciWpl3oxRSyV+NKJFtdvaYDPvI4jPhh3U=,3,8,3,0,0.02387,465,小宇-宋念宇,宋念宇,...,2017,我在角落觀察ABC,TWA770700720,2007.0,1,1,1,1,0.006942,0.219182
503433,FsnD6eISCRrYBij8+Gb1USi7YFPilM2xP3uhFw/8dWY=,qn+XcFoHbyUc003F/gVETnDbjN4imFz81D78x2+KIy0=,3,8,3,0,0.021717,1609,Girls' Generation (少女時代),Roel De Meulemeester| Guy Balbaert| Stefanie D...,...,2017,Girls,JPPO01504108,2015.0,1,1,3,0,0.00501,0.073229
4434252,cNNgFqsgLfTSIDCDM6Wqg/Fcd8FeINkDaSQ6KPFSyvQ=,+R1gjRTawxmMLwHdwmwm0eK48J8NuLlqLHh5pL8MzIM=,6,16,8,0,0.028088,465,信樂團 (Shin Band),Keith Stuart,...,2017,千年之戀,TWA770400202,2004.0,1,1,1,1,0.071214,0.104776
5290700,pgwFdM4Ibn/C2PNxozWcUMz7WDPeFjggJX1D2lq3jrk=,Ib3gYTwQSE6t4uBAVEvBBsOkdaWsffKXRG02pOZgntU=,7,1,11,1,0.023427,465,陶喆 (David Tao),陶喆,...,2017,天天,EMIDD0288284,2002.0,1,1,1,1,0.060407,0.063727
5952392,5pr+x566HOsYLEv4ouZ5cEHFFwBB74xaFbExdPGJ6b0=,jB0cPuBYP/ycfh+sYyo/8UI3sJQ7C6Uy9XaNiZ9lyYY=,6,16,8,1,0.025854,465,王心如 (Cynthia Wang),康小白,...,2018,壞天氣好心情,TWA750900504,2009.0,1,1,1,2,0.007086,0.288483


In [21]:
X_train = train.drop(columns = ['target'] , axis = 1)
Y_train = train['target'].values
X_test = test.drop(columns = ['id'] , axis = 1)
ids = test['id'].values
del train , test
gc.collect()
train_set = lgb.Dataset(X_train , Y_train)

In [22]:
params = {
        'objective': 'binary',
        'metric': 'binary_logloss',
        'boosting': 'gbdt',
        'learning_rate': 0.3 ,
        'verbose': 0,
        'num_leaves': 108,
        'bagging_fraction': 0.95,
        'bagging_freq': 1,
        'bagging_seed': 1,
        'feature_fraction': 0.9,
        'feature_fraction_seed': 1,
        'max_bin': 256,
        'max_depth': 10,
        'num_rounds': 200,
        'metric' : 'auc'
    }

%time model_f1 = lgb.train(params, train_set=train_set,  valid_sets=train_set, verbose_eval=5)

You can set `force_col_wise=true` to remove the overhead.
[5]	training's auc: 0.736013
[10]	training's auc: 0.747873
[15]	training's auc: 0.753968
[20]	training's auc: 0.75881
[25]	training's auc: 0.762558
[30]	training's auc: 0.765889
[35]	training's auc: 0.768492
[40]	training's auc: 0.771594
[45]	training's auc: 0.773583
[50]	training's auc: 0.776168
[55]	training's auc: 0.778398
[60]	training's auc: 0.779996
[65]	training's auc: 0.782235
[70]	training's auc: 0.783841
[75]	training's auc: 0.785289
[80]	training's auc: 0.786489
[85]	training's auc: 0.787656
[90]	training's auc: 0.788899
[95]	training's auc: 0.790035
[100]	training's auc: 0.791538
[105]	training's auc: 0.792555
[110]	training's auc: 0.793562
[115]	training's auc: 0.796136
[120]	training's auc: 0.797285
[125]	training's auc: 0.798097
[130]	training's auc: 0.798919
[135]	training's auc: 0.80002
[140]	training's auc: 0.800621
[145]	training's auc: 0.801345
[150]	training's auc: 0.80235
[155]	training's auc: 0.803263
[160

In [23]:
pred_test = model_f1.predict(X_test)
print('Saving Predictions')
sub = pd.DataFrame()
sub['id'] = ids
sub['target'] = pred_test
sub.to_csv('1st_submission.csv' , index = False , float_format ='%.5f' )

Saving Predictions


In [24]:
sub.head()

Unnamed: 0,id,target
0,0,0.652969
1,1,0.704967
2,2,0.083125
3,3,0.089885
4,4,0.095229


### Conclusion:
In conclusion, music recommender systems play a crucial role in enhancing the music listening experience by providing personalized song recommendations to users. These systems leverage a variety of algorithms and techniques, including collaborative filtering and k-nearest neighbors, to analyze user behavior and preferences.

Collaborative filtering allows the system to identify patterns and similarities between users or items to generate relevant recommendations. By analyzing factors such as user history, genre preferences, and artist similarities, the system can identify similar users or items and suggest songs that align with the user's tastes and interests. This approach helps users discover new music and expands their musical horizons.

K-nearest neighbors, as a specific algorithm within collaborative filtering, further refines the recommendation process by finding the most similar items or users based on their characteristics. This technique helps improve the accuracy and relevance of the recommendations, ensuring that users receive song suggestions that are closely aligned with their preferences.

Overall, music recommender systems have revolutionized the way we discover and enjoy music. By leveraging advanced algorithms and analyzing user data, these systems provide personalized recommendations that cater to individual tastes and preferences. They not only enhance the user experience but also contribute to the discovery of new music and the exploration of diverse musical genres. As technology continues to advance, we can expect music recommender systems to further evolve and deliver even more precise and tailored recommendations to music enthusiasts worldwide.