# Music Recommendation by Using LightGBM

## WSDM - KKBox's Music Recommendation Challenge

This is a competition released on Kaggle (Link see [3]).

In this task, we have to predict the chances of a user listening to a song repetitively after the first observable listening event within a time window was triggered.

If there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, its target is marked 1, and 0 otherwise in the training set. The same rule applies to the testing set.

Kaggle provides a training data set consists of information of the first observable listening event for each unique user-song pair within a specific time duration. Metadata of each unique user and song pair is also provided.

## Boosting Machine Learning

In this project, we adopt boosting machine learning technique.

The philosophy of boosting is to use a set of __weak learnners__ to create a __strong learner__. A __weak learner__ is defined to be a classifier which is only slightly correlated with the true classification. In contrast, a __strong learner__ is a classifier that is arbitrarily well-correlated with the true classification.

Boosting algorithms consist of iteratively training weak learners with respect to a distribution and adding them to a final strong learner. When the weak learners are added, they are typically weighted with respect to their accuracy. Every time after a weak learner is added, the data are reweighted: __examples that are misclassified gain weight and examples that are classified correctly lose weight__. In this way, future weak learners will focus more on the examples that previous weak learners misclassified. Then by adding new weak learners again and again, finally we will get a strong learner that could handle the task much better.

## Light Gradient Boosting Machine (LightGBM)

When the weak learner is chosen as (small fixed size) __decision tree__, and differentiable loss functions are adopted, we get a __Gradient Boosting Decision Tree (GBDT)__ [1]. Due to its efficiency, accuracy, and interpretability, GBDT achieves state-of-the-art performances in many machine learning tasks, such as multi-class classification, click prediction, and learning to rank.

However, __conventional implementations__ of GBDT need to, for every feature, scan all the data instances to estimate the information gain of all the possible split points. This makes these implementations very time consuming when handling big data. To this end, in 2017 NIPS, Microsoft published a __new implementation__ of GBDT, called __LightGBM__, which speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy in experiments on multiple public datasets [1].

In [3]:
import numpy as np
import pandas as pd
import csv

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb
from tqdm import tqdm

The original training dataset provided by Kaggle is too large for our slow computers to work on. So does the test dataset. Furthermore, the test data provided by Kaggle does not have labels. This makes us unable to see how the algorithm works. Therefore, we decide to discard the test data. Instead, we extract 10,000 samples from the training data, and split it into a larger group with 7000 samples for training and a smaller group with 3000 samples for testing.

In [5]:
ntr = 7000
nts = 3000
print('Loading data...')
data_path = './data/'
train = pd.read_csv(data_path + 'train.csv',nrows=ntr)
names=['msno','song_id','source_system_tab','source_screen_name',\
      'source_type','target']
test1 = pd.read_csv(data_path+'train.csv',names=names,skiprows=ntr,nrows=nts)
songs = pd.read_csv(data_path + 'songs.csv')
members = pd.read_csv(data_path + 'members.csv')

Loading data...


To test the learned recommendation system in the end, we extract the true labels of test data into ytr

In [6]:
test = test1.drop(['target'],axis=1)
ytr = np.array(test1['target'])

rearrange columns of the test data so that it fits into remain codes  

In [7]:
test_name = ['id','msno','song_id','source_system_tab',\
             'source_screen_name','source_type']
test['id']=np.arange(nts)
test = test[test_name]

In [8]:
print('Data preprocessing...')
song_cols = ['song_id', 'artist_name', 'genre_ids', 'song_length', 'language']
train = train.merge(songs[song_cols], on='song_id', how='left')
test = test.merge(songs[song_cols], on='song_id', how='left')

Data preprocessing...


In [9]:
members['registration_year'] = members['registration_init_time'].apply(lambda x: int(str(x)[0:4]))
members['registration_month'] = members['registration_init_time'].apply(lambda x: int(str(x)[4:6]))
members['registration_date'] = members['registration_init_time'].apply(lambda x: int(str(x)[6:8]))

In [10]:
members['expiration_year'] = members['expiration_date'].apply(lambda x: int(str(x)[0:4]))
members['expiration_month'] = members['expiration_date'].apply(lambda x: int(str(x)[4:6]))
members['expiration_date'] = members['expiration_date'].apply(lambda x: int(str(x)[6:8]))
members = members.drop(['registration_init_time'], axis=1)

In [11]:
members_cols = members.columns
train = train.merge(members[members_cols], on='msno', how='left')
test = test.merge(members[members_cols], on='msno', how='left')

In [12]:
train = train.fillna(-1)
test = test.fillna(-1)

In [13]:
import gc
del members, songs; gc.collect();

In [14]:
cols = list(train.columns)
cols.remove('target')

In [15]:
for col in tqdm(cols):
    if train[col].dtype == 'object':
        train[col] = train[col].apply(str)
        test[col] = test[col].apply(str)

        le = LabelEncoder()
        train_vals = list(train[col].unique())
        test_vals = list(test[col].unique())
        le.fit(train_vals + test_vals)
        train[col] = le.transform(train[col])
        test[col] = le.transform(test[col])

100%|██████████████████████████████████████████| 19/19 [00:00<00:00, 48.56it/s]


In [16]:
# Song popularity
unique_songs = range(max(train['song_id'].max(), test['song_id'].max()))
song_popularity = pd.DataFrame({'song_id': unique_songs, 'popularity':0})

train_sorted = train.sort_values('song_id')
train_sorted.reset_index(drop=True, inplace=True)
test_sorted = test.sort_values('song_id')
test_sorted.reset_index(drop=True, inplace=True)

In [17]:
for unique_song in tqdm(unique_songs):
    if unique_song != (len(unique_songs)-1):
        train_pop = (train_sorted['song_id'].searchsorted(unique_song+1)[0] - 
                     train_sorted['song_id'].searchsorted(unique_song)[0])
        test_pop = (test_sorted['song_id'].searchsorted(unique_song+1)[0] - 
                     test_sorted['song_id'].searchsorted(unique_song)[0])
    else : 
        train_pop = (len(train_sorted) -
                     train_sorted['song_id'].searchsorted(unique_song)[0])
        test_pop = (len(test_sorted) -
                     test_sorted['song_id'].searchsorted(unique_song)[0])
    song_popularity[unique_song] = train_pop + test_pop

100%|█████████████████████████████████████| 5770/5770 [00:24<00:00, 235.15it/s]


In [18]:
# User library size


X = np.array(train.drop(['target'], axis=1))
y = train['target'].values

X_test = np.array(test.drop(['id'], axis=1))
ids = test['id'].values

del train, test; gc.collect();

X_train, X_valid, y_train, y_valid = train_test_split(X, y, \
    test_size=0.1, random_state = 12)
    
del X, y; gc.collect();

In [19]:
d_train = lgb.Dataset(X_train, label=y_train)
d_valid = lgb.Dataset(X_valid, label=y_valid) 

watchlist = [d_train, d_valid]

In [20]:
print('Training LGBM model...')
params = {}
params['learning_rate'] = 0.4
params['application'] = 'binary'
params['max_depth'] = 15
params['num_leaves'] = 2**8
params['verbosity'] = 0
params['metric'] = 'auc'

model = lgb.train(params, train_set=d_train, num_boost_round=200, valid_sets=watchlist, \
early_stopping_rounds=10, verbose_eval=10)

Training LGBM model...
Training until validation scores don't improve for 10 rounds.
[10]	training's auc: 0.990018	valid_1's auc: 0.844558
[20]	training's auc: 0.999901	valid_1's auc: 0.839645
Early stopping, best iteration is:
[13]	training's auc: 0.997805	valid_1's auc: 0.849754


In [21]:
print('Making predictions and saving them...')
p_test = model.predict(X_test)

Making predictions and saving them...


In [22]:
subm = pd.DataFrame()
subm['id'] = ids
subm['target'] = p_test
subm.to_csv('submission.csv.gz', compression = 'gzip', index=False, float_format = '%.5f')
print('Done!')

Done!


Now for each id in the test data, the model predicted the probablity that the user will listen to the corresponding song for the second time in the future. The result is stored in p_test. Now We use a hard thredhold rule to obtain yhat. That is, if $p_{test}[i]>0.5$, $yhat=1$, else $yhat=0$. Then we can calculate the accuracy acc of our lgbm model. 

In [23]:
yhat = (p_test>0.5).astype(int)
comp = (yhat==ytr).astype(int)
acc = comp.sum()/comp.size*100
print('The accuracy of lgbm model on test data is: {0:f}%'.format(acc))

The accuracy of lgbm model on test data is: 77.900000%


To make a comparison, we use a random guessing to predict on the same test data set. That is, to assign yhat_rand 1 or 2 for each id in test data according to $Bernoulli(1/2)$ distribution 

In [24]:
rd_seed = np.random.uniform(0,1,nts)
yhat_rand = (rd_seed>0.5).astype(int)
comp_rand = (yhat_rand==ytr).astype(int)
acc_rand = comp_rand.sum()/comp_rand.size*100
print('The accuracy of random model on test data is: {0:f}%'.format(acc_rand))

The accuracy of random model on test data is: 49.933333%


Obviously, lgbm model is better than random guessing. This means that the chosen lgbm model indeed improved the predicition accuracy on this problem. Of course, it is just a first trial, there are much work to do if one wants to further enhance the prediction accuracy.

# Reference

[1] Friedman J H. Greedy function approximation: a gradient boosting machine[J]. Annals of statistics, 2001: 1189-1232.

[2] Ke G, Meng Q, Wang T, et al. A Highly Efficient Gradient Boosting Decision Tree[C]//Advances in Neural Information Processing Systems. 2017: 3148-3156.

[3] https://www.kaggle.com/c/kkbox-music-recommendation-challenge