## Latent Factor Model 

For the model-based approach we chose a latent factor model with a bias term for missing data. We assume that the values in the rating matrix is not missing at random and thus introduce a bias term for missing data and weights.

This section is structured in the following way
* Define the latent factor model
* Hyperparameter Tuning
* Scaling
* Final Test

#### Model Definition

In [1]:
path = 'data/train.csv'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
data = pd.read_csv(path)




In [16]:
def sample_data_50_50(data):
    user_rating_count = data.groupby('user_id')['media_id'].nunique()
    # List of users in ascending order (by number of songs, not unique, listened to)
    user_rating_count = user_rating_count.sort_values(axis=0, ascending=False)
    # select 10,000 most active users
    # TODO: check variable u_sel never used
    user = user_rating_count[:10000].index
    

    # List of unique media ids (with user count)
    media_count = data.groupby('media_id')['user_id'].nunique()
    # List of media in ascending order (by number of users)
    media_count = media_count.sort_values(axis=0, ascending=False)
    # Select top 100 items

    # TODO: check variable m_sel never used
    media1 = media_count[:50]
    media2 = media_count[51:5000].sample(n=50)
    media = media1.append(media2).index

    # return list with media_ids

    return user, media

In [8]:
def rating_matrix(data, user, media):
    # TODO: Add standard docstring
	# returns a rating matrix in wide format given a long format table, sampled by the specified user and media ids
    # data: the raw_data to build the matrix from
    # user: an df of user_ids to select from data
    # media: an df of media_ids which to include into the ratings matrix

    # sample users
    users = data[data['user_id'].isin(user)]

    # sample items
    items = users[users['media_id'].isin(media)]

    # drop duplicates (assumption: keep the first of an item)
    matrix = items.drop_duplicates(subset=['media_id', 'user_id'], keep='first')

    # get long format into wide format
    # TODO: check memory usage, improve to sparse matrix storage if necessary
    matrix = matrix.pivot(index='user_id',
                          columns='media_id',
                          values='is_listened')

    return matrix

In [17]:
user, media = sample_data_50_50(data)
print media

100
Int64Index([132434634, 129632340, 131576046, 124603270, 133165774, 133661814,
            132123604, 129011934,  67354825, 132614858,  70266756, 122440564,
            122927594, 129310248, 125089724, 132123626, 132123630, 132265704,
            129313094, 124237488, 126866297, 122886138, 130604714, 134748108,
            119437608, 118986142, 117678828, 120026876, 125890431,  78671166,
             73982869, 127539479, 133940962, 127248545,  95813354, 122450802,
             79223833, 123883254,  99976954, 124550860, 127246869, 129121520,
            111780412,   3135183,  95938334, 130250890, 126884459,  86922279,
             92734438, 132123606,   1044157,  97956134,   2963337, 134464870,
              1056838, 134951488,    769970, 136336120, 114438294, 124105990,
             98286648,   2603789,  75212314,  65446363, 134300156, 108937884,
            114393902,  89538849,   6426473, 133583416, 133782124,   3098217,
             68442426,  66201357,    280556, 124262676,    6

#### Hyperparameter Tuning

In this section we explore the hyper-parameters of the model and test is with respect to the crossvalidation set. The following hyperparameters need to be chosen: 
* No of latent factors
* Regularization term
* Bias 
* Bias weights

Tuning will take place in two rounds. In the first round we will explore the parameter space, to get an idea about where the best parameters lie. The second round will be informed by this exploration and further tune the parameters to find the best performing model. 

##### First Round 


In [None]:
regularization= [0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000]
latent_factors = [1,2,3,5,8,10,15,20,25,30,50,100]
bias = range(0,1,0.05)
bias_weights = [0.01, 0.05, 0.1, 0.3, 0.5, 0.8, 1]

performance = np.zeros((len(regularization),len(latent_factors),len(bias), len(bias_weights)))

i = 0
j = 0
k = 0
l = 0

for r in regularization: 
    for l in latent_factors: 
        for b in bias: 
            for bw in bias_weights: 
                R_hat, U, V = latent_factors_with_bias(matrix, latent_factors=l ,regularization=r )
                acc = test_accuracy(matrix, R_hat)
                performance[i,j,k,l] = acc
                l += 1
            k += 1
        j +=1
    i +=1

##### First Round: Results

##### Second Round

##### Second Round: Results