# Problem Statement

Understanding customers and their preferences is the holy grail for online businesses. Building a recommender system is one of the common ways to do so.

In this contest, you need to build a model that predicts a given user’s ratings (from 0 to 10 stars) for a given item based on past ratings on other items and/or other information. The problem of rating prediction is the primary part of a recommendation problem (the part where explicit ratings are given). No additional information (user demographics, item content features etc.) are given and the prediction has to be made using only the ratings of already rated items.

# Dataset

The rating data of 40,000 users, and 120 items . Ratings of users who have rated less than 10 items have been removed.

**training.csv** - This contains 958,529 ratings which are selected randomly from 1,599,544 ratings. Contains 4 columns:
- ID: Unique ID for each record
- userId: Unique user ID for each customer
- itemid: Item ID fo the product
- rating: Rating given to each item by user

**test.csv** - This file has three columns containing the ID, userId and itemId. The predictions on this set would be judged.

**submission.csv**: This contains the predictions of the model on the test file. The file has to contain a two columns (ID and rating).

# Evaluation

The metrics used for evaluating the performance of the model is the Root Mean Squared Error between the predicted and the actual ratings.

Public : Private leaderboard split on test data is 25:75

# Import Packages

In [2]:
import graphlab as gl
import graphlab.aggregate as agg
from graphlab.toolkits.feature_engineering import *
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

This non-commercial license of GraphLab Create for academic use is assigned to karthi.aru@gmail.com and will expire on May 31, 2017.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1488240761.log


# Import Data

In [3]:
train = gl.SFrame(data='data/train_MLWARE2.csv')
test = gl.SFrame(data='data/test_MLWARE2.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,int,int,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [4]:
train

ID,userId,itemId,rating
16041_129,16041,129,0.5
16041_25,16041,25,0.5
16041_28,16041,28,5.5
16041_101,16041,101,0.5
16041_47,16041,47,1.5
16041_132,16041,132,0.5
16041_38,16041,38,0.5
16041_89,16041,89,10.0
16041_17,16041,17,2.5
16041_116,16041,116,6.5


In [3]:
# Create row identifier for test data to sort for submission
test['row'] = gl.SArray.from_sequence(len(test))

In [5]:
test

ID,userId,itemId,row
16041_10,16041,10,0
16041_107,16041,107,1
16041_1,16041,1,2
16041_40,16041,40,3
16041_96,16041,96,4
16041_137,16041,137,5
16041_51,16041,51,6
16041_59,16041,59,7
16041_135,16041,135,8
16041_15,16041,15,9


In [6]:
print "Train.........."
print "No. of unique items:", len(train['itemId'].unique())
print "No. of unique users:", len(train['userId'].unique())
print "userId range:", min(train['userId'].unique()), "->", max(train['userId'].unique())
print "itemId range:", min(train['itemId'].unique()), "->", max(train['itemId'].unique())

print "\nTest.........."
print "No. of unique items:", len(test['itemId'].unique())
print "No. of unique users:", len(test['userId'].unique())
print "userId range:", min(test['userId'].unique()), "->", max(test['userId'].unique())
print "itemId range:", min(test['itemId'].unique()), "->", max(test['itemId'].unique())

Train..........
No. of unique items: 120
No. of unique users: 40000
userId range: 0 -> 59131
itemId range: 1 -> 139

Test..........
No. of unique items: 120
No. of unique users: 39982
userId range: 0 -> 59131
itemId range: 1 -> 139


In [7]:
submit = gl.SFrame(data='data/sample_submission_MLWARE2.csv')
submit

Insufficient number of rows to perform type inference
Could not detect types. Using str for each column.


ID,rating


# Popularity Model

This model scores a Public LB of ~2.6 and is used as benchmark.

In [44]:
popularity_model = gl.popularity_recommender.create(train,
                                                    user_id='userId', item_id='itemId',
                                                    target='rating',
                                                    user_data=users, item_data=items,
                                                    random_seed=123, verbose=False)

In [45]:
popularity_model

Class                            : PopularityRecommender

Schema
------
User ID                          : userId
Item ID                          : itemId
Target                           : rating
Additional observation features  : 0
User side features               : ['userId', 'count', 'rating_mean', 'rating_sd']
Item side features               : ['itemId', 'count', 'rating_mean', 'rating_sd', 'wr']

Statistics
----------
Number of observations           : 958529
Number of users                  : 40000
Number of items                  : 120

Training summary
----------------
Training time                    : 0.1676

Model Parameters
----------------
Model class                      : PopularityRecommender

In [46]:
# Predictions
popularity_pred = popularity_model.recommend(k=120, new_observation_data=test,
                                             exclude=None, exclude_known=False,
                                             random_seed=123, verbose=False)

In [50]:
test1 = test.join(popularity_pred, on=['userId','itemId'], how='left')
test1 = test1.sort('row')
#test1['score'] = test1['score'].apply(lambda x: round(x,1))

In [51]:
test1.head(3)

ID,userId,itemId,row,score,rank
16041_10,16041,10,0,4.69391715178,115
16041_107,16041,107,1,4.19703820467,119
16041_1,16041,1,2,5.37728754074,98


In [24]:
# Submission
submit = gl.SFrame({'ID':test1['ID'],'rating':test1['score']})
submit.save('data/submit.csv', format='csv')

# Similarity Model

In [52]:
similarity_model = gl.item_similarity_recommender.create(train,
                                                         user_id='userId', item_id='itemId', target='rating',
                                                         user_data=users, item_data=items,
                                                         similarity_type='jaccard', only_top_k=120, verbose=False)

In [53]:
similarity_model

Class                            : ItemSimilarityRecommender

Schema
------
User ID                          : userId
Item ID                          : itemId
Target                           : rating
Additional observation features  : 0
User side features               : ['userId', 'count', 'rating_mean', 'rating_sd']
Item side features               : ['itemId', 'count', 'rating_mean', 'rating_sd', 'wr']

Statistics
----------
Number of observations           : 958529
Number of users                  : 40000
Number of items                  : 120

Training summary
----------------
Training time                    : 1.2289

Model Parameters
----------------
Model class                      : ItemSimilarityRecommender
threshold                        : 0.001
similarity_type                  : jaccard
training_method                  : auto

Other Settings
--------------
degree_approximation_threshold   : 4096
sparse_density_estimation_sample_size : 4096
max_data_passes                

In [54]:
# Predictions
similarity_pred = similarity_model.recommend(k=120, new_observation_data=test,
                                             exclude=None, exclude_known=False,
                                             random_seed=123, verbose=False)

In [55]:
test2 = test.join(similarity_pred, on=['userId','itemId'], how='left')
test2 = test2.sort('row')
#test2['score'] = test2['score'].apply(lambda x: round(x,1))

In [56]:
test2.head(3)

ID,userId,itemId,row,score,rank
16041_10,16041,10,0,0.204375151193,57
16041_107,16041,107,1,0.204425914108,56
16041_1,16041,1,2,0.20575889214,52


In [53]:
# Submission
submit = gl.SFrame({'ID':test2['ID'],'rating':test2['score']})
submit.save('data/submit.csv', format='csv')

This model doesn't improve the benchmark and scores ~6.2 on the Public LB.

# Factorization Model

In [None]:
factorization_model = gl.ranking_factorization_recommender.create(train,
                                                                  user_id='userId', item_id='itemId', target='rating',
                                                                  num_factors = 200, max_iterations = 30,
                                                                  sgd_step_size = 0.016276, sgd_convergence_threshold=1e-06,
                                                                  regularization=1e-09, linear_regularization=1e-09,
                                                                  unobserved_rating_value=0, ranking_regularization=0,
                                                                  solver = "auto", random_seed=123)

In [75]:
factorization_model

Class                            : RankingFactorizationRecommender

Schema
------
User ID                          : userId
Item ID                          : itemId
Target                           : rating
Additional observation features  : 1
User side features               : []
Item side features               : []

Statistics
----------
Number of observations           : 958529
Number of users                  : 40000
Number of items                  : 120

Training summary
----------------
Training time                    : 73.8624

Model Parameters
----------------
Model class                      : RankingFactorizationRecommender
num_factors                      : 200
binary_target                    : 0
side_data_factorization          : 1
solver                           : auto
nmf                              : 0
max_iterations                   : 30

Regularization Settings
-----------------------
regularization                   : 0.0
regularization_type              : nor

# Predictions

In [81]:
factorization_pred = factorization_model.recommend(k=120, new_observation_data=test,
                                             exclude=None, exclude_known=False,
                                             random_seed=123, verbose=False)

In [82]:
test3 = test.join(factorization_pred, on=['userId','itemId'], how='left')
test3 = test3.sort('row')

In [83]:
test3.head()

ID,userId,itemId,row,score,rank
16041_10,16041,10,0,2.32468843367,117
16041_107,16041,107,1,1.47171229031,120
16041_1,16041,1,2,3.73109954503,107
16041_40,16041,40,3,6.18635744509,52
16041_96,16041,96,4,4.94744706061,86
16041_137,16041,137,5,7.79970431235,4
16041_51,16041,51,6,4.67210061756,94
16041_59,16041,59,7,6.62484834623,41
16041_135,16041,135,8,5.96890032199,60
16041_15,16041,15,9,6.07083183434,58


# Submission

In [84]:
submit = gl.SFrame({'ID':test3['ID'],'rating':test3['score']})
submit.save('data/submit.csv', format='csv')

This model scores the following RMSE,

- Train: 0.000268562
- Public LB: 2.0648
- Private LB: 2.064361

# Approach 2

This involves building user and item side features

In [14]:
users = train.groupby(key_columns='userId',
                      operations={'count': agg.COUNT(),
                                  'rating_mean': agg.AVG('rating'),
                                  'rating_sd': agg.STDV('rating')})

In [16]:
users['rating_mean'] = users['rating_mean'].apply(lambda x: round(x,2))
users['rating_sd'] = users['rating_sd'].apply(lambda x: round(x,2))

In [17]:
users

userId,count,rating_mean,rating_sd
21855,11,5.59,1.55
7899,16,7.06,1.85
30621,29,5.45,1.45
43116,18,6.97,1.16
26319,15,5.23,2.71
26439,9,10.0,0.0
21925,14,6.75,0.94
22098,38,8.3,1.48
26561,10,5.15,2.43
3143,12,6.13,1.45


In [36]:
items = train.groupby(key_columns='itemId', 
                      operations={'count': agg.COUNT(),
                                  'rating_mean': agg.AVG('rating'),
                                  'rating_sd': agg.STDV('rating')})

In [38]:
import numpy as np
a = np.array(items['count'])
np.percentile(a, 50)

6415.0

In [39]:
m = 6500 # Approx. based on 50 percentile i.e. median
n = items['count']
R = items['rating_mean']
c = train['rating'].mean() #5.928695949731271
items['wr'] = ((n/(n+m))*R) + ((m/(n+m))*c)

In [40]:
items['rating_mean'] = items['rating_mean'].apply(lambda x: round(x,2))
items['rating_sd'] = items['rating_sd'].apply(lambda x: round(x,2))
items['wr'] = items['wr'].apply(lambda x: round(x,2))

In [41]:
items

itemId,count,rating_mean,rating_sd,wr
118,6366,6.29,2.61,6.11
60,9852,6.59,2.28,6.33
132,4468,5.66,2.56,5.82
36,8667,6.37,2.5,6.18
136,6558,6.23,2.4,6.08
116,11737,6.71,2.16,6.43
24,9726,6.48,2.31,6.26
2,7393,6.25,2.42,6.1
46,4193,5.79,2.52,5.87
79,6601,5.73,2.5,5.83


In [None]:
factorization_model = gl.ranking_factorization_recommender.create(train,
                                                                  user_id='userId', item_id='itemId', target='rating',
                                                                  user_data=users.select_columns(['userId','count','rating_mean']),
                                                                  item_data=items.select_columns(['itemId','count','wr']),
                                                                  num_factors = 200, max_iterations = 30,
                                                                  #sgd_step_size = 0.016276,
                                                                  sgd_convergence_threshold=1e-06,
                                                                  regularization=1e-09, linear_regularization=1e-09,
                                                                  unobserved_rating_value=0,
                                                                  #ranking_regularization=0,
                                                                  solver = "auto", random_seed=123)

In this case none of the features from neither users and/or items improved the validation RMSE. Each of the features such as count, mean, weighted mean, etc were iteratively tried.