# Movielens

Movielens is a dataset of (user,movie,rating) triples. Given training data of such triples, 
we want to predict ratings for unseen (user,movie) pairs. All ratings are between 0.5 and 5.0.


The lesson 5 fastai notebook trains a collaborative filtering model on the movielens data in "ratings.csv", by choosing a random subset of points to use for validation. A Mean-Squared-Error validation loss of 0.765 is obtained in that notebook, which is a bit better than the best published result for movielens at the time.

In this notebook I also train on the movielens dataset in "ratings.csv", with a random subset of points chosen for validation. I train in 2 ways:

1. Using a collaborative filtering model from my file CollaborativeFiltering.py 
2. Using a general structured data model from my file StructuredData.py

(Note: The collaborative filtering problem is a specific case of the structured data problem with exactly 2 categorical input variables and 1 continuous output variable.)

Performance with my collaborative filtering model is a bit better than when using my general strucutured data model, and comporable to the result in the fastai lesson 5 notebook. However, I did not play around too much too optimize parameters in the general structured data model, and there are many more of them. 

In [1]:
# Automatic reloading and inline plotting
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
# My Imports
from General import *
from CollaborativeFiltering import *
from StructuredData import *

# Standard Imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Torch Imports
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

PATH = 'data/movielens/'

### Method 1 - Train Using a Collaborative Filtering Model from CollaborativeFiltering.py

In [3]:
# data object
filename = PATH + 'ratings.csv'
data = CollabFilterDataObj.from_csv(filename,'userId','movieId','rating',bs=128)

# pytorch model
model = CollabFilterNet.from_dataobj(data, emb_dim=50)

# optimizer and loss function
optimizer = optim.Adam(model.parameters())
loss_func = nn.MSELoss()

# learner 
learner = Learner(PATH, data, model, optimizer, loss_func)

In [4]:
learner.fit(lr=0.003,num_cycles=2,cycle_mult = 2)

epoch   train_loss  val_loss    

0       0.89780     0.97462       epoch run time: 0 min, 4.14 sec
1       0.47825     0.76586       epoch run time: 0 min, 4.06 sec
2       0.41527     0.76170       epoch run time: 0 min, 4.06 sec


### Method 2 - Train Using a General Structured Data Model from StructuredData.py

In [5]:
# define data object
cat_vars = ['userId','movieId']
cont_vars = ['rating']
output_var = 'rating'
output_type = 'cont'
bs = 128

df = pd.read_csv(PATH + 'ratings.csv')
df = df.reindex(columns=['userId','movieId','rating'])
users = df['userId'].unique()
items = df['movieId'].unique()
user_labels = {users[i]:i for i in range(len(users))}
item_labels = {items[i]:i for i in range(len(items))}
labels = [user_labels,item_labels]
train_df, val_df = SplitDataFrameTrainVal(df)

xcat_df, xcont_df, y, scaling_values, category_labels = \
ProcessDataFrame(train_df, cat_vars, cont_vars, output_var, scale_cont = 'No', 
                 category_labels = labels, unknown_category = False)
train_ds = StructuredDataset(xcat_df,xcont_df,y,output_type)

xcat_df, xcont_df, y, scaling_values, category_labels = \
ProcessDataFrame(val_df, cat_vars, cont_vars, output_var, scale_cont = 'No', 
                 category_labels = labels, unknown_category = False)
val_ds = StructuredDataset(xcat_df,xcont_df,y,output_type)


data = StructuredDataObj(train_ds, val_ds, labels, scaling_values,
                         bs, num_workers=4, test_ds = None)


# define pytorch model
fc_layer_sizes = [50,10,1]
emb_sizes = 'default'
output_range = [0.5,5.0]
dropout_levels = (0,0,[0,0.5,0.5])
use_bn = True
model = StructuredDataNet.from_dataobj(data, fc_layer_sizes, emb_sizes, output_range, dropout_levels, use_bn)

# optimizer and loss function
optimizer = optim.Adam(model.parameters())
loss_func = nn.MSELoss()

# learner object
learner = Learner(PATH,data,model,optimizer,loss_func)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  for var in cat_vars: df[var] = df[var].astype('category')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  for var in cont_vars: df[var] = df[var].astype('float32')


In [6]:
learner.fit(lr=0.003,num_cycles=2,cycle_mult = 2)

epoch   train_loss  val_loss    

0       0.89963     0.91686       epoch run time: 0 min, 6.49 sec
1       0.74073     0.81896       epoch run time: 0 min, 6.22 sec
2       0.68890     0.80451       epoch run time: 0 min, 6.25 sec


In [7]:
learner.fit(lr=0.003,num_cycles=2,base_cycle_length=2)

epoch   train_loss  val_loss    

0       0.68689     0.80359       epoch run time: 0 min, 6.17 sec
1       0.63907     0.79171       epoch run time: 0 min, 6.44 sec
2       0.64052     0.78692       epoch run time: 0 min, 7.48 sec
3       0.61415     0.78966       epoch run time: 0 min, 8.73 sec
