# Collaborative Filtering - with Matrix Factorization and Neural Networks

Collaborative filtering: systems recommend items based on similarity measures between users and/or items. The items recommended to a user are those preferred by similar users. 

The MovieLens dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100004 ratings and 1296 tag applications across 9125 movies. https://grouplens.org/datasets/movielens/. To get the data:

`wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip`

In [1]:
! wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

--2019-06-13 21:25:10--  http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2019-06-13 21:25:10 (3.17 MB/s) - ‘ml-latest-small.zip’ saved [978202/978202]



## MovieLens dataset

In [12]:
from pathlib import Path
import pandas as pd
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F

In [13]:
import os
os.getcwd()

'/home/jupyter/recco'

In [14]:
PATH = Path("/home/jupyter/recco/ml-latest-small/")
list(PATH.iterdir())

[PosixPath('/home/jupyter/recco/ml-latest-small/README.txt'),
 PosixPath('/home/jupyter/recco/ml-latest-small/tags.csv'),
 PosixPath('/home/jupyter/recco/ml-latest-small/movies.csv'),
 PosixPath('/home/jupyter/recco/ml-latest-small/links.csv'),
 PosixPath('/home/jupyter/recco/ml-latest-small/ratings.csv')]

In [15]:
! head /home/jupyter/recco/ml-latest-small/ratings.csv

userId,movieId,rating,timestamp
1,1,4.0,964982703
1,3,4.0,964981247
1,6,4.0,964982224
1,47,5.0,964983815
1,50,5.0,964982931
1,70,3.0,964982400
1,101,5.0,964980868
1,110,4.0,964982176
1,151,5.0,964984041


In [16]:
# reading a csv into pandas
data = pd.read_csv(PATH/"ratings.csv")

In [17]:
data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### Encoding data
We enconde the data to have contiguous ids for users and movies. You can think about this as a categorical encoding of our two categorical variables userId and movieId.

In [18]:
# split train and validation before encoding
np.random.seed(3)
msk = np.random.rand(len(data)) < 0.8
train = data[msk].copy()
val = data[~msk].copy()

In [22]:
len(train), len(val), len(data)

(80450, 20386, 100836)

In [23]:
# here is a handy function modified from fast.ai
def proc_col(col, train_col=None):
    """Encodes a pandas column with continous ids. 
    """
    if train_col is not None:
        uniq = train_col.unique()
    else:
        uniq = col.unique()
    name2idx = {o:i for i,o in enumerate(uniq)}
    return name2idx, np.array([name2idx.get(x, -1) for x in col]), len(uniq)

In [24]:
def encode_data(df, train=None):
    """ Encodes rating data with continous user and movie ids. 
    If train is provided, encodes df with the same encoding as train.
    """
    df = df.copy()
    for col_name in ["userId", "movieId"]:
        train_col = None
        if train is not None:
            train_col = train[col_name]
        if col_name == 'movieId':
            movieId2idx,col,_ = proc_col(df[col_name], train_col)
        else:
            userId2idx,col,_ = proc_col(df[col_name], train_col)
        df[col_name] = col
        df = df[df[col_name] >= 0]
        
    return (df, movieId2idx, userId2idx) if train is None else df

In [25]:
df_train, movieId2idx, userId2idx  = encode_data(train)
df_val = encode_data(val, train)

## Matrix factorization model - with bias

## Matrix Factorization with bias
We want to extend the Matrix Factorization model discussed with a "bias" parameter for each user and another "bias" parameter for each movie.  <br>
For the problem in class we had the parameters matrix $U$ and $V$, we are adding $u_0$ which is a vector of dimension $n_u$ and $v_0$ which is a vector of dimension $n_m$. The equations

$$\hat{y}_{ij} = u_{0i} + v_{0j} + u_i \cdot v_j  $$ 
 
Equation : $$Loss(E) = [(\hat{Y} -  U\cdot V^T - U_{0} - V_{0}^T)^2 * R]/ N$$

* $$ dE / du_{0i} =  -2/N  \cdot np.sum([\hat{Y} -  U\cdot V^T - U_{0} - V_{0}^T]*R, axis = 0)$$    $$u_{0i} = u_{0i} - \eta * dE / du_{0i}$$

 $$ dE / dv_{0j} =  -2/N  \cdot np.sum([\hat{Y} -  U\cdot V^T - U_{0} - V_{0}^T]*R, axis = 1)^T$$    $$v_{0j} = v_{0j} - \eta * dE / dv_{0i}$$

 $$ dE/ dV = -2/N  \cdot np.sum(([\hat{Y} -  U\cdot V^T - U_{0} - V_{0}^T]*R)\cdot U))  $$ $$V = V - \eta * dE / dV$$

 $$ dE/ dU = -2/N  \cdot np.sum(([\hat{Y} -  U\cdot V^T - U_{0} - V_{0}^T]*R)^T\cdot V))  $$  $$U = U - \eta * dE / dU$$

In [35]:
class MF_bias(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF_bias, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.item_bias = nn.Embedding(num_items, 1)
        # init 
        self.user_emb.weight.data.uniform_(0,0.05)
        self.item_emb.weight.data.uniform_(0,0.05)
        self.user_bias.weight.data.uniform_(-0.01,0.01)
        self.item_bias.weight.data.uniform_(-0.01,0.01)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        b_u = self.user_bias(u).squeeze()
        b_v = self.item_bias(v).squeeze()
        return (U*V).sum(1) +  b_u  + b_v

## Training MF model

In [126]:
num_users = len(df_train.userId.unique())
num_items = len(df_train.movieId.unique())

In [150]:
model = MF_bias(num_users, num_items, emb_size=50).cuda()  # if you have a GPU .cuda()

In [151]:
# here we are not using data loaders because our data fits well in memory
def train_epocs(model, epochs=10, lr=0.01, wd=5e-5, unsqueeze=False):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=wd)
    model.train()
    for i in range(epochs):
        users = torch.LongTensor(df_train.userId.values).cuda()
        items = torch.LongTensor(df_train.movieId.values).cuda()
        ratings = torch.FloatTensor(df_train.rating.values).cuda()
        if unsqueeze:
            ratings = ratings.unsqueeze(1)
        y_hat = model(users, items)
        loss = F.mse_loss(y_hat, ratings)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if i %5 == 0:
            print('train loss at epoch ', i ,'is ', loss.item()) 
            test_loss(model, unsqueeze)

In [152]:
# Here is what unsqueeze does
ratings = torch.FloatTensor(df_train.rating.values)
print(ratings.shape)
ratings = ratings.unsqueeze(1) #.cuda()
ratings.shape

torch.Size([80450])


torch.Size([80450, 1])

In [153]:
def test_loss(model, unsqueeze=False):
    model.eval()
    users = torch.LongTensor(df_val.userId.values).cuda()
    items = torch.LongTensor(df_val.movieId.values).cuda()
    ratings = torch.FloatTensor(df_val.rating.values).cuda()
    if unsqueeze:
        ratings = ratings.unsqueeze(1)
    y_hat = model(users, items)
    loss = F.mse_loss(y_hat, ratings)
    print('test loss at epoch is ', loss.item()) 
    print('##############')

In [154]:
train_epocs(model, epochs=30, lr= 0.1)

train loss at epoch  0 is  13.133536338806152
test loss at epoch is  7.642823696136475
##############
train loss at epoch  5 is  1.2623982429504395
test loss at epoch is  1.119561791419983
##############
train loss at epoch  10 is  1.930947184562683
test loss at epoch is  1.3582265377044678
##############
train loss at epoch  15 is  1.258612871170044
test loss at epoch is  1.16812002658844
##############
train loss at epoch  20 is  1.0722205638885498
test loss at epoch is  1.2347640991210938
##############
train loss at epoch  25 is  0.6588515639305115
test loss at epoch is  0.8953675627708435
##############


In [155]:
train_epocs(model, epochs=10, lr= 0.001)

train loss at epoch  0 is  0.6485084295272827
test loss at epoch is  0.7965345978736877
##############
train loss at epoch  5 is  0.63094162940979
test loss at epoch is  0.7882490158081055
##############


In [156]:
train_epocs(model, epochs=5, lr= 0.0001)

train loss at epoch  0 is  0.6198918223381042
test loss at epoch is  0.7857194542884827
##############


In [157]:
train_epocs(model, epochs=10, lr= 0.0001)

train loss at epoch  0 is  0.6188392639160156
test loss at epoch is  0.7852720618247986
##############
train loss at epoch  5 is  0.6178157329559326
test loss at epoch is  0.784857451915741
##############


In [170]:
train_epocs(model, epochs=10, lr= 0.00001)

train loss at epoch  0 is  0.5047546029090881
test loss at epoch is  0.7760410308837891
##############
train loss at epoch  5 is  0.5046563744544983
test loss at epoch is  0.7760418057441711
##############


In [171]:
avg_ratings = df_train.groupby('movieId').mean()['rating'].to_dict()

In [172]:
movies = pd.read_csv('ml-latest-small/movies.csv')
movies['id'] = movies.movieId.apply(lambda x: movieId2idx[x] if x in movieId2idx else -1)
movies['avg_rating'] = movies.id.apply(lambda x: avg_ratings[x] if x in avg_ratings else -1)
movies['rating_bin'] = pd.cut(movies['avg_rating'],5,labels=[0,1,2,3,4])
color_mapping = {0:'black', 1:'darkred', 2: 'darkseagreen', 3:'orange', 4:'lightblue'}
movies['color'] = movies['rating_bin'].apply(lambda x: color_mapping[x])

In [173]:
titles = movies.title[movies.id.isin(df_train.movieId.unique())]
rating_bin = movies.rating_bin[movies.id.isin(df_train.movieId.unique())]
colors = movies.color[movies.id.isin(df_train.movieId.unique())]

In [174]:
from collections import Counter
Counter(rating_bin)

Counter({4: 2617, 3: 4524, 2: 1572, 1: 285})

In [175]:
embedding_array = list(model.named_parameters())[2][1].data.cpu().numpy()

In [176]:
from sklearn.manifold import TSNE
#t-sne on embedding vectors
tsne = TSNE(n_components=2, verbose=1, perplexity=30, n_iter=1000,learning_rate=10)
tsne_results = tsne.fit_transform(embedding_array)

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 8998 samples in 0.016s...
[t-SNE] Computed neighbors for 8998 samples in 7.347s...
[t-SNE] Computed conditional probabilities for sample 1000 / 8998
[t-SNE] Computed conditional probabilities for sample 2000 / 8998
[t-SNE] Computed conditional probabilities for sample 3000 / 8998
[t-SNE] Computed conditional probabilities for sample 4000 / 8998
[t-SNE] Computed conditional probabilities for sample 5000 / 8998
[t-SNE] Computed conditional probabilities for sample 6000 / 8998
[t-SNE] Computed conditional probabilities for sample 7000 / 8998
[t-SNE] Computed conditional probabilities for sample 8000 / 8998
[t-SNE] Computed conditional probabilities for sample 8998 / 8998
[t-SNE] Mean sigma: 0.000004
[t-SNE] KL divergence after 250 iterations with early exaggeration: 88.359222
[t-SNE] Error after 1000 iterations: 2.022384


#### Checkout what the model has learnt !!!

In [177]:
from bokeh.plotting import figure, show, output_notebook, save
from bokeh.models import HoverTool, value, LabelSet, Legend, ColumnDataSource
output_notebook()
source = ColumnDataSource(dict(
    x=tsne_results[:, 0],
    y=tsne_results[:, 1],
    title= titles
))
title = 'Visualizing 50 dimensional Embeddings with T-SNE'

plot_lda = figure(plot_width=1000, plot_height=600,
                     title=title, tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
                     x_axis_type=None, y_axis_type=None, min_border=1)

plot_lda.scatter(x='x', y='y',source=source,
                alpha=0.4, size=10)

# hover tools
hover = plot_lda.select(dict(type=HoverTool))
hover.tooltips = {"content": "Title: @title"}

show(plot_lda)
save(plot_lda, 't-SNE_emb.html') #save plot as HTML

  warn("save() called but no resources were supplied and output_file(...) was never called, defaulting to resources.CDN")
  warn("save() called but no title was supplied and output_file(...) was never called, using default title 'Bokeh Plot'")


'/home/jupyter/recco/t-SNE_emb.html'

## Neural Network Model

In [178]:
# Note here there is no matrix multiplication, we could potentially make the embeddings 
# for users and items of different sizes.
# Here we could get better results by keep playing with regularization.
    
class CollabFNet(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100, n_hidden=10):
        super(CollabFNet, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.lin1 = nn.Linear(emb_size*2, n_hidden)
        self.lin2 = nn.Linear(n_hidden, 1)
        self.drop1 = nn.Dropout(0.1)
        self.drop2 = nn.Dropout(0.0)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        x = torch.cat([U, V], dim=1)
        x = self.drop1(x)
        x = F.relu(self.lin1(x))
        x = self.drop2(x)
        x = self.lin2(x)
        return x

In [184]:
model = CollabFNet(num_users, num_items, emb_size=50).cuda()

In [188]:
train_epocs(model, epochs=50, lr=0.1, wd=1e-5, unsqueeze=True) 

train loss at epoch  0 is  0.6087620854377747
test loss at epoch is  3.7063727378845215
##############
train loss at epoch  5 is  0.7105394601821899
test loss at epoch is  1.347374677658081
##############
train loss at epoch  10 is  0.7356892824172974
test loss at epoch is  0.9125908017158508
##############
train loss at epoch  15 is  0.657213032245636
test loss at epoch is  0.8483330607414246
##############
train loss at epoch  20 is  0.6466460227966309
test loss at epoch is  0.7778878808021545
##############
train loss at epoch  25 is  0.6261124014854431
test loss at epoch is  0.7569930553436279
##############
train loss at epoch  30 is  0.617447555065155
test loss at epoch is  0.7566865682601929
##############
train loss at epoch  35 is  0.5975403785705566
test loss at epoch is  0.7531224489212036
##############
train loss at epoch  40 is  0.5862475037574768
test loss at epoch is  0.7611249089241028
##############
train loss at epoch  45 is  0.5702094435691833
test loss at epoch is 

In [189]:
train_epocs(model, epochs=20, lr=0.01, wd=1e-6, unsqueeze=True)

train loss at epoch  0 is  0.6248961687088013
test loss at epoch is  0.7775956392288208
##############
train loss at epoch  5 is  0.5494442582130432
test loss at epoch is  0.7836381196975708
##############
train loss at epoch  10 is  0.5421764850616455
test loss at epoch is  0.7898766994476318
##############
train loss at epoch  15 is  0.5381168723106384
test loss at epoch is  0.7883901000022888
##############
