# Neural Network approach to collaborative filtering

Collaborative filtering is a technique to make recommendation systems. The underlying assumption behind the method is that users who are similar will like similar things.

To frame this problem as a structured neural network, we will create an embedding (a list of numbers) for each user and for each movie. To predict how a user will rate a movie, we will concatenate the user and the movie’s embeddings into a single vector and pass it through a fully connected neural network. Second, single number embeddings for the user bias and for the movie bias will be added to the output. Finally, that value will be passed through a sigmoid that to squish it between the minimum and maximum possible ratings.

Read more about collaborative filtering on [wikipedia](https://en.wikipedia.org/wiki/Collaborative_filtering)


## Imports

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.learner import *
from fastai.column_data import *

## Data Setup
To create this recommendation system, we will use a tabular like data format. We want every row to consist of a particular user, a particular movie, and the user rating for that movie. When training the neural network, the user and movie embeddings will attempt to predict the associated rating.

Load in the raw data

In [2]:
path = "data/movie-lens/"

In [3]:
ratings = pd.read_csv(os.path.join(path, "ratings.csv"))
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [4]:
movies = pd.read_csv(os.path.join(path, "movies.csv"))
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Ratings is the dataframe that contains all the users, movies, and associated ratings. We are converting the unique ids of the users and movies to matrix indices for the embedding matrix we will create later for fast lookups.

In [5]:
u_uniq = ratings.userId.unique()
user2idx = {u_id:i for i,u_id in enumerate(u_uniq)}
ratings.userId = ratings.userId.apply(lambda x: user2idx[x])

m_uniq = ratings.movieId.unique()
movie2idx = {m_id:i for i,m_id in enumerate(m_uniq)}
ratings.movieId = ratings.movieId.apply(lambda x: movie2idx[x])

n_users  = ratings.userId.nunique()
n_movies = ratings.movieId.nunique()

Format the data into an input (x) and target dataframe (y)

In [6]:
x = ratings.drop(['rating', 'timestamp'], axis=1)
y = ratings['rating'].astype(np.float32)

In [7]:
x.head()

Unnamed: 0,userId,movieId
0,0,0
1,0,1
2,0,2
3,0,3
4,0,4


In [8]:
y.head()

0    2.5
1    3.0
2    3.0
3    2.0
4    4.0
Name: rating, dtype: float32

Create a list of validation indexes and then make the data object. By default this uses a 80% train, 20% validation split. The provided path dictates where we want to save models.

In [9]:
val_idxs = get_cv_idxs(len(ratings))
data = ColumnarModelData.from_data_frame(path, val_idxs, x, y, ['userId', 'movieId'], 64)

## Model
This neural network features an embedding for each user and each movie. Additionally, it contains a bias for each user and each movie that is used near the end of the network. This extra bias term lowers the validation loss by .03 ish. Note that an embedding matrix is just a computational speedup of a 1 hot encoding and matrix of weights.

In [10]:
def get_embed(ni, nf):
    """
    A simple function that creates an embedding
    
    :param ni: number of unique indices / number of things to make an embedding for
    :param nf: number of factors / length of embedding
    """
    embed = nn.Embedding(ni, nf)
    embed.weight.data.uniform_(-0.02, 0.02)
    return embed 

In [11]:
min_rating, max_rating = ratings.rating.min(), ratings.rating.max()
class EmbeddingNet(nn.Module):
    def __init__(self, n_users, n_items, n_factors, nh=10, p1=0.5, p2=0.5):
        """
        self.u - the embedding of the users
        self.ub - the biases for each individual user
        self.m - the embedding of the movies
        self.mb - the biases for each individual movie
        """
        super().__init__()
        embeds = [(n_users, n_factors), (n_users,1), (n_items, n_factors), (n_items,1)]
        (self.u, self.ub, self.m, self.mb) = [get_embed(*e) for e in embeds]
        self.lin1 = nn.Linear(n_factors*2, nh)
        self.lin2 = nn.Linear(nh, 1)
        self.drop1 = nn.Dropout(p1)
        self.drop2 = nn.Dropout(p2)
        
    def forward(self, cats, conts):
        users,items = cats[:,0],cats[:,1]
        x = self.drop1(torch.cat([self.u(users),self.m(items)], dim=1))
        x = self.drop2(F.relu(self.lin1(x)))
        x = self.lin2(x) + self.ub(users) + self.mb(items)
        return F.sigmoid(x) * (max_rating-min_rating+1) + min_rating-0.5

In [12]:
wd=1e-6
model = EmbeddingNet(n_users, n_movies, n_factors=20, nh=10).cuda()
opt = optim.Adam(model.parameters(), 1e-3, weight_decay=wd)

In [13]:
fit(model, data, 4, opt, F.mse_loss)

epoch      trn_loss   val_loss                                  
    0      0.932313   0.818633  
    1      0.819566   0.793078                                  
    2      0.796216   0.784089                                  
    3      0.766555   0.777681                                  



[0.7776812]

In [14]:
set_lrs(opt, 1e-4)
val_loss = fit(model, data, 3, opt, F.mse_loss)

epoch      trn_loss   val_loss                                  
    0      0.694063   0.77544   
    1      0.719205   0.774536                                  
    2      0.69612    0.773747                                  



The resulting validation loss is very good. It would be further improved if all data was used for training.

In [15]:
np.sqrt(val_loss)

array([0.87963], dtype=float32)