# Collaborative Filtering Recommender Systems

A recommender system for movies.

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf

Movie ratings dataset from [MovieLens "ml-latest-small"](https://grouplens.org/datasets/movielens/latest/).

This (reduced) dataset consists of ratings on a scale of 0.5 to 5 in 0.5 step increments. The reduced dataset has $n_u = 443$ users, and $n_m= 4778$ movies. 

In [2]:
ratings_df = pd.read_csv("./ml-latest-small/ratings.csv")

In [3]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
max_user_id = ratings_df["userId"].max()
print("max_user_id", max_user_id)

max_movie_id = ratings_df["movieId"].max()
print("max_user_id", max_movie_id)

max_user_id 610
max_user_id 193609


In [5]:
movies_df = pd.read_csv("./ml-latest-small/movies.csv")

# Create a mapping dictionary for movieId to row number
mapping_dict = dict(zip(movies_df["movieId"], movies_df.index + 1))

In [6]:
mapping_dict[500]

437

In [7]:
ratings_df[ratings_df["userId"] == 1][ratings_df["movieId"] == 500]

  ratings_df[ratings_df["userId"] == 1][ratings_df["movieId"] == 500]


Unnamed: 0,userId,movieId,rating,timestamp
27,1,500,3.0,964981208


In [8]:
# Update the movieId column in the ratings DataFrame
ratings_df["movieId"] = ratings_df["movieId"].map(mapping_dict)

In [9]:
ratings_df[ratings_df["userId"] == 1][ratings_df["movieId"] == 437]

  ratings_df[ratings_df["userId"] == 1][ratings_df["movieId"] == 437]


Unnamed: 0,userId,movieId,rating,timestamp
27,1,437,3.0,964981208


In [10]:
movies_df[movies_df["movieId"] == 500]

Unnamed: 0,movieId,title,genres
436,500,Mrs. Doubtfire (1993),Comedy|Drama


In [11]:
movies_df["movieId"] = movies_df["movieId"].map(mapping_dict)

In [12]:
movies_df[movies_df["movieId"] == 437]

Unnamed: 0,movieId,title,genres
436,437,Mrs. Doubtfire (1993),Comedy|Drama


In [13]:
# create matrix R, the user-movie ratings
# R = ratings_df.pivot(index="movieId", columns="userId", values="rating")
R = ratings_df.pivot_table(
    index="movieId",
    columns="userId",
    values="rating",
    aggfunc=lambda x: 1 if x.notnull().any() else 0,
    fill_value=0,
)

In [14]:
R.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,0,0,0,1,0,1,0,0,0,...,1,0,1,1,1,1,1,1,1,1
2,0,0,0,0,0,1,0,1,0,0,...,0,1,0,1,1,0,0,1,0,0
3,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [15]:
R.iloc[0]

userId
1      1
2      0
3      0
4      0
5      1
      ..
606    1
607    1
608    1
609    1
610    1
Name: 1, Length: 610, dtype: int64

In [16]:
Y = ratings_df.pivot_table(
    index="movieId", columns="userId", values="rating", fill_value=0
)

In [17]:
Y.shape

(9724, 610)

In [18]:
Y.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
Y.iloc[0]

userId
1      4.0
2      0.0
3      0.0
4      0.0
5      4.0
      ... 
606    2.5
607    4.0
608    2.5
609    3.0
610    5.0
Name: 1, Length: 610, dtype: float64

In [20]:
print("Y", Y.shape, "R", R.shape)

Y (9724, 610) R (9724, 610)


In [21]:
# df to np array
Y = Y.values
R = R.values

# calculate mean for the 1st movie
tsmean = np.mean(Y[0, R[0, :].astype(bool)])
print(f"Average rating for movie 1 : {tsmean:0.3f} / 5")

Average rating for movie 1 : 3.921 / 5



## Cost function

The collaborative filtering cost function is given by
$$J({\mathbf{x}^{(0)},...,\mathbf{x}^{(n_m-1)},\mathbf{w}^{(0)},b^{(0)},...,\mathbf{w}^{(n_u-1)},b^{(n_u-1)}})= \left[ \frac{1}{2}\sum_{(i,j):r(i,j)=1}(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2 \right]
+ \underbrace{\left[
\frac{\lambda}{2}
\sum_{j=0}^{n_u-1}\sum_{k=0}^{n-1}(\mathbf{w}^{(j)}_k)^2
+ \frac{\lambda}{2}\sum_{i=0}^{n_m-1}\sum_{k=0}^{n-1}(\mathbf{x}_k^{(i)})^2
\right]}_{regularization}
\tag{1}$$
The first summation in (1) is "for all $i$, $j$ where $r(i,j)$ equals $1$" and could be written:

$$
= \left[ \frac{1}{2}\sum_{j=0}^{n_u-1} \sum_{i=0}^{n_m-1}r(i,j)*(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2 \right]
+\text{regularization}
$$


In [22]:
def cofi_cost_func_v(X, W, b, Y, R, lambda_):
    """
    Returns the cost for the content-based filtering
    Vectorized for speed. Uses tensorflow operations to be compatible with custom training loop.
    Args:
      X (ndarray (num_movies,num_features)): matrix of item features
      W (ndarray (num_users,num_features)) : matrix of user parameters
      b (ndarray (1, num_users)            : vector of user parameters
      Y (ndarray (num_movies,num_users)    : matrix of user ratings of movies
      R (ndarray (num_movies,num_users)    : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
      lambda_ (float): regularization parameter
    Returns:
      J (float) : Cost
    """
    j = (tf.linalg.matmul(X, tf.transpose(W)) + b - Y) * R
    J = 0.5 * tf.reduce_sum(j**2) + (lambda_ / 2) * (
        tf.reduce_sum(X**2) + tf.reduce_sum(W**2)
    )
    return J

In [23]:
def normalizeRatings(Y, R):
    """
    Preprocess data by subtracting mean rating for every movie (every row).
    Only include real ratings R(i,j)=1.
    [Ynorm, Ymean] = normalizeRatings(Y, R) normalized Y so that each movie
    has a rating of 0 on average. Unrated moves then have a mean rating (0)
    Returns the mean rating in Ymean.
    """
    Ymean = (np.sum(Y * R, axis=1) / (np.sum(R, axis=1) + 1e-12)).reshape(-1, 1)
    Ynorm = Y - np.multiply(Ymean, R)
    return (Ynorm, Ymean)

In [24]:
# Normalize the Dataset
Ynorm, Ymean = normalizeRatings(Y, R)

In [30]:
# for consistent results
SEED = 42

#  Useful Values
num_movies, num_users = Y.shape
num_features = 100

# Set Initial Parameters (W, X), use tf.Variable to track these variables
tf.random.set_seed(SEED)
W = tf.Variable(tf.random.normal((num_users, num_features), dtype=tf.float64), name="W")
X = tf.Variable(
    tf.random.normal((num_movies, num_features), dtype=tf.float64), name="X"
)
b = tf.Variable(tf.random.normal((1, num_users), dtype=tf.float64), name="b")

optimizer = tf.keras.optimizers.legacy.Adam(learning_rate=1e-1)

In [31]:
print("X", X.shape)
print("W", W.shape)
print("b", b.shape)
print("num_features", num_features)

X (9724, 100)
W (610, 100)
b (1, 610)
num_features 100


In [32]:
iterations = 500
lambda_ = 1

for iter in range(iterations):
    # record the operations used to compute the cost
    with tf.GradientTape() as tape:
        cost_value = cofi_cost_func_v(X, W, b, Ynorm, R, lambda_)

    grads = tape.gradient(cost_value, [X, W, b])
    optimizer.apply_gradients(zip(grads, [X, W, b]))

    if iter % 50 == 0:
        print(f"Training loss at iteration {iter}: {cost_value:0.1f}")

Training loss at iteration 0: 5569194.9
Training loss at iteration 50: 74231.8
Training loss at iteration 100: 19463.6
Training loss at iteration 150: 9042.4
Training loss at iteration 200: 6076.7
Training loss at iteration 250: 5039.1
Training loss at iteration 300: 4605.9
Training loss at iteration 350: 4393.0
Training loss at iteration 400: 4272.4
Training loss at iteration 450: 4196.0


In [33]:
movieList_df = pd.read_csv(
    "./ml-latest-small/movies.csv",
    header=0,
    index_col=0,
    delimiter=",",
    quotechar='"',
)
movieList = movies_df["title"].to_list()

In [34]:
movieList[:5]

['Toy Story (1995)',
 'Jumanji (1995)',
 'Grumpier Old Men (1995)',
 'Waiting to Exhale (1995)',
 'Father of the Bride Part II (1995)']

In [35]:
movies_df.to_csv("./temp.csv")

In [36]:
my_ratings = np.zeros(num_movies)

my_ratings[277] = 5  # 277,"Shawshank Redemption, The (1994)",Crime|Drama
my_ratings[3141] = 5  # 3141,Memento (2000),Mystery|Thriller
my_ratings[706] = 1  # 706,2001: A Space Odyssey (1968),Adventure|Drama|Sci-Fi
my_ratings[3233] = 4  # 3233,"Fast and the Furious, The (2001)",Action|Crime|Thriller
my_ratings[257] = 5  # 257,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
my_ratings[
    7302
] = 5  # 7302,How to Train Your Dragon (2010),Adventure|Animation|Children|Fantasy|IMAX
my_ratings[3622] = 3  # 3622,Amelie (Fabuleux destin d'Amélie Poulain, Le)
my_ratings[
    7667
] = 4.5  # 7667,My Afternoons with Margueritte (La tête en friche) (2010),Comedy
my_ratings[5265] = 5  # 5265,"I, Robot (2004)",Action|Adventure|Sci-Fi|Thriller
my_ratings[8305] = 4.5  # 8305,"Wolf of Wall Street, The (2013)",Comedy|Crime|Drama
my_ratings[7016] = 5  # 7016,X-Men Origins: Wolverine (2009),Action|Sci-Fi|Thriller
my_ratings[8688] = 1  # 8688,Justice League (2017),Action|Adventure|Sci-Fi
my_ratings[5917] = 5  # 5917,Batman Begins (2005),Action|Crime|IMAX

my_rated = [i for i in range(len(my_ratings)) if my_ratings[i] > 0]

print("\nNew user ratings:\n")
for i in range(len(my_ratings)):
    if my_ratings[i] > 0:
        print(f'Rated {my_ratings[i]} for  {movies_df.loc[i,"title"]}')


New user ratings:

Rated 5.0 for  Pulp Fiction (1994)
Rated 5.0 for  Shawshank Redemption, The (1994)
Rated 1.0 for  2001: A Space Odyssey (1968)
Rated 5.0 for  Memento (2000)
Rated 4.0 for  Fast and the Furious, The (2001)
Rated 3.0 for  Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)
Rated 5.0 for  I, Robot (2004)
Rated 5.0 for  Batman Begins (2005)
Rated 5.0 for  X-Men Origins: Wolverine (2009)
Rated 5.0 for  How to Train Your Dragon (2010)
Rated 4.5 for  My Afternoons with Margueritte (La tête en friche) (2010)
Rated 4.5 for  Wolf of Wall Street, The (2013)
Rated 1.0 for  Justice League (2017)


In [42]:
# Add new user ratings to Y
Y = np.c_[my_ratings, Y]

# Add new user indicator matrix to R
R = np.c_[(my_ratings != 0).astype(int), R]

In [47]:
# Normalize the Dataset
Ynorm, Ymean = normalizeRatings(Y, R)

In [48]:
# for consistent results
SEED = 42

#  Useful Values
num_movies, num_users = Y.shape
num_features = 100

# Set Initial Parameters (W, X), use tf.Variable to track these variables
tf.random.set_seed(SEED)
W = tf.Variable(tf.random.normal((num_users, num_features), dtype=tf.float64), name="W")
X = tf.Variable(
    tf.random.normal((num_movies, num_features), dtype=tf.float64), name="X"
)
b = tf.Variable(tf.random.normal((1, num_users), dtype=tf.float64), name="b")

optimizer = tf.keras.optimizers.legacy.Adam(learning_rate=1e-1)

In [49]:
print(num_users)

611


In [50]:
iterations = 500
lambda_ = 1

for iter in range(iterations):
    # record the operations used to compute the cost
    with tf.GradientTape() as tape:
        cost_value = cofi_cost_func_v(X, W, b, Ynorm, R, lambda_)

    grads = tape.gradient(cost_value, [X, W, b])
    optimizer.apply_gradients(zip(grads, [X, W, b]))

    if iter % 50 == 0:
        print(f"Training loss at iteration {iter}: {cost_value:0.1f}")

Training loss at iteration 0: 5610060.4
Training loss at iteration 50: 74642.7
Training loss at iteration 100: 19605.5
Training loss at iteration 150: 9118.6
Training loss at iteration 200: 6116.4
Training loss at iteration 250: 5061.4
Training loss at iteration 300: 4618.6
Training loss at iteration 350: 4400.5
Training loss at iteration 400: 4277.6
Training loss at iteration 450: 4200.3


In [62]:
# Make a prediction using trained weights and biases
p = np.matmul(X.numpy(), np.transpose(W.numpy())) + b.numpy()

# restore the mean
pm = p + Ymean

my_predictions = pm[:, 0]

# sort predictions
ix = tf.argsort(my_predictions, direction="DESCENDING")

for i in range(100):
    j = ix[i]
    if j not in my_rated:
        print(
            f"Predicting rating {min(5.0, my_predictions[j]):0.2f} for movie {movieList[j]}"
        )

print("\n\nOriginal vs Predicted ratings:\n")
for i in range(len(my_ratings)):
    if my_ratings[i] > 0:
        print(
            f"Original {my_ratings[i]}, Predicted {min(5.0, my_predictions[i]):0.2f} for {movieList[i]}"
        )

Predicting rating 5.00 for movie Crow: Salvation, The (2000)
Predicting rating 5.00 for movie Return of Martin Guerre, The (Retour de Martin Guerre, Le) (1982)
Predicting rating 5.00 for movie Tom and Jerry: Shiver Me Whiskers (2006)
Predicting rating 5.00 for movie Thin Line Between Love and Hate, A (1996)
Predicting rating 5.00 for movie The Fault in Our Stars (2014)
Predicting rating 5.00 for movie Europa (Zentropa) (1991)
Predicting rating 5.00 for movie Night Porter, The (Portiere di notte, Il) (1974)
Predicting rating 5.00 for movie MatchMaker, The (1997)
Predicting rating 5.00 for movie The Final Girls (2015)
Predicting rating 5.00 for movie Beach Blanket Bingo (1965)
Predicting rating 5.00 for movie Fog, The (2005)
Predicting rating 5.00 for movie Bone Tomahawk (2015)
Predicting rating 5.00 for movie Stand and Deliver (1988)
Predicting rating 5.00 for movie Black Mass (2015)
Predicting rating 5.00 for movie Tightrope (1984)
Predicting rating 5.00 for movie Mozart and the Whale 

In practice, additional information can be utilized to enhance our predictions. Above, the predicted ratings for the first few hundred movies lie in a small range. We can augment the above by selecting from those top movies, movies that have high average ratings and movies with more than 20 ratings. This section uses a [Pandas](https://pandas.pydata.org/) data frame which has many handy sorting features.

In [61]:
ratings_count = ratings_df["movieId"].value_counts()

popular_movies = movies_df[
    movies_df["movieId"].isin(ratings_count[ratings_count > 80].index)
]

popular_movies.shape

(206, 3)

In [56]:
predicts = pm[:, 0]

res = []
for i in range(len(popular_movies)):
    res.append((predicts[i], movieList[i]))

res.sort(key=lambda x: x[0], reverse=True)

for item in res:
    print(f"Predicting rating {item[0]:0.2f} for movie {item[1]}")

Predicting rating 5.44 for movie Awfully Big Adventure, An (1995)
Predicting rating 5.43 for movie Heidi Fleiss: Hollywood Madam (1995)
Predicting rating 5.42 for movie Lamerica (1994)
Predicting rating 5.17 for movie Braveheart (1995)
Predicting rating 4.97 for movie Usual Suspects, The (1995)
Predicting rating 4.87 for movie Chungking Express (Chung Hing sam lam) (1994)
Predicting rating 4.81 for movie Crimson Tide (1995)
Predicting rating 4.78 for movie Die Hard: With a Vengeance (1995)
Predicting rating 4.75 for movie Living in Oblivion (1995)
Predicting rating 4.75 for movie Toy Story (1995)
Predicting rating 4.74 for movie Heat (1995)
Predicting rating 4.70 for movie Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
Predicting rating 4.69 for movie Cry, the Beloved Country (1995)
Predicting rating 4.65 for movie Persuasion (1995)
Predicting rating 4.60 for movie Crumb (1994)
Predicting rating 4.57 for movie Clueless (1995)
Predicting rating 4.52 for movie Flirting With Disaster (1996)
Pr