<a href="https://colab.research.google.com/github/Henry-Le-CS/Basic-Machine-Learning/blob/master/Recommender_System_(Collaborative_Filtering)_with_Movies_ratings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Description

For this project, we will implement a recommender for the users based on the data given in **The MovieLens Datasets: History and Context:** https://dl.acm.org/doi/10.1145/2827872

We will use collaborative filtering to apply for this recommender system


# Packages

We will apply the following packages for this project

- Tensorflow
- Matplotlib
- Numpy
- Pandas

In [188]:
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Dataset

We will use the dataset of  MovieLens "ml-latest-small". The source can be found above.

It contains of $n_u = $ 443 users' rating ranged from 0.5 to 5, with a step of 0.5. There will be $n_m = $ 4778 movies and $n_f = 10 $ features for us to examine.

The data will be put in a matrix Y $\in \mathbf{R^{n_m \text{ x } n_u}}$

For research purpose, we will include the dataset for X, W, b, and R

Where:

- X are the feature matrix $\in \mathbf{R^{n_m\text{ x } n_f}}$
- W, b are the parameter matrix $\in \mathbf{R^{n_u\text{ x } n_f}} \text{ and } \mathbf{R^{n_u}} \text{ respectively}$
- R are the matrix that contrain boolean type to see if the user j had voted for the movie i $\in \mathbf{R^{n_m \text{ x } n_u}}$


$
\mathbf{X} = 
\begin{bmatrix}
--- (\mathbf{x}^{(0)})^T --- \\
--- (\mathbf{x}^{(1)})^T --- \\
\vdots \\
--- (\mathbf{x}^{(n_m-1)})^T --- \\
\end{bmatrix} , \quad
\mathbf{W} = 
\begin{bmatrix}
--- (\mathbf{w}^{(0)})^T --- \\
--- (\mathbf{w}^{(1)})^T --- \\
\vdots \\
--- (\mathbf{w}^{(n_u-1)})^T --- \\
\end{bmatrix},\quad
\mathbf{ b} = 
\begin{bmatrix}
 b^{(0)}  \\
 b^{(1)} \\
\vdots \\
b^{(n_u-1)} \\
\end{bmatrix}\quad
$

- Each row of matrix X is the feature vector for movie i
- Each row of W and b is the parameter vector to recommend for user j



In [229]:
Y = pd.read_csv('small_movies_Y.csv',header = None).values

In [230]:
print('The first five element of utility matrix Y: \n',Y[:5])
print('The shape of Y: {}'.format(Y.shape))

The first five element of utility matrix Y: 
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [5. 0. 0. ... 4. 3. 3.]]
The shape of Y: (4778, 443)


In [231]:
X = pd.read_csv('small_movies_X.csv',header = None).values
W = pd.read_csv('small_movies_W.csv',header = None).values
b = pd.read_csv('small_movies_b.csv',header = None).values
R = pd.read_csv('small_movies_R.csv',header = None).values

In [232]:
print('The shape of X: {}'.format(X.shape))
print('The shape of W: {}'.format(W.shape))
print('The shape of b: {}'.format(b.shape))
print('The shape of R: {}'.format(R.shape))

The shape of X: (4778, 10)
The shape of W: (443, 10)
The shape of b: (1, 443)
The shape of R: (4778, 443)


In [233]:
num_movies, num_users = R.shape
num_features = W.shape[1]
print('The number of movies in the list: ', num_movies)
print('The number of users that rated: ', num_users)
print('The number of features for each movie: ', num_features)

The number of movies in the list:  4778
The number of users that rated:  443
The number of features for each movie:  10


# Cost Function

We define the cost function with respect to X, W, b, as below:

$
J(X,W,b) = \frac{1}{2}\sum_{(i,j):r(i,j)=1}(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2 + \frac{\lambda}{2} \sum_{j=0}^{n_u-1}\sum_{k=0}^{n-1}(\mathbf{w}^{(j)}_k)^2
$

Or the vectorized implementation is:

$
J(X,W,b) = \frac{1}{2} \sum{ (XW^T+b-Y)^2} \odot R + \frac{\lambda}{2}\sum (X^2+W^2)
$

The $\odot$ represents the elementwise matrix multiplication

We will implement both methods to calculate cost function

In [234]:
def cost_iterative(X,W,R, b, Y,lambda_):
  num_movies,num_users = Y.shape
  J = 0
  for i in range(num_movies):
    for j in range(num_users):
      if(R[i,j]==1):
        J = J + ((W[j].dot(X[i])+b[0,j]-Y[i,j])*R[i,j])**2
  J = J/2 + lambda_/2*np.sum(W**2)+lambda_/2*np.sum(X**2)
  return J

In [235]:
num_users_r = 4
num_movies_r = 5 
num_features_r = 3

X_r = X[:num_movies_r, :num_features_r]
W_r = W[:num_users_r,  :num_features_r]
b_r = b[0, :num_users_r].reshape(1,-1)
Y_r = Y[:num_movies_r, :num_users_r]
R_r = R[:num_movies_r, :num_users_r]

# Evaluate cost function
J = cost_iterative(X_r, W_r,R_r, b_r, Y_r, 0);
print(f"Cost with no regularization: {J:0.2f}")
J = cost_iterative(X_r, W_r,R_r, b_r, Y_r, 1.5);
print(f"Cost with regularization: {J:0.2f}")


Cost with no regularization: 13.67
Cost with regularization: 28.09


In [236]:
def cost_matrix(X,W, R, b, Y, lambda_):
  j = tf.reduce_sum(((tf.linalg.matmul(X,tf.transpose(W))+b-Y)*R)**2)/2
  j += lambda_/2 * (tf.reduce_sum(X**2)+tf.reduce_sum(W**2))
  return j

In [237]:
J = cost_matrix(X_r, W_r,R_r, b_r, Y_r, 0);
print(f"Cost with no regularization: {J:0.2f}")
J = cost_matrix(X_r, W_r,R_r, b_r, Y_r, 1.5);
print(f"Cost with regularization: {J:0.2f}")

Cost with no regularization: 13.67
Cost with regularization: 28.09


# Learning for recommendations

Firstly, we need to load the data for the movie list, then we can define our own ratings to test the model.

We will load a dataframe and an array to make it easier to use

In [238]:
movies_df = pd.read_csv('Movie_rating.csv')
movies_df.head()

Unnamed: 0.1,Unnamed: 0,mean rating,number of ratings,title
0,0,3.4,5,"Yards, The (2000)"
1,1,3.25,6,Next Friday (2000)
2,2,2.0,4,Supernova (2000)
3,3,2.0,4,Down to You (2000)
4,4,2.672414,29,Scream 3 (2000)


In [239]:
movies = pd.read_csv('Movie_rating.csv')[['mean rating','number of ratings','title']].values
movies[:5]

array([[3.4, 5, 'Yards, The (2000)'],
       [3.25, 6, 'Next Friday (2000)'],
       [2.0, 4, 'Supernova (2000)'],
       [2.0, 4, 'Down to You (2000)'],
       [2.6724137931034484, 29, 'Scream 3 (2000)']], dtype=object)

Next, I will define my own set of ratings

In [240]:
my_ratings = np.zeros((num_movies))
my_ratings[2700] = 4.5   # Toy story 3
my_ratings[2609] = 1;  # A random movie
my_ratings[929]  = 5   # Lord of the Rings: The Return of the King
my_ratings[246]  = 5   # Shrek (2001)
my_ratings[2716] = 4   # Inception
my_ratings[1150] = 5   # Incredibles, The (2004)
my_ratings[382]  = 0.5   # Amelie (Fabuleux destin d'Amélie Poulain, Le)
my_ratings[366]  = 5   # Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
my_ratings[622]  = 5   # Harry Potter and the Chamber of Secrets (2002)
my_ratings[988]  = 1   # Eternal Sunshine of the Spotless Mind (2004)
my_ratings[2925] = 2  # Louis Theroux: Law & Disorder (2008)
my_ratings[2937] = 1   # Nothing to Declare (Rien à déclarer)
my_ratings[793]  = 4.5   # Pirates of the Caribbean: The Curse of the Black Pearl (2003)
my_ratings[486] = 5 # Spider-man(2000)
my_ratings[1884] = 5 # Spider-man(2000)
my_ratings[3846] = 5 # Amazing Spider-man 3
my_ratings[3657] = 3.5 #Haunted House 2
my_ratings[1929] = 4 #Transformer 2
my_ratings[1930] = 4 #Harry Potter and menh lenh phuong hoang
my_ratings[3475] = 3  #The conjuring (2013)
my_ratings[3904] = 4 #Avengers 2
my_ratings[3911] = 5 #Avengers 3
my_ratings[3912] = 5 #Thor 3
my_ratings[3914] = 5 # Guardian of the galaxy 3
my_ratings[3915] = 5 # Captain America 3
my_ratings[3916] = 5 # Doctor strange
my_ratings[3917] = 4 # Xmen apocalypse
my_ratings[3918] = 5 # Spiderman
my_ratings[3919] = 5 #Avengers 3
my_ratings[2145] = 4 #Iron man

In [241]:
np.sum(my_ratings>0)

30

In [242]:
print('My new ratings are:')
for i in range(len(my_ratings)):
  if(my_ratings[i]>0):
    print('{}: {}/5'.format(movies_df.loc[i,'title'],my_ratings[i]))

My new ratings are:
Shrek (2001): 5.0/5
Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001): 5.0/5
Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001): 0.5/5
Spider-Man (2002): 5.0/5
Harry Potter and the Chamber of Secrets (2002): 5.0/5
Pirates of the Caribbean: The Curse of the Black Pearl (2003): 4.5/5
Lord of the Rings: The Return of the King, The (2003): 5.0/5
Eternal Sunshine of the Spotless Mind (2004): 1.0/5
Incredibles, The (2004): 5.0/5
Spider-Man 3 (2007): 5.0/5
Transformers (2007): 4.0/5
Harry Potter and the Order of the Phoenix (2007): 4.0/5
Iron Man (2008): 4.0/5
Persuasion (2007): 1.0/5
Toy Story 3 (2010): 4.5/5
Inception (2010): 4.0/5
Louis Theroux: Law & Disorder (2008): 2.0/5
Nothing to Declare (Rien à déclarer) (2010): 1.0/5
Conjuring, The (2013): 3.0/5
Haunted House 2, A (2014): 3.5/5
Black Sea (2015): 5.0/5
Avengers: Age of Ultron (2015): 4.0/5
Avengers: Infinity War - Part I (2018): 5.0/5
Thor: Ragnarok (2017): 5.0/5
Capt

We need to concatenate the new ratings to the original Y, R and update the necessary variables to normalize them

In [252]:
Y = pd.read_csv('small_movies_Y.csv',header = None).values
R = pd.read_csv('small_movies_R.csv',header = None).values
Y = np.c_[my_ratings,Y]
R = np.c_[(my_ratings!=0).astype(int), R]

In [253]:
print('The new shape of Y: {}'.format(Y.shape))
print('The new shape of R: {}'.format(R.shape))

The new shape of Y: (4778, 444)
The new shape of R: (4778, 444)


In [254]:
def normalize(Y, R):
  #we need to calculate the average of each row and make it a vector of size mx1
  Ymean = (np.sum(Y*R,axis = 1)/np.sum(R,axis = 1)+1e-12)
  Ymean = Ymean.reshape(-1,1)
  Ynorm = Y - np.multiply(Ymean, R)
  return Ymean, Ynorm

In [255]:
Ymean, Ynorm = normalize(Y,R)

Let's prepare to train and set the optimizer

In [256]:
num_movies, num_users = Y.shape
num_features = 30

# Set Initial Parameters (W, X), use tf.Variable to track these variables
tf.random.set_seed(1234) # for consistent results
W = tf.Variable(tf.random.normal((num_users,  num_features),dtype=tf.float64),  name='W')
X = tf.Variable(tf.random.normal((num_movies, num_features),dtype=tf.float64),  name='X')
b = tf.Variable(tf.random.normal((1, num_users),   dtype=tf.float64),  name='b')

# Instantiate an optimizer.
optimizer = keras.optimizers.Adam(learning_rate=1e-1)

Let's train them

In [257]:
iterations = 200
lambda_ = 1.5
for i in range(iterations+1):
  with tf.GradientTape() as tape:
    costJ = cost_matrix(X,W,R,b,Y,lambda_)
  grads = tape.gradient(costJ, [X,W,b])
  optimizer.apply_gradients(zip(grads,[X,W,b]))
  if i % 20 == 0:
    print(f'Iter {i} - Loss: {costJ} \n')

Iter 0 - Loss: 991716.6263689055 

Iter 20 - Loss: 58507.66613270386 

Iter 40 - Loss: 21895.23431142898 

Iter 60 - Loss: 11935.08128950034 

Iter 80 - Loss: 8506.281844820549 

Iter 100 - Loss: 7017.842224216112 

Iter 120 - Loss: 6219.504352702818 

Iter 140 - Loss: 5712.406647368076 

Iter 160 - Loss: 5350.909261107175 

Iter 180 - Loss: 5074.1692804251215 

Iter 200 - Loss: 4851.863162494498 



# Recommendation

I will recommend movies for myself as no one is free to do so :<
To predict the ratings of a movie i for a user j, we have:

$\mathbf{y}_{i,j} = \mathbf{x}_i×\mathbf{w}_j+\mathbf{b}_j$

Or generally, we have a vectorized version of the predicted utility matrix as: 

$\mathbf{Y} = \mathbf{X}\mathbf{W^T}+\mathbf{b}$

Then we denormolize the matrix by simply adding $Y_{mean}$ back to the matrix.

In [258]:
p = np.matmul(X.numpy(),W.numpy().T)+b.numpy()
p_restore = p+Ymean
my_recommended_list = tf.argsort(p[:,0],direction = 'DESCENDING')
#I'll list 15 movies for myself to enjoy tonight
my_rated = [i for i in range(len(my_ratings)) if my_ratings[i]>0]
for i in range(25):
  if my_recommended_list[i] not in my_rated:
    print('Movie names: {}'.format(movies_df.loc[i,'title']))


Movie names: Yards, The (2000)
Movie names: Next Friday (2000)
Movie names: Supernova (2000)
Movie names: Down to You (2000)
Movie names: Scream 3 (2000)
Movie names: Boondock Saints, The (2000)
Movie names: Gun Shy (2000)
Movie names: Beach, The (2000)
Movie names: Snow Day (2000)
Movie names: Tigger Movie, The (2000)
Movie names: Hanging Up (2000)
Movie names: Whole Nine Yards, The (2000)
Movie names: Black Tar Heroin: The Dark End of the Street (2000)
Movie names: Wonder Boys (2000)
Movie names: Chain of Fools (2000)
Movie names: Next Best Thing, The (2000)
Movie names: What Planet Are You From? (2000)
Movie names: Mission to Mars (2000)


# Conclusion

Collaborative Filtering tends to find what similar users would like and the recommendations to be provided and in order to classify the users into clusters of similar types and recommend each user according to the preference of its cluster.

Therefore, I intentionally choose action, sci-fi movies to to rate. The result turns out to be good enough. Perhaps, I might need to rate more  movies like other people to see it works perfectly or the data need to concentrate on recent years.