# Movie Recommender

We will implement the collaborative filtering learning algorithm and apply it to a dataset of movie ratings. This dataset
consists of ratings on a scale of 1 to 5. The dataset has nu = 943 users, and nm = 1682 movies.<br /><br />
The objective of collaborative filtering is to predict movie ratings for the movies that users have not yet rated, that is, the entries with R(i,j) = 0. This will allow us to recommend the movies with the highest predicted ratings to the user.

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import scipy.optimize

In [2]:
datafile = 'data/ex8_movies.mat'
data = scipy.io.loadmat(datafile)

In [3]:
data.keys()

dict_keys(['__globals__', 'Y', '__header__', '__version__', 'R'])

In [4]:
Y = data['Y']
R = data['R']

In [5]:
Y.shape

(1682, 943)

In [6]:
R.shape

(1682, 943)

In [7]:
# nm = 1682; nu = 943
n = 100 # the number of features
# X is nm*n matrix; theta is nu*n matrix

The model predicts the rating for **movie i** by **user j** as: <br />** y<sup>(i,j)</sup> = (\theta<sup>(j)</sup>)<sup>T</sup>x<sup>(i)</sup> **<br \><br \>
In order to use an on-the-shelf minimizer such as fmin_cg, the cost function has been set up to unroll the parameters into a single vector params.

In [8]:
def unroll(X, theta):
    return np.concatenate((X.flatten(), theta.flatten()))

In [9]:
def reshape(Xtheta, nm, nu, n):
    X = Xtheta[:nm*n].reshape((nm,n))
    theta = Xtheta[-nu*n:].reshape((nu,n))
    return X, theta

In [10]:
datafile = 'data/ex8_movieParams.mat'
data = scipy.io.loadmat(datafile)

In [11]:
data.keys()

dict_keys(['Theta', '__globals__', 'num_users', 'num_features', 'num_movies', 'X', '__header__', '__version__'])

In [12]:
X = data['X']
theta = data['Theta']
nu = int(data['num_users'])
nm = int(data['num_movies'])
nf = int(data['num_features'])

In [13]:
# for testing the correctness of the code
nu = 4; nm = 5; n = 3
X = X[:nm,:n]
theta = theta[:nu,:n]
Y = Y[:nm,:nu]
R = R[:nm,:nu]

### 1. Cost Function

In [14]:
def cost(Xtheta, Y, R, nu, nm, n, reg):
    X, theta = reshape(Xtheta, nm, nu, n)
    J = 0.5*np.sum(np.power(np.multiply(X.dot(theta.T), R)-Y,2))
    J = J+0.5*reg*np.sum(np.multiply(theta,theta))+0.5*reg*np.sum(np.multiply(X,X))
    return J

In [15]:
# To test the correctness using small dataset
Xtheta = unroll(X, theta)
J = cost(Xtheta, Y, R, nu, nm, n, 0)

In [16]:
J

22.224603725685675

In [17]:
J = cost(Xtheta, Y, R, nu, nm, n, 1.5)

In [18]:
J

31.344056244274221

### 2. Gradient

In [19]:
def gradient(Xtheta, Y, R, nu, nm, n, reg=0):
    X, theta = reshape(Xtheta, nm, nu, n)
    gradientX = (np.multiply(X.dot(theta.T), R)-Y).dot(theta)
    gradientTheta = ((np.multiply(X.dot(theta.T), R)-Y).T).dot(X)
    gradientX = gradientX+reg*X
    gradientTheta = gradientTheta+reg*theta
    return unroll(gradientX, gradientTheta)

### 3. Rate the Movies
    

In [20]:
movies = []
with open('data/movie_ids.txt') as f:
    for line in f:
        namelst = line.split(' ')
        name = ' '.join(namelst[1:])
        movies.append(name)

# Here we use the sample ratings to test the correctness of the code
# The rating can be modified for fun!
my_ratings = np.zeros((1682,1))
my_ratings[0]   = 4
my_ratings[97]  = 2
my_ratings[6]   = 3
my_ratings[11]  = 5
my_ratings[53]  = 4
my_ratings[63]  = 5
my_ratings[65]  = 3
my_ratings[68]  = 5
my_ratings[182] = 4
my_ratings[225] = 5
my_ratings[354] = 5

In [21]:
datafile = 'data/ex8_movies.mat'
data = scipy.io.loadmat(datafile)

In [22]:
data.keys()

dict_keys(['__globals__', 'Y', '__header__', '__version__', 'R'])

In [23]:
R = data['R']
Y = data['Y']

In [24]:
# add my ratings into the matrix Y and R
my_ratings_R = my_ratings>0
Y = np.c_[Y, my_ratings]
R = np.c_[R, my_ratings_R]

In [25]:
Y.shape

(1682, 944)

In [26]:
R.shape

(1682, 944)

In [27]:
def meanNormalization(Y, R):
    mean_movie = np.sum(Y, axis=1)/np.sum(R, axis=1)
    mean_movie = mean_movie.reshape((Y.shape[0],1))
    return Y-mean_movie, mean_movie

In [28]:
Y_norm, mean_movie = meanNormalization(Y, R)

### 4. Recommendation

In [29]:
nm = Y.shape[0]
nu = Y.shape[1]
n = 10

In [30]:
# Random initial values for X and theta
X = np.random.rand(nm,n)
theta = np.random.rand(nu,n)
Xtheta = unroll(X, theta)

In [31]:
reg = 10
recommendation = scipy.optimize.fmin_cg(cost, x0=Xtheta, fprime=gradient, args=(Y, R, nu, nm, n, reg), maxiter=50,disp=True,full_output=True)

         Current function value: 72848.319175
         Iterations: 50
         Function evaluations: 81
         Gradient evaluations: 81


In [32]:
X, theta = reshape(recommendation[0], nm, nu, n)

In [33]:
p = X.dot(theta.T)

In [34]:
p.shape

(1682, 944)

In [35]:
my_res = p[:,-1]+mean_movie.flatten()

In [36]:
my_res_idx = np.argsort(-my_res)

In [37]:
print ("The Top 10 movies recommended for you:\t")
for i in range(10):
    print ("%d %s score: %f" %(i+1, movies[my_res_idx[i]], my_res[my_res_idx[i]]))

The Top 10 movies recommended for you:	
1 Titanic (1997)
 score: 8.367244
2 Shawshank Redemption, The (1994)
 score: 8.347072
3 Schindler's List (1993)
 score: 8.333890
4 Star Wars (1977)
 score: 8.301666
5 Raiders of the Lost Ark (1981)
 score: 8.136483
6 Good Will Hunting (1997)
 score: 8.070331
7 Usual Suspects, The (1995)
 score: 7.993570
8 Braveheart (1995)
 score: 7.980482
9 Empire Strikes Back, The (1980)
 score: 7.955492
10 Casablanca (1942)
 score: 7.914373


In [38]:
my_res_idx

array([ 312,   63,  317, ..., 1556, 1492,  829], dtype=int64)