### Read in the data

We read our dataset and store it in sparse matrix using csr format. To do so we use the scipy library. It enables us to store only the non zero elements.

### Here I am just reading in the data and splitting it into a random training and test test (80-20).

In [2]:
import numpy as np
import csv
from scipy.sparse import csc_matrix, csr_matrix

#constants defining the dimensions of our User Rating Matrix (URM)
MAX_PID = 193610 # Max movie id
MAX_UID = 611 # Max user id


urm = np.zeros(shape=(MAX_UID,MAX_PID), dtype=np.float32)
with open('./ratings training.csv', 'rb') as trainFile:
    urmReader = csv.reader(trainFile, delimiter=',')
    for row in urmReader:
        urm[int(row[0]), int(row[1])] = float(row[2])

urm = csc_matrix(urm, dtype=np.float32)

In [3]:
urm

<611x193610 sparse matrix of type '<type 'numpy.float32'>'
	with 80668 stored elements in Compressed Sparse Column format>

### Retrieve the test users

First we are going to create a function readUsersTest in order to get the ids of the users for which we want to make a prediction.

In [4]:
uTest = dict()
with open("./ratings test.csv", 'rb') as testFile:
	testReader = csv.reader(testFile, delimiter=',')
	for row in testReader:
		uTest[int(row[0])] = list()

Then we want to find the movies already seen by these users in order not to recommend them again.

In [5]:
moviesSeen = dict()
with open("./ratings training.csv", 'rb') as trainFile:
	urmReader = csv.reader(trainFile, delimiter=',')
	for row in urmReader:
		try:
			moviesSeen[int(row[0])].append(int(row[1]))
		except:
			moviesSeen[int(row[0])] = list()
			moviesSeen[int(row[0])].append(int(row[1]))

In [6]:
import math as mt
import csv
from sparsesvd import sparsesvd

K=90

U, s, Vt = sparsesvd(urm, K)

dim = (len(s), len(s))
S = np.zeros(dim, dtype=np.float32)
for i in range(0, len(s)):
	S[i,i] = mt.sqrt(s[i])

U = csr_matrix(np.transpose(U), dtype=np.float32)
S = csr_matrix(S, dtype=np.float32)
Vt = csr_matrix(Vt, dtype=np.float32)

In [7]:
from scipy.sparse.linalg import * #used for matrix multiplication

rightTerm = S*Vt 

estimatedRatings = np.zeros(shape=(MAX_UID, MAX_PID), dtype=np.float16)
for userTest in uTest:
	prod = U[userTest, :]*rightTerm

	#we convert the vector to dense format in order to get the indices of the movies with the best estimated ratings 
	estimatedRatings[userTest, :] = prod.todense()
	recom = (-estimatedRatings[userTest, :]).argsort()[:250]
	for r in recom:
		if r not in moviesSeen[userTest]:
			uTest[userTest].append(r)

			if len(uTest[userTest]) == 5:
				break

In [8]:
uTest

{1: [2231, 1263, 454, 266, 1200],
 2: [58559, 91529, 2959, 80463, 106782],
 3: [329, 110, 1215, 1200, 1997],
 4: [2599, 593, 1265, 63082, 2396],
 5: [480, 593, 34, 296, 161],
 6: [661, 158, 442, 596, 586],
 7: [150, 4306, 541, 1206, 912],
 8: [457, 592, 21, 595, 185],
 9: [2716, 1291, 3948, 318, 1302],
 10: [5816, 339, 4886, 68954, 4896],
 11: [527, 780, 292, 316, 733],
 12: [527, 597, 2671, 1777, 1569],
 13: [593, 260, 2329, 364, 1573],
 14: [597, 161, 253, 457, 587],
 15: [1270, 79132, 110, 116797, 7153],
 16: [296, 6016, 7153, 1036, 1221],
 17: [356, 1213, 858, 2959, 1193],
 18: [],
 19: [4226, 2105, 60],
 20: [3147, 1, 2012, 594, 1022],
 21: [903, 165, 380, 3033],
 22: [4973, 48516, 46578, 4963, 79132],
 23: [1201, 111, 1219, 29, 296],
 24: [2959, 5989, 1291, 2324, 48774],
 25: [1196, 2959, 91529, 2028, 109487],
 26: [161, 590, 318, 110, 292],
 27: [500, 1028, 596, 1375, 2716],
 28: [],
 29: [1221, 111, 541, 1233, 2194],
 30: [89745, 2959, 134130, 116797, 589],
 31: [260, 104, 95, 

## Let's see how the recommender does compared to the movies a user has seen
One user that stands out is user 20, for which 'Jumanji' has been recommended. I'm not sure that user 20 has really never actually seen 'Jumanji' (I've never met someone who hasn't seen it at least), but I suppose they have never seen it as far as these ratings are concerned. For example, they could've watched in the company of others not using their Netflix account or something like that. But let's look at their recommendations.

In [None]:
from pandas import read_csv
from pandas import DataFrame
movies = read_csv('movies.csv')
movies.head()

In [21]:
movies.loc[uTest[20],'title']

3147    Tailor of Panama, The (2001)
1                     Jumanji (1995)
2012          Free Enterprise (1998)
594                   Twister (1996)
1022               Birds, The (1963)
Name: title, dtype: object

So, Twister and Jumanji are a couple of mid-90's blockbuster movies, now let's see what movies they have rated from the ratings data set and see if they are similar to the recommendations.

In [50]:
ratings = read_csv('ratings.csv')
user20 = ratings.loc[ratings['userId']== 20]
user20.loc[user20['movieId'].isin(uTest[20])]

Unnamed: 0,userId,movieId,rating
2993,20,594,5.0
3012,20,1022,4.5


Recall we split the ratings data into a training and test data, so they actually did see Twister and The Birds and rated them highly. Personally, I think if this person rated Twister a 5, then they will surely rate Jumanji high as well. The other two movies I have never seen either, but apparently I need to see them! It's no wonder that using the Singular Value Decomposition was a winning approach in the Netflix competition. 